## 29 May 2008

### Measures of correlation

As NLPers, we're often tasked with measuring similarity of things, where things are usually words or phrases or something like that. The most standard example is measuring collocations. Namely, for every pair of words, do they co-locate (appear next to each other) more than they "should." There are lots of statistics that are used to measure collocation, but I'm not aware of any in-depth comparison of these. If anyone actually still reads this blog over the summer and would care to comment, it would be appreciated!

The most straightforward measure, in my mind, is mutual information. If we have two words "Y" and "Y", then the mutual information is: MI(X;Y) = \sum_x \sum_y p(X=x,Y=y) \log [p(X=x,Y=y) / (p(X=x)p(Y=y))] (sorry, my LaTeX plugin seems to have died). In the case of collocation, the values x and y can take are usually just "does occur" and "does not occur." Here, we're basically asking ourselves: do X and Y take on the same values more often than chance. I.e., do they seem roughly statistically independent. The mutual information statistic gives, for every X/Y pair (every pair of words) a score. We can then sort all word pairs by this score and find the strongest collocations.

One issue with MI seems to be that it doesn't really care if pairs are common or not. Very infrequent (typically noisy) pairs seem to pop to the front. One way to fix this would be to add an additional "count" term to the front of the mutual information to weight high-frequency pairs more. This is quite similar to the RlogF measure that's locally quite popular.

The next set of methods seem to be based mostly on the idea of assuming a generative model for the data and doing hypothesis testing. I.e., you can have one model that assumes the two words are independent (the null hypothesis), and one that assumes they are not. Typically this is implemented as multinomials over words, and then a classical statistical test is applied to the estimated maximum likelihood parameters of the data. You can use Dice coefficient, t-score, chi-squared, or even something simpler like the log-likelihood ratio.

Although I'm not completely familiar with these techniques, they seem like they would also be sensitive to noise in the MLE, especially for low-frequency terms (of which we know there are a lot). It seems plausible to just directly do a Bayesian hypothesis test, rather than a classical one, but I don't know if anyone has done this.

In writing this, I just came across collocations.de, which among other things, has a nice survey of these results from an ESSLLI tutorial in 2003. They seem to conclude that for extracting PP/verb collocations, t-score and frequency seem to work best, and for extracting adjective/noun collocations, log-likelihood, t-score, Fisher, and p-values seem to work best. The intersection just contains t-score, so maybe that's the way to go. Of course, these things are hard to validate because so much depends on what you're trying to do.

Tanton said...

I'm actually working in collocations right now in a 700M word-pair corpus. I'm using Chi-Square as the test mechanism (which is similar to log-likelihood, right?) The main problem I had was that the size of corpus caused ANY pair to seem more correlated than it actually was. So, I had to increase the odds that two pairs were NOT correlated to counter that. I'll look into the ideas you presented here and the web site you linked to. This is very timely, thanks!

David said...

I realize that collocation and coreference are different, but given a large enough corpus (e.g. Reuters RCV1, 10 years of the NYTimes, etc.), wouldn't the same techniques be reasonable for coreference?

Bob Carpenter said...

I blogged about this in the past: Phrase Extraction: Binomial Hypothesis Testing vs. Coding Loss.

In LingPipe, we do a bunch of things for independence testing. We've found chi-square to work best for collocations (so did the citations in Manning and Schuetze).

We use MAP estimates of language models to measure hotness in one corpus vs. another, such as the news in the last hour vs. the news in the last month or year. If the "background" model is a unigram LM and the foreground a bigram, this becomes similar to the mutual information measure, though done with simple binomial hypothesis tests.

Before us, Takashi Tomokiyo and Matthew Hurst wrote a paper A language model approach to keyphrase extraction which uses a foreground vs. background model; it's different in that they use mutual information.

In the real world, you need some clever hacks, like restricting the phrases to noun phrases, or just phrases that are capitalized (like Amazon does), or just sequences following "the" (as Tomokiyo and Hurst did).

The coolest implementation I've seen of collocation or phrase extraction is on scirus.com. See the "refining your search" sidebar after a search.

For higher-order collocation (not phrases), things like singular value decomposition work pretty well.

For more reading about the details, I also have a blog entry about chi-squared testing and boundary conditions on counts: Collocations, Chi-Squared Independence, and N-gram Count Boundary Conditions.

Anonymous said...

Hi Hal,

I've been following the blog for a while and while I don't have a useful comment on collocations I did have a question that I was hoping you might answer...

When dealing with statistical algorithms I am not sure which to choose when and why.

Take for example SVM, LDA and MaxEnt... They all seem to have about the same capabilities so what is the relative advantage/disadvantages of each?

For instance MaxEnt, LDA and SVM have all been used "successfully" to do part of speech tagging but why would I use one over the other?

Pretty open ended but your other descriptions have been enlightening so if you had a moment to answer that would be super awesome.

Thanks so much,

Sam

hal said...

Aleks Jakulin tried to post the following but apparently it ended up in blogopurgatory:

The weird results you're getting are a consequence of an overfitted P(X,Y) model of co-occurrences - when the counts are low, the frequentist probabilities are way off, and log-loss makes it even worse.

For that I'd recommend you either use MI's p-values, or MI-at-risk measure (the value of MI at the lower 5%).

I'd recommend you to take a look at Chapter 4 of http://www.stat.columbia.edu/~jakulin/Int/jakulin05phd.pdf or at http://www.stat.columbia.edu/~jakulin/Int/jakulin-bratko-ICML2004.pdf -- don't worry about 3-way interactions, although I've had some interesting 3-way effects that I describe in my dissertation at "Synonymy and Polysemy" on printed page 59 (page 47).

hal said...

David -- I'm not quite sure how to do something like this for coreference. It might possibly work for name ("Bush") vs. nominal ("President") pairs, but certainly not for pronouns.

Bob Moore said...

Hal, your comments about mutual information (MI) don't quite square with my experience; perhaps because you may be conflating pointwise MI with average MI. Infrequent pairs coming out on top is certainly a problem with pointwise MI. A pair of items that each have one occurrence and happen to co-occur will have the highest possible pointwise MI. However, the formula you give is for average MI, which in my experience does not have this problem. In fact, if anything, the opposite is true; with average MI, weakly correlated but very frequent pairs often come out with higher scores than you would like. I dealt with some issues that I think are related to your concerns in my 2004 EMNLP paper, "On log-likelihood-ratios and the significance of rare events".

hal said...

Hi Bob --

Indeed, I was conflating PMI and (real) MI. I guess PMI just seems crazy to me. Though I can see why (average) MI might overly favor common things. If I recall from your EMNLP paper, your conclusion is that Fisher's exact test is almost always to be preferred over log-lik ratio?

I fully admit that I'm probably totally biased on these things, but I really feel like almost any classical measure is going to either overly favor rare things or do the opposite in trying to correct for that. Given how easy it is to marginalize out a Binomial under a Beta and compute KLs between posterior Betas (or similarly with Multinomial under a Dirichlet), I'm a bit surprised that I haven't seen anything along these lines.... Maybe my intuition is just bad here, but I feel like this is likely to do the right thing.

gustavolacerda said...

I recommend LaTeXMathML. You just need to add a couple of JavaScript lines to your webpage template. See http://www.maths.nottingham.ac.uk/personal/drw/lm.html

Brendan said...

I'm used to seeing Pearson correlation as the first thing people mention for a correlation measure, but you don't even bother. Is there a reason I'm missing?

hal said...

brendan -- a couple of reasons. first, i'm not using correlation in the standard statistical correlation sense. all i want to know is: when i see the phrase "real estate" do these words "go together" or not (for instance). the bigger problem is that sure you could do pearson, but every event will be either (0,0), (1,1), (0,1) or (1,0), so pearson is going to suck :).

Anonymous said...

Ultima Online Gold, UO Gold, crestingwait
lotro gold
wow gold
warhammer gold
Age of Conan Gold, AOC Gold

Anonymous said...

It is the knight noah which makes me very happy these days, my brother says knight gold is his favorite games gold he likes, he usually knight online gold to start his game and most of the time he will win the knight online noah back and give me some cheap knight gold to play the game.

cutepig said...

Do you want to buy the cheapest dofus kamas, I think many people want to buy the cheapest dofus gold, but some times we all do not how to buy dofus kamas, so we all hope in the game can earn much cheap dofus kamas, this is the best hope for us to play the Dofus.

Anonymous said...
. said...

qishaya said...

one day i went shopping outside,and in an ed hardy store,I found some kinds of ed hardy i love most they are Your website is really good Thank you for the information ed hardy ed hardy ed hardy clothing ed hardy clothing ed hardy shoes ed hardy shoes don ed hardy don ed hardy ed hardy clothes ed hardy clothes ed hardy bags ed hardy bags ed hardy swimwear ed hardy swimwear ed hardy jeans ed hardy jeans ed hardy mens ed hardy mens Thank you for the information

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex