29 May 2008

Measures of correlation

As NLPers, we're often tasked with measuring similarity of things, where things are usually words or phrases or something like that. The most standard example is measuring collocations. Namely, for every pair of words, do they co-locate (appear next to each other) more than they "should." There are lots of statistics that are used to measure collocation, but I'm not aware of any in-depth comparison of these. If anyone actually still reads this blog over the summer and would care to comment, it would be appreciated!

The most straightforward measure, in my mind, is mutual information. If we have two words "Y" and "Y", then the mutual information is: MI(X;Y) = \sum_x \sum_y p(X=x,Y=y) \log [p(X=x,Y=y) / (p(X=x)p(Y=y))] (sorry, my LaTeX plugin seems to have died). In the case of collocation, the values x and y can take are usually just "does occur" and "does not occur." Here, we're basically asking ourselves: do X and Y take on the same values more often than chance. I.e., do they seem roughly statistically independent. The mutual information statistic gives, for every X/Y pair (every pair of words) a score. We can then sort all word pairs by this score and find the strongest collocations.

One issue with MI seems to be that it doesn't really care if pairs are common or not. Very infrequent (typically noisy) pairs seem to pop to the front. One way to fix this would be to add an additional "count" term to the front of the mutual information to weight high-frequency pairs more. This is quite similar to the RlogF measure that's locally quite popular.

The next set of methods seem to be based mostly on the idea of assuming a generative model for the data and doing hypothesis testing. I.e., you can have one model that assumes the two words are independent (the null hypothesis), and one that assumes they are not. Typically this is implemented as multinomials over words, and then a classical statistical test is applied to the estimated maximum likelihood parameters of the data. You can use Dice coefficient, t-score, chi-squared, or even something simpler like the log-likelihood ratio.

Although I'm not completely familiar with these techniques, they seem like they would also be sensitive to noise in the MLE, especially for low-frequency terms (of which we know there are a lot). It seems plausible to just directly do a Bayesian hypothesis test, rather than a classical one, but I don't know if anyone has done this.

In writing this, I just came across collocations.de, which among other things, has a nice survey of these results from an ESSLLI tutorial in 2003. They seem to conclude that for extracting PP/verb collocations, t-score and frequency seem to work best, and for extracting adjective/noun collocations, log-likelihood, t-score, Fisher, and p-values seem to work best. The intersection just contains t-score, so maybe that's the way to go. Of course, these things are hard to validate because so much depends on what you're trying to do.


Tanton said...

I'm actually working in collocations right now in a 700M word-pair corpus. I'm using Chi-Square as the test mechanism (which is similar to log-likelihood, right?) The main problem I had was that the size of corpus caused ANY pair to seem more correlated than it actually was. So, I had to increase the odds that two pairs were NOT correlated to counter that. I'll look into the ideas you presented here and the web site you linked to. This is very timely, thanks!

David said...

I realize that collocation and coreference are different, but given a large enough corpus (e.g. Reuters RCV1, 10 years of the NYTimes, etc.), wouldn't the same techniques be reasonable for coreference?

Anonymous said...

I blogged about this in the past: Phrase Extraction: Binomial Hypothesis Testing vs. Coding Loss.

In LingPipe, we do a bunch of things for independence testing. We've found chi-square to work best for collocations (so did the citations in Manning and Schuetze).

We use MAP estimates of language models to measure hotness in one corpus vs. another, such as the news in the last hour vs. the news in the last month or year. If the "background" model is a unigram LM and the foreground a bigram, this becomes similar to the mutual information measure, though done with simple binomial hypothesis tests.

Before us, Takashi Tomokiyo and Matthew Hurst wrote a paper A language model approach to keyphrase extraction which uses a foreground vs. background model; it's different in that they use mutual information.

In the real world, you need some clever hacks, like restricting the phrases to noun phrases, or just phrases that are capitalized (like Amazon does), or just sequences following "the" (as Tomokiyo and Hurst did).

The coolest implementation I've seen of collocation or phrase extraction is on scirus.com. See the "refining your search" sidebar after a search.

For higher-order collocation (not phrases), things like singular value decomposition work pretty well.

For more reading about the details, I also have a blog entry about chi-squared testing and boundary conditions on counts: Collocations, Chi-Squared Independence, and N-gram Count Boundary Conditions.

Anonymous said...

Hi Hal,

I've been following the blog for a while and while I don't have a useful comment on collocations I did have a question that I was hoping you might answer...

When dealing with statistical algorithms I am not sure which to choose when and why.

Take for example SVM, LDA and MaxEnt... They all seem to have about the same capabilities so what is the relative advantage/disadvantages of each?

For instance MaxEnt, LDA and SVM have all been used "successfully" to do part of speech tagging but why would I use one over the other?

Pretty open ended but your other descriptions have been enlightening so if you had a moment to answer that would be super awesome.

Thanks so much,


hal said...

Aleks Jakulin tried to post the following but apparently it ended up in blogopurgatory:

The weird results you're getting are a consequence of an overfitted P(X,Y) model of co-occurrences - when the counts are low, the frequentist probabilities are way off, and log-loss makes it even worse.

For that I'd recommend you either use MI's p-values, or MI-at-risk measure (the value of MI at the lower 5%).

I'd recommend you to take a look at Chapter 4 of http://www.stat.columbia.edu/~jakulin/Int/jakulin05phd.pdf or at http://www.stat.columbia.edu/~jakulin/Int/jakulin-bratko-ICML2004.pdf -- don't worry about 3-way interactions, although I've had some interesting 3-way effects that I describe in my dissertation at "Synonymy and Polysemy" on printed page 59 (page 47).

hal said...

David -- I'm not quite sure how to do something like this for coreference. It might possibly work for name ("Bush") vs. nominal ("President") pairs, but certainly not for pronouns.

Anonymous said...

Hal, your comments about mutual information (MI) don't quite square with my experience; perhaps because you may be conflating pointwise MI with average MI. Infrequent pairs coming out on top is certainly a problem with pointwise MI. A pair of items that each have one occurrence and happen to co-occur will have the highest possible pointwise MI. However, the formula you give is for average MI, which in my experience does not have this problem. In fact, if anything, the opposite is true; with average MI, weakly correlated but very frequent pairs often come out with higher scores than you would like. I dealt with some issues that I think are related to your concerns in my 2004 EMNLP paper, "On log-likelihood-ratios and the significance of rare events".

hal said...

Hi Bob --

Indeed, I was conflating PMI and (real) MI. I guess PMI just seems crazy to me. Though I can see why (average) MI might overly favor common things. If I recall from your EMNLP paper, your conclusion is that Fisher's exact test is almost always to be preferred over log-lik ratio?

I fully admit that I'm probably totally biased on these things, but I really feel like almost any classical measure is going to either overly favor rare things or do the opposite in trying to correct for that. Given how easy it is to marginalize out a Binomial under a Beta and compute KLs between posterior Betas (or similarly with Multinomial under a Dirichlet), I'm a bit surprised that I haven't seen anything along these lines.... Maybe my intuition is just bad here, but I feel like this is likely to do the right thing.

Anonymous said...

I recommend LaTeXMathML. You just need to add a couple of JavaScript lines to your webpage template. See http://www.maths.nottingham.ac.uk/personal/drw/lm.html

Brendan O'Connor said...

I'm used to seeing Pearson correlation as the first thing people mention for a correlation measure, but you don't even bother. Is there a reason I'm missing?

hal said...

brendan -- a couple of reasons. first, i'm not using correlation in the standard statistical correlation sense. all i want to know is: when i see the phrase "real estate" do these words "go together" or not (for instance). the bigger problem is that sure you could do pearson, but every event will be either (0,0), (1,1), (0,1) or (1,0), so pearson is going to suck :).

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花