As NLPers, we're often tasked with measuring similarity of things, where things are usually words or phrases or something like that. The most standard example is measuring collocations. Namely, for every pair of words, do they co-locate (appear next to each other) more than they "should." There are lots of statistics that are used to measure collocation, but I'm not aware of any in-depth comparison of these. If anyone actually still reads this blog over the summer and would care to comment, it would be appreciated!

The most straightforward measure, in my mind, is mutual information. If we have two words "Y" and "Y", then the mutual information is: MI(X;Y) = \sum_x \sum_y p(X=x,Y=y) \log [p(X=x,Y=y) / (p(X=x)p(Y=y))] (sorry, my LaTeX plugin seems to have died). In the case of collocation, the values x and y can take are usually just "does occur" and "does not occur." Here, we're basically asking ourselves: do X and Y take on the same values more often than chance. I.e., do they seem roughly statistically independent. The mutual information statistic gives, for every X/Y pair (every pair of words) a score. We can then sort all word pairs by this score and find the strongest collocations.

One issue with MI seems to be that it doesn't really care if pairs are common or not. Very infrequent (typically noisy) pairs seem to pop to the front. One way to fix this would be to add an additional "count" term to the front of the mutual information to weight high-frequency pairs more. This is quite similar to the RlogF measure that's locally quite popular.

The next set of methods seem to be based mostly on the idea of assuming a generative model for the data and doing hypothesis testing. I.e., you can have one model that assumes the two words are independent (the null hypothesis), and one that assumes they are not. Typically this is implemented as multinomials over words, and then a classical statistical test is applied to the estimated maximum likelihood parameters of the data. You can use Dice coefficient, t-score, chi-squared, or even something simpler like the log-likelihood ratio.

Although I'm not completely familiar with these techniques, they seem like they would also be sensitive to noise in the MLE, especially for low-frequency terms (of which we know there are a lot). It seems plausible to just directly do a Bayesian hypothesis test, rather than a classical one, but I don't know if anyone has done this.

In writing this, I just came across collocations.de, which among other things, has a nice survey of these results from an ESSLLI tutorial in 2003. They seem to conclude that for extracting PP/verb collocations, t-score and frequency seem to work best, and for extracting adjective/noun collocations, log-likelihood, t-score, Fisher, and p-values seem to work best. The intersection just contains t-score, so maybe that's the way to go. Of course, these things are hard to validate because so much depends on what you're trying to do.

Stan on TV

15 minutes ago

## 22 comments:

I'm actually working in collocations right now in a 700M word-pair corpus. I'm using Chi-Square as the test mechanism (which is similar to log-likelihood, right?) The main problem I had was that the size of corpus caused ANY pair to seem more correlated than it actually was. So, I had to increase the odds that two pairs were NOT correlated to counter that. I'll look into the ideas you presented here and the web site you linked to. This is very timely, thanks!

I realize that collocation and coreference are different, but given a large enough corpus (e.g. Reuters RCV1, 10 years of the NYTimes, etc.), wouldn't the same techniques be reasonable for coreference?

I blogged about this in the past: Phrase Extraction: Binomial Hypothesis Testing vs. Coding Loss.

In LingPipe, we do a bunch of things for independence testing. We've found chi-square to work best for collocations (so did the citations in Manning and Schuetze).

We use MAP estimates of language models to measure hotness in one corpus vs. another, such as the news in the last hour vs. the news in the last month or year. If the "background" model is a unigram LM and the foreground a bigram, this becomes similar to the mutual information measure, though done with simple binomial hypothesis tests.

Before us, Takashi Tomokiyo and Matthew Hurst wrote a paper A language model approach to keyphrase extraction which uses a foreground vs. background model; it's different in that they use mutual information.

In the real world, you need some clever hacks, like restricting the phrases to noun phrases, or just phrases that are capitalized (like Amazon does), or just sequences following "the" (as Tomokiyo and Hurst did).

The coolest implementation I've seen of collocation or phrase extraction is on scirus.com. See the "refining your search" sidebar after a search.

For higher-order collocation (not phrases), things like singular value decomposition work pretty well.

For more reading about the details, I also have a blog entry about chi-squared testing and boundary conditions on counts: Collocations, Chi-Squared Independence, and N-gram Count Boundary Conditions.

Hi Hal,

I've been following the blog for a while and while I don't have a useful comment on collocations I did have a question that I was hoping you might answer...

When dealing with statistical algorithms I am not sure which to choose when and why.

Take for example SVM, LDA and MaxEnt... They all seem to have about the same capabilities so what is the relative advantage/disadvantages of each?

For instance MaxEnt, LDA and SVM have all been used "successfully" to do part of speech tagging but why would I use one over the other?

Pretty open ended but your other descriptions have been enlightening so if you had a moment to answer that would be super awesome.

Thanks so much,

Sam

Aleks Jakulin tried to post the following but apparently it ended up in blogopurgatory:

The weird results you're getting are a consequence of an overfitted P(X,Y) model of co-occurrences - when the counts are low, the frequentist probabilities are way off, and log-loss makes it even worse.

For that I'd recommend you either use MI's p-values, or MI-at-risk measure (the value of MI at the lower 5%).

I'd recommend you to take a look at Chapter 4 of http://www.stat.columbia.edu/~jakulin/Int/jakulin05phd.pdf or at http://www.stat.columbia.edu/~jakulin/Int/jakulin-bratko-ICML2004.pdf -- don't worry about 3-way interactions, although I've had some interesting 3-way effects that I describe in my dissertation at "Synonymy and Polysemy" on printed page 59 (page 47).

David -- I'm not quite sure how to do something like this for coreference. It might possibly work for name ("Bush") vs. nominal ("President") pairs, but certainly not for pronouns.

Hal, your comments about mutual information (MI) don't quite square with my experience; perhaps because you may be conflating pointwise MI with average MI. Infrequent pairs coming out on top is certainly a problem with pointwise MI. A pair of items that each have one occurrence and happen to co-occur will have the highest possible pointwise MI. However, the formula you give is for average MI, which in my experience does not have this problem. In fact, if anything, the opposite is true; with average MI, weakly correlated but very frequent pairs often come out with higher scores than you would like. I dealt with some issues that I think are related to your concerns in my 2004 EMNLP paper, "On log-likelihood-ratios and the significance of rare events".

Hi Bob --

Indeed, I was conflating PMI and (real) MI. I guess PMI just seems crazy to me. Though I can see why (average) MI might overly favor common things. If I recall from your EMNLP paper, your conclusion is that Fisher's exact test is almost always to be preferred over log-lik ratio?

I fully admit that I'm probably totally biased on these things, but I really feel like almost

anyclassical measure is going to either overly favor rare things or do the opposite in trying to correct for that. Given how easy it is to marginalize out a Binomial under a Beta and compute KLs between posterior Betas (or similarly with Multinomial under a Dirichlet), I'm a bit surprised that I haven't seen anything along these lines.... Maybe my intuition is just bad here, but I feel like this is likely to do the right thing.I recommend LaTeXMathML. You just need to add a couple of JavaScript lines to your webpage template. See http://www.maths.nottingham.ac.uk/personal/drw/lm.html

I'm used to seeing Pearson correlation as the first thing people mention for a correlation measure, but you don't even bother. Is there a reason I'm missing?

brendan -- a couple of reasons. first, i'm not using correlation in the standard statistical correlation sense. all i want to know is: when i see the phrase "real estate" do these words "go together" or not (for instance). the bigger problem is that sure you could do pearson, but every event will be either (0,0), (1,1), (0,1) or (1,0), so pearson is going to suck :).

Ultima Online Gold, UO Gold, crestingwait

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

buy uo gold

lotro gold

wow gold

warhammer gold

buy aoc gold

buy aoc gold

buy aoc gold

buy aoc gold

buy aoc gold

buy aoc gold

buy aoc gold

Age of Conan Gold, AOC Gold

It is the knight noah which makes me very happy these days, my brother says knight gold is his favorite games gold he likes, he usually knight online gold to start his game and most of the time he will win the knight online noah back and give me some cheap knight gold to play the game.

Do you want to buy the cheapest dofus kamas, I think many people want to buy the cheapest dofus gold, but some times we all do not how to buy dofus kamas, so we all hope in the game can earn much cheap dofus kamas, this is the best hope for us to play the Dofus.

網頁設計,情趣用品,情趣用品,情趣用品,情趣用品

色情遊戲,寄情築園小遊戲,情色文學,一葉情貼圖片區,情惑用品性易購,情人視訊網,辣妹視訊,情色交友,成人論壇,情色論壇,愛情公寓,情色,舊情人,情色貼圖,色情聊天室,色情小說,做愛,做愛影片,性愛

免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,UT聊天室,聊天室,豆豆聊天室,尋夢園聊天室,聊天室尋夢園,080聊天室,080苗栗人聊天室,女同志聊天室,上班族聊天室,小高聊天室

AV,AV女優

視訊,影音視訊聊天室,視訊交友

視訊,影音視訊聊天室,視訊聊天室,視訊交友,視訊聊天,視訊美女

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

艾葳酒店經紀提供專業的酒店經紀,酒店上班,酒店打工、兼職、酒店相關知識等酒店相關產業服務,想加入這行業的水水們請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司、我們是不會跟水水簽任何的合約 ( 請放心 )，我們是不會強押水水辛苦工作的薪水，我們絕對不會對任何人公開水水的資料、工作環境高雅時尚，無業績壓力，無脫秀無喝酒壓力，高層次會員制客源，工作輕鬆。

一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已，對水水們的上班安全一點保障都沒有！艾葳酒店經紀公司的水水們上班時全程媽咪作陪，不需擔心！只提供最優質的酒店上班環境、上班條件給水水們。

one day i went shopping outside,and in an ed hardy store,I found some kinds of ed hardy i love most they are Your website is really good Thank you for the information

ed hardyed hardyed hardy clothinged hardy clothinged hardy shoesed hardy shoesdon ed hardydon ed hardyed hardy clothesed hardy clothesed hardy bagsed hardy bagsed hardy swimweared hardy swimweared hardy jeansed hardy jeansed hardy mensed hardy mens Thank you for the informationReally trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet

seslisohbet

sesli chat

seslichat

sesli sohbet sitesi

sesli chat sitesi

sesli sohpet

kamerali sohbet

kamerali chat

webcam sohbet

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..

seslisohbet

seslichat

sesli sohbet

sesli chat

sesli

sesli site

görünlütü sohbet

görüntülü chat

kameralı sohbet

kameralı chat

sesli sohbet siteleri

sesli chat siteleri

görüntülü sohbet siteleri

görüntülü chat siteleri

kameralı sohbet siteleri

canlı sohbet

sesli muhabbet

görüntülü muhabbet

kameralı muhabbet

seslidunya

seslisehir

sesli sex

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..

seslisohbet

seslichat

sesli sohbet

sesli chat

sesli

sesli site

görünlütü sohbet

görüntülü chat

kameralı sohbet

kameralı chat

sesli sohbet siteleri

sesli chat siteleri

sesli muhabbet siteleri

görüntülü sohbet siteleri

görüntülü chat siteleri

görüntülü muhabbet siteleri

kameralı sohbet siteleri

kameralı chat siteleri

kameralı muhabbet siteleri

canlı sohbet

sesli muhabbet

görüntülü muhabbet

kameralı muhabbet

birsesver

birses

seslidunya

seslisehir

sesli sex

Post a Comment