28 February 2007

Loss Functions for Chunking Tasks

I feel like this is starting to be a bit of dead horse, but I wanted to follow up a bit on previous posts talking about f-score versus accuracy for chunking problems.

An easy observation is that if you have a chunking problem for which the majority of the chunks are multi-word tokens, then it is possible to get a model that achieves quite good accuracy, but abysmal f-score. Of course, real world taggers may not actually do this. What I wanted to know was the extent to which accuracy, f-score and ACE score are not correlated when used with a real world tagger.

Here's the experiment. I use TagChunk for all the experiments. I do experiments on both syntactic chunking (CoNLL data) and NER (also CoNLL data), both in English. In the experiments, I vary (A) the amount of training data used, (B) the size of the beam used by the model, (C) the number of iterations of training run. When (A) is varied, I run five sets of training data, each randomly selected.

The data set sizes I used are: 8, 16, 32, 64, 125, 250, 500, 1000, 2000, 4000 and 8000. (There are number of sentences, not words. For the NER data, I remove the "DOCSTART" sentences.) The beam sizes I use are 1, 5 and 10. The number of iterations is from 1 to 10.

For the chunking problem, I tracked Hamming accuracy and f-score (on test) in each of these settings. These are drawn below (click on the image for a bigger version):



As we can see, the relationship is ridiculously strongly linear. Basically once we've gotten above an accuracy of 80%, it would be really really hard to improve accuracy and not improve F-score. The correlation coefficient for this data is 0.9979 and Kendall's tau is 0.9795, both indicating incredibly strong correlation (formal caveat: these samples are not independent).

For the NER task, I do the same experiments, but this time I keep track of accuracy, F-score and (a slightly simplified version of) the ACE metric. The results are below (again, click for a bigger version):



The left-most image is accuracy-versus-F, the middle is accuracy-versus-ACE and the right is F-versus-ACE. The ACE seems to be the outlier: it produces the least correlation. As before, with accuracy-to-F, we get a ridiculously high correlation coefficient (0.9959) and tau (0.9761). This drops somewhat when going to accuracy-to-ACE (corr=0.9596 and tau=0.9286) or to F-to-ACE (corr=0.9715 and tau=0.9253).

Nevertheless, the majority of the non-linearity occurs in the "very low accuracy" region. Here, that region is in the 0.8-0.9 range, not the 0.1-0.5 range as in chunking. This is because in chunking, almost every word is in a chunk, whereas in NER there are a ton of "out of chunk" words.

The take-away from these experiments is that it seems like, so long as you have a reasonably good model (i.e., once you're getting accuracies that are sufficiently high), it doesn't really matter what you optimize. If your model is terrible or if you don't have much data, then it does. It also seems to make a much bigger difference if the end metric is F or ACE. For F, it's pretty much always okay to just optimize accuracy. For ACE, it's not so much, particularly if you don't have sufficient data.

10 comments:

Libin said...

Happy families are all alike; every unhappy family is unhappy in its own way.

Here, we could replace 'happy' with 'high-scored', and 'families' with 'metrics'.

Anonymous said...

Natural Language Processing using an Ontology. What does everyone think?

http://www.landcglobal.com

Language and Computing Inc.

hal said...

i love it, libin! is that some actual proverb/saying, or are you just especially clever?

i guess at the extreme, a raw 0/1 loss over the entire structure is also "reasonable" and really drives this point home... since once you get it 100% correct, no matter how you measure, you're doing great.

there is a small counter-example to this. a few years back in the summarization community when people first started using automatic evaluation, it was observed that greater agreement with human judgments could be had if the scorer completely ignored stop words. the repercussion was that if you built your summarization system to exclude all stop words, you would get a score like twice as high as the best system, because you could pack a ton more content words in the 100 word limit.

so i think that the cute proverb is only accurate when the metric isn't gameable.

Anonymous said...

Take a look at the first few lines of Tolstoy's Anna Karenina ...

shoe stretchers said...

If you were going to buy a golf club, you wouldn't walk into a store and buy the first one you see, would you? Of course

not; especially if you want to improve your golf game! You'll want to hold the club, take some practice swings, hit some

balls if the store has a practice spot, and look at the price, of course. If you are considering buying running shoes,

you need to go through a similar process and take the time to find the perfect shoe.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex