11 June 2007

First-best, Balanced F and All That

Our M.O. in NLP land is to evaluate our systems in a first-best setting, typically against a balanced F measure (balanced F means that precision and recall are weighed equally). Occasionally we see precision/recall curves, but this is typically in straightforward classification tasks, not in more complex applications.

Why is this (potentially) bad? Well, it's typically because our evaluation criteria is uncalibrated against human use studies. In other words, picking on balanced F for a second, it may turn out that for some applications it's better to have higher precisions, while for others its better to have higher recall. Reporting a balanced F removes our ability to judge this. Sure, one can report precision, recall and F (and people often do this), but this doesn't give us a good sense of the trade-off. For instance, if I report P=70, R=50, F=58, can I conclude that I could just as easily get P=50, R=70, F=58 or P=R=F=58 using the same system but tweaked differently? Likely not. But this seems to be the sort of conclusion we like to draw, especially when we compare across systems by using balanced F as a summary.

The issue is essentially that it's essentially impossible for any single metric to capture everything we need to know about the performance of a system. This even holds up the line in applications like MT. The sort of translations that are required to do cross-lingual IR, for instance, are of a different nature than those that are required to put a translation in front of a human. (I'm told that for cross lingual IR, it's hard to beat just doing "query expansion" using model 1 translation tables.)

I don't think the solution is to proliferate error metrics, as has been seemingly popular recently. The problem is that once you start to apply 10 different metrics to a single problem (something I'm guilty of myself), you actually cease to be able to understand the results. It's reasonable for someone to develop a sufficiently deep intuition about a single metric, or two metrics, or maybe even three metrics, to be able to look at numbers and have an idea what they mean. I feel that this is pretty impossible with ten very diverse metrics. (And even if possible, it may just be a waste of time.)

One solution is to evaluate a different "cutoffs" ala precision/recall curves, or ROC curves. The problem is that while this is easy for thresholded binary classifiers (just change the threshold), it is less clear for other classifiers, much less complex applications. For instance, in my named entity tagger, I can trade-off precision/recall by postprocessing the weights and increasing the "bias" toward the "out of entity" tag. While this is an easy hack to accomplish, there's nothing to guarantee that this is actually doing the right thing. In other words, I might be able to do much better were I to directly optimize some sort of unbalanced F. For a brain teaser, how might one do this in Pharaoh? (Solutions welcome in comments!)

Another option is to force systems to produce more than a first-best output. In the limit, if you can get every possible output together with a probability, you can compute something like expected loss. This is good, but limits you to probabilistic classifiers, which makes like really hard in structure land where things quickly become #P-hard or worse to normalize. Alternatively, one could produce ranked lists (up to, say, 100 best) and then look at something like precision a 5, 10, 20, 40, etc. as they do in IR. But this presupposes that your algorithm can produce k-best lists. Moreover, it doesn't answer the question of how to optimize for producing k-best lists.

I don't think there's a one-size fits all answer. Depending on your application and your system, some of the above options may work. Some may not. I think the important thing to keep in mind is that it's entirely possible (and likely) that different approaches will be better at different points of trade-off.

5 comments:

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex