06 February 2006

The Art of Loss Functions

There are roughly four types of loss functions that are used in NLP research.


  1. The real loss function given to us by the world. Typically involves notions of money saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function.

  2. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments, etc. We can perform these evaluations, but they are slow and costly. They require humans in the loop.

  3. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are cheap and quick. Typically some effort has been put into showing correlation between these and something higher up.

  4. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and perplexity (for language modeling). These also require humans at the front of the loop, but differ from (3) in that they are not actually compared with higher-up tasks.


Note that the difference between (3) and (4) changes over time: there are many (4)s that could easily become (3)s given a few rounds of experimentation. (I apologize if I called something a (4) that should be a (3).)

It is important to always keep in mind that our goal is to improve (1). The standard line is: I can't compute (1) so I approximate it with (2). I can't optimize (2) so I approximate it with (3). (Or, I don't know how to define (2) so I approximate it with (4).)

I strongly feel that the higher up on this list one is, the harder it is to actually define the evaluation metric. It is very easy to define things at the (4) level because there are essentially no requirements on such a loss function other than (A) hard to game and (B) intuitively reasonable. It is a bit harder to define things at the (3) level because an additional critereon is added: (C) must be able to convince people that this new loss function approximates an established (1) or (2). Defining something at the (2) level is shockingly difficult and (1) is virtually impossible.

There are two questions that are crutial to progress. The first is: what is the minimum acceptable evaluation. We cannot reasonably require (1) in all cases. In many cases, I think it absolutely reasonable to require (2); eg. journal papers for which a working definition of (2) is known. Conference papers can usually get by with (3) or (4). I would never accept (4) when a (3) is known. But when not (consider named entity recognition), it's all we have.

It's unclear if this is a good breakdown or not. I could argue never to accept (4): that you must show your results improve something humans care about, at least approximately. But this seems limiting. First, it means that if we want to solve a subproblem, we have to formally show that solving it will be useful before we actually solve it. This is a good thing to do a but of internally, but to do a sophisticated analysis to the point of publishable results to move something from (4) to (3) is a lot of effort for (often) little gain. Especially in the context of a larger research agenda.

I could also argue that for many results, (4) is always sufficient. The argument here is that, from a machine learning perspective, if I can optimize (4), I can optimize anything else you give me. While I've made this argument personally, it's just not true. A more complex loss function might be much harder to optimize (consider optimizing Bleu-1 without brevity penalty versus full Bleu-4). Moreover, it may turn out that whatever biases my model has are good for one (4) and not for another. This is an okay (but not great) argument in an ML conference, but I wouldn't buy it in NLP.

So how do other people rank these things? How much pressure is there to bring a (4) to a (3)? And how often should we do (2)?

7 comments:

William said...

after reading this post, went to a guest lecture from a guy from nuance who does asr for medical transcription. he mentioned they wanted to move away from wer and towards a measuring productivity gain directly in terms of time spent by the human transcriptionist in correcting the asr output.

so they might be in a situation to jump immediately from type 4 to type 1.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

qishaya said...

one day i went shopping outside,and in an ed hardy store,I found some kinds of ed hardy i love most they are Your website is really good Thank you for the information ed hardy ed hardy ed hardy clothing ed hardy clothing ed hardy shoes ed hardy shoes don ed hardy don ed hardy ed hardy clothes ed hardy clothes ed hardy bags ed hardy bags ed hardy swimwear ed hardy swimwear ed hardy jeans ed hardy jeans ed hardy mens ed hardy mens Thank you for the information

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex