06 February 2006

The Art of Loss Functions

There are roughly four types of loss functions that are used in NLP research.

  1. The real loss function given to us by the world. Typically involves notions of money saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function.

  2. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments, etc. We can perform these evaluations, but they are slow and costly. They require humans in the loop.

  3. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are cheap and quick. Typically some effort has been put into showing correlation between these and something higher up.

  4. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and perplexity (for language modeling). These also require humans at the front of the loop, but differ from (3) in that they are not actually compared with higher-up tasks.

Note that the difference between (3) and (4) changes over time: there are many (4)s that could easily become (3)s given a few rounds of experimentation. (I apologize if I called something a (4) that should be a (3).)

It is important to always keep in mind that our goal is to improve (1). The standard line is: I can't compute (1) so I approximate it with (2). I can't optimize (2) so I approximate it with (3). (Or, I don't know how to define (2) so I approximate it with (4).)

I strongly feel that the higher up on this list one is, the harder it is to actually define the evaluation metric. It is very easy to define things at the (4) level because there are essentially no requirements on such a loss function other than (A) hard to game and (B) intuitively reasonable. It is a bit harder to define things at the (3) level because an additional critereon is added: (C) must be able to convince people that this new loss function approximates an established (1) or (2). Defining something at the (2) level is shockingly difficult and (1) is virtually impossible.

There are two questions that are crutial to progress. The first is: what is the minimum acceptable evaluation. We cannot reasonably require (1) in all cases. In many cases, I think it absolutely reasonable to require (2); eg. journal papers for which a working definition of (2) is known. Conference papers can usually get by with (3) or (4). I would never accept (4) when a (3) is known. But when not (consider named entity recognition), it's all we have.

It's unclear if this is a good breakdown or not. I could argue never to accept (4): that you must show your results improve something humans care about, at least approximately. But this seems limiting. First, it means that if we want to solve a subproblem, we have to formally show that solving it will be useful before we actually solve it. This is a good thing to do a but of internally, but to do a sophisticated analysis to the point of publishable results to move something from (4) to (3) is a lot of effort for (often) little gain. Especially in the context of a larger research agenda.

I could also argue that for many results, (4) is always sufficient. The argument here is that, from a machine learning perspective, if I can optimize (4), I can optimize anything else you give me. While I've made this argument personally, it's just not true. A more complex loss function might be much harder to optimize (consider optimizing Bleu-1 without brevity penalty versus full Bleu-4). Moreover, it may turn out that whatever biases my model has are good for one (4) and not for another. This is an okay (but not great) argument in an ML conference, but I wouldn't buy it in NLP.

So how do other people rank these things? How much pressure is there to bring a (4) to a (3)? And how often should we do (2)?


Anonymous said...

after reading this post, went to a guest lecture from a guy from nuance who does asr for medical transcription. he mentioned they wanted to move away from wer and towards a measuring productivity gain directly in terms of time spent by the human transcriptionist in correcting the asr output.

so they might be in a situation to jump immediately from type 4 to type 1.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花