26 May 2006

Here's a Loss, There's a Loss

There are several different notions of "loss" that one encounters in machine learning literature. The definition I stick with is: the end criteria on which we will be evaluated. (The other notions often have to deal with how one approximates this loss for learning: for instance the standard loss for classification is 0/1, but this is approximated by hinge-loss or log-loss or...) For a large number of problems, there are quite a few "reasonable" losses. For chunking problems, there is 0/1 loss (did you get everything correct), Hamming loss (how many individual predictions did you get correct) and F-score (over chunks). For MT, there is WER, PER, HED, NIST, BLEU, etc. For summarization there are all the Rouges.

Some of these losses are easier to optimize than others (optimize from the learning perspective). The question that arises is: if I can optimize one, should I worry about any of the others? A positive answer would make our lives much easier, because we could ignore some of the complexities. A negative answer would mean that we, as machine learning people, would have to keep "chasing" the people inventing the loss functions.

It seems the answer is "yes and no." Or, more specifically, if the problem is sufficiently easy, then the answer is no. In all other cases, the answer is yes.

Arguments in favor of "no":

  • All automatically evaluated loss functions are by definition approximations to the true loss. What does one more step of approximation hurt?
  • The loss functions are rarely that different: I can often closely approximate loss B with a weighted version of loss A (and weighting is much easier than changing the loss entirely).
  • Empirical evidence in sequence labeling/chunking problems and parsing show that optimizing 0/1 or Hamming or accuracy or F-score can all do very well.
Arguments in favor of "yes":
  • If not "yes" then reranking would not be such a popular technique (though it has other advantages in features).
  • Empirical evidence in MT and summarization suggest that optimizing different different measures produces markedly different results (though how this translates into human evaluation is questionable).
  • If effort is put in to finding an automatic criteria that closely correlates with human judgment, we should take advantage of it if we want to do well on human judgements.
There is a fourth argument for "yes" which comes from the binary classification literature. If we solve hundreds of different classification problems, some of which are harder than others, we can compare directly how well one loss looks like another. The standard picture looks like that below:

On the x-axis is one loss function (say, hinge-loss) and on the y-axis is another (say, squared loss). Each dot corresponds to a different learning problem: those in the bottom left are easy, those in the top right are hard. What we commonly observe is that for very easy problems, it doesn't matter what loss you use (getting low error on one directly implies low error on another). However, as the problem gets harder, it makes a bigger and bigger difference which we use.

A confounding factor which even further argues for "yes" is that in the binary classification example, the losses used are ones for which zero loss in one always implies zero loss for the other. This is very much not the case for the sorts of losses we encounter in NLP.

So, my personal belief is that if we can optimize the more complicated loss, we always should, unless the problem is so easy that it is unlikely to matter (eg, part of speech tagging). But for any sufficiently hard problem, it is likely to make a big difference. Though I would love to see some results that do the same experiment that Franz did in his MT paper, but also with human evaluations.

8 comments:

Kevin said...

The plot is very interesting! Do you have a specific reference for it?

I absolutely agree with you, Hal--definitely optimize on the loss you care about, unless there are computational reasons to do otherwise.

hal said...

The plot was made up, and right now I'm having trouble digging up a paper that has a similar one. I heard about this through folklore, not through a paper, though. One could create a similar plot based on the KDD CUP 2004 data; there is an associated paper that has some similar plots (see the right columns of the graphs), but they're not exactly the same. I know I've seen something like this published before, but can't seem to dig it up.

Just to play devil's advocate: what we optimize is rarely what we care about, since we care most often about some extrinsic non-automatic metric. We're so far from this anyway that if there is any computational reason to do otherwise (and sometimes these are quite severe!), maybe it doesn't really matter.

I'm not sure if anyone has formalized this, but it seems intuitively plausible that it is harder (from, say, a sample complexity perspective, ignoring computational complexity) to optimize a more complex loss function, and so we might lose there, too.

hal said...

I put two sets of plots up based on the cited paper...these don't show the effect quite as strongly as one would want because its only based on two dozen or so systems. The first compares accuracy (x-axis) to everything else and the second compares ROC (x-axis) to everything else.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

qishaya said...

one day i went shopping outside,and in an ed hardy store,I found some kinds of ed hardy i love most they are Your website is really good Thank you for the information ed hardy ed hardy ed hardy clothing ed hardy clothing ed hardy shoes ed hardy shoes don ed hardy don ed hardy ed hardy clothes ed hardy clothes ed hardy bags ed hardy bags ed hardy swimwear ed hardy swimwear ed hardy jeans ed hardy jeans ed hardy mens ed hardy mens Thank you for the information

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex