There are roughly four types of loss functions that are used in NLP research.
- The real loss function given to us by the world. Typically involves notions of money saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function.
- The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments, etc. We can perform these evaluations, but they are slow and costly. They require humans in the loop.
- Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are cheap and quick. Typically some effort has been put into showing correlation between these and something higher up.
- Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and perplexity (for language modeling). These also require humans at the front of the loop, but differ from (3) in that they are not actually compared with higher-up tasks.
Note that the difference between (3) and (4) changes over time: there are many (4)s that could easily become (3)s given a few rounds of experimentation. (I apologize if I called something a (4) that should be a (3).)
It is important to always keep in mind that our goal is to improve (1). The standard line is: I can't compute (1) so I approximate it with (2). I can't optimize (2) so I approximate it with (3). (Or, I don't know how to define (2) so I approximate it with (4).)
I strongly feel that the higher up on this list one is, the harder it is to actually define the evaluation metric. It is very easy to define things at the (4) level because there are essentially no requirements on such a loss function other than (A) hard to game and (B) intuitively reasonable. It is a bit harder to define things at the (3) level because an additional critereon is added: (C) must be able to convince people that this new loss function approximates an established (1) or (2). Defining something at the (2) level is shockingly difficult and (1) is virtually impossible.
There are two questions that are crutial to progress. The first is: what is the minimum acceptable evaluation. We cannot reasonably require (1) in all cases. In many cases, I think it absolutely reasonable to require (2); eg. journal papers for which a working definition of (2) is known. Conference papers can usually get by with (3) or (4). I would never accept (4) when a (3) is known. But when not (consider named entity recognition), it's all we have.
It's unclear if this is a good breakdown or not. I could argue never to accept (4): that you must show your results improve something humans care about, at least approximately. But this seems limiting. First, it means that if we want to solve a subproblem, we have to formally show that solving it will be useful before we actually solve it. This is a good thing to do a but of internally, but to do a sophisticated analysis to the point of publishable results to move something from (4) to (3) is a lot of effort for (often) little gain. Especially in the context of a larger research agenda.
I could also argue that for many results, (4) is always sufficient. The argument here is that, from a machine learning perspective, if I can optimize (4), I can optimize anything else you give me. While I've made this argument personally, it's just not true. A more complex loss function might be much harder to optimize (consider optimizing Bleu-1 without brevity penalty versus full Bleu-4). Moreover, it may turn out that whatever biases my model has are good for one (4) and not for another. This is an okay (but not great) argument in an ML conference, but I wouldn't buy it in NLP.
So how do other people rank these things? How much pressure is there to bring a (4) to a (3)? And how often should we do (2)?