In fact, I remember back in 2005 I visited MIT and gave a talk on what eventually became the BayeSum paper (incidentally, probably one of my favorite papers I've written, though according to friends not exactly the best written... drat). I was comparing in the talk against some baselines, but Regina Barzilay very rightfully asked me: do any of these baselines have access to the same information that my proposed approach does? At the time I gave this talk the answer was no. In the actual paper the answer is yes. I think Regina was really on to something here, and this one question asked in my talk has had a profound impact on how I think about evaluation since then. For that, I take a small time-out and say: Thank you, Regina.
Like all such influential comments, my interpretation of Regina's question has changed over time, and this post is essentially about my current thinking on this issue, and how it relates to this "does not compare against a strong enough baseline" reviewer critique that is basically a way to kill any paper.
If we're going to ask the question about whether an evaluation strategy is "good" or "bad" we have to ask ourselves why are we doing this evaluation thing in the first place. My answer always goes back to my prima facie question when I read/review papers: what did I learn from this paper? IMO, the goal of an evaluation should be to help isolate what I learned.
Let's go back to the BayeSum paper. There are two things I could have been trying to demonstrate in this paper. (A) I could have been trying to show that some new external source of knowledge is useful; (B) I could have been trying to show that some new technique is useful. In the case of BayeSum, the answer was more like (B), which is why Regina's comment was spot on. I was trying to claim the approach was good, but I hadn't disentangled the new approach from the data, and so you could have easily believed that the improvement over the baseline was due to the added source of data rather than the new technique.
In many cases it's not that cut and dry because a (B)-style new technique might enable the use of (A)-style new data in a way that wasn't possible before. In that case, the thing that an author might be trying to convince me of is that the combination is good. That's fine, but I still think it's worth disentangling these two sources of information as much as possible. This is often tough because in the current NLP atmosphere in which we're obsessed with shiny new techniques, it's not appealing to show that the new data gets you 90% of the gain and the new technique is only 10% on top of that. But this is another issue.
So evaluations are to help us learn something. Let's return now to the question of you didn't compare against a strong enough baseline. Aside from parroting what's been said many times in the past, what is the point of such a complaint, beyond Regina's challenge, which I hope I've made clear I agree with. The issue, as best I can understand it, is that it is based on the following logic:
- Assumption: if your approach improves things against a strong baseline, then it will also improve against a weaker baseline, perhaps by more.
And, like the title of this blog post suggest, I would like to put forth the idea that this assumption is often ridiculous, or at least that there's plentiful evidence against it.
I'm going to pick on machine translation now just to give a concrete example, but I don't think this phenomenon is limited to MT in any way. The basic story is I start with some MT system like Moses or cdec or whatever. I add some features to it and performance goes up. The claim is that if my baseline MT system wasn't already sufficiently strong, then any improvement I see from my proposed technique could just be solving a problem that's already been solved if I had just tuned Moses better.
There's truth to this claim, but there's also untruth. A famous recent example is the Devlin et al. neural network MT paper. Let me be clear: I think this paper is great and I 100% believe the results that they presented. I'm not attacking this paper in any way; I'm choosing it simply as a representative example. One of the results they show is some insane 8 bleu point gain over a very strong baseline. And I completely believe that this was a very strong baseline. And that the 8 bleu point improvement was real. And that everything is great.
Okay, so any paper that leads to an 8 bleu point gain over a very strong baseline is going to get reimplemented by lots of people, and this has happened. Has anyone else gotten an 8 bleu point gain? Not that I've heard. I've heard numbers along the lines of 1 to 2 bleu points, but it's very plausible I haven't heard the whole story.
So what's going on here?
The answer is simply that the assumption I wrote above is false. We've assumed that since they got 8 points on a strong baseline, I'll get at least as much on my baseline (which is likely weaker than theirs).
One problem is that "strong" isn't a total order. Different systems might get similar bleu scores, but this doesn't mean that they get them in the same way. Something like the neural networks stuff clearly solved a major problem in the BBN strong baseline system, but this major problem clearly wasn't as major of a problem in some other strong baseline systems.
Does this make the results in the Devlin paper any less impressive or important? No, of course not. I learned a lot from that paper. But one thing I didn't learn is "if you apply this approach to any system that's weaker than our strong baseline, you will get 8 bleu points." That's just not a claim that their results substantiate, and the only reason people seem to believe that this should be true is because of the faulty assumption above.
So does this mean that comparing to strong baselines is unimportant and everyone should go back to comparing their MT system against a word-for-word model 1 baseline?
Of course not. There are lots of ways to be better than such a baseline, and so "beating" it does not teach me anything. I always tell students not to get too pleased when they get state of the art performance on some standard task: someone else will beat them next year. If the only thing that I learn from their papers is that they win on task X, then next year there's nothing to learn from that paper. The paper has to teach me something else to have any sort of lasting effect: what is the generalizable knowledge.
The point is that an evaluation is not an end in itself. An evaluation is there to teach you something, or to substantiate something that I want to teach you. If I want to show you that X is important, then I should show you an experiment that isolates X to the best of my ability and demonstrates an improvement, preferably also with an error analysis that shows that what I claim my widget is doing is actually what it's doing.