I can probably count on my fingers the number of papers I've submitted for which a reviewer hasn't complained about a baseline in some way. I don't mean to imply that all of those complaints are invalid: many of them were 100% right on in ways that either I was lazy about or ways that I hadn't seen a priori.
In fact, I remember back in 2005 I visited MIT and gave a talk on what eventually became the BayeSum paper (incidentally, probably one of my favorite papers I've written, though according to friends not exactly the best written... drat). I was comparing in the talk against some baselines, but Regina Barzilay very rightfully asked me: do any of these baselines have access to the same information that my proposed approach does? At the time I gave this talk the answer was no. In the actual paper the answer is yes. I think Regina was really on to something here, and this one question asked in my talk has had a profound impact on how I think about evaluation since then. For that, I take a small time-out and say: Thank you, Regina.
Like all such influential comments, my interpretation of Regina's question has changed over time, and this post is essentially about my current thinking on this issue, and how it relates to this "does not compare against a strong enough baseline" reviewer critique that is basically a way to kill any paper.
If we're going to ask the question about whether an evaluation strategy is "good" or "bad" we have to ask ourselves why are we doing this evaluation thing in the first place. My answer always goes back to my prima facie question when I read/review papers: what did I learn from this paper? IMO, the goal of an evaluation should be to help isolate what I learned.
Let's go back to the BayeSum paper. There are two things I could have been trying to demonstrate in this paper. (A) I could have been trying to show that some new external source of knowledge is useful; (B) I could have been trying to show that some new technique is useful. In the case of BayeSum, the answer was more like (B), which is why Regina's comment was spot on. I was trying to claim the approach was good, but I hadn't disentangled the new approach from the data, and so you could have easily believed that the improvement over the baseline was due to the added source of data rather than the new technique.
In many cases it's not that cut and dry because a (B)-style new technique might enable the use of (A)-style new data in a way that wasn't possible before. In that case, the thing that an author might be trying to convince me of is that the combination is good. That's fine, but I still think it's worth disentangling these two sources of information as much as possible. This is often tough because in the current NLP atmosphere in which we're obsessed with shiny new techniques, it's not appealing to show that the new data gets you 90% of the gain and the new technique is only 10% on top of that. But this is another issue.
So evaluations are to help us learn something. Let's return now to the question of you didn't compare against a strong enough baseline. Aside from parroting what's been said many times in the past, what is the point of such a complaint, beyond Regina's challenge, which I hope I've made clear I agree with. The issue, as best I can understand it, is that it is based on the following logic:
- Assumption: if your approach improves things against a strong baseline, then it will also improve against a weaker baseline, perhaps by more.
And, like the title of this blog post suggest, I would like to put forth the idea that this assumption is often ridiculous, or at least that there's plentiful evidence against it.
I'm going to pick on machine translation now just to give a concrete example, but I don't think this phenomenon is limited to MT in any way. The basic story is I start with some MT system like Moses or cdec or whatever. I add some features to it and performance goes up. The claim is that if my baseline MT system wasn't already sufficiently strong, then any improvement I see from my proposed technique could just be solving a problem that's already been solved if I had just tuned Moses better.
There's truth to this claim, but there's also untruth. A famous recent example is the Devlin et al. neural network MT paper. Let me be clear: I think this paper is great and I 100% believe the results that they presented. I'm not attacking this paper in any way; I'm choosing it simply as a representative example. One of the results they show is some insane 8 bleu point gain over a very strong baseline. And I completely believe that this was a very strong baseline. And that the 8 bleu point improvement was real. And that everything is great.
Okay, so any paper that leads to an 8 bleu point gain over a very strong baseline is going to get reimplemented by lots of people, and this has happened. Has anyone else gotten an 8 bleu point gain? Not that I've heard. I've heard numbers along the lines of 1 to 2 bleu points, but it's very plausible I haven't heard the whole story.
So what's going on here?
The answer is simply that the assumption I wrote above is false. We've assumed that since they got 8 points on a strong baseline, I'll get at least as much on my baseline (which is likely weaker than theirs).
One problem is that "strong" isn't a total order. Different systems might get similar bleu scores, but this doesn't mean that they get them in the same way. Something like the neural networks stuff clearly solved a major problem in the BBN strong baseline system, but this major problem clearly wasn't as major of a problem in some other strong baseline systems.
Does this make the results in the Devlin paper any less impressive or important? No, of course not. I learned a lot from that paper. But one thing I didn't learn is "if you apply this approach to any system that's weaker than our strong baseline, you will get 8 bleu points." That's just not a claim that their results substantiate, and the only reason people seem to believe that this should be true is because of the faulty assumption above.
So does this mean that comparing to strong baselines is unimportant and everyone should go back to comparing their MT system against a word-for-word model 1 baseline?
Of course not. There are lots of ways to be better than such a baseline, and so "beating" it does not teach me anything. I always tell students not to get too pleased when they get state of the art performance on some standard task: someone else will beat them next year. If the only thing that I learn from their papers is that they win on task X, then next year there's nothing to learn from that paper. The paper has to teach me something else to have any sort of lasting effect: what is the generalizable knowledge.
The point is that an evaluation is not an end in itself. An evaluation is there to teach you something, or to substantiate something that I want to teach you. If I want to show you that X is important, then I should show you an experiment that isolates X to the best of my ability and demonstrates an improvement, preferably also with an error analysis that shows that what I claim my widget is doing is actually what it's doing.
8 comments:
Great thoughts Hal. On the industry side of things, we struggle with evaluation of proprietary data sets. It's often the case that we don't have the luxury of performing baseline experiments at all. Often, we're handcuffed by some kind of decontextual *accuracy* number that hardly means anything at all, yet customers focus on it. Customers often want that one magic number that tells them how *good* a system is.
Great post, Hal. I agree with most of what you said. Except for, your conclusion about the Devlin paper seems a bit off to me. You seem to argue that they were comparing against a strong baseline, but one that had different weaknesses than everyone else's systems. The problematic part is the magnitude of the improvement.
Let's assume that your argument is true. Then the implications are that:
-- BBN must now have a system that is at least 4 blue points better than anything else out there (after allowing for the baseline to be 4 blue points below the current state of the art, which is quite a bit). I find that hard to believe.
-- If all other systems indeed get only 1 to 2 blue points improvement and the baseline of the paper is indeed strong (i.e., close to the state of the art), then the baseline in the paper must have strengths that are very different from the strengths of all other systems out there. That means that the baseline is quite interesting and innovative -- the more interesting part of the system.
One thing I would agree is that improvements over a simple baseline using a simple and principled approach are worthy in themselves as they reduce overall complexity in the system (for example the "Natural Language Processing (almost) from Scratch" paper). But that does not seem to be the case you're arguing here.
'if your approach improves things against a strong baseline, then it will also improve against a weaker baseline, perhaps by more.'
For sure, there's no reason to expect the above to be correct. There are plenty of MT examples, e.g. to get the best use out of a monster LM, you need an MT decoder that can apply the LM to a really large search space, and that decoder probably would have yielded a relatively strong baseline. But, even so, its easier to improve a weak baseline than a strong baseline. So its still reasonable for reviewers to request strong baselines.
The bigger issue is why the literature is full of so many weak baselines. Building a strong baseline in the way that BBN did is *really hard*. Few researchers could do it, even if they wanted to. Its not simply a case of 'running moses better'.
Arguably, this is why strong baselines make for better research. The skills required to build a strong baseline are the same skills needed to get new and different models to work well.
It's +6bleu for the BBN NN paper, not +8. Still, point stands.
One of the main problems here is an assumption that weak baselines are useful. If you have system X stronger than Y, you probably want to make X even stronger.
If somebody beats your results and invents a way to create even stronger system, this is great. This is the way to make progress. Improving weak systems doesn't normally lead to a progress and improvements demonstrated for weak systems often don't add up.
Of course, certain improvements over weak baselines do teach important lessons. Furthermore, a weak baseline may represent a new direction eventually producing much stronger systems.
However, I suspect these are relatively infrequent cases.
Thanks for a great post, Hal. I'm getting back into research having been in industry for a few years. May I ask what might be a very basic question...
What do you mean by "baseline"? Is it a dataset that everyone agrees demonstrates a class of problem well?
Thank you!
@Chris: interesting, thanks! I'm vaguely aware of this, and it seems like industries often have their own (potentially uninteresting, just like ours :P) numbers they care about.
@Ves: that's a really fascinating deconstruction! i guess another interpretation is that the particular architecture that they ended up selecting for was a function of their input system, and then the fact that it doesn't work as well for other input systems is not so surprising. i somehow imagine this has more to do with idiosyncrasies of systems and bleu and other things than real improvement, but idk.
@RB1: yes, i agree. i guess my main point is that improving on a baseline---strong or weak---is really not the point.
@Jon: thanks for the correction!!!
@itman: see reply to @RB1 :). i agree that finding new hills to climb is relatively rare, but i think this is also at least partially due to our oversubscription to the notion that greedy hillclimbing is a good approach. especially, for, say, grad students or untenured faculty, in whom most of this potential probably resides anyway. yes, i agree that weak baselines don't necessarily tell us anything useful, and we should avoid weak baselines for the same reasons we should avoid strong baselines: they're not the end, they're a means.
@rob: i more mean techniques (which are often applied to a dataset).
Hal, I'm curious as to what you think about Feynman's position that your baseline needs to be internal to the work that you're reporting, rather than being someone else's work. Here's how his position is characterized by Wikipedia:
"An example of cargo cult science is an experiment that uses another researcher's results in lieu of an experimental control. Since the other researcher's conditions might differ from those of the present experiment in unknown ways, differences in the outcome might have no relation to the independent variable under consideration."
https://en.wikipedia.org/wiki/Cargo_cult_science
Post a Comment