29 July 2008

What Makes a Fair Comparison

This issue seems to come up implicitly or explicitly in a large variety of circumstances. The general set up is this. Suppose that for some task, we have the "de facto" approach "A." I come up with a new approach "B" that I want to argue is better than "A." The catch is that for whatever reasons, I have to limit B in some way (perhaps in terms of the amount of data I train in on). So, is the fair comparison to a similarly limited version of A or to the full-blown A?

Let's instantiate this with two examples. (Yes, these are examples of papers that have been published; you can see if you can figure out which ones they are. I'm just not sufficiently creative to make something up right now.)

  1. Let's say I'm doing syntactic language modeling. In this case, the baseline would be (say) a trigram language model. My model requires parse trees to train. So I train my language model on the Penn Treebank. Now, when I compare to an ngram model, is the fair comparison to an ngram model trained on the Treebank, or one trained on (say) all WSJ from ten years?
  2. Now let's say I'm doing MT. My baseline is Moses and I build some fancy MT system that can only train on sentences of length 10 or less. Should I compare to Moses trained on sentences of length 10 or less, or all sentences?
The obvious response is "report both." But this is just deferring the problem. Instead of you deciding which is the appropriate comparison, you are leaving the decision up to someone else. Ultimately, someone has to decide.

The advantage to comparing to a gimped version of model A is that it tells you: if this is the only data available to me, this is how much better I do than A. Of course, in no real world situation (for many problems, including the two I listed above) will this really be the only data you have. Plus, if you compare to a non-gimped A, you'll almost always lose by a ridiculously large margin.

On the other hand, comparing against a non-gimped A is a bit unfair. Chances are that quite some time has gone in to optimizing (say, for speed) algorithms for solving A. Should I, as a developer of new models, also have to be a hard-core optimizer in order to make things work? I'm thinking back to the introduction of SVMs twenty years ago. Back then, SVM training would take thousands of times longer than naive Bayes. Today, (linear) SVM training really isn't that much slower (maybe a small constant times longer).

Yet from the perspective of a consumer, there's something fundamentally unpleasant about having to gimp an existing system.... you have to ask yourself: why should I care?

I suppose this is where human judgement comes in. If I can reasonably imagine that the new system B might possibly be scaled up and, if so, I think it would continue to do well, then I'm not unhappy with a gimped comparison. For example, I can probably buy the syntactic language modeling example above (and to a lesser degree the MT example). I have a harder time with grammar induction on 10 word sentences because my prior beliefs state that 10 word sentences are syntactically really different from real sentences.

(Incidentally, although this issue didn't come up in my recent reviewing, I suppose that if I were reviewing a paper that had this issue, it probably wouldn't hurt if the authors were to ease me through this imagination process. For instance, in grammar induction, maybe you can show statistics that say that the distribution of context free rules is not so dissimilar between all sentences and short sentences. This almost never happens, but I think it would be useful both during reviewing as well as just for posterity.)

2 comments:

Kevin Duh said...

I agree that we should ideally have both evaluation. If we only have time to do one, though, I would prefer comparing the gimped A, since it is more informative from the research standpoint to reduce the number of free variables and understand how methods really differ. Also, I can spend the time to do more detailed error analysis.

That said, if A is too gimped, I guess that would raise questions on whether B is practical at all.

gamefan12 said...

I think this so true when The catch is that for whatever reasons, I have to limit B in some way (perhaps in terms of the amount of data I train in on). So, is the fair comparison to a similarly limited version of A or to the full-blown A?
atlanta asbestos lawyers