I occasionally--more than I would like--run in to the following problem.  I am working on some task for which there is prior work (shocking!).  I'm getting ready to decide what experiments to run in order to convince readers of a potential paper that what I'm doing is reasonable.  Typically this involves comparing fairly directly against said prior work.  Which, in turn, means that I should replicate what this prior work has done as closely as possible, letting the only difference in systems be what I am trying to propose.  Easy example: I'm doing machine learning for some problem and there is an alternative machine learning solution; I need to use an identical feature set to them.  Another easy example: I am working on some task; I want to use the same training/dev/test set as the prior work.
The problem is: what if they did it wrong (in my opinion).  There are many ways for doing things wrong and it would be hard for me to talk about specifics without pointing fingers.  But just for concreteness, let's take the following case relating to the final example in the above paragraph.  Say that we're doing POS tagging and the only prior work POS tagging paper used as test data every tenth sentence in WSJ, rather than the last 10%.  This is (fairly clearly) totally unfair.  However, it leaves me with the following options:
- I can repeat their bad idea and test on every tenth sentence.
- I can point out (in the paper) why this is a bad idea, and evaluate on the last 10%.
- I can not point this out in the paper and just ignore their work and still evaluate on the last 10%.
- I can point out why this is a bad idea, evaluate on both the last 10% (for "real" numbers) and every tenth sentence (for "comparison" numbers).
- I can point out why this is a bad idea, reimplement their system, and run both mine and theirs on the last 10%.
(4) is tempting, because it allows me to "break the bad habit" but also compare directly. The problem is that if I really disbelieve the prior methodology, then these comparison numbers are themselves dubious: what am I supposed to learn from results that compare in an unreasonable setting? If they're great (for me), does that actually tell me anything? And if they're terrible (for me), does that tell me anything? It seems to depend on the severity of the "bad idea", but it's certainly not cut and dry.
(3) just seems intellectually dishonest.
(2) at first glance seems inferior to (4), but I'm not entirely sure that I believe this. After all, I'm not sure that (4) really tells me anything that (2) doesn't. I suppose one advantage to (4) over (2) is that it gives me some sense of whether this "bad idea" really is all that bad. If things don't look markedly different in (2) and (4), then maybe this bad idea really isn't quite so bad. (Of course, measuring "markedly different" may be difficult for some tasks.)
(1) also seems intellectually dishonest.
At this point, to me, it seems like (4) and (5) are the winners, with (2) not hugely worse than (4). Thinking from the perspective of how I would feel reviewing such a paper, I would clearly prefer (5), but depending on how the rest of the paper went, I could probably tolerate (2) or (4) depending on how egregious the offense is. One minor issue is that as a writer, you have to figure that the author of this prior work is likely to be a reviewer, which means that you probably shouldn't come out too hard against this error. Which is difficult because in order to get other reviewers to buy (4), and especially (2), they have to buy that this is a problem.
I'm curious how other people feel about this. I think (5) is obviously best, but if (5) is impossible to do (or nearly so), what should be done instead.
 

15 comments:
In too many situations, it is nearly impossible to replicate a complete test setup in order to evaluate a new algorithm against previously published results. Even in the best written papers, there are experimental details left out. Every decision along the way can affect the final outcome: from the minutiae of corpus preprocessing to differing interpretations of something as simple as "the last 10%", from your example.
The only really "fair" option is to re-implement previously published methods and compare on a completely level playing field. I am suspicious of results that simply compare against performance numbers from previous papers without re-running the experiments. In this light, (1) and (4) would lead to possibly bogus comparisons. (5) is really the only option if you must compare with the previous work.
As an added benefit, re-implementing someone else's algorithm is almost always enlightening.
I think the issue is what do you want your scope of your work to be. Are you trying to show that their technique was flawed and you have a better testing technique? or are you trying to introduce your own work.
If you are trying to do both, and can in the time/space you have, that would be great. If the two issues are large enough, maybe you should address them separately. Alternately you could list their prior work as only PARTIALLY applicable with an explanation of why.
If I were reviewing the paper (and say that I knew the previous work), then only 4 and 5 would be acceptable in my mind. Anything else smacks of naivete or disingenuousness.
This exact issue came up in a small way with some work I did once on document classification. The standard was to classify every tenth document in the corpus, but the corpus had many very nearly duplicated articles. That meant that there was a really good chance that the articles in the test set were also in the training set. My final solution was to use one year for training and the next for test.
I was lucky in that the major system to compare with (UMASS) was willing to run their system on my comparison so (5) worked for me.
Continually testing on the same subsection of a corpus is quite misleading for two reasons: the high variance across sections and the resulting multiple comparisons problem.
What I find is that when I do 10-fold cross-validation on problems like part-of-speech tagging, spelling correction, Chinese word segmentation, named-entity recognition or classification, is that the difference between folds implies substantial variance of performance estimates. This variation across folds is often greater than inter-system differences on the "official" test section.
The multiple comparisons problem is exacerbated by studies that make the error of using the training data as development data. The whole field is falling into this trap by repeatedly reporting results on Section 23 of the Penn Treebank. We're overtraining to a particular test set.
I think the answer is "it depends".
I'd be inclined to say that (5) is *in general* a bad idea, for scientific rather than practical reasons. The problem is that, in real life, it's almost always impossible to do a previous approach justice when re-implementing---and what would motivate an author to do that anyhow? That suggests choice (6), re-implement and run on both the last 10% and every tenth sentence. At least then you can show your re-implementation is up to snuff (maybe).
Another difficulty I regularly have is comparing against previous work when their corpus is not publicly available. In that case, I'm essentially forced to re-implement their method, but I have almost no way of knowing that I got it right.
I wish I could do (2) or (3). It is a waste of valuable space in the 8 page limit to spend several paragraphs explaining deficiencies of the previous approach to evaluation, since doing so does not actually add to the intellectual contribution of my paper. I would rather use those inches to talk about qualitative evaluation, but a complete quantitative evaluation is crucial for paper acceptance. Since I know my evaluation method is superior, I would consider it a waste of time to do (5), unless the original author agrees to run his/her system. I agree that (4) is the best option, assuming that there are only minor differences in the evaluation methods. If the format of the output representation is substantially different, then it's comparing apples and oranges, since one representation may not be evaluable (evaluatable?) under the other metric. It is a problem if reviewers do not agree that the difference in output representation is actually important.
Not to answer the question, but addressing the underlying problem: I think this question emphasises the need for "reproducible research" (e.g. see http://epub.wu-wien.ac.at/dyn/virlib/wp/eng/mediate/epub-wu-01_c75.pdf?ID=epub-wu-01_c75). The notion is, that in computational sciences, the author should make available all the code and data needed to allow any reader to reproduce exactly the results claimed in the paper. Or as Claerbout's Principle puts it, "An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures."
Naive question - why is 'every tenth sentence' so much worse than 'last 10%'?
It isn't that naive a question, else metrics like that wouldn't be used so often.
The problem is that your test set is too similar to the training set and thus the results you get give you an unreasonably optimistic view of how your system will perform on new text. A similar example with newswire shows this problem even more starkly. Since so many articles are updates of previous articles, the test set will actually have many very near duplicates of items in the training set. This would make, for example, nearest neighbor methods work unreasonably well.
If you choose instead a test set from the last 10% (of sentences) or the succeeding year (with articles) then you will get much lower rate of these fortuitous accidents and your estimates of accuracy will be much more realistic.
In the original posting, however, this case is merely emblematic of the underlying problem of what to do when the standard evaluation methodology is flawed. The particular flaw isn't as important as the general problem.
This is annoying -- for some reason Blogger stopped emailing me notifications of comments to posts, so I didn't know people had written!
I pretty much agree with all the comments... One thing I will say in favor of reimplementing someone else's stuff, to echo one of the comments, is that it's often quite educational. Also, I think the best way to improve on something is to actually see where it breaks down. This is impossible unless you have a baseline system.
Andrew: continuing on Ted's reply... one big issue in, say, POS tagging is the issue of unknown words. Almost all errors are on unknown words for state-of-the-art taggers. But rare words often appear multiple times in the same article. So if one of the times happens to be in the "training" part, then this word will no longer be unknown. So it isn't fair.
I admire your thoughtful analysis of this scenario, and I know you are being realistic and being pragmatic about things as they are now...
But, still, I have to ask myself how in the world did we reach a point in this business where we accept the premise that we need to do re-implementations to carry out comparative studies and reproduce results? I have done this, I suppose we all have, but I'm increasingly amazed that we put up with it.
We aren't like a group of botanists studying a particular plant that grows only in a certain place in a special kind of soil after a particular sort of rain has fallen (so that you must physically be THERE in that place/time, etc. to do the experiments).
We usually use computer programs and data that run on at least a few other computers besides our own, and if we are so inclined we have lots of ways to make code and data available to anybody in the world within seconds.
Reproduction of experiments should be easy in this environment ... if the original experimenter will just give us a hand with that.
So, here's another option - let's just avoid this mess.
6) Reject the original paper because the authors failed to make the code and data available to allow easy and convenient reproduction and verification of their results.
I know this isn't realistic right away, but maybe it's something to shoot for over the longer term. Maybe by 2010 let's say. :)
What exactly is the problem with using every 10th sentence rather than the last 10%?
To repeat in new words, using every 10th sentence is that you run twin risks of having duplicates and near duplicates appear in training and test, and also having changes over time be under-estimated.
The basic problem that is incurred is that your training data looks too much like your test data to give you realistic estimates of accuracy.
This wasn't the main thrust of this post, but I had an idea for a new post--why not invite people to post "gotchas", common mistakes seen in papers/submissions with maybe just a short, brief explanation of why it's a bad idea. Just a thought.
You really ought to moderate comments on this blog. The spammers have figured you for a sucker, rightly so since you seem to just approve any comment at all.
Post a Comment