I occasionally--more than I would like--run in to the following problem. I am working on some task for which there is prior work (shocking!). I'm getting ready to decide what experiments to run in order to convince readers of a potential paper that what I'm doing is reasonable. Typically this involves comparing fairly directly against said prior work. Which, in turn, means that I should replicate what this prior work has done as closely as possible, letting the only difference in systems be what I am trying to propose. Easy example: I'm doing machine learning for some problem and there is an alternative machine learning solution; I need to use an identical feature set to them. Another easy example: I am working on some task; I want to use the same training/dev/test set as the prior work.
The problem is: what if they did it wrong (in my opinion). There are many ways for doing things wrong and it would be hard for me to talk about specifics without pointing fingers. But just for concreteness, let's take the following case relating to the final example in the above paragraph. Say that we're doing POS tagging and the only prior work POS tagging paper used as test data every tenth sentence in WSJ, rather than the last 10%. This is (fairly clearly) totally unfair. However, it leaves me with the following options:
- I can repeat their bad idea and test on every tenth sentence.
- I can point out (in the paper) why this is a bad idea, and evaluate on the last 10%.
- I can not point this out in the paper and just ignore their work and still evaluate on the last 10%.
- I can point out why this is a bad idea, evaluate on both the last 10% (for "real" numbers) and every tenth sentence (for "comparison" numbers).
- I can point out why this is a bad idea, reimplement their system, and run both mine and theirs on the last 10%.
(4) is tempting, because it allows me to "break the bad habit" but also compare directly. The problem is that if I really disbelieve the prior methodology, then these comparison numbers are themselves dubious: what am I supposed to learn from results that compare in an unreasonable setting? If they're great (for me), does that actually tell me anything? And if they're terrible (for me), does that tell me anything? It seems to depend on the severity of the "bad idea", but it's certainly not cut and dry.
(3) just seems intellectually dishonest.
(2) at first glance seems inferior to (4), but I'm not entirely sure that I believe this. After all, I'm not sure that (4) really tells me anything that (2) doesn't. I suppose one advantage to (4) over (2) is that it gives me some sense of whether this "bad idea" really is all that bad. If things don't look markedly different in (2) and (4), then maybe this bad idea really isn't quite so bad. (Of course, measuring "markedly different" may be difficult for some tasks.)
(1) also seems intellectually dishonest.
At this point, to me, it seems like (4) and (5) are the winners, with (2) not hugely worse than (4). Thinking from the perspective of how I would feel reviewing such a paper, I would clearly prefer (5), but depending on how the rest of the paper went, I could probably tolerate (2) or (4) depending on how egregious the offense is. One minor issue is that as a writer, you have to figure that the author of this prior work is likely to be a reviewer, which means that you probably shouldn't come out too hard against this error. Which is difficult because in order to get other reviewers to buy (4), and especially (2), they have to buy that this is a problem.
I'm curious how other people feel about this. I think (5) is obviously best, but if (5) is impossible to do (or nearly so), what should be done instead.