22 May 2008

Continuing bad ideas

I occasionally--more than I would like--run in to the following problem. I am working on some task for which there is prior work (shocking!). I'm getting ready to decide what experiments to run in order to convince readers of a potential paper that what I'm doing is reasonable. Typically this involves comparing fairly directly against said prior work. Which, in turn, means that I should replicate what this prior work has done as closely as possible, letting the only difference in systems be what I am trying to propose. Easy example: I'm doing machine learning for some problem and there is an alternative machine learning solution; I need to use an identical feature set to them. Another easy example: I am working on some task; I want to use the same training/dev/test set as the prior work.

The problem is: what if they did it wrong (in my opinion). There are many ways for doing things wrong and it would be hard for me to talk about specifics without pointing fingers. But just for concreteness, let's take the following case relating to the final example in the above paragraph. Say that we're doing POS tagging and the only prior work POS tagging paper used as test data every tenth sentence in WSJ, rather than the last 10%. This is (fairly clearly) totally unfair. However, it leaves me with the following options:

  1. I can repeat their bad idea and test on every tenth sentence.
  2. I can point out (in the paper) why this is a bad idea, and evaluate on the last 10%.
  3. I can not point this out in the paper and just ignore their work and still evaluate on the last 10%.
  4. I can point out why this is a bad idea, evaluate on both the last 10% (for "real" numbers) and every tenth sentence (for "comparison" numbers).
  5. I can point out why this is a bad idea, reimplement their system, and run both mine and theirs on the last 10%.
I think that if all else is equal, (5) is the best option. I think it's scientifically sound, truthful, and gives results that are actually comparable. Unfortunately, it's also the most work because it involves reimplementing a complete other system which in many cases may be verging on prohibitive. (I suppose another option would be to contact the authors and ask them to rerun in the correct way -- I imagine some people might respond positively to this, though probably not all.)

(4) is tempting, because it allows me to "break the bad habit" but also compare directly. The problem is that if I really disbelieve the prior methodology, then these comparison numbers are themselves dubious: what am I supposed to learn from results that compare in an unreasonable setting? If they're great (for me), does that actually tell me anything? And if they're terrible (for me), does that tell me anything? It seems to depend on the severity of the "bad idea", but it's certainly not cut and dry.

(3) just seems intellectually dishonest.

(2) at first glance seems inferior to (4), but I'm not entirely sure that I believe this. After all, I'm not sure that (4) really tells me anything that (2) doesn't. I suppose one advantage to (4) over (2) is that it gives me some sense of whether this "bad idea" really is all that bad. If things don't look markedly different in (2) and (4), then maybe this bad idea really isn't quite so bad. (Of course, measuring "markedly different" may be difficult for some tasks.)

(1) also seems intellectually dishonest.

At this point, to me, it seems like (4) and (5) are the winners, with (2) not hugely worse than (4). Thinking from the perspective of how I would feel reviewing such a paper, I would clearly prefer (5), but depending on how the rest of the paper went, I could probably tolerate (2) or (4) depending on how egregious the offense is. One minor issue is that as a writer, you have to figure that the author of this prior work is likely to be a reviewer, which means that you probably shouldn't come out too hard against this error. Which is difficult because in order to get other reviewers to buy (4), and especially (2), they have to buy that this is a problem.

I'm curious how other people feel about this. I think (5) is obviously best, but if (5) is impossible to do (or nearly so), what should be done instead.

23 comments:

Jon Elsas said...

In too many situations, it is nearly impossible to replicate a complete test setup in order to evaluate a new algorithm against previously published results. Even in the best written papers, there are experimental details left out. Every decision along the way can affect the final outcome: from the minutiae of corpus preprocessing to differing interpretations of something as simple as "the last 10%", from your example.

The only really "fair" option is to re-implement previously published methods and compare on a completely level playing field. I am suspicious of results that simply compare against performance numbers from previous papers without re-running the experiments. In this light, (1) and (4) would lead to possibly bogus comparisons. (5) is really the only option if you must compare with the previous work.

As an added benefit, re-implementing someone else's algorithm is almost always enlightening.

Dalamar Taurog said...

I think the issue is what do you want your scope of your work to be. Are you trying to show that their technique was flawed and you have a better testing technique? or are you trying to introduce your own work.

If you are trying to do both, and can in the time/space you have, that would be great. If the two issues are large enough, maybe you should address them separately. Alternately you could list their prior work as only PARTIALLY applicable with an explanation of why.

Ted Dunning ... apparently Bayesian said...

If I were reviewing the paper (and say that I knew the previous work), then only 4 and 5 would be acceptable in my mind. Anything else smacks of naivete or disingenuousness.

This exact issue came up in a small way with some work I did once on document classification. The standard was to classify every tenth document in the corpus, but the corpus had many very nearly duplicated articles. That meant that there was a really good chance that the articles in the test set were also in the training set. My final solution was to use one year for training and the next for test.

I was lucky in that the major system to compare with (UMASS) was willing to run their system on my comparison so (5) worked for me.

Bob Carpenter said...

Continually testing on the same subsection of a corpus is quite misleading for two reasons: the high variance across sections and the resulting multiple comparisons problem.

What I find is that when I do 10-fold cross-validation on problems like part-of-speech tagging, spelling correction, Chinese word segmentation, named-entity recognition or classification, is that the difference between folds implies substantial variance of performance estimates. This variation across folds is often greater than inter-system differences on the "official" test section.

The multiple comparisons problem is exacerbated by studies that make the error of using the training data as development data. The whole field is falling into this trap by repeatedly reporting results on Section 23 of the Penn Treebank. We're overtraining to a particular test set.

J. Scott Olsson said...

I think the answer is "it depends".

I'd be inclined to say that (5) is *in general* a bad idea, for scientific rather than practical reasons. The problem is that, in real life, it's almost always impossible to do a previous approach justice when re-implementing---and what would motivate an author to do that anyhow? That suggests choice (6), re-implement and run on both the last 10% and every tenth sentence. At least then you can show your re-implementation is up to snuff (maybe).

Another difficulty I regularly have is comparing against previous work when their corpus is not publicly available. In that case, I'm essentially forced to re-implement their method, but I have almost no way of knowing that I got it right.

mike said...

I wish I could do (2) or (3). It is a waste of valuable space in the 8 page limit to spend several paragraphs explaining deficiencies of the previous approach to evaluation, since doing so does not actually add to the intellectual contribution of my paper. I would rather use those inches to talk about qualitative evaluation, but a complete quantitative evaluation is crucial for paper acceptance. Since I know my evaluation method is superior, I would consider it a waste of time to do (5), unless the original author agrees to run his/her system. I agree that (4) is the best option, assuming that there are only minor differences in the evaluation methods. If the format of the output representation is substantially different, then it's comparing apples and oranges, since one representation may not be evaluable (evaluatable?) under the other metric. It is a problem if reviewers do not agree that the difference in output representation is actually important.

Ross said...

Not to answer the question, but addressing the underlying problem: I think this question emphasises the need for "reproducible research" (e.g. see http://epub.wu-wien.ac.at/dyn/virlib/wp/eng/mediate/epub-wu-01_c75.pdf?ID=epub-wu-01_c75). The notion is, that in computational sciences, the author should make available all the code and data needed to allow any reader to reproduce exactly the results claimed in the paper. Or as Claerbout's Principle puts it, "An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures."

Andrew Clegg said...

Naive question - why is 'every tenth sentence' so much worse than 'last 10%'?

Ted Dunning ... apparently Bayesian said...

It isn't that naive a question, else metrics like that wouldn't be used so often.

The problem is that your test set is too similar to the training set and thus the results you get give you an unreasonably optimistic view of how your system will perform on new text. A similar example with newswire shows this problem even more starkly. Since so many articles are updates of previous articles, the test set will actually have many very near duplicates of items in the training set. This would make, for example, nearest neighbor methods work unreasonably well.

If you choose instead a test set from the last 10% (of sentences) or the succeeding year (with articles) then you will get much lower rate of these fortuitous accidents and your estimates of accuracy will be much more realistic.

In the original posting, however, this case is merely emblematic of the underlying problem of what to do when the standard evaluation methodology is flawed. The particular flaw isn't as important as the general problem.

hal said...

This is annoying -- for some reason Blogger stopped emailing me notifications of comments to posts, so I didn't know people had written!

I pretty much agree with all the comments... One thing I will say in favor of reimplementing someone else's stuff, to echo one of the comments, is that it's often quite educational. Also, I think the best way to improve on something is to actually see where it breaks down. This is impossible unless you have a baseline system.

Andrew: continuing on Ted's reply... one big issue in, say, POS tagging is the issue of unknown words. Almost all errors are on unknown words for state-of-the-art taggers. But rare words often appear multiple times in the same article. So if one of the times happens to be in the "training" part, then this word will no longer be unknown. So it isn't fair.

Ted said...

I admire your thoughtful analysis of this scenario, and I know you are being realistic and being pragmatic about things as they are now...

But, still, I have to ask myself how in the world did we reach a point in this business where we accept the premise that we need to do re-implementations to carry out comparative studies and reproduce results? I have done this, I suppose we all have, but I'm increasingly amazed that we put up with it.

We aren't like a group of botanists studying a particular plant that grows only in a certain place in a special kind of soil after a particular sort of rain has fallen (so that you must physically be THERE in that place/time, etc. to do the experiments).

We usually use computer programs and data that run on at least a few other computers besides our own, and if we are so inclined we have lots of ways to make code and data available to anybody in the world within seconds.

Reproduction of experiments should be easy in this environment ... if the original experimenter will just give us a hand with that.

So, here's another option - let's just avoid this mess.

6) Reject the original paper because the authors failed to make the code and data available to allow easy and convenient reproduction and verification of their results.

I know this isn't realistic right away, but maybe it's something to shoot for over the longer term. Maybe by 2010 let's say. :)

Anonymous said...

What exactly is the problem with using every 10th sentence rather than the last 10%?

Ted Dunning ... apparently Bayesian said...

To repeat in new words, using every 10th sentence is that you run twin risks of having duplicates and near duplicates appear in training and test, and also having changes over time be under-estimated.

The basic problem that is incurred is that your training data looks too much like your test data to give you realistic estimates of accuracy.

Anonymous said...

This wasn't the main thrust of this post, but I had an idea for a new post--why not invite people to post "gotchas", common mistakes seen in papers/submissions with maybe just a short, brief explanation of why it's a bad idea. Just a thought.

Anonymous said...

It is the knight noah which makes me very happy these days, my brother says knight gold is his favorite games gold he likes, he usually knight online gold to start his game and most of the time he will win the knight online noah back and give me some cheap knight gold to play the game.

Anonymous said...

網頁設計,情趣用品,情趣用品,情趣用品,情趣用品
色情遊戲,寄情築園小遊戲,情色文學,一葉情貼圖片區,情惑用品性易購,情人視訊網,辣妹視訊,情色交友,成人論壇,情色論壇,愛情公寓,情色,舊情人,情色貼圖,色情聊天室,色情小說,做愛,做愛影片,性愛

免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,UT聊天室,聊天室,豆豆聊天室,尋夢園聊天室,聊天室尋夢園,080聊天室,080苗栗人聊天室,女同志聊天室,上班族聊天室,小高聊天室 

AV,AV女優
視訊,影音視訊聊天室,視訊交友
視訊,影音視訊聊天室,視訊聊天室,視訊交友,視訊聊天,視訊美女

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀提供專業的酒店經紀,酒店上班,酒店打工、兼職、酒店相關知識等酒店相關產業服務,想加入這行業的水水們請找專業又有保障的艾葳酒店經紀公司!
艾葳酒店經紀是合法的公司、我們是不會跟水水簽任何的合約 ( 請放心 ),我們是不會強押水水辛苦工作的薪水,我們絕對不會對任何人公開水水的資料、工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆。
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班環境、上班條件給水水們。

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Ted Dunning ... apparently Bayesian said...

You really ought to moderate comments on this blog. The spammers have figured you for a sucker, rightly so since you seem to just approve any comment at all.

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex