10 March 2007

Reproducible Results

In an ideal world, it would be possible to read a paper, go out and implement the proposed algorithm, and obtain the same results. In the real world, this isn't possible. For one, if by "paper" we mean "conference paper," there's often just not enough space to spell out all the details. Even how you do tokenization can make a big difference! It seems reasonable that there should be sufficient detail in a journal paper to achieve essentially the same results, since there's (at least officially) not a space issue. On the other hand, no one really publishes in journals in our subfamily of CS.

The next thing one can do is to release the software associated with a paper. I've tried to do this in a handful of cases, but it can be a non-trivial exercise. There are a few problems. First, there's the question of how polished the software you put out should be. Probably my most polished is megam (for learning classifiers) and the least polished is DPsearch (code from my AI stats paper). It was a very nontrivial amount of effort to write up all the docs for megam and so on. As a result, I hope that people can use it. I have less hope for DPsearch --- you'd really have to know what you're doing to rip the guts out of it.

Nevertheless, I have occasionally received copies of code like my DPsearch from other people (i.e., unpolished code) and have still been able to use them successfully, albeit only for ML stuff, not for NLP stuff. ML stuff is nice because, for the most part, its self-contained. NLP stuff often isn't: first you run a parser, then you have to have wordnet installed, then you have to have 100MB of data files, then you have to run scripts X, Y and Z before you can finally run the program. The work I did for my thesis is a perfect example of this: instead of building all the important features into the main body of code I wrote, about half of them were implemented as Perl scripts that would essentially add "columns" to a CoNLL-style input format. At the end, the input was like 25-30 columns wide, and if any were missing or out of order, bad things would happen. As a result, it's a completely nontrivial exercise for me to release this beast. The only real conceivable option would be to remove the non-important scripts, get the important ones back into the real code, and then release that. But then there's no way the results would match exactly those from the paper/thesis.

I don't know of a solution to this problem. I suppose it depends on what your goal is. One goal is just to figure out some implementation details so that you can use them yourself. For this, it would be perfectly acceptable in, say, my thesis situation, to just put up the code (perhaps the scripts too) and leave it at that. There would be an implicit contract that you couldn't really expect too much from it (i.e., you shouldn't expect to run it).

A second goal is to use someone else's code as a baseline system to compare against. This goal is lessened when common data is available, because you can compare to published results. But often you don't care about the common data and really want to see how it works on other data. Or you want to qualitatively compare your output to a baseline. This seems harder to deal with. If code goes up to solve this problem, it needs to be runnable. And it needs to achieve pretty much the same results as published, otherwise funny things happen ("so and so reported scores of X but we were only able to achieve Y using their code", where Y < X). This looks bad, but is actually quite understandable in many cases. Maybe the solution here is, modulo copyright restrictions and licensing problems (ahem, LDC), just put up you models output as well. This doesn't solve the direct problem, but maybe helps a bit. It also lets people see where you model screws up, so they can attempt to fix those problems.

4 comments:

Anonymous said...

The problem, as you state in your last sentence, is "It also lets people see where you model screws up"...

Now seriously, most of the times, even when you start coding in a very structured way after careful design and you even put time in documentation - you get (bad) results and then just fix one small thing, then you change something else that might improve results. then you realize that another pre-processing step might be needed so you write a quick script (perl, python whatever) to do that. Submission date is approaching and you really don't have the time to bundle it all together....
Eventually, not only that the code is not publishable, it is even self re-use takes its time till you figure out what needs to be done...

Anonymous said...

The problem with "research" software distro mainly derives from one single factor: lack of automation. Take Hal's anecdote of "pre"-processing CoNLL data with cumulative Perl scripts, or Oren's comment that sometimes you make quick local changes. As long as there is a single (one-touch) top-level script/makefile that runs everything, a diligent reader can trace through where the code's going. And there's no way for the columns to get out of order, either.

The second biggest factor that will help with software distribution is for people to write more readable code. Not more comments. I don't care if there are any comments in code if it's written to be read. That means variable names that are consistent and make sense, code broken down into subroutines with reasonable names, etc. Comments get out of date as often as they're useful in living code.

The third issue is learning to read software. Researchers don't get much practice at this, as they're usually writing their own one-off software rather than having to write re-usable software as part of a group. Sasha Caskey taught me to read code while we were integrating JavaScript into SpeechWorks' semantic parser; pair programming is a great way to learn this kind of thing (and also forces you to write code more cleanly, too).

I'd recommend everyone pick up a copy of Hunt and Thomas's "The Pragmatic Programmer", read it, and follow their advice in your next project. Beck's "Extreme Programming" is also worth a read. Both are highly applicable to research programming.

As a Java example in NLP, check out our BioCreative and CoNLL submissions in LingPipe's CVS sandbox.

For a C++ example in collaborative filtering, check out Timely Development's Netflix code, which is an online SVD algorithm handling missing data. It's reconstructed from Simon Funk's algorithm sketch based on Genevieve Gorrell's paper. Ironically, I find Timely's code the easiest of all three sources to read.

P.S. This same thread just made the rounds at the Nodalpoint bioinformatics blog.

nikita said...

i m a student of masters in computer science ..i m interested in doing to create a compiler for natural langauges like "english" langauge ..is it possible?? are there any more interesting researches to be done in NLP ?? pl help me
nikita

Anonymous said...

A car amplifier will give you a loud and clear sound on a consistent basis. It will boost the power flowing from the

receiver to the speakers. In doing so, it will reduce the stress put on all the other components of your car stereo

system, including the receiver.

Choosing the right car amplifier is important. Your decision should be based on five important features. Make sure you

address them all !

The first item on the agenda is the number of channels. This will depend on the number of speakers in your system.

Two-channel amplifiers will feed well two speakers or a single subwoofer. You will want to consider a four-channel

amplifier if you have any of the following combinations :