Apologies for the long silence -- life got in the way. Part of life was AIstats, which I'll blog about shortly, but I'm waiting for them to get the proceedings online so I can (easily) link to papers.
A common (often heated) argument in the scientific community is the "blind" versus "not-blind" review process -- i.e., can the reviewers see the identity of the authors. Many other bloggers have talked about this in the past (here and here, for instance). I don't want to talk about this issue right now, but rather one specific consequence of it that I don't think actually need to be a consequence. But if often is. And I think that more than anything else, it does serve to hurt research.
Recent anecdote: Less than a month ago, I was talking with my friend John Blitzer at UPenn recently over email about domain adaptation stuff. We both care about this problem and have worked on different angles of it. Initially we were talking about his NIPS paper, but the discussion diffused into more general aspects of the problem. I had been at the time working furiously on a somewhat clever, but overall incredibly simply approach to domain adaptation (managed to make it in to ACL this year -- draft here). The interesting thing about this approach was that it completely went against the theoretical bounds in his paper (essentially because those are crafted to be worst case). It doesn't contradict the bounds of course, but it shows the good adaptation is possible in many cases when the bounds give no information.
Of course I told him about this, right? Well, actually, no. I didn't. In retrospect this was stupid and a mistake, but it's a stupid mistake I've made over and over again. Why did I do this? Because at the time I knew that the paper was on its way to ACL -- perhaps I had even submitted at that point; I cannot remember. And if he happened to be one of the reviewers on it, then he would obviously know it was me and then the "double blind" aspect of ACL would be obviated. Or---perhaps worse---he would have marked it as a conflict of interest (though indeed I don't think it was) because he knew my identity. And then I would have lost the opinion of a pretty smart guy.
All this is to say: I think there's often a temptation not to talk about ongoing research because of the double-blind rule. But this is ridiculous because the people who are close to you in research area are exactly those to whom you should talk! I don't think the solution has anything to do with reversing double-blind (though that would solve it, too). I think the solution is just to realize that in many cases we will know the authorship of a paper we review, and we shouldn't try so hard to hide this. Hiding it only hinders progress. We should talk to whomever we want about whatever we want, irregardless of whether this person may or may not review a paper on the topic later. (As a reviewer, their identity is hidden anyway, so who cares!)
(Briefly about conflict of interest. I used to be somewhat liberal in saying I had a COI for a paper if I knew who the author was. This is a bad definition of COI, since it means that I have a COI with nearly every paper in my area(s). A true COI should be when I have something to gain by this paper being published. E.g., it is written by my advisor, student, or very very very close colleauge -- i.e., one with whom I publish regularly, though even that seems a bit of a stretch.)
my biased thoughts on the fields of natural language processing (NLP), computational linguistics (CL) and related topics (machine learning, math, funding, etc.)
31 March 2007
10 March 2007
Reproducible Results
In an ideal world, it would be possible to read a paper, go out and implement the proposed algorithm, and obtain the same results. In the real world, this isn't possible. For one, if by "paper" we mean "conference paper," there's often just not enough space to spell out all the details. Even how you do tokenization can make a big difference! It seems reasonable that there should be sufficient detail in a journal paper to achieve essentially the same results, since there's (at least officially) not a space issue. On the other hand, no one really publishes in journals in our subfamily of CS.
The next thing one can do is to release the software associated with a paper. I've tried to do this in a handful of cases, but it can be a non-trivial exercise. There are a few problems. First, there's the question of how polished the software you put out should be. Probably my most polished is megam (for learning classifiers) and the least polished is DPsearch (code from my AI stats paper). It was a very nontrivial amount of effort to write up all the docs for megam and so on. As a result, I hope that people can use it. I have less hope for DPsearch --- you'd really have to know what you're doing to rip the guts out of it.
Nevertheless, I have occasionally received copies of code like my DPsearch from other people (i.e., unpolished code) and have still been able to use them successfully, albeit only for ML stuff, not for NLP stuff. ML stuff is nice because, for the most part, its self-contained. NLP stuff often isn't: first you run a parser, then you have to have wordnet installed, then you have to have 100MB of data files, then you have to run scripts X, Y and Z before you can finally run the program. The work I did for my thesis is a perfect example of this: instead of building all the important features into the main body of code I wrote, about half of them were implemented as Perl scripts that would essentially add "columns" to a CoNLL-style input format. At the end, the input was like 25-30 columns wide, and if any were missing or out of order, bad things would happen. As a result, it's a completely nontrivial exercise for me to release this beast. The only real conceivable option would be to remove the non-important scripts, get the important ones back into the real code, and then release that. But then there's no way the results would match exactly those from the paper/thesis.
I don't know of a solution to this problem. I suppose it depends on what your goal is. One goal is just to figure out some implementation details so that you can use them yourself. For this, it would be perfectly acceptable in, say, my thesis situation, to just put up the code (perhaps the scripts too) and leave it at that. There would be an implicit contract that you couldn't really expect too much from it (i.e., you shouldn't expect to run it).
A second goal is to use someone else's code as a baseline system to compare against. This goal is lessened when common data is available, because you can compare to published results. But often you don't care about the common data and really want to see how it works on other data. Or you want to qualitatively compare your output to a baseline. This seems harder to deal with. If code goes up to solve this problem, it needs to be runnable. And it needs to achieve pretty much the same results as published, otherwise funny things happen ("so and so reported scores of X but we were only able to achieve Y using their code", where Y < X). This looks bad, but is actually quite understandable in many cases. Maybe the solution here is, modulo copyright restrictions and licensing problems (ahem, LDC), just put up you models output as well. This doesn't solve the direct problem, but maybe helps a bit. It also lets people see where you model screws up, so they can attempt to fix those problems.
The next thing one can do is to release the software associated with a paper. I've tried to do this in a handful of cases, but it can be a non-trivial exercise. There are a few problems. First, there's the question of how polished the software you put out should be. Probably my most polished is megam (for learning classifiers) and the least polished is DPsearch (code from my AI stats paper). It was a very nontrivial amount of effort to write up all the docs for megam and so on. As a result, I hope that people can use it. I have less hope for DPsearch --- you'd really have to know what you're doing to rip the guts out of it.
Nevertheless, I have occasionally received copies of code like my DPsearch from other people (i.e., unpolished code) and have still been able to use them successfully, albeit only for ML stuff, not for NLP stuff. ML stuff is nice because, for the most part, its self-contained. NLP stuff often isn't: first you run a parser, then you have to have wordnet installed, then you have to have 100MB of data files, then you have to run scripts X, Y and Z before you can finally run the program. The work I did for my thesis is a perfect example of this: instead of building all the important features into the main body of code I wrote, about half of them were implemented as Perl scripts that would essentially add "columns" to a CoNLL-style input format. At the end, the input was like 25-30 columns wide, and if any were missing or out of order, bad things would happen. As a result, it's a completely nontrivial exercise for me to release this beast. The only real conceivable option would be to remove the non-important scripts, get the important ones back into the real code, and then release that. But then there's no way the results would match exactly those from the paper/thesis.
I don't know of a solution to this problem. I suppose it depends on what your goal is. One goal is just to figure out some implementation details so that you can use them yourself. For this, it would be perfectly acceptable in, say, my thesis situation, to just put up the code (perhaps the scripts too) and leave it at that. There would be an implicit contract that you couldn't really expect too much from it (i.e., you shouldn't expect to run it).
A second goal is to use someone else's code as a baseline system to compare against. This goal is lessened when common data is available, because you can compare to published results. But often you don't care about the common data and really want to see how it works on other data. Or you want to qualitatively compare your output to a baseline. This seems harder to deal with. If code goes up to solve this problem, it needs to be runnable. And it needs to achieve pretty much the same results as published, otherwise funny things happen ("so and so reported scores of X but we were only able to achieve Y using their code", where Y < X). This looks bad, but is actually quite understandable in many cases. Maybe the solution here is, modulo copyright restrictions and licensing problems (ahem, LDC), just put up you models output as well. This doesn't solve the direct problem, but maybe helps a bit. It also lets people see where you model screws up, so they can attempt to fix those problems.