At this year's ICML, they gave out a "10 year" award to a paper published in an ICML-related venue from 1998. This year it went to a COLT 1998 paper by Avrim Blum and Tom Mitchell: Combining Labeled and Unlabeled Data with Co-Training. While I'm not super familiar with what might have been a contender, I have to say that I definitely think this is a good choice.
For those unfamiliar with the idea of co-training, you should really read the paper. There's also a wikipedia entry that describes it as:
Co-training is a semi-supervised learning technique that requires two views of the data. It was introduced by Avrim Blum and Tom Mitchell. It assumes that each example is described using two different feature sets that provide different, complementary information about the instance. Ideally, the two views are conditionally independent (i.e., the two feature sets of each instance are conditionally independent given the class) and each view is sufficient (i.e., the class of an instance can be accurately predicted from each view alone). Co-training first learns a separate classifier for each view using any labeled examples. The most confident predictions of each classifier on the unlabeled data are then used to iteratively construct additional labeled training data.This is a good summary of the algorithm, but what is left off is that---as far as I know---co-training was one of the first (if not the first) method for which theoretical analysis showed that semi-supervised learning might help. My history is a bit rough, so anyone should feel free to correct me if I'm wrong.
Another aspect of co-training that is cool for readers of this blog is that to a reasonable degree, it has its roots in a 1995 ACL paper by David Yarowsky: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods, which, as far as I know, was really the first paper to introduce the notion of having two views of data (although I don't think David described it as such).
All in all, the co-training paper is great. In fact, if you don't believe me that I think it's great, check out my EMNLP 2008 paper. My analysis (and, to some degree, algorithm) are based heavily on the co-training analysis.
Which brings me to what I really want to discuss. That is, I have a strong feeling that if the co-training paper were reviewed today, it would be blasted for the theoretical analysis. (Indeed, I had the same fear for my EMNLP paper; though since it was EMNLP and not, say, COLT, I don't think the problem was as severe.) The problem with the co-training paper is that the theoretical result is for an algorithm that is only superficially related to the actual algorithm they implement. In particular, the actual algorithm they implement uses notions of confidence, and steadily increasing training set size, and incremental additions and so on. It's vastly more complicated that the algorithm they analyze. My recent experience as both an author and reviewer at places like NIPS and ICML is that this is pretty much a non-starter these day.
In fact, the algorithm is so different that it took three years for an analysis of something even remotely close to come out. In NIPS 2001, Sanjoy Dasgupta, Michael Littman and David McAllester published a paper that actually tries to analyze something closer to the "real" co-training algorithm. They get pretty close. And this analysis paper is a full NIPS paper that basically just proves one (main) theorem.
(A similar set of events happened with David Yarowsky's paper. He didn't include any theoretical analysis, but there has been subsequent work, for instance by Steve Abney to try to understand the Yarowsky algorithm theoretically. And again we see that an analysis of the exact original algorithm is a bit out of grasp.)
I'm sure other people will disagree--which is fine--but my feeling about this matter is that there's nothing wrong with proving something interesting about an algorithm that is not quite exactly what you implement. The danger, of course, is if you get an incorrect intuition. For instance, in the case of co-training, maybe it really was all these "additions" that made the algorithm work, and the whole notion of having two views was useless. This seems to have turned out not to be the case, but it would be hard to tell. For instance, the co-training paper doesn't report results on the actual algorithm analyzed: presumably it doesn't work very well or there would be no need for the more complex variant (I've never tried to implement it). On the other hand, if it had taken Avrim and Tom three extra years to prove something stronger before publishing, then the world would have had to wait three extra years for this great paper.
The approach I took in my EMNLP paper, which, at least as of now, I think is reasonable, is to just flat out acknowledge that the theory doesn't really apply to the algorithm that was implemented. (Actually, in the case of the EMNLP paper, I did implement both the simple and the complex and the simple wasn't too much worse, but the difference was enough to make it worth--IMO--using the more complex one.)