Shared tasks have been increasing in popularity over the past half decade. These are effectively competitions (though perhaps that word is rightfully disdained) for building systems that perform well on a given task, for a specific data set. Typically a lot of stuff is given to you for free: the data, variously preprocessing steps, evaluation scripts, etc. Anywhere from a handful of people to dozens enter these shared tasks. Probably the most well known are the CoNLL shared tasks, but they have also taken place in other workshops (eg., the two SMT workshops and many others). Goverment-run competitions (eg., GALE, ACE, DUC (to some degree) and others) are somehow similar, with the added bonus that money is often contingent on performance, but for the most part, I'll be talking about the community-driven shared tasks. (I'll note that shared tasks exist in other communities, but not to the extent that they exist in NLP, to my knowledge.)
I think there are both good and bad things about having these shared tasks, and a lot depends on how they are run. Perhaps some analysis (and discussion?) can serve to help future shared task organizers make decisions about how to run these things.
Many pros of shared tasks are perhaps obvious:
- Increases community attention to the task.
- Often leads to development or convergence of techniques by getting lots of people together to talk about the same problem.
- Significantly reduces the barrier of entry to the task (via the freely available, preprocessed data and evaluation scripts).
- (Potentially) enables us to learn what works and what doesn't work for the task.
- Makes a standardized benchmark against which future algorithms can be compared.
In my opinion, (4) seems like it should be the real reason to do these things. I think the reason why people don't tend to learn as much as might be possible about what does and does not work is that there's very little systematization in the shared tasks. At the very least, almost everyone will use (A) a different learning algorithm and (B) a different feature set. This means that it's often very hard to tell -- when someone does well -- whether it was the learning or the features.
Unfortunately (were it not the case!) there are some cons associated with shared tasks, generally closely tied to corresponding pros.
- May artificially bloat the attention given to one particular task.
- Usefulness of results is sometimes obscured by multiple dimensions of variability.
- Standardization can lead to inapplicability of certain options that might otherwise work well.
- Leads to repeated testing on the same data.
(2) and (3) are, unfortunately, almost opposed. You can, for instance, fix the feature set and only allow people to vary the learning. Then we can see who does learning best. Aside from the obvious problem here, there's an additional problem that another learning algorithm might do better, if it had different features. Alternatively, you could fix the learning and let people do feature engineering. I think this would actually be quite interesting. I've thought for a while about putting out a version of Searn for a particular task and just charge people with coming up with better features. This might be especially interesting if we did it for, say, both Searn and Mallet (the UMass CRF implementation) so we can get a few more points of comparison.
To be more concrete about (3), a simple example is in machine translation. The sort of preprocessing (eg., tokenization) that is good for one MT (eg., a phrase-based system) may be very different from the preprocessing that is good for another (eg., syntax-based). One solution here is to give multiple versions of the data (raw, preprocessed, etc.), but then this makes the (2) situation worse: how can we tell who is doing best, and is it just because they have a darn good tokenizer (don't under-estimate the importance of this!).
(4) doesn't really need any extra discussion.
My personal take-away from putting some extra thought into this is that it can be very beneficial to have shared tasks, if we set at the beginning what are the goals. If our goal is to understand what features are important, maybe we should consider fixing the learning to a small set of algorithms. If our goal is learning, do the opposite. If we want both, maybe ask people to do feature ablation and/or try with a few different learning techniques (this is perhaps too much burden, though). I think we should definitely keep the (3) of low barrier of entry: to me, this is one of the biggest pros. I think the SMT workshops did a phenomenal job here, for a task as complex as MT. And, of course, we should choose the tasks carefully.