(Guest post by Kevin Duh. Thanks, Kevin!)
At the NAACL SSL-NLP Workshop recently, we discussed whether there ought to be a "shared task" for semi-supervised learning in NLP. The panel discussion consisted of Hal Daume, David McClosky, and Andrew Goldberg as panelists and audience input from Jason Eisner, Tom Mitchell, and many others. Here we will briefly summarize the points raised and hopefully solicit some feedback from blog readers.
Three motivations for a shared task
A1. Fair comparison of methods: A common dataset will allow us to compare different methods in an insightful way. Currently, different research papers use different datasets or data-splits, making it difficult to draw general intuitions from the combined body of research.
A2. Establish a methodology for evaluating SSLNLP results: How exactly should a semi-supervised learning method be evaluated? Should we would evaluate the same method for both low-resource scenarios (few labeled points, many unlabeled points) and high-resource scenarios (many labeled points, even more unlabeled points)? Should we evaluate the same method under different ratios of labeled/unlabeled data? Currently there is no standard methodology for evaluating SSLNLP results, which means that the completeness/quality of experimental sections in research papers varies considerably.
A3. Encourage more research in the area: A shared task can potentially lower the barrier of entry to SSLNLP, especially if it involves pre-processed data and community support network. This will make it easier for researchers in outside fields, or researchers with smaller budgets to contribute their expertise to the field. Furthermore, a shared task can potentially direct the community research efforts in a particular subfield. For example, "online/lifelong learning for SSL" and "SSL as joint inference of multiple tasks and heterogeneous labels" (a la Jason Eisner's keynote) were identified as new, promising areas to focus on in the panel discussion. A shared task along those lines may help us rally the community behind these efforts.
Arguments against the above points
B1. Fair comparison: Nobody really argues against fair comparison of methods. The bigger question, however, is whether there exist a *common* dataset or task where everyone is interested in. At the SSLNLP Workshop, for example, we had papers in a wide range of areas ranging from information extraction to parsing to text classification to speech. We also had papers where the need for unlabeled data is intimately tied in to particular components of a larger system. So, a common dataset is good, but what dataset can we all agree upon?
B2. Evaluation methodology: A consistent standard for evaluating SSLNLP results is nice to have, but this can be done independently from a shared task through, e.g. an influential paper or gradual recognition of its importance by reviewers. Further, one may argue: what makes you think that your particular evaluation methodology is the best? What makes you think people will adopt it generally, both inside and outside of the shared task?
B3. Encourage more research: It is nice to lower the barriers to entry, especially if we have pre-processed data and scripts. However, it has been observed in other shared tasks that often it is the pre-processing and features that matter most (more than the actual training algorithm). This presents a dilemma: If the shared task pre-processes the data to make it easy for anyone to join, will we lose the insights that may be gained via domain knowledge? On the other hand, if we present the data in raw form, will this actually encourage outside researchers to join the field?
A straw poll at the panel discussion showed that people are generally in favor of looking into the idea of a shared task. The important question is how to make it work, and especially how to address counterpoints B1 (what task to choose) and B3 (how to prepare the data). We did not have enough time during the panel discussion to go through the details, but here are some suggestions:
- We can view NLP problems as several big "classes" of problems: sequence prediction, tree prediction, multi-class classification, etc. In choosing a task, we can pick a representative task in each class, such as name-entity recognition for sequence prediction, dependency parsing for tree prediction, etc. This common dataset won't attract everyone in NLP, but at least it will be relevant for a large subset of researchers.
- If participants are allowed to pre-process their own data, the evaluation might require participant to submit a supervised system along with their semi-supervised system, using the same feature set and setup, if possible. This may make it easier to learn from results if there are differences in pre-processing.
- There should also be a standard supervised and semi-supervised baseline (software) provided by the shared task organizer. This may lower the barrier of entry for new participants, as well as establish a common baseline result.