I'll skip discussion of multitask learning for now and go directly for the unlabeled data question.
It's instructive to compare what NLPers do with unlabeled data to what MLers do with unlabeled data. In machine learning, there are a few "standard" approaches to boosting supervised learning with unlabeled data:
- Use the unlabeled data to construct a low dimensional manifold; use this manifold to "preprocess" the training data.
- Use the unlabeled data to construct a kernel.
- Use the unlabeled data during training time to implement the "nearby points get similar labels" intuition.
- Use the unlabeled data to cluster words; use these word clusters as features for training.
- Use the unlabeled data to bootstrap new training instances based on highly precise patterns (regular expressions, typically).
The paper that's been getting a lot of attention recently -- on both sides -- is the work by Ando and Zhang. As I see it, this is one way of formalizing the common practice in NLP to a form digestible by ML people. This is great, because maybe it means the two camps will be brought closer together. The basic idea is to take "A" style NLP learning, but instead of clustering as we normally think of clustering (words based on contexts), they try to learn a classifier that predicts known "free" aspects of the unlabeled data (is this word capitalized?). Probably the biggest (acknowledged) shortcoming of this technique is that a human has to come up with these secondary classification problems. Can we try to do that automatically?
But beyond standard notions of supervised -> semi-supervised -> unsupervised, I think that working in the structured domain offers us a much more interesting continuum. Maybe instead of having some unlabeled data, we have some partially labeled data. Or data that isn't labeled quite how we want (William Cohen recently told me of a data set where they have raw bio texts and lists of proteins that are important in these texts, but they want to actually find the names in unlabeled [i.e., no lists] texts.) Or maybe we have some labeled data but then want to deploy a system and get user feedback (good/bad translation/summary). This is a form of weak feedback that we'd ideally like to use to improve our system. I think that investigating these avenues is also a very promising direction.