08 April 2006

Unlabeled Structured Data

I'll skip discussion of multitask learning for now and go directly for the unlabeled data question.

It's instructive to compare what NLPers do with unlabeled data to what MLers do with unlabeled data. In machine learning, there are a few "standard" approaches to boosting supervised learning with unlabeled data:

  1. Use the unlabeled data to construct a low dimensional manifold; use this manifold to "preprocess" the training data.
  2. Use the unlabeled data to construct a kernel.
  3. Use the unlabeled data during training time to implement the "nearby points get similar labels" intuition.
There are basically two common uses of unlabeled data in NLP (that I can think of):
  1. Use the unlabeled data to cluster words; use these word clusters as features for training.
  2. Use the unlabeled data to bootstrap new training instances based on highly precise patterns (regular expressions, typically).
Why the discrepancy? I think it partially that the ML techniques are relatively new and developed without thinking about language problems. For instance, most manifold learning techniques work in mid-dimensional, continuous space, not uber-high-dimensional sparse discrete space. Moreover, most of the ML-style techniques scale as O(N^3), where N=# of unlabeled points. This is clearly far far too expensive for any reasonable data set. For the converse, NLP semi-sup techniques are very tied to their domain, and don't generalize well.

The paper that's been getting a lot of attention recently -- on both sides -- is the work by Ando and Zhang. As I see it, this is one way of formalizing the common practice in NLP to a form digestible by ML people. This is great, because maybe it means the two camps will be brought closer together. The basic idea is to take "A" style NLP learning, but instead of clustering as we normally think of clustering (words based on contexts), they try to learn a classifier that predicts known "free" aspects of the unlabeled data (is this word capitalized?). Probably the biggest (acknowledged) shortcoming of this technique is that a human has to come up with these secondary classification problems. Can we try to do that automatically?

But beyond standard notions of supervised -> semi-supervised -> unsupervised, I think that working in the structured domain offers us a much more interesting continuum. Maybe instead of having some unlabeled data, we have some partially labeled data. Or data that isn't labeled quite how we want (William Cohen recently told me of a data set where they have raw bio texts and lists of proteins that are important in these texts, but they want to actually find the names in unlabeled [i.e., no lists] texts.) Or maybe we have some labeled data but then want to deploy a system and get user feedback (good/bad translation/summary). This is a form of weak feedback that we'd ideally like to use to improve our system. I think that investigating these avenues is also a very promising direction.

No comments: