08 April 2006

Unlabeled Structured Data

I'll skip discussion of multitask learning for now and go directly for the unlabeled data question.

It's instructive to compare what NLPers do with unlabeled data to what MLers do with unlabeled data. In machine learning, there are a few "standard" approaches to boosting supervised learning with unlabeled data:

  1. Use the unlabeled data to construct a low dimensional manifold; use this manifold to "preprocess" the training data.
  2. Use the unlabeled data to construct a kernel.
  3. Use the unlabeled data during training time to implement the "nearby points get similar labels" intuition.
There are basically two common uses of unlabeled data in NLP (that I can think of):
  1. Use the unlabeled data to cluster words; use these word clusters as features for training.
  2. Use the unlabeled data to bootstrap new training instances based on highly precise patterns (regular expressions, typically).
Why the discrepancy? I think it partially that the ML techniques are relatively new and developed without thinking about language problems. For instance, most manifold learning techniques work in mid-dimensional, continuous space, not uber-high-dimensional sparse discrete space. Moreover, most of the ML-style techniques scale as O(N^3), where N=# of unlabeled points. This is clearly far far too expensive for any reasonable data set. For the converse, NLP semi-sup techniques are very tied to their domain, and don't generalize well.

The paper that's been getting a lot of attention recently -- on both sides -- is the work by Ando and Zhang. As I see it, this is one way of formalizing the common practice in NLP to a form digestible by ML people. This is great, because maybe it means the two camps will be brought closer together. The basic idea is to take "A" style NLP learning, but instead of clustering as we normally think of clustering (words based on contexts), they try to learn a classifier that predicts known "free" aspects of the unlabeled data (is this word capitalized?). Probably the biggest (acknowledged) shortcoming of this technique is that a human has to come up with these secondary classification problems. Can we try to do that automatically?

But beyond standard notions of supervised -> semi-supervised -> unsupervised, I think that working in the structured domain offers us a much more interesting continuum. Maybe instead of having some unlabeled data, we have some partially labeled data. Or data that isn't labeled quite how we want (William Cohen recently told me of a data set where they have raw bio texts and lists of proteins that are important in these texts, but they want to actually find the names in unlabeled [i.e., no lists] texts.) Or maybe we have some labeled data but then want to deploy a system and get user feedback (good/bad translation/summary). This is a form of weak feedback that we'd ideally like to use to improve our system. I think that investigating these avenues is also a very promising direction.

5 comments:

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

Adi said...

Oes Tsetnoc one of the ways in which we can learn seo besides Mengembalikan Jati Diri Bangsa. By participating in the Oes Tsetnoc or Mengembalikan Jati Diri Bangsa we can improve our seo skills. To find more information about Oest Tsetnoc please visit my Oes Tsetnoc pages. And to find more information about Mengembalikan Jati Diri Bangsa please visit my Mengembalikan Jati Diri Bangsa pages. Thank you So much.

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex