25 January 2006

Domain Adaptation

We NLPers face this problem all the time: we have training data from one domain/genre but really want to work in another. Eg., the treebank is WSJ text but we care about email or web pages or whatever. We'd like to be able to intelligently use our annotated WSJ text to get a good statistical model for a different domain.

I've been working on this problem for a while now and have a partial solution. I'm most interested in the case that we have lots of annotated "out of domain" (OOD) data and a little annotated "in domain" (ID) data. Domain Adaptation for Statistical Classifiers is a paper that's been accepted to JAIR that presents one way to model this problem. The key idea is to model the OOD/ID data distributions as mixtures. There are three mixture components: a "truly ID" distribution, a "truly OOD" distribution and a "general" distribution. We say the OOD data comes from a mixture of "truly OOD" and "general," while the ID data comes from a mixture of "truly ID" and "general." The learning task is to tear apart our data sets to figure out what "general" (and hence relevant to the ID task) information there is in the OOD data.

The framework, when applied to maximum entropy models, gives relatively simple update equations for model parameters. The derivations take a bit of thought, but are not insane. The approach is a partial solution because it's limited to maximum entropy models. I think the problem should be amenable to a more learning theoretic analysis, but haven't had time to make much headway here.

I'm a bit surprised this problem hasn't gotten more attention in the NLP community (a similar problem -- speaker adaptation -- exists in the speech community). Or perhaps I've missed it in NLP. It seems like we, as NLPers, should really care about this issue.

2 comments:

hal said...

i've recently talked to drew bagnell about some work he's done in a related problem: robust supervised learning. the idea in robust learning is that you don't trust that your test distribution is identical to your training distribution, but the difference is bounded (in a KL sense). this isn't the same as classifier migration, but they're related. i think there might be some room for marrying the techniques.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花