18 May 2008

Adaptation versus adaptability

Domain adaptation is, roughly, the following problem: given labeled data drawn from one or more source domains, and either (a) a little labeled data drawn from a target domain or (b) a lot of unlabeled data drawn from a target domain; do the following. Produce a classifier (say) that has low expected loss on new data drawn from the target domain. (For clarity: we typically assume that it is the data distribution that changes between domains, not the task; that would be standard multi-task learning.)

Obviously I think this is an fun problem (I publish on it, and blog about it reasonably frequently). It's fun to me because it both seems important and is also relatively unsolved (though certainly lots of partial solutions exist).

One thing I've been wondering recently, and I realize as I write this that the concept is as yet a bit unformed in my head, is whether the precise formulation I wrote in the first paragraph is the best one to solve. In particular, I can imagine many ways to tweak it:

  1. Perhaps we don't want just a classifier that will do well on the "target" domain, but will really do well on any of the source domains as well.
  2. Continuing (1), perhaps when we get a test point, we don't know which of the domains we've trained on it comes from.
  3. Continuing (2), perhaps it might not even come from one of the domains we've seen before!
  4. Perhaps at training time, we just have a bunch of data that's not neatly packed into "domains." Maybe one data set really comprises five different domains (think: Brown, if it weren't labeled by the source) or maybe two data sets that claim to be different domains really aren't.
  5. Continuing (4), perhaps the notion of "domain" is too ill-defined to be thought of as a discrete entity and should really be more continuous.
I am referring to union of these issues as adaptability, rather than adaptation (the latter implies that it's a do-it-once sort of thing; the former that it's more online). All of these points beg the question that I often try to ask myself when defining problems: who is the client.

Here's one possible client: Google (or Yahoo or MSN or whatever). Anyone who has a copy of the web lying around and wants to do some non-trivial NLP on it. In this case, the test data really comes from a wide variety of different domains (which may or may not be discrete) and they want something that "just works." It seems like for this client, we must be at (2) or (3) and not (1). There may be some hints as to the domain (eg., if the URL starts with "cnn.com" then maybe--though not necessarily--it will be news; it could also be a classified ad). The reason why I'm not sure of (2) vs (3) is that if you're in the "some labeled target data" setting of domain adaptation, you're almost certainly in (3); if you're in the "lots of unlabeled target data" setting, then you may still be in (2) because you could presumably use the web as your "lots of unlabeled target data." (This is a bit more like transduction that induction.) However, in this case, you are now also definitely in (4) because no one is going to sit around and label all of the web as to what "domain" it is in. However, if you're in the (3) setting, I've no clue what your classifier should do when it gets a test example from a domain that (it thinks) it hasn't seen before!

(4) and (5) are a bit more subtle, primarily because they tend to deal with what you believe about your data, rather than who your client really is. For instance, if you look at the WSJ section of the Treebank, it is tempting to think of this as a single domain. However, there are markedly different types of documents in there. Some are "news." Some are lists of stock prices and their recent changes. One thing I tried in the Frustratingly Easy paper but that didn't work is the following. Take the WSJ section of the Treebank and run document clustering on it (using bag of words). Treat each of these clusters as a separate domain and then do adaptation. This introduces the "when I get a test example I have to classify it into a domain" issue. At the end of the day, I couldn't get this to work. (Perhaps BOW is the wrong clustering representation? Perhaps a flat clustering is a bad idea?)

Which brings us to the final question (5): what really is a domain? One alternative way to ask this is: give data partitioned into two bags, are these two bags drawn from the same underlying distribution. Lots of people (especially in theory and databases, though also in stats/ML) have proposed solutions for answering this question. However, it's never going to be possible to give a real "yes/no" answer, so now you're left with a "degree of relatedness." Furthermore, if you're just handed a gigabyte of data, you're not going to want to try ever possible split into subdomains. One could try some tree-structured representation, which seems kind of reasonable to me (perhaps I'm just brainwashed because I've been doing too much tree stuff recently).

9 comments:

Ben said...

I tried your frustratingly easy domain adaptation technique using a classifier that tries to learn something out of unknown values (boostexter). You can think of it as defining extra binary features to handle the unknown case. While interesting, it just doesn't work. Adaboost gets completely confused and fails to learn anything useful.

Anyways, I also sometimes think about the problem of handling unseen domains and I was wondering if it is possible, given a test example, to find a "best" set of training examples for classifying the test example. This would enable some kind of domain blending, or soft domain adaptation.

Anonymous said...

This is a huge problem on the acoustic side of speech recognition in both scale and complexity.

The acoustic adaptation problem cross-cuts gender (male/female), age (young/old), broad dialect (Mexican/Castillian), finer dialect (Yucatan/Mexico City), speaking rate (fast/slow), "emotion" (angry/excited/calm/sleepy) and last and mostly least, "domain" (flight reservations/movie information). There's also channel adaptation (cell phone in car vs. land line at home vs. speaker phone in conference room).

Of course, you can keep going down to the speaker and speaker's situation (talking to mom/talking to the bank/talking to friend from school).

The SpeechWorks recognizers were installed with fixed vocabularies, but could adapt the acoustic models to the range of callers using the system over time. This reportedly reduced error rates as much as 50% over the first few months of use.

The trick was using the (confidence ranked) feedback you got on confirmation prompts ("Did you say Los Angeles?"). The underlying model for speech adaptation is typically nothing more complex than Gaussian mixtures.

What doesn't tend to work is some up-front classification of callers which routes them to the appropriate model. The data sharing across models tends to be too helpful to throw away.

Anonymous said...

Your comment about continuous domain-variability reminds me of some OCR style adaptation I worked on for my Ph.D. We had a rudimentary Gaussian model for the style-distribution from which one style is assumed be drawn to generate a given test set (it can be as small as two samples) to be classified.

This is an interesting problem with nice connections to semi-supervised learning.

Ref: "Analytical Results on Bayesian Style ...", PAMI July 2007.

Gabriel said...

I've been interested in the domain adaptation issue for a while now, in the context of extractive summarization (so a binary classification problem). One of the obvious approaches you mention in 'Frustratingly Easy Domain Adaptation' is the Predict method, where the target data has source predictions as an additional feature. I'm curious if you've seen anything working in the reverse direction, i.e. using the bit of labeled training data in the target domain to augment or filter the source data. I tried a couple of simple approaches off the top of my head, like reclassifying the source datapoints and then training on that, and adding a target prediction as an additional feature in the source data, then training on both source and target (the additional feature in the target domain simply being the gold-standard class label). Both yielded mixed results.

Unknown said...

I agree with you on question (5). Often, it's unclear where divides among domains end and begin. Visiting Google, Yahoo, and MSN, I didn't have a great answer for this, although I also speculated that some kind of hierarchical clustering model might work.

But I also think there are some notions of domain that *are* clear. 1990s Wall Street Journal vs. MEDLINE abstracts is one example. Another example that is very prevalent in China is Chinese Treebank (Xinhua) vs. community question answering. baidu zhidao (百度知道) is a great repository for useful information, but it looks nothing like the Xinhua data that we have annotated.

I would guess that adaptation, if it could be done well, would be most useful for these kinds of situations. I think this includes judicious labeling of target domain data, as well as developing better models.

hal said...

This is annoying -- for some reason Blogger stopped emailing me notifications of comments to posts, so I didn't know people had written!

Benob -- That's very frustraing about AdaBoost... I assume you're using a decision stump as a base learner? Maybe the greedy feature selection is just bad in this case, since there are so many redundant features?

Bob -- Like always, thanks for the amazing insight!

Harsha -- Thanks for the pointer :).

Gabriel -- I don't quite follow... sounds moderately interesting, though :).

John (John B, I presume!): I guess one way to go about it is: they are different if you can classify them away from each other (ala your H-delta-H measure :P). Of course, even H-delta-H is a continuous measure, so you'd inevitably need to threshold, or do hierarchical clustering with H-delta-H as a dissimilarity, or something like that.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

Anonymous said...

喝花酒
酒店喝酒
暑假打工
寒假打工
酒店小姐
酒店兼職
禮服店
酒店經紀
酒店兼差
酒店
酒店經紀人
酒店現領
酒店經紀爆米花
酒店經紀
酒店打工
酒店上班
假日打工
台北酒店經紀
酒店pt
酒店pt
酒店應酬
粉味
酒店經紀PRETTY GIRL
酒店經濟
酒店經濟
晚上兼差

Unknown said...

I have been working in domain adaptation problems recently (that's why I just saw your interesting words on DA problems in this blog :) ). After viewing all articles and comments about DA, some questions puzzle me.
In the real-world application, say, object recognition. While using some feature x for training a classifier to classify a object with label y, we may adopt some regularization terms embedded into the objective function of classification task. And we hope with regularizing, the classifier can prevent from over-fitting and achieve better generalization ability. Here comes some questions, is there any link between regularization and domain adaptation? From the regularization aspect, we hope that well-classifying the test samples (which may have some disturbances compared with distribution of training samples, if the disturbance goes larger, it may results in a so-called "different" distribution.) And it may be ambiguous to define if two distributions are different. Therefore, if regularization techniques is somewhat a solution to domain adaptation problems ? Or do I misunderstand the which type of problems that regularization techniques to solve? This is a point really confusing me.