18 May 2008

Adaptation versus adaptability

Domain adaptation is, roughly, the following problem: given labeled data drawn from one or more source domains, and either (a) a little labeled data drawn from a target domain or (b) a lot of unlabeled data drawn from a target domain; do the following. Produce a classifier (say) that has low expected loss on new data drawn from the target domain. (For clarity: we typically assume that it is the data distribution that changes between domains, not the task; that would be standard multi-task learning.)

Obviously I think this is an fun problem (I publish on it, and blog about it reasonably frequently). It's fun to me because it both seems important and is also relatively unsolved (though certainly lots of partial solutions exist).

One thing I've been wondering recently, and I realize as I write this that the concept is as yet a bit unformed in my head, is whether the precise formulation I wrote in the first paragraph is the best one to solve. In particular, I can imagine many ways to tweak it:

  1. Perhaps we don't want just a classifier that will do well on the "target" domain, but will really do well on any of the source domains as well.
  2. Continuing (1), perhaps when we get a test point, we don't know which of the domains we've trained on it comes from.
  3. Continuing (2), perhaps it might not even come from one of the domains we've seen before!
  4. Perhaps at training time, we just have a bunch of data that's not neatly packed into "domains." Maybe one data set really comprises five different domains (think: Brown, if it weren't labeled by the source) or maybe two data sets that claim to be different domains really aren't.
  5. Continuing (4), perhaps the notion of "domain" is too ill-defined to be thought of as a discrete entity and should really be more continuous.
I am referring to union of these issues as adaptability, rather than adaptation (the latter implies that it's a do-it-once sort of thing; the former that it's more online). All of these points beg the question that I often try to ask myself when defining problems: who is the client.

Here's one possible client: Google (or Yahoo or MSN or whatever). Anyone who has a copy of the web lying around and wants to do some non-trivial NLP on it. In this case, the test data really comes from a wide variety of different domains (which may or may not be discrete) and they want something that "just works." It seems like for this client, we must be at (2) or (3) and not (1). There may be some hints as to the domain (eg., if the URL starts with "cnn.com" then maybe--though not necessarily--it will be news; it could also be a classified ad). The reason why I'm not sure of (2) vs (3) is that if you're in the "some labeled target data" setting of domain adaptation, you're almost certainly in (3); if you're in the "lots of unlabeled target data" setting, then you may still be in (2) because you could presumably use the web as your "lots of unlabeled target data." (This is a bit more like transduction that induction.) However, in this case, you are now also definitely in (4) because no one is going to sit around and label all of the web as to what "domain" it is in. However, if you're in the (3) setting, I've no clue what your classifier should do when it gets a test example from a domain that (it thinks) it hasn't seen before!

(4) and (5) are a bit more subtle, primarily because they tend to deal with what you believe about your data, rather than who your client really is. For instance, if you look at the WSJ section of the Treebank, it is tempting to think of this as a single domain. However, there are markedly different types of documents in there. Some are "news." Some are lists of stock prices and their recent changes. One thing I tried in the Frustratingly Easy paper but that didn't work is the following. Take the WSJ section of the Treebank and run document clustering on it (using bag of words). Treat each of these clusters as a separate domain and then do adaptation. This introduces the "when I get a test example I have to classify it into a domain" issue. At the end of the day, I couldn't get this to work. (Perhaps BOW is the wrong clustering representation? Perhaps a flat clustering is a bad idea?)

Which brings us to the final question (5): what really is a domain? One alternative way to ask this is: give data partitioned into two bags, are these two bags drawn from the same underlying distribution. Lots of people (especially in theory and databases, though also in stats/ML) have proposed solutions for answering this question. However, it's never going to be possible to give a real "yes/no" answer, so now you're left with a "degree of relatedness." Furthermore, if you're just handed a gigabyte of data, you're not going to want to try ever possible split into subdomains. One could try some tree-structured representation, which seems kind of reasonable to me (perhaps I'm just brainwashed because I've been doing too much tree stuff recently).

19 comments:

Benob said...

I tried your frustratingly easy domain adaptation technique using a classifier that tries to learn something out of unknown values (boostexter). You can think of it as defining extra binary features to handle the unknown case. While interesting, it just doesn't work. Adaboost gets completely confused and fails to learn anything useful.

Anyways, I also sometimes think about the problem of handling unseen domains and I was wondering if it is possible, given a test example, to find a "best" set of training examples for classifying the test example. This would enable some kind of domain blending, or soft domain adaptation.

Bob Carpenter said...

This is a huge problem on the acoustic side of speech recognition in both scale and complexity.

The acoustic adaptation problem cross-cuts gender (male/female), age (young/old), broad dialect (Mexican/Castillian), finer dialect (Yucatan/Mexico City), speaking rate (fast/slow), "emotion" (angry/excited/calm/sleepy) and last and mostly least, "domain" (flight reservations/movie information). There's also channel adaptation (cell phone in car vs. land line at home vs. speaker phone in conference room).

Of course, you can keep going down to the speaker and speaker's situation (talking to mom/talking to the bank/talking to friend from school).

The SpeechWorks recognizers were installed with fixed vocabularies, but could adapt the acoustic models to the range of callers using the system over time. This reportedly reduced error rates as much as 50% over the first few months of use.

The trick was using the (confidence ranked) feedback you got on confirmation prompts ("Did you say Los Angeles?"). The underlying model for speech adaptation is typically nothing more complex than Gaussian mixtures.

What doesn't tend to work is some up-front classification of callers which routes them to the appropriate model. The data sharing across models tends to be too helpful to throw away.

Harsha said...

Your comment about continuous domain-variability reminds me of some OCR style adaptation I worked on for my Ph.D. We had a rudimentary Gaussian model for the style-distribution from which one style is assumed be drawn to generate a given test set (it can be as small as two samples) to be classified.

This is an interesting problem with nice connections to semi-supervised learning.

Ref: "Analytical Results on Bayesian Style ...", PAMI July 2007.

Gabriel said...

I've been interested in the domain adaptation issue for a while now, in the context of extractive summarization (so a binary classification problem). One of the obvious approaches you mention in 'Frustratingly Easy Domain Adaptation' is the Predict method, where the target data has source predictions as an additional feature. I'm curious if you've seen anything working in the reverse direction, i.e. using the bit of labeled training data in the target domain to augment or filter the source data. I tried a couple of simple approaches off the top of my head, like reclassifying the source datapoints and then training on that, and adding a target prediction as an additional feature in the source data, then training on both source and target (the additional feature in the target domain simply being the gold-standard class label). Both yielded mixed results.

John said...

I agree with you on question (5). Often, it's unclear where divides among domains end and begin. Visiting Google, Yahoo, and MSN, I didn't have a great answer for this, although I also speculated that some kind of hierarchical clustering model might work.

But I also think there are some notions of domain that *are* clear. 1990s Wall Street Journal vs. MEDLINE abstracts is one example. Another example that is very prevalent in China is Chinese Treebank (Xinhua) vs. community question answering. baidu zhidao (百度知道) is a great repository for useful information, but it looks nothing like the Xinhua data that we have annotated.

I would guess that adaptation, if it could be done well, would be most useful for these kinds of situations. I think this includes judicious labeling of target domain data, as well as developing better models.

hal said...

This is annoying -- for some reason Blogger stopped emailing me notifications of comments to posts, so I didn't know people had written!

Benob -- That's very frustraing about AdaBoost... I assume you're using a decision stump as a base learner? Maybe the greedy feature selection is just bad in this case, since there are so many redundant features?

Bob -- Like always, thanks for the amazing insight!

Harsha -- Thanks for the pointer :).

Gabriel -- I don't quite follow... sounds moderately interesting, though :).

John (John B, I presume!): I guess one way to go about it is: they are different if you can classify them away from each other (ala your H-delta-H measure :P). Of course, even H-delta-H is a continuous measure, so you'd inevitably need to threshold, or do hierarchical clustering with H-delta-H as a dissimilarity, or something like that.

Anonymous said...

Ultima Online Gold, UO Gold, crestingwait
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
lotro gold
wow gold
warhammer gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
Age of Conan Gold, AOC Gold

Anonymous said...

I always heard something from my neighbor that he sometimes goes to the internet bar to play the game which will use him some habbo credits,he usually can win a lot of habbo gold,then he let his friends all have some habbo coins,his friends thank him very much for introducing them the cheap habbo credits,they usually buy habbo gold together.

Anonymous said...

網頁設計,情趣用品,情趣用品,情趣用品,情趣用品
色情遊戲,寄情築園小遊戲,情色文學,一葉情貼圖片區,情惑用品性易購,情人視訊網,辣妹視訊,情色交友,成人論壇,情色論壇,愛情公寓,情色,舊情人,情色貼圖,色情聊天室,色情小說,做愛,做愛影片,性愛

免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,UT聊天室,聊天室,豆豆聊天室,尋夢園聊天室,聊天室尋夢園,080聊天室,080苗栗人聊天室,女同志聊天室,上班族聊天室,小高聊天室 

AV,AV女優
視訊,影音視訊聊天室,視訊交友
視訊,影音視訊聊天室,視訊聊天室,視訊交友,視訊聊天,視訊美女

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

Anonymous said...

喝花酒
酒店喝酒
暑假打工
寒假打工
酒店小姐
酒店兼職
禮服店
酒店經紀
酒店兼差
酒店
酒店經紀人
酒店現領
酒店經紀爆米花
酒店經紀
酒店打工
酒店上班
假日打工
台北酒店經紀
酒店pt
酒店pt
酒店應酬
粉味
酒店經紀PRETTY GIRL
酒店經濟
酒店經濟
晚上兼差

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

tariely said...

Оса 800 электрошокер в Москве.
Оса 800 электрошокер в Москве.

Julius' Blog said...

I have been working in domain adaptation problems recently (that's why I just saw your interesting words on DA problems in this blog :) ). After viewing all articles and comments about DA, some questions puzzle me.
In the real-world application, say, object recognition. While using some feature x for training a classifier to classify a object with label y, we may adopt some regularization terms embedded into the objective function of classification task. And we hope with regularizing, the classifier can prevent from over-fitting and achieve better generalization ability. Here comes some questions, is there any link between regularization and domain adaptation? From the regularization aspect, we hope that well-classifying the test samples (which may have some disturbances compared with distribution of training samples, if the disturbance goes larger, it may results in a so-called "different" distribution.) And it may be ambiguous to define if two distributions are different. Therefore, if regularization techniques is somewhat a solution to domain adaptation problems ? Or do I misunderstand the which type of problems that regularization techniques to solve? This is a point really confusing me.

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex