30 November 2007

Domain adaptation vs. transfer learning

The standard classification setting is a input distribution p(X) and a label distribution p(Y|X). Roughly speaking, domain adaptation (DA) is the problem that occurs when p(X) changes between training and test. Transfer learning (TL) is the problem that occurs when p(Y|X) changes between training and test. In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stays the same, but the labels change. The two problems are clearly quite similar, and indeed you see similar (if not identical) techniques being applied to both problems. (Incidentally, you also see papers that are really solving one of the two problems but claim to be solving the other.)

As a brief aside, we actually need to be a bit more specific about the domain adaptation case. In particular, if p(X) changes then we can always encode any alternative labeling function by "hiding" some extra information in p(X). In other words, under the model that p(X) changes, the assumption that p(Y|X) doesn't change is actually vacuous. (Many people have pointed this out, I think I first heard it from Shai Ben-David a few years ago.) It is because of the assumption that theoretical work in domain adaptation has been required to make stronger assumptions. A reasonable one is the assumption that (within a particular concept class---i.e., space of possible classifiers), there exists one that doesn't do too bad on either the source or the target domain. This is a stronger assumption that the "p(Y|X) doesn't change", but actually enables us to do stuff. (Though, see (*) below for a bit more discussion on this assumption.)
Now, beyond the DA versus TL breakdown, there is a further breakdown: for which sides of the problem do we have labeled or unlabeled data. In DA, the two "sides" are the source domain and the target domain. In TL, the two sides are task 1 (the "source" task) and task 2 (the "target" task). In all cases, we want something that does well on the target. Let's enumerate the four possibilities:
  1. Source labeled, target labeled (S+T+)
  2. Source labeled, target only unlabeled (S+T-)
  3. Source only unlabeled, target labeled (S-T+)
  4. Source only unlabeled, target only unlabeled (S-T-)
We can immediately throw away S-T- because this is basically an unsupervised learning problem.

The typical assumption in TL is S+T+. That is, we have labeled data for both tasks. (Typically, it is actually assumed that we have one data set that is labeled for both problems, but I don't feel that this is a necessary assumption.)

In DA, there are two standard settings: S+T+ (this is essentially the side of DA that I have worked on) and S+T- (this is essentially the side of DA that John Blitzer has worked on).

Now, I think it's fair to say that any of the T- settings are impossible for TL. Since we're assuming that the label function changes and can change roughly arbitrarily, it seems like we just have to have some labeled target data. (I.e., unlike the case in DA where we assume a single good classifier exists, this assumption doesn't make sense in TL.)

This begs the question: in TL and DA, does the S-T+ setting make any sense?

For DA, the S-T+ setting is a bit hard to argue for from a practical perspective. Usually we want to do DA so that we don't have to label (much) target data. However, one could make a semi-supervised sort of argument here. Maybe it's just hard to come by target data, labeled or otherwise. In this case, we'd like to use a bunch of unlabeled source data to help out. (Though I feel that in this case, we're probably reasonably likely to already have some labeled source.) From a more theoretical perspective, I don't really see anything wrong with it. In fact, any DA algorithm that works in the S+T- setting would stand a reasonable chance here.

For TL, I actually think that this setting makes a lot of sense, despite the fact that I can't come up with a single TL paper that does this (of course, I don't follow TL as closely as I follow DA). Why do I think this makes sense? Essentially, the TL assumption basically says that the labeling function can change arbitrarily, but the underlying distribution can't. If this is true, and we have labeled target data, I see no reason why we would need labeled source data. That is, we're assuming that knowing the source label distribution tells us nothing about the target label distribution. Hence, the only information we should really be able to get out of the source side is information about the underlying distribution p(X), since this is the only thing that stays the same.

What this suggests is that if having labeled source data in TL is helpful, then maybe the problem is really more domain adaptation-ish. I've actually heard (eg., at AI-Stats this year) a little muttering about how the two tasks used in TL work are often really quite similar. There's certainly nothing wrong with this, but it seems like if this is indeed true, then we should be willing to make this an explicit assumption in our model. Perhaps not something so severe as in DA (there exists a good classifier on both sides), but something not so strong as independence of labeling distributions. Maybe some assumption on the bound of the KL divergence or some such thing.

How I feel at this point is basically that for DA the interesting cases are S+T+ and S+T- (which are the well studied cases) and for TL the only interesting one is S-T+. This is actually quite surprising, given that similar techniques have been used for both.
(*) I think one exception to this assumption occurs in microarray analysis in computational biology. One of the big problems faced in this arena is that it is very hard to combine data from microarrays taken using different platforms (the platform is essentially the manufacturer of the actual device) or in different experimental conditions. What is typically done in compbio is to do a fairly heavy handed normalization of the data, usually by some complex rank-ordering and binning process. A lot of information is lost in this transformation, but at least puts the different data sets on the same scale and renders them (hopefully) roughly comparable. One can think of not doing the normalization step and instead thinking of this as a DA problem. However, due to the different scales and variances of the gene expression levels, it's not clear that a "single good classifier" exists. (You also have a compounded problem that not all platforms measure exactly the same set of genes, so you get lots of missing data.)


Anonymous said...

I am confused, in transfer learning where P(X) does not change, how is S-T+ any different from semi-supervised learning?

The statement "That is, we're assuming that knowing the source label distribution tells us nothing about the target label distribution." is a bit perplexing. I always thought the opposite, otherwise, why do we case about P(Y|X) for S (like you suggest). When I think of TL I think of cases like the following: I have a large set of data labeled with definition A of a gene, and a small set of data labeled with definition B of a gene. Both annotations are drawn from the same corpus, e.g., medline. For my problem, I am interested in extracting mentions of genes under definition B. Clearly the labeled corpus A is going to help. Even though the definitions A and B are not identical, they are at least informative about each other. So how can I use the data annotated with A to help learn a classifier to predict definition B? But maybe this is more multi-task learning? I certainly don't think of this as DA since P(X) hasn't changed.

hal said...

ryan --

for the first statement, you're right, it's not :).

i guess the issue is that in transfer learning we want to think of the two tasks in terms of how related they are. maybe the right way to think of this is as some sort of mutual information... what's the entropy of label2 given x, versus given x and label1. there's obviously (in most cases) useful information here. but then you get this compounding problem: since i don't actually know label1, i have to integrate it out when trying to predict label2, which removes any notion of lower entropy.

Kevin Duh said...

Hal, I like your categorization of various scenarios (e.g. S+T-). However, I think the words "domain adaptation", "transfer learning", etc. are just terms people use to roughly describe their training scenario and these might not correspond to your definitions.

Here's a question similar to Ryan's: If you can call S-T+ semi-supervised learning, can you call S+T- transductive learning?

I have more fundamental question that I'm puzzled with: how different does the training and test distribution have to be before you begin to call the problem a domain adaptation or transfer learning problem? For instance, is WSJ->Brown corpus a DA problem? If so, is WSJ section1->WSJ section22 a DA problem? If so, is WSJ sentence#2202-> WSJ sentence#2203 a DA problem? We seem to draw the line between DA/TL and traditional supervised learning somewhat arbitrarily. Can we draw the line in a more principled way, or even more, do we even need to draw this line?

Ani said...

Hi Hal,
I may be wrong but I feel if p(x) changes the only way p(y/x) will not be affected is that distribution of y is not dependent of x, so how is DA and TL different?
Whats your solution to the microarray normalization problem, if you thinking of one?

Abhishek said...

Hi Hal,

have you come across NLP specifically Named Entity Recognition applied to the legal domain?

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花