The standard classification setting is a input distribution p(X) and a label distribution p(Y|X). Roughly speaking, domain adaptation (DA) is the problem that occurs when p(X) changes between training and test. Transfer learning (TL) is the problem that occurs when p(Y|X) changes between training and test. In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stays the same, but the labels change. The two problems are clearly quite similar, and indeed you see similar (if not identical) techniques being applied to both problems. (Incidentally, you also see papers that are really solving one of the two problems but claim to be solving the other.)
As a brief aside, we actually need to be a bit more specific about the domain adaptation case. In particular, if p(X) changes then we can always encode any alternative labeling function by "hiding" some extra information in p(X). In other words, under the model that p(X) changes, the assumption that p(Y|X) doesn't change is actually vacuous. (Many people have pointed this out, I think I first heard it from Shai Ben-David a few years ago.) It is because of the assumption that theoretical work in domain adaptation has been required to make stronger assumptions. A reasonable one is the assumption that (within a particular concept class---i.e., space of possible classifiers), there exists one that doesn't do too bad on either the source or the target domain. This is a stronger assumption that the "p(Y|X) doesn't change", but actually enables us to do stuff. (Though, see (*) below for a bit more discussion on this assumption.)Now, beyond the DA versus TL breakdown, there is a further breakdown: for which sides of the problem do we have labeled or unlabeled data. In DA, the two "sides" are the source domain and the target domain. In TL, the two sides are task 1 (the "source" task) and task 2 (the "target" task). In all cases, we want something that does well on the target. Let's enumerate the four possibilities:
- Source labeled, target labeled (S+T+)
- Source labeled, target only unlabeled (S+T-)
- Source only unlabeled, target labeled (S-T+)
- Source only unlabeled, target only unlabeled (S-T-)
The typical assumption in TL is S+T+. That is, we have labeled data for both tasks. (Typically, it is actually assumed that we have one data set that is labeled for both problems, but I don't feel that this is a necessary assumption.)
In DA, there are two standard settings: S+T+ (this is essentially the side of DA that I have worked on) and S+T- (this is essentially the side of DA that John Blitzer has worked on).
Now, I think it's fair to say that any of the T- settings are impossible for TL. Since we're assuming that the label function changes and can change roughly arbitrarily, it seems like we just have to have some labeled target data. (I.e., unlike the case in DA where we assume a single good classifier exists, this assumption doesn't make sense in TL.)
This begs the question: in TL and DA, does the S-T+ setting make any sense?
For DA, the S-T+ setting is a bit hard to argue for from a practical perspective. Usually we want to do DA so that we don't have to label (much) target data. However, one could make a semi-supervised sort of argument here. Maybe it's just hard to come by target data, labeled or otherwise. In this case, we'd like to use a bunch of unlabeled source data to help out. (Though I feel that in this case, we're probably reasonably likely to already have some labeled source.) From a more theoretical perspective, I don't really see anything wrong with it. In fact, any DA algorithm that works in the S+T- setting would stand a reasonable chance here.
For TL, I actually think that this setting makes a lot of sense, despite the fact that I can't come up with a single TL paper that does this (of course, I don't follow TL as closely as I follow DA). Why do I think this makes sense? Essentially, the TL assumption basically says that the labeling function can change arbitrarily, but the underlying distribution can't. If this is true, and we have labeled target data, I see no reason why we would need labeled source data. That is, we're assuming that knowing the source label distribution tells us nothing about the target label distribution. Hence, the only information we should really be able to get out of the source side is information about the underlying distribution p(X), since this is the only thing that stays the same.
What this suggests is that if having labeled source data in TL is helpful, then maybe the problem is really more domain adaptation-ish. I've actually heard (eg., at AI-Stats this year) a little muttering about how the two tasks used in TL work are often really quite similar. There's certainly nothing wrong with this, but it seems like if this is indeed true, then we should be willing to make this an explicit assumption in our model. Perhaps not something so severe as in DA (there exists a good classifier on both sides), but something not so strong as independence of labeling distributions. Maybe some assumption on the bound of the KL divergence or some such thing.
How I feel at this point is basically that for DA the interesting cases are S+T+ and S+T- (which are the well studied cases) and for TL the only interesting one is S-T+. This is actually quite surprising, given that similar techniques have been used for both.
(*) I think one exception to this assumption occurs in microarray analysis in computational biology. One of the big problems faced in this arena is that it is very hard to combine data from microarrays taken using different platforms (the platform is essentially the manufacturer of the actual device) or in different experimental conditions. What is typically done in compbio is to do a fairly heavy handed normalization of the data, usually by some complex rank-ordering and binning process. A lot of information is lost in this transformation, but at least puts the different data sets on the same scale and renders them (hopefully) roughly comparable. One can think of not doing the normalization step and instead thinking of this as a DA problem. However, due to the different scales and variances of the gene expression levels, it's not clear that a "single good classifier" exists. (You also have a compounded problem that not all platforms measure exactly the same set of genes, so you get lots of missing data.)