Domain adaptation is, roughly, the following problem: given labeled data drawn from one or more source domains, and either (a) a little labeled data drawn from a target domain or (b) a lot of unlabeled data drawn from a target domain; do the following. Produce a classifier (say) that has low expected loss on new data drawn from the target domain. (For clarity: we typically assume that it is the data distribution that changes between domains, not the task; that would be standard multi-task learning.)
Obviously I think this is an fun problem (I publish on it, and blog about it reasonably frequently). It's fun to me because it both seems important and is also relatively unsolved (though certainly lots of partial solutions exist).
One thing I've been wondering recently, and I realize as I write this that the concept is as yet a bit unformed in my head, is whether the precise formulation I wrote in the first paragraph is the best one to solve. In particular, I can imagine many ways to tweak it:
- Perhaps we don't want just a classifier that will do well on the "target" domain, but will really do well on any of the source domains as well.
- Continuing (1), perhaps when we get a test point, we don't know which of the domains we've trained on it comes from.
- Continuing (2), perhaps it might not even come from one of the domains we've seen before!
- Perhaps at training time, we just have a bunch of data that's not neatly packed into "domains." Maybe one data set really comprises five different domains (think: Brown, if it weren't labeled by the source) or maybe two data sets that claim to be different domains really aren't.
- Continuing (4), perhaps the notion of "domain" is too ill-defined to be thought of as a discrete entity and should really be more continuous.
Here's one possible client: Google (or Yahoo or MSN or whatever). Anyone who has a copy of the web lying around and wants to do some non-trivial NLP on it. In this case, the test data really comes from a wide variety of different domains (which may or may not be discrete) and they want something that "just works." It seems like for this client, we must be at (2) or (3) and not (1). There may be some hints as to the domain (eg., if the URL starts with "cnn.com" then maybe--though not necessarily--it will be news; it could also be a classified ad). The reason why I'm not sure of (2) vs (3) is that if you're in the "some labeled target data" setting of domain adaptation, you're almost certainly in (3); if you're in the "lots of unlabeled target data" setting, then you may still be in (2) because you could presumably use the web as your "lots of unlabeled target data." (This is a bit more like transduction that induction.) However, in this case, you are now also definitely in (4) because no one is going to sit around and label all of the web as to what "domain" it is in. However, if you're in the (3) setting, I've no clue what your classifier should do when it gets a test example from a domain that (it thinks) it hasn't seen before!
(4) and (5) are a bit more subtle, primarily because they tend to deal with what you believe about your data, rather than who your client really is. For instance, if you look at the WSJ section of the Treebank, it is tempting to think of this as a single domain. However, there are markedly different types of documents in there. Some are "news." Some are lists of stock prices and their recent changes. One thing I tried in the Frustratingly Easy paper but that didn't work is the following. Take the WSJ section of the Treebank and run document clustering on it (using bag of words). Treat each of these clusters as a separate domain and then do adaptation. This introduces the "when I get a test example I have to classify it into a domain" issue. At the end of the day, I couldn't get this to work. (Perhaps BOW is the wrong clustering representation? Perhaps a flat clustering is a bad idea?)
Which brings us to the final question (5): what really is a domain? One alternative way to ask this is: give data partitioned into two bags, are these two bags drawn from the same underlying distribution. Lots of people (especially in theory and databases, though also in stats/ML) have proposed solutions for answering this question. However, it's never going to be possible to give a real "yes/no" answer, so now you're left with a "degree of relatedness." Furthermore, if you're just handed a gigabyte of data, you're not going to want to try ever possible split into subdomains. One could try some tree-structured representation, which seems kind of reasonable to me (perhaps I'm just brainwashed because I've been doing too much tree stuff recently).