natural language processing blog: domain adaptation

Showing posts with label domain adaptation. Show all posts

19 August 2010

Multi-task learning: should our hypothesis classes be the same?

It is almost an unspoken assumption in multitask learning (and domain adaptation) that you use the same type of classifier (or, more formally, the same hypothesis class) for all tasks. In NLP-land, this usually means that everything is a linear classifier, and the feature sets are the same for all tasks; in ML-land, this usually means that the same kernel is used for every task. In neural-networks land (ala Rich Caruana), this is enforced by the symmetric structure of the networks used.

I probably would have gone on not even considering this unspoken assumption, until a few years ago I saw a couple papers that challenged it, albeit indirectly. One was Factorizing Complex Models: A Case Study in Mention Detection by Radu (Hans) Florian, Hongyan Jing, Nanda Kambhatla and Imed Zitouni, all from IBM. They're actually considering solving tasks separately rather than jointly, but joint learning and multi-task learning are very closely related. What they see is that different features are useful for spotting entity spans, and for labeling entity types.

That year, or the next, I saw another paper (can't remember who or what -- if someone knows what I'm talking about, please comment!) that basically showed a similar thing, where a linear kernel was doing best for spotting entity spans, and a polynomial kernel was doing best for labeling the entity types (with the same feature sets, if I recall correctly).

Now, to some degree this is not surprising. If I put on my feature engineering hat, then I probably would design slightly different features for these two tasks. On the other hand, coming from a multitask learning perspective, this is surprising: if I believe that these tasks are related, shouldn't I also believe that I can do well solving them in the same hypothesis space?

This raises an important (IMO) question: if I want to allow my hypothesis classes to be different, what can I do?

One way is to punt: you can just concatenate your feature vectors and cross your fingers. Or, more nuanced, you can have some set of shared features and some set of features unique to each task. This is similar (the nuanced version, not the punting version) to what Jenny Finkel and Chris Manning did in their ACL paper this year, Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data.

An alternative approach is to let the two classifiers "talk" via unlabeled data. Although motivated differently, this was something of the idea behind my EMNLP 2008 paper on Cross-Task Knowledge-Constrained Self Training, where we run two models on unlabeled data and look for where they "agree."

A final idea that comes to mind, though I don't know if anyone has tried anything like this, would be to try to do some feature extraction over the two data sets. That is, basically think of it as a combination of multi-view learning (since we have two different hypothesis classes) and multi-task learning. Under the assumption that we have access to examples labeled for both tasks simultaneously (i.e., not the settings for either Jenny's paper or my paper), then one could do a 4-way kernel CCA, where data points are represented in terms of their task-1 kernel, task-2 kernel, task-1 label and task-2 label. This would be sort of a blending of CCA-for-multiview-learning and CCA-for-multi-task learning.

I'm not sure what the right way to go about this is, but I think it's something important to consider, especially since it's an assumption that usually goes unstated, even though empirical evidence seems to suggest it's not (always) the right assumption.

01 April 2010

Classification weirdness, regression simplicity

In the context of some work on multitask learning, we came to realize that classification is kind of weird. Or at least linear classification. It's not that it's weird in a way that we didn't already know: it's just sort of a law of unexpected consequences.

If we're doing linear (binary) classification, we all know that changing the magnitude of the weight vector doesn't change the predictions. A standard exercise in a machine learning class might be to show that if your data is linearly separable, then for some models (for instance, unregularized models), the best solution is usually an infinite norm weight vector that's pointing in the right direction.

This is definitely not true of (linear) regression. Taking a good (or even perfect) linear regressor and blowing up the weights by some constant will kill your performance. By adding a regularizer, what you're basically doing is just saying how big you want that norm to be.

Of course, by regression I simply mean minimizing something like squared error and by classification I mean something like 0/1 loss or hinge loss or logistic loss or whatever.

I think this is stuff that we all know.

Where this can bite you in unexpected ways is the following. In lots of problems, like domain adaptation and multitask learning, you end up making assumptions roughly of the form "my weight vector for domain A should look like my weight vector for domain B" where "look like" is really the place where you get to be creative and define things how you feel best.

This is all well and good in the regression setting. A magnitude 5 weight in regression means a magnitude 5 weight in regression. But not so in classification. Since you can arbitrarily scale your weight vectors and still get the same decision boundaries, a magnitude 5 weight kind of means nothing. Or at least it means something that has to do more with the difficulty of the problem and how you chose to set your regularization parameter, rather than something to do with the task itself.

Perhaps we should be looking for definitions of "look like" that are insensitive to things like magnitude. Sure you can always normalize all your weight vectors to unit norm before you co-regularize them, but that loses information as well.

Perhaps this is a partial explanation of some negative transfer. One thing that you see, when looking at the literature in DA and MTL, is that all the tasks are typically of about the same difficulty. My expectation is that if you have two tasks that are highly related, but one is way harder than the other, is going to lead to negative transfer. Why? Because the easy task will get low norm weights, and the hard task will get high norm weights. The high norm weights will pull the low norm weights toward them too much, leading to worse performance on the "easy" task. In a sense, we actually want the opposite to happen: if you have a really hard task, it shouldn't screw up everyone else that's easy! (Yes, I know that being Bayesian might help here since you'd get a lot of uncertainty around those high norm weight vectors!)

18 May 2008

Adaptation versus adaptability

Domain adaptation is, roughly, the following problem: given labeled data drawn from one or more source domains, and either (a) a little labeled data drawn from a target domain or (b) a lot of unlabeled data drawn from a target domain; do the following. Produce a classifier (say) that has low expected loss on new data drawn from the target domain. (For clarity: we typically assume that it is the data distribution that changes between domains, not the task; that would be standard multi-task learning.)

Obviously I think this is an fun problem (I publish on it, and blog about it reasonably frequently). It's fun to me because it both seems important and is also relatively unsolved (though certainly lots of partial solutions exist).

One thing I've been wondering recently, and I realize as I write this that the concept is as yet a bit unformed in my head, is whether the precise formulation I wrote in the first paragraph is the best one to solve. In particular, I can imagine many ways to tweak it:

Perhaps we don't want just a classifier that will do well on the "target" domain, but will really do well on any of the source domains as well.
Continuing (1), perhaps when we get a test point, we don't know which of the domains we've trained on it comes from.
Continuing (2), perhaps it might not even come from one of the domains we've seen before!
Perhaps at training time, we just have a bunch of data that's not neatly packed into "domains." Maybe one data set really comprises five different domains (think: Brown, if it weren't labeled by the source) or maybe two data sets that claim to be different domains really aren't.
Continuing (4), perhaps the notion of "domain" is too ill-defined to be thought of as a discrete entity and should really be more continuous.

I am referring to union of these issues as adaptability, rather than adaptation (the latter implies that it's a do-it-once sort of thing; the former that it's more online). All of these points beg the question that I often try to ask myself when defining problems: who is the client.

Here's one possible client: Google (or Yahoo or MSN or whatever). Anyone who has a copy of the web lying around and wants to do some non-trivial NLP on it. In this case, the test data really comes from a wide variety of different domains (which may or may not be discrete) and they want something that "just works." It seems like for this client, we must be at (2) or (3) and not (1). There may be some hints as to the domain (eg., if the URL starts with "cnn.com" then maybe--though not necessarily--it will be news; it could also be a classified ad). The reason why I'm not sure of (2) vs (3) is that if you're in the "some labeled target data" setting of domain adaptation, you're almost certainly in (3); if you're in the "lots of unlabeled target data" setting, then you may still be in (2) because you could presumably use the web as your "lots of unlabeled target data." (This is a bit more like transduction that induction.) However, in this case, you are now also definitely in (4) because no one is going to sit around and label all of the web as to what "domain" it is in. However, if you're in the (3) setting, I've no clue what your classifier should do when it gets a test example from a domain that (it thinks) it hasn't seen before!

(4) and (5) are a bit more subtle, primarily because they tend to deal with what you believe about your data, rather than who your client really is. For instance, if you look at the WSJ section of the Treebank, it is tempting to think of this as a single domain. However, there are markedly different types of documents in there. Some are "news." Some are lists of stock prices and their recent changes. One thing I tried in the Frustratingly Easy paper but that didn't work is the following. Take the WSJ section of the Treebank and run document clustering on it (using bag of words). Treat each of these clusters as a separate domain and then do adaptation. This introduces the "when I get a test example I have to classify it into a domain" issue. At the end of the day, I couldn't get this to work. (Perhaps BOW is the wrong clustering representation? Perhaps a flat clustering is a bad idea?)

Which brings us to the final question (5): what really is a domain? One alternative way to ask this is: give data partitioned into two bags, are these two bags drawn from the same underlying distribution. Lots of people (especially in theory and databases, though also in stats/ML) have proposed solutions for answering this question. However, it's never going to be possible to give a real "yes/no" answer, so now you're left with a "degree of relatedness." Furthermore, if you're just handed a gigabyte of data, you're not going to want to try ever possible split into subdomains. One could try some tree-structured representation, which seems kind of reasonable to me (perhaps I'm just brainwashed because I've been doing too much tree stuff recently).

30 November 2007

Domain adaptation vs. transfer learning

The standard classification setting is a input distribution p(X) and a label distribution p(Y|X). Roughly speaking, domain adaptation (DA) is the problem that occurs when p(X) changes between training and test. Transfer learning (TL) is the problem that occurs when p(Y|X) changes between training and test. In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stays the same, but the labels change. The two problems are clearly quite similar, and indeed you see similar (if not identical) techniques being applied to both problems. (Incidentally, you also see papers that are really solving one of the two problems but claim to be solving the other.)

As a brief aside, we actually need to be a bit more specific about the domain adaptation case. In particular, if p(X) changes then we can always encode any alternative labeling function by "hiding" some extra information in p(X). In other words, under the model that p(X) changes, the assumption that p(Y|X) doesn't change is actually vacuous. (Many people have pointed this out, I think I first heard it from Shai Ben-David a few years ago.) It is because of the assumption that theoretical work in domain adaptation has been required to make stronger assumptions. A reasonable one is the assumption that (within a particular concept class---i.e., space of possible classifiers), there exists one that doesn't do too bad on either the source or the target domain. This is a stronger assumption that the "p(Y|X) doesn't change", but actually enables us to do stuff. (Though, see (*) below for a bit more discussion on this assumption.)

Now, beyond the DA versus TL breakdown, there is a further breakdown: for which sides of the problem do we have labeled or unlabeled data. In DA, the two "sides" are the source domain and the target domain. In TL, the two sides are task 1 (the "source" task) and task 2 (the "target" task). In all cases, we want something that does well on the target. Let's enumerate the four possibilities:

Source labeled, target labeled (S+T+)
Source labeled, target only unlabeled (S+T-)
Source only unlabeled, target labeled (S-T+)
Source only unlabeled, target only unlabeled (S-T-)

We can immediately throw away S-T- because this is basically an unsupervised learning problem.

The typical assumption in TL is S+T+. That is, we have labeled data for both tasks. (Typically, it is actually assumed that we have one data set that is labeled for both problems, but I don't feel that this is a necessary assumption.)

In DA, there are two standard settings: S+T+ (this is essentially the side of DA that I have worked on) and S+T- (this is essentially the side of DA that John Blitzer has worked on).

Now, I think it's fair to say that any of the T- settings are impossible for TL. Since we're assuming that the label function changes and can change roughly arbitrarily, it seems like we just have to have some labeled target data. (I.e., unlike the case in DA where we assume a single good classifier exists, this assumption doesn't make sense in TL.)

This begs the question: in TL and DA, does the S-T+ setting make any sense?

For DA, the S-T+ setting is a bit hard to argue for from a practical perspective. Usually we want to do DA so that we don't have to label (much) target data. However, one could make a semi-supervised sort of argument here. Maybe it's just hard to come by target data, labeled or otherwise. In this case, we'd like to use a bunch of unlabeled source data to help out. (Though I feel that in this case, we're probably reasonably likely to already have some labeled source.) From a more theoretical perspective, I don't really see anything wrong with it. In fact, any DA algorithm that works in the S+T- setting would stand a reasonable chance here.

For TL, I actually think that this setting makes a lot of sense, despite the fact that I can't come up with a single TL paper that does this (of course, I don't follow TL as closely as I follow DA). Why do I think this makes sense? Essentially, the TL assumption basically says that the labeling function can change arbitrarily, but the underlying distribution can't. If this is true, and we have labeled target data, I see no reason why we would need labeled source data. That is, we're assuming that knowing the source label distribution tells us nothing about the target label distribution. Hence, the only information we should really be able to get out of the source side is information about the underlying distribution p(X), since this is the only thing that stays the same.

What this suggests is that if having labeled source data in TL is helpful, then maybe the problem is really more domain adaptation-ish. I've actually heard (eg., at AI-Stats this year) a little muttering about how the two tasks used in TL work are often really quite similar. There's certainly nothing wrong with this, but it seems like if this is indeed true, then we should be willing to make this an explicit assumption in our model. Perhaps not something so severe as in DA (there exists a good classifier on both sides), but something not so strong as independence of labeling distributions. Maybe some assumption on the bound of the KL divergence or some such thing.

How I feel at this point is basically that for DA the interesting cases are S+T+ and S+T- (which are the well studied cases) and for TL the only interesting one is S-T+. This is actually quite surprising, given that similar techniques have been used for both.

(*) I think one exception to this assumption occurs in microarray analysis in computational biology. One of the big problems faced in this arena is that it is very hard to combine data from microarrays taken using different platforms (the platform is essentially the manufacturer of the actual device) or in different experimental conditions. What is typically done in compbio is to do a fairly heavy handed normalization of the data, usually by some complex rank-ordering and binning process. A lot of information is lost in this transformation, but at least puts the different data sets on the same scale and renders them (hopefully) roughly comparable. One can think of not doing the normalization step and instead thinking of this as a DA problem. However, due to the different scales and variances of the gene expression levels, it's not clear that a "single good classifier" exists. (You also have a compounded problem that not all platforms measure exactly the same set of genes, so you get lots of missing data.)

02 January 2007

Learning when test and train inputs have different distributions -- NIPS workshop

I spent the second day of workshops at NIPS (while not skiing) attending the Learning when test and train inputs have different distributions workshop. This is closely related (or really just a different name) for the domain adaptation problem I've been interested in for quite some time. Unfortunately, I can't easily come across the list of papers (there were some good ones!) which means that my memory may be lacking at some parts. Here are some points I took away.

Statisticians have worked on this problem for a long time. If you provide insurance, you're going to want a predictor to say whether to give a new person a policy or not. You have lots of data on people and whether they made any claims. Unfortunately, the training data (people you have information on) is limited to those to whom you gave policies. So the test distribution (entire popular) differs from the training distribution (people to whom you gave policies). Some guy (can't remember his name right now) actually won a Nobel prize in Economics for a partial solution to this problem, which was termed "covariate shift" (because statisticians call our "inputs" "covariates" and they are changing).

There seems to be a strong desire to specify models which (though see the comment below) can be characterized as "train p(y|x) and test p(y|x) are the same, but train p(x) and test p(x) differ." In other words, the "labeling function" is the same, but the distribution over inputs is different. This is probably the clearest way to differentiate domain adaptation from multitask learning (for the latter, we typically assume p(x) stays the same but p(y|x) changes). I'm not sure that this is really a hugely important distinction. It may be to obtain interesting theoretical results, but my sense is that in the problems I encounter, both p(x) and p(y|x) are changing, but hopefully not by "too much." An interesting point made along these lines by Shai Ben David that I spent a bunch of time thinking about several years ago was that from a theoretical perspective, assuming p(y|x) is the same is a vacuous assumption, because you can always take two radically different p(x) and q(x), add a feature that indicates which (p vs. q) the data point came from, and call this the "global p(x)". In fact, in some sense, this is all you need to do to solve multitask learning, or domain adaptation: just add a feature saying which distribution the input is from, and learn a single model. I've been doing some experiments recently and, while you can do better than this in practice with standard learning models, it's not such a bad approach.

There were several other talks I liked. It seemed that the results (theoretically) were of the form "if p(y|x) is the same and p(x) and p(y) differ only by a 'little' then doing naive things for learning can do nicely." My favorite formalization of p(x) and p(y) differ a little was the Shai Ben David/John Blitzer approach of saying that they differ slightly if there is a single hyperplane that does well (has low error) on both problems. The restriction to hyperplanes is convenient for what they do later, but in general it seems that having a single hypothesis from some class that will do well on both problems is the general sense of what "p(y|x) is the same" is really supposed to mean. I also enjoyed a talk by Alex Smola on essentially learning the differences between the input distributions and using this to your advantage.

In general, my sense was that people are really starting to understand this problem theoretically, but I really didn't see any practical results that convinced me at all. Most practical results (modulo the Ben David/Blitzer, which essentially cites John's old work) were very NIPSish, in the sense that they were on unrealistic datasets (sorry, but it's true). I wholeheartedly acknowledge that its somewhat difficult to get your hands on good data for this problem, but there is data out there. And it's plentiful enough that it should no longer be necessary to make artificial or semi-artificial data for this problem. (After all, if there weren't real data out there, we wouldn't be working on this problem...or at least we shouldn't :P.)

natural language processing blog