natural language processing blog: 2007/11

30 November 2007

Domain adaptation vs. transfer learning

The standard classification setting is a input distribution p(X) and a label distribution p(Y|X). Roughly speaking, domain adaptation (DA) is the problem that occurs when p(X) changes between training and test. Transfer learning (TL) is the problem that occurs when p(Y|X) changes between training and test. In other words, in DA the input distribution changes but the labels remain the same; in TL, the input distributions stays the same, but the labels change. The two problems are clearly quite similar, and indeed you see similar (if not identical) techniques being applied to both problems. (Incidentally, you also see papers that are really solving one of the two problems but claim to be solving the other.)

As a brief aside, we actually need to be a bit more specific about the domain adaptation case. In particular, if p(X) changes then we can always encode any alternative labeling function by "hiding" some extra information in p(X). In other words, under the model that p(X) changes, the assumption that p(Y|X) doesn't change is actually vacuous. (Many people have pointed this out, I think I first heard it from Shai Ben-David a few years ago.) It is because of the assumption that theoretical work in domain adaptation has been required to make stronger assumptions. A reasonable one is the assumption that (within a particular concept class---i.e., space of possible classifiers), there exists one that doesn't do too bad on either the source or the target domain. This is a stronger assumption that the "p(Y|X) doesn't change", but actually enables us to do stuff. (Though, see (*) below for a bit more discussion on this assumption.)

Now, beyond the DA versus TL breakdown, there is a further breakdown: for which sides of the problem do we have labeled or unlabeled data. In DA, the two "sides" are the source domain and the target domain. In TL, the two sides are task 1 (the "source" task) and task 2 (the "target" task). In all cases, we want something that does well on the target. Let's enumerate the four possibilities:

Source labeled, target labeled (S+T+)
Source labeled, target only unlabeled (S+T-)
Source only unlabeled, target labeled (S-T+)
Source only unlabeled, target only unlabeled (S-T-)

We can immediately throw away S-T- because this is basically an unsupervised learning problem.

The typical assumption in TL is S+T+. That is, we have labeled data for both tasks. (Typically, it is actually assumed that we have one data set that is labeled for both problems, but I don't feel that this is a necessary assumption.)

In DA, there are two standard settings: S+T+ (this is essentially the side of DA that I have worked on) and S+T- (this is essentially the side of DA that John Blitzer has worked on).

Now, I think it's fair to say that any of the T- settings are impossible for TL. Since we're assuming that the label function changes and can change roughly arbitrarily, it seems like we just have to have some labeled target data. (I.e., unlike the case in DA where we assume a single good classifier exists, this assumption doesn't make sense in TL.)

This begs the question: in TL and DA, does the S-T+ setting make any sense?

For DA, the S-T+ setting is a bit hard to argue for from a practical perspective. Usually we want to do DA so that we don't have to label (much) target data. However, one could make a semi-supervised sort of argument here. Maybe it's just hard to come by target data, labeled or otherwise. In this case, we'd like to use a bunch of unlabeled source data to help out. (Though I feel that in this case, we're probably reasonably likely to already have some labeled source.) From a more theoretical perspective, I don't really see anything wrong with it. In fact, any DA algorithm that works in the S+T- setting would stand a reasonable chance here.

For TL, I actually think that this setting makes a lot of sense, despite the fact that I can't come up with a single TL paper that does this (of course, I don't follow TL as closely as I follow DA). Why do I think this makes sense? Essentially, the TL assumption basically says that the labeling function can change arbitrarily, but the underlying distribution can't. If this is true, and we have labeled target data, I see no reason why we would need labeled source data. That is, we're assuming that knowing the source label distribution tells us nothing about the target label distribution. Hence, the only information we should really be able to get out of the source side is information about the underlying distribution p(X), since this is the only thing that stays the same.

What this suggests is that if having labeled source data in TL is helpful, then maybe the problem is really more domain adaptation-ish. I've actually heard (eg., at AI-Stats this year) a little muttering about how the two tasks used in TL work are often really quite similar. There's certainly nothing wrong with this, but it seems like if this is indeed true, then we should be willing to make this an explicit assumption in our model. Perhaps not something so severe as in DA (there exists a good classifier on both sides), but something not so strong as independence of labeling distributions. Maybe some assumption on the bound of the KL divergence or some such thing.

How I feel at this point is basically that for DA the interesting cases are S+T+ and S+T- (which are the well studied cases) and for TL the only interesting one is S-T+. This is actually quite surprising, given that similar techniques have been used for both.

(*) I think one exception to this assumption occurs in microarray analysis in computational biology. One of the big problems faced in this arena is that it is very hard to combine data from microarrays taken using different platforms (the platform is essentially the manufacturer of the actual device) or in different experimental conditions. What is typically done in compbio is to do a fairly heavy handed normalization of the data, usually by some complex rank-ordering and binning process. A lot of information is lost in this transformation, but at least puts the different data sets on the same scale and renders them (hopefully) roughly comparable. One can think of not doing the normalization step and instead thinking of this as a DA problem. However, due to the different scales and variances of the gene expression levels, it's not clear that a "single good classifier" exists. (You also have a compounded problem that not all platforms measure exactly the same set of genes, so you get lots of missing data.)

23 November 2007

NIPS 2007 Pre-prints Up (and WhatToSee-d)

NIPS 2007 preprints are online, with a big warning that they're not final versions. Be that as it may, I've indexed them on WhatToSee for your perusal. (More info on WhatToSee)

19 November 2007

Translation out of English

If you look at MT papers published in the *ACL conferences and siblings, I imagine you'll find a cornucopia of results for translating into English (usually, from Chinese or Arabic, though sometimes from German or Spanish or French, if the corpus used is from EU or UN or de-news). The parenthetical options are usually just due to the availability of corpora for those languages. The former two are due to our friend DARPA's interest in translating from Chinese and Arabic into English. There are certainly groups out there who work on translation with other target languages; the ones that come most readily to mind are Microsoft (which wants to translate its product docs from English into a handful of "commercially viable" languages), and a group that works on translation into Hungarian, which seems to be quite a difficult proposition!

Maybe I'm stretching here, but I feel like we have a pretty good handle on translation into English at this point. Our beloved n-gram language models work beautifully on a languages with such a fixed word order and (to a first approximation) no morphology. Between the phrase-based models that have dominated for a few years, and their hierarchical cousins (both with and without WSJ as input), I think we're doing a pretty good job on this task.

I think the state of affairs in translation out of English is much worse off. In particular, I think the state of affairs for translation from a morphologically-poor language to a morphologically-rich language. (Yes, I will concede that, in comparison to English, Chinese is morphologically-poorer, but I think the difference is not particularly substantial.)

Why do I think this is an interesting problem? For one, I think it challenges a handful of preconceived notions about translation. For instance, my impression is that while language modeling is pretty darn good in English, it's pretty darn bad in languages with complex morphology. There was a JHU workshop in 2002 on speech recognition for Arabic, a large component of which was on language modeling. From their final report, "All these morphology-based language models yielded slight but consistent reductions in word error rate when combined with standard word-based language models." I don't want to belittle that work---it was fantastic. But one would hope for more than slight reduction, given how badly word based ngram models work in Arabic.

Second, one of my biggest pet-peeves about MT (well, at least, why I think MT is easier than most people usually think of it) is that interpretation doesn't seem to be a key component. That is, the goal of MT is just to map from one language to another. It is still up to the human reading the (translated) document to do interpretation. One place where this (partially) falls down is when there is less information in the source language than you need to produce grammatical sentences in the target language. This is not a morphological issue per se (for instance, in the degree to which context plays a role for interpretation in Japanese is significantly higher than in English---directly translated sentences would often not be interpretable in English), but really an issue of information-poor to information-rich translation. It just so happens that a lot of this information is often marked in morphology, which languages like English lack.

That said, there is at least one good reason why one might not work on translation out of English. For me, at least, I don't speak another language well enough to really figure out what's going on in translations (i.e., I would be bad at error analysis). The language other than English that I speak best is Japanese. But in Japanese I could probably only catch gross translation errors, nothing particularly subtle. Moreover, Japanese is not what I would call a morphologically rich language. I would imagine Japanese to English might actually be harder than the other way, due to the huge amount of dropping (pro-drop and then some) that goes on in Japanese.

If I spoke Arabic, I think English to Arabic translation would be quite interesting. Not only do we have huge amounts of data (just flip all our "Arabic to English" data :P), but Arabic has complex, but well-studied morphology (even in the NLP literature). As cited above, there's been some progress in language modeling for Arabic, but I think it's far from solved. Finally, one big advantage of going out of English is that, if we wanted, we have a ton of very good tools we could throw at the source language: parsers, POS taggers, NE recognition, coreference systems, etc. Such things might be important in generating, eg., gender and number morphemes. But alas, my Arabic is not quite up to par.

(p.s., I recognize that there's no reason English even has to be one of the languages; it's just that most of our parallel data includes English and it's a very widely spoken language and so it seems at least not unnatural to include it. Moreover, from the perspective of "information poor", it's pretty close to the top!)

15 November 2007

NetFlix "solved" (the small version)

See the post on Hunch. Maybe that last 1.5% might benefit not from fancy ML, but from fancy (or even stupid) NLP. "But what data do we have to run the NLP on," my friends may ask. How about stuff like this, or if you're adventurous like this? (Had I enough time, I might give it a whirl, but alas...)

p.s., If any of the above links encourage copyright violations, then I'm not actually advocating their use.

12 November 2007

Understanding Model Predictions

One significant (IMO) issue that we face when attempting to do some sort of feature engineering is trying to understand not only what sorts of errors our model is making, but why. This is an area in which pattern-based methods seem to have a leg up on classification-based methods. If an error occurs, we know exactly what pattern fired to yield that error, and we can look back at where the pattern came from and (perhaps) see what went wrong.

I've been trying to think for a while what the best way to do this in a classification style system is, especially when the output is structured. I think I have a way, but in retrospect it seems so obvious that I feel someone must have tried it before. Note that unlike some past blog posts, I haven't actually tried doing this yet, but I think maybe the idea is promising.

Suppose we learn a system to solve some task, and we have some held out dev data on which we want to do, say, feature engineering. What we'd like to do is (a) find common varieties of errors and (b) figure out why these errors are common. I'm going to assume that (a) is solved... for instance in tagging problems you could look at confusion matrices (though note that in something like MT or summarization, where the structure of the output space changes based on past decisions, this may be nontrivial).

Let's say we're doing POS tagging and we've observed some error class that we'd like to understand better, so we can add new features that will fix it. One way of thinking about this is that our classifier predicted X in some context where it should have produced Y. The context, of course, is described by a feature set. So what we want to do, essentially, is look back over the training data for similar contexts, but where the correct answer was X (what our current model predicted). In the case of linear classifiers, it seems something as simple as cosine difference over feature vectors may be a sufficiently good metric to use.