06 April 2011

Seeding, transduction, out-of-sample error and the Microsoft approach...

My past master's student Adam Teichert (now at JHU) did some work on inducing part of speech taggers using typological information.  We wanted to compare the usefulness of using small amounts of linguistic information with small amounts of lexical information in the form of seeds.  (Other papers give seeds different names, like initial dictionaries or prototypes or whatever... it's all the same basic idea.)

The basic result was that if you don't use seeds, then typological information can help a lot.  If you do you seeds, then your baseline performance jumps from like 5% to about 40% and then using typological information on top of this isn't really that beneficial.

This was a bit frustrating, and led us to think more about the problem.  The way we got seeds was to look at the wikipedia page about Portuguese (for instance) and use their example list of words for each tag.  An alternative popular way is to use labeled data and extract the few most frequent words for each part of speech type.  They're not identical, but there is definitely quite a bit of overlap between the words that Wikipedia lists as examples of determiners and the most frequent determiners (this correlation is especially strong for closed-class words).

In terms of end performance, there are two reasons seeds can help.  The first, which is the interesting case, is that knowing that "the" is a determiner helps you find other determiners (like "a") and perhaps also nouns (for instance, knowing the determiners often precede nouns in Portuguese).  The second, which is the uninteresting case, is that now every time you see one of your seeds, you pretty much always get it right.  In other words, just by specifying seeds, especially by frequency (or approximately by frequency ala Wikipedia), you're basically ensuring that you get 90% accuracy (due to ambiguity) on some large fraction of the corpus (again, especially for closed-class words which have short tails).

This phenomena is mentioned in the text (but not the tables :P), for instance, in Haghighi & Klein's 2006 NAACL paper on prototype-driven POS tagging, wherein they say: "Adding prototypes ... gave an accuracy of 68.8% on all tokens, but only 47.7% on non-prototype occurrences, which is only a marginal improvement over [a baseline system with no prototypes."  Their improved system remedies this and achieves better accuracy on non-prototypes as well as prototypes (aka seeds).

This is very similar to the idea of transductive learning in machine learning land.  Transduction is an alternative to semi-supervised learning.  The setting is that you get a bunch of data, some of which is labeled and some of which is unlabeled.  Your goal is to simply label the unlabeled data.  You need not "induce" the labeling function (though many approach do, in passing).

The interesting thing is that learning with seeds is very similar to transductive learning, though perhaps with a bit stronger assumption of noise on the "labeled" part.  The irony is that in machine learning land, you would never report "combined training and test accuracy" -- this would be ridiculous.  Yet this is what we seem to like to do in NLP land.  This is itself related to an old idea in machine learning wherein you rate yourself only on test example that you didn't see at training time.  This is your out-of-sample error, and is obviously much harder than your standard generalization error.  (The famous no-free-lunch theorems are from an out-of-sample analysis.)  The funny thing out of sample error is that sometimes you prefer not to get more training examples, because you then know you won't be tested on it!  If you were getting it right already, this just hurts you.  (Perhaps you should be allowed to see x and say "no I don't want to see y"?)

I think the key question is: what are we trying to do.  If we're trying to build good taggers (i.e., we're engineers) then overall accuracy is what we care about and including "seed" performance in our evaluations make sense.  But when we're talking about 45% tagging accuracy (like Adam and I were), then this is a pretty pathetic claim.  In the case that we're trying to understand learning algorithms and study their performance on real data (i.e., we're scientists) then accuracy on non-seeds is perhaps more interesting.  (Please don't jump on me for the engineer/scientist distinction: it's obviously much more subtle than this.)

This also reminds me of something Eric Brill said to me when I was working with him as a summer intern in MLAS at Microsoft (back when MLAS existed and back when Eric was in MLAS....).  We were working on web search stuff.  His comment was that he really didn't care about doing well on the 1000 most frequent queries.  Microsoft could always hire a couple annotators to manually do a good job on these queries.  And in fact, this is what is often done.  What we care about is the heavy tail, where there are too many somewhat common things to have humans annotate them all.  This is precisely the same situation here.  I can easily get 1000 seeds for a new language.  Do I actually care how well I do on those, or do I care how well I do on the other 20000+ things?

2 comments:

Bob Carpenter said...

I would also add that statisticians often only evaluate on the training data. They call it "data analysis".

I agree that held-out results can be misleading because of the dominance of high-frequency items be they words, phrases, dependencies or discourse structures. "Real" language isn't stationary, either, so that nice 1990s training data we all keep evaluating on is getting staler and staler.

As the data gets more complex, it gets harder to tell what's in the training data and what's not. For instance, if I evaluate held-out language modeling performance for LDA, it's not over a new set of tokens. If I evaluate a collaborative filtering engine, it's usually not over completely new users and items. So I think machine learning people do a bit of "evaluating on the training data", at least in a near neighbor kind of way. Otherwise, how could we generalize or get anywhere at all?

hal said...

@Bob: Yeah I pretty much agree. We like to think our data is i.i.d., but of course it never is "in the real world," just because times changed. Jimmy Lin pointed this out a few days ago very well in a talk about some of the work he's been doing at twitter. They were tracking performance and noticed a large jump of performance at some point in time that happened to be essentially exactly when the recent earthquake hit Japan. Trying to tease apart all these variables and actually ascertain if your system is doing better is hard. (Though you can always do the thing where different users get different versions of the system.)