Someone (can't remember who anymore, even though it was just a couple days ago!) pointed out to me that Google scholar seems to be doing weird things with citations. In particular, it seems to think that the citation relation is symmetric (at least in some cases).
Here's an easy example. Look up Khalid El-Arini's 2009 paper "Turning down the noise in the blogosphere" paper on Google scholar (or just follow that link). Apparently it's been cited by 24 papers. Let's look at who cites them. Apparently in 2003, in addition to inventing LDA, also invented a time machine so that he could cite Khalid's paper!
The weird thing is that this doesn't seem to be a systematic error! It only happens some times.
Oh well, I won't complain -- it just makes my H index look better :)
my biased thoughts on the fields of natural language processing (NLP), computational linguistics (CL) and related topics (machine learning, math, funding, etc.)
12 April 2011
06 April 2011
Seeding, transduction, out-of-sample error and the Microsoft approach...
My past master's student Adam Teichert (now at JHU) did some work on inducing part of speech taggers using typological information. We wanted to compare the usefulness of using small amounts of linguistic information with small amounts of lexical information in the form of seeds. (Other papers give seeds different names, like initial dictionaries or prototypes or whatever... it's all the same basic idea.)
The basic result was that if you don't use seeds, then typological information can help a lot. If you do you seeds, then your baseline performance jumps from like 5% to about 40% and then using typological information on top of this isn't really that beneficial.
This was a bit frustrating, and led us to think more about the problem. The way we got seeds was to look at the wikipedia page about Portuguese (for instance) and use their example list of words for each tag. An alternative popular way is to use labeled data and extract the few most frequent words for each part of speech type. They're not identical, but there is definitely quite a bit of overlap between the words that Wikipedia lists as examples of determiners and the most frequent determiners (this correlation is especially strong for closed-class words).
In terms of end performance, there are two reasons seeds can help. The first, which is the interesting case, is that knowing that "the" is a determiner helps you find other determiners (like "a") and perhaps also nouns (for instance, knowing the determiners often precede nouns in Portuguese). The second, which is the uninteresting case, is that now every time you see one of your seeds, you pretty much always get it right. In other words, just by specifying seeds, especially by frequency (or approximately by frequency ala Wikipedia), you're basically ensuring that you get 90% accuracy (due to ambiguity) on some large fraction of the corpus (again, especially for closed-class words which have short tails).
This phenomena is mentioned in the text (but not the tables :P), for instance, in Haghighi & Klein's 2006 NAACL paper on prototype-driven POS tagging, wherein they say: "Adding prototypes ... gave an accuracy of 68.8% on all tokens, but only 47.7% on non-prototype occurrences, which is only a marginal improvement over [a baseline system with no prototypes." Their improved system remedies this and achieves better accuracy on non-prototypes as well as prototypes (aka seeds).
This is very similar to the idea of transductive learning in machine learning land. Transduction is an alternative to semi-supervised learning. The setting is that you get a bunch of data, some of which is labeled and some of which is unlabeled. Your goal is to simply label the unlabeled data. You need not "induce" the labeling function (though many approach do, in passing).
The interesting thing is that learning with seeds is very similar to transductive learning, though perhaps with a bit stronger assumption of noise on the "labeled" part. The irony is that in machine learning land, you would never report "combined training and test accuracy" -- this would be ridiculous. Yet this is what we seem to like to do in NLP land. This is itself related to an old idea in machine learning wherein you rate yourself only on test example that you didn't see at training time. This is your out-of-sample error, and is obviously much harder than your standard generalization error. (The famous no-free-lunch theorems are from an out-of-sample analysis.) The funny thing out of sample error is that sometimes you prefer not to get more training examples, because you then know you won't be tested on it! If you were getting it right already, this just hurts you. (Perhaps you should be allowed to see x and say "no I don't want to see y"?)
I think the key question is: what are we trying to do. If we're trying to build good taggers (i.e., we're engineers) then overall accuracy is what we care about and including "seed" performance in our evaluations make sense. But when we're talking about 45% tagging accuracy (like Adam and I were), then this is a pretty pathetic claim. In the case that we're trying to understand learning algorithms and study their performance on real data (i.e., we're scientists) then accuracy on non-seeds is perhaps more interesting. (Please don't jump on me for the engineer/scientist distinction: it's obviously much more subtle than this.)
This also reminds me of something Eric Brill said to me when I was working with him as a summer intern in MLAS at Microsoft (back when MLAS existed and back when Eric was in MLAS....). We were working on web search stuff. His comment was that he really didn't care about doing well on the 1000 most frequent queries. Microsoft could always hire a couple annotators to manually do a good job on these queries. And in fact, this is what is often done. What we care about is the heavy tail, where there are too many somewhat common things to have humans annotate them all. This is precisely the same situation here. I can easily get 1000 seeds for a new language. Do I actually care how well I do on those, or do I care how well I do on the other 20000+ things?
The basic result was that if you don't use seeds, then typological information can help a lot. If you do you seeds, then your baseline performance jumps from like 5% to about 40% and then using typological information on top of this isn't really that beneficial.
This was a bit frustrating, and led us to think more about the problem. The way we got seeds was to look at the wikipedia page about Portuguese (for instance) and use their example list of words for each tag. An alternative popular way is to use labeled data and extract the few most frequent words for each part of speech type. They're not identical, but there is definitely quite a bit of overlap between the words that Wikipedia lists as examples of determiners and the most frequent determiners (this correlation is especially strong for closed-class words).
In terms of end performance, there are two reasons seeds can help. The first, which is the interesting case, is that knowing that "the" is a determiner helps you find other determiners (like "a") and perhaps also nouns (for instance, knowing the determiners often precede nouns in Portuguese). The second, which is the uninteresting case, is that now every time you see one of your seeds, you pretty much always get it right. In other words, just by specifying seeds, especially by frequency (or approximately by frequency ala Wikipedia), you're basically ensuring that you get 90% accuracy (due to ambiguity) on some large fraction of the corpus (again, especially for closed-class words which have short tails).
This phenomena is mentioned in the text (but not the tables :P), for instance, in Haghighi & Klein's 2006 NAACL paper on prototype-driven POS tagging, wherein they say: "Adding prototypes ... gave an accuracy of 68.8% on all tokens, but only 47.7% on non-prototype occurrences, which is only a marginal improvement over [a baseline system with no prototypes." Their improved system remedies this and achieves better accuracy on non-prototypes as well as prototypes (aka seeds).
This is very similar to the idea of transductive learning in machine learning land. Transduction is an alternative to semi-supervised learning. The setting is that you get a bunch of data, some of which is labeled and some of which is unlabeled. Your goal is to simply label the unlabeled data. You need not "induce" the labeling function (though many approach do, in passing).
The interesting thing is that learning with seeds is very similar to transductive learning, though perhaps with a bit stronger assumption of noise on the "labeled" part. The irony is that in machine learning land, you would never report "combined training and test accuracy" -- this would be ridiculous. Yet this is what we seem to like to do in NLP land. This is itself related to an old idea in machine learning wherein you rate yourself only on test example that you didn't see at training time. This is your out-of-sample error, and is obviously much harder than your standard generalization error. (The famous no-free-lunch theorems are from an out-of-sample analysis.) The funny thing out of sample error is that sometimes you prefer not to get more training examples, because you then know you won't be tested on it! If you were getting it right already, this just hurts you. (Perhaps you should be allowed to see x and say "no I don't want to see y"?)
I think the key question is: what are we trying to do. If we're trying to build good taggers (i.e., we're engineers) then overall accuracy is what we care about and including "seed" performance in our evaluations make sense. But when we're talking about 45% tagging accuracy (like Adam and I were), then this is a pretty pathetic claim. In the case that we're trying to understand learning algorithms and study their performance on real data (i.e., we're scientists) then accuracy on non-seeds is perhaps more interesting. (Please don't jump on me for the engineer/scientist distinction: it's obviously much more subtle than this.)
This also reminds me of something Eric Brill said to me when I was working with him as a summer intern in MLAS at Microsoft (back when MLAS existed and back when Eric was in MLAS....). We were working on web search stuff. His comment was that he really didn't care about doing well on the 1000 most frequent queries. Microsoft could always hire a couple annotators to manually do a good job on these queries. And in fact, this is what is often done. What we care about is the heavy tail, where there are too many somewhat common things to have humans annotate them all. This is precisely the same situation here. I can easily get 1000 seeds for a new language. Do I actually care how well I do on those, or do I care how well I do on the other 20000+ things?