natural language processing blog: coreference

04 November 2008

Using machine learning to answer scientific questions

I tend to prefer research (and the papers that result from it) that attempt to answer some sort of question about natural language. And "can I build a system that beats your system" does not count as a question. (Of course, I don't always obey this model myself.) For instance, the original IBM MT papers were essentially trying to answer the question: can we build a model of translation directly from parallel data? Later work in phrase-based and syntax-based translation basically started out by asking the question: is syntactic structure (or local lexical structure) a useful abstraction for enabling translation? It seems the answer to all these questions is "yes."

A less broad type of question that one can ask might be categorized as: is some type of feature relevant for some task. For instance, are lexical semantics features derived via a hand-crafted resource (such as WordNet) useful for the task of coreference resolution? What about lexical semantics features derived automatically from the web? This type of question is the sort of thing that I looked at in an EMNLP 2005 paper. The answers seemed to be "no" and "yes" respectively.

There are several issues with trying to answer such questions. One issue is that typically what you're actually looking at is the question: can I figure out a way to turn WordNet into a feature vector that is useful for the task of coreference resolution? This is probably a partial explanation for why our community seems not to like negative results to such questions: maybe you just weren't sufficiently clever in encoding WordNet as features. Or maybe WordNet features are useful but there is some other set of features that's more useful that's just swamping the benefits of WordNet.

My first question is whether we're even going about this the right way. The usual approach is to take some baseline system, add WordNet features, and see if predictive performance goes up (i.e., performance on test data). This seems like a bit of a round-about way to attack the problem. After all, this problem of "does some feature have an influence on the target concept" is a classical and very well studied area of statistics: most people have probably seen ANOVA (analysis of variance), but there are many many more ways to try to address this question. And, importantly, they don't hinge on the notions of predictive performance. (Which almost immediately ties us in to "my system is better than yours.")

Now, I'm not claiming that ANOVA (or some variant thereof) is the right thing to do. Our problems are often quite different from those encountered in classical statistics. We have millions of covariates (features) and often complex structured outputs. I just wonder if there's some hint of a good idea in those decades of statistical research.

The other way to go about looking at this question, which at the time I wrote the EMNLP paper I thought was pretty good, is the following. The question is: why do I care if WordNet features are useful for prediction. Very rarely will they actually hurt performance, so who cares if they don't help? (Besides those people who actually want to know something about how language works.... though in this case perhaps it tells us more about how WordNet works.) The perspective we took was that someone out there is going to implement a coreference system from scratch (perhaps for a new language, perhaps for a new domain) and they want to know what features they should spend time writing and which ones they can ignore. This now is a question about predictive performance, and we did a form of backwards feature selection to try to answer it. If you look at Figure 2 in the paper, basically what you see is that you'd better implement string-match features, lexical features, count-based features, and knowledge-based features. On top of this, you can get another half point with inference features and another half again using gazetteers. But beyond that, you're probably wasting your time. (At the time of the EMNLP paper, one of this big "sells" was the count-based features, which are impossible in the standard pairwise-decision models that most people apply to this problem. I still think this is an interesting result.)

At any rate, I still think this ("what should I spend time implementing") is a reasonable way to look at the problem. But if you're moving to a new domain or new language, most likely all of your time/money is going to be spent annotating data, not implementing features. So maybe you should just go and implement everything you can think of anyway. And besides, if you want to make an argument about moving to different languages or different domains or even different amounts of training data, you should make those arguments specifically and do the experiments to back them up. For instance, how does Figure 2 in the EMNLP paper change when you have half the data, or a tenth the data? Maybe the lexical features don't matter as much and perhaps here WordNet might kick in?

natural language processing blog

04 November 2008

Using machine learning to answer scientific questions

About Me

Labels

My Blog List

Blog Archive