Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

07 April 2010

When Maximum Likelihood Doesn't Make Sense

One of the most fun AHA moments in ML or stats is when you see that for multinomial distributions, your naive idea of relative frequency corresponds (through a bunch of calculus) to maximum likelihood estimation. Ditto means of Gaussians, though that's never quite as amazing because Gaussians seems sort of odd beasts anyway. It's nice when our intuitions match what stats tells us.

I'll often watch a DVD, and then leave it in the DVD player until I want to watch something else. Usually it will return to the menu screen and the timer display will go through 0:00, 0:01, 0:02, ..., until it hits the length of the title screen loop, and then go back to 0:00. What I always wonder when I glance up at it is "what is the actual length of the title screen loop?" That is, what is the highest value it'll count up to.

Being a good statistician, I set up a model. First I play the frequentist game. Since I glance up and pretty arbitrary times, my observations can be modeled as a uniform random variable from 0 to some upper bound U, which is the quantity I want to infer.

Supposing that I only make one observation, say 0:27, I want a maximum likelihood estimate of U. It's easy to see that U=27 is the maximum likelihood solution. Anything less than 27 will render my observation impossible. For any U>=27, my observation has probability 1/(U+1), so the ML solution is precisely U=27. Note that even if I observe five observations a <= b <= c <= d <= e, the ML solution is still U=e. Huh. The math works. The model is as correct as I could really imagine it. But this answer doesn't really seem reasonable to me. Somehow I expect a number greater than my observation to come about.

Perhaps I need a prior. That'll solve it. I'll add a prior p(U) and then do maximum a posteriori. Well, it's easy to see that if p(U) is unimodal, and its mode is less than my (highest) observation, then the MAP solution will still be the (highest) observation. If it's mode is more than my (highest) observation, then the MAP solution will be the mode of p(U). If I think about this a bit, it's hard for me to justify a multi-modal p(U), and also hard for me to be happy with a system in which my prior essentially completely determines my solution.

Or I could be fully Bayesian and look at a posterior distribution p(U | data). This will just be a left-truncated version of my prior, again, not very satisfying.

(Note that the "Maximum Likelihood Sets" idea, which I also like quite a bit, doesn't fix this problem either.)

It also really bugs me that only my largest observation really affects the answer. If I get one hundred observations, most of them centered around 0:10, and then one at 0:30, then I'd guess that 0:30 or 0:31 or 0:32 is probably the right answer. But if I observe a bunch of stuff spread roughly uniformly between 0:00 and 0:30, then I'd be more willing to believe something larger.

I don't really have a solution to this dilemma. Perhaps you could argue that my difficulties arise from the use of a Uniform distribution -- namely, that were I to use another distribution, these problems wouldn't really arise. I don't think this is quite true, as described below:

We actually run in to this problem in NLP fairly often. I observe a billion documents and see, in this 1b documents, 100k unique words, many of which are singletons. But I'm not willing to believe that if I see that document 1,000,000,001, that there won't be a new unseen word. Sure it's less likely that I see a new unique word than in document 1001, where it is almost guaranteed, but there's still a relatively large probability that some new word will show up in that next document.

The whole point is that somehow we expect random samples to be representatives of the underlying distribution. We don't expect them to define the underlying distribution. I don't actually expect to ever see the corner cases, unless I've seen everything else.


UPDATE: The Bayesian approach actually does do something reasonable. Here is some Matlab code for computing posteriors with a uniform prior, and results with an upper bound of 100 and various observations are plotted below:








04 November 2008

Using machine learning to answer scientific questions

I tend to prefer research (and the papers that result from it) that attempt to answer some sort of question about natural language. And "can I build a system that beats your system" does not count as a question. (Of course, I don't always obey this model myself.) For instance, the original IBM MT papers were essentially trying to answer the question: can we build a model of translation directly from parallel data? Later work in phrase-based and syntax-based translation basically started out by asking the question: is syntactic structure (or local lexical structure) a useful abstraction for enabling translation? It seems the answer to all these questions is "yes."

A less broad type of question that one can ask might be categorized as: is some type of feature relevant for some task. For instance, are lexical semantics features derived via a hand-crafted resource (such as WordNet) useful for the task of coreference resolution? What about lexical semantics features derived automatically from the web? This type of question is the sort of thing that I looked at in an EMNLP 2005 paper. The answers seemed to be "no" and "yes" respectively.

There are several issues with trying to answer such questions. One issue is that typically what you're actually looking at is the question: can I figure out a way to turn WordNet into a feature vector that is useful for the task of coreference resolution? This is probably a partial explanation for why our community seems not to like negative results to such questions: maybe you just weren't sufficiently clever in encoding WordNet as features. Or maybe WordNet features are useful but there is some other set of features that's more useful that's just swamping the benefits of WordNet.

My first question is whether we're even going about this the right way. The usual approach is to take some baseline system, add WordNet features, and see if predictive performance goes up (i.e., performance on test data). This seems like a bit of a round-about way to attack the problem. After all, this problem of "does some feature have an influence on the target concept" is a classical and very well studied area of statistics: most people have probably seen ANOVA (analysis of variance), but there are many many more ways to try to address this question. And, importantly, they don't hinge on the notions of predictive performance. (Which almost immediately ties us in to "my system is better than yours.")

Now, I'm not claiming that ANOVA (or some variant thereof) is the right thing to do. Our problems are often quite different from those encountered in classical statistics. We have millions of covariates (features) and often complex structured outputs. I just wonder if there's some hint of a good idea in those decades of statistical research.

The other way to go about looking at this question, which at the time I wrote the EMNLP paper I thought was pretty good, is the following. The question is: why do I care if WordNet features are useful for prediction. Very rarely will they actually hurt performance, so who cares if they don't help? (Besides those people who actually want to know something about how language works.... though in this case perhaps it tells us more about how WordNet works.) The perspective we took was that someone out there is going to implement a coreference system from scratch (perhaps for a new language, perhaps for a new domain) and they want to know what features they should spend time writing and which ones they can ignore. This now is a question about predictive performance, and we did a form of backwards feature selection to try to answer it. If you look at Figure 2 in the paper, basically what you see is that you'd better implement string-match features, lexical features, count-based features, and knowledge-based features. On top of this, you can get another half point with inference features and another half again using gazetteers. But beyond that, you're probably wasting your time. (At the time of the EMNLP paper, one of this big "sells" was the count-based features, which are impossible in the standard pairwise-decision models that most people apply to this problem. I still think this is an interesting result.)

At any rate, I still think this ("what should I spend time implementing") is a reasonable way to look at the problem. But if you're moving to a new domain or new language, most likely all of your time/money is going to be spent annotating data, not implementing features. So maybe you should just go and implement everything you can think of anyway. And besides, if you want to make an argument about moving to different languages or different domains or even different amounts of training data, you should make those arguments specifically and do the experiments to back them up. For instance, how does Figure 2 in the EMNLP paper change when you have half the data, or a tenth the data? Maybe the lexical features don't matter as much and perhaps here WordNet might kick in?

29 May 2008

Measures of correlation

As NLPers, we're often tasked with measuring similarity of things, where things are usually words or phrases or something like that. The most standard example is measuring collocations. Namely, for every pair of words, do they co-locate (appear next to each other) more than they "should." There are lots of statistics that are used to measure collocation, but I'm not aware of any in-depth comparison of these. If anyone actually still reads this blog over the summer and would care to comment, it would be appreciated!

The most straightforward measure, in my mind, is mutual information. If we have two words "Y" and "Y", then the mutual information is: MI(X;Y) = \sum_x \sum_y p(X=x,Y=y) \log [p(X=x,Y=y) / (p(X=x)p(Y=y))] (sorry, my LaTeX plugin seems to have died). In the case of collocation, the values x and y can take are usually just "does occur" and "does not occur." Here, we're basically asking ourselves: do X and Y take on the same values more often than chance. I.e., do they seem roughly statistically independent. The mutual information statistic gives, for every X/Y pair (every pair of words) a score. We can then sort all word pairs by this score and find the strongest collocations.

One issue with MI seems to be that it doesn't really care if pairs are common or not. Very infrequent (typically noisy) pairs seem to pop to the front. One way to fix this would be to add an additional "count" term to the front of the mutual information to weight high-frequency pairs more. This is quite similar to the RlogF measure that's locally quite popular.

The next set of methods seem to be based mostly on the idea of assuming a generative model for the data and doing hypothesis testing. I.e., you can have one model that assumes the two words are independent (the null hypothesis), and one that assumes they are not. Typically this is implemented as multinomials over words, and then a classical statistical test is applied to the estimated maximum likelihood parameters of the data. You can use Dice coefficient, t-score, chi-squared, or even something simpler like the log-likelihood ratio.

Although I'm not completely familiar with these techniques, they seem like they would also be sensitive to noise in the MLE, especially for low-frequency terms (of which we know there are a lot). It seems plausible to just directly do a Bayesian hypothesis test, rather than a classical one, but I don't know if anyone has done this.

In writing this, I just came across collocations.de, which among other things, has a nice survey of these results from an ESSLLI tutorial in 2003. They seem to conclude that for extracting PP/verb collocations, t-score and frequency seem to work best, and for extracting adjective/noun collocations, log-likelihood, t-score, Fisher, and p-values seem to work best. The intersection just contains t-score, so maybe that's the way to go. Of course, these things are hard to validate because so much depends on what you're trying to do.