26 February 2006

Evaluation Criterea: Correlation versus Generalization

When developing a "third level" loss function (i.e., an automatic metric), one often shows that the ranking or the scores of the automatic function correlates well with the ranking or scores of the human-level function.

I'm not sure this is the best thing to do. The problem is that what correlation tell us is just that: on the data sample we have, our automatic evaluation function (AE) correlates well with the human evaluation function (HE). But we don't actually care (directly) about the correlation on past data. We are about the ability to make future predictions of quality. Namely, we often want to know:

If I improve my system by 1 point AE, will this improve my HE score?

This is a generalization question, not a correlation question. But we know how to analyze generalization questions: Machine learning theorists do it all the time!

So, what do we know: if we want AE to predict HE well in the generalization sense, then it is (with high probability) sufficient that AE predicts HE well on our observed (training) data, and that AE is not "too complex." Usually this latter holds. AE is often chosen to be as simple as possible, otherwise people probably won't buy it. Usually it's chosen from a small finite set, which means we can do really simple PAC-style analyses. So this should be enough, right: we have good prediction of the training data (the correlation) and a small hypothesis class, so we'll get good generalization. Thus, correlation is enough.

Well, not quite.

Correlation+small hypothesis class is enough only when the "training data" is i.i.d. And our problem assumptions kill both of the "i"s. First, our data is not independent. In summarization (and to a perhaps slightly lesser extent in MT), the systems are very very similar to eachother. In fact, when people submit multiple runs, they are hugely similar. Second, our data is not identically distributed to the test data. The whole point of this exercise was to answer the question: if I improve my AE will by HE improve? But, at least if I have the top system, if I improve my AE too much, my system is no longer from the same distribution of systems that generated the training data. So the data is hardly i.i.d. and easy generalization bounds cannot be had.

So what can we do about this? Well, recognizing it is a start. More practically, it might be worth subsampling from systems when deriving correlation results. The systems should be chosen to be (a) as diverse as possible and (b) as high performance as possible. The diversity helps with independence. The high performance means that we're not geting supurb correlation on crummy systems and no correlation on the great systems, leading to poor generalization as systems get better. This doesn't fix the problem, but it might help. And, of course, whenever possible we should actually run HE.

22 February 2006

Structured Prediction 3: What's to Come

Different researchers have different priorities, and not all those listed here are things I personally plan to work on. There are many open questions about structured prediction that I'd like an answer to. Hopefully we'll make some headway at the atomic learning workshop, and I'll of course post after that.

  1. Move to more complex, "intractable" structures.
  2. More with less: improving with only weak feedback.
  3. Learning structure: the analogue of learning graphical models.
  4. Learning from heterogenous data.
  5. Measuring sample complexity.
  6. Trading off expressivity with tractability.
  7. Integrated prediction and data mining.
  8. Computational complexity versus sample complexity.
My short term goal is [1] and [6] (and I've worked a bit on [4], but not for SP). Andrew McCallum seems to be working hard on [1] (though using different machinery) and [7]. I think Ben Taskar has some new work on [1] and is probably closer to [5] than any of us. Thorsten Joachims has done interesting work in [1] and [2] (the latter not for SP though), but I think (hope) he might bring some of that across. I don't really know who (if anyone!) is working on the other topics.

The question that drives me is: to what extent can we let the NLPer forget about the fact that there's machine learning going on in the background. I think that, as of now, at least from my perspective, the answer is that the NLPer must be able to structure a search space for his problem, and devise a way for turning the "true output" into an "optimal policy" by breaking it down into regions. The rest is purely trivial crank turning (no math required). The answers to the questions above will help in this endeavor. A very basic question, hinted on before, is: when do we even need to consider structure? If we don't, life is much easier for the NLPer. But we need to be able to say when we do and when we don't need structure. And when we do, we want it to be as painless as possible.

14 February 2006

Tutorial: Bayesian Techniques for NLP

I'm giving a tutorial on Bayesian methods for NLP at HLT-NAACL 2006. I gave a similar tutorial about a year ago here at ISI. This gave me a pretty good idea of what I want to keep in and what I want to cut out. The topics I intend to cover are, roughly:

  1. Bayesian paradigm: priors, posteriors, normalization, etc.
  2. Graphical models, expectation maximization, non-bayesian inference techniques
  3. Common statistical distributions: uniform, binomial/multinomial, beta/dirichlet
  4. Simple inference: integration, summaring, monte carlo
  5. Advanced inference: MCMC, Laplace, Variational
  6. Survey of popular models: LDA, Topics and Syntax, Words and Pictures
  7. Pointers to literature
All of this is, of course, cast in the context of NLP problems: all discrete distributions, language applications, etc., that hopefully both NLP and IR people will find interesting (maybe even some speech people, too).

Does anyone have anything they'd really like to hear that's not on the list? Or anything that's on the list that they don't care about? Keep in mind several constraints: 3 hours (minus coffee time), generally accessible, focused on NLP applications, and something I know something about. (For instance, I covered expectation propagation in the tutorial last year, but decided to cut it for this to give more time to other issues.) Note that I am also preparing a written tutorial that covers roughly the same material.

13 February 2006

Structured Prediction 2.5: Features

I decided to add an additional post on structured prediction before the final one. This is also highly related to the post on joint inference.

One can claim that when doing structured prediction, the features X never have longer range than the loss function, where X is either "need" or "should." To this day I struggle to understand this issue: I don't believe it is as simple as one would like.

Consider our old favorite: sequence labeling under Hamming (per label) loss. The arguement essentially goes: we don't care (i.e., our loss function doesn't care) about how well our labels fit together. All it cares about is getting each right individually. By essentially the same argument as the joint inference post, one can then see that one should train a single classifier that does not use any "Markov-style" features.

On the other hand, I know that I have always seen Markov-style features to help, and I believe that most people who work on SP would agree.

Just as in the joint inference case, I think that if you want to be guaranteed the same information to train a classifier on, you need to move to a cross-product feature space. Unlike John, I do fear this. I fear this for several reasons. First, we already have millions of features. Going to 10^12 features is just not practical (blah blah kernels blah blah). This can also lead to severe generalization issues, and I don't think the state of the art in machine learning is capable of dealing with this. It's trikcy, though, because I largely believe the joint inference argument, and largely don't believe this argument. Yet they are essentially the same.

But I think there might be a more fundamental issue here, beyond cross-product feature spaces, generalization bounds and so on. This is the issue of: if we know a problem has structure, even if the loss function (even the true one) doesn't reflect this, should we make use of this structure. The answer to "Why?" is that it might make it easier on our models to learn. The answer to "Why not?" is that by doing so we require more "human in the loop" than otherwise.

06 February 2006

The Art of Loss Functions

There are roughly four types of loss functions that are used in NLP research.

  1. The real loss function given to us by the world. Typically involves notions of money saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function.

  2. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments, etc. We can perform these evaluations, but they are slow and costly. They require humans in the loop.

  3. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are cheap and quick. Typically some effort has been put into showing correlation between these and something higher up.

  4. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and perplexity (for language modeling). These also require humans at the front of the loop, but differ from (3) in that they are not actually compared with higher-up tasks.

Note that the difference between (3) and (4) changes over time: there are many (4)s that could easily become (3)s given a few rounds of experimentation. (I apologize if I called something a (4) that should be a (3).)

It is important to always keep in mind that our goal is to improve (1). The standard line is: I can't compute (1) so I approximate it with (2). I can't optimize (2) so I approximate it with (3). (Or, I don't know how to define (2) so I approximate it with (4).)

I strongly feel that the higher up on this list one is, the harder it is to actually define the evaluation metric. It is very easy to define things at the (4) level because there are essentially no requirements on such a loss function other than (A) hard to game and (B) intuitively reasonable. It is a bit harder to define things at the (3) level because an additional critereon is added: (C) must be able to convince people that this new loss function approximates an established (1) or (2). Defining something at the (2) level is shockingly difficult and (1) is virtually impossible.

There are two questions that are crutial to progress. The first is: what is the minimum acceptable evaluation. We cannot reasonably require (1) in all cases. In many cases, I think it absolutely reasonable to require (2); eg. journal papers for which a working definition of (2) is known. Conference papers can usually get by with (3) or (4). I would never accept (4) when a (3) is known. But when not (consider named entity recognition), it's all we have.

It's unclear if this is a good breakdown or not. I could argue never to accept (4): that you must show your results improve something humans care about, at least approximately. But this seems limiting. First, it means that if we want to solve a subproblem, we have to formally show that solving it will be useful before we actually solve it. This is a good thing to do a but of internally, but to do a sophisticated analysis to the point of publishable results to move something from (4) to (3) is a lot of effort for (often) little gain. Especially in the context of a larger research agenda.

I could also argue that for many results, (4) is always sufficient. The argument here is that, from a machine learning perspective, if I can optimize (4), I can optimize anything else you give me. While I've made this argument personally, it's just not true. A more complex loss function might be much harder to optimize (consider optimizing Bleu-1 without brevity penalty versus full Bleu-4). Moreover, it may turn out that whatever biases my model has are good for one (4) and not for another. This is an okay (but not great) argument in an ML conference, but I wouldn't buy it in NLP.

So how do other people rank these things? How much pressure is there to bring a (4) to a (3)? And how often should we do (2)?

02 February 2006

Joint Inference

Joint inference is becoming really popular these days. There's even a workshop coming up on the topic that you should submit a paper to! (That is, unless your paper is more related to the computationally hard NLP workshop I'm running with Ryan McDonald and Fernando Pereira).

The problem with joint inference is that while it seems like a great idea at the outset -- getting rid of pipelining, etc. -- in theory it's not always the best choice. I have a squib on the topic, titled Joint Inference is not Always Optimal and I welcome any and all comments/criticisms/etc.