## 28 July 2006

### Loss versus Conditional Probability

There was a talk in the session I chaired at ACL about directly optimizing CRFs to produce low F-scores for problems like NE tagging and chunking. The technique is fairly clever and is based on the observation that you can use very similar dynamic programming techniques to do max F-score as to do max log-probability in CRFs.

The details are not particularly important, but during the question phase, Chris Manning asked the following question: Given that F-score is not really motivated (i.e., is a type 4 loss), should we really be trying to optimize it? Conditional probability seems like a completely reasonable thing to want to maximize, given that we don't know how the tags will be used down the pipeline. (It seems Chris is also somewhat biased by a paper he subsequently had at EMNLP talking about sampling in pipelines.)

I think Chris' point is well taken. Absent any other information, conditional probably seems like a quite plausible thing to want to optimize, since given the true conditional probabilities, we can plug in any loss function at test time and do minimum Bayes risk (in theory, at least).

On the other hand, there is an interesting subtlety here. Conditional probability of what? The standard CRF optimization tries to maximize conditional probability of the entire sequence. The "alternative objective" of Kakade, Teh and Roweis optimizes the sub of the conditional probabilities of each label. These are two quite different criterea, and which one should we choose? In fact, neither really seems appropriate. Conditional probability of the sequence doesn't make sense because it would rather improve a bad label from probability 0.01 to 0.1 than improve a bad label from 0.4 to 0.6 and thus get it right. But summed conditional probability of labels doesn't make sense in NE tagging tasks because always assigning probability 0.9 to "not an entity" will do quite well. This is essentially the "accuracy versus f-score" problem, where, when few elements are actually "on," accuracy is a pretty terrible metric.

If we take Chris' advice and desire a conditional probability, it seems what we really want is direct conditional probability over the chunks! But how do we formulate this and how do we optimize it? My impression is that a direct modification of the paper Chris was asking about would actually enable us to do exactly that. So, while the authors of this paper were focusing on optimizing F-score, I think they've also given us a way to optimize conditional chunk probabilities (actually this should be easier than F-score because there are fewer forward/backward dependencies), similar to what Kakade et al. did for conditional label probabilities.

Kevin Duh said...

This is a very thought-provoking post. I was at the talk too but didn't make this connection. It's interesting that the critical question "What do we optimize?" isn't clear all the time in our problems. It'll be really interesting if someone could empirically try the various optimization criteria for chunking/tagging and see how that REALLY affects the later stages in the pipeline. (Of course, then we nead some goodness measure for the final stage too...)

Anonymous said...

Hi Hal,

Thanks for this thoughtful post. It would be great if you mention some interesting papers which you have seen in the conference in a special post. That would be great for those who could not make it to the conference.

Anonymous said...

I totally agree with Chris on this.

We're using the confidence scores as counts in a corpus that we use for data mining and information retrieval of genes by name.

It's easy to convert a forward-backward lattice of tag probabilities to those of chunks. With a BIO-encoding of chunks as tags, check out Culotta and McCallum's Confidence Estimation for Information Extraction, somehow only accepted as a poster.

We used a Begin-Middle-End-Whole encoding of chunkings as taggings in LingPipe, and it makes it a whole lot easier to do extraction. It pulls out n-best chunks (or n-best whole analyses) with conditional probability scores at 330K/second.
We just ran it over all of MEDLINE.

For what it's worth, pulling back most likely sequences vs. most likely tags is not always the same for POS, but the scores are always very close in my experience. We have tutorials on POS with confidence and entity extraction with confidence.

hal said...

Kevin -- I've wanted to do just that for parsing, perhaps with a summarization, EDT and MT system, but the overhead for trying such an experiment is daunting (not to mention the issue of engineering around syntax). Incidentally, Alex Fraser has done just this for alignments.

Bob -- I think I agree with Chris too, to a large degree. I'll have to read the Culotta and McCallum paper...in general I'm not a huge fan of these encodings for sequence segmentation (preferring direct segmentation models), but the paper sounds interesting.

Anonymous said...