25 August 2006

Doing Named Entity Recognition? Don't optimize for F1

(Guest post by Chris Manning. Thanks Chris!)

Among ML-oriented nlpers, using a simple F1 of precision and recall is the standard way to evaluate Named Entity Recognition. Using F1 seems familiar and comfortable, but I think most nlpers haven't actually thought through the rather different character that the F1 measure takes on when applied to evaluating sequence models. It's not just that it's a type 4 loss (a simple, intuition-driven measure like accuracy): In most cases such measures are reasonable enough for what they are, but using F1 for NER has an under-appreciated dysfunctional character. You wouldn't want to optimize for it!

This post explains what I was really thinking about when I made the comment that Hal referred to previously (fortunately, I didn't try to say all this after the talk!). I agree with Hal: the paper was a very nice piece of work technically. I just think that the authors, Jun Suzuki et al., chose a bad peak to climb.

Everyone is familiar with the F1 measure for simple classification decisions. You draw a 2x2 contingency table of whether something should be yes/no, and whether the system guessed yes/no, and then calculate the harmonic mean of precision and recall. But now think about Named Entity Recognition. You're chugging through text, and every now-and-again there is an entity, which your system recognizes or doesn't or fantasizes. I will use the notation word/GOLD/GUESS throughout, with O denoting the special background class of not-an-entity. So there are stretches of plain text (drove/O/O along/O/O a/O/O narrow/O/O road/O/O). These are the non-coding regions of NER. Then there are sequences (of one or more tokens) where there was an entity and the system guessed it right (in/O/O Palo/LOC/LOCAlto/LOC/LOC ./O/O), where there was an entity but the system missed it (in/O/O Palo/LOC/O Alto/LOC/O ./O/O), and where there wasn't an entity but the system hypothesized one (an/O/O Awful/O/ORG Headache/O/ORG ./O/O).

Things look good up until here: those events map naturally on to the false negatives (fn), true positives (tp), false negatives (fp), and false positives (fp) of the simple classification case. The problem is that there are other events that can happen. A system can notice that there is an entity but give it the wrong label (I/O/O live/O/O in/O/O Palo/LOC/ORG Alto/LOC/ORG ./O/O). A system can notice that there is
an entity but get its boundaries wrong (Unless/O/PERS Karl/PERS/PERS Smith/PERS/PERS resigns/O/O). Or it can make both mistakes at once (Unless/O/ORG Karl/PERS/ORG Smith/PERS/ORG resigns/O/O). I'll call these events a labeling error (le), a boundary error (be), and a label-boundary error (lbe).

I started thinking along these lines just as an intuitive, natural way to characterize happenings in NER output, where entities are sparse occurrences in stretches of background text. But you can make it formal (I wrote a Perl script!). Moving along the sequence, the subsequence boundaries are: (i) at start and end of document, (ii) anywhere there is a change to or from a word/O/O token from or to a token where either guess or gold is not O, and (iii) anywhere that both systems change their class assignment simultaneously, regardless of whether they agree. If you chop into subsequences like that, each can be assigned to one of the above seven classes.

Now, the thing to notice is that for the first 4 event types, you are either correct or you get 1 demerit, assessed to either precision or recall. In the simple classification case, that's the end of the story and the F1 measure is sensible. But when doing precision and recall over subsequences, there are these other three event types. Each of them is assessed a minimum of 2 demerits, with both precision and recall being hit. Therefore, it is fairly clear that optimizing for F1 in this context will encourage a system to do the following: if I'm moderately uncertain of either the class label or the boundaries of the entity, because a mistake would cost me a minimum of 2 demerits, I'm better off proposing no entity, which will cost me only 1 demerit.

(Two notes:

(i) As I've defined events, the possible demerits for an event in the last three classes is unbounded, though in practice 2 is the most common case. For example, this lbe event would be assessed 4 demerits (3 to precision, and 1 to recall): Smith/ORG/PERS and/ORG/O Newcomb/ORG/PERS and/ORG/O Co./ORG/ORG.

(ii) Despite my title, the problem here isn't with the F measure per se, as Bob Moore emphasized to me at a coffee break during ACL 2006 (thanks!). The problem would occur with any measure that combines precision and recall and which is increasing in both arguments, such as the simple arithmetic mean of precision and recall.)

Observe that this behavior is the opposite of the way things were meant to work: people adopted F1 in IR rather than using accuracy because accuracy gives high scores to a system that returns no documents, which obviously isn't useful. But, here, optimizing for F1 is encouraging a system to not mark entities.

Now let's look at some data. I used this event classification system on the output of my NER system on the CoNLL 2003 shared task English testa data. Here is how many events of each type there were:

tn 5583
tp 4792
fn 118
fp 120
le 472
be 102
lbe 75

Note in particular that over 2/3 of the errors are in those 3 extra categories that are multiply penalized. The ratios of classes vary with the task. For example, in biological NER, you tend to get many more boundary errors. But in my experience it is always the case that lots of the errors are in the last 3 classes.

Moreover, some of the errors in the le and be classes are not that bad, and sometimes even reflect subtle judgement calls and human annotator inconsistency in the gold standard. For instance, in the GENIA data you can find both regulation/O of/O human/DNA interleukin-2/DNA gene/DNA expression and transduction/O to/O the/O human/O IL-2/DNA gene/DNA, where it is unclear whether to include human in the name of the gene. Or in a newswire phrase like the Leeds stadium, it's not always very clear whether Leeds should be tagged ORG as a reference to the football team or LOC as a reference to the city. In almost any imaginable task, you would prefer systems that made these errors to ones that missed such entities entirely. In other words, the F1 measure is punishing more severely mistakes that should be punished less according to reasonable intuitions of task utility.

Has this been noticed before? I think so. The ACE program has a system for giving partial credit. But most ML people react very negatively to a scoring system that you couldn't possibly write on a napkin and which involves various very arbitrary-looking constants.... Do these observations undermine the last decade of work in NER? I don't think so. It turns out that there are lots of measures that are pretty okay providing you do not specifically optimize for them, but are dysfunctional if you do. A well-known example is traditional readability measures.

p.s. As I finish writing this guest post, it's occurred to me that I think this is the first nlpers post with some actual natural language examples in it. If you're reading this post, I guess that at least shows that such content isn't actively filtered out!

hal said...

Very interesting -- I'm glad you put so much effort into thinking about this so the rest of us don't have to :).

I'd like first to comment a bit about the ACE measure (restricted the NE tagging). Here, essentially what you do is create a bipartite graph. One half is the "true mentions" and the other half is the "system detected mentions." We then do a matching between them (ignore how for a second) and compute a final score based on this matching. For instance, you get docked some points if types don't match across links. Matches that don't overlap by some character-based minimum amount (eg., 90% of the characters) are disallowed. You also get penalized for misses (unmatched elements on the truth side) or false alarms (unmatched elements on the system side). The way the matching is actually computed is so as to MAXIMIZE your end score: this is a straightforward bipartite matching problem and can be solved efficiently.

The major thing that I think the ACE score misses that Chris talks about is the issue of span. According to ACE, either spans overlap or they don't. 90% (IIRC) of characters in common is a match, 89.9% is not. This, for instance, would not allow for many of Chris' examples. Moreover, it doesn't fix the problem that only getting "human" in "human/DNA interleukin-2/DNA gene/DNA" is probably much worse than only getting "gene." But to know that, eg., "gene" is the head, or than last names are more important (generally) than first, we would need to actually annotate this, or come up with heuristics (likely akin to Mike Collin's head finding heuristics for head-finding in NPs).

A second issue is believability, which Chris also mentioned in terms of "ML people react very negatively to a scoring system that you couldn't possibly write on a napkin and which involves various very arbitrary-looking constants." I think there are two reasons for this.

(1) Complex loss functions with weird constants are just not believable. As I talked about before, we really want loss function than generalize, not that just fit the data. If we fit the data using a lot of features and weights, but this is not intuitive (our "prior" says "crazy!"), we're not going to believe it will generalize. I don't think the ACE metric is all that bad in this respect (I've seen far worse). But I have had lengthy conversations with hardcore ML people who think that even something like BLEU is too complex.

(2) Practically speaking, complex loss functions are hard to optimize. Hamming loss is just so easy in comparison to really anything else (even F1 is hard, as is noted by the technical difficulty in the paper that brought up this conversation). Directly optimizing ACE or BLEU is quite difficult and the sorts of techniques that we (as machine learning folk) use are not really up to snuff here.

One question -- and I don't really think this has an answer (but I'll ask it anyway) -- this brings up is: how can we tell if a loss function is good? Of course, if we can compare it to real evaluations, this is great, but what if we can't (eg., in parsing or NE tagging). Is F1 good for parsing for instance? Is the ACE metric good, or at least better than F (in some formal sense)? Is there a way to tell whether a metric is easy to game?

Okay, a long post gets a long reply, but I'll stop here :). Thanks again to Chris for putting this together.

Anonymous said...

The "problem" Chris raises can be attributed to focusing on first-best solutions. I like to think of the problem of entity extraction as more like search and less like first-best classification (though there is clearly a deep connection between search and classification). Downstream applications typically focus on gathering/tracking information about some known entities (e.g. protein p53 and things it regulates in humans), or on mining data to discover relationships or patterns (e.g. "species=human, regulator:p53, regulated: human insuline-like growth factor II").

Suppose you have a system that can return multiple mentions and spans from a text input and give them scores that are comparable across docs. We like to use conditional probabilities p(mention of type T from position n to n+k|text) here, because they're easy to combine with downstream processing like social network analysis and because they have the desirable cross-document comparability. For instance, here's some output drawn from LingPipe's Named Entity Tutorial (trained from NCBI's GeneTag corpus):

p53 regulates human insulin-like growth factor II gene expression through active P4 promoter in rhabdomyosarcoma cells. Mentions by confidence: "p53":0.9999, "p4 promoter":0.7328, "insulin-like growth factor II gene":0.6055, "human insulin-like growth factor II gene":0.3817, "active P4 promoter":0.1395, "P4":0.0916, "active P4":0.0088, "insulin-like growth factor II":0.0070, "human insulin-like growth factor II":0.0044, ... The numbers are conditional probability estimates of the phrase being a gene given the input text (this is actually done by span, not by phrase, but the above is easier to read with the limited formatting available in this blog). You can see that it's just not sure if "human" should be in there, just as in Chris's example.

A ranking of entities along with a reference gold standard allows you to draw a precision-recall curve, compute mean-average precision (MAP), precision-recall breakeven point (BEP), compute precision-at-n-documents, and so on. For instance, for genes, we can do 99.5% recall at 10% precision for search, or tighten precision down to 99% for mining applications.

Perhaps even more importantly, we can combine information across mentions. This was originally done statistically by Mark Craven, I believe. Most simply, we can combine scores from all mentions of "P4 promoter" and estimate its total count in a corpus. This allows very high precision extraction if we take high estimated count events. We can also use this kind of output as the basis of rescoring by downstream models such as relation extractors (as doen by Dan Roth) and coreference resolvers (Heng Ji and Ralph Grishman), though that's typically done with whole-sentence n-best (which LingPipe also does) rather than per-entity n-best.

Historical note: The whole alignment/partial-credit thing goes back to MUC. It's possible to remove the alignment part of it and just give partial credit for overlaps with the same type or exact span matches with different types and to break all that down by type and what not. But it's still not very easy to understand the final numbers. As Chris's figures show (and they're typical), most errors get partial credit, so this scoring often comes close to halving error rates. DARPA's performance figure reverse engineering is interesting from both a technical and sociological/organizational/marketing perspective.

hal said...

While I agree that it is often important to not just provide a single-best output but either an n-best list or a distribution over outputs, I feel that this is something of a side issue. (Of course, if we're trying to produce probabilities, then the whole "optimizing F1" is irrelevant since we'll instead be optimizing for conditional probability.) It would of course be possible to extract n-best lists from a system trained using Suzuki et al.'s technique, but if the loss being optimized is fundamentally wrong (as Chris contests) then doing so won't help us.

In the end (*), what we want is a loss function for NE rec such that low loss implies good inputs for systems further down the pipeline (since there are only very few imaginable scenarios in which NER is the final stage). It may turn out, as Chris contends, that conditional probability is a pretty darn good option. But whether we use an n-best list or not seems tangential to me.

(*) Of course, it may be possible to directly optimize NER performance on the basis of final task performance, without inventing a new loss function. In fact, I hope this is true :).

Anonymous said...

Think about this when deciding whether first-best is enough: when's the last time you've pressed Google's "I'm Felling Lucky" button? A system that's limited to a first-best state-of-the-art 80-85% recall is going to miss a lot of entities. I believe a relevant task eval is the precision you can deliver at 99.5% recall. Or what recall you can get at 95% precision.

Optimizing for first-best can actually hurt performance at different precision/recall points. Our own long-distance models are more attenuated than the more local ones; they're about 10% absolute better on F measure, but much worse on precision-at-100 docs, area-under-ROC curve, or MAP-type measures.

Assuming that partial matches are desirable, as Chris did, assumes that downstream processes have the means to correct noisy first-best output. I think it's easier to combine and rescore than it is to correct, but that's just a hunch.

Scoring by cross-entropy estimates of the gold standard versus a system model makes a lot of sense. The score is just the sum of the estimated log probs for the reference annotations. It works not only for NE, but for just about any task you can imagine. Not only for evaluation, but for things like EM training. It also makes sense with either joint or conditional estimates.

The advantage of scoring by cross-entropy is that developers don't need to write decoders -- just scorers. The problem for a bakeoff scoring by cross-entropy is that it assumes systems are properly normalized probabilistic systems, which rules out SVMs, TF/IDF, heuristic pattern filters (unless they can be integrated with rescoring), etc.

Scoring ranked n-best entities a la TREC, as I suggested, eliminates normalization errors (and cheating possibilities). But it requires ranked output, and for best performance, n-best. I believe downstream systems require n-best anyway, so this doesn't seem like much of an imposition.

hal said...

I more-or-less agree that first best is not always the best way to evaluate a system, especially one that is going to be embedded within a larger system. There are many options for how to embed, the simplest of which are probably n-best lists or samples from the posterior (see Chris' paper).

All I'm saying is that I think that the issue of F1 being suboptimal for optimization is independent of the one-best versus many distinction. If the loss you're optimizing is fundamentally broken, then you're essentially leaving it up to chance that it's somehow "close enough" that within an n-best list, you'll get something reasonable. But this is a strong assumption. I think that if you believe what Chris is saying, then you must believe it whether you are producing single best outputs or n-best outputs.

Abhishek said...

Given that I have recognized the boundaries of named entities in a large text corpora, are you aware of any unsupervised technique that classifies these Entity candidates?

hal said...

Abhishek: I'd take a look at Collins and Singer's bootstrapping approach for that. (Plus things that cite it.)

Anonymous said...

I'm a little slow off the bat, but:

I am interested in some clarifications of Chris's error classificatiion. The described segmentation method would leave some cases accounted for very strangely (though they may be rare in English NER).

For example:

Chrisopher|I-PER|I-PER Manning|I-PER|B-PER will be marked as a single BE, presumably. Will Chrisopher|I-PER|I-PER Manning|I-PER|I-ORG be an LBE?

With this segmentation, if a system tags an entire sentence with one tag, or with many tags which continually overlap with gold-standard entities but whose boundaries never coincide, it will be counted as a single segment, and judged as a maximum of 1 error. Surely this is just as useless a result.

I am basically not convinced that we have a clear way of counting boundary errors.

Instead, perhaps, we could give a score for each correct and predicted annotation, in a way that simple cases (exact matches, boundary errors, etc.) will award two agreeing points:

Unless/O/B-PERS Karl/B-PERS/B-PERS Smith/B-PERS/B-PERS resigns/O/O

Here correct Karl Smith corresponds to predicted Unless Karl Smith. From the perspective of either annotation, this is a boundary error.

Chrisopher|I-PER|I-PER Manning|I-PER|B-PER

This might be scored as two boundary errors from the perspective of the predictions, and one (label-)boundary error from the perspective of the correct entity.

But this scoring is biased against false positives and false negatives, and probably has a number of other faults.

Anonymous said...

When computing precision, recall, and f-measure, do you exclude non-coding regions of the text? (eg, "drove/O/O along/O/O a/O/O"?)

Seems if you include non-coding regions, a set of evaluation documents with very few entities will produce artificially high results. However technically, a classification result of "not an entity" is still a classifier result (capable of false positive, negative, true positive, etc). Any thoughts on this? Should non-coding tokens be included or not included?

Anonymous said...

Anonymous said...

I am grateful to you for this great content.aöf thanks radyo dinle cool hikaye very nice sskonlycinsellik very nice ehliyet turhoq home free kadın last go korku jomax med olsaoy hikaye lesto go müzik dinle free only film izle love aşk 09sas mp3 indir

Anonymous said...
Anonymous said...

to a scoring system that you couldn't possibly write on a napkin and which involves various very arbitrary-looking constants." I think there are two reasons for this.
Dissertation Writing | Essay Writing | Research Paper Writing

term paper guide said...

Great blog, people get lots of information keep on posting this type of attractive articles.

custom paper guide

college term paper

Research Papers said...

I'll be glad to hear any other news from you.
Essay Writing | Buy Research Paper | College Essays

Writing a Research Paper said...

Many institutions limit access to their online information. Making this information available will be an asset to all.

Anonymous said...

Student can get term papers and custom term papers help online through many websites.

Anonymous said...

Wonderful blog, i recently come to your blog through Google excellent knowledge keep on posting you guys.

Dissertation services
Dissertation writing

Joe said...

Thank you for great article. Where else could anyone get that kind of information in such a perfect way of presentation.
Oes Tsetnoc

Joe said...

this is great information that i know a lot of people are interested in.
Kerja Keras Adalah Energi Kita | Hosting Murah | Kerja Keras Adalah Energi Kita

Methey said...

I like the way you explain things.

Babies | Bayi

Baby said...

Custom Essay said...

Hi,
I personally like your post; you have shared good insights and experiences. This post will really help beginners, although it is basic but, it will help others in great deal in future. Keep it up.

College Research Paper said...

I appreciate the work of all people who share information with others.

Unknown said...

Hi,
It must've taken you a bit of time, so thanks for taking the time to do so, I appreciate it, and this post is just great.
Coursework help

Essay Paper said...

Hi,

Anonymous said...

Thank you very much for a kind of the hottest data just about this post ! You have to ground your dissertation, I think. Because some thesis writing services do such things and you are able accomplish really good format thesis too.

PDF BOOKS said...

Thanks for writing this informative article.

pdf books
free pdf books
online pdf books

Hey,

Custom Term Papers said...

Nice post! Very complete and detail information. That’s what I need! Well done!
Custom Term Paper

Dissertation said...

Great, I know I've wanted a larger test ACE a few times, glad to see someone created one!

Unknown said...
This comment has been removed by the author.
Unknown said...
This comment has been removed by the author.
Unknown said...

that is a great kind of information peoples easily atract with it
Thesis Writing | Dissertation Writing | Essay Writing | Assignment Writing

Unknown said...
This comment has been removed by the author.
Unknown said...
This comment has been removed by the author.
Anonymous said...

I am interested in is that Chris mistake classificatiion some clarification. Segmentation method will described in some cases, very strange, but they may be in English net enrollment rate of the rare.

Thesis Help | Dissertation Help | Essay Help | Assignment Help

Amber Melton said...

Good post! Very comprehensive and detailed information. This is what I need! Well done!
Social Media Marketing Press Release Submission Social Media Tips Press Release Submission

Many institutions limit access to their online information. Making this information available will be an asset to all.

Anonymous said...
college essays said...

Great idea..thanks for sharing..I will forward it to my friends...
Term Papers | Custom Essays | Research Papers

Anonymous said...

Great Article as share good stuff with good ideas and concepts, lots of great information and inspiration, both of which we all need, thanks for all the enthusiasm to offer such helpful information here
link building service | social bookmark submission| Article Submission Service | Manually directory Submission Service | press release distribution

hayda said...

What a helpful post really will be coming back to this time and time again. mirc . chat . chat sohbet . mirc sohbet . cinsellik . cinsel sohbet . cinsellik sohbet . ask sohbet .Thanks ..

daoes said...

These kind of post are always inspiring and I prefer to read quality content Car Insurance
so Zynga Poker Bot | Top health plans happy to find many good point here in the post, writing is simply great, thank you for the post

Logo designs said...

Interesting post, Keep up the good work:
Logo Designs , Logo Design

www.4xindia.com said...

A lot of great info and ideas in this post, thanks for bringing this to us!
Jenny Parker

logo design - logoinn said...

What a helpful post really will be coming back to this time and time again.
Logo Designs | Logo Design

Thesis Help said...

Great thesis help for the students for thesis paper.

Essay Thesis said...

Helpful post. Very informative. thesis statement

tattoos designs said...

Nice post, thanks for sharing this wonderful and useful information with us.

Unknown said...

No matter what product you purchase from us north face jacketsWe are offering you a wide range ofnorth face outletquestion，Some color combinations seem to never get old north face outerwearBut within the same community north face coatsbecause it features just the right amount of north face uk，look at another good paoduct such as Dior totes，A little of these are given below.ugg bootswas a very well-known French fashionable boot
cheap ugg boots，because of the wisdom of brilliant featuresdiscount ugg boots，which you are buying is unique and original classic ugg boots，When possible, they obtain materials from domestic suppliers ugg classic tall boots，which makes it exclusive and uniquebabyliss，I like the other two designs as wel Benefit GHD，A stroll around the park with the GHD IV Salon Styler，They're also used to buy GHD Mini Styler，A stroll around the park with GHD Precious gift，They're also used to help prevent
GHD Rare Styler，Following the success of the initial gray ghd，The extremity of the sole is gold ghd
And those who buy this ghd Instyler，people who work outside and so wearisome Kiss ghd，I would recommend moving up to pink ghd，If your own walks or intentions for this pure black ghd，I wouldn’t hesitate to recommend the pure white ghd，our price is very reasonable purple ghd

aweseome blogger said...

Everyone is familiar with the F1 measure for simple classification decisions.
cara meninggikan badanYou draw a 2x2 contingency table of whether something should be yes/no, and whether the system guessed yes/no, and then calculate the harmonic mean of precision and recall.

aweseome blogger said...

Among ML-oriented nlpers, using a simple F1 of precision and recall is the standard way to evaluate Named Entity Recognition. Using F1 seems familiar and comfortable, but I think most nlpers haven't actually thought through the rather different character that the F1 measure takes on when applied to evaluating sequence models. tinggi badan

speed dating nyc said...

Viko said...

What a great style. Very informative one, I hope you will continue your research.
I can offer you term paper
on this subject. Thank you.

logo design said...

Nice article
logo design
logo designs
logos

aweseome blogger said...

Then there are sequences (of one or more tokens) where there was an entity and the system guessed it right (in/O/O Palo/LOC/LOCAlto/LOC/LOC ./O/O), where there was an entity but the system missed it (in/O/O Palo/LOC/O Alto/LOC/O ./O/O), and where there wasn't an entity but the system hypothesized one (an/O/O Awful/O/ORG Headache/O/ORG ./O/O).peninggi badan

KATIE DOYLE said...

The blogger is Huge network for blogging i get lots of interesting information from here, hope blogger will modify and increase attributes to make it simpler.

Dissertation Writing Help | Dissertation Structure

Jony said...

We can recommend for women and men about Breast Enlargement Pills and How to make penis bigger.

Unknown said...

Custom Logo Design

Unknown said...

Great article..! i like the way of your writing and description.

Aria Kerry said...

This might be scored as two boundary errors from the perspective of the predictions, and one (label-)boundary error from the perspective of the correct entity.
website design | web design | free website design | flyer design

Unknown said...

Wonderful blog, i recently come to your blog through Google excellent knowledge keep on posting you guys,thanks for sharing this post.

louis said...

Your blog is so nice.I am impressed with your vivid expression.I will
vuitton
handbags
neverfull | louis vuitton mahina
bookmarked you…keep up the good work!!!!

essay writing said...

i like your blog so much its my honor to give my comments on it great work man....

website designing company said...

I am very much pleased with the contents you have mentioned. I enjoyed every little bit part of it. It contains truly information. I want to thank you for this informative read; I really appreciate sharing this great.