(Guest post by Chris Manning. Thanks Chris!)
Among ML-oriented nlpers, using a simple F1 of precision and recall is the standard way to evaluate Named Entity Recognition. Using F1 seems familiar and comfortable, but I think most nlpers haven't actually thought through the rather different character that the F1 measure takes on when applied to evaluating sequence models. It's not just that it's a type 4 loss (a simple, intuition-driven measure like accuracy): In most cases such measures are reasonable enough for what they are, but using F1 for NER has an under-appreciated dysfunctional character. You wouldn't want to optimize for it!
This post explains what I was really thinking about when I made the comment that Hal referred to previously (fortunately, I didn't try to say all this after the talk!). I agree with Hal: the paper was a very nice piece of work technically. I just think that the authors, Jun Suzuki et al., chose a bad peak to climb.
Everyone is familiar with the F1 measure for simple classification decisions. You draw a 2x2 contingency table of whether something should be yes/no, and whether the system guessed yes/no, and then calculate the harmonic mean of precision and recall. But now think about Named Entity Recognition. You're chugging through text, and every now-and-again there is an entity, which your system recognizes or doesn't or fantasizes. I will use the notation word/GOLD/GUESS throughout, with O denoting the special background class of not-an-entity. So there are stretches of plain text (drove/O/O along/O/O a/O/O narrow/O/O road/O/O). These are the non-coding regions of NER. Then there are sequences (of one or more tokens) where there was an entity and the system guessed it right (in/O/O Palo/LOC/LOCAlto/LOC/LOC ./O/O), where there was an entity but the system missed it (in/O/O Palo/LOC/O Alto/LOC/O ./O/O), and where there wasn't an entity but the system hypothesized one (an/O/O Awful/O/ORG Headache/O/ORG ./O/O).
Things look good up until here: those events map naturally on to the false negatives (fn), true positives (tp), false negatives (fp), and false positives (fp) of the simple classification case. The problem is that there are other events that can happen. A system can notice that there is an entity but give it the wrong label (I/O/O live/O/O in/O/O Palo/LOC/ORG Alto/LOC/ORG ./O/O). A system can notice that there is
an entity but get its boundaries wrong (Unless/O/PERS Karl/PERS/PERS Smith/PERS/PERS resigns/O/O). Or it can make both mistakes at once (Unless/O/ORG Karl/PERS/ORG Smith/PERS/ORG resigns/O/O). I'll call these events a labeling error (le), a boundary error (be), and a label-boundary error (lbe).
I started thinking along these lines just as an intuitive, natural way to characterize happenings in NER output, where entities are sparse occurrences in stretches of background text. But you can make it formal (I wrote a Perl script!). Moving along the sequence, the subsequence boundaries are: (i) at start and end of document, (ii) anywhere there is a change to or from a word/O/O token from or to a token where either guess or gold is not O, and (iii) anywhere that both systems change their class assignment simultaneously, regardless of whether they agree. If you chop into subsequences like that, each can be assigned to one of the above seven classes.
Now, the thing to notice is that for the first 4 event types, you are either correct or you get 1 demerit, assessed to either precision or recall. In the simple classification case, that's the end of the story and the F1 measure is sensible. But when doing precision and recall over subsequences, there are these other three event types. Each of them is assessed a minimum of 2 demerits, with both precision and recall being hit. Therefore, it is fairly clear that optimizing for F1 in this context will encourage a system to do the following: if I'm moderately uncertain of either the class label or the boundaries of the entity, because a mistake would cost me a minimum of 2 demerits, I'm better off proposing no entity, which will cost me only 1 demerit.
(i) As I've defined events, the possible demerits for an event in the last three classes is unbounded, though in practice 2 is the most common case. For example, this lbe event would be assessed 4 demerits (3 to precision, and 1 to recall): Smith/ORG/PERS and/ORG/O Newcomb/ORG/PERS and/ORG/O Co./ORG/ORG.
(ii) Despite my title, the problem here isn't with the F measure per se, as Bob Moore emphasized to me at a coffee break during ACL 2006 (thanks!). The problem would occur with any measure that combines precision and recall and which is increasing in both arguments, such as the simple arithmetic mean of precision and recall.)
Observe that this behavior is the opposite of the way things were meant to work: people adopted F1 in IR rather than using accuracy because accuracy gives high scores to a system that returns no documents, which obviously isn't useful. But, here, optimizing for F1 is encouraging a system to not mark entities.
Now let's look at some data. I used this event classification system on the output of my NER system on the CoNLL 2003 shared task English testa data. Here is how many events of each type there were:
Note in particular that over 2/3 of the errors are in those 3 extra categories that are multiply penalized. The ratios of classes vary with the task. For example, in biological NER, you tend to get many more boundary errors. But in my experience it is always the case that lots of the errors are in the last 3 classes.
Moreover, some of the errors in the le and be classes are not that bad, and sometimes even reflect subtle judgement calls and human annotator inconsistency in the gold standard. For instance, in the GENIA data you can find both regulation/O of/O human/DNA interleukin-2/DNA gene/DNA expression and transduction/O to/O the/O human/O IL-2/DNA gene/DNA, where it is unclear whether to include human in the name of the gene. Or in a newswire phrase like the Leeds stadium, it's not always very clear whether Leeds should be tagged ORG as a reference to the football team or LOC as a reference to the city. In almost any imaginable task, you would prefer systems that made these errors to ones that missed such entities entirely. In other words, the F1 measure is punishing more severely mistakes that should be punished less according to reasonable intuitions of task utility.
Has this been noticed before? I think so. The ACE program has a system for giving partial credit. But most ML people react very negatively to a scoring system that you couldn't possibly write on a napkin and which involves various very arbitrary-looking constants.... Do these observations undermine the last decade of work in NER? I don't think so. It turns out that there are lots of measures that are pretty okay providing you do not specifically optimize for them, but are dysfunctional if you do. A well-known example is traditional readability measures.
p.s. As I finish writing this guest post, it's occurred to me that I think this is the first nlpers post with some actual natural language examples in it. If you're reading this post, I guess that at least shows that such content isn't actively filtered out!