04 October 2007

F-measure versus Accuracy

I had a bit of a revelation a few years ago. In retrospect, it's obvious. And I'm hoping someone else out there hasn't realized this because otherwise I'll feel like an idiot. The realization was that F-measure (for a binary classification problem) is not invariant under label switching. That is, if you just change which class it is that you call "positive" and which it is that you call "negative", then your overall F-measure will change.

What this means is that you have to be careful, when using F-measure, about how you choose which class is the "positive" class.

On the other hand, the simple "accuracy" metric is (of course) invariant under label switching. So when using accuracy, you needn't worry about which class you consider "positive."

In the olden days, when people pretty much just used F-measure to analyze things like retrieval quality, this wasn't a problem. It was "clear" which class was positive (good documents) and which was negative. (Or maybe it wasn't...) But now, when people use F-measure to compare results on a large variety of tasks, it makes sense to ask: when is accuracy the appropriate measure and when is F the appropriate measure?

I think that, if you were to press anyone on an immediate answer to this question, they would say that they favor F when one of the classes is rare. That is, if one class occurs only in 1% of the instances, then a classifier that always reports "the other class" will get 99% accuracy, but terrible F.

I'm going to try to convince you that while rarity is a reasonable heuristic, there seems to be something deeper going on.

Suppose I had a bunch of images of people drinking soda (from a can) and your job was to classify if they were drinking Coke or Pepsi. I think it would be hard to argue that F is a better measure here than accuracy: how would I choose which one is "positive." Now, suppose the task were to distinguish between Coke and Diet Dr. Pepper. Coke is clearly going to be the majority class here (by a long shot), but I still feel that F is just the wrong measure. Accuracy still seems to make more sense. Now, suppose the task were to distinguish between Coke and "anything else." All of a sudden, F is much more appealing, even though Coke probably isn't much of a minority (maybe 30%).

What seems to be important here is the notion of X versus not-X, rather than X versus Y. In other words, the question seems to be: does the "not-X" space make sense?

Let's consider named entity recognition (NER). Despite the general suggestion that F is a bad metric for NER, I would argue that it makes more sense than accuracy. Why? Because it just doesn't make sense to try to specify what "not a name" is. For instance, consider the string "Bill Clinton used to be the president; now it's Bush." Clearly "Bill Clinton" is a person. But is "Clinton used"? Not really. What about "
Bill the"? Or "Bill Bush"? I think a substantial part of the problem here is not that names are rare, but that it's just not reasonable to develop an algorithm that finds all not-names. They're just not well defined.

This suggests that F is a more appropriate measure for NER than, at least, accuracy.

One might argue--and I did initially myself--that this is an artifact of the fact that names are often made up of multiple words, and so there's a segmentation issue. (The same goes for the computer vision problem of trying to draw bounding boxes around humans in images, for which, again, F seems to make more sense.)

But I think I've been convinced that this isn't actually the key issue. It seems, again, that what it boils down to is that it makes sense to ask one to find Entities, but it doesn't make sense to ask one to find non-Entities, in the same way it it doesn't make sense to ask one to find non-Cokes.

(Of course, in my heart of hearts I believe that you need to use a real--i.e., type 4--evaluation metric, but if you're stuck without this, then perhaps this yields a reasonable heuristic.)


  1. I think F-measure makes sense only if precision and recall make sense. At least that's how I think about it.

  2. The ROC curve and the AUC (Area Under the ROC Curve) metric can be used for cases where one of the classes is rare, as a more robust alternative to the F-measure.

    When there is no clear definition for members of the population for the "negative" class, a Free-Response ROC curve (and the derivative metrics) can be used instead.

  3. Although F-measure is only defined in terms of true positive (TP), false positive (FP) and false negative (FN), in a real evalaution, true negatives (TN) come into play as follows.

    Suppose you have an algorithm that is 90% accurate on positive cases. Then for every 100 positive test cases, you find 90 TPs and 10 FNs. Next, suppose the algorithm's only 80% accurate on negative cases. That means, for every 100 negative test cases, you get 80 TNs and 20 FPs.

    Now let's see what happens to precision and recall as the balance of positive and negative test cases changes. With 100 positive and 100 negative test cases, you get TP=90, FN=10, TN=80, FP=20, for precision=TP/(TP+FP)=90/(90+20) and recall=TP/(TP+FN)=90/(90+10).

    Next consider 100 positive and 1000 negative cases. This leads to TP=90, FN=10, TN=800, FP=200. Recall remains 90/100, but precision is now a measly 90/(90+200). As the number of negative cases grow, the rejection accuracy remains the same, 80%, but the precision drops precipitously.

    In search, it's basically impossible to precisely measure recall in a large doc set. With sampling, you can sometimes get close. TREC restricts its evals to the top N documents returned by the participants. Docs not returned by anyone may be relevant, but are not considered in their evaluations. Thus we don't really know corpus-level recall, just recall on the set of docs returned in the top N (typically 1000) of at least one search engine.

    In any case, point estimates are not so useful in non-bakeoff contexts. What's more useful is precision/recall curves (not area or average or maximum summary statistics). That's because some apps need high precision at low recall (e.g. typical web search) and others need high precision at high recall (e.g. intelligence analysts).

    PS: Since we're on the topic of F, it's interesting to note its relation to the traditional Jaccard measure:

    F = 2*TP / (2*TP + FP + FN)

    Jaccard = TP/(TP + FP + FN)

  4. An alternative to F measure for unbalanced classes is kappa:


    which is normally used for interannotator disagreement. This basically scales accuracy to reflect the imbalanced labels without ignoring the TN class.

  5. The word "class" has a meaning here that I don't understand. I tend to think of "class" as simply equivalent to "label". What is your definition?

  6. I will admit to not fully understanding the retrieval problem, but I wonder whether the proportion of true positives captured with the X% predicted most likely would not be useful? My thought is that, of all items returned, only so-many (X percent) can usefully be reviewed, so why not try to maximize the quality of those items?

    Well, it was just a thought.

    -Will Dwinnell
    Data Mining in MATLAB

  7. Will: What you're talking about is called precision at N, which is the accuracy on the first N documents. It's widely used in search evaluation, and is part of the TREC ad hoc search evals. Popular values of N are 5, 10 or 100. If you're feeling lucky, choose N=1.

  8. I'm not understanding something else then. What's wrong with the label "not a coke"?

  9. i don't think there's a problem with the label not-a-coke... i just don't think that accuracy is an appropriate measure for this problem. i think what it boils down to is something like open versus closed sets... tomorrow, someone can come out with a new type of soda and this chances what the not-a-coke class looks like. this can't happen, for instance, in the coke-versus-sprite case. this seems to be what--psycologically--is making me prefer P/R/F in the X-versus-notX cases. basically, it's somewhat meaningless to have a classifier that can detect not-X because not-X can change. so the important thing to measure is if you can reliably spot Xs in a sea of otherness.

  10. Nice to see the different measures for precision and recall. Now I know what not to optimize my algorithm for.

  11. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

  12. 艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

    一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

    等相關服務 幫您快速的實現您的夢想~!!

  13. I need to do eval on a rule-based chunk parser that is tuned to resume data (it needs to identify things like job title, company name and date range). I will go with classification-style testing methods rather than true parsing evals - there aren't any real dependencies at work here. The task is in fact much more like NER than parsing. Does anybody have advice or experience with this sort of data? Thanks.......

  14. Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

    sesli sohbet
    sesli chat
    sesli sohbet sitesi
    sesli chat sitesi
    sesli sohpet
    kamerali sohbet
    kamerali chat
    webcam sohbet