04 April 2008

More complaining about automatic evaluation

I remember a few years ago complaining about automatic evaluation at conference was the thing to do. (Ironically, so was writing papers about automatic evaluation!) Things are saner now on both sides. While what I'm writing here is interpretable as a gripe, it's really intended as a "did anyone else notice this" because it's somewhat subtle.

The evaluation metric I care about is Rouge, designed for summarization. The primary difference between Rouge and Bleu is that Rouge is recall-oriented while Bleu is precision-oriented. The way Rouge works is as follows. Pick an ngram size. Get a single system summary H and a single reference summary R (we'll get to multiple references shortly). Let |H| denote the size of bag the defined by H and let |H^R| denote the bag intersection. Namely, the number of times some n-gram is allowed to appear in H^R is the min of the number of times it appears in H and R. Take this number and divide by |R|. This is the ngram recall for our system on this one example.

To extend this to more than one summary, we simple average the Rouges at each individual summary.

Now, suppose we have multiple references, R_1, R_2, ..., R_K. In the original Rouge papers and implementation, we compute the score for a single sentence as the max over the references of the Rouge on that individual reference. In other words, our score is the score against a single reference, where that reference is chosen optimistically.

In later Rouge paper and implementation, this changed. In the single-reference case, our score was |H^R|/|R|. In the multiple reference setting, it is |H^(R_1 + R_2 + ... + R_K)|/|R_1 + R_2 + ... + R_K|, where + denotes bag union. Apparently this makes the evaluation more stable.

(As an aside, there is no notion of a "too long" penalty because all system output is capped at some fixed length, eg., 100 words.)

Enough about how Rouge works. Let's talk about how my DUC summarization system worked back in 2006. First, we run BayeSum to get a score for each sentence. Then, based on the score and about 10 other features, we perform sentence extraction, optimized against Rouge. Many of these features are simple patterns; the most interesting (for this post) is my "MMR-like" feature.

MMR (Maximal Marginal Relevance) is a now standard technique in summarization that aims to allow your sentence extractor to extract sentences that aren't wholly redundant. The way it works is as follows. We score each sentence. We pick as our first sentence the sentence with the highest score. We the rescore each sentence to a weighted linear combination of the original score and minus the similarity between the proposed second sentence and its similarity to the first. Essentially, we want to punish redundancy, weighted by some parameter a.

This parameter is something that I tune in max-Rouge training. What I found was that at the end of the day, the value of a that is found by the system is always negative, which means that instead of disfavoring redundancy, we're actually favoring it. I always took this as a notion that human summaries really aren't that diverse.

The take-home message is that if you can opportunistically pick one good sentence to go in your summary, the remaining sentences you choose should be as similar to that one was possible. It's sort of an exploitation (not exploration) issue.

The problem is that I don't think this is true. I think it's an artifact, and probably a pretty bad one, of the "new" version of Rouge with multiple references. In particular, suppose I opportunistically choose one good sentence. It will match a bunch of ngrams in, say, reference 1. Now, suppose as my second sentence I choose something that is actually diverse. Sure, maybe it matches something diverse in one of the references. But maybe not. Suppose instead that I pick (roughly) the same sentence that I chose for sentence 1. It won't re-match against ngrams from reference 1, but if it's really an important sentence, it will match the equivalent sentence in reference 2. And so on.

So this is all nice, but does it happen? It seems so. Below, I've taken all of the systems from DUC 2006 and plotted (on X) their human-graded Non-Redundancy scores (higher means less redundant) against (on Y) their Rouge-2 scores.

Here, we clearly see (though there aren't even many data points) that high non-redundacy means low Rouge-2. Below is Rouge-SU4, which is another version of the metric:

Again, we see the same trend. If you want high Rouge scores, you had better be redundant.

The point here is not to gripe about the metric, but to point out something that people may not be aware of. I certainly wasn't until I actually started looking at what my system was learning. Perhaps this is something that deserves some attention.


  1. Hal,

    We noticed exactly the same thing in our DUC2007 submission. We used a generic optimizer to pick the feature weights by doing max-delta ROUGE training (we pick that sentence that maximizes the delta in the ROUGE of the summary before and after the addition of the sentence). We found that we were ranked in the top 10 when ranked according to all three ROUGE metrics (1, 2 and SU4) but ranked almost last (27/30) when our summaries were evaluated on the non-redundancy metric.

    Do you think that the "old" definition of ROUGE would not suffer from the same issue ?

    PS: Funnily, the older ROUGE paper[1] still has the newer formula[2] even though it states that it picks the best reference.

    [1] Lin, Chin-Yew. 2004a. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

    [2] Lin, Chin-Yew. 2004b. Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?. In Proceedings of the NTCIR Workshop 4, Tokyo, Japan, June 2 - June 4, 2004.

  2. You should also note that human evaluation redundancy , as defined by DUC, is the linguistic redundancy: whether a pronoun should replace a person name... This does not cover content redundancy.

    For me, the fact that redundancy is favored by ROUGE only shows our failure to merge the content of two sentences giving different details about the same thing.


  3. I am so happy to get some hero gold and the hero online gold is given by my close friend who tells me that the hero online money is the basis to enter into the game. Therefore, I should buy hero gold with the spare money and I gain some hero money from other players.

  4. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

  5. 艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

    一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

    等相關服務 幫您快速的實現您的夢想~!!

  6. hi friends I am very happy because I found a excelent blog like More complaining about automatic evaluation"...
    your blog is very profesional thanks a lot

  7. Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

    sesli sohbet
    sesli chat
    sesli sohbet sitesi
    sesli chat sitesi
    sesli sohpet
    kamerali sohbet
    kamerali chat
    webcam sohbet