natural language processing blog: More complaining about automatic evaluation

I remember a few years ago complaining about automatic evaluation at conference was the thing to do. (Ironically, so was writing papers about automatic evaluation!) Things are saner now on both sides. While what I'm writing here is interpretable as a gripe, it's really intended as a "did anyone else notice this" because it's somewhat subtle.

The evaluation metric I care about is Rouge, designed for summarization. The primary difference between Rouge and Bleu is that Rouge is recall-oriented while Bleu is precision-oriented. The way Rouge works is as follows. Pick an ngram size. Get a single system summary H and a single reference summary R (we'll get to multiple references shortly). Let |H| denote the size of bag the defined by H and let |H^R| denote the bag intersection. Namely, the number of times some n-gram is allowed to appear in H^R is the min of the number of times it appears in H and R. Take this number and divide by |R|. This is the ngram recall for our system on this one example.

To extend this to more than one summary, we simple average the Rouges at each individual summary.

Now, suppose we have multiple references, R_1, R_2, ..., R_K. In the original Rouge papers and implementation, we compute the score for a single sentence as the max over the references of the Rouge on that individual reference. In other words, our score is the score against a single reference, where that reference is chosen optimistically.

In later Rouge paper and implementation, this changed. In the single-reference case, our score was |H^R|/|R|. In the multiple reference setting, it is |H^(R_1 + R_2 + ... + R_K)|/|R_1 + R_2 + ... + R_K|, where + denotes bag union. Apparently this makes the evaluation more stable.

(As an aside, there is no notion of a "too long" penalty because all system output is capped at some fixed length, eg., 100 words.)

Enough about how Rouge works. Let's talk about how my DUC summarization system worked back in 2006. First, we run BayeSum to get a score for each sentence. Then, based on the score and about 10 other features, we perform sentence extraction, optimized against Rouge. Many of these features are simple patterns; the most interesting (for this post) is my "MMR-like" feature.

MMR (Maximal Marginal Relevance) is a now standard technique in summarization that aims to allow your sentence extractor to extract sentences that aren't wholly redundant. The way it works is as follows. We score each sentence. We pick as our first sentence the sentence with the highest score. We the rescore each sentence to a weighted linear combination of the original score and minus the similarity between the proposed second sentence and its similarity to the first. Essentially, we want to punish redundancy, weighted by some parameter a.

This parameter is something that I tune in max-Rouge training. What I found was that at the end of the day, the value of a that is found by the system is always negative, which means that instead of disfavoring redundancy, we're actually favoring it. I always took this as a notion that human summaries really aren't that diverse.

The take-home message is that if you can opportunistically pick one good sentence to go in your summary, the remaining sentences you choose should be as similar to that one was possible. It's sort of an exploitation (not exploration) issue.

The problem is that I don't think this is true. I think it's an artifact, and probably a pretty bad one, of the "new" version of Rouge with multiple references. In particular, suppose I opportunistically choose one good sentence. It will match a bunch of ngrams in, say, reference 1. Now, suppose as my second sentence I choose something that is actually diverse. Sure, maybe it matches something diverse in one of the references. But maybe not. Suppose instead that I pick (roughly) the same sentence that I chose for sentence 1. It won't re-match against ngrams from reference 1, but if it's really an important sentence, it will match the equivalent sentence in reference 2. And so on.

So this is all nice, but does it happen? It seems so. Below, I've taken all of the systems from DUC 2006 and plotted (on X) their human-graded Non-Redundancy scores (higher means less redundant) against (on Y) their Rouge-2 scores.

Here, we clearly see (though there aren't even many data points) that high non-redundacy means low Rouge-2. Below is Rouge-SU4, which is another version of the metric:

Again, we see the same trend. If you want high Rouge scores, you had better be redundant.

The point here is not to gripe about the metric, but to point out something that people may not be aware of. I certainly wasn't until I actually started looking at what my system was learning. Perhaps this is something that deserves some attention.

5 comments:

DesiLinguist04 April, 2008 16:17
Hal,

We noticed exactly the same thing in our DUC2007 submission. We used a generic optimizer to pick the feature weights by doing max-delta ROUGE training (we pick that sentence that maximizes the delta in the ROUGE of the summary before and after the addition of the sentence). We found that we were ranked in the top 10 when ranked according to all three ROUGE metrics (1, 2 and SU4) but ranked almost last (27/30) when our summaries were evaluated on the non-redundancy metric.

Do you think that the "old" definition of ROUGE would not suffer from the same issue ?

Nitin
PS: Funnily, the older ROUGE paper[1] still has the newer formula[2] even though it states that it picks the best reference.

[1] Lin, Chin-Yew. 2004a. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

[2] Lin, Chin-Yew. 2004b. Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?. In Proceedings of the NTCIR Workshop 4, Tokyo, Japan, June 2 - June 4, 2004.
Ben15 April, 2008 14:04
You should also note that human evaluation redundancy , as defined by DUC, is the linguistic redundancy: whether a pronoun should replace a person name... This does not cover content redundancy.

For me, the fact that redundancy is favored by ROUGE only shows our failure to merge the content of two sentences giving different details about the same thing.

Benoit.
Anonymous12 May, 2009 10:35
酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店酒店兼差PRETTY GIRL酒店公關酒店小姐彩色爆米花酒店兼職,酒店工作彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀彩色爆米花
Anonymous20 September, 2009 20:02
艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差、打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店當日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作、兼差打工、假日兼職、兼職工作、酒店兼差、兼差、打工兼差、日領工作、晚上兼差工作、酒店工作、酒店上班、酒店打工、兼職、兼差、兼差工作、酒店上班等,想了解酒店相關工作和特種行業內容,想兼職工作日領、假日兼職、兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚，無業績壓力，無脫秀無喝酒壓力，高層次會員制客源，工作輕鬆，可日領、現領。
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已，對水水們的上班安全一點保障都沒有！艾葳酒店經紀公司的水水們上班時全程媽咪作陪，不需擔心！只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職、缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務幫您快速的實現您的夢想~!!
generic viagra21 May, 2010 11:31
hi friends I am very happy because I found a excelent blog like More complaining about automatic evaluation"...
your blog is very profesional thanks a lot

04 April 2008

More complaining about automatic evaluation

5 comments: