Showing posts with label summarization. Show all posts
Showing posts with label summarization. Show all posts

04 April 2008

More complaining about automatic evaluation

I remember a few years ago complaining about automatic evaluation at conference was the thing to do. (Ironically, so was writing papers about automatic evaluation!) Things are saner now on both sides. While what I'm writing here is interpretable as a gripe, it's really intended as a "did anyone else notice this" because it's somewhat subtle.

The evaluation metric I care about is Rouge, designed for summarization. The primary difference between Rouge and Bleu is that Rouge is recall-oriented while Bleu is precision-oriented. The way Rouge works is as follows. Pick an ngram size. Get a single system summary H and a single reference summary R (we'll get to multiple references shortly). Let |H| denote the size of bag the defined by H and let |H^R| denote the bag intersection. Namely, the number of times some n-gram is allowed to appear in H^R is the min of the number of times it appears in H and R. Take this number and divide by |R|. This is the ngram recall for our system on this one example.

To extend this to more than one summary, we simple average the Rouges at each individual summary.

Now, suppose we have multiple references, R_1, R_2, ..., R_K. In the original Rouge papers and implementation, we compute the score for a single sentence as the max over the references of the Rouge on that individual reference. In other words, our score is the score against a single reference, where that reference is chosen optimistically.

In later Rouge paper and implementation, this changed. In the single-reference case, our score was |H^R|/|R|. In the multiple reference setting, it is |H^(R_1 + R_2 + ... + R_K)|/|R_1 + R_2 + ... + R_K|, where + denotes bag union. Apparently this makes the evaluation more stable.

(As an aside, there is no notion of a "too long" penalty because all system output is capped at some fixed length, eg., 100 words.)

Enough about how Rouge works. Let's talk about how my DUC summarization system worked back in 2006. First, we run BayeSum to get a score for each sentence. Then, based on the score and about 10 other features, we perform sentence extraction, optimized against Rouge. Many of these features are simple patterns; the most interesting (for this post) is my "MMR-like" feature.

MMR (Maximal Marginal Relevance) is a now standard technique in summarization that aims to allow your sentence extractor to extract sentences that aren't wholly redundant. The way it works is as follows. We score each sentence. We pick as our first sentence the sentence with the highest score. We the rescore each sentence to a weighted linear combination of the original score and minus the similarity between the proposed second sentence and its similarity to the first. Essentially, we want to punish redundancy, weighted by some parameter a.

This parameter is something that I tune in max-Rouge training. What I found was that at the end of the day, the value of a that is found by the system is always negative, which means that instead of disfavoring redundancy, we're actually favoring it. I always took this as a notion that human summaries really aren't that diverse.

The take-home message is that if you can opportunistically pick one good sentence to go in your summary, the remaining sentences you choose should be as similar to that one was possible. It's sort of an exploitation (not exploration) issue.

The problem is that I don't think this is true. I think it's an artifact, and probably a pretty bad one, of the "new" version of Rouge with multiple references. In particular, suppose I opportunistically choose one good sentence. It will match a bunch of ngrams in, say, reference 1. Now, suppose as my second sentence I choose something that is actually diverse. Sure, maybe it matches something diverse in one of the references. But maybe not. Suppose instead that I pick (roughly) the same sentence that I chose for sentence 1. It won't re-match against ngrams from reference 1, but if it's really an important sentence, it will match the equivalent sentence in reference 2. And so on.

So this is all nice, but does it happen? It seems so. Below, I've taken all of the systems from DUC 2006 and plotted (on X) their human-graded Non-Redundancy scores (higher means less redundant) against (on Y) their Rouge-2 scores.



Here, we clearly see (though there aren't even many data points) that high non-redundacy means low Rouge-2. Below is Rouge-SU4, which is another version of the metric:



Again, we see the same trend. If you want high Rouge scores, you had better be redundant.

The point here is not to gripe about the metric, but to point out something that people may not be aware of. I certainly wasn't until I actually started looking at what my system was learning. Perhaps this is something that deserves some attention.

19 March 2008

What to do with a million summaries?

Let's pretend.

Let's pretend that someone gave you one million document/summary pairs. If you like single-document, pretend they're single-document; if you like multi-document, pretend they're multi-document.

For those of us who work on summarization, this seems like it would be a pretty cool gift. Most of the corpora we're used to using have, at best, a few hundred such pairs. Some (eg., for headline generation) have more, but then I didn't allow you to pretend that these were headlines (I can get you billions of these, if you really want them).

So here's the question I pose: what would you do with a million summaries?

I actually have a hard time answering this question. Sure, we have whatever our favorite sentence extraction method is. If it's learned at all, it's probably learned over a dozen or so features: position, length, similarity to query (if a query exists), similarity to document centroid, etc. This would probably be optimized against some automatic measure like one of the many flavors of Rouge. Fitting a dozen parameters on a corpus of 100 examples is probably pretty reasonable and we have some results that suggest that we've gone about as far as we can go with sentence extraction (at least with respect to the DUC corpora); see, for instance, section 6.6 of my thesis. Here, we see that we're pretty much matching oracle performance at sentence extraction on DUC04 data when evaluated using Rouge(I've independently seen other people present similar results, so I think it's replicable). (Yes, I'm aware there are caveats here: the use of Rouge, the fixation to one corpus, etc.)

But now we have a million doc/sum pairs. Fitting a dozen parameters on a million examples seems a bit wasteful. It also seems a bit wasteful to continue to focus on sentence extraction in this case. Why? Well, first, we've already "solved" this problem (the quotes indicate the caveats hinted at above). Second, I have a seriously hard time thinking of too many more high level features that I could possibly tune (my best entry ever into DUC---I think we came in second or third, depending on the scoring---had about 25, many of which ended up getting very very small weights).

So, being me, my next thought is: do word alignment on the summaries, like what they do in machine translation. It turns out that somebody has already tried this, with a reasonable amount of success. In all seriousness, if I were to try something like this again, I think I would throw out the "phrase" issue and deal with words; probably also consider throwing out the HMM issue and do something akin to Model 1. The key difference is that I would continue to include the additional features, like stem-identity, WordNet, etc. I might also throw in some word clustering just for fun.

So let's say that the alignments worked. Now what? We could decode, of course, by intersecting the learned alignment model with a language model. I think this would be a really bad idea, essentially because I don't think there's enough information in the alignments to actually produce new summaries; just enough to get reasonable alignments.

So now we've got a million document/summary pairs that have been aligned. What now?

You could say "learn to create abstracts", but I'm actually not particularly thrilled with this idea, either. (Why? Well, there's a long story, but the basic problem is that if you ask humans to write summaries, they're lazy. What this means is that they do a lot of word copying, at least up to the stem. If you look in the alignments paper, there are some results that say that over half of the words in the summary are aligned to identical words (stems) in the document, even with a human doing the alignment. What this means is that if you're using a lexically-based scoring method, like Rouge, odds are against you if you ever change a word because chances are the human writing the summary didn't change it.)

You could suggest trying to learn to do compression, which is probably what I'd look at most seriously. But I also don't think we have a really good understanding of this. In the Searn paper, we show how to use Searn to compute compressions, but to be honest it's really slow and I don't think it's really scalable to 1 million doc/sum pairs. But I suppose I would probably start with something like that.

But that's what I would do. What would you do?

(Incidentally, if you want to ask me: "do you have a million summaries?" the answer is "no." I have about 120,000. But there are some complications with this data. Maybe I'll post about it in the near future.)

03 February 2008

The behemoth, PubMed

The friend I crashed with while attending SODA is someone I've known since we were five years old. (Incidentally, there's actually someone in the NLP world who I've actually known from earlier...small world.) Anyway, the friend I stayed with is just finishing med school at UCSF and will soon be staying there for residency. His specialty is neurosurgery, and his interests are in neural pathologies. He spent some time doing research on Alzheimer's disease, effectively by studying mice (there's something I feel sort of bad about finding slightly amusing about mice with Alzheimer's disease). Needless to say, in the process of doing research, he made nearly daily use out of PubMed. (For those of you who don't know, PubMed is like the ACL anthology, but with hundreds of thousands of papers, with new ones being added by the truckload daily, and will a bunch of additional things, like ontologies and data sets.)

There are two things I want to talk about regarding PubMed. I think both of these admit very interesting problems that we, as NLPers, are qualified to tackle. I think the most important thing, however, is opening and maintaining a wide channel of communication. There seems to be less interaction between people who do (for instance) bio-medical informatics (we have a fairly large group here) and what I'll term as mainstream NLPers. Sure, there have been BioNLP workshops at ACLs, but I really think that both communities would be well-served to interact more. And for those of you who don't want to work on BioNLP because it's "just a small domain problem", let me assure you: it is not easy... don't think of it in the same vein as a true "sublanguage" -- it is quite broad.

I suppose I should give a caveat that my comments below are based on a sample size of one (my friend), so it may not be totally representative. But I think it generalizes.

Search in PubMed, from what I've heard, is good in the same ways that web search is good and bad in the same ways that web search is bad. It is good when you know what you're looking for (i.e., you know the name for it) and bad otherwise. One of the most common sorts of queries that my friend wants to do is something like "show me all the research on proteins that interact in some way with XXX in the context of YYY" where XXX is (eg) a gene and YYY is (eg) a disease. The key is that we don't know which proteins these are and so it's hard to query for them directly. I know that this is something that the folks at Penn (and probably elsewhere) are working on, and I get the impression that a good solution to this problem would make lots and lots of biologists much happier (and more productive). One thing that was particularly interesting, however, is that he was pretty averse to using structured queries like the one I gave above. He effectively wants to search for "XXX YYY" and have it realize that XXX is a gene, YYY is a disease, and that it's "obvious" that what he wants is proteins that interact with (or even, for instance, pathways that contain) XXX in the context of disease YYY. On the other hand, if YYY were another gene, then probably he's be looking for diseases or pathways that are regulated by both XXX and YYY. It's a bit complex, but I don't think this is something particularly beyond our means.

The other thing I want to talk about is summarization. PubMed actually archives a fairly substantial collection of human-written summaries. These fall into one of two categories. The first, called "systematic reviews" are more or less what we would think of as summaries. However, they are themselves quite long and complex. They're really not anything like sentence extracts. The second, called "meta analyses" are really not like summaries at all. In a meta analysis, an author will consider a handful of previously published papers on, say, the effects of smoking on lifespan. He will take the data and results published in these individual papers, and actually do novel statistical analyses on them to see how well the conclusions hold.

From a computational perspective, the automatic creation of meta analyses would essentially be impossible, until we have machines that can actually run experiments in the lab. "Systematic reviews", on the other hand, while totally outside the scope of our current technology, are things we could hope to do. And they give us lots of training data. There are somewhere around ten to forty thousand systematic reviews on PubMed, each about 20 pages long, and each with references back to papers, almost all of which are themselves in PubMed. Finding systematic reviews older than a few years ago (when the began being tagged explicitly) has actually sprouted a tiny cottage industry. And PubMed nicely makes all of their data available for download, without having to crawl, something that makes life much easier for us.

My friend warns that it might not be a good idea to use all systematic reviews, but only those from top journals. (They tend to be less biased, and better written.) However, in so far as I don't think we'd even have hope of getting something as good as a systematic review from the worst journal in the world, I'm not sure this matters much. Maybe all it says is that for evaluation, we should be careful to evaluate against the top.

Now, I should point out that people in biomedical informatics have actually been working on the summarization problem too. From what I can tell, the majority of effort there is on rule-based systems that build on top of more rule-based systems that extract genes, pathways and relations. People at the National Library of Medicine, Rindflesch and Fiszman, use SemRep to do summarization, and they have tried applying it to some medical texts. Two other people that I know are doing this kind of work are Kathy McKeown and Greg Whalen, both at Columbia. The Columbia group has access to a medically informed NLP concept extractor called MedLEE, which gives them a leg up on the low-level processing details. If you search for 'summarization medical OR biomedical' in GoogleScholar, you'll get a raft of hits (~9000).

Now, don't get me wrong -- I'm not saying that this is easy -- but for summarization folks who are constantly looking for "natural artifacts" of their craft, this is an enormous repository.

03 May 2007

Future DUC Plans

(Guest post by Lucy Vanderwende -- Thanks, Lucy!)

There was a general sigh of relief upon hearing the proposal for the next DUC to occur in November 2008, giving us all a nice long lead time for system development. The change comes about by co-locating DUC with TREC, which predictably occurs once a year, in November, rather than the erratic schedule that DUC has followed in the past few years. As a result, system submissions will no longer be due at the same time as the deadline for submission to a major NLP conference: another deep sigh of relief. Perhaps we will see more DUC workshop papers extended and published as conference publications, now that there is more time to do system analysis after the DUC results have been made available. Did you know that we should send NIST a reference to every paper published that uses the DUC data?

Co-locating with TREC has another change in store for DUC (though not yet set in stone). DUC will, likely, have two tracks: summarization and question-answering (the QA track will move out of TREC and into DUC). At the DUC2007 workshop, Hoa Trang Dang gave us an overview of the types of questions and question templates that are currently the focus of TREC QA. Workshop participants were very interested, and the opportunities for synergy between the two communities seem plentiful.

For summarization, DUC will continue to have a main task and a pilot task. NIST's proposal is for the 2008 main task to be Update Summarization (producing short, ~100 words, multi-document summaries in the context of a set of earlier articles); everyone at the workshop was excited by this proposal, and several groups have already worked on update summarization since it was the pilot for 2007. NIST's proposal for the 2008 pilot task is summarization of opinions from blogs; this is less certain and there are some remaining questions, for example, whether it would be "single blog" or "multiple blog" summarization, summarization of initial blog post or blog post + comments, and what the summary size ought to be. There was a lot of enthusiasm for this topic, especially because there was a TREC track on blogs in which for the last two years they have tagged blogs as positive or negative wrt opinions, i.e., there's at least some training data.

For more information, check back with the main website for DUC.