19 March 2008

What to do with a million summaries?

Let's pretend.

Let's pretend that someone gave you one million document/summary pairs. If you like single-document, pretend they're single-document; if you like multi-document, pretend they're multi-document.

For those of us who work on summarization, this seems like it would be a pretty cool gift. Most of the corpora we're used to using have, at best, a few hundred such pairs. Some (eg., for headline generation) have more, but then I didn't allow you to pretend that these were headlines (I can get you billions of these, if you really want them).

So here's the question I pose: what would you do with a million summaries?

I actually have a hard time answering this question. Sure, we have whatever our favorite sentence extraction method is. If it's learned at all, it's probably learned over a dozen or so features: position, length, similarity to query (if a query exists), similarity to document centroid, etc. This would probably be optimized against some automatic measure like one of the many flavors of Rouge. Fitting a dozen parameters on a corpus of 100 examples is probably pretty reasonable and we have some results that suggest that we've gone about as far as we can go with sentence extraction (at least with respect to the DUC corpora); see, for instance, section 6.6 of my thesis. Here, we see that we're pretty much matching oracle performance at sentence extraction on DUC04 data when evaluated using Rouge(I've independently seen other people present similar results, so I think it's replicable). (Yes, I'm aware there are caveats here: the use of Rouge, the fixation to one corpus, etc.)

But now we have a million doc/sum pairs. Fitting a dozen parameters on a million examples seems a bit wasteful. It also seems a bit wasteful to continue to focus on sentence extraction in this case. Why? Well, first, we've already "solved" this problem (the quotes indicate the caveats hinted at above). Second, I have a seriously hard time thinking of too many more high level features that I could possibly tune (my best entry ever into DUC---I think we came in second or third, depending on the scoring---had about 25, many of which ended up getting very very small weights).

So, being me, my next thought is: do word alignment on the summaries, like what they do in machine translation. It turns out that somebody has already tried this, with a reasonable amount of success. In all seriousness, if I were to try something like this again, I think I would throw out the "phrase" issue and deal with words; probably also consider throwing out the HMM issue and do something akin to Model 1. The key difference is that I would continue to include the additional features, like stem-identity, WordNet, etc. I might also throw in some word clustering just for fun.

So let's say that the alignments worked. Now what? We could decode, of course, by intersecting the learned alignment model with a language model. I think this would be a really bad idea, essentially because I don't think there's enough information in the alignments to actually produce new summaries; just enough to get reasonable alignments.

So now we've got a million document/summary pairs that have been aligned. What now?

You could say "learn to create abstracts", but I'm actually not particularly thrilled with this idea, either. (Why? Well, there's a long story, but the basic problem is that if you ask humans to write summaries, they're lazy. What this means is that they do a lot of word copying, at least up to the stem. If you look in the alignments paper, there are some results that say that over half of the words in the summary are aligned to identical words (stems) in the document, even with a human doing the alignment. What this means is that if you're using a lexically-based scoring method, like Rouge, odds are against you if you ever change a word because chances are the human writing the summary didn't change it.)

You could suggest trying to learn to do compression, which is probably what I'd look at most seriously. But I also don't think we have a really good understanding of this. In the Searn paper, we show how to use Searn to compute compressions, but to be honest it's really slow and I don't think it's really scalable to 1 million doc/sum pairs. But I suppose I would probably start with something like that.

But that's what I would do. What would you do?

(Incidentally, if you want to ask me: "do you have a million summaries?" the answer is "no." I have about 120,000. But there are some complications with this data. Maybe I'll post about it in the near future.)

5 comments:

todd. said...

I don't work on summarization so there's a good chance I'm about to ask a stupid question. But I did a summer project on summarization as an undergraduate, and so I often wonder how various ML techniques would work when applied to summarization.

My question, then, is: has anyone had much success using a topic model like latent Dirichlet allocation for summarization? The LDA papers always say it'd be great for that, and it's easy to see how this could be true, but I couldn't find any papers where this had been done. (Admittedly, I only spent 5-10 minutes with Google trying.)

Knowing pretty much nothing about what works for summarization, it seems to me that if I had 1M summaries, that would be the avenue I wanted to research.

hal said...

I don't think it's a dumb question. All I can say for certain is that I've tried something like this and couldn't really get it to work (though I didn't put a huge amount of effort into it). The tact that I took was to take BayeSum and augment it with LDA-style topics. BayeSum is already quite LDA-like... imagine each document constructed from 3 topics: a stop word topic, a document-specific topic, and a query-specific topic.

There are two places you can add more canonical LDA-style topics here. The first is to allow the query to contain >1 topic; the second is to allow the document to. The latter is totally unhelpful in query-focused summarization because we don't care what the topics are in the document that don't deal with the query. The former is pretty unhelpful because queries are already quite a bit more specific than LDA topics turn out to be.

It may be useful for non-query-focused summarization, but I have no own concerns about that.

hal said...

should be "have my own concerns..."

Joseph Turian said...

I've been musing about this question since you posted it.

Let's say we are interested in an inducing underlying representation for text. Rather than using standard unsupervised learning techniques to learn an embedding for each text in isolation, we can add an additional term to the objective: The embedding for the document and its summary should be close. This extra component in the training helps ensure that the representation for a document actually captures the summary's information.

What is the story with the 120K summaries?

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花