Let's pretend that someone gave you one million document/summary pairs. If you like single-document, pretend they're single-document; if you like multi-document, pretend they're multi-document.
For those of us who work on summarization, this seems like it would be a pretty cool gift. Most of the corpora we're used to using have, at best, a few hundred such pairs. Some (eg., for headline generation) have more, but then I didn't allow you to pretend that these were headlines (I can get you billions of these, if you really want them).
So here's the question I pose: what would you do with a million summaries?
I actually have a hard time answering this question. Sure, we have whatever our favorite sentence extraction method is. If it's learned at all, it's probably learned over a dozen or so features: position, length, similarity to query (if a query exists), similarity to document centroid, etc. This would probably be optimized against some automatic measure like one of the many flavors of Rouge. Fitting a dozen parameters on a corpus of 100 examples is probably pretty reasonable and we have some results that suggest that we've gone about as far as we can go with sentence extraction (at least with respect to the DUC corpora); see, for instance, section 6.6 of my thesis. Here, we see that we're pretty much matching oracle performance at sentence extraction on DUC04 data when evaluated using Rouge(I've independently seen other people present similar results, so I think it's replicable). (Yes, I'm aware there are caveats here: the use of Rouge, the fixation to one corpus, etc.)
But now we have a million doc/sum pairs. Fitting a dozen parameters on a million examples seems a bit wasteful. It also seems a bit wasteful to continue to focus on sentence extraction in this case. Why? Well, first, we've already "solved" this problem (the quotes indicate the caveats hinted at above). Second, I have a seriously hard time thinking of too many more high level features that I could possibly tune (my best entry ever into DUC---I think we came in second or third, depending on the scoring---had about 25, many of which ended up getting very very small weights).
So, being me, my next thought is: do word alignment on the summaries, like what they do in machine translation. It turns out that somebody has already tried this, with a reasonable amount of success. In all seriousness, if I were to try something like this again, I think I would throw out the "phrase" issue and deal with words; probably also consider throwing out the HMM issue and do something akin to Model 1. The key difference is that I would continue to include the additional features, like stem-identity, WordNet, etc. I might also throw in some word clustering just for fun.
So let's say that the alignments worked. Now what? We could decode, of course, by intersecting the learned alignment model with a language model. I think this would be a really bad idea, essentially because I don't think there's enough information in the alignments to actually produce new summaries; just enough to get reasonable alignments.
So now we've got a million document/summary pairs that have been aligned. What now?
You could say "learn to create abstracts", but I'm actually not particularly thrilled with this idea, either. (Why? Well, there's a long story, but the basic problem is that if you ask humans to write summaries, they're lazy. What this means is that they do a lot of word copying, at least up to the stem. If you look in the alignments paper, there are some results that say that over half of the words in the summary are aligned to identical words (stems) in the document, even with a human doing the alignment. What this means is that if you're using a lexically-based scoring method, like Rouge, odds are against you if you ever change a word because chances are the human writing the summary didn't change it.)
You could suggest trying to learn to do compression, which is probably what I'd look at most seriously. But I also don't think we have a really good understanding of this. In the Searn paper, we show how to use Searn to compute compressions, but to be honest it's really slow and I don't think it's really scalable to 1 million doc/sum pairs. But I suppose I would probably start with something like that.
But that's what I would do. What would you do?
(Incidentally, if you want to ask me: "do you have a million summaries?" the answer is "no." I have about 120,000. But there are some complications with this data. Maybe I'll post about it in the near future.)
19 March 2008