The Document Understanding Conference features a yearly summarization competition. For the past few years, the task has been query-focused summarization of clusters of (essentially entirely) news documents. There will be a pilot task next year and based on comments made during DUC 2006, it appears it will be one of the following:
- Multidocument, (probably) query-focused summarization of blog posts.
- Multidocument summarization of news, with respect to known information.
The idea in (1) is that there are several "novel" aspects one has to deal with. First, blog posts are out of domain for most parsers, etc., which means we'll get noisy input but not as noisy as speech. Second, although the blog posts (the blogs would be from the TREC blog collection) will essentially all focus on news topics (saldy, NLPers is not in the corpus), they are almost certainly more emotionally fueled than vanilla news. The identification of sentiment and opinion, which are both in vogue these days, will potentially become more useful.
The idea in (2) is that in most real world situations, the user who desires the summary has some background information on the topic. The idea is that the summarization engine would be handed a collection of 5-10 documents that the user has presumably read, then 5-10 new documents to be summarized. The novel aspect of this task is, essentially, detecting novelty.
Personally, I think both are potentially interesting, though not without their drawbacks. The biggest potential problem I see with the blogs idea is that I think we're reentering the phase of not being able to achieve any sort of human agreement without fairly strict guidelines. It's unclear if, say, two viewpoints are expressed, how a summary should reflect these. The biggest problem I see with idea (2) is that it is very reminiscent of some TREC-style tasks, like TDT, and I'm not sure that doing anything more than essentially doing normal query-focused summarization with an MMR-style term to account for "known information." That's not to say these aren't worth exploring -- I think both are quite interesting -- but, as always, we should be careful.