Summarization is one of the canonical NLP problems, but of all the big ones (MT, speech, IR, IE, etc.) it is in my opinion the most difficult (let the flames begin!). The reason I think it's so hard is because it's unclear what a summary is. When one cannot define a problem well, it is impossible to solve, and difficult ⊆ impossible. There has been an enormous amount of effort to define specific summarization problems that we can solve, a comparable amount of effort on figuring out how to measure how good a solution is, and a lot of effort on building models/systems that can solve these problems. We've tried many things; see the history of DUC for a small, biased, incomplete set.
That said, I think the field has much potential, but not necessarily by trying to mimick what a human would do when asked to produce a summary. I'm interested in doing summarization-like tasks that a human would never be able to do reliably. Here are some examples to give a flavor of what I'm talking about:
- Let me go to a scientific engine like Rexa or CiteSeer and ask for a summary of "reinforcement learning." The system should know what papers I've read, what papers I've written, the relationship between all the papers in its database, an analysis of the goodness of authors and so on. What it produces for me must be markedly different from what it would produce for Satinder Singh.
- Let me go to Amazon and ask about new digital cameras. Maybe I'm interested in spending $200-$300. It should know that I've never bought a digital camera before, or that I've bought 4. I want a summary of important specifications, user comments and so on.
- Let me go to my own inbox and ask for a summary of what I've been discussing with John recently.
One can imagine many more similar tasks, but these are three obvious ones. The nice thing about these is that even partial solutions would be enormously useful (to me, at least...my mom might not care about the first). These are also things that people really can't do well. If someone asks me for something like the first one but, say, on structured prediction instead of reinforcement learning, I can give a summary, but it will be heavily biased. It is worse for the second, where I can basically only produce anecdotal evidence, and virtually impossible for the third.
The most important outstanding issue is how to measure sucess at such problems. I cannot imagine how to do this without doing user studies, but people probably felt the same way about MT
a few years ago. How about
now? But given the amount of personalization in these tasks, I feel that it would be harder to do automatic evaluation of them. Probably the most important things to measure in user studies are subjective satisfaction, how many times multiple searches had to be performed and so on. One could also take a TREC style approach for
comparative pair-wise evaluations by marking system 2 down if it missed something system 1 found that a human thought was important.
There are also tons of subproblems that can be pulled from this tangle of tasks. Most notably, personalization methods, social network analysis methods, redundancy identification, coherence, information presentation (UI) techniques, generation of multimodal outputs (tables, graphs, etc.), dealing with imperfect input (googling for "reinforcement learning" also produces irrelevant documents), opinion identification, processing of ungrammatical input, anti-spam, and so on. I'm not a huge proponent of
just solving subtasks that aren't proven to be necessary, but it is sometimes helpful to take this approach. I think we just have to keep the big picture in our minds.