The vast majority of work on summarization is at the text level. A document (or collection of documents) comes in as text, and a summary goes out as text. Although I'm certainly not one of the ones pushing summarization of speech, I think it's quite an interesting task. (Here, by speech summarization, I pretty much mean speech in, speech out ... though similar arguments could be made for text in, speech out applications.) A lot of my conclusion comes from personal experience, but I also had a really great conversation with Alan Black (who works on speech synthesis problems).
Let me motivate this by an personal anecdote. My mom tends to leave really long voicemail messages. Like, really long. Usually she overruns the alloted time and has to call back. It's not uncommon for her messages to span three mailbox slots. (Sorry, mom, if you're reading this!) The most interesting thing about these messages is that they really don't contain that much information. In fact, there's usually only one or two sentences in the whole thing that really matter (at least from an "information" sense).
The problem is that you can't "skim" voicemail. If I get an equally long email (say, one that would take me 3 minutes to read), I can fairly easily skim it to get the gist. Especially with some tailored highlighting, I'm able to absorb much more text per second than speech per second. Word on the street (or at least around the croissant table at ACL) is that studies have been done that show that people can get at information from text just as fast if its presented in a summary as if its not summarized at all, essentially because we're really good at skimming (I knew high school English taught me something!). (Incidentally, if anyone has a reference for this, let me know. I believe I saw it ages ago, but I can't remember where and I'd love to have it.)
So one solution to the "Mom's message" problem is to run automatic speech recognition and turn it into an email. This has the added advantage that I check my email much more frequently than I check my voicemail, it gives me a record of the message after I've deleted the speech signal to save space, and allows for indexing. But there are also times when I don't have easy access to a computer and I really do want listen to the voicemail. This is where speech summarization becomes interesting.
Anyway, if you're now convinced that speech summarization is interesting, well, I don't really know what to tell you since I really haven't looked at this problem. The work I am aware of (and this is a very biased sample...I apologize that I missed something) is:
- Summarizing Speech Without Text Using Hidden Markov Models by Sameer Maskey and Julia Hirshberg
- Sentence-extractive automatic speech summarization and evaluation techniques by Makoto Hirohata, Yosuke Shinnaka, Koji Iwano and Sadaoki Furui (Furui has been working in this area for a while...I can't find this exact paper, but his lab has others)
- Evaluation of Sentence Selection for Speech Summarization by Xiaodan Zhu and Gerald Penn