21 May 2006

Getting Started In: Summarization

I'll kick off the "Getting Started In" series with summarization, since it is near and dear to my heart.

Warning: I should point out at the get-go that these posts are not intended to be comprehensive or present a representative sample of work in an area. These are blog posts, not survey papers, tutorials or anything of that sort. If I don't cite a paper, it's not because I don't think it's any good. It's also probably not a good idea to base a career move on what I say.

Summarization is the task of taking a long utterance (or a collection of long utterances) and producing a short utterance. The most common is when utterance=document, but there has also been work on speech summarization too. There are roughly three popular types of summarization: headline generation, sentence extraction and sentence compression (or document compression). In sentence extraction, a summary is created by selecting a bunch of sentences from the original document(s) and gluing them together. In headline generation, a very short summary (10 words or so) is created by selecting a bunch of words from the original document(s) and gluing them together. In sentence (document) compression a summary is created by dropping words and phrases from a sentence (document).

One of the big problems in summarization is that if you give two humans the same set of documents and ask them to write a summary, they will do wildly different things. This happens because (a) its unclear what information is important and (b) each human has different background knowledge. This is partially alleviated by moving to a task-specific setting, like the query-focused summarization model, which has grown increasingly popular in the past few years. In the query-focused setting, a document (or document collection) is provided along with a user query that serves to focus the summary on a particular topic. This doesn't fix problem (b), but goes a long way to fixing (a), and human agreement goes up dramatically. Another advantage to the query-focused setting is that, at least with news documents, the most important information is usually presented first (this is actually stipulated by many newspaper editor's guidelines). This means that producing a summary by just taking leading sentences often does incredibly well.

A related problem is that of evaluation. The best option is to do a human evaluation, preferably in some simulated real world setting. A reasonable alternative is to ask humans to write reference summaries and compare system output to these. The collection of Rouge metrics has been designed to automate this comparison (Rouge is essentially a collection of similarity metrics for matching human to system summaries). Overall, however, evaluation is a long-standing and not-well-solved problem in summarization.

Techniques for summarization vary by summary type (extraction, headline, compression).

The standard recipe for sentence extraction works as follows. A summary is created by first extracting the "best" sentence according to a scoring module. The second sentence is selected by finding the next-best sentence according to the scoring module, minus some redundancy penalty (we don't want to extract the same information over and over again). This process ends when the summary is sufficiently long. The scoring component assigns to each sentence in the original document collection a score that says how important it is (in query-focused summarization, for instance, the word overlap between the sentence and the query would be a start). The redundancy component typically computes the similarity between a sentence and the previously extracted sentences.

Headline generation and sentence compression have not yet reached a point of stability in the same way that sentence extraction has. A very popular and successful approach to headline generation is to train a hidden Markov model something like what you find in statistical machine translation (similar to IBM model 1, for those familiar with it). For sentence compression, one typically parses a sentence and then attempts to summarize it by dropping words and phrases (phrases = whole constituents).

Summarization has close ties to question answering and information retrieval; in fact, having some limited background in standard IR techniques (tf-idf, vector space model, etc.) are pretty much necessary in order to understand what goes on in summarization.

Here are some papers/tutorials/slides worth reading to get one started (I'd also recommend Inderjeet Mani's book, Automatic Summarization if you don't mind spending a little money):

  1. Background IR material
  2. Sentence extraction
    1. Marcu's ACL tutorial
    2. Multi-Document Summarization By Sentence Extraction
    3. Automated Multi-document Summarization in NeATS
    4. A Trainable Document Summarizer
  3. Headline generation: Automatic Headline Generation for Newspaper Stories
  4. Sentence compression: Statistics-Based Summarization --- Step One: Sentence Compression
  5. Other:
    1. Discussion: What might be in a summary?
    2. Automatic evaluation: ROUGE: a Package for Automatic Evaluation of Summaries
    3. Recent fun stuff: Cut and paste based text summarization, Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment and A Statistical Approach for Automatic Speech Summarization

6 comments:

Scott A. said...

I don't have anything {insightful} to say just yet, but thanks for the interesting links and book recommendations! :-)

Anonymous said...

Thanks the intro and the links to getting start.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex