I'll kick off the "Getting Started In" series with summarization, since it is near and dear to my heart.
Warning: I should point out at the get-go that these posts are not intended to be comprehensive or present a representative sample of work in an area. These are blog posts, not survey papers, tutorials or anything of that sort. If I don't cite a paper, it's not because I don't think it's any good. It's also probably not a good idea to base a career move on what I say.
Summarization is the task of taking a long utterance (or a collection of long utterances) and producing a short utterance. The most common is when utterance=document, but there has also been work on speech summarization too. There are roughly three popular types of summarization: headline generation, sentence extraction and sentence compression (or document compression). In sentence extraction, a summary is created by selecting a bunch of sentences from the original document(s) and gluing them together. In headline generation, a very short summary (10 words or so) is created by selecting a bunch of words from the original document(s) and gluing them together. In sentence (document) compression a summary is created by dropping words and phrases from a sentence (document).
One of the big problems in summarization is that if you give two humans the same set of documents and ask them to write a summary, they will do wildly different things. This happens because (a) its unclear what information is important and (b) each human has different background knowledge. This is partially alleviated by moving to a task-specific setting, like the query-focused summarization model, which has grown increasingly popular in the past few years. In the query-focused setting, a document (or document collection) is provided along with a user query that serves to focus the summary on a particular topic. This doesn't fix problem (b), but goes a long way to fixing (a), and human agreement goes up dramatically. Another advantage to the query-focused setting is that, at least with news documents, the most important information is usually presented first (this is actually stipulated by many newspaper editor's guidelines). This means that producing a summary by just taking leading sentences often does incredibly well.
A related problem is that of evaluation. The best option is to do a human evaluation, preferably in some simulated real world setting. A reasonable alternative is to ask humans to write reference summaries and compare system output to these. The collection of Rouge metrics has been designed to automate this comparison (Rouge is essentially a collection of similarity metrics for matching human to system summaries). Overall, however, evaluation is a long-standing and not-well-solved problem in summarization.
Techniques for summarization vary by summary type (extraction, headline, compression).
The standard recipe for sentence extraction works as follows. A summary is created by first extracting the "best" sentence according to a scoring module. The second sentence is selected by finding the next-best sentence according to the scoring module, minus some redundancy penalty (we don't want to extract the same information over and over again). This process ends when the summary is sufficiently long. The scoring component assigns to each sentence in the original document collection a score that says how important it is (in query-focused summarization, for instance, the word overlap between the sentence and the query would be a start). The redundancy component typically computes the similarity between a sentence and the previously extracted sentences.
Headline generation and sentence compression have not yet reached a point of stability in the same way that sentence extraction has. A very popular and successful approach to headline generation is to train a hidden Markov model something like what you find in statistical machine translation (similar to IBM model 1, for those familiar with it). For sentence compression, one typically parses a sentence and then attempts to summarize it by dropping words and phrases (phrases = whole constituents).
Summarization has close ties to question answering and information retrieval; in fact, having some limited background in standard IR techniques (tf-idf, vector space model, etc.) are pretty much necessary in order to understand what goes on in summarization.
Here are some papers/tutorials/slides worth reading to get one started (I'd also recommend Inderjeet Mani's book, Automatic Summarization if you don't mind spending a little money):
- Background IR material
- Sentence extraction
- Marcu's ACL tutorial
- Multi-Document Summarization By Sentence Extraction
- Automated Multi-document Summarization in NeATS
- A Trainable Document Summarizer
- Headline generation: Automatic Headline Generation for Newspaper Stories
- Sentence compression: Statistics-Based Summarization --- Step One: Sentence Compression
- Discussion: What might be in a summary?
- Automatic evaluation: ROUGE: a Package for Automatic Evaluation of Summaries
- Recent fun stuff: Cut and paste based text summarization, Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment and A Statistical Approach for Automatic Speech Summarization