19 March 2008

What to do with a million summaries?

Let's pretend.

Let's pretend that someone gave you one million document/summary pairs. If you like single-document, pretend they're single-document; if you like multi-document, pretend they're multi-document.

For those of us who work on summarization, this seems like it would be a pretty cool gift. Most of the corpora we're used to using have, at best, a few hundred such pairs. Some (eg., for headline generation) have more, but then I didn't allow you to pretend that these were headlines (I can get you billions of these, if you really want them).

So here's the question I pose: what would you do with a million summaries?

I actually have a hard time answering this question. Sure, we have whatever our favorite sentence extraction method is. If it's learned at all, it's probably learned over a dozen or so features: position, length, similarity to query (if a query exists), similarity to document centroid, etc. This would probably be optimized against some automatic measure like one of the many flavors of Rouge. Fitting a dozen parameters on a corpus of 100 examples is probably pretty reasonable and we have some results that suggest that we've gone about as far as we can go with sentence extraction (at least with respect to the DUC corpora); see, for instance, section 6.6 of my thesis. Here, we see that we're pretty much matching oracle performance at sentence extraction on DUC04 data when evaluated using Rouge(I've independently seen other people present similar results, so I think it's replicable). (Yes, I'm aware there are caveats here: the use of Rouge, the fixation to one corpus, etc.)

But now we have a million doc/sum pairs. Fitting a dozen parameters on a million examples seems a bit wasteful. It also seems a bit wasteful to continue to focus on sentence extraction in this case. Why? Well, first, we've already "solved" this problem (the quotes indicate the caveats hinted at above). Second, I have a seriously hard time thinking of too many more high level features that I could possibly tune (my best entry ever into DUC---I think we came in second or third, depending on the scoring---had about 25, many of which ended up getting very very small weights).

So, being me, my next thought is: do word alignment on the summaries, like what they do in machine translation. It turns out that somebody has already tried this, with a reasonable amount of success. In all seriousness, if I were to try something like this again, I think I would throw out the "phrase" issue and deal with words; probably also consider throwing out the HMM issue and do something akin to Model 1. The key difference is that I would continue to include the additional features, like stem-identity, WordNet, etc. I might also throw in some word clustering just for fun.

So let's say that the alignments worked. Now what? We could decode, of course, by intersecting the learned alignment model with a language model. I think this would be a really bad idea, essentially because I don't think there's enough information in the alignments to actually produce new summaries; just enough to get reasonable alignments.

So now we've got a million document/summary pairs that have been aligned. What now?

You could say "learn to create abstracts", but I'm actually not particularly thrilled with this idea, either. (Why? Well, there's a long story, but the basic problem is that if you ask humans to write summaries, they're lazy. What this means is that they do a lot of word copying, at least up to the stem. If you look in the alignments paper, there are some results that say that over half of the words in the summary are aligned to identical words (stems) in the document, even with a human doing the alignment. What this means is that if you're using a lexically-based scoring method, like Rouge, odds are against you if you ever change a word because chances are the human writing the summary didn't change it.)

You could suggest trying to learn to do compression, which is probably what I'd look at most seriously. But I also don't think we have a really good understanding of this. In the Searn paper, we show how to use Searn to compute compressions, but to be honest it's really slow and I don't think it's really scalable to 1 million doc/sum pairs. But I suppose I would probably start with something like that.

But that's what I would do. What would you do?

(Incidentally, if you want to ask me: "do you have a million summaries?" the answer is "no." I have about 120,000. But there are some complications with this data. Maybe I'll post about it in the near future.)

12 comments:

todd. said...

I don't work on summarization so there's a good chance I'm about to ask a stupid question. But I did a summer project on summarization as an undergraduate, and so I often wonder how various ML techniques would work when applied to summarization.

My question, then, is: has anyone had much success using a topic model like latent Dirichlet allocation for summarization? The LDA papers always say it'd be great for that, and it's easy to see how this could be true, but I couldn't find any papers where this had been done. (Admittedly, I only spent 5-10 minutes with Google trying.)

Knowing pretty much nothing about what works for summarization, it seems to me that if I had 1M summaries, that would be the avenue I wanted to research.

hal said...

I don't think it's a dumb question. All I can say for certain is that I've tried something like this and couldn't really get it to work (though I didn't put a huge amount of effort into it). The tact that I took was to take BayeSum and augment it with LDA-style topics. BayeSum is already quite LDA-like... imagine each document constructed from 3 topics: a stop word topic, a document-specific topic, and a query-specific topic.

There are two places you can add more canonical LDA-style topics here. The first is to allow the query to contain >1 topic; the second is to allow the document to. The latter is totally unhelpful in query-focused summarization because we don't care what the topics are in the document that don't deal with the query. The former is pretty unhelpful because queries are already quite a bit more specific than LDA topics turn out to be.

It may be useful for non-query-focused summarization, but I have no own concerns about that.

hal said...

should be "have my own concerns..."

Joseph Turian said...

I've been musing about this question since you posted it.

Let's say we are interested in an inducing underlying representation for text. Rather than using standard unsupervised learning techniques to learn an embedding for each text in isolation, we can add an additional term to the objective: The embedding for the document and its summary should be close. This extra component in the training helps ensure that the representation for a document actually captures the summary's information.

What is the story with the 120K summaries?

Anonymous said...

It is the holic gold which makes me very happy these days, my brother says holic money is his favorite games gold he likes, he usually holic online gold to start his game and most of the time he will win the cheap holic gold back and give me some holic online money to play the game.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex