## 26 March 2008

### N-best lists and duplication

One thing that continually irks me about (some varieties of) n-best lists is duplicates. Duplicates typically arise due to an "arg max" approximation in a structured search process. In other words, we have some model $p(\text{out} | \text{in})$ that decomposes into $\sum_{\text{hid}} p(\text{hid} | \text{in}) p(\text{out} | \text{hid}, \text{in})$. Here, "in" is the input, "out" is the desired output and "hid" is some hidden/latent variable. For example, in phrase-based MT, "in" is French, "out" in English and "hid" might be the phrasal segmentation. The problem that arises is that we typically do not compute $p(\text{out} | \text{in})$ by summing over hidden states; instead, we try to find the "out"/"hid" pair that maximizes the joint probability $p(\text{out}, \text{hid} | \text{in})$.

Now, the first problem with this is that we're not actually after the out/hid pair that maximizes the joint; we're after the out that maximize the marginal. But this is largely a separate issue.

The issue that I care about is that when we look at an N-best list, we get the top N out/hid pairs, rather than the top N outputs. This leads to "duplication" in the N-best list, where many of the top N are actually the same output, but with different latent variables.

One might ask: is this a big problem? I suspect so. Below is a figure that shows how many unique outputs do we get in a top 5000 list from phrase-based translation (French to English). There are 2000 sentences being translated (along the x-axis) and each produces some number of unique outputs (along the y-axis). The sentences are sorted by the number of uniques they produce. So the ones on the left produce almost all 5000 unique outputs, but the ones on the right produce only one or two unique outputs. (Both axes are log-scaled.)
What we observe is that the median number of unique outputs in the top 5000 list is only 56! The mean is only 178. Over 10% of then have ten or fewer uniques. Only six have over 2500 uniques.

This does not bode well for any system that intends to pipeline the N-best list into some other module (whether it is a straightforward reranker or something more complex; for instance, like a summarization system or search system).

One may ask: does the number of unique outputs correlate with the input sentence length. The answer is: sort of. The following figure plots input sentence length along the x-axis and number of unique translations along the y-axis.

We see that there is some weak correlation. Pretty much all of the examples that lead to large numbers of uniques are short sentences (length 10 or less). "Average" sentence of length 30-
50 seem to produce somewhere about 250 uniques, and long sentences (>100) tend to produce only a handful of outputs. Since average sentences are, in many ways, the most interesting, this is a bit frustrating. I don't particularly care about getting lots of translations for 10-word-long sentences because they're kind of boring anyway.

So what can be done about this? Here are a few things that come to mind, though I'm not sure any is a complete solution:

• Generate more than 5000 total and hope to get more than 50 unique. Sure, this works (at the expense of computation and memory), but a 1% return is really quite bad. If we could even get, say, a 10% or 20% return I would be much much happier.
• Instead of using n-best lists, generate samples from the posterior $p(\text{out}, \text{hid} | \text{in})$ (here, I assume that marginalization is still too much to ask for). Probably you'd want to take the 1-best as well, since there's no guarantee that the MAP would show up in a finite sample. I'm also not sure we know how to do this (efficiently) for arbitrary models.
• Try to optimize the n-best list for diversity, instead of simply optimality. You can write this down, but it seems hard to implement and may be computationally less pleasant than just generating 50,000 total.
• Improve hypothesis recombination inside the decoder. Here, my worry would be that it wouldn't be until the final step when we could combine everything, in which case this boils down to the "generate 50,000" approach.
I'm sure there are other ways, but these are what come to mind. How do other people deal with this? What size n-best lists are popular?

Chris said...

For applications where greater diversity is necessary, I've found Liang Huang's suggestion for extacting "unique" n-best lists to be invaluable. Basically, when recursively searching through the hypergraph representing a parse forest for the n-best derivations of a parallel tree (in the context of, for example, a CKY-based Hiero-style decoder), by keeping track at each node of the unique strings spanned by that node in a hash table, the next unique string, rather than just the next best string, can easily be requested. He describes this in an AMTA paper from a couple years ago, A Syntax-Directed Translation with Extended Domain of Locality. Anyway, a comparable modification should be possible with standard left-to-right phrase-based decoders since they use the same kind of dynamic programming search to reconstruct the (n-) best path(s).

Bob Carpenter said...

This problem's huge (literally and figuratively) in speech recognition where the input is acoustic vectors, the output is a sequence of words, and the hidden units are the temporal segmentation points of the words, the phonemes within them, and the states of the HMMs making up the phonemes within them. That is, you don't care in the end if the word "Chicago" started at 1.01 seconds into the utterance or 1.02 seconds.

The scale is pretty bad: typically 100 feature vectors/second consisting of 30-40 dimensional acoustic representations which are evaluated against tens of thousands of 16- or 32-way Gaussian mixtures representing phoneme segments. That's the emission model of an HMM which uses phonotactic and language models in the transition model.

Speech researchers have at least evaluated the third and fourth approaches: (3) try to build more diversity into n-best list extraction, typically with customized heuristic "n-best" extractors, and and (4) try to collapse the hypotheses within the recognizer, typically by adopting some fuzzy notion of boundary equivalence at the language model decoding phase.

Finite state transducers can help with some of this, because they naturally do the kind of space reduction collapsing you're talking about.

Kevin said...

I've also been irked by this issue for a while, but I am not sure whether this is a problem in practice. For example, suppose you're re-ranking a N-best list with K unique translations (K < N) vs. re-ranking an K-best list with duplicates filtered: What is better? (We do the former in the UW MT system)

Here's what I'll argue without evidence: Although the N-best list has many duplicate translations, the features of these duplicates may be distinct. This gives the re-ranker learning algorithm more flexibility in the optimization (or if you want to counter me, you might say flexibility = many local optimum).

Anyway, my point is that the "duplicates" in N-best list are not really duplicates in feature space. They are duplicates (equivalent classes) only with respect to the loss function (e.g. BLEU, WER). It is not exactly clear whether the duplicate issue is good, bad, or neutral.

On a related note, does more duplicates imply more confidence? I wonder whether sentence-level BLEU scores correlate with % duplicates in N-best?

Jenny said...

You largely dismiss the first issue, of getting out the maximal out/hid pair, when really your second problem should be setting off huge warning bells about the severity of the first problem. Given the number of different ways one can construct an output, based on the hidden state, who knows just how "best" the (condensed) n-best you extract really are.

Jenny said...

Also, if you generate samples from p(out,hid|in) and you just discard the hid, then you have effectively generated samples of out and can approximate the marginal. Unless I'm misunderstanding what you meant by "I assume that marginalization is still too much to ask for." This is the approach that I would take (and have taken), since you actually know what you are getting.

hal said...

bob -- do you have any pointers to the heuristic n-best extractors? it indeed seems like life is probably worse for them than for us, due to the fact that they have much longer input sequences and hence many more latent variables.

kevin -- well, if the *computation* time is the same for K unique as for N non-unique containing K unique, then i'd prefer the N, essentially for the reason jenny points out: that you can (approximately) marginalize. however, the problem i see is that while for some sentences you get nearly all N unique, for many you only get 1 or 2 unique out of 5000, so how to choose N?

jenny -- i'm not trying to dismiss the first issue (max versus sum), but i just think that it's somewhat tangential. regardless of whether you do n-best or sampling, you can still (approximately) marginalize out 'hid' by summing, so i think that the sampling-versus-nbest is also a bit of a red herring. (my statement about marginalization being too much to ask for was in reference to exact marginalization, like you would get if you were living in FSM world.)

that is: i think the fact that we're max-ing instead of sum-ing is a reasonably big problem; i just also think that since exact marginalization is pretty much impossible, we're going to have to do something about duplicates, regardless of whether we sample or sum.

incidentally, i know it's pretty trivial and well known how to set up a gibbs or MH sampler when our output space is a linear chain or tree; does anyone know how to do it in general for "weird" output structures, like we see in translation or ASR (the "weirdness" here is just that the length is not fixed). i can imagine a "correct" Gibbs sampler that probably wouldn't work because it would get trapped in local maxima really easily; i can also imagine doing particle filtering/SMC, but this is maybe overkill and comes with its own issues.

Bob Carpenter said...

Lidia Mangu's thesis and the related papers explored the issue recently. Here's a link to a representative paper:

http://citeseer.ist.psu.edu/390181.html

There are lots and lots of papers out there around this topic.

For the FSM composition approach, check out any of the Mohri/Pereira/Riley speech papers, such as:

http://citeseer.ist.psu.edu/629641.html

Can you compute the total output probability so you can get conditionals? If so, you can figure out how much probability mass your n-best or compact representation thereof covers. If not, even with sampling, you can't tell how representative the posterior is that you've sampled.

Some models, e.g. LDA used for word similarity or doc similarity, allow you to sample the value of interest without getting a very dense posterior sample of clusterings.

If you can reduce your model to a hypergraph, you can use inside-outside like algorithms. Does anyone know any general techniques past that? The compact representations for dynamic programming seem tightly related to the posterior sampler or n-best iterator.

Anuj said...

Sir I am a high school student working on NLP, I have started on a very basic level can you tell me some books, websites etc. where I can get the required knowledge for implementing NLP in a practical manner? I have tried MIT OCW and other such websites, but I want to learn more could you suggest me something?

Anonymous said...

It is the goonzu gold which make me very happy these days, my brother says goonzu money is his favorite games gold he likes, he usually buy some goonzu online gold to start his game and most of the time he will win the buy goonzu gold back and give me some cheap goonzu gold to play the game.

. said...

goodeda1122 said...

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
sesliparti
seslisohbet odalari
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex