One thing that continually irks me about (some varieties of) n-best lists is duplicates. Duplicates typically arise due to an "arg max" approximation in a structured search process. In other words, we have some model that decomposes into . Here, "in" is the input, "out" is the desired output and "hid" is some hidden/latent variable. For example, in phrase-based MT, "in" is French, "out" in English and "hid" might be the phrasal segmentation. The problem that arises is that we typically do not compute by summing over hidden states; instead, we try to find the "out"/"hid" pair that maximizes the joint probability .

Now, the first problem with this is that we're not actually after the out/hid pair that maximizes the joint; we're after the out that maximize the marginal. But this is largely a separate issue.

The issue that I care about is that when we look at an N-best list, we get the top N out/hid pairs, rather than the top N outputs. This leads to "duplication" in the N-best list, where many of the top N are actually the same output, but with different latent variables.

One might ask: is this a big problem? I suspect so. Below is a figure that shows how many unique outputs do we get in a top 5000 list from phrase-based translation (French to English). There are 2000 sentences being translated (along the x-axis) and each produces some number of unique outputs (along the y-axis). The sentences are sorted by the number of uniques they produce. So the ones on the left produce almost all 5000 unique outputs, but the ones on the right produce only one or two unique outputs. (Both axes are log-scaled.)

What we observe is that the median number of unique outputs in the top 5000 list is only 56! The mean is only 178. Over 10% of then have ten or fewer uniques. Only six have over 2500 uniques.

This does not bode well for any system that intends to pipeline the N-best list into some other module (whether it is a straightforward reranker or something more complex; for instance, like a summarization system or search system).

One may ask: does the number of unique outputs correlate with the input sentence length. The answer is: sort of. The following figure plots input sentence length along the x-axis and number of unique translations along the y-axis.

We see that there is some weak correlation. Pretty much all of the examples that lead to large numbers of uniques are short sentences (length 10 or less). "Average" sentence of length 30-

50 seem to produce somewhere about 250 uniques, and long sentences (>100) tend to produce only a handful of outputs. Since average sentences are, in many ways, the most interesting, this is a bit frustrating. I don't particularly care about getting lots of translations for 10-word-long sentences because they're kind of boring anyway.

So what can be done about this? Here are a few things that come to mind, though I'm not sure any is a complete solution:

- Generate more than 5000 total and hope to get more than 50 unique. Sure, this works (at the expense of computation and memory), but a 1% return is really quite bad. If we could even get, say, a 10% or 20% return I would be much much happier.
- Instead of using n-best lists, generate samples from the posterior (here, I assume that marginalization is still too much to ask for). Probably you'd want to take the 1-best as well, since there's no guarantee that the MAP would show up in a finite sample. I'm also not sure we know how to do this (efficiently) for arbitrary models.
- Try to optimize the n-best list for diversity, instead of simply optimality. You can write this down, but it seems hard to implement and may be computationally less pleasant than just generating 50,000 total.
- Improve hypothesis recombination inside the decoder. Here, my worry would be that it wouldn't be until the final step when we could combine everything, in which case this boils down to the "generate 50,000" approach.

## 11 comments:

For applications where greater diversity is necessary, I've found Liang Huang's suggestion for extacting "unique"

n-best lists to be invaluable. Basically, when recursively searching through the hypergraph representing a parse forest for then-best derivations of a parallel tree (in the context of, for example, a CKY-based Hiero-style decoder), by keeping track at each node of the unique strings spanned by that node in a hash table, the nextuniquestring, rather than just the next best string, can easily be requested. He describes this in an AMTA paper from a couple years ago, A Syntax-Directed Translation with Extended Domain of Locality. Anyway, a comparable modification should be possible with standard left-to-right phrase-based decoders since they use the same kind of dynamic programming search to reconstruct the (n-) best path(s).This problem's huge (literally and figuratively) in speech recognition where the input is acoustic vectors, the output is a sequence of words, and the hidden units are the temporal segmentation points of the words, the phonemes within them, and the states of the HMMs making up the phonemes within them. That is, you don't care in the end if the word "Chicago" started at 1.01 seconds into the utterance or 1.02 seconds.

The scale is pretty bad: typically 100 feature vectors/second consisting of 30-40 dimensional acoustic representations which are evaluated against tens of thousands of 16- or 32-way Gaussian mixtures representing phoneme segments. That's the emission model of an HMM which uses phonotactic and language models in the transition model.

Speech researchers have at least evaluated the third and fourth approaches: (3) try to build more diversity into n-best list extraction, typically with customized heuristic "n-best" extractors, and and (4) try to collapse the hypotheses within the recognizer, typically by adopting some fuzzy notion of boundary equivalence at the language model decoding phase.

Finite state transducers can help with some of this, because they naturally do the kind of space reduction collapsing you're talking about.

I've also been irked by this issue for a while, but I am not sure whether this is a problem in practice. For example, suppose you're re-ranking a N-best list with K unique translations (K < N) vs. re-ranking an K-best list with duplicates filtered: What is better? (We do the former in the UW MT system)

Here's what I'll argue without evidence: Although the N-best list has many duplicate translations, the features of these duplicates may be distinct. This gives the re-ranker learning algorithm more flexibility in the optimization (or if you want to counter me, you might say flexibility = many local optimum).

Anyway, my point is that the "duplicates" in N-best list are not really duplicates in feature space. They are duplicates (equivalent classes) only with respect to the loss function (e.g. BLEU, WER). It is not exactly clear whether the duplicate issue is good, bad, or neutral.

On a related note, does more duplicates imply more confidence? I wonder whether sentence-level BLEU scores correlate with % duplicates in N-best?

You largely dismiss the first issue, of getting out the maximal out/hid pair, when really your second problem should be setting off huge warning bells about the severity of the first problem. Given the number of different ways one can construct an output, based on the hidden state, who knows just how "best" the (condensed) n-best you extract really are.

Also, if you generate samples from p(out,hid|in) and you just discard the hid, then you have effectively generated samples of out and can approximate the marginal. Unless I'm misunderstanding what you meant by "I assume that marginalization is still too much to ask for." This is the approach that I would take (and have taken), since you actually know what you are getting.

bob -- do you have any pointers to the heuristic n-best extractors? it indeed seems like life is probably worse for them than for us, due to the fact that they have much longer input sequences and hence many more latent variables.

kevin -- well, if the *computation* time is the same for K unique as for N non-unique containing K unique, then i'd prefer the N, essentially for the reason jenny points out: that you can (approximately) marginalize. however, the problem i see is that while for some sentences you get nearly all N unique, for many you only get 1 or 2 unique out of 5000, so how to choose N?

jenny -- i'm not trying to

dismissthe first issue (max versus sum), but i just think that it's somewhat tangential. regardless of whether you do n-best or sampling, you can still (approximately) marginalize out 'hid' by summing, so i think that the sampling-versus-nbest is also a bit of a red herring. (my statement about marginalization being too much to ask for was in reference to exact marginalization, like you would get if you were living in FSM world.)that is: i think the fact that we're max-ing instead of sum-ing is a reasonably big problem; i just also think that since exact marginalization is pretty much impossible, we're going to have to do

somethingabout duplicates, regardless of whether we sample or sum.incidentally, i know it's pretty trivial and well known how to set up a gibbs or MH sampler when our output space is a linear chain or tree; does anyone know how to do it in general for "weird" output structures, like we see in translation or ASR (the "weirdness" here is just that the length is not fixed). i can imagine a "correct" Gibbs sampler that probably wouldn't work because it would get trapped in local maxima really easily; i can also imagine doing particle filtering/SMC, but this is maybe overkill and comes with its own issues.

Lidia Mangu's thesis and the related papers explored the issue recently. Here's a link to a representative paper:

http://citeseer.ist.psu.edu/390181.html

There are lots and lots of papers out there around this topic.

For the FSM composition approach, check out any of the Mohri/Pereira/Riley speech papers, such as:

http://citeseer.ist.psu.edu/629641.html

Can you compute the total output probability so you can get conditionals? If so, you can figure out how much probability mass your n-best or compact representation thereof covers. If not, even with sampling, you can't tell how representative the posterior is that you've sampled.

Some models, e.g. LDA used for word similarity or doc similarity, allow you to sample the value of interest without getting a very dense posterior sample of clusterings.

If you can reduce your model to a hypergraph, you can use inside-outside like algorithms. Does anyone know any general techniques past that? The compact representations for dynamic programming seem tightly related to the posterior sampler or n-best iterator.

Sir I am a high school student working on NLP, I have started on a very basic level can you tell me some books, websites etc. where I can get the required knowledge for implementing NLP in a practical manner? I have tried MIT OCW and other such websites, but I want to learn more could you suggest me something?

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差、打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店當日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作、兼差打工、假日兼職、兼職工作、酒店兼差、兼差、打工兼差、日領工作、晚上兼差工作、酒店工作、酒店上班、酒店打工、兼職、兼差、兼差工作、酒店上班等,想了解酒店相關工作和特種行業內容,想兼職工作日領、假日兼職、兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚，無業績壓力，無脫秀無喝酒壓力，高層次會員制客源，工作輕鬆，可日領、現領。

一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已，對水水們的上班安全一點保障都沒有！艾葳酒店經紀公司的水水們上班時全程媽咪作陪，不需擔心！只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職、缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??

艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,

等相關服務 幫您快速的實現您的夢想~!!

自慰套,真愛密碼,

自慰套,自慰器,自慰套,情趣,充氣娃娃,

性感丁字褲,AV,按摩棒,電動按摩棒,情趣按摩棒,

角色扮演,角色扮演服,吊帶襪,丁字褲,飛機杯,

按摩棒,變頻跳蛋,跳蛋,無線跳蛋,G點,

潤滑液,SM,情趣內衣,內衣,性感內衣,情趣用品,情趣,

Post a Comment