18 December 2007

Particle filtering versus beam search

I had a very interesting discussion at NIPS with Vikash Mansingka about beam search and particle filtering. Much of what is below is a result of this conversation and he deserves much credit.

In a very real sense, particle filtering can be seen (for many models) as a sort of stochastic beam search. Just to set the stage, let's talk about generic beam search:

1. Initialize a min-queue "beam" to contain a single initial hypothesis.
2. While beam contains non-complete hypothesis:
A. Take h, the lowest scoring hypothesis from the beam
B. For each h' in the neighborhood of h
I. Add h' to the beam, with some score
C. If the beam exceeds a certain capacity, drop high-scoring elements
3. Return the lowest scoring hypothesis from beam
The key variant in beam search algorithms is the score function that's used. The most naive is just path cost---how much did we have to pay to get to the current hypothesis? The "A*" way is to use path cost + admissible heuristic cost, where the admissible heuristic is an underestimate of the amount of cost remaining to complete.

There are (of course) other variants: sometimes we expand the whole beam in one step; sometimes we use multiple beams; etc. We'll come back to these shortly.

Particle filtering (using as similar terminology to the beam search as possible), on the other hand, looks something like:
1. Initialize a max-queue "beam" to contain a single initial hypothesis.
2. While beam contains non-complete hypothesis:
A. Take h, the highest scoring hypothesis from the beam
B. Sample h' from the neighborhood of h
I. Add h' to the beam, with some score
C. If the beam exceeds a certain capacity, randomly drop elements
D. If the "effective sample size" of the beam is too low, resample elements
3. Return the highest scoring hypothesis from beam
To map beam search to particle filtering, use negative log probabilities for scores. (This turns max search into min search.)

In a sense, the major thing that differs between PF and BS is that in PF everything happens randomly -- we're always sampling from a proposal distribution, rather than greedily taking the "best" hypotheses at each step.

There are, however, some minor issues: what proposal distribution is used in step 2B in PF and what does resampling do.

In general, the proposal distribution should be the true posterior probability, but this is pretty much always intractable (otherwise we wouldn't be using approximations at all). The most common thing to do is to use a transition probability, which corresponds exactly to using a path cost in beam search. However, just as we know that using a heuristic can help in beam search, this suggest that using a proposal distribution in PF that contains something like a future cost heuristic is going to be helpful. I'm sure this has been done, but I don't think it's nearly as common as it perhaps should be, given how helpful it is in BS.

What 2D in PF does is to measure whether the samples that we have in the beam have fallen out of the true posterior---in other words, are we "off the mark." There's a straightforward way to compute this (see the wikipedia page linked above). If we are off the mark, we resample all the elements in our beam, essentially trying to get back on the right path. My experience with PF is that this is pretty essential to getting things to work, but as far as I know there's no analogue in BS land.

It may be that these two things (resampling and heuristic) are two different ways of solving the same problem and that doing both is not so helpful. But I tend to doubt it.

One other thing that I've never seen done in PF is to maintain multiple beams. Admittedly this makes the resampling step harder (maybe...I haven't worked through it), but it may be helpful. One thing, for instance, that we found in the PF for the coalescent was that the PF tends to like to merge clusters early with no regard for how hard life is going to be later. We use resampling to try to fix this, but another way would be to use multiple beams. This is commonly done in, eg., machine translation, where a single beam doesn't do well, so we maintain one beam for each number of foreign (or English) words covered.

So what's the point? I think that by recognizing the similarity between BS and PF, we can see ways to potentially improve both. For instance, the following seem promising:
  1. Use resampling techniques in beam search
  2. Use multiple beams in particle filtering
  3. Use A*-style heuristics in the proposal distribution for particle filtering
An alternative suggestion would be for NLPers to pay more attention to particle filtering. It's a nice, clean probabilistic analogue to a technique that we love (or at least are forced to use).

14 comments:

Kevin said...

Nice post, Hal! There's definitely a connection here worth exploring.

Question: I don't really understand what you mean by multiple beams and why PF doesn't usually have it. I thought in PF we sample from each particle individually, which leads to the ability to estimate distributions with multiple modes--doesn't that correspond to multiple beams?

Mark Johnson said...

IMHO the main advantage of particle filtering is that it works in situations where it's possible to sample but it's not possible to find the n-best, so beam search would be intractable.

Lots of Bayes net and MRF problems are like this, and one of the wonderful things about sampling is that it is tractable even if you can't compute the argmax or the partition function.

I keep my hand on my wallet when people ask me to accept random lower probability solutions in place of higher probability ones for no real reason. So I suspect that beam search will be better if it is tractable, all else equal.

Vikash Mansinghka said...

Shameless plug: Along with Keith Bonawitz, Dan Roy, and Josh Tenenbaum, I am finishing up a tech report on this basic connection. (Many very helpful insights, along with the whole multiple beam thing, were provided by Hal at NIPS, and his AISTATS paper last year was definitely influential.) We'll include more details, and are applying the technique to several classic inference problems (MRFs; Bayes Nets; and simple nonparametric models, like our AClass paper at AISTATS last year).

Now, a few comments:

- In the analogy I see, 'path length so far' is analogous to particle weight, and 'cost to go' is analogous to proposal probability (though proposal probability needn't be an underestimate of the true conditional posterior, just an approximation). Beaming is analogous to resampling.

- I can A* search with beam but no heuristic, heuristic but no beam, neither, or both. Similarly, I can SMC with resampling but crappy (e.g. uniform) proposals, good proposals (e.g. expensively computed via lookahead, so they're close to exact conditional posteriors) but no resampling, or both. Specifically: Exact optimization involves extending all partial paths (no beaming) - though considering extensions in order of length+estimated-cost-to-go lets you avoid some, losing nothing if your estimation is a lower bound, saving time the tighter it is. Similarly, exact filtering involves summing over all latent state trajectories (no resampling), though a good proposal - and consideration of states in decreasing order of weight*proposal probability - might allow you to get most of the mass quickly. Sampling instead of summing adds variance (making inference approximate) but lets you dodge exponential time/space. Beaming for optimization guarantees poly time/space, and improves bounded time performance, at the cost of error (with or w/o heuristic, though they can interact to make error worse). Similarly, resampling adds bias (and could interact with bad proposals, leading to degeneration), but attempts to reduce expansion of low probability paths, in some cases improving solution quality for bounded time/space.

- The analogy between beaming and resampling is a bit subtle. In A* search, you can compare partial paths of different depths. This might interact badly with beaming (hence the idea of multiple beams). In an SMC algorithm for e.g. tracking, you can extend particles at different rates (ie incorporate different numbers of datapoints or timepoints into particles) - but when you resample, you can only resample particles that have the same datapoints incorporated. This is because SMC (really, recursive unnormalized importance sampling) depends on resampling as bootstrap, which is only valid if all the particles are drawn from the same distribution (ie share the same unknown normalizing constant Z). However, multiple beams (each consisting of partial paths of the same depth) and multiple resampling (bagging particles according to how many datapoints from some ordered list they've incorporated, and only resampling among particles with the same number) both should go through fine. Ultimately this is because determining how much credit to assign to a partial solution is both harder and easier when you sample. It's harder in that you need a sense of the probability mass left to go, easier in that you don't need strict underestimates for future credit. Furthermore, in A*+beam, k-best really means 'k-best but distinct', since the determinism of A* means two partial paths with the same edge sequence will extend the same under the algorithm (so there is no point representing them twice in the beam). This is not the case for SMC - two particles that are the same up to time t might extend very differently.

- It would be very interesting to see how to recover A* as a zero temperature limit of an appropriately designed SMC algorithm. Developing this case (along with other classic, clever A*ish ideas like dependency directed backjumping) and supporting resampling for different length partial paths (or particles with different Zs), even approximately, seems like an important avenue for future work.

Re Mark: I don't think I agree. First off, in inference (including learning), I care about finding typical posterior solutions (or solutions from regions of high mass). Optimization (as opposed to
sampling + averaging) can lead to all kinds of weird atypicalities, including overfitting in the learning setting (no model averaging). Second, even in
decisionmaking (e.g. based on expected utility), it often turns out that "noisy optimization" (e.g. sample from an 'annealed' expected utility surface over actions) is better than choosing the MEU action - this guards against model error/builds in exploration. Even if one
really wants to optimize, pretending to try to sample (at least for awhile) often helps avoid local minima - compare/contrast MCMC, MCMC+annealing, and hillclimbing as pure optimization methods.

I'd be very interested to hear about situations where pure optimization (both in terms of the problem setup, and in terms of the local decisions made by an algorithm) are clearly better. I'm also
very interested in comments from readers about problems they'd like to see the approach applied to (or more generally what kinds of experiments would help illustrate the tradeoffs discussed in this post). NLP examples would be especially welcome.

hal said...

kevin -- i don't think PF has an equivalent to multiple beams... this would, in PF, amount to having multiple sets of particles, each covering a different number of data points, which is not often done.

mark -- i would argue that there are very few problems for which we can actually find a true n-best... even parsing, which is technically polynomial, is usually done with huge amounts of pruning and thresholding.

the issue of taking lower probability scores is a compelling one... anyone who knows me knows that i pretty much feel the same way. i think there are a few answers to this. (a) if you believe jenny, chris and andrew's results (i'm not expressing a personal opinion one way or the other), then you would believe that in a pipelined system you're better off sampling than finding a k-best output. (b) even if we're doing something like beamed A* with an admissible heuristic, then we're still crossing our fingers that our heuristic is "good enough" -- adding a bit of randomness might not hurt (and could possibly help). (c) we may still be able to borrow some useful things from the PF literature, like resampling, and adapt it to beam search.

vikash -- very nice comment ;)

Russian Translator and Russian Interpreter said...

Hello,
would you be interested in posting a short post about my translation service? How much would you charge for that? Here is my site if you want to take a look at it: http://www.russian-english-translator.com
Thank you,
Tim

Anonymous said...

Ultima Online Gold, UO Gold, crestingwait
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
lotro gold
wow gold
warhammer gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
Age of Conan Gold, AOC Gold

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

Anonymous said...

1
秋天賞楓何處去酒店經紀,安排韓國旅遊有獨到心得的寶馬旅行社表示 酒店打工,秋遊韓國的重點就是美食、溫泉、還有雪嶽山美麗秋景。位於江原道 酒店兼差束草、襄陽、麟蹄一帶的雪嶽山,是韓國最早楓葉轉紅的地方,也由於雪嶽山一年四季都有奇岩絕璧 酒店兼職、溪谷瀑布等美景,吸引了許多觀光客前來旅遊。一到 酒店工作秋天,以雪嶽山的最高峰~大青峰酒店上班
(1,708公尺)為首,雪嶽山各主要登山路線沿途的楓葉把山染 酒店上班成一片紅色的圖畫,美不勝收。


標榜「全程無自費」,相當受旅客歡 寒假打工迎,而且價格相當平易近人,只要14500元即可成行。另外還有全程五星酒店、海陸空版的「戀戀秋濟^海陸空濟州4日」 暑假打工,同樣獨家全程無自費!緊張刺激360度噴射快艇(價值韓幣25000元)、飛天熱氣球(價值韓幣25000元) 酒店PT、海水溫泉汗蒸幕(價值韓幣8000元) 禮服酒店等,海、陸、空讓您玩的盡興也只要13900元!現在就去體驗韓國秋天的美景吧~


驚險摩托車秀HAPPY TOWN 兼差價值韓幣12000元):表演者以機車為主,靈活的玩弄, 打工全世界只有兩組特技人員能做的高難度表演,在一個小時的演出中還有空中飛人﹑民俗雜技和大車輪 台北酒店經紀等表演,保證讓您大呼過隱,不虛此行喝酒 特技令人嘖嘖稱奇。而享譽全球的國寶級亂打秀(價值韓幣45000元),是韓國人獨創的敲擊樂表演,故事的場景是發生在廚房中,因此所謂的樂器就是就地以鍋碗等廚房交際應酬 用具敲 打出澎湃的節奏。在沒有冷場的過程裡,不需要語言您就可以清楚知道劇情粉味的發展,台上演員還會與台下觀眾互動演出,整場歡笑不斷。酒店工作



去過的旅客都津津樂道的酒店喝酒韓文化生活體驗營」,讓您親手體驗泡菜製作,穿著傳統韓服更能體驗韓國婦女的優雅!另外,精緻好吃的韓國美食當然也不能 酒店不嚐:鮑魚太極人蔘雞、長壽麵、、黑毛豬烤肉、還有獨家特色餐「?花魚定食+五花肉+鐵板馬肉+?料」「生猛海鮮大餐」等等讓人食指大動。酒店經紀酒店經紀酒店兼差酒店打工酒店上班酒店經紀酒店小姐酒店打工酒店兼差 酒店工作 彩妝酒店經紀酒店酒店領檯 酒店公關酒店小姐 酒店兼差

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex