29 September 2006

Doing DP Clustering? Don't Sample!

Dirichlet process techniques are increasingly popular tools for Bayesian analysis. There's not enough space here to describe how they work, so I'll assume you know. With the exception of the Variational DP techniques that Dave Blei and Michael Jordan developed, one typically uses MCMC techniques to perform sampling from the posterior distribution and then uses these samples to compute properties of interest. In many cases, the properties of interest are simply the cluster assignments. Since it's unclear how to use multiple samples over the cluster assignments to generate a single one (except, perhaps, by some MBR method), one typically just chooses the single cluster assignment from the sample that has maximal marginal likelihood. This is, of course, not really Bayesian, but it still seems a reasonable thing to do.

For this post, I'm going to consider a model in which we have a likelihood term F(x | theta) and a mean prior G0, where we first draw G from DP(G0,alpha) and then theta from G and then x from theta. In particular, I will assume G0 and F are conjugate, which means that in the Gibbs sampler, we can analytically integrate out both G and theta and draw only for the cluster assignments (this is Neal's algorithm 3). This algorithm is nice because it converges much faster than ones that also have to sample over theta. (I know there are other good algorithms...MH works, as do split/merge proposals.)

The potential worry is that if all you want is the cluster assignment that maximizes the marginal likelihood, is a Gibbs sampler a good search algorithm? The answer is no.

Let's consider another way to search. We arbitrarily order the observed data, then label left-to-right over this ordering. When we get to x_n, we'll have already created k clusters, and then there are k+1 possible choices. It is straightforward to compute a partial marginal likelihood for each of these possibilities, which leads to a direct implementation for breadth-first search. But we can (often) do better. If F is in the exponential family and G0 is its natural prior, we can construct a reasonably good admissible heuristic and apply A* search (I'm working on writing up the details, and I'll make the code available shortly...it's quite simple to implement, but proving that the heuristic is admissible is a bit involved and I'm not 100% sure it's true for all exponential family members or just specific ones).

Here are some artificial experiments. I generated ten documents over a vocabulary of 40 words, based on three clusters drawn from a symmetric Dirichlet with parameter alpha=4, approximately 40 words per document (distributed Poisson). I then use a DP with G0=Dirichlet(2) and F=Multinomial, with the scale parameter on the DP=2. I ran two Gibbs samplers (one initializing with a single large cluster, one initializing with many singleton clusters), and three search algorithms on this data. The first search was full A*. The second was beamed A* with a beam of 5. The last was A* with an inadmissible, but in some sense tigher, heuristic, that's even easier to compute. The results are in the following graph:



The y axis is negative log probability (lower is better) and the x axis is time in seconds. This is all in matlab and I didn't do anything fancy to make either algorithm especially fast. The horizonal lines are, top to bottom, heuristic A*, beam A* and full A*. The timing are, <0.1s, 1.6s and 2.1s, respectively (variances over 5 runs with different permutations of the input are shown in parens). So the search does significantly better (attains a higher marginal likelihood than the sampler) in very little time (even after 30 seconds, the Gibbs sampler is still really far from getting down to even the performance of heuristic A*).

So that's for small data. It turns out that the heuristic isn't good enough to work on huge data sets, unless you use a really small beam, which hampers performance (I'm investigating this). But if we use the inadmissible heuristic, we can handle large-ish data sets fairly easily. I ran the same experiment, but with 1000 docs over a vocabulary of 400 words, with 20 clusters, 1000 words per document and a symmetric Dirichlet prior with alpha=4. The Gibbs here actually sucks. Within about ten iterations, it gets suck with a neg log lik of 724842 and 16 clusters (about 50 seconds per Gibbs iteration). The heuristic A* takes about 2300 seconds total and ends with a neg log like of 707020 (and 20 clusters), quite significantly better. Combining heuristic A* with beam search (beam=100) leads to a neg log lik of 707260 (slightly worse, but no where near as bad as Gibbs) in only 1800 seconds.

(Incidentally, the Gibbs gets stuck because the documents are so long, that the marginal posterior likelihoods completely dwarf the vanilla marginals, so it essentially never moves out of a local maximum. With shorter documents, this doesn't happen as much.)

I'm still working on this a bit...I'm using DP clustering enough that this result make a huge difference for me. I think there's a lot of room for improvement, even over what I have so far. I'm also applying it to real data to see if it still helps. But overall, it looks like this is a fairly promising direction (which is actually quite surprising, because clustering is typically not something that we would typically attack in a "left-to-right" fashion).

8 comments:

Anonymous said...

Have a look at:

Kenichi Kurihara, Max Welling and Nikos Vlassis (2006)
Accelerated Variational DP mixture Models
NIPS 2006

For fast inference.

dave blei said...

very interesting hal! i'm looking forward to hearing about the details of the heuristic(s). for the purists who may not be satisfied with MAP, this could also be used as a starting point for gibbs sampling.

hal said...

incidentally, is finding the MAP solution a known NP-hard problem? any references?

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex