29 September 2006

Doing DP Clustering? Don't Sample!

Dirichlet process techniques are increasingly popular tools for Bayesian analysis. There's not enough space here to describe how they work, so I'll assume you know. With the exception of the Variational DP techniques that Dave Blei and Michael Jordan developed, one typically uses MCMC techniques to perform sampling from the posterior distribution and then uses these samples to compute properties of interest. In many cases, the properties of interest are simply the cluster assignments. Since it's unclear how to use multiple samples over the cluster assignments to generate a single one (except, perhaps, by some MBR method), one typically just chooses the single cluster assignment from the sample that has maximal marginal likelihood. This is, of course, not really Bayesian, but it still seems a reasonable thing to do.

For this post, I'm going to consider a model in which we have a likelihood term F(x | theta) and a mean prior G0, where we first draw G from DP(G0,alpha) and then theta from G and then x from theta. In particular, I will assume G0 and F are conjugate, which means that in the Gibbs sampler, we can analytically integrate out both G and theta and draw only for the cluster assignments (this is Neal's algorithm 3). This algorithm is nice because it converges much faster than ones that also have to sample over theta. (I know there are other good algorithms...MH works, as do split/merge proposals.)

The potential worry is that if all you want is the cluster assignment that maximizes the marginal likelihood, is a Gibbs sampler a good search algorithm? The answer is no.

Let's consider another way to search. We arbitrarily order the observed data, then label left-to-right over this ordering. When we get to x_n, we'll have already created k clusters, and then there are k+1 possible choices. It is straightforward to compute a partial marginal likelihood for each of these possibilities, which leads to a direct implementation for breadth-first search. But we can (often) do better. If F is in the exponential family and G0 is its natural prior, we can construct a reasonably good admissible heuristic and apply A* search (I'm working on writing up the details, and I'll make the code available shortly...it's quite simple to implement, but proving that the heuristic is admissible is a bit involved and I'm not 100% sure it's true for all exponential family members or just specific ones).

Here are some artificial experiments. I generated ten documents over a vocabulary of 40 words, based on three clusters drawn from a symmetric Dirichlet with parameter alpha=4, approximately 40 words per document (distributed Poisson). I then use a DP with G0=Dirichlet(2) and F=Multinomial, with the scale parameter on the DP=2. I ran two Gibbs samplers (one initializing with a single large cluster, one initializing with many singleton clusters), and three search algorithms on this data. The first search was full A*. The second was beamed A* with a beam of 5. The last was A* with an inadmissible, but in some sense tigher, heuristic, that's even easier to compute. The results are in the following graph:

The y axis is negative log probability (lower is better) and the x axis is time in seconds. This is all in matlab and I didn't do anything fancy to make either algorithm especially fast. The horizonal lines are, top to bottom, heuristic A*, beam A* and full A*. The timing are, <0.1s, 1.6s and 2.1s, respectively (variances over 5 runs with different permutations of the input are shown in parens). So the search does significantly better (attains a higher marginal likelihood than the sampler) in very little time (even after 30 seconds, the Gibbs sampler is still really far from getting down to even the performance of heuristic A*).

So that's for small data. It turns out that the heuristic isn't good enough to work on huge data sets, unless you use a really small beam, which hampers performance (I'm investigating this). But if we use the inadmissible heuristic, we can handle large-ish data sets fairly easily. I ran the same experiment, but with 1000 docs over a vocabulary of 400 words, with 20 clusters, 1000 words per document and a symmetric Dirichlet prior with alpha=4. The Gibbs here actually sucks. Within about ten iterations, it gets suck with a neg log lik of 724842 and 16 clusters (about 50 seconds per Gibbs iteration). The heuristic A* takes about 2300 seconds total and ends with a neg log like of 707020 (and 20 clusters), quite significantly better. Combining heuristic A* with beam search (beam=100) leads to a neg log lik of 707260 (slightly worse, but no where near as bad as Gibbs) in only 1800 seconds.

(Incidentally, the Gibbs gets stuck because the documents are so long, that the marginal posterior likelihoods completely dwarf the vanilla marginals, so it essentially never moves out of a local maximum. With shorter documents, this doesn't happen as much.)

I'm still working on this a bit...I'm using DP clustering enough that this result make a huge difference for me. I think there's a lot of room for improvement, even over what I have so far. I'm also applying it to real data to see if it still helps. But overall, it looks like this is a fairly promising direction (which is actually quite surprising, because clustering is typically not something that we would typically attack in a "left-to-right" fashion).


Anonymous said...

Have a look at:

Kenichi Kurihara, Max Welling and Nikos Vlassis (2006)
Accelerated Variational DP mixture Models
NIPS 2006

For fast inference.

Anonymous said...

very interesting hal! i'm looking forward to hearing about the details of the heuristic(s). for the purists who may not be satisfied with MAP, this could also be used as a starting point for gibbs sampling.

hal said...

incidentally, is finding the MAP solution a known NP-hard problem? any references?

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花