19 February 2009

Mixing of gibbs samplers for topic (and mixture) models

This is just a quick note because it's something that had somehow never occurred to me until a few days ago. If you've ever talked to me in person about the standard (collapsed) Gibbs samplers for topic models, you know that I get concerned that these things don't mix. And it's not just the generic "how do we know" feeling that can get levied at (almost) any sampler, but a very specific we know for certain that our Gibbs samplers for topic models aren't mixing. How do we know? Because, for the most part, they don't mode switch. You can figure this out quite easily by simply watching the topic assignments. (The standard collapsed samplers for Gaussian or Multinomial mixture models exhibit the same problem.) At least if you have a reasonable amount of data.

The reason this always bugged me is because we now have definitive evidence that these things aren't mixing in the sense of mode hopping, which leads me to worry that they might also not be mixing in other, less easy to detect ways.

Well, worry no more. Or at least worry less. The mode hopping is a red herring.

Maybe the following observation is obvious, but I'll say it anyway.

Let's take our topic model Gibbs sampler and introduce a new Metropolis-Hastings step. This MH step simply takes two topic indices (say topics i and j) and swaps them. It picks i and j uniformly at random from the (K choose 2) possibilities. It's easy to verify that the acceptance probability for this move will be one (the qs will cancel and the ps are identical), which means that it's really more like a Gibbs step than an MH step (in the sense that Gibbs steps are MH steps that are always accepted).

The observation is that (a) this doesn't actually change what the sampler is doing in any real, meaningful way -- that is, re-indexing things is irrelevant; but (b) you now cannot claim that the sampler isn't mode switching. It's mode switching a lot.

Sure, it might still have poor mixing properties for other reasons, but at least now there isn't this elephant-in-the-room reason it can't possibly be mixing.

So this is a totally "useless" practical observation (sometimes we like to try to exploit the fact that it's not mixing), but from a theoretical perspective it might be interesting. For instance, if you want to prove something about a fast mixing rate for these samplers (if you believe they are actually mixing fast, which I'm somewhat keen to believe), then you're not going to be able to prove anything if you don't make this trivial change to the sampler (because without it they're not really mixing fast). So it might have interesting theoretical consequences.

12 comments:

  1. This is yet another instance of the problem of non-identifiability, which has received less attention in the field than it probably should (except for some recent work of Percy Liang et al, and older work by Stu Geman et al), The technical question is how to define mixing (or convergence for EM procedures, in another notorious case) modulo non-identifiable permutations of hidden variable values? I had a long chat with Stu Geman about this last summer, and he expressed frustration that the problem had defied his and students' repeated attempts to come up with an analysis method that is not sidetracked by non-identifiability.

    ReplyDelete
  2. What kind of mixing are you worried about? We know that the topic IDs aren't identified, but their multinomial distributions are better behaved, as Griffiths and Steyvers showed with their greedy, KL-based alignment of topics.

    From my reading of advice surrounding BUGS (e.g. their manual or Gelman et al.'s books), the usual thing to do is to measure a derived statistic that is identified. For instance, if a difference between a skill parameter and a difficulty parameter in an item-response model isn't identified, but (difficulty-skill) is, you track the difference for mixing across chains.

    What should be mixing well (identified) in LDA are the posterior inferences that sum out the topic identifiers. For instance, similarity of documents as measured by their topic multinomials' distributions, or similarity of words as measured by topic co-occurrence, or likelihood estimates for new documents.

    ReplyDelete
  3. In related work, Christian Robert orders his mixture components by their corresponding proportions. This is described in detail in Robert and Casella's book about MCMC.

    ReplyDelete
  4. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

    ReplyDelete
  5. the problem had defied his and students' repeated attempts to come up with an analysis method that is not sidetracked by non-identifiability.
    Term Paper Help | Thesis Help

    ReplyDelete
  6. Who knows where to download XRumer 5.0 Palladium?
    Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!

    ReplyDelete
  7. Classic mixture models assume that the prevalence of the various mixture components is fixed and does not vary over time. This presents problems for applications where the goal is to learn how complex data distributions evolve. We develop models and Bayesian learning algorithms for inferring the temporal trends of the components in a mixture model as a function of time. We show the utility of our models by applying them to the real-life problem of tracking changes in the rates of antibiotic resistance in Escherichia coli and Staphylococcus aureus. The results show that our methods can derive meaningful temporal antibiotic resistance patterns.

    Thesis Writing | Dissertation Writing | Essay Writing | Assignment Writing

    ReplyDelete
  8. As a newbie, this article was really helpful, thanks for sharing!

    ReplyDelete