natural language processing blog: Mixing of gibbs samplers for topic (and mixture) models

19 February 2009

Mixing of gibbs samplers for topic (and mixture) models

This is just a quick note because it's something that had somehow never occurred to me until a few days ago. If you've ever talked to me in person about the standard (collapsed) Gibbs samplers for topic models, you know that I get concerned that these things don't mix. And it's not just the generic "how do we know" feeling that can get levied at (almost) any sampler, but a very specific we know for certain that our Gibbs samplers for topic models aren't mixing. How do we know? Because, for the most part, they don't mode switch. You can figure this out quite easily by simply watching the topic assignments. (The standard collapsed samplers for Gaussian or Multinomial mixture models exhibit the same problem.) At least if you have a reasonable amount of data.

The reason this always bugged me is because we now have definitive evidence that these things aren't mixing in the sense of mode hopping, which leads me to worry that they might also not be mixing in other, less easy to detect ways.

Well, worry no more. Or at least worry less. The mode hopping is a red herring.

Maybe the following observation is obvious, but I'll say it anyway.

Let's take our topic model Gibbs sampler and introduce a new Metropolis-Hastings step. This MH step simply takes two topic indices (say topics i and j) and swaps them. It picks i and j uniformly at random from the (K choose 2) possibilities. It's easy to verify that the acceptance probability for this move will be one (the qs will cancel and the ps are identical), which means that it's really more like a Gibbs step than an MH step (in the sense that Gibbs steps are MH steps that are always accepted).

The observation is that (a) this doesn't actually change what the sampler is doing in any real, meaningful way -- that is, re-indexing things is irrelevant; but (b) you now cannot claim that the sampler isn't mode switching. It's mode switching a lot.

Sure, it might still have poor mixing properties for other reasons, but at least now there isn't this elephant-in-the-room reason it can't possibly be mixing.

So this is a totally "useless" practical observation (sometimes we like to try to exploit the fact that it's not mixing), but from a theoretical perspective it might be interesting. For instance, if you want to prove something about a fast mixing rate for these samplers (if you believe they are actually mixing fast, which I'm somewhat keen to believe), then you're not going to be able to prove anything if you don't make this trivial change to the sampler (because without it they're not really mixing fast). So it might have interesting theoretical consequences.

12 comments:

Fernando Pereira19 February, 2009 11:44
This is yet another instance of the problem of non-identifiability, which has received less attention in the field than it probably should (except for some recent work of Percy Liang et al, and older work by Stu Geman et al), The technical question is how to define mixing (or convergence for EM procedures, in another notorious case) modulo non-identifiable permutations of hidden variable values? I had a long chat with Stu Geman about this last summer, and he expressed frustration that the problem had defied his and students' repeated attempts to come up with an analysis method that is not sidetracked by non-identifiability.
ReplyDelete
Replies
Anonymous19 February, 2009 13:09
What kind of mixing are you worried about? We know that the topic IDs aren't identified, but their multinomial distributions are better behaved, as Griffiths and Steyvers showed with their greedy, KL-based alignment of topics.

From my reading of advice surrounding BUGS (e.g. their manual or Gelman et al.'s books), the usual thing to do is to measure a derived statistic that is identified. For instance, if a difference between a skill parameter and a difficulty parameter in an item-response model isn't identified, but (difficulty-skill) is, you track the difference for mixing across chains.

What should be mixing well (identified) in LDA are the posterior inferences that sum out the topic identifiers. For instance, similarity of documents as measured by their topic multinomials' distributions, or similarity of words as measured by topic co-occurrence, or likelihood estimates for new documents.
ReplyDelete
Replies
david blei19 February, 2009 19:46
In related work, Christian Robert orders his mixture components by their corresponding proportions. This is described in detail in Robert and Casella's book about MCMC.
ReplyDelete
Replies
Anonymous12 May, 2009 10:27
酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店酒店兼差PRETTY GIRL酒店公關酒店小姐彩色爆米花酒店兼職,酒店工作彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀彩色爆米花
ReplyDelete
Replies
Anonymous03 November, 2009 04:30
Great post, i really appreciate it.
Dissertation Help | Research Paper Help | Essay Help
ReplyDelete
Replies
Anonymous03 November, 2009 04:31
the problem had defied his and students' repeated attempts to come up with an analysis method that is not sidetracked by non-identifiability.
Term Paper Help | Thesis Help
ReplyDelete
Replies
Anonymous21 November, 2009 06:04
Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!
ReplyDelete
Replies
deLi26 December, 2009 08:56
sohbet odaları
sohbet
yonja
chat siteleri
forum siteleri
toplist ekle
sohbet
yonja
netlog
sohbet
kizlarla sohbet
sohbet
sohbet
dini sohbet
islami sohbet
chat
sohbet chat
mirc indir
cinsel sohbet
porno izle
camfrog indir
lida
kurumsalseo.com R10 lida fx15 pohudey zayıflama
ReplyDelete
Replies
Anonymous31 January, 2010 22:46
Classic mixture models assume that the prevalence of the various mixture components is fixed and does not vary over time. This presents problems for applications where the goal is to learn how complex data distributions evolve. We develop models and Bayesian learning algorithms for inferring the temporal trends of the components in a mixture model as a function of time. We show the utility of our models by applying them to the real-life problem of tracking changes in the rates of antibiotic resistance in Escherichia coli and Staphylococcus aureus. The results show that our methods can derive meaningful temporal antibiotic resistance patterns.

Thesis Writing | Dissertation Writing | Essay Writing | Assignment Writing
ReplyDelete
Replies
Anonymous09 February, 2010 22:19
i really appreciate it. Great post

Thesis Help | Dissertation Help | Essay Help | Assignment Help
ReplyDelete
Replies
speed dating ny26 May, 2010 23:31
As a newbie, this article was really helpful, thanks for sharing!
ReplyDelete
Replies
combattery8407 July, 2010 00:27
IBM ThinkPad R60 Battery
IBM ThinkPad T60 Battery
IBM ThinkPad T41 Battery
IBM ThinkPad T43 Battery
IBM ThinkPad X40 Battery
Thinkpad x24 battery
ThinkPad G41 battery
IBM thinkpad r52 battery
Thinkpad x22 battery
IBM thinkpad t42 battery
IBM thinkpad r51 battery
Thinkpad r50 battery
IBM thinkpad r32 battery
Thinkpad x41 battery
SONY VGP-BPS2 Battery
SONY VGP-BPS2C Battery
SONY VGP-BPS5 battery
SONY VGP-BPL2C battery
SONY VGP-BPS2A battery
SONY VGP-BPS2B battery
SONY PCGA-BP1N battery
SONY PCGA-BP2E battery
SONY PCGA-BP2NX battery
SONY PCGA-BP2S battery
SONY PCGA-BP2SA battery
SONY PCGA-BP2T battery
SONY PCGA-BP2V battery
SONY PCGA-BP4V battery
SONY PCGA-BP71 battery
SONY PCGA-BP71A battery
SONY VGP-BPL1 battery
SONY VGP-BPL2 battery
ReplyDelete
Replies

Add comment