natural language processing blog: 2009/02

This is just a quick note because it's something that had somehow never occurred to me until a few days ago. If you've ever talked to me in person about the standard (collapsed) Gibbs samplers for topic models, you know that I get concerned that these things don't mix. And it's not just the generic "how do we know" feeling that can get levied at (almost) any sampler, but a very specific we know for certain that our Gibbs samplers for topic models aren't mixing. How do we know? Because, for the most part, they don't mode switch. You can figure this out quite easily by simply watching the topic assignments. (The standard collapsed samplers for Gaussian or Multinomial mixture models exhibit the same problem.) At least if you have a reasonable amount of data.

The reason this always bugged me is because we now have definitive evidence that these things aren't mixing in the sense of mode hopping, which leads me to worry that they might also not be mixing in other, less easy to detect ways.

Well, worry no more. Or at least worry less. The mode hopping is a red herring.

Maybe the following observation is obvious, but I'll say it anyway.

Let's take our topic model Gibbs sampler and introduce a new Metropolis-Hastings step. This MH step simply takes two topic indices (say topics i and j) and swaps them. It picks i and j uniformly at random from the (K choose 2) possibilities. It's easy to verify that the acceptance probability for this move will be one (the qs will cancel and the ps are identical), which means that it's really more like a Gibbs step than an MH step (in the sense that Gibbs steps are MH steps that are always accepted).

The observation is that (a) this doesn't actually change what the sampler is doing in any real, meaningful way -- that is, re-indexing things is irrelevant; but (b) you now cannot claim that the sampler isn't mode switching. It's mode switching a lot.

Sure, it might still have poor mixing properties for other reasons, but at least now there isn't this elephant-in-the-room reason it can't possibly be mixing.

So this is a totally "useless" practical observation (sometimes we like to try to exploit the fact that it's not mixing), but from a theoretical perspective it might be interesting. For instance, if you want to prove something about a fast mixing rate for these samplers (if you believe they are actually mixing fast, which I'm somewhat keen to believe), then you're not going to be able to prove anything if you don't make this trivial change to the sampler (because without it they're not really mixing fast). So it might have interesting theoretical consequences.

I used to know a fair amount about math; in fact, I even applied to a few grad schools to do logic (long story, now is not the time). I was never involved in actual math research (primarily because of the necessary ramp-up time -- thank goodness this doesn't exist as much in CS!), but I did get a sense from my unofficial undergrad advisor how things worked. The reason I started thinking of this is because I recently made my way about halfway through a quite dense book on long games by a prof from grad school (I cross-enrolled at UCLA for a few semesters). The basic idea of a countable game is that there is a fixed subset A of [0,1] (subset of the reals) and two players alternate play. In each move, they play an integer from 0 to 9. They play for a countably infinite number of moves, essentially writing down (in decimal form) a real number. If, at the "end", this number is in A, then player 1 wins; otherwise, player 2 wins. Both players know A. The set A is said to be determined if there is a strategy that will force a win; it is undetermined otherwise. A long game is the obvious generalization where you play for longer than countable time. The details don't matter.

This led me to think, as someone who's moved over to working a lot on machine learning: is there an analogous question for online learning? There are several ways to set this up and I won't bore you with the details (I doubt any reader here really cares), but depending on how you set it up, you can prove several relatively trivial, but kind of cute, results (I say trivial because they took me on the order of hours, which means that someone who knows what they're doing probably would see them immediately). I basically did this as a mental exercise, not for any real reason.

But it got me thinking: obviously machine learning people wouldn't care about this because it's too esoteric and not at all a realistic setting (even for COLTers!). I strongly doubt that logicians would care either, but for a totally different reason. From my interaction, they would be interested if and only if two things were satisfied: (a) the result showed some interesting connection between a new model and existing models; (b) the proofs were non-trivial and required some new insight that could be then applied to other problems. Obviously this is not my goal in life, so I've dropped it.

This led to me introspect: what is is that we as a community need in order to find some result interesting? What about other fields that I claim to know a bit about?

Let's take algorithms for a minute. Everything here is about big-O. Like the math types, a result without an interesting proof is much less interesting than a result with an interesting proof, though if you start reading CS theory blogs, you'll find that there's a bit of divide in the community on whether this is good or not. But my sense (which could be totally broken) is that if you have a result with a relatively uninteresting proof that gets you the same big-O running time as the current state of the art, you're in trouble.

I think it's interesting to contrast this with what happens in both NLP and ML. Big-O works a bit differently here. My non-technical description of big-O to someone who knows nothing is that it measure "order of magnitude" improvements. (Okay, O(n log n) versus O(n log log n) is hard to call an order of magnitude, but you get the idea.) An equivalent on the experimental side would seem to be something like: you cut the remaining error on a problem by half or more. In other words, if state of the art is 60% accuracy, then an order of magnitude improvement would be 80% accuracy or better. At 80% it would be 90%. At 90% it would be 95% and so on. 90.5% to 91.2% is not order of magnitude.

I actually like this model for looking at experimental results. Note that this has absolutely nothing to do with statistical significance. It's kind of like reading results graphs with a pair of thick glasses on (for those who don't wear glasses) or no glasses on (for those who wear think glasses). I think the justification is that for less than order of magnitude improvement, it's really just hard to say whether the improvement is due to better tweaking or engineering or dumb luck in how some feature was implemented or what. For order of magnitude improvement, there almost has to be something interesting going on.

Now, I'm not proposing that a paper isn't publishable if it doesn't have order of magnitude improvement. Very few papers would be published this way. I'm just suggesting that improving the state of the art not be -- by itself -- a reason for acceptance unless it's an order of magnitude improvement. That is, you'd either better have a cool idea, be solving a new problem, analyzing the effect of some aspect of a problem that's important, etc., or work on a well trod task and get a big-O improvement.

What I'm saying isn't novel, of course... the various exec boards at the ACL conferences have been trying to find ways to get more "interesting" papers into the conferences for (at least) a few years. This is just a concrete proposal. Obviously it requires buy-in at least of area chairs and probably reviewers. And there are definitely issues with it. Like any attempt to make reviewing non-subjective, there are obviously corner cases (eg., you have a sort-of-interesting idea and an almost-order-of-magnitude improvement). You can't mechanize the reviewing process. But frankly, when I see paper reviews that gush over tiny improvements in the state of the art in an otherwise drab paper, I just get annoyed :).

natural language processing blog

19 February 2009

Mixing of gibbs samplers for topic (and mixture) models

04 February 2009

Beating the state of the art, big-O style