I finished K-means and Gaussian mixture models in class last week or maybe the week before. I've previously discussed the fact that these two are really solving different problems (despite being structurally so similar), but today's post is about something different.
There are two primary differences between the typical presentation of K-means and the typical presentation of GMMs. (I say "typical" because you can modify these algorithms fairly extensively as you see fit.) The first difference is that GMMs just have more parameters. The parameters of K-means are typically the cluster assignments ("z") and the means ("mu"). The parameters of a GMM are typically these (z and mu) as well as the class prior probabilities ("pi") and cluster covariances ("Sigma"). The GMM model is just richer. Of course, you can restrict it so all clusters are isotropic and all prior probabilities are even, in which case you've effectively removed this difference (or you can add these things into K-means). The second difference is that GMMs operate under the regime of "soft assignments," meaning that points aren't wed to clusters: they only prefer (in a probabilistic sense) some clusters to others. This falls out naturally from the EM formalization, where the soft assignments are simply the expectations of the assignments under the current model.
One can get rid of the second difference by running "hard EM" (also called "Viterbi EM" in NLP land), where the expectations are clamped at their most likely value. This leads to something that has much more of a K-means feel.
This "real EM" versus "hard EM" distinction comes up a lot in NLP, where computing exact expectations is often really difficult. (Sometimes you get complex variants, like the "pegging" approaches in the IBM machine translation models, but my understanding from people who run in this circle is that pegging is much ado about nothing.) My general feeling has always been "if you don't have much data, do real EM; if you have tons of data, hard EM is probably okay." (This is purely from a practical perspective.) The idea is that when you have tons and tons of data, you can approximate expectations reasonably well by averaging over many data points. (Yes, this is hand-wavy and it's easy to construct examples where it fails. But it seems to work many times.) Of course, you can get pedantic and say "hard EM sucks: it's maximizing p(x,z) but I really want to maximize p(x)" to which I say: ho hum, who cares, you don't actually care about p(x), you care about some extrinsic evaluation metric which, crossing your fingers, you hope correlates with p(x), but for all I know it correlates better with p(x,z).
Nevertheless, a particular trusted friend has told me he's always remiss when he can't do full EM and has to do hard EM: he's never seen a case where it doesn't help. (Or maybe "rarely" would be more fair.) Of course, this comes at a price: for many models, maximization (search) can be done in polynomial time, but computing expectations can be #P-hard (basically because you have to enumerate -- or count -- over every possible assignment).
Now let's think about approximate inference in graphical models. Let's say I have a graphical model with some nodes I want to maximize over (call them "X") and some nodes I want to marginalize out (call them "Z"). For instance, in GMMs, the X nodes would be the means, covariances and cluster priors; the Z nodes would be the assignments. (Note that this is departing slightly from typical notation for EM.) Suppose I want to do inference in such a model. Here are three things I can do:
- Just run max-product. That is, maximize p(X,Z) rather than p(X).
- Just run sum-product. That is, compute expectations over X and Z, rather than just over Z.
- Run EM, by alternating something like sum-product on Z and something like max-product onX.
What do you think happens? Well, first, things vary as a function of the number of X nodes versus Z nodes in the graph.
When most of the nodes are X (maximization) nodes, then max-product does best and EM basically does the same.
Whe most of the nodes are Z (marginalization) nodes, then EM does best and sum-product does almost the same. But max product also does almost the same.
This is an effect that we've been seeing regularly, regardless of what the models look like (chains or lattices), what the potentials look like (high temperature or low temperature) and how you initialize these models (eg., in the chain case, EM converges to different places depending on initialization, while sum- and max-product do not). Max product is just unbeatable.
In a sense, from a practical perspective, this is nice. It says: if you have a mixed model, just run max product and you'll do just as well as if you had done something more complicated (like EM). But it's also frustrating: we should be getting some leverage out of marginalizing over the nodes that we should marginalize over. Especially in the high temperature case, where there is lots of uncertainty in the model, max product should start doing worse and worse (note that when we evaluate, we only measure performance on the "X" nodes -- the "Z" nodes are ignored).
Likening this back to K-means versus GMM, for the case where the models are the same (GMM restricted to not have priors or covariances), the analogy is that as far as the means go, it doesn't matter which one you use. Even if there's lots of uncertainty in the data. Of course, you may get much better assignments from GMM (or you may not, I don't really know). But if all you really care about at the end of the day are the Xs (the means), then our experience with max-product suggests that it just won't matter. At all. Ever.
Part of me finds this hard to believe, and note that I haven't actually run experiments with K-means and GMM, but the results in the graphical model cases are sufficiently strong and reproducible that I'm beginning to trust them. Shouldn't someone have noticed this before, though? For all the effort that's gone into various inference algorithms for graphical models, why haven't we ever noticed that you just can't beat max-product?
(Yes, I'm aware of some theoretical results, eg., the Wainwright result that sum-product + randomized rounding is a provably good approximation to the MAP assignment, but this result actually goes the other way, and contradicts many of our experimental studies where sum-product + rounding just flat out sucks. Maybe there are other results out there that we just haven't been able to dig up.)