17 November 2009

K-means vs GMM, sum-product vs max-product

I finished K-means and Gaussian mixture models in class last week or maybe the week before. I've previously discussed the fact that these two are really solving different problems (despite being structurally so similar), but today's post is about something different.

There are two primary differences between the typical presentation of K-means and the typical presentation of GMMs. (I say "typical" because you can modify these algorithms fairly extensively as you see fit.) The first difference is that GMMs just have more parameters. The parameters of K-means are typically the cluster assignments ("z") and the means ("mu"). The parameters of a GMM are typically these (z and mu) as well as the class prior probabilities ("pi") and cluster covariances ("Sigma"). The GMM model is just richer. Of course, you can restrict it so all clusters are isotropic and all prior probabilities are even, in which case you've effectively removed this difference (or you can add these things into K-means). The second difference is that GMMs operate under the regime of "soft assignments," meaning that points aren't wed to clusters: they only prefer (in a probabilistic sense) some clusters to others. This falls out naturally from the EM formalization, where the soft assignments are simply the expectations of the assignments under the current model.

One can get rid of the second difference by running "hard EM" (also called "Viterbi EM" in NLP land), where the expectations are clamped at their most likely value. This leads to something that has much more of a K-means feel.

This "real EM" versus "hard EM" distinction comes up a lot in NLP, where computing exact expectations is often really difficult. (Sometimes you get complex variants, like the "pegging" approaches in the IBM machine translation models, but my understanding from people who run in this circle is that pegging is much ado about nothing.) My general feeling has always been "if you don't have much data, do real EM; if you have tons of data, hard EM is probably okay." (This is purely from a practical perspective.) The idea is that when you have tons and tons of data, you can approximate expectations reasonably well by averaging over many data points. (Yes, this is hand-wavy and it's easy to construct examples where it fails. But it seems to work many times.) Of course, you can get pedantic and say "hard EM sucks: it's maximizing p(x,z) but I really want to maximize p(x)" to which I say: ho hum, who cares, you don't actually care about p(x), you care about some extrinsic evaluation metric which, crossing your fingers, you hope correlates with p(x), but for all I know it correlates better with p(x,z).

Nevertheless, a particular trusted friend has told me he's always remiss when he can't do full EM and has to do hard EM: he's never seen a case where it doesn't help. (Or maybe "rarely" would be more fair.) Of course, this comes at a price: for many models, maximization (search) can be done in polynomial time, but computing expectations can be #P-hard (basically because you have to enumerate -- or count -- over every possible assignment).

Now let's think about approximate inference in graphical models. Let's say I have a graphical model with some nodes I want to maximize over (call them "X") and some nodes I want to marginalize out (call them "Z"). For instance, in GMMs, the X nodes would be the means, covariances and cluster priors; the Z nodes would be the assignments. (Note that this is departing slightly from typical notation for EM.) Suppose I want to do inference in such a model. Here are three things I can do:
  1. Just run max-product. That is, maximize p(X,Z) rather than p(X).
  2. Just run sum-product. That is, compute expectations over X and Z, rather than just over Z.
  3. Run EM, by alternating something like sum-product on Z and something like max-product onX.
Of these, only (3) is really doing the "right thing." Further, let's get away from the notion of p(X) not correlating with some extrinsic evaluation by just measuring ourselves against exact inference. (Yes, this limits us to relatively small models with 10 or 20 binary nodes.)

What do you think happens? Well, first, things vary as a function of the number of X nodes versus Z nodes in the graph.

When most of the nodes are X (maximization) nodes, then max-product does best and EM basically does the same.

Whe most of the nodes are Z (marginalization) nodes, then EM does best and sum-product does almost the same. But max product also does almost the same.

This is an effect that we've been seeing regularly, regardless of what the models look like (chains or lattices), what the potentials look like (high temperature or low temperature) and how you initialize these models (eg., in the chain case, EM converges to different places depending on initialization, while sum- and max-product do not). Max product is just unbeatable.

In a sense, from a practical perspective, this is nice. It says: if you have a mixed model, just run max product and you'll do just as well as if you had done something more complicated (like EM). But it's also frustrating: we should be getting some leverage out of marginalizing over the nodes that we should marginalize over. Especially in the high temperature case, where there is lots of uncertainty in the model, max product should start doing worse and worse (note that when we evaluate, we only measure performance on the "X" nodes -- the "Z" nodes are ignored).

Likening this back to K-means versus GMM, for the case where the models are the same (GMM restricted to not have priors or covariances), the analogy is that as far as the means go, it doesn't matter which one you use. Even if there's lots of uncertainty in the data. Of course, you may get much better assignments from GMM (or you may not, I don't really know). But if all you really care about at the end of the day are the Xs (the means), then our experience with max-product suggests that it just won't matter. At all. Ever.

Part of me finds this hard to believe, and note that I haven't actually run experiments with K-means and GMM, but the results in the graphical model cases are sufficiently strong and reproducible that I'm beginning to trust them. Shouldn't someone have noticed this before, though? For all the effort that's gone into various inference algorithms for graphical models, why haven't we ever noticed that you just can't beat max-product?

(Yes, I'm aware of some theoretical results, eg., the Wainwright result that sum-product + randomized rounding is a provably good approximation to the MAP assignment, but this result actually goes the other way, and contradicts many of our experimental studies where sum-product + rounding just flat out sucks. Maybe there are other results out there that we just haven't been able to dig up.)

17 comments:

  1. I've found decreasing robustness from Gibbs sampling to full EM to hard-EM (or max-product) in several NLP apps from clustering to semi-supervised learning to unsupervised stemming.

    As David McKay's book chapter on EM/clustering points out, EM has a tendency to blow up around zero variance estimates.

    I compare full EM and Gibbs with a hierarchical model for gold-standard inference from multiply annotated data in a LingPipe blog post invoking MacKay's terminology, EM Goes Kaboom.

    McCallum, Nigam and Mitchell, in their paper Semi-Supervised Text Classification Using EM, discuss the difference between hard and soft EM and introduce a typical kind of compromise with an annealed "temperature" parameter. Technically, this is an exponent on "raw" probability estimates that get renormalized before the M step. By driving the temperature from cool (0) to hot (infinity), estimates are pushed from uniform to winner-take-all.

    ReplyDelete
  2. hi, this article is fabiluous.
    Sir , iam a M-tech student of cognitive science from rajasthan university India. so sir could you assist me some data finding about machine learning.
    this is my e-mail id grace_hemdani8@yahoo.co.in.
    I will be very thankful to you.

    ReplyDelete
  3. Who knows where to download XRumer 5.0 Palladium?
    Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!

    ReplyDelete
  4. You do have a point here :) I admire the stuff you post and the quality information you offer in your blog! Keep up the good work dude. Please come visit my site Business Trade Guide of Fresno California CA when you got time.

    ReplyDelete
  5. You do have a point here :) I admire the stuff you post and the quality information you offer in your blog! Keep up the good work dude. Please come visit my site Directory Fresno City when you got time.

    ReplyDelete
  6. "I say your blog is very nice. Gisele Bundchen, Kate Moss, Adriana Lima & Co owe success, wealth and fame of her beauty and her charisma. They are among the most sought-after supermodels in the world are on the catwalks of Paris, Milan and New York home and wrap their flawless bodies in the noblest of noble designer creations.

    http://www.model.de
    ."

    ReplyDelete
  7. Very nice information. Thanks for this. Please come visit my site Aurora Community Video Library when you got time.

    ReplyDelete
  8. Very nice information. Thanks for this. Please come visit my site Aurora Colorado CO Directory when you got time.

    ReplyDelete
  9. I like your article, really interesting! My point is also very good, I hope you'll like:chi flat iron are a very popular choice of hair straightener.New Balance,new Blance shoes,new Blance Outlet are some of the most comfortable and stylish shoes on the market today. The designer has a whole range of shoes for all types of athletes. five finger shoes,vibram five fingers,Five fingers shoes give women the feeling of walking barefoot while still keeping the feet protected.

    ReplyDelete