Showing posts with label clustering. Show all posts
Showing posts with label clustering. Show all posts

17 November 2009

K-means vs GMM, sum-product vs max-product

I finished K-means and Gaussian mixture models in class last week or maybe the week before. I've previously discussed the fact that these two are really solving different problems (despite being structurally so similar), but today's post is about something different.

There are two primary differences between the typical presentation of K-means and the typical presentation of GMMs. (I say "typical" because you can modify these algorithms fairly extensively as you see fit.) The first difference is that GMMs just have more parameters. The parameters of K-means are typically the cluster assignments ("z") and the means ("mu"). The parameters of a GMM are typically these (z and mu) as well as the class prior probabilities ("pi") and cluster covariances ("Sigma"). The GMM model is just richer. Of course, you can restrict it so all clusters are isotropic and all prior probabilities are even, in which case you've effectively removed this difference (or you can add these things into K-means). The second difference is that GMMs operate under the regime of "soft assignments," meaning that points aren't wed to clusters: they only prefer (in a probabilistic sense) some clusters to others. This falls out naturally from the EM formalization, where the soft assignments are simply the expectations of the assignments under the current model.

One can get rid of the second difference by running "hard EM" (also called "Viterbi EM" in NLP land), where the expectations are clamped at their most likely value. This leads to something that has much more of a K-means feel.

This "real EM" versus "hard EM" distinction comes up a lot in NLP, where computing exact expectations is often really difficult. (Sometimes you get complex variants, like the "pegging" approaches in the IBM machine translation models, but my understanding from people who run in this circle is that pegging is much ado about nothing.) My general feeling has always been "if you don't have much data, do real EM; if you have tons of data, hard EM is probably okay." (This is purely from a practical perspective.) The idea is that when you have tons and tons of data, you can approximate expectations reasonably well by averaging over many data points. (Yes, this is hand-wavy and it's easy to construct examples where it fails. But it seems to work many times.) Of course, you can get pedantic and say "hard EM sucks: it's maximizing p(x,z) but I really want to maximize p(x)" to which I say: ho hum, who cares, you don't actually care about p(x), you care about some extrinsic evaluation metric which, crossing your fingers, you hope correlates with p(x), but for all I know it correlates better with p(x,z).

Nevertheless, a particular trusted friend has told me he's always remiss when he can't do full EM and has to do hard EM: he's never seen a case where it doesn't help. (Or maybe "rarely" would be more fair.) Of course, this comes at a price: for many models, maximization (search) can be done in polynomial time, but computing expectations can be #P-hard (basically because you have to enumerate -- or count -- over every possible assignment).

Now let's think about approximate inference in graphical models. Let's say I have a graphical model with some nodes I want to maximize over (call them "X") and some nodes I want to marginalize out (call them "Z"). For instance, in GMMs, the X nodes would be the means, covariances and cluster priors; the Z nodes would be the assignments. (Note that this is departing slightly from typical notation for EM.) Suppose I want to do inference in such a model. Here are three things I can do:

  1. Just run max-product. That is, maximize p(X,Z) rather than p(X).
  2. Just run sum-product. That is, compute expectations over X and Z, rather than just over Z.
  3. Run EM, by alternating something like sum-product on Z and something like max-product onX.
Of these, only (3) is really doing the "right thing." Further, let's get away from the notion of p(X) not correlating with some extrinsic evaluation by just measuring ourselves against exact inference. (Yes, this limits us to relatively small models with 10 or 20 binary nodes.)

What do you think happens? Well, first, things vary as a function of the number of X nodes versus Z nodes in the graph.

When most of the nodes are X (maximization) nodes, then max-product does best and EM basically does the same.

Whe most of the nodes are Z (marginalization) nodes, then EM does best and sum-product does almost the same. But max product also does almost the same.

This is an effect that we've been seeing regularly, regardless of what the models look like (chains or lattices), what the potentials look like (high temperature or low temperature) and how you initialize these models (eg., in the chain case, EM converges to different places depending on initialization, while sum- and max-product do not). Max product is just unbeatable.

In a sense, from a practical perspective, this is nice. It says: if you have a mixed model, just run max product and you'll do just as well as if you had done something more complicated (like EM). But it's also frustrating: we should be getting some leverage out of marginalizing over the nodes that we should marginalize over. Especially in the high temperature case, where there is lots of uncertainty in the model, max product should start doing worse and worse (note that when we evaluate, we only measure performance on the "X" nodes -- the "Z" nodes are ignored).

Likening this back to K-means versus GMM, for the case where the models are the same (GMM restricted to not have priors or covariances), the analogy is that as far as the means go, it doesn't matter which one you use. Even if there's lots of uncertainty in the data. Of course, you may get much better assignments from GMM (or you may not, I don't really know). But if all you really care about at the end of the day are the Xs (the means), then our experience with max-product suggests that it just won't matter. At all. Ever.

Part of me finds this hard to believe, and note that I haven't actually run experiments with K-means and GMM, but the results in the graphical model cases are sufficiently strong and reproducible that I'm beginning to trust them. Shouldn't someone have noticed this before, though? For all the effort that's gone into various inference algorithms for graphical models, why haven't we ever noticed that you just can't beat max-product?

(Yes, I'm aware of some theoretical results, eg., the Wainwright result that sum-product + randomized rounding is a provably good approximation to the MAP assignment, but this result actually goes the other way, and contradicts many of our experimental studies where sum-product + rounding just flat out sucks. Maybe there are other results out there that we just haven't been able to dig up.)

02 March 2009

Mixture models: clustering or density estimation

My colleague Suresh Venkatasubramanian is running as seminar on clustering this semester. Last week we discussed EM and mixture of Gaussians. I almost skipped because it's a relatively old hat topic for me (how many times have I given this lecture?!), and had some grant stuff going out that day. But I decided to show up anyway. I'm glad I did.

We discussed a lot of interesting things, but something that had been bugging me for a while finally materialized in a way about which I can be precise. I basically have two (purely qualitative) issues with mixture of Gaussians as a clustering method. (No, I'm not actually suggesting there's anything wrong with using it in practice.) My first complaint is that many times, MoG is used to get the cluster assignments, or to get soft-cluster assignments... but this has always struck me as a bit weird because then we should be maximizing over the cluster assignments and doing expectations over everything else. Max Welling has done some work related to this in the Bayesian setting. (I vaguely remember that someone else did basically the same thing at basically the same time, but can't remember any more who it was.)

But my more fundamental question is this. When we start dealing with MoG, we usually say something like... suppose we have a density F which can be represented at F = pi_0 F_0 + pi_1 F_1 + ... + pi_K F_K, where the pis give a convex combination of "simpler" densities F_k. This question arose in the context of density estimation (if my history is correct) and the maximum likelihood solution via expectation maximization was developed to solve the density estimation problem. That is, the ORIGINAL goal in this case was to do density estimation; the fact that "cluster assignments" were produced as a byproduct was perhaps not the original intent.

I can actually give a fairly simple example to try to make this point visually. Here is some data generated by a mixture of uniform distributions. And I'll even tell you that K=2 in this case. There are 20,000 points if I recall correctly:



Can you tell me what the distribution is? Can you give me the components? Can you give me cluster assignments?

The problem is that I've constructed this to be non-identifiable. Here are two ways of writing down the components. (I've drawn this in 2D, but only pay attention to the x dimension.) They give rise to exactly the same distribution. One is equally weighted components, one uniform on the range (-3,1) and one uniform on the range (-1,3). The other is to have to components, one with 2/3 weight on the range (-3,3) and one with 1/3 weight on the range (-1,1).

I could imagine some sort of maximum likelihood parameter estimation giving rise to either of these (EM is hard to get to work here because once a point is outside the bounds of a uniform, it has probability zero). They both correctly recover the distribution, but would give rise to totally different (and sort of weird) cluster assignments.

I want to quickly point out that this is a very different issue from the standard "non-identifiability in mixture models issue" that has to do with the fact that any permutation of cluster indices gives rise to the same model.

So I guess that all this falls under the category of "if you want X, go for X." If you want a clustering, go for a clustering -- don't go for density estimation and try to read off clusters as a by-product. (Of course, I don't entirely believe this, but I still think it's worth thinking about.)

21 August 2007

Topic modeling: syntactic versus semantic

Topic modeling has turned into a bit of a cottage industry in the NLP/machine learning world. Most seems to stem from latent Dirichlet allocation, though this of course built on previous techniques; the most well-known of which is latent semantic analysis. At the end of the day, such "topic models" really look more like dimensionality reduction techniques (eg., the similarity to multinomial PCA); however, in practice, they're often used as (perhaps soft) clustering methods. Words are mapped to topics; topics are used as features; this is fed into some learning algorithm.

One thing that's interested me for a while is that when viewed as clustering algorithms, how these topic models compare with more standard word clustering algorithms from the NLP community. For instance, the Brown clustering technique (built into SRILM) that clusters words based on context. (Lots of other word clustering techniques exist, but they pretty much all cluster based on local context; where local is either positionally local or local in a syntactic tree.)

I think the general high level story is that "topic models" go for semantics while "clustering models" go for syntax. That is, clustering models will tend to cluster words together that appear in similar local context, while topic models will cluster words together that appear in a similar global context. I've even heard stories that when given a choice of using POS tags as features in a model versus Brown clusters, it really don't make a difference.

I think this sentiment is a bit unfair to clustering models. Saying that context-based clustering models only find syntactically similar words is just not true. Consider the example clusters from the original LDA paper (the top portion of Figure 8). If we look up "film" ("new" seems odd) in CBC, we get: movie, film, comedy, drama, musical, thriller, documentary, flick, etc. (I left out multiword entries). The LDA list contains: new, film, show, music, movie, play, musical, best, actor, etc. We never get things like "actor" or "york" (presumably this is why "new" appeared), "love" or "theater", but it's unclear if this is good or not. Perhaps with more topics, these things would have gone into separate topics.

If we look up "school", we get: hospital, school, clinic, center, laboratory, lab, library, institute, university, etc. Again, this is a different sort of list than the LDA list, which contains: school, students, schools, education, teachers, high, public, teacher, bennett, manigat, state, president, etc.

It seems like the syntactic/semantic distinction is not quite right. In some sense, with the first list, LDA is being more liberal in what it considers film-like, with CBC being more conservative. OTOH, with the "school" list, CBC seems to be more liberal.

I realize, of course, that this is comparing apples and oranges... the data sets are different, the models are different, the preprocessing is different, etc. But it's still pretty clear that both sort of models are getting at the same basic information. It would be cool to see some work that tried to get leverage from both local context and global context, but perhaps this wouldn't be especially beneficial since these approaches---at least looking at these two lists---don't seem to produce results that are strongly complementary. I've also seen questions abound regarding getting topics out of topic models that are "disjoint" in some sense...this is something CBC does automatically. Perhaps a disjoint-LDA could leverage these ideas.