natural language processing blog: K-means vs GMM, sum-product vs max-product

17 November 2009

K-means vs GMM, sum-product vs max-product

I finished K-means and Gaussian mixture models in class last week or maybe the week before. I've previously discussed the fact that these two are really solving different problems (despite being structurally so similar), but today's post is about something different.

There are two primary differences between the typical presentation of K-means and the typical presentation of GMMs. (I say "typical" because you can modify these algorithms fairly extensively as you see fit.) The first difference is that GMMs just have more parameters. The parameters of K-means are typically the cluster assignments ("z") and the means ("mu"). The parameters of a GMM are typically these (z and mu) as well as the class prior probabilities ("pi") and cluster covariances ("Sigma"). The GMM model is just richer. Of course, you can restrict it so all clusters are isotropic and all prior probabilities are even, in which case you've effectively removed this difference (or you can add these things into K-means). The second difference is that GMMs operate under the regime of "soft assignments," meaning that points aren't wed to clusters: they only prefer (in a probabilistic sense) some clusters to others. This falls out naturally from the EM formalization, where the soft assignments are simply the expectations of the assignments under the current model.

One can get rid of the second difference by running "hard EM" (also called "Viterbi EM" in NLP land), where the expectations are clamped at their most likely value. This leads to something that has much more of a K-means feel.

This "real EM" versus "hard EM" distinction comes up a lot in NLP, where computing exact expectations is often really difficult. (Sometimes you get complex variants, like the "pegging" approaches in the IBM machine translation models, but my understanding from people who run in this circle is that pegging is much ado about nothing.) My general feeling has always been "if you don't have much data, do real EM; if you have tons of data, hard EM is probably okay." (This is purely from a practical perspective.) The idea is that when you have tons and tons of data, you can approximate expectations reasonably well by averaging over many data points. (Yes, this is hand-wavy and it's easy to construct examples where it fails. But it seems to work many times.) Of course, you can get pedantic and say "hard EM sucks: it's maximizing p(x,z) but I really want to maximize p(x)" to which I say: ho hum, who cares, you don't actually care about p(x), you care about some extrinsic evaluation metric which, crossing your fingers, you hope correlates with p(x), but for all I know it correlates better with p(x,z).

Nevertheless, a particular trusted friend has told me he's always remiss when he can't do full EM and has to do hard EM: he's never seen a case where it doesn't help. (Or maybe "rarely" would be more fair.) Of course, this comes at a price: for many models, maximization (search) can be done in polynomial time, but computing expectations can be #P-hard (basically because you have to enumerate -- or count -- over every possible assignment).

Now let's think about approximate inference in graphical models. Let's say I have a graphical model with some nodes I want to maximize over (call them "X") and some nodes I want to marginalize out (call them "Z"). For instance, in GMMs, the X nodes would be the means, covariances and cluster priors; the Z nodes would be the assignments. (Note that this is departing slightly from typical notation for EM.) Suppose I want to do inference in such a model. Here are three things I can do:

Just run max-product. That is, maximize p(X,Z) rather than p(X).
Just run sum-product. That is, compute expectations over X and Z, rather than just over Z.
Run EM, by alternating something like sum-product on Z and something like max-product onX.

Of these, only (3) is really doing the "right thing." Further, let's get away from the notion of p(X) not correlating with some extrinsic evaluation by just measuring ourselves against exact inference. (Yes, this limits us to relatively small models with 10 or 20 binary nodes.)

What do you think happens? Well, first, things vary as a function of the number of X nodes versus Z nodes in the graph.

When most of the nodes are X (maximization) nodes, then max-product does best and EM basically does the same.

Whe most of the nodes are Z (marginalization) nodes, then EM does best and sum-product does almost the same. But max product also does almost the same.

This is an effect that we've been seeing regularly, regardless of what the models look like (chains or lattices), what the potentials look like (high temperature or low temperature) and how you initialize these models (eg., in the chain case, EM converges to different places depending on initialization, while sum- and max-product do not). Max product is just unbeatable.

In a sense, from a practical perspective, this is nice. It says: if you have a mixed model, just run max product and you'll do just as well as if you had done something more complicated (like EM). But it's also frustrating: we should be getting some leverage out of marginalizing over the nodes that we should marginalize over. Especially in the high temperature case, where there is lots of uncertainty in the model, max product should start doing worse and worse (note that when we evaluate, we only measure performance on the "X" nodes -- the "Z" nodes are ignored).

Likening this back to K-means versus GMM, for the case where the models are the same (GMM restricted to not have priors or covariances), the analogy is that as far as the means go, it doesn't matter which one you use. Even if there's lots of uncertainty in the data. Of course, you may get much better assignments from GMM (or you may not, I don't really know). But if all you really care about at the end of the day are the Xs (the means), then our experience with max-product suggests that it just won't matter. At all. Ever.

Part of me finds this hard to believe, and note that I haven't actually run experiments with K-means and GMM, but the results in the graphical model cases are sufficiently strong and reproducible that I'm beginning to trust them. Shouldn't someone have noticed this before, though? For all the effort that's gone into various inference algorithms for graphical models, why haven't we ever noticed that you just can't beat max-product?

(Yes, I'm aware of some theoretical results, eg., the Wainwright result that sum-product + randomized rounding is a provably good approximation to the MAP assignment, but this result actually goes the other way, and contradicts many of our experimental studies where sum-product + rounding just flat out sucks. Maybe there are other results out there that we just haven't been able to dig up.)

17 comments:

Bob Carpenter said...: I've found decreasing robustness from Gibbs sampling to full EM to hard-EM (or max-product) in several NLP apps from clustering to semi-supervised learning to unsupervised stemming.

As David McKay's book chapter on EM/clustering points out, EM has a tendency to blow up around zero variance estimates.

I compare full EM and Gibbs with a hierarchical model for gold-standard inference from multiply annotated data in a LingPipe blog post invoking MacKay's terminology, EM Goes Kaboom.

McCallum, Nigam and Mitchell, in their paper Semi-Supervised Text Classification Using EM, discuss the difference between hard and soft EM and introduce a typical kind of compromise with an annealed "temperature" parameter. Technically, this is an exponent on "raw" probability estimates that get renormalized before the M step. By driving the temperature from cool (0) to hot (infinity), estimates are pushed from uniform to winner-take-all.; 17 November, 2009 12:34
Anonymous said...: hi, this article is fabiluous.
Sir , iam a M-tech student of cognitive science from rajasthan university India. so sir could you assist me some data finding about machine learning.
this is my e-mail id grace_hemdani8@yahoo.co.in.
I will be very thankful to you.; 21 November, 2009 06:59
Anonymous said...: Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!; 21 November, 2009 12:12
Anonymous said...: 自慰套,真愛密碼,
自慰套,自慰器,自慰套,情趣,充氣娃娃,
性感丁字褲,AV,按摩棒,電動按摩棒,情趣按摩棒,
角色扮演,角色扮演服,吊帶襪,丁字褲,飛機杯,

按摩棒,變頻跳蛋,跳蛋,無線跳蛋,G點,
潤滑液,SM,情趣內衣,內衣,性感內衣,情趣用品,情趣,; 23 November, 2009 06:59
data entry bpo services said...: Very interesting subject...
Thanks...
Regards,
Data extraction services; 24 November, 2009 21:59
Anonymous said...: You do have a point here :) I admire the stuff you post and the quality information you offer in your blog! Keep up the good work dude. Please come visit my site Business Trade Guide of Fresno California CA when you got time.; 27 November, 2009 01:57
Anonymous said...: You do have a point here :) I admire the stuff you post and the quality information you offer in your blog! Keep up the good work dude. Please come visit my site Directory Fresno City when you got time.; 27 November, 2009 01:57
Anonymous said...: "I say your blog is very nice. Gisele Bundchen, Kate Moss, Adriana Lima & Co owe success, wealth and fame of her beauty and her charisma. They are among the most sought-after supermodels in the world are on the catwalks of Paris, Milan and New York home and wrap their flawless bodies in the noblest of noble designer creations.

http://www.model.de."; 01 December, 2009 09:50
David Holmes HPD, Dip.NLP, Cert.SM, Dip.H Psych, Cert.En psych said...: I wonder if NLP training would be of benefit?; 16 December, 2009 09:41
rr8004 said...: Very nice information. Thanks for this. Please come visit my site Aurora Community Video Library when you got time.; 10 January, 2010 00:41
rr8004 said...: Very nice information. Thanks for this. Please come visit my site Aurora Colorado CO Directory when you got time.; 10 January, 2010 00:41
Anonymous said...: cheap nike shox
cheap sport shoes
nike tn dollar
ed hardy ugg boots
ed hardy love kills slowly
ed hardy clothing us
ed hardy clothing
cheap ed hardy
cheap ed hardy clothing
ed hardy clothes
ed hardy wholesale
ed hardy clothing
ed hardy t shirts
ed hardy shirts
ed hardy uk
ed hardy t shirts
ed hardy shirts
ed hardy hoodies
Cheap JORDAN SHOES，，
cheap nike max ，。
puma future cat
ed hardy ugg boots.
ed hardy love kills slowly boots.
ed hardy love kills slowly.
ed hardy polo shirts.
cheap ed hardy clothing,.
ed hardy shirts .
ed hardy t shirts.,.; 13 April, 2010 20:07
ai said...: ugg boots
polo boots
polo shoes

herve leger
herve leger bandage dress

chanel outlet
chanel bandbags
chanel bags
chanel iman

ralph Lauren polo
ralph lauren outlet
lacoste polo
polo raplh lauren

air jordan 2010
cheap jordan shoes
jordan ajf shoes
discount jordan shoes
jumpman23

moncler
moncler jackets
moncler coats
moncler vest
moncler outlet
moncler Polo t shirt
cheap five finger shoes

kiss ghd; 26 April, 2010 20:28
Anonymous said...: Everyone has unique requirements in terms of brands,personal specifications and choices.Whether you

want to clothing online,new moncler,moncler jackets,moncler coats,moncler outlet,moncler vest,moncler polot-shirt,spyder outlet,spyder polot-shirt,spyder vest,spyder

coats,cheap ed hardy

wholesale,ed hardy wholesale,discount ed hardy wholesale,wholesale ed hardy,ed hardy outletyou can just take your pick, put it

in the shopping cart and make your payment.FromSexy

Lingerie Store,Intimate Apparel,Sexy Halloween Costumes,Sexy Underwear,cheap vibram 5 fingers, discount vibram five fingers,Vibram running shoes,vibram five fingers shoes,vibram five finger outlet ,you get them all here.

The real benefit is in the discount prices.; 10 May, 2010 02:56
ylinling001 said...: I like your article, really interesting! My point is also very good, I hope you'll like:chi flat iron are a very popular choice of hair straightener.New Balance,new Blance shoes,new Blance Outlet are some of the most comfortable and stylish shoes on the market today. The designer has a whole range of shoes for all types of athletes. five finger shoes,vibram five fingers,Five fingers shoes give women the feeling of walking barefoot while still keeping the feet protected.; 14 July, 2010 02:39
combattery84 said...: Dell inspiron 640m battery
Dell inspiron b120 battery
Dell xps m1210 battery
Dell inspiron xps m1710 battery
Dell inspiron 1100 battery
Dell 310-6321 battery
Dell 1691p battery
Dell Inspiron 500m battery
Dell 6Y270 battery
Dell inspiron 8600 battery
Latitude x300 series battery
Dell latitude cpi battery
Dell 1x793 battery
dell Inspiron 1501 battery
Dell 75UYF Battery
Dell Inspiron 1720 battery
dell Latitude C640 battery
Dell XPS M140 battery
Dell Inspiron E1405 battery
dell 700m battery
dell C1295 battery
Dell U4873 Battery
Dell Latitude C600 battery
Armada E700 Series battery
Compaq 116314-001 battery
Compaq 319411-001 battery
Compaq nc4200 battery
Compaq Presario R3000 Battery
Compaq Presario 2100 battery
Compaq Presario r3000 Battery; 14 July, 2010 20:09
combattery84 said...: IBM Thinkpad 390X battery
IBM ThinkPad Z61m Battery
IBM 02K7018 Battery
IBM thinkpad t41p battery
IBM THINKPAD T42 Battery
IBM ThinkPad R60 Battery
IBM ThinkPad T60 Battery
IBM ThinkPad T41 Battery
IBM ThinkPad T43 Battery
IBM ThinkPad X40 Battery
Thinkpad x24 battery
ThinkPad G41 battery
IBM thinkpad r52 battery
Thinkpad x22 battery
IBM thinkpad t42 battery
IBM thinkpad r51 battery
Thinkpad r50 battery
IBM thinkpad r32 battery
Thinkpad x41 battery
SONY VGP-BPS2 Battery
SONY VGP-BPS2C Battery
SONY VGP-BPS5 battery
SONY VGP-BPL2C battery
SONY VGP-BPS2A battery
SONY VGP-BPS2B battery
SONY PCGA-BP1N battery
SONY PCGA-BP2E battery
SONY PCGA-BP2NX battery; 14 July, 2010 20:10

natural language processing blog

17 November 2009

K-means vs GMM, sum-product vs max-product

17 comments:

About Me

Labels

My Blog List

Blog Archive