My colleague Suresh Venkatasubramanian is running as seminar on clustering this semester. Last week we discussed EM and mixture of Gaussians. I almost skipped because it's a relatively old hat topic for me (how many times have I given this lecture?!), and had some grant stuff going out that day. But I decided to show up anyway. I'm glad I did.

We discussed a lot of interesting things, but something that had been bugging me for a while finally materialized in a way about which I can be precise. I basically have two (purely qualitative) issues with mixture of Gaussians as a clustering method. (No, I'm not actually suggesting there's anything wrong with using it in practice.) My first complaint is that many times, MoG is used to get the cluster assignments, or to get soft-cluster assignments... but this has always struck me as a bit weird because then we should be maximizing over the cluster assignments and doing expectations over everything else. Max Welling has done some work related to this in the Bayesian setting. (I vaguely remember that someone else did basically the same thing at basically the same time, but can't remember any more who it was.)

But my more fundamental question is this. When we start dealing with MoG, we usually say something like... suppose we have a density F which can be represented at F = pi_0 F_0 + pi_1 F_1 + ... + pi_K F_K, where the pis give a convex combination of "simpler" densities F_k. This question arose in the context of density estimation (if my history is correct) and the maximum likelihood solution via expectation maximization was developed to solve the density estimation problem. That is, the ORIGINAL goal in this case was to do density estimation; the fact that "cluster assignments" were produced as a byproduct was perhaps not the original intent.

I can actually give a fairly simple example to try to make this point visually. Here is some data generated by a mixture of uniform distributions. And I'll even tell you that K=2 in this case. There are 20,000 points if I recall correctly:

Can you tell me what the distribution is? Can you give me the components? Can you give me cluster assignments?

The problem is that I've constructed this to be non-identifiable. Here are two ways of writing down the components. (I've drawn this in 2D, but only pay attention to the x dimension.) They give rise to exactly the same distribution. One is equally weighted components, one uniform on the range (-3,1) and one uniform on the range (-1,3). The other is to have to components, one with 2/3 weight on the range (-3,3) and one with 1/3 weight on the range (-1,1).

I could imagine some sort of maximum likelihood parameter estimation giving rise to either of these (EM is hard to get to work here because once a point is outside the bounds of a uniform, it has probability zero). They both correctly recover the distribution, but would give rise to totally different (and sort of weird) cluster assignments.

I want to quickly point out that this is a very different issue from the standard "non-identifiability in mixture models issue" that has to do with the fact that any permutation of cluster indices gives rise to the same model.

So I guess that all this falls under the category of "if you want X, go for X." If you want a clustering, go for a clustering -- don't go for density estimation and try to read off clusters as a by-product. (Of course, I don't entirely believe this, but I still think it's worth thinking about.)

## 02 March 2009

### Mixture models: clustering or density estimation

Posted by hal at 3/02/2009 09:13:00 PM

Labels: clustering, machine learning

Subscribe to:
Post Comments (Atom)

## 22 comments:

nice pic. maybe in another post you can expand on the EM vs ME issue

I usually also think that if you want X, go for it. But the question of identifiability is not existing (i mean solved) under classical hypothesis, in the gaussian case, right ?

Nice thing with gaussian mixtures is that the theoretical framework under which you fall makes things much easier than many clustering algorithms...

Here's the abstract from Dempster, Laird and Rubin (1977), in which they describe EM as producing maximum likelihood estimates in the face of missing data (where the missing data in the mixture case is the mixture assignment):

A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived.

Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

I believe that the problem of non-identifiability is well-understood in the literature, e.g., On the exponential value of labeled samples by Castelli and Cover. Of course if the mixture is not identifiable, it is hard to do anything at all.

But I think that the following question is at the heart of the problem:

given a probability density p, what is a good clustering?

If we cannot answer this (at least under some conditions), there is little hope to get anything more than a heuristic from a finite sample.

misha b

On a related note -- I think we often forget that the EM theorem says that EM converges to a local max in likelihood, and says nothing about convergence of the parameters. If the parameters are nonidentifiable there's no reason to expect them to converge (I've seen this with multinomial mixture models).

Have you considered the relation between EM with Gaussian Mixtures and the original k-means clustering algorithm (isodata), where k-means can can be seen as a limiting form of EM-GM. Also, see an old paper by Mike Kearns on this topic, where they consider a variant where points are assigned probabilistically rather than either hard or soft.

there is little hope to get anything more than a heuristic from a finite sample.

Assignment | Coursework | Thesis

where they consider a variant where points are assigned probabilistically rather than either hard or soft.

Dissertation | Essay

Классный фильм на кинозоуне.

электронная почта без регистрации

I agree with the issue with mixture of Gaussians as a clustering method. I like your ideas for the future.

jacksonville beach cosmetic dentist

one day i went shopping outside,and in an ed hardy store,I found some kinds of ed hardy i love most they are Your website is really good Thank you for the information

ed hardyed hardyed hardy clothinged hardy clothinged hardy shoesed hardy shoesdon ed hardydon ed hardyed hardy clothesed hardy clothesed hardy bagsed hardy bagsed hardy swimweared hardy swimweared hardy jeansed hardy jeansed hardy mensed hardy mens Thank you for the informationReally trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbetsesli chat

sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat

sesli sohbet siteleri

sesli chat siteleri

SesliChat

cılgın sohbet

güzel kızlar

bekar kızlar

dul bayanlar

seviyeli insanlar

yarışma

canlı müzik

izdivac

en güzel evlilik

hersey burada

sesliparti

seslisohbet odalari

Sesli adresi

Sesli Chat

SesliChat Siteleri

Sesli Chat sitesi

SesliChat sitesi

SesliSohbet

Sesli Sohbet

Sesli Sohbet Sitesi

SesliSohbet Sitesi

SesliSohbet Siteleri

Muhabbet Sitesi

kamerali chat

Görüntülü Sohbet

Hasret gülleri

Çet sitesi

SesliSohbet

Sesli Sohbet

Canli sohbet

Turkce sohbet

Kurtce Sohbet

Kurtce Chat

Kurtce Muhabbet

Kurtce Sohbet

Kurdish Chat

SesliChat

Sesli Chat

SesliSanal

Guncel Haber

sohbet Sitesi

Chat sitesi..

Topic is very good, Thank you very much for your information, nice job keep it up.

qualitative dissertation | qualitative dissertations | dissertation writing

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet

seslisohbet

sesli chat

seslichat

sesli sohbet sitesi

sesli chat sitesi

sesli sohpet

kamerali sohbet

kamerali chat

webcam sohbetReally trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet

seslisohbet

sesli chat

seslichat

sesli sohbet sitesi

sesli chat sitesi

sesli sohpet

kamerali sohbet

kamerali chat

webcam sohbet

Hi,

This is an excellent and informative work, this will really helpful for me in future. I like the way you start and then conclude your thoughts. I completely agree with the points you have shared here about estimating the density or clustering. Thanks for this information. I really appreciate your work, keep it up.

Essay Writing Service

HP dv9700 battery

HP F4809A Battery

HP nc8000 battery

HP nc8230 battery

HP pavilion zd8000 battery

HP f2024b battery

HP f4812a battery

HP Pavilion ZV5000 battery

HP Pavilion DV1000 battery

HP Pavilion ZD7000 Battery

HP Pavilion DV2000 battery

HP Pavilion DV4000 Battery

HP Pavilion dv6000 Battery

HP Pavilion DV9000 Battery

HP F4098A battery

HP pavilion zx6000 battery

HP omnibook xe4400 battery

HP omnibook xe4500 battery

HP omnibook xe3 battery

Notebook NX9110 battery

IBM 02K6821 battery

IBM 02K7054 battery

IBM 08K8195 battery

IBM 08K8218 battery

IBM 92P1089 battery

IBM Thinkpad 390 Series battery

IBM Thinkpad 390X battery

IBM ThinkPad Z61m Battery

IBM 02K7018 Battery

IBM thinkpad t41p battery

IBM THINKPAD T42 Battery

IBM ThinkPad R60 Battery

IBM ThinkPad T60 Battery

IBM ThinkPad T41 Battery

IBM ThinkPad T43 Battery

IBM ThinkPad X40 Battery

Thinkpad x24 battery

ThinkPad G41 battery

IBM thinkpad r52 battery

Thinkpad x22 battery

IBM thinkpad t42 battery

IBM thinkpad r51 battery

Thinkpad r50 battery

IBM thinkpad r32 battery

Thinkpad x41 battery

SONY VGP-BPS2 Battery

SONY VGP-BPS2C Battery

SONY VGP-BPS5 battery

SONY VGP-BPL2C battery

SONY VGP-BPS2A battery

SONY VGP-BPS2B battery

SONY PCGA-BP1N battery

SONY PCGA-BP2E battery

SONY PCGA-BP2NX battery

SONY PCGA-BP2S battery

SONY PCGA-BP2SA battery

SONY PCGA-BP2T battery

SONY PCGA-BP2V battery

SONY PCGA-BP4V battery

SONY PCGA-BP71 battery

SONY PCGA-BP71A battery

SONY VGP-BPL1 battery

SONY VGP-BPL2 battery

Awesome Tips!

I am glad to found such useful post. I really increased my knowledge after read your post which will be beneficial for me.

Thanks for valuable sharing!

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..

seslisohbet

seslichat

sesli sohbet

sesli chat

sesli

sesli site

görünlütü sohbet

görüntülü chat

kameralı sohbet

kameralı chat

sesli sohbet siteleri

sesli chat siteleri

görüntülü sohbet siteleri

görüntülü chat siteleri

kameralı sohbet siteleri

canlı sohbet

sesli muhabbet

görüntülü muhabbet

kameralı muhabbet

seslidunya

seslisehir

sesli sex

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..

seslisohbet

seslichat

sesli sohbet

sesli chat

sesli

sesli site

görünlütü sohbet

görüntülü chat

kameralı sohbet

kameralı chat

sesli sohbet siteleri

sesli chat siteleri

sesli muhabbet siteleri

görüntülü sohbet siteleri

görüntülü chat siteleri

görüntülü muhabbet siteleri

kameralı sohbet siteleri

kameralı chat siteleri

kameralı muhabbet siteleri

canlı sohbet

sesli muhabbet

görüntülü muhabbet

kameralı muhabbet

birsesver

birses

seslidunya

seslisehir

sesli sex

Post a Comment