## 02 March 2009

### Mixture models: clustering or density estimation

My colleague Suresh Venkatasubramanian is running as seminar on clustering this semester. Last week we discussed EM and mixture of Gaussians. I almost skipped because it's a relatively old hat topic for me (how many times have I given this lecture?!), and had some grant stuff going out that day. But I decided to show up anyway. I'm glad I did.

We discussed a lot of interesting things, but something that had been bugging me for a while finally materialized in a way about which I can be precise. I basically have two (purely qualitative) issues with mixture of Gaussians as a clustering method. (No, I'm not actually suggesting there's anything wrong with using it in practice.) My first complaint is that many times, MoG is used to get the cluster assignments, or to get soft-cluster assignments... but this has always struck me as a bit weird because then we should be maximizing over the cluster assignments and doing expectations over everything else. Max Welling has done some work related to this in the Bayesian setting. (I vaguely remember that someone else did basically the same thing at basically the same time, but can't remember any more who it was.)

But my more fundamental question is this. When we start dealing with MoG, we usually say something like... suppose we have a density F which can be represented at F = pi_0 F_0 + pi_1 F_1 + ... + pi_K F_K, where the pis give a convex combination of "simpler" densities F_k. This question arose in the context of density estimation (if my history is correct) and the maximum likelihood solution via expectation maximization was developed to solve the density estimation problem. That is, the ORIGINAL goal in this case was to do density estimation; the fact that "cluster assignments" were produced as a byproduct was perhaps not the original intent.

I can actually give a fairly simple example to try to make this point visually. Here is some data generated by a mixture of uniform distributions. And I'll even tell you that K=2 in this case. There are 20,000 points if I recall correctly:

Can you tell me what the distribution is? Can you give me the components? Can you give me cluster assignments?

The problem is that I've constructed this to be non-identifiable. Here are two ways of writing down the components. (I've drawn this in 2D, but only pay attention to the x dimension.) They give rise to exactly the same distribution. One is equally weighted components, one uniform on the range (-3,1) and one uniform on the range (-1,3). The other is to have to components, one with 2/3 weight on the range (-3,3) and one with 1/3 weight on the range (-1,1).

I could imagine some sort of maximum likelihood parameter estimation giving rise to either of these (EM is hard to get to work here because once a point is outside the bounds of a uniform, it has probability zero). They both correctly recover the distribution, but would give rise to totally different (and sort of weird) cluster assignments.

I want to quickly point out that this is a very different issue from the standard "non-identifiability in mixture models issue" that has to do with the fact that any permutation of cluster indices gives rise to the same model.

So I guess that all this falls under the category of "if you want X, go for X." If you want a clustering, go for a clustering -- don't go for density estimation and try to read off clusters as a by-product. (Of course, I don't entirely believe this, but I still think it's worth thinking about.)

Suresh Venkatasubramanian said...

nice pic. maybe in another post you can expand on the EM vs ME issue

Anonymous said...

I usually also think that if you want X, go for it. But the question of identifiability is not existing (i mean solved) under classical hypothesis, in the gaussian case, right ?

Nice thing with gaussian mixtures is that the theoretical framework under which you fall makes things much easier than many clustering algorithms...

Anonymous said...

Here's the abstract from Dempster, Laird and Rubin (1977), in which they describe EM as producing maximum likelihood estimates in the face of missing data (where the missing data in the mixture case is the mixture assignment):

A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived.

Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Anonymous said...

I believe that the problem of non-identifiability is well-understood in the literature, e.g., On the exponential value of labeled samples by Castelli and Cover. Of course if the mixture is not identifiable, it is hard to do anything at all.

But I think that the following question is at the heart of the problem:

given a probability density p, what is a good clustering?

If we cannot answer this (at least under some conditions), there is little hope to get anything more than a heuristic from a finite sample.

misha b

Mark Johnson said...

On a related note -- I think we often forget that the EM theorem says that EM converges to a local max in likelihood, and says nothing about convergence of the parameters. If the parameters are nonidentifiable there's no reason to expect them to converge (I've seen this with multinomial mixture models).

bj said...

Have you considered the relation between EM with Gaussian Mixtures and the original k-means clustering algorithm (isodata), where k-means can can be seen as a limiting form of EM-GM. Also, see an old paper by Mike Kearns on this topic, where they consider a variant where points are assigned probabilistically rather than either hard or soft.

Anonymous said...

Many years ago, I already know the silkroad online game, but at that time I do not play, Recently due to my friends introduce, I began to play the game, but I need the silkroad gold, my friend look me the first time to play so send me some sro gold, but I still need more silkroad online gold, so I want to buy.

Anonymous said...

Anonymous said...

Anonymous said...

there is little hope to get anything more than a heuristic from a finite sample.
Assignment | Coursework | Thesis

Anonymous said...

where they consider a variant where points are assigned probabilistically rather than either hard or soft.
Dissertation | Essay

Anonymous said...

Классный фильм на кинозоуне.
электронная почта без регистрации

Anonymous said...
gamefan12 said...

I agree with the issue with mixture of Gaussians as a clustering method. I like your ideas for the future.
jacksonville beach cosmetic dentist

Unknown said...

one day i went shopping outside,and in an ed hardy store,I found some kinds of ed hardy i love most they are Your website is really good Thank you for the information ed hardy ed hardy ed hardy clothing ed hardy clothing ed hardy shoes ed hardy shoes don ed hardy don ed hardy ed hardy clothes ed hardy clothes ed hardy bags ed hardy bags ed hardy swimwear ed hardy swimwear ed hardy jeans ed hardy jeans ed hardy mens ed hardy mens Thank you for the information

Unknown said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

Unknown said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
sesliparti
seslisohbet odalari
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

dissertation said...

Topic is very good, Thank you very much for your information, nice job keep it up.
qualitative dissertation | qualitative dissertations | dissertation writing

Unknown said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbetReally trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

Custom Essay Writing said...

Hi,
This is an excellent and informative work, this will really helpful for me in future. I like the way you start and then conclude your thoughts. I completely agree with the points you have shared here about estimating the density or clustering. Thanks for this information. I really appreciate your work, keep it up.

Essay Writing Service

Tammy said...
This comment has been removed by the author.
combattery84 said...
combattery84 said...
free cna training said...

Awesome Tips!
I am glad to found such useful post. I really increased my knowledge after read your post which will be beneficial for me.

Thanks for valuable sharing!

Anonymous said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Unknown said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex