natural language processing blog: 2009/03

07 March 2009

n-gram words an language Ordering model with

N-gram language models have been fairly successful at the task of distinguishing homophones, in the context of speech recognition. In machine translation (and other tasks, such as summarization, headline generation, etc.), this is not their job. Their job is to select fluent/grammatical sentences, typically ones which have undergone significant reordering. In a sense, they have to order words. A large part of the thesis of my academic sibling, Radu Soricut, had to do with exploring how well ngram language models can reorder sentences. Briefly, they don't do very well. This is something that our advisor, Daniel Marcu, likes to talk about when he gives invited talk; he shows a 15 word sentence and the preferred reorderings by a ngram LM and they're total hogwash, even though audience members can fairly quickly solve the exponential time problem of reordering the words to make a good sounding sentence. (As an aside, Radu found that if you add in a syntactic LM, things get better... if you don't want to read the whole thesis, just skip forward to section 8.4.2.)

Let's say we like ngram models. They're friendly for many reasons. What could we do to make them more word-order sensitive? I'm not claiming that none of these things have been tried; just that I'm not aware of them having been tried :).

Discriminative training. There's lots of work on discriminative training of language models, but, from what I've seen, it usually has to do with trying to discriminate true sentences from fake sentences, where the fake sentences are generated by some process (eg., an existing MT or speech system, a trigram LM, etc.). The alternative is to directly train a language model to order words. Essentially think of it as a structured prediction problem and try to predict the 8th word based on (say) the two previous. The correct answer is the actual 8th word; the incorrect answer is any other word in the sentence. Words that don't appear in the sentence are "ignored." This is easy to implement and seems to do something reasonable (on a small set of test data).
Add syntactic features to words, eg., via cluster-based language models. My thought here is to look at syntactic features of words (for instance, CCG-style lexicon information) and use these to create descriptors of the words; these can then be clustered (eg., use tree-kernel-style-features) to give a cluster LM. This is similar to how people have added CCG/supertag information to phrase-based MT, although they don't usually do the clustering step. The advantage to clustering is then you (a) get generalization to new words and (b) it fits in nicely with the cluster LM framework.

These both seem like such obvious ideas that they must have been tried... maybe they didn't work? Or maybe I just couldn't dig up papers. Or maybe they're just not good ideas so everyone else dismissed them :).

02 March 2009

Mixture models: clustering or density estimation

My colleague Suresh Venkatasubramanian is running as seminar on clustering this semester. Last week we discussed EM and mixture of Gaussians. I almost skipped because it's a relatively old hat topic for me (how many times have I given this lecture?!), and had some grant stuff going out that day. But I decided to show up anyway. I'm glad I did.

We discussed a lot of interesting things, but something that had been bugging me for a while finally materialized in a way about which I can be precise. I basically have two (purely qualitative) issues with mixture of Gaussians as a clustering method. (No, I'm not actually suggesting there's anything wrong with using it in practice.) My first complaint is that many times, MoG is used to get the cluster assignments, or to get soft-cluster assignments... but this has always struck me as a bit weird because then we should be maximizing over the cluster assignments and doing expectations over everything else. Max Welling has done some work related to this in the Bayesian setting. (I vaguely remember that someone else did basically the same thing at basically the same time, but can't remember any more who it was.)

But my more fundamental question is this. When we start dealing with MoG, we usually say something like... suppose we have a density F which can be represented at F = pi_0 F_0 + pi_1 F_1 + ... + pi_K F_K, where the pis give a convex combination of "simpler" densities F_k. This question arose in the context of density estimation (if my history is correct) and the maximum likelihood solution via expectation maximization was developed to solve the density estimation problem. That is, the ORIGINAL goal in this case was to do density estimation; the fact that "cluster assignments" were produced as a byproduct was perhaps not the original intent.

I can actually give a fairly simple example to try to make this point visually. Here is some data generated by a mixture of uniform distributions. And I'll even tell you that K=2 in this case. There are 20,000 points if I recall correctly:

Can you tell me what the distribution is? Can you give me the components? Can you give me cluster assignments?

The problem is that I've constructed this to be non-identifiable. Here are two ways of writing down the components. (I've drawn this in 2D, but only pay attention to the x dimension.) They give rise to exactly the same distribution. One is equally weighted components, one uniform on the range (-3,1) and one uniform on the range (-1,3). The other is to have to components, one with 2/3 weight on the range (-3,3) and one with 1/3 weight on the range (-1,1).

I could imagine some sort of maximum likelihood parameter estimation giving rise to either of these (EM is hard to get to work here because once a point is outside the bounds of a uniform, it has probability zero). They both correctly recover the distribution, but would give rise to totally different (and sort of weird) cluster assignments.

I want to quickly point out that this is a very different issue from the standard "non-identifiability in mixture models issue" that has to do with the fact that any permutation of cluster indices gives rise to the same model.

So I guess that all this falls under the category of "if you want X, go for X." If you want a clustering, go for a clustering -- don't go for density estimation and try to read off clusters as a by-product. (Of course, I don't entirely believe this, but I still think it's worth thinking about.)

What programming language(s) do you use for large-ish projects?
	C/C#/C++/Objective-C
	D
	Eiffel
	Erlang
	F#
	Haskell
	Java
	Lisp
	Matlab
	OCaml/SML/ML
	Perl
	Python
	R
	Ruby
	Scala
	Sheme
	Smalltalk
	Other

Free polls from Pollhost.com

natural language processing blog

22 March 2009

Programming Language of Choice

07 March 2009

n-gram words an language Ordering model with

02 March 2009

Mixture models: clustering or density estimation