03 May 2006

Supervised Hidden Variable Models

Hidden variable models have been extremely successful in unsupervised NLP problems, such as the famous alignment models for machine translation. Recently, two parsing papers and a textual entailment paper have applied hidden variable techniques to supervised problems. I find this approach appealing because it allows one to learn hidden variable models that are (hopefully) optimal for the end goal. This contrasts with, or example, unsupervised word alignment, where it is unclear if the same alignment is good for two different tasks, for instance translation or projection.

The question that interests me is "what exactly are hidden variables in these models?" As a simple example, let's take the following problem: we have an English sentence and an Arabic sentence and we want to know if they're aligned. Dragos has a paper that shows that we can do a good job of making this determination if we do the following: construct an alignment (using, eg., GIZA++) and then compute features over the alignment (length of aligned subphrases, number of unaligned words, etc.). One can imagine similar models for textual entailment, paraphrase identification, etc.

I would ideally like to learn the alignments as part of training the classifier, treating them essentially as hidden variables in the supervised problem.

So what happens in such a model? We learn to produce alignments and then define features over the alignments. From an information theoretic perspective, we needn't bother with the alignments: we can just define features on the original sentence pair. Perhaps these features will be somewhat unnatural, but I don't think so. To build the alignments, we'll have features like "the word 'man' is aligned to the word 'XXX'" (where 'XXX' is Arabic for 'man'). Then on top of this we'd have features that count how many words are successfully aligned. It's somewhat unclear if this is much more than just including all word pair features for a pure classification technique.

The key difference appears to be in the fact that we impose a strong constraint on the alignments. For instance, that they are one-to-one or one-to-many or something similar. This significantly reduces the number of features that are fed into the classifier.

At the same time, the hidden variable models look a lot like introducing a hidden layer into a neural network! Consider a drastic simplification of the parallel sentence detection problem (which is, in the end, just a binary classification problem). Namely, consider the XOR problem. As we all know, we can't solve this with a linear classifier. But if we add a hidden variable to the mix, we can easily find a solution (eg., the hidden variable indicates left-versus-right and then conditional on this, we learn two separate classifiers, one pointing up and one pointing down). This is exactly what a two layer neural net (i.e., one hidden layer) would do!

From this perspective, it seems one can argue that hidden variable models are simply increasing the model complexity of the underlying learner. This comes at a cost: increased model complexity means we either need more data or stronger prior information. Hidden variable models --- more precisely, the constraints present in hidden variable models --- seem to provide this prior information.


Anonymous said...

HMMs as used in speech recognition typically hide the following two items during training: (a) the state [typically begin, middle, and end states for each phoneme or tri-phone], and (b) mixture state for acoustic emissions. Training data includes acoustic streams paired with lexical token sequences; sometimes with phonemic level transcription and/or alignment, both of which leave most of the fine-grained state alignment still hidden.

hal said...

Yup, roughly the same thing happens in the original word-based MT systems (the phrase- and syntax-based newcomers are much more complicated). The question is: what does this buy us?

Nobuyuki Shimizu said...

Isn't it the unlabeled data? Sort of like Ando & Zhang
( http://www-cs-students.stanford.edu/%7Etzhang/papers/jmlr05_semisup.pdf ) that was mentioned a bit ago?

By the way, I left a pragmatics example in the past comment.

Kevin Duh said...

I think there should be a distinction between hidden variables in generative vs. discriminative models.

In discriminative models like neural networks, the hidden variables can be used to increase model complexity/flexibility, as you said. In fact, the hidden layer in can be thought of as an implicit feature mapping similar to the high dimensional kernels used in SVMs and other kernel machines.

However, the role of hidden variables in generative models is different, and applications such as the IBM models for word alignment and HMM for speech recognition fall under this category. In generative models, obtaining an accurate model of the underlying phenomenon is extremely important, since an incorrect model trained via maximum likelihood results in poor classification. Thus, hidden variables are used introduced to build more parsimonious and accurate models. This is the same reason statisticians do hierarchical and mixture modeling--often introducing a hidden variable or two will (a) give a more accurate generative model, and (b) allow fewer parameters or simpler parametric distributions.

In the speech recognition research in our lab, we try to think of all sorts of interesting variables to add to our dynamic bayesian networks so as to model the speech process more accurately. Some of them are hidden, and some of them are deterministic given other variables. In effect, this kind of work is analagous to feature engineering for discriminative models. So I think hidden variables are important in generative models, just as features are important in discriminative models.

That being said, I know there's an effort to develop discriminative models that utilize hidden variables. I think they have a different goal, though--e.g. using hidden variables to model missing features/labels. Anyone know more on this?

hal said...

kevin -- I agree that hidden units in NNs are different, especially in the generative world. That is, in graphical models, hidden variables have a meaning. Typically they do not in NNs. But I have a feeling that when you do generative-style hidden variables in discriminative techniques, like the cited Koo and Collins paper, what you end up with is a lot like a restricted hidden unit in a neural network. I want to understand better what this means.

In essense your closing question is exactly what I'm interested in.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花