natural language processing blog: Bayesian Methods for NLP (summary)

15 December 2005

Bayesian Methods for NLP (summary)

I recently co-organized a BayesNLP workshop with Yee Whye. Here's a brief summary of a subset of the talks and the discussions that took place.

Topic models and language modeling: A lot of the discussion and papers were about either TMs, LMs or both. Much of the discussion of topic models was how to introduce Markov-style dependencies. There are essentially three ways to do this: (1) make word i dependent on topic i and word (i-1); (2) make topic i dependent on topic (i-1); (3) both. Basically this comes out in how you structure the "beta" language models (in LDA terminology). There is a trade-off between number of params (|vocab| * (# top)^2 versus |vocab|^2 * (# top)) and the ability to fit data. I think the general consensus is that if you have a lot of data (which you should!) then you should use the more expressive models.

The major problem with these models is that they often are evaluated by their perplexity on test data. These perplexities are significantly higher than those obtained by people in the speech community, which raises the "why should I care question" (see this other entry). There are several potential answers: (1) topics can be embeded in a task (say MT) and this leads to better performance; (2) topics are used to enable new tasks (browsing Science repositories); (3) topics can be compared with what humans do in a CogSci manner.

This topic lead into some incomplete discussion on what sorts of problems we might want to work on in the future. I don't think there was a solid decision made. In terms of what applications might be interesting, I think the agreement was that Bayesian techniques are most useful in problems for which there is insufficient data to fit all parameters well. Since "there's no data like more data" has become a mantra in NLP, this seems like it would include every problem! My opinion is that Bayesian methods will turn out to be most useful for largely unsupervised tasks, where my prior knowledge can be encoded as structure. I think there's lots of room to grow into new application domains (similar to some stuff Andrew McCallum has been working on in social network analysis). Introducing new tasks makes evaluation difficult which can make publication difficult (your eight pages have to go both to technique an evaluation), but I think it's the right way for the community to head.

I also really like Yee Whye's talk (which happened to propose basically the same model as a paper by Goldwater, Griffiths and Johnson at this same NIPS), where he basically gave an interpretation of KN smoothing as a nonparametric Bayesian model with a Poisson-Dirichlet prior. Unlike previous methods to explain why KN works, this actually give superior results to interpolated KN (though it loses to modified interpolated KN). Shaojun talked about integrating a whole bunch of stuff (Markov models, grammars and topics) into a language model using directed Markov fields as an "interface" language. This was really cute and they seem to be doing really well (going against the above comment that it's hard to get comparable perplexities). I believe there's an upcoming CL paper on this topic.

If anyone else took part in the BNLP workshop and would like to comment, you're more than welcome.