I think it's fair to say that I am a fan of the Bayesian lifestyle. I have at least a handful of papers with "Bayesian" in the title, and no in the misleading "I used Bayes' rule on a noisy-channel model" sense.

It's probably also fair to say that I'm a fan of NLP.

So... let's think about it. A fan of Bayes, a fan of NLP. He must do

topic modeling (ie LDA-style models), right? Well, no. Not really. Admittedly I have done something that looks like topic modeling (for

query-focused summarization), but never really topic modeling for topic modeling's sake.

The main reason I have stayed away from the ruthless, yet endlessly malleable, world of topic modeling is because it's notoriously difficult to evaluate. (I got away with this in the summarization example because I could evaluate the summaries directly.) The purpose of this post is to discuss

how one can try to evaluate topic models.

At the end of the day, most topic models are just probabilistic models over documents (although sometimes they are models over collections of documents). For a very simple example, let's take LDA. Here, we have

. In the particular case of LDA, the first two "

p"s are Dirichlet, and the last two "

p"s are Multinomial, where

z is an indicator selecting a mixture. For simplicity, though, let's just collapse all the hyperparameters into a single variable

a and the true parameters into a single parameter

z and just treat this as

and let's assume

p and

q are

not conjugate (if they were, life would be too easy).

Now, we want to "evaluate" these models. That is, we want to check to see if they really are good models of documents. (Or more precisely, are the

better models of documents than whatever we are comparing against... perhaps I've proposed a

new version of

p and

q that I claim is more "life-like.") Well, the natural thing to do would be to take some held-out data and evaluate

according to the model. Whichever model assigns higher probability to the heldout data is probably better.

At this point, we need to take a moment to talk about

inference. There's the whole Monte Carlo camp and there's the whole deterministic (variational, Laplace, EP, etc.) camp. Each gives you something totally different. In the Monte Carlo camp, we'll typically get a set of

R-many (possibly weighted) samples from the

joint distribution

p(a,z,w). We can easily "throw out" some of the components to arrive at a conditional distribution on whatever parameters we want. In the deterministic camp, one of the standard things we might get is a type-II maximum likelihood estimate of

a given the training data: i.e., a value of

a that maximizes

p(a|w). (This is the empirical Bayes route -- some deterministic approximations will allow you to be fully Bayes and integrate out

a as well.)

Now, back to evaluation. The issue that comes up is that in order to evaluate -- that is, in order to compute

, we have to do

more inference. In particular, we have to marginalize over the

zs for the heldout data. In the MC camp, this would mean taking our samples to describe a posterior distribution on

a given

w (marginalizing out

z) and then using this posterior to evaluate the heldout likelihood. This would involve

another run of a sampler to marginalize out the

zs for the new data. In the deterministic camp, we may have an ML-II point estimate of the hyperparameters

a, but we still need to marginalize out

z, which usually means basically running inference again (eg., running EM on the test data).

All of this is quite unfortunate. In both cases, re-running a sampler or re-running EM, is going to be computationally expensive. Life is probably slightly better in the deterministic camp where you usually get a fairly reasonable approximation to the evidence. In the MC camp, life is pretty bad. We

can run this sampler, but (a) it is usually going to have pretty high variance and, (b) (even worse!) it's just plain hard to evaluate evidence in a sampler. At least I don't know of any really good ways and I've looked reasonably extensively (though "corrections" to my naivete are certainly welcome!).

So, what recourse do we have?

One reasonable standard thing to do is to "hold out" data in a different way. For instance, instead of holding out 10% of my

documents, I'll hold out 10% of my

words in each document. The advantage here is that since the parameters

z are typically document-specific, I will obtain them for every document in the process of normal inference. This means that (at least part of) the integration in computing

p(w|a) disappears and is usually tractable. The problem with this approach is that in many cases, it's not really in line with what we

want to evaluate. Typically we

want to evaluate how well this model models totally new documents, not "parts" of previously seen documents. (There are

other issues, too, though these are less irksome to me in a topic-modeling setting.)

Another standard thing to do is to throw the latent variables into some sort of classification problem. That is, take (eg) the 20newsgroups data set, training and test combined. Run your topic model and get document-level parameters. Use

these as parameters to, say, logistic regression and see how well you do. This definitely gets around the "test on new data" problem, is not really cheating (in my mind), and does give you an estimate. The problem is that this estimate is cloaked behind classification. Maybe there's no natural classification task associated with your data, or maybe classification washes out precisely the distinctions your model is trying to capture.

The final method I want to talk about I learned from Wei Li and Andrew McCallum and is (briefly!) described in their

Pachinko Allocation paper. (Though I recall Andrew telling me that the technique stems---like so many things---from stats; namely, it is the

empirical likelihood estimate of Diggle and Gratton.

The key idea in empirical likelihood is to replace our statistically simple but computationally complex model

p with a statistically complex but computationally simple model

q. We then evaluate likelihood according to

q instead of

p. Here's how I think of it. Let's say that whatever inference I did on training data allowed me to obtain a method for

sampling from the distribution

p(w|a). In most cases, we'll have this. If we have an ML-II estimate of

a, we just follow the topic models' generative story; if we have samples over

a, we just use those samples. Easy enough.

So, what we're going to do is generate a

ton of faux documents from the posterior. On each of these documents, we estimate some simpler model. For instance, we might simply estimate a Multinomial (bag of words) on each faux document. We can consider this to now be a giant mixture of multinomials (evenly weighted), where the number of mixture components is the number of faux documents we generated. The nice thing here is that evaluating likelihood of test data under a (mixture of) multinomials is

really easy. We just take our test documents, compute their average likelihood under each of these faux multinomials, and

voila -- we're done!

This method is, of course, not without it's own issues. For one, a multinomial might not be a good model to use. For instance, if my topic model says anything about word order, then I might want to estimate simple

n-gram language models instead. The estimates might also have high variance -- how many faux documents do I need? Some sort of kernel smoothing can help here, but then that introduces additional bias. I haven't seen anyone do any evaluation of this for topic-model things, but it would be nice to see.

But overall, I find this method the least offensive (ringing praise!) and, in fact, it's what is implemented as part of

HBC.