I think it's fair to say that I am a fan of the Bayesian lifestyle. I have at least a handful of papers with "Bayesian" in the title, and no in the misleading "I used Bayes' rule on a noisy-channel model" sense.
It's probably also fair to say that I'm a fan of NLP.
So... let's think about it. A fan of Bayes, a fan of NLP. He must do
topic modeling (ie LDA-style models), right? Well, no. Not really. Admittedly I have done something that looks like topic modeling (for
query-focused summarization), but never really topic modeling for topic modeling's sake.
The main reason I have stayed away from the ruthless, yet endlessly malleable, world of topic modeling is because it's notoriously difficult to evaluate. (I got away with this in the summarization example because I could evaluate the summaries directly.) The purpose of this post is to discuss
how one can try to evaluate topic models.
At the end of the day, most topic models are just probabilistic models over documents (although sometimes they are models over collections of documents). For a very simple example, let's take LDA. Here, we have
. In the particular case of LDA, the first two "
p"s are Dirichlet, and the last two "
p"s are Multinomial, where
z is an indicator selecting a mixture. For simplicity, though, let's just collapse all the hyperparameters into a single variable
a and the true parameters into a single parameter
z and just treat this as
and let's assume
p and
q are
not conjugate (if they were, life would be too easy).
Now, we want to "evaluate" these models. That is, we want to check to see if they really are good models of documents. (Or more precisely, are the
better models of documents than whatever we are comparing against... perhaps I've proposed a
new version of
p and
q that I claim is more "life-like.") Well, the natural thing to do would be to take some held-out data and evaluate
according to the model. Whichever model assigns higher probability to the heldout data is probably better.
At this point, we need to take a moment to talk about
inference. There's the whole Monte Carlo camp and there's the whole deterministic (variational, Laplace, EP, etc.) camp. Each gives you something totally different. In the Monte Carlo camp, we'll typically get a set of
R-many (possibly weighted) samples from the
joint distribution
p(a,z,w). We can easily "throw out" some of the components to arrive at a conditional distribution on whatever parameters we want. In the deterministic camp, one of the standard things we might get is a type-II maximum likelihood estimate of
a given the training data: i.e., a value of
a that maximizes
p(a|w). (This is the empirical Bayes route -- some deterministic approximations will allow you to be fully Bayes and integrate out
a as well.)
Now, back to evaluation. The issue that comes up is that in order to evaluate -- that is, in order to compute
, we have to do
more inference. In particular, we have to marginalize over the
zs for the heldout data. In the MC camp, this would mean taking our samples to describe a posterior distribution on
a given
w (marginalizing out
z) and then using this posterior to evaluate the heldout likelihood. This would involve
another run of a sampler to marginalize out the
zs for the new data. In the deterministic camp, we may have an ML-II point estimate of the hyperparameters
a, but we still need to marginalize out
z, which usually means basically running inference again (eg., running EM on the test data).
All of this is quite unfortunate. In both cases, re-running a sampler or re-running EM, is going to be computationally expensive. Life is probably slightly better in the deterministic camp where you usually get a fairly reasonable approximation to the evidence. In the MC camp, life is pretty bad. We
can run this sampler, but (a) it is usually going to have pretty high variance and, (b) (even worse!) it's just plain hard to evaluate evidence in a sampler. At least I don't know of any really good ways and I've looked reasonably extensively (though "corrections" to my naivete are certainly welcome!).
So, what recourse do we have?
One reasonable standard thing to do is to "hold out" data in a different way. For instance, instead of holding out 10% of my
documents, I'll hold out 10% of my
words in each document. The advantage here is that since the parameters
z are typically document-specific, I will obtain them for every document in the process of normal inference. This means that (at least part of) the integration in computing
p(w|a) disappears and is usually tractable. The problem with this approach is that in many cases, it's not really in line with what we
want to evaluate. Typically we
want to evaluate how well this model models totally new documents, not "parts" of previously seen documents. (There are
other issues, too, though these are less irksome to me in a topic-modeling setting.)
Another standard thing to do is to throw the latent variables into some sort of classification problem. That is, take (eg) the 20newsgroups data set, training and test combined. Run your topic model and get document-level parameters. Use
these as parameters to, say, logistic regression and see how well you do. This definitely gets around the "test on new data" problem, is not really cheating (in my mind), and does give you an estimate. The problem is that this estimate is cloaked behind classification. Maybe there's no natural classification task associated with your data, or maybe classification washes out precisely the distinctions your model is trying to capture.
The final method I want to talk about I learned from Wei Li and Andrew McCallum and is (briefly!) described in their
Pachinko Allocation paper. (Though I recall Andrew telling me that the technique stems---like so many things---from stats; namely, it is the
empirical likelihood estimate of Diggle and Gratton.
The key idea in empirical likelihood is to replace our statistically simple but computationally complex model
p with a statistically complex but computationally simple model
q. We then evaluate likelihood according to
q instead of
p. Here's how I think of it. Let's say that whatever inference I did on training data allowed me to obtain a method for
sampling from the distribution
p(w|a). In most cases, we'll have this. If we have an ML-II estimate of
a, we just follow the topic models' generative story; if we have samples over
a, we just use those samples. Easy enough.
So, what we're going to do is generate a
ton of faux documents from the posterior. On each of these documents, we estimate some simpler model. For instance, we might simply estimate a Multinomial (bag of words) on each faux document. We can consider this to now be a giant mixture of multinomials (evenly weighted), where the number of mixture components is the number of faux documents we generated. The nice thing here is that evaluating likelihood of test data under a (mixture of) multinomials is
really easy. We just take our test documents, compute their average likelihood under each of these faux multinomials, and
voila -- we're done!
This method is, of course, not without it's own issues. For one, a multinomial might not be a good model to use. For instance, if my topic model says anything about word order, then I might want to estimate simple
n-gram language models instead. The estimates might also have high variance -- how many faux documents do I need? Some sort of kernel smoothing can help here, but then that introduces additional bias. I haven't seen anyone do any evaluation of this for topic-model things, but it would be nice to see.
But overall, I find this method the least offensive (ringing praise!) and, in fact, it's what is implemented as part of
HBC.