When I was working on what turned into an old short paper (Markov Random Topic Fields) I decided it might be pedagogically interesting to keep a journal of what I was doing. This journal ended when I ran out of steam and I never got back to it. My whole original idea was, after the paper got published, post everything: the journal, the code, the paper, the paper reviews, etc. It's now been 6 years and that's not going to happen, but in case anyone finds it interesting, here in the report.
Anyway, here is the report. I'm posting this so that perhaps new students can see that things don't ever work the first time, faculty still have trouble getting their code to work, etc.
The progress of a research idea ============= DAY 1 ============= * Idea Want to do a form of topic modeling, but where there is meta information. There are ways to do this, eg., Supervised LDA or Dirichlet-Multinomial Regression. These both operate on a *feature* level. For some tasks, it is more natural to operate over a graph. Along these lines, there's Pachinko Allocation, but this posits a graph over the vocabulary, not over the documents. (Plus, it is a DAG, which doesn't make sense for our application.) Question: how can we augment a standard topic model (eg., LDA), with an underlying graph, where we assume topics vary smoothly over the graph? * Technology What technology exists for statistical modeling over graphs? Sounds like a Markov Random Field. So let's marry topic models (LDA) with MRFs, to give a "Topical Markov Random Field" (TMRF). We think of LDA a generating documents by first choosing a topic mixture \theta, and then choosing topics z=k for each word w, where w is drawn from a multinomial \beta_k. Where can a graph fit in this? The first idea is to put an MRF over \theta. * MRF over \theta If we have an MRF over theta, then two issues arise. First, we almost certainly can't collapse out theta as we might like. Okay, we'll live with that. Second, from an MRF perspective, what do the potential functions look like? The simplest idea is to use pairwise potentials of the form e^{-dist}, where dist is the distance between two thetas that touch on the graph. What Distance metric should we use? We could use Bhattacharyya, Hellinger, Euclidean, LogitEuclidean, etc. Let's start with Hellinger. What about a variance? We could have lengths in the graph that are either latent or known. Let's say they're latent and our potentials have the form e^{-dist/l}, where l is the length (so that if you're far away, distance doesn't matter. ** Getting data together We have about 1000 docs and three graphs over those docs. We get them in a reasonable format and then subsample about 400 of the docs. We do this both for speed and to make sure we don't overfit the model on the data too much. ============= DAY 2 ============= ** Implementation We implement this with an option to use or not use graphs (so we can tell if they're helping). We collapse out \beta, but not \theta in both cases, and we compute log posteriors. We run first on some simple test data (testW) from HBC and find that it seems to be doing something kind of reasonable. We then run on some real data and it puts everything in one cluster after about 20-30 Gibbs iterations. Debugging: First, we turn off all graph stuff (sampling lengths) and things are still broken. Then we initialize optimally and things are still broken. Then we turn off resampling \theta and things are still broken. The problem is that I'm using the collapsed \beta incorrectly when sampling z. I fix it and things work as expected (i.e., not everything goes in the same cluster). ** Evaluating So now the code seems to be working, so we want to evaluate. We run a model with an without a graph (where the graph is something we expect will help). The posterior probabilities coming out of the two different models are all over the place. So we do the standard trick of holding out 20% of the data as "test" data and then evaluating log likelihood on the test. Here, we do the "bad" thing and just use 20% of the words in each document (so that we already have \theta for all the documents). Not great, but easy to implement. This time, no bugs. At this point, it's a pain to recompile for every configuration change and we'd like to be able to run a bunch of configs simultaneously. So we add a simple command line interface. In order to evaluate, we plot either posteriors or held-out likelihoods (usually the latter) as a function of iteration using xgraph (interacts nicely with the shell and I'm used to it). Things now seem mixed. There's very little difference when you're not using a graph between sampling \theta from the true posterior versus using an MH proposal (this is good for us, since we have to use MH). There is also little difference between the baseline LDA model and the MRF models. We turn off sampling of the lengths and just fix them at one. For the three graphs we're trying, only one seems to be doing any better than the baseline LDA model. ** Optimization Now that we're running experiments, we find that things are taking way too long to run. So we do some optimization. First, we cache the sum of all \beta_k posteriors. This helps. Second, we flip \beta from \beta_{k,w} to \beta_{w,k}, which we've heard helps. It doesn't. We put it back the other way. All the time is being spent in resample_z, so we waste a half day trying to figure out if there's a way to only resample a subset of the zs. For instance, track how often they change and only resample those that change a lot (probabilistically). This hurts. Resampling those with high entropy hurts. I think the issue is that there are three types of zs. (1) those that change a lot because they have high uncertainty but are rare enough that they don't really matter, (2) those that change a lot and do matter, (3) those that just don't change very much. Probably could do something intelligent, but have wasted too much time already. In order to really evaluate speed, we add some code that prints out timing. We do one more optimization that's maybe not very common. resample_z loops over docs, then words, then topics. For each word, the loop over topics is to compute p(z=k). But if the next word you loop over is the same word (they are both "the"), then you don't need to recompute all the p(z=k)s -- you can cache them. We do this, and then sort the words. This gives about a 20% speedup with no loss in performance (posterior or held-out likelihood). ** Evaluating again Since we had some success fixing the lengths at 1, we try fixing them at 20. Now that one graph is doing noticably better than the baseline and the other two slightly better. We try 5 and 10 and 50 and nothing much seems to happen. 20 seems like a bit of a sweet spot. ============= DAY 3 ============= ** A more rigorous evaluation Running with lengths fixed at 20 seems to work, but there's always variance due to randomness (both in the sampling and in the 80/20 split) that we'd like to account for. So we run 8 copies of each of the four models (8 just because we happen to have an 8 core machine, so we can run them simultaneously). Now, we require more complex graphing technology than just xgraph. We'll probably eventually want to graph things in matlab, but for now all we really care about it how things do over time. So we write a small perl script to extract scores every 50 iterations (we've switched from 200 to 300 just to be safe) and show means and stddevs for each of the models. While we're waiting for this to run, we think about... * MRF over z? Our initial model which may or may not be doing so well (we're waiting on some experiments) assumes an MRF over \th. Maybe this is not the best place to put it. Can we put it over z instead? Why would we want to do this? There are some technological reasons: (1) this makes the MRF discrete and we know much better how to deal with discrete MRFs. (2) we can get rid of the MH step (though this doesn't seem to be screwing us up much). (3) we no longer have the arbitrary choice of which distance function to use. There are also some technological reasons *not* to do it: it seems like it would be computationally much more burdensome. But, let's think if this makes sense in the context of our application. We have a bunch of research papers and our graphs are authorship, citations, time or venue. These really do feel like graphs over *documents* not *words*. We could turn them in to graphs over words by, say, connecting all identical terms across documents, encouraging them to share the same topic. This could probably be done efficiently by storing an inverted index. On the other hand, would this really capture much? My gut tells me that for a given word "foo", it's probably pretty rare that "foo" is assigned to different topics in different documents. (Or at least just as rare as it being assigned to different topics in the same document.) Note that we could evaluate this: run simple LDA, and compute the fraction of times a word is assigned it's non-majority topic across all the data, versus just across one documents. I suspect the numbers would be pretty similar. The extreme alternative would be to link all words, but this is just going to be computationally infeasible. Moreover, this begins to look just like tying the \thetas. So for now, we put this idea on the back burner... * MRF over \theta, continued... We're still waiting for these experiments to run (they're about half of the way there). In thinking about the graph over z, though, it did occur to me that maybe you have to use far more topics than I've been using to really reap any benefits here. We begin running with 8, and then bumped it up to 20. But maybe we really need to run with a lot more. So, I log in to two other machines and run just one copy with 50, 100, 150 and 200 topics, just for vanilla LDA. The larger ones will be slow, but we'll just have to go do something else for a while... ============= DAY 4 ============= * MRF over \theta, re-continued... Both experiments finish and we see that: (1) with 20 topics and lengths fixed at 10, there is no difference between raw LDA and TMRF. (2) More topics is better. Even 200 topics doesn't seem to be overfitting. Going 20, 50, 100, 150, 200, we get hidden log likelihoods of -1.43, -1.40, -1.38, -1.36, -1.36 (all e+06). The significance (from the first experiments) seems to be around .002 (e+06), so these changes (even the last, which differs by 0.005) seem to be real. Since we weren't able to overfit, we also run with 300 and 400, and wait some more... ...and they finish and still aren't overfitting. We get -1.35 and -1.35 respectively (differing by about 0.004, again significant!). This is getting ridiculous -- is there a bug somewhere? Everything we've seen in LDA-like papers shows that you overfit when you have a ton of topics. Maybe this is because our documents are really long? ** Model debugging One thing that could be going wrong is that when we hide 20% of the words, and evaluate on that 20%, we're skewing the evaluation to favor long documents. But long documents are probably precisely those that need/want lots of topics. Our mean document length is 2300, but the std is over 2500. The shortest document has 349 words, the longest has 37120. So, instead of hiding 20%, we try hiding a fixed number, which is 20% of the mean, or 460. ============= DAY 5 ============= ** Read the papers, stupid! At this point, we've done a large number of evals, both with 20% hid, and 460 words/doc hid (actually, the min of this and doclen/2), and 50-1600 (at *2) topics. We do actually see a tiny bit of overfitting at 800 or 1600 documents. Then we do something we should have done a long time ago: go back and skim through some LDA papers. We look at the BNJ 2003 JMLR paper. We see that one of the selling points of LDA over LSI is that it *doesn't* overfit! Aaargh! No wonder we haven't been able to get substantial overfitting. However, we also notice something else: aside from dropping 50 stop words (we've been dropping 100), on one data set they don't prune rare words at all, and on the other they prune df=1 words only. We've been pruning df<=5 or <=10 words (can't remember which). Perhaps what's going on is that the reason the graphs aren't helping is just because there vocabulary (around 3k) isn't big enough for them to make a difference! We recreate the text, pruning only the df=1 words. This leads to a vocab of about 10k (which means inference will be ~3 times slower). We run at 50, 100, 200 and 400 and we actualy see a tiny bit of overfitting (maybe) on 400. We accidentally only ran 100 iterations, so it's a bit hard to tell, but at the very least there's no *improvement* for going from 200 topics to 400 topics. Strangely (and I'd have to think about this more before I understand it), running on the new text is actually about 5-20% *faster*, despite the larger vocabulary! ** Re-running with graphs At this point, we're ready to try running with graphs again. Despite the fact that it's slow, we settle on 200 topics (this took about 3.5 hours without graphs, so we will be waiting a while). We also want to run for more iterations, just to see what's going to happen: we do 200 again. And again there's not much difference. One of the graphs seems to be just barely one std above everyone else, but that's nothing to write home about. ============= DAY 6 ============= * Abstracts only? At this point, things are not looking so spectacular. Perhaps the problem is that the documents themselves are so big that there's really not much uncertainty. This is reflected, to some degree, by the lack of variance in the predictive perplexities. So we rebuild the data on abstracts only. This makes running significantly faster (yay). We run 5 copies each of 10, 20, 40, 80, 160 and 320 topics. 40 is a clear winner. 80 and above overfit fairly badly. Now, we turn on graphs and get the following results (5 runs): 40-top-nog.default -69239.4 (86.2629700392932) 40-top-nog.auth -68920.0 (111.863756418243) 40-top-nog.cite -68976.4 (174.920839238783) 40-top-nog.year -69174.2 (133.769204228776) If we compare default (not graphs) to auth (best), we see that we get a 2-3 std separation. This is looking promising, FINALLY! Also, if we plot the results, it looks like auth and cite really dominate. Year is fairly useless. It suggests that, maybe, we just need more data to see more of a difference. * Getting More Data There are two ways we could get more data. First, we could crawl more. Second, we could switch over to, say, PubMed or ACM. This would work since we only need abstracts, not full texts. These sites have nice interfaces, so we start downloading from ACM. ============= DAY 7 ============= Okay, ACM is a pain. And it's really not that complete, so we switch over to CiteSeer (don't know why we didn't think of this before!). We seed with docs from acl, cl, emnlp, icml, jmlr, nips and uai. We notice that CiteSeer is apparently using some crappy PDF extractor (it misses ligatures! ugh!) but it's not worth (yet!) actually downloading the pdfs to do the extraction ourselves ala Braque. From these seeds, we run 10 iterations of reference crawling, eventually ending up with just over 44k documents. We extract a subset comprising 9277 abstracts, and six graphs: author, booktitle, citation, url, year and time (where you connect to year-1 and year+1). The 9k out of 44k are those that (a) have >=100 characters "reasonable" in the abstract and (b) have connections to at least 5 other docs in *all* the graphs. (We're no longer favoring citations.) We then begin the runs again....