28 August 2007

The Hierarchical Bayes Compiler

I've been working for a while on a compiler for statistical models. (Those of you who knew me back in the day will know that I used to be a programming languages/compiler nut.) The basic idea is to have a language in which you can naturally express hierarchical Bayesian models and either simulate them directly or compile them into your language of choice (in this case, right now, the only language you can choose in C :P). The key differentiation between HBC (the Hierarchical Bayes Compiler) and tools like WinBugs is that you're actually supposed to be able to use HBC to do large-scale simulations. Also, although right now it only supports Gibbs sampling, message passing and stochastic EM variants are in the works. It also almost knows how to automatically collapse out variables (i.e., automatic Rao-Blackwellization), but this is still a bit buggy.

Anyway, you can see more information on the official website. I'd really appreciate people who are interested in sending me bug reports if you encounter any problems. It's a pretty complex bit of machinery and some of it is still hacked together rather than done properly (especially type checking), so I expect to find some bugs there.

So just to whet your appetite if you haven't yet clicked on the above link, here's a complete implementation of LDA including hyperparameter estimation in HBC:


alpha ~ Gam(0.1,1)
eta ~ Gam(0.1,1)
beta_{k} ~ DirSym(eta, V) , k \in [1,K]
theta_{d} ~ DirSym(alpha, K) , d \in [1,D]
z_{d,n} ~ Mult(theta_{d}) , d \in [1,D] , n \in [1,N_{d}]
w_{d,n} ~ Mult(beta_{z_{d,n}}) , d \in [1,D] , n \in [1,N_{d}]

23 comments:

david blei said...

this looks amazing, hal. any plans to add variational methods to the toolkit? i think that this would be nearly as straightforward as gibbs, a la vibes.

hal said...

dave -- in short, yes. in long, there are a few things on the TODO list that are higher right now. basically, the list as it stands right now is something like: (1) automatic variable collapsing; (2) better support for gaussians (right now only constant-diagonal covariance matrices are allowed); (3) maximizations; (4) variational. somewhere in there also targetting other languages. if someone wants to help me target some other language, i'd really appreciate it: i plan to do ocaml, just because i use that a lot, but it would also be nice to have java... unfortunately, i don't remember java so if someone wants to tell me what the important differences between java and C are, it should be an easy port.

hal said...

i received the following question by email:

I gather from the Web page that the ability to generate production-performance code distinguishes HBC from most other
stochastic languages, but I as an ignorant observer do wonder as a
serious question, how does HBC relate to IBAL, AutoBayes, and Dyna?
(Any others?) It seems that IBAL and Dyna are more focused on discrete
distributions (such as over algebraic data types) and exact inference --
is that right?


this is pretty much right. autobayes, umacs, bugs, etc are all interpreted and hence pretty slow. some allow integration into R, which allows some flexibility, but R (and matlab) is pretty bad for dealing with text. so the big "selling point" is supposed to be that it generates code in your language of choice (which must be C right now). in comparison to dyna and ibal, these focus on exact inference, the former essentially in automata, the latter in algebraic data types.

but the basic point is that it should be somewhat difficult to actually implement a better sampler directly in C than the one produced by HBC. indeed, if you see places in the generated C code that could be sped up significantly (and that gcc doesn't already optimize, like constant lifting and simple dead code elimination), let me know.

Anonymous said...

What're the licensing terms? I didn't see anything on the site. If it's Apache-style open source, I'd love to help out generating Java if the system's much more scalable than WinBugs.

Could you run a pLSA-like mixture model over the Netflix data? That's 20K films, 500K users, 100M integer ratings in {1,2,3,4,5}. Ratings could be generated by mixtures of around 50-100 Gaussians or by similarly sized mixtures of 5-dimensional mutlinomials for the discrete output case.

I'm thiking pLSA needs to not only be generalized in the obvious way for Bayesian priors, but also for generating films for the Netflix case where not every film is rated for every user. The generative process would first select a topic, then a film from a per-topic multinomial. In addition to completing the generative model, it has the obvious predictive advantage of letting films associate more strongly with topics, so if you're a sci-fi nut and a documentary nut, you don't rate Star Trek a 1 if you happen to draw the documentary topic. I found this helpful in an EM estimator I implemented in Java for pLSA to run on the Netflix data.

How much memory and time would this take? And once the model's estimated, how fast would predictions be (estimating ratings given film and user)?

Anonymous said...

Nine formulas are worth hundreds of words. Here's a stab at expressing a hierarchical Bayesian model slightly generalizing pLSA for collaborative filtering.

alpha ~ Gam(x1,y1)
beta ~ Gam(x2,y2)
delta ~ Gam(x3,y3)
theta_{u} ~ DirSym(alpha,K) , u \in [1,M]
phi_{k} ~ DirSym(beta,N) , k \in [1,K]
psi_{k} ~ DirSym(delta,5) , k \in [1,K]
z_{u,n} ~ Mult(theta_{u}) , u \in [1,M] , n \in [1,P_{u}]
f_{u,n} ~ Mult(phi_{z_{u,n}}) , u \in [1,M] , n \in [1,P_{u}]
r_{u,n} ~ Mult(psi_{z_{u,n})) , u \in [1,M] , n \in [1,P_{u}]

Here's a key:

x1,y1,x2,y2,x3,y3: hyperparameters

K: num topics (around 50)
N: num films (around 20K)
M: num users (around 500K)
P_{u}: num films for user u

alpha: prior dispersion of topics per user
beta: prior dispersion of films per topic+film
delta: prior dispersion of ratings per topic+film

theta_{u} : multinomial over topics 1..K for user u
phi_{k} : multinomial over films 1..N for topic k
psi_{k} : multinomial over ratings 1..5 for topic k

z_{u,n} : topic chosen for user u's n-th film
f_{u,n} : film chosen for user u's n-th film
r_{u,n} : rating for user u's n-th film

Exercise for the reader: replace the rating multinomial psi_{k} and prior delta to convert this to the Gaussian-style pLSA variant.

Anonymous said...

Oops, missed a subscript on the multinomials over films per user:

psi_{k,p} ~ DirSym(delta,5) , k \in [1,K], p \in [1,N]

r_{u,n} ~ Mult(psi_{z_{u,n},f_{u,n}}) , u \in [1,M] , n \in [1,P_{u}]

Key:

psi_{k,p} : multinomial over ratings 1..5 for film p in topic k

hal said...

re license: i was hoping to avoid this issue because it just irks me. anyway, it's completely free for any purpose...i just added this to the web page.

bob -- i don't think memory would be a problem (i find that even on a fair amount of data, the memory usage is quite small). the bigger issue is that the speed of convergence for your model might be slow, at least until collapsing is implemented.

but it's pretty fun, right? very easy to try new models... i hope it's fruitful.

anyway, i'd love help with java. i'm pretty much unavailable for the next two weeks, but after that let's get it done. it should be a fairly trivial modification of the C generation.

Yoav said...

This looks pretty neat, but unfortunately I hardly understand it.

What are some good pointers for learning Bayesian inference, to the point I could unserstand this (and similar) discussions/models?

Anonymous said...

For a machine learning perspective, see Bishop's book "Pattern Recognition and Machine Learning". For a simulation-oriented statistics perspective, see Gelman et al.'s "Bayesian Data Analysis". Both of these assume you are comfortable with probabilility theory, the common distributions in calc/matrix format. If you need to bone up on that, I'd recommend Larsen and Marx's "Intro to Mathematical Statistics", but there are lots of other mathematical intros out there.

Anonymous said...

I think of myself as a decent java programmer. If you need someone to code some stuff I would be willing to invest some time in it. This would also give me the opportunity to get more familiar with Bayesian models ;)

Anonymous said...

This is great! I'm playing with hbc right now.

In response to your comment about other versions: I'd be very excited about an Ocaml version, and would be even more excited about a Python or Ruby one. OpenBayes for Python looks good, so together these would be powerful tools... but again, this is just my selfish perspective, since I know these languages better than Haskell. But in any case, just wanted to let you know there's major interest in Ocaml, Python and Ruby versions, and I'd definitely hack on it and add features.

MelvinBerkovitz said...

I think this:

hbc --simulate --loadM X x N dim --define K 2 mix_gauss.hier

is a typo, it should be "simulate" without the -- prefix.

Unknown said...

hal --

this looks very cool, i'm looking farward to playing with hbc, and understanding how it works.

here's a simple challenge (that i think is hard to do in most packages for bayesian modeling): write a pcfg, and acheive the inference performance of cyk (without extraordinary tinkering). this is hard first because representing a pcfg is often awkward, and second because generic inference algorithms don't seem to leverage the independence properties that dynamic programming does.

btw., as someone who uses haskell and ocaml for doing serious bayesian modeling work, which would you recomend as a 'replacement' for everyday matlab use?

best,

-NG

hal said...

regarding getting it working in other languages...

there are essentially two things that need to be done. first, the code generator needs to be written; i can do this with a small amount of input. second, the equivalent of stats.[ch] needs to be ported. the closer the type signatures can match, the easier it will be to generate the code. note however that if your language is reasoanble and your arrays store their length, then obviously this parameter doesn't need to be passed in.

so, here's the deal i'll strike. if you want a code generator in language X, give me a stats.X and i'll make the code generator. two people have expressed interest in java; maybe you can team up.


(ps. and yes, it should be simulate, not --simulate; it's fixed now.)

Anonymous said...

stats.c isn't so bad, but samplib.c on which it depends, is a killer. It's 2500 lines of densely written sampling algorithms (gamma, beta, exponential, Poisson, normal, etc.).

Does anyone know of good sampling code in Java? Porting samplib.c would be tough because of all the gotos, which Java doesn't support. The algorithms would have to be significantly refactored. I don't think this'd be possible without some good test sets. There are good refs and decent inline comments in some of the algorithms, so it at least looks possible.

There is standard normal sampling in java.util.Random, but I don't know how good the implementation is.

Jakarta Commons Math have inverse cumulative prob methods for all their distributions and these could be used with a uniform [0,1] sampler from Random to sample from a gamma, but I'm guessing that's not going to be the best way to sample. And not all the distros are even implemented (e.g. beta's missing an Impl).

Yoav said...

Thanks, Bob, for your reading list. I will give it a shot.

As for targeting other languages -- putting aside the coolness-factor of compilation to many different languages, I think that given the dependence on tight C code, and the inherent slowness of at least Python/Ruby if not Java -- it might be better to stick with generating cross platform C code and concentrate on the FFI of the host languages (SWIG for python, JNI for java, etc). This will require the users to have gcc, but at least the code will run at a reasonable speed.

Alternatively, one can just provide the stats lib using the FFI, which will remove the compilation burden from the users while making it easier to support different languages AND making the generated code run at reasonable speed.

Anonymous said...

Bob, for sampling in Java one may use cern.jet.random -libraries, for example the Gamma-library can be used to sample from the Dirichlet.

See:
http://dsd.lbl.gov/~hoschek/colt/

Another option is the SSJ-library:
http://www.iro.umontreal.ca/~simardr/ssj/indexe.html

Joseph Coffey said...

This looks great!
So I'm making a first stab at a factorial hmm, and for some reason its complaining that pi is not quantified. I'm not sure what these means, but I suspect I need some sort of type annotation....

D \in I

A_{k,d} ~ DirSym(1.0, K) , k \in [1,K], d \in [1,D]
B_{k,d} ~ DirSym(1.0, V) , k \in [1,K], d \in [1,D] , v \in [1,V]
z_{n,d} ~ Mult(A_{z_{(n-1),d},d}) , n \in [1,N], d \in [1,D]

alpha ~ Gam(0.1,1)
beta ~ DirSym(alpha,D)
pi_{n} ~ Mult(beta) ,n \in [1,N]

w_{n} ~ Mult(B_{z_{n,pi_{n}},pi_{n}} ) , n \in [1,N]
--# --define K 5
--# --define D 10
--# --define N 30
--# --define V 3
--# --dumpcore

hal said...

Jeffrey -- This bug is fixed now; see version 0.3 on the web page.

hal said...

Oops -- I meant Joseph, not Jeffrey :).

Jahui said...

I was trying HBC to load 3-D data, however, I encounter compiling issue saying "too many dimension". Here the HBC script modified from LDA:

alpha ~ Gam(0.1,1)
eta ~ Gam(0.1,1)
beta_{k} ~ DirSym(eta, V), k \in [1,K]
theta_{d} ~ DirSym(alpha, K), d \in [1,D]
z_{d,n} ~ Mult(theta_{d}), d \in [1,D] , n \in [1,N_{d}]
w_{d,n,m} ~ Mult(beta_{z_{d,n}}), d \in [1,D] , n \in [1,N_{d}], m \in [1, M_{n}]

--# --define K 10
--# --loadD testW w V D N M;
--# --collapse beta
--# --collapse theta

Thanks!

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

gamefan12 said...

I think this is so good how you use this. I think your insight is so good.
ponte vedra cosmetic dentist