28 August 2007

The Hierarchical Bayes Compiler

I've been working for a while on a compiler for statistical models. (Those of you who knew me back in the day will know that I used to be a programming languages/compiler nut.) The basic idea is to have a language in which you can naturally express hierarchical Bayesian models and either simulate them directly or compile them into your language of choice (in this case, right now, the only language you can choose in C :P). The key differentiation between HBC (the Hierarchical Bayes Compiler) and tools like WinBugs is that you're actually supposed to be able to use HBC to do large-scale simulations. Also, although right now it only supports Gibbs sampling, message passing and stochastic EM variants are in the works. It also almost knows how to automatically collapse out variables (i.e., automatic Rao-Blackwellization), but this is still a bit buggy.

Anyway, you can see more information on the official website. I'd really appreciate people who are interested in sending me bug reports if you encounter any problems. It's a pretty complex bit of machinery and some of it is still hacked together rather than done properly (especially type checking), so I expect to find some bugs there.

So just to whet your appetite if you haven't yet clicked on the above link, here's a complete implementation of LDA including hyperparameter estimation in HBC:

alpha ~ Gam(0.1,1)
eta ~ Gam(0.1,1)
beta_{k} ~ DirSym(eta, V) , k \in [1,K]
theta_{d} ~ DirSym(alpha, K) , d \in [1,D]
z_{d,n} ~ Mult(theta_{d}) , d \in [1,D] , n \in [1,N_{d}]
w_{d,n} ~ Mult(beta_{z_{d,n}}) , d \in [1,D] , n \in [1,N_{d}]

21 August 2007

Topic modeling: syntactic versus semantic

Topic modeling has turned into a bit of a cottage industry in the NLP/machine learning world. Most seems to stem from latent Dirichlet allocation, though this of course built on previous techniques; the most well-known of which is latent semantic analysis. At the end of the day, such "topic models" really look more like dimensionality reduction techniques (eg., the similarity to multinomial PCA); however, in practice, they're often used as (perhaps soft) clustering methods. Words are mapped to topics; topics are used as features; this is fed into some learning algorithm.

One thing that's interested me for a while is that when viewed as clustering algorithms, how these topic models compare with more standard word clustering algorithms from the NLP community. For instance, the Brown clustering technique (built into SRILM) that clusters words based on context. (Lots of other word clustering techniques exist, but they pretty much all cluster based on local context; where local is either positionally local or local in a syntactic tree.)

I think the general high level story is that "topic models" go for semantics while "clustering models" go for syntax. That is, clustering models will tend to cluster words together that appear in similar local context, while topic models will cluster words together that appear in a similar global context. I've even heard stories that when given a choice of using POS tags as features in a model versus Brown clusters, it really don't make a difference.

I think this sentiment is a bit unfair to clustering models. Saying that context-based clustering models only find syntactically similar words is just not true. Consider the example clusters from the original LDA paper (the top portion of Figure 8). If we look up "film" ("new" seems odd) in CBC, we get: movie, film, comedy, drama, musical, thriller, documentary, flick, etc. (I left out multiword entries). The LDA list contains: new, film, show, music, movie, play, musical, best, actor, etc. We never get things like "actor" or "york" (presumably this is why "new" appeared), "love" or "theater", but it's unclear if this is good or not. Perhaps with more topics, these things would have gone into separate topics.

If we look up "school", we get: hospital, school, clinic, center, laboratory, lab, library, institute, university, etc. Again, this is a different sort of list than the LDA list, which contains: school, students, schools, education, teachers, high, public, teacher, bennett, manigat, state, president, etc.

It seems like the syntactic/semantic distinction is not quite right. In some sense, with the first list, LDA is being more liberal in what it considers film-like, with CBC being more conservative. OTOH, with the "school" list, CBC seems to be more liberal.

I realize, of course, that this is comparing apples and oranges... the data sets are different, the models are different, the preprocessing is different, etc. But it's still pretty clear that both sort of models are getting at the same basic information. It would be cool to see some work that tried to get leverage from both local context and global context, but perhaps this wouldn't be especially beneficial since these approaches---at least looking at these two lists---don't seem to produce results that are strongly complementary. I've also seen questions abound regarding getting topics out of topic models that are "disjoint" in some sense...this is something CBC does automatically. Perhaps a disjoint-LDA could leverage these ideas.

08 August 2007

Conferences: Costs and Benefits

I typically attend two or three conferences per year; usually NIPS (which has been in Vancouver since I started attending), and an ACL-related one; the third is typically a second ACL-related conference or ICML, depending on the year. Typically two of these are domestic, one is international. Domestic conferences cost me about $1500 and international ones vary, but Prague weighed in at around $4000. This means that my travel costs (just for myself!) are about $5500-$7000 per year. Moreover, this takes 2-3 weeks of my year (more than 5% of my non-vacation time). When I was a student, this question never entered my mind (I seemed to have a nearly endless supply of money); now, I find myself wondering: are conferences worth the time and money investment?

I'll focus on international conferences because these are the biggest sink in terms of both money and time. In particular, I'll consider Prague, which hosted both ACL and EMNLP. Here's what I feel like I gained from this trip:

  1. I saw some interesting papers presented.
  2. I saw some interesting invited talks (Tom Mitchell's stands out for me).
  3. I had semi-deep hallway conversations with 3 or 4 people.
  4. I had non-deep hallway conversations with probably ~20 people.
  5. I gave two presentations. (The implication is that this may make me "more famous" and that this is a good thing for some reason :P.)
  6. I saw an area of the world that I hadn't yet been to.
  7. I spent a not insignificant amount of time socializing with ~20 friends who I pretty much only see at conferences.
So the question is, was this worth just over $4 grand and 10 days of my life that could have been spent doing research (or taking a vacation)?

I have mixed feelings.
  1. does not seem compelling -- for conferences close to me that I do not attend, I still read proceedings. Sure, sometimes presentations are helpful and there's a bit of a serendipity aspect, but overall, I'd say this is something I could do in a day in the park with a copy of the proceedings.
  2. is important. Especially when the invited talks are good and aren't just a long version of some paper presentation---i.e., when you can get a good sense of the overall research direction and the important long term results---I feel like these things are worth something.
  3. is important. Some people say that hallway conversations are the most important; maybe it's just me, but it's pretty rare for me to have hallway conversations that are sufficiently deep to be really meaningful in the long run, but I'd say around 3 per conferences is something that you can hope for. At least with these, they seem to have either led to collaboration or at least new ideas to try out in my own work.
  4. provides good social networking... I don't feel like these really change how I think about problems (and I think the same is true for the people I had such conversations with). The only important thing here is if you find out about what new problems other people are working on, you can learn about new areas that may interest you.
  5. is nebulous to me; I feel like the key purpose in conference talks is advertisement. It's a bit unclear what I'm advertising for---citations, perhaps?---but hopefully something I've done will save someone else some time, or will give them ideas of something to try or something along these lines. But this is highly correlated with (1), which suggests that it's actually not particularly useful.
  6. shouldn't be underestimated, but if I compare taking a vacation with going to a conference, they're very different. In particular, even at a conference where I a priori intend to spend a bunch of time touristing, I never seem able to accomplish this as much as I would like. Of course, $4k out of grant money versus $4k out of personal money is very different.
  7. also shouldn't be underestimated, but like (6) is maybe best accomplished in other ways.
Based on this, I feel like overall the main benefits to going to a conference are: seeing invited talks, having deep hallway conversations, and a minor bit of socializing and serendipity.

The thing that occurred to me recently is that it's actually possible to achieve these things without going to conferences. In particular, consider the following model. I invite one or two "famous types" to my university to give invited talks. Each of these would cost maybe $2000, but a lot (if not all) of this would be subsidized by the department. So I get invited talks for (nearly) free; for safety, even say it costs me $1k. I now have $3k left. With this $3k I tour around the country and spend a few days at different labs/universities and meet with people. If I know someone well enough at a lab, I can probably stay with them, which means my only real cost is airfare (assuming their university doesn't want to invite me and pay for it) and incidentals. For domestic flights, it's hard to imagine that I wouldn't be able to pull off each of this for around $750. And that eats up the $4k.

What do I get out of this model? Well, I'd give talks at four universities. This is not quite as broad an audience as at a conference, but they're more focused and my talk can be longer. Instead of having semi-deep hallway conversations, at the very least I get 4 very deep office conversations, which is potentially much more useful. I get one or two invited talks per year, by people that I choose (modulo availability).

What do I lose? I lose seeing other papers presented, which I don't think is too serious. I lose socializing and touristing (in foreign countries). This is too bad, but is perhaps better served by a legitimate vacation. The only other big thing I lose is conversations with multiple people simultaneously (eg., in Prague, one of my "good" conversations was with Ryan McDonald and Joakim Nivre... this would not be possible under my proposed model). I also lose seeing questions and answers asked at talks, which are occasionally quite interesting, but also something that I'm willing to live with out.

Overall, I think the biggest thing I would lose is a sense of community, which I think is hard to quantify and yet still important. Though, I'm also not proposing that I would never go to a conference, but that maybe 2-3 per year is overkill for the benefits obtained (especially for expensive destinations). If I went to one domestic (I count Canada as domestic) conference per year and visited 2-3 other sites, I'm not sure that I'd be any worse off. (Of course, the fact that I'm in the States helps here... you probably couldn't get away with this model outside of US/Canada.)

01 August 2007

Explanatory Models

I am frequently asked the question: why does your application for solving XXX make such and such an error? (You can easily replace "your" with any possessive noun and the statement remains valid.)

My standard answer is to shrug and say "who knows."

This is quite different from, for instance, work in pattern matching for information extraction (many other citations are possible). In this setting, when the system makes an error, one can ask the system "what pattern caused this error." You can then trace the pattern back to the source documents from which it came and obtain some understanding for what is going on.

This is frequently not the case for your generic sequence labeling algorithm. If, say, a CRF misses a person name, what can you do about it? Can you understand why it made the error. More generally, if a model of any variety errs, can it say anything about why this error came to be.

One way to approach this problem would be to try to inspect the weights of the learned algorithm, but there are so many trade-offs going on internally that I've never been able to do this successfully (by hand, at least --- perhaps a clever tool could help, but I'm not sure). An alternative that I've been thinking about recently (but probably won't work on because I haven't enough time right now) is instead to pose the question as: what is the minimal change to the input required so that I would have made the decision correctly.

I guess one way to think about this is consider the case that a POS system misses tagging "Fred" in the sentence "Fred is not happy" as an NNP and instead calls it a "VBD." Presumably we have a bunch of window features about Fred that give us its identify, prefixes and suffixes, etc. Perhaps if "Fred" had been "Harry" this wouldn't have happened because "Fred" has the spelling feature "-ed." (Ok, clearly this is a very crafted example, but you get the idea.)

The question is: how do you define minimal change in features. If we're in an HMM (where we're willing to assume feature independence), then I don't think this is a big problem. But in the case where "Fred" ends in "-ed", it also ends in "-d" and both of these make it look more like a VBD. Such an explanatory system would ideally like to know that if "-d" weren't there, then neither would be "-d" and use this for computing the minimal change. It would also have to know that certain features are easier to change than others. For instance, if it has only ever seen the word "Xavier" in the training data as an NNP, then it could also suggest that if the word were "Xavier" instead of "Fred" than this error would not have happened. But this is sort of silly, because it gives us no information. (I'm working under the assumption that we want to do this so that we can add/remove features to our model to help correct for errors [on development data :P].)

It seems like neither of these problems is insurmountable. Indeed, just looking at something like feature frequency across the entire training set would give you some sense of which features are easy to change, as well as which ones are highly correlated. (I guess you could even do this on unlabeled data.)

I feel like it's possible to create a methodology for doing this for a specific problem (eg., NE identification), but I'd really like to see some work on a more generic framework that can be applied to a whole host of problems (why did my MT system make that translation?). Perhaps something already exists and I just haven't seen it.