27 December 2007

Those Darn Biologists...

I've recently been talking to a handful of biologists. The lab/bench kind, not the computational kind. As an experiment in species observation, they intrigue me greatly. I'm going to stereotype, and I'm sure it doesn't hold universally, but I think it's pretty true and several (non-biologist) friends have confirmed this (one of whom is actually married to a biologist). The reason they intrigue me is because they are wholly uninterested in techniques that allow you to do X better, once they already have a solution for X.

This is a bit of an exaggeration, but not a huge one (I feel).

It is perhaps best served by example. Take dimensionality reduction. This is a very very well studied problem in the machine learning community. There are a bajillion ways to do it. What do biologists use? PCA. Still. And it's not that people have tried other things and they've failed. Other things have worked. Better. But no one cares. Everyone still uses PCA.

Part of this is probably psychological. After using PCA for a decade, people become comfortable with it and perhaps able to anticipate a bit of what it will do. Changing techniques would erase this.

Part of this is probably cultural. If you're trying to publish a bio paper (i.e., a paper in a bio journal) and you're doing dimensionality reduction and you aren't using PCA, you're probably going to really have to convince the reviewers that there's a good reason not to use PCA and that PCA just plain wouldn't work.

Now, that's not to say that biologists are luddites. They seem perfectly happy to accept new solutions to new problems. For instance, still in the dimensionality reduction setting, non-negative matrix factorization techniques are starting to show their heads in microarray settings. But it's not because they're better at solving the same old problem. What was observed was that if you have not just one, but a handful of microarray data sets, taken under slightly different experimental conditions, then the variance captured by PCA is the variance across experiments, not the variance that you actually care about. It has been (empirically) observed that non-negative matrix factorization techniques don't suffer from this problem. Note that it's only fairly recently that we've had such a surplus of microarray data that this has even become an issue. So here we have an example of a new problem needing a new solution.

Let's contrast this with the *ACL modus operandi.

There were 132 papers in ACL 2007. How many of them do you think introduced a new problem?

Now, I know I'm being a bit unfair. I'm considering dimensionality reduction across different platforms (or experimental conditions) as a different problem from dimensionality reduction on a single platform. You could argue that these really are the same problem. I just kind of disagree.

Just as biologists tend to be wary of new techniques, so we tend to be wary of new problems. I think anyone who has worked on a new problem and tried to get it published in a *ACL has probably gone through an experience that left a few scars. I know I have. It's just plain hard to introduce a new problem.

The issue is that if you work on an age-old problem, then all you have to do is introduce a new technique and then show that it works better than some old technique. And convince the reviewers that the technique is interesting (though this can even be done away with if the "works better" part is replaced with "works a lot better").

On the other hand, if you work on a new problem, you have to do a lot. (1) You have to introduce the new problem and convince us that it is interesting. (2) You have to introduce a new evaluation metric and convince us that it is reasonable. (3) You have to introduce a new technique and show that it's better than whatever existed before.

The problem is that (1) is highly subjective, so a disinterested reviewer can easily kill such a paper with a "this problem isn't interesting" remark. (2) is almost impossible to do right, especially the first time around. Again, this is something that's easy to pick on as a reviewer and instantly kill a paper. One might argue that (3) is either irrelevant (being a new problem, there isn't anything that existed before) or no harder than in the "age old problem" case. I've actually found this not to be true. If you work on an age old problem, everyone knows what the right baseline is. If you work on a new problem, not only do you have to come up with your own technique, but you also have to come up with reasonable baselines. I've gotten (on multiple occasions) reviews for "new problems" papers along the lines of "what about a baseline that does XXX." The irony is that while XXX is often creative and very interesting, it's not really a baseline and would actually probably warrant it's own paper!

That said, I didn't write this post to complain about reviewing, but rather to point out a marked difference between how our community works and how another community (that is also successful) works. Personally, I don't think either is completely healthy. I feel like there is merit in figuring out how to do something better. But I also feel that there is merit in figuring out what new problems are and how to solve them. The key problem with the latter is that the common reviewer complaints I cited above (problem not interesting or metric not useful) often are real issues with "new problem" papers. And I'm not claiming that my cases are exceptions. And once you accept a "new problem" paper that is on the problem that really isn't interesting, you've opened Pandora's box and you can expect to see a handful of copy cat papers next year ( could, but won't, give a handful of examples of this happening in the past 5 years). And it's really hard to reject the second paper on a topic as being uninteresting, given that the precedent has been set.

18 December 2007

Particle filtering versus beam search

I had a very interesting discussion at NIPS with Vikash Mansingka about beam search and particle filtering. Much of what is below is a result of this conversation and he deserves much credit.

In a very real sense, particle filtering can be seen (for many models) as a sort of stochastic beam search. Just to set the stage, let's talk about generic beam search:

1. Initialize a min-queue "beam" to contain a single initial hypothesis.
2. While beam contains non-complete hypothesis:
A. Take h, the lowest scoring hypothesis from the beam
B. For each h' in the neighborhood of h
I. Add h' to the beam, with some score
C. If the beam exceeds a certain capacity, drop high-scoring elements
3. Return the lowest scoring hypothesis from beam
The key variant in beam search algorithms is the score function that's used. The most naive is just path cost---how much did we have to pay to get to the current hypothesis? The "A*" way is to use path cost + admissible heuristic cost, where the admissible heuristic is an underestimate of the amount of cost remaining to complete.

There are (of course) other variants: sometimes we expand the whole beam in one step; sometimes we use multiple beams; etc. We'll come back to these shortly.

Particle filtering (using as similar terminology to the beam search as possible), on the other hand, looks something like:
1. Initialize a max-queue "beam" to contain a single initial hypothesis.
2. While beam contains non-complete hypothesis:
A. Take h, the highest scoring hypothesis from the beam
B. Sample h' from the neighborhood of h
I. Add h' to the beam, with some score
C. If the beam exceeds a certain capacity, randomly drop elements
D. If the "effective sample size" of the beam is too low, resample elements
3. Return the highest scoring hypothesis from beam
To map beam search to particle filtering, use negative log probabilities for scores. (This turns max search into min search.)

In a sense, the major thing that differs between PF and BS is that in PF everything happens randomly -- we're always sampling from a proposal distribution, rather than greedily taking the "best" hypotheses at each step.

There are, however, some minor issues: what proposal distribution is used in step 2B in PF and what does resampling do.

In general, the proposal distribution should be the true posterior probability, but this is pretty much always intractable (otherwise we wouldn't be using approximations at all). The most common thing to do is to use a transition probability, which corresponds exactly to using a path cost in beam search. However, just as we know that using a heuristic can help in beam search, this suggest that using a proposal distribution in PF that contains something like a future cost heuristic is going to be helpful. I'm sure this has been done, but I don't think it's nearly as common as it perhaps should be, given how helpful it is in BS.

What 2D in PF does is to measure whether the samples that we have in the beam have fallen out of the true posterior---in other words, are we "off the mark." There's a straightforward way to compute this (see the wikipedia page linked above). If we are off the mark, we resample all the elements in our beam, essentially trying to get back on the right path. My experience with PF is that this is pretty essential to getting things to work, but as far as I know there's no analogue in BS land.

It may be that these two things (resampling and heuristic) are two different ways of solving the same problem and that doing both is not so helpful. But I tend to doubt it.

One other thing that I've never seen done in PF is to maintain multiple beams. Admittedly this makes the resampling step harder (maybe...I haven't worked through it), but it may be helpful. One thing, for instance, that we found in the PF for the coalescent was that the PF tends to like to merge clusters early with no regard for how hard life is going to be later. We use resampling to try to fix this, but another way would be to use multiple beams. This is commonly done in, eg., machine translation, where a single beam doesn't do well, so we maintain one beam for each number of foreign (or English) words covered.

So what's the point? I think that by recognizing the similarity between BS and PF, we can see ways to potentially improve both. For instance, the following seem promising:
  1. Use resampling techniques in beam search
  2. Use multiple beams in particle filtering
  3. Use A*-style heuristics in the proposal distribution for particle filtering
An alternative suggestion would be for NLPers to pay more attention to particle filtering. It's a nice, clean probabilistic analogue to a technique that we love (or at least are forced to use).

12 December 2007

NIPS retrospective

I got back from NIPS on Sunday; sadly I got sick on the last day. Despite this, it was by far my most productive conference ever. Admittedly, I made a point to try to make it productive (being somewhat geographically isolated means that it's in my best interest), but even so, I had a fantastic time. There are several ideas that I plan on blogging on in the near future, but for now, I'll focus on the actual conference program. The standard disclaimer applies: I didn't see everything and just because something's not listed here doesn't mean I didn't enjoy it. I'll try to limit my commentary to papers that are relevant to the NLP audience, but I may stray. I'll conclude with a bit of my sense of what was in and out this year.

  1. Kristina Toutanova and Mark Johnson had a topic-model style paper for doing semisupervised POS tagging. There are two new things here. First, they actually model context (yay!) which we know is important in language. Second, they introduce a notion of a confusion class (what possible POS tags can a word get) and actually model this. This latter is something that makes sense in retrospect, but is not obvious to think of (IMO). They actually get good results (which is non-trivial for topic models in this context).
  2. Alex Bouchard-Cote and three more of the Berkeley gang had a paper on language change. If you saw their paper at ACL, they're attacking a pretty similar problem, by looking at phonological divergences between a handful of languages. I'd love to be able to reconcile their approach with what I've been working on---I use a huge database of typological knowledge; they use word forms...I'd love to use both, but I really like working with thousands of languages and it's hard to find corpora for all of these :).
  3. John Blitzer had a good paper (with other Penn folks) on learning bounds for domain adaptation. If you care at all about domain adaptation, you should read this paper.
  4. Alex Kulesza and Fernando Pereira had a great paper on what happens if you try to do structured learning when the underlying prediction (inference) algorithm is approximate. They show that things can go very wrong in widely different ways. This is a bit of a negative results paper, but it gives us (a) a strong sense that proceeding with a generic learning algorithm on top of an approximate inference algorithm is not okay and (b) a few ideas of what can actually go wrong.
  5. Yee Whye Teh, Kenichi Kurihara and Max Welling continue their quest toward collapsed variational models for all problems in the world by solving the HDP. (For those who don't know, you can think of this as LDA with a potentially infinite number of topics, but of course the real case is more interesting.)
  6. Ali Rahimi and Benjamin Recht presented a very cool analysis of kernel methods when your kernel is of the form K(x,z) = f(x-z). This is the case for, eg., the RBF kernel. This is something that I've wanted to try for a while, but to be honest my poor recollection of my analysis training left me a bit ill equipped. Essentially they apply a Fourier analysis to such kernels and then represent the kernel product K(x,z) as a dot product in a new input space, where kernelization is no longer required. The upshot is that you can now effectively do kernel learning in a linear model in primal space, which is very nice when the number of examples is large.
  7. David Heckerman gave a really nice invited talk on graphical models for HIV vaccine design. What he was particularly interested in was analyzing whether standard methods of determining correlation between events (eg., "do I have HIV given that I have some genomic signal?") work well when the data points are not independent, but rather belong to, eg., some genetic population. This is effectively what Lyle Campbell and I were doing in our ACL paper last year on typological features. Curing HIV might be a bit more of an interesting application, though :). Essentially what David and his colleagues do is to build a hierarchy on top of the data points and then let this explain some of the interactions and then figure out what isn't explained by this hierarchy. In the linguistic sense, this is effectively separating historical from areal divergences. I'll have to read up more on what they do, precisely.
There were several themes that stuck out at NIPS this year, though they aren't represented in the above list. Probably the biggest one is recommender systems. Just search the titles for "matrix" and you'll come up with a handful of such systems. I have to believe that this is spurred, at least in part, by the NetFlix challenge. Another theme was deep belief networks, which even had a rogue workshop dedicated to them. I have a feeling we'll be seeing more and more of these things. A final theme was randomized algorithms, particularly as applied to large scale learning. I think this is a great direction, especially at NIPS, where, historically, things have been less focused on the large scale.

My only serious quibble with the program this year is that I saw a handful of papers (and at least one that got an oral and one a spotlight) that really fell down in terms of evaluation. That is, these papers all proposed a new method for solving task X and then proceeded to solve it. The experiments, however, contained only comparisons to algorithms that did not have access to all the information that X had. I'll give an example of something that would fall in this category but I have not seen appear anywhere: I build a structured prediction algorithm and compare only against algorithms that make independent classification decisions. I'm not a huge quibbler over comparing against the absolute state of the art, but it really really irks me when there are no comparisons to algorithms that use the same information, especially when standard (read: they are in any reasonable machine learning textbook) algorithms exist.