## 29 September 2006

### Doing DP Clustering? Don't Sample!

Dirichlet process techniques are increasingly popular tools for Bayesian analysis. There's not enough space here to describe how they work, so I'll assume you know. With the exception of the Variational DP techniques that Dave Blei and Michael Jordan developed, one typically uses MCMC techniques to perform sampling from the posterior distribution and then uses these samples to compute properties of interest. In many cases, the properties of interest are simply the cluster assignments. Since it's unclear how to use multiple samples over the cluster assignments to generate a single one (except, perhaps, by some MBR method), one typically just chooses the single cluster assignment from the sample that has maximal marginal likelihood. This is, of course, not really Bayesian, but it still seems a reasonable thing to do.

For this post, I'm going to consider a model in which we have a likelihood term F(x | theta) and a mean prior G0, where we first draw G from DP(G0,alpha) and then theta from G and then x from theta. In particular, I will assume G0 and F are conjugate, which means that in the Gibbs sampler, we can analytically integrate out both G and theta and draw only for the cluster assignments (this is Neal's algorithm 3). This algorithm is nice because it converges much faster than ones that also have to sample over theta. (I know there are other good algorithms...MH works, as do split/merge proposals.)

The potential worry is that if all you want is the cluster assignment that maximizes the marginal likelihood, is a Gibbs sampler a good search algorithm? The answer is no.

Let's consider another way to search. We arbitrarily order the observed data, then label left-to-right over this ordering. When we get to x_n, we'll have already created k clusters, and then there are k+1 possible choices. It is straightforward to compute a partial marginal likelihood for each of these possibilities, which leads to a direct implementation for breadth-first search. But we can (often) do better. If F is in the exponential family and G0 is its natural prior, we can construct a reasonably good admissible heuristic and apply A* search (I'm working on writing up the details, and I'll make the code available shortly...it's quite simple to implement, but proving that the heuristic is admissible is a bit involved and I'm not 100% sure it's true for all exponential family members or just specific ones).

Here are some artificial experiments. I generated ten documents over a vocabulary of 40 words, based on three clusters drawn from a symmetric Dirichlet with parameter alpha=4, approximately 40 words per document (distributed Poisson). I then use a DP with G0=Dirichlet(2) and F=Multinomial, with the scale parameter on the DP=2. I ran two Gibbs samplers (one initializing with a single large cluster, one initializing with many singleton clusters), and three search algorithms on this data. The first search was full A*. The second was beamed A* with a beam of 5. The last was A* with an inadmissible, but in some sense tigher, heuristic, that's even easier to compute. The results are in the following graph:

The y axis is negative log probability (lower is better) and the x axis is time in seconds. This is all in matlab and I didn't do anything fancy to make either algorithm especially fast. The horizonal lines are, top to bottom, heuristic A*, beam A* and full A*. The timing are, <0.1s, 1.6s and 2.1s, respectively (variances over 5 runs with different permutations of the input are shown in parens). So the search does significantly better (attains a higher marginal likelihood than the sampler) in very little time (even after 30 seconds, the Gibbs sampler is still really far from getting down to even the performance of heuristic A*).

So that's for small data. It turns out that the heuristic isn't good enough to work on huge data sets, unless you use a really small beam, which hampers performance (I'm investigating this). But if we use the inadmissible heuristic, we can handle large-ish data sets fairly easily. I ran the same experiment, but with 1000 docs over a vocabulary of 400 words, with 20 clusters, 1000 words per document and a symmetric Dirichlet prior with alpha=4. The Gibbs here actually sucks. Within about ten iterations, it gets suck with a neg log lik of 724842 and 16 clusters (about 50 seconds per Gibbs iteration). The heuristic A* takes about 2300 seconds total and ends with a neg log like of 707020 (and 20 clusters), quite significantly better. Combining heuristic A* with beam search (beam=100) leads to a neg log lik of 707260 (slightly worse, but no where near as bad as Gibbs) in only 1800 seconds.

(Incidentally, the Gibbs gets stuck because the documents are so long, that the marginal posterior likelihoods completely dwarf the vanilla marginals, so it essentially never moves out of a local maximum. With shorter documents, this doesn't happen as much.)

I'm still working on this a bit...I'm using DP clustering enough that this result make a huge difference for me. I think there's a lot of room for improvement, even over what I have so far. I'm also applying it to real data to see if it still helps. But overall, it looks like this is a fairly promising direction (which is actually quite surprising, because clustering is typically not something that we would typically attack in a "left-to-right" fashion).

## 23 September 2006

### Humor is Hard

Several months ago I became temporarily interested in trying to automatically identify if entries in online discussions are informative, interesting, humorous, etc. (This was somewhat in the context of a summarization sort of system, but the problem seems more generic.) It turns out that in the comments section of slashdot, people manually tag comments into such categories. I spent a few weeks crawling slashdot (eventually getting my IP banned because this is apparently not allowed) and grabbed a few thousand stories and associated comments. I spent a few hours building a straightforward classifier based on the comment labels. It turns out one of the hardest sorts of comments to classify correctly are the funny ones.

In general, I think identifying humor (or attempted humor) is a very hard problem. It seems to almost require a substantial amount of world knowledge and inference capabilities, since humorous comments are rarely signalled by straightforward lexical cues (though having three exclamation points or a smiley is a good indicator, these actually occur surprisingly rarely).

To get a sense of why this is so hard, let's look at some examples. These are grabbed from slashdot two days ago (the 21st).

In one article titled Motorola Unveils Phone Vending Machines (which talks about how you can buy cell phones from vending machines and they they are delivered by robotic arm rather than dropping ala sodas), we have the following comments marked humorous: "can i use the cell phones I want to purchases to purchases the cell phone I am purchasing?" and "I have a hard enough time trying to pull a big old stuffed animal out with those robotic arms much less a tiny tiny phone. At 50 bucks a pop rather than 50 cents, I'm going to waste a lot of money."

In another article about Googling for ATM Master Passwords, we have the following comments. "[Subj: The default password is...] I thought it was up, up, down, down, left, right, left, right, B, A, Start ..." (for those not of my generation, this is the cheat code for the NES game Contra and several other Konami games). Additionally, in response to "Whoever makes these ATMs deserves all the bad publicity that they get." someone comments "Might it be Diebold, by any chance?"

Finally, in commenting about the article Fish Work as Anti-terror Agents (which discusses how fish like the bluegill help detect poisonous substances in water supplies), we get comments like "In Australia, we have stingrays guarding us from pests." and "How do we know this isn't a red herring by some terroist group?" and finally "Does this mean we can carry water bottles on planes again -- if they have bluefish swimming in them?"

You may take issue with the degree to which these comments are funny, but regardless of whether they actually are funny, the certainly were intended to be funny.

What I find fascinating about all these examples is that they're essentially playing the game of drawing surprising comparisons between the article at hand and other common knowledge. For instance, the "robotic arms" comment is based on our shared experience of failing at fairs to get stuffed animals. The stingray comment is in regards to Steve Irwin's recent death, and the waterbottle joke is in reference to the new airline policies. While some (eg., the waterbottle joke) are perhaps easy to identify because they seem "off topic" somehow, other ones (like the Diebold comment or the stingray comment) really are on topic for the article, but just play against some alternative story that we're all expected to know.

I'm not sure what my conclusion is, but if you're out there looking for a really hard text classification problem for which it at least seems that a lot of knowledge and inference is required, you may find humor detection fun.

## 17 September 2006

### Statistical NLP is not NLP but just Statistics?

bact' brings up an interesting point, perhaps more provocative than my original (intended-to-be provocative) pseudo-question. To quote, he says:

and some also said,
statistical natural language processing is not language processing at all, only statistics :P

My impression is that the only sense in which this sentence is true is if you insist that what goes on inside the black box of statistical NLP is somehow explaining what goes on inside our heads.  I see it as essentially parallel to the argument against "neural-style" machine learning.  Some neural networks people used to claim (some still do, I hear) that what happens in an artificial neural net is essentially the same as what goes on in our minds.  My impression (though this is now outside what I really know for sure) is that most cognitive scientists would strongly disagree with this claim.  I get the sense that the majority of people who use NNets in practice use them because they work well, not out of some desire to mimic what goes on in our heads.

I feel the same is probably true for most statistical NLP.  I don't know of anyone who would claim that when people parse sentences they do chart parsing (I know some people claim something more along the lines of incremental parsing actually does happen and this seems somewhat plausible to me).  Or that when people translate sentences they apply IBM Model 4 :).

On the other hand, the alternative to statistical NLP is essentially rule-based NLP.  I have an equally hard time believing that we behave simply as rule processing machines when parsing or translating, and that we efficiently store and search through millions of rules in order to do processing.  In fact, I think I have a harder time believing this than believing the model 4 story :P.

Taking a step back, it seems that there are several goals one can have with dealing with language on a computer.  One can be trying to carry out tasks that have to do with language, which I typically refer to as NLP.  Alternatively, one can be trying to model how humans work with language.  I would probably call this CogNLP or something like that.  One could instead try to use computers and language data to uncover "truths" about language.  This is typically considered computational linguistics.  I don't think any of these goals is a priori better than the others, but they are very different.  My general feeling is that NLPers cannot solve all problems, CogNLPers don't really know what goes on in our minds and CLers are a long way from understanding how language functions.  Given this, I think it's usually best to confine a particular piece of work to one of the fields, since trying to solve two or three at a time is likely going to basically be impossible.

## 08 September 2006

### Multilingual = Not Lingual at All?

There has been a trend for quite some time now toward developing algorithms and techniques to be applicable to a wide range of languages. Examples include parsing (witness the recent CoNLL challenge), machine translation, named entity recognition, etc. I know that in at least one or two of my own papers, I have claimed (without any experimental substantiation, of course :P) that there is no reason why the exact same system could not be run on languages other than English, provided a sufficient amount of labeled training data (and a native speaker who can deal with the annoying tokenization/normalization issues in the non-English language).

I get the feeling that a large part of the surge is blowback against older NLP systems, for which hundreds of/or thousands of human hours were put into writing language-specific grammars and rules and lexicons. The replacement idea is to spend thouse hundreds of/or thousands of hours annotating data, and then repeatedly reusing this data to solve different problems (or to try to come up with better solutions to an existing problem, despite the associated fears in doing so).

I think that, overall, this is a good trend. The problem that I see is that it is potentially limiting. In order to develop a system that could plausibly be applied to (nearly) any language, one has to resort to features that are universal across all languages. This is fine, but for the most part the only universal features we know of that are reasonably computable are things like "language is made up of words and words are sort of semanticy units on their own" (of course, this misses a lot of compounds in German and is hard to do in Chinese without spaces) and "words sometimes have prefixes and suffixes and these are syntactically useful" (oops, Arabic has infixes) and "capitalization is often a good indicator of something proper-noun-like" (except for German where many common nouns are capitalized or Japanese where there isn't case marking). These are sometimes compounded "adjacent words carry semantic meaning." But all in all, these features are relatively weak from the perspective of "language understanding."

This distinction seems analogous to the "domain independent" versus "domain specific" one that we've also seen. If you are willing to limit yourself to a specific domain (eg., counter-terrorism), you can probably do a pretty good job doing reasonably deep understanding. On the other hand, if you want to work at the other end---applicable across all domains---there's little you can do because you're better off going for shallow with complete coverage rather than deep but sparse. Where I think that the domain specific people have it right is that they actually do take advantage of being in a specific domain. I know that when I work on a problem that's language specific (eg., summarization or coreference), I've only seldom taken advantage of the fact that the language is English. Sure, for summarization I've occasionally made use of an English parser and for coreference I've made use of mined data that's specific to English, but overall, I treat it as pretty much "any old language." This would probably be fine if I then ran my system on Arabic and Chinese and Portuguese and showed that it worked. But I don't. This seems to tell me that I'm missing something: that I have not been clear about my goal. Maybe I should take a hint from the domain specific people and decide which side of the language independent camp I want to be on.

(The one counterargument that I will use to save face is that applying to other languages is often a lot of relatively needless work...you often have a pretty good idea of what's going to happen and I'd be surprised if people have strongly believed they've built something that's reasonably language independent and it turns out not to be.)