29 September 2006

Doing DP Clustering? Don't Sample!

Dirichlet process techniques are increasingly popular tools for Bayesian analysis. There's not enough space here to describe how they work, so I'll assume you know. With the exception of the Variational DP techniques that Dave Blei and Michael Jordan developed, one typically uses MCMC techniques to perform sampling from the posterior distribution and then uses these samples to compute properties of interest. In many cases, the properties of interest are simply the cluster assignments. Since it's unclear how to use multiple samples over the cluster assignments to generate a single one (except, perhaps, by some MBR method), one typically just chooses the single cluster assignment from the sample that has maximal marginal likelihood. This is, of course, not really Bayesian, but it still seems a reasonable thing to do.

For this post, I'm going to consider a model in which we have a likelihood term F(x | theta) and a mean prior G0, where we first draw G from DP(G0,alpha) and then theta from G and then x from theta. In particular, I will assume G0 and F are conjugate, which means that in the Gibbs sampler, we can analytically integrate out both G and theta and draw only for the cluster assignments (this is Neal's algorithm 3). This algorithm is nice because it converges much faster than ones that also have to sample over theta. (I know there are other good algorithms...MH works, as do split/merge proposals.)

The potential worry is that if all you want is the cluster assignment that maximizes the marginal likelihood, is a Gibbs sampler a good search algorithm? The answer is no.

Let's consider another way to search. We arbitrarily order the observed data, then label left-to-right over this ordering. When we get to x_n, we'll have already created k clusters, and then there are k+1 possible choices. It is straightforward to compute a partial marginal likelihood for each of these possibilities, which leads to a direct implementation for breadth-first search. But we can (often) do better. If F is in the exponential family and G0 is its natural prior, we can construct a reasonably good admissible heuristic and apply A* search (I'm working on writing up the details, and I'll make the code available shortly...it's quite simple to implement, but proving that the heuristic is admissible is a bit involved and I'm not 100% sure it's true for all exponential family members or just specific ones).

Here are some artificial experiments. I generated ten documents over a vocabulary of 40 words, based on three clusters drawn from a symmetric Dirichlet with parameter alpha=4, approximately 40 words per document (distributed Poisson). I then use a DP with G0=Dirichlet(2) and F=Multinomial, with the scale parameter on the DP=2. I ran two Gibbs samplers (one initializing with a single large cluster, one initializing with many singleton clusters), and three search algorithms on this data. The first search was full A*. The second was beamed A* with a beam of 5. The last was A* with an inadmissible, but in some sense tigher, heuristic, that's even easier to compute. The results are in the following graph:



The y axis is negative log probability (lower is better) and the x axis is time in seconds. This is all in matlab and I didn't do anything fancy to make either algorithm especially fast. The horizonal lines are, top to bottom, heuristic A*, beam A* and full A*. The timing are, <0.1s, 1.6s and 2.1s, respectively (variances over 5 runs with different permutations of the input are shown in parens). So the search does significantly better (attains a higher marginal likelihood than the sampler) in very little time (even after 30 seconds, the Gibbs sampler is still really far from getting down to even the performance of heuristic A*).

So that's for small data. It turns out that the heuristic isn't good enough to work on huge data sets, unless you use a really small beam, which hampers performance (I'm investigating this). But if we use the inadmissible heuristic, we can handle large-ish data sets fairly easily. I ran the same experiment, but with 1000 docs over a vocabulary of 400 words, with 20 clusters, 1000 words per document and a symmetric Dirichlet prior with alpha=4. The Gibbs here actually sucks. Within about ten iterations, it gets suck with a neg log lik of 724842 and 16 clusters (about 50 seconds per Gibbs iteration). The heuristic A* takes about 2300 seconds total and ends with a neg log like of 707020 (and 20 clusters), quite significantly better. Combining heuristic A* with beam search (beam=100) leads to a neg log lik of 707260 (slightly worse, but no where near as bad as Gibbs) in only 1800 seconds.

(Incidentally, the Gibbs gets stuck because the documents are so long, that the marginal posterior likelihoods completely dwarf the vanilla marginals, so it essentially never moves out of a local maximum. With shorter documents, this doesn't happen as much.)

I'm still working on this a bit...I'm using DP clustering enough that this result make a huge difference for me. I think there's a lot of room for improvement, even over what I have so far. I'm also applying it to real data to see if it still helps. But overall, it looks like this is a fairly promising direction (which is actually quite surprising, because clustering is typically not something that we would typically attack in a "left-to-right" fashion).

23 September 2006

Humor is Hard

Several months ago I became temporarily interested in trying to automatically identify if entries in online discussions are informative, interesting, humorous, etc. (This was somewhat in the context of a summarization sort of system, but the problem seems more generic.) It turns out that in the comments section of slashdot, people manually tag comments into such categories. I spent a few weeks crawling slashdot (eventually getting my IP banned because this is apparently not allowed) and grabbed a few thousand stories and associated comments. I spent a few hours building a straightforward classifier based on the comment labels. It turns out one of the hardest sorts of comments to classify correctly are the funny ones.

In general, I think identifying humor (or attempted humor) is a very hard problem. It seems to almost require a substantial amount of world knowledge and inference capabilities, since humorous comments are rarely signalled by straightforward lexical cues (though having three exclamation points or a smiley is a good indicator, these actually occur surprisingly rarely).

To get a sense of why this is so hard, let's look at some examples. These are grabbed from slashdot two days ago (the 21st).

In one article titled Motorola Unveils Phone Vending Machines (which talks about how you can buy cell phones from vending machines and they they are delivered by robotic arm rather than dropping ala sodas), we have the following comments marked humorous: "can i use the cell phones I want to purchases to purchases the cell phone I am purchasing?" and "I have a hard enough time trying to pull a big old stuffed animal out with those robotic arms much less a tiny tiny phone. At 50 bucks a pop rather than 50 cents, I'm going to waste a lot of money."

In another article about Googling for ATM Master Passwords, we have the following comments. "[Subj: The default password is...] I thought it was up, up, down, down, left, right, left, right, B, A, Start ..." (for those not of my generation, this is the cheat code for the NES game Contra and several other Konami games). Additionally, in response to "Whoever makes these ATMs deserves all the bad publicity that they get." someone comments "Might it be Diebold, by any chance?"

Finally, in commenting about the article Fish Work as Anti-terror Agents (which discusses how fish like the bluegill help detect poisonous substances in water supplies), we get comments like "In Australia, we have stingrays guarding us from pests." and "How do we know this isn't a red herring by some terroist group?" and finally "Does this mean we can carry water bottles on planes again -- if they have bluefish swimming in them?"

You may take issue with the degree to which these comments are funny, but regardless of whether they actually are funny, the certainly were intended to be funny.

What I find fascinating about all these examples is that they're essentially playing the game of drawing surprising comparisons between the article at hand and other common knowledge. For instance, the "robotic arms" comment is based on our shared experience of failing at fairs to get stuffed animals. The stingray comment is in regards to Steve Irwin's recent death, and the waterbottle joke is in reference to the new airline policies. While some (eg., the waterbottle joke) are perhaps easy to identify because they seem "off topic" somehow, other ones (like the Diebold comment or the stingray comment) really are on topic for the article, but just play against some alternative story that we're all expected to know.

I'm not sure what my conclusion is, but if you're out there looking for a really hard text classification problem for which it at least seems that a lot of knowledge and inference is required, you may find humor detection fun.

17 September 2006

Statistical NLP is not NLP but just Statistics?

bact' brings up an interesting point, perhaps more provocative than my original (intended-to-be provocative) pseudo-question. To quote, he says:

and some also said,
statistical natural language processing is not language processing at all, only statistics :P

My impression is that the only sense in which this sentence is true is if you insist that what goes on inside the black box of statistical NLP is somehow explaining what goes on inside our heads.  I see it as essentially parallel to the argument against "neural-style" machine learning.  Some neural networks people used to claim (some still do, I hear) that what happens in an artificial neural net is essentially the same as what goes on in our minds.  My impression (though this is now outside what I really know for sure) is that most cognitive scientists would strongly disagree with this claim.  I get the sense that the majority of people who use NNets in practice use them because they work well, not out of some desire to mimic what goes on in our heads.

I feel the same is probably true for most statistical NLP.  I don't know of anyone who would claim that when people parse sentences they do chart parsing (I know some people claim something more along the lines of incremental parsing actually does happen and this seems somewhat plausible to me).  Or that when people translate sentences they apply IBM Model 4 :).

On the other hand, the alternative to statistical NLP is essentially rule-based NLP.  I have an equally hard time believing that we behave simply as rule processing machines when parsing or translating, and that we efficiently store and search through millions of rules in order to do processing.  In fact, I think I have a harder time believing this than believing the model 4 story :P.

Taking a step back, it seems that there are several goals one can have with dealing with language on a computer.  One can be trying to carry out tasks that have to do with language, which I typically refer to as NLP.  Alternatively, one can be trying to model how humans work with language.  I would probably call this CogNLP or something like that.  One could instead try to use computers and language data to uncover "truths" about language.  This is typically considered computational linguistics.  I don't think any of these goals is a priori better than the others, but they are very different.  My general feeling is that NLPers cannot solve all problems, CogNLPers don't really know what goes on in our minds and CLers are a long way from understanding how language functions.  Given this, I think it's usually best to confine a particular piece of work to one of the fields, since trying to solve two or three at a time is likely going to basically be impossible.

08 September 2006

Multilingual = Not Lingual at All?

There has been a trend for quite some time now toward developing algorithms and techniques to be applicable to a wide range of languages. Examples include parsing (witness the recent CoNLL challenge), machine translation, named entity recognition, etc. I know that in at least one or two of my own papers, I have claimed (without any experimental substantiation, of course :P) that there is no reason why the exact same system could not be run on languages other than English, provided a sufficient amount of labeled training data (and a native speaker who can deal with the annoying tokenization/normalization issues in the non-English language).

I get the feeling that a large part of the surge is blowback against older NLP systems, for which hundreds of/or thousands of human hours were put into writing language-specific grammars and rules and lexicons. The replacement idea is to spend thouse hundreds of/or thousands of hours annotating data, and then repeatedly reusing this data to solve different problems (or to try to come up with better solutions to an existing problem, despite the associated fears in doing so).

I think that, overall, this is a good trend. The problem that I see is that it is potentially limiting. In order to develop a system that could plausibly be applied to (nearly) any language, one has to resort to features that are universal across all languages. This is fine, but for the most part the only universal features we know of that are reasonably computable are things like "language is made up of words and words are sort of semanticy units on their own" (of course, this misses a lot of compounds in German and is hard to do in Chinese without spaces) and "words sometimes have prefixes and suffixes and these are syntactically useful" (oops, Arabic has infixes) and "capitalization is often a good indicator of something proper-noun-like" (except for German where many common nouns are capitalized or Japanese where there isn't case marking). These are sometimes compounded "adjacent words carry semantic meaning." But all in all, these features are relatively weak from the perspective of "language understanding."

This distinction seems analogous to the "domain independent" versus "domain specific" one that we've also seen. If you are willing to limit yourself to a specific domain (eg., counter-terrorism), you can probably do a pretty good job doing reasonably deep understanding. On the other hand, if you want to work at the other end---applicable across all domains---there's little you can do because you're better off going for shallow with complete coverage rather than deep but sparse. Where I think that the domain specific people have it right is that they actually do take advantage of being in a specific domain. I know that when I work on a problem that's language specific (eg., summarization or coreference), I've only seldom taken advantage of the fact that the language is English. Sure, for summarization I've occasionally made use of an English parser and for coreference I've made use of mined data that's specific to English, but overall, I treat it as pretty much "any old language." This would probably be fine if I then ran my system on Arabic and Chinese and Portuguese and showed that it worked. But I don't. This seems to tell me that I'm missing something: that I have not been clear about my goal. Maybe I should take a hint from the domain specific people and decide which side of the language independent camp I want to be on.

(The one counterargument that I will use to save face is that applying to other languages is often a lot of relatively needless work...you often have a pretty good idea of what's going to happen and I'd be surprised if people have strongly believed they've built something that's reasonably language independent and it turns out not to be.)

26 August 2006

NLG on NPR

I heard a story on NPR while driving today that referenced a recent Financial Times story about the use of natural language generation technology for producing financial stories. The financial company, Thomson, is now generating some of its stories using NLG. The NLG technology, based on the little bit of information I gathered from the NPR story and reading the associated article, seems to be straightforward template filling. This is not terribly surprising, given the stagnant nature of financial articles (grab any such article from WSJ and you see things like "The DOW went up ____ percent (___ points) because of ___." Most of this can be easily gathered from databases.

One might suspect that the final "___" (the reason) would be hard to come up with automatically, but apparently this is highly standardized as well. According to the NPR story following the one on the NLG technology, the number of reasons that fill that blank is nearly closed class ("profits" or "middle east" or "terrorism" or "bin Laden videotape" or "strong trading" or ...). And, based on some hear-say evidence, the reasons are largely bogus anyway (this is not surprising: summarizing one billion trades as being driven by a bin Laden videotape is highly suspect).

An amusing comment was made when the journalist at NPR asked the Thomson rep if his job was in danger. The response was that it was, only if he believed that his abilities topped out at filling out templates.

Anyway, if anyone knows more about the technology, and if there's anything interesting in it, it would be fun to know. Beyond that, I just thought it was fun that NLP technology made a spot on NPR.

25 August 2006

Doing Named Entity Recognition? Don't optimize for F1

(Guest post by Chris Manning. Thanks Chris!)

Among ML-oriented nlpers, using a simple F1 of precision and recall is the standard way to evaluate Named Entity Recognition. Using F1 seems familiar and comfortable, but I think most nlpers haven't actually thought through the rather different character that the F1 measure takes on when applied to evaluating sequence models. It's not just that it's a type 4 loss (a simple, intuition-driven measure like accuracy): In most cases such measures are reasonable enough for what they are, but using F1 for NER has an under-appreciated dysfunctional character. You wouldn't want to optimize for it!

This post explains what I was really thinking about when I made the comment that Hal referred to previously (fortunately, I didn't try to say all this after the talk!). I agree with Hal: the paper was a very nice piece of work technically. I just think that the authors, Jun Suzuki et al., chose a bad peak to climb.

Everyone is familiar with the F1 measure for simple classification decisions. You draw a 2x2 contingency table of whether something should be yes/no, and whether the system guessed yes/no, and then calculate the harmonic mean of precision and recall. But now think about Named Entity Recognition. You're chugging through text, and every now-and-again there is an entity, which your system recognizes or doesn't or fantasizes. I will use the notation word/GOLD/GUESS throughout, with O denoting the special background class of not-an-entity. So there are stretches of plain text (drove/O/O along/O/O a/O/O narrow/O/O road/O/O). These are the non-coding regions of NER. Then there are sequences (of one or more tokens) where there was an entity and the system guessed it right (in/O/O Palo/LOC/LOCAlto/LOC/LOC ./O/O), where there was an entity but the system missed it (in/O/O Palo/LOC/O Alto/LOC/O ./O/O), and where there wasn't an entity but the system hypothesized one (an/O/O Awful/O/ORG Headache/O/ORG ./O/O).

Things look good up until here: those events map naturally on to the false negatives (fn), true positives (tp), false negatives (fp), and false positives (fp) of the simple classification case. The problem is that there are other events that can happen. A system can notice that there is an entity but give it the wrong label (I/O/O live/O/O in/O/O Palo/LOC/ORG Alto/LOC/ORG ./O/O). A system can notice that there is
an entity but get its boundaries wrong (Unless/O/PERS Karl/PERS/PERS Smith/PERS/PERS resigns/O/O). Or it can make both mistakes at once (Unless/O/ORG Karl/PERS/ORG Smith/PERS/ORG resigns/O/O). I'll call these events a labeling error (le), a boundary error (be), and a label-boundary error (lbe).

I started thinking along these lines just as an intuitive, natural way to characterize happenings in NER output, where entities are sparse occurrences in stretches of background text. But you can make it formal (I wrote a Perl script!). Moving along the sequence, the subsequence boundaries are: (i) at start and end of document, (ii) anywhere there is a change to or from a word/O/O token from or to a token where either guess or gold is not O, and (iii) anywhere that both systems change their class assignment simultaneously, regardless of whether they agree. If you chop into subsequences like that, each can be assigned to one of the above seven classes.

Now, the thing to notice is that for the first 4 event types, you are either correct or you get 1 demerit, assessed to either precision or recall. In the simple classification case, that's the end of the story and the F1 measure is sensible. But when doing precision and recall over subsequences, there are these other three event types. Each of them is assessed a minimum of 2 demerits, with both precision and recall being hit. Therefore, it is fairly clear that optimizing for F1 in this context will encourage a system to do the following: if I'm moderately uncertain of either the class label or the boundaries of the entity, because a mistake would cost me a minimum of 2 demerits, I'm better off proposing no entity, which will cost me only 1 demerit.

(Two notes:

(i) As I've defined events, the possible demerits for an event in the last three classes is unbounded, though in practice 2 is the most common case. For example, this lbe event would be assessed 4 demerits (3 to precision, and 1 to recall): Smith/ORG/PERS and/ORG/O Newcomb/ORG/PERS and/ORG/O Co./ORG/ORG.

(ii) Despite my title, the problem here isn't with the F measure per se, as Bob Moore emphasized to me at a coffee break during ACL 2006 (thanks!). The problem would occur with any measure that combines precision and recall and which is increasing in both arguments, such as the simple arithmetic mean of precision and recall.)

Observe that this behavior is the opposite of the way things were meant to work: people adopted F1 in IR rather than using accuracy because accuracy gives high scores to a system that returns no documents, which obviously isn't useful. But, here, optimizing for F1 is encouraging a system to not mark entities.

Now let's look at some data. I used this event classification system on the output of my NER system on the CoNLL 2003 shared task English testa data. Here is how many events of each type there were:

tn 5583
tp 4792
fn 118
fp 120
le 472
be 102
lbe 75

Note in particular that over 2/3 of the errors are in those 3 extra categories that are multiply penalized. The ratios of classes vary with the task. For example, in biological NER, you tend to get many more boundary errors. But in my experience it is always the case that lots of the errors are in the last 3 classes.

Moreover, some of the errors in the le and be classes are not that bad, and sometimes even reflect subtle judgement calls and human annotator inconsistency in the gold standard. For instance, in the GENIA data you can find both regulation/O of/O human/DNA interleukin-2/DNA gene/DNA expression and transduction/O to/O the/O human/O IL-2/DNA gene/DNA, where it is unclear whether to include human in the name of the gene. Or in a newswire phrase like the Leeds stadium, it's not always very clear whether Leeds should be tagged ORG as a reference to the football team or LOC as a reference to the city. In almost any imaginable task, you would prefer systems that made these errors to ones that missed such entities entirely. In other words, the F1 measure is punishing more severely mistakes that should be punished less according to reasonable intuitions of task utility.

Has this been noticed before? I think so. The ACE program has a system for giving partial credit. But most ML people react very negatively to a scoring system that you couldn't possibly write on a napkin and which involves various very arbitrary-looking constants.... Do these observations undermine the last decade of work in NER? I don't think so. It turns out that there are lots of measures that are pretty okay providing you do not specifically optimize for them, but are dysfunctional if you do. A well-known example is traditional readability measures.

p.s. As I finish writing this guest post, it's occurred to me that I think this is the first nlpers post with some actual natural language examples in it. If you're reading this post, I guess that at least shows that such content isn't actively filtered out!

24 August 2006

Machine learning has failed[?.]

When I was visiting Singapore, Lan Man gave a talk at NUS right before mine (for extra fun, try to find her web site without using my link!). The work she presented is, for the purposes of this discussion, essentially a feature weighting technique for document classification.

More specifically, let's consider the standard bag of words model for text classification. Here, we have a feature set that is exactly the size of the vocabulary. The easiest thing to do is to use binary features (word present or not), but other things seem to work better. For instance, the number of times the word appears (term frequency, or tf) often works better. This is not so surprising. The problem with just using tf is that words like "the" appear a lot and they're somehow less "important." So instead, what we often do is tf*idf, where idf is the "inverse document frequency." There are a few variants of idf (that use logs, etc.) but the basic idea is to count how in many documents each word appears at least once. Then, we divide the number of documents by this count. So the idf value is always at least one, and is one only for words that appear in all documents (eg., "the"). What Lan Man was proposing was essentially a different value to use instead of idf (I don't recall the details, but it was similarly an easy-to-compute function of the corpus). The result is essentially that binary does worst, tf does next best, followed by tf*idf and then tf*her_proposal worked best (for most data sets).

The importance of using tf*idf arose in IR, where people started computing cosine similarities between word vectors. Here, downweighting things like "the" was of utmost importance, due to how cosine works. But it makes much less sense in supervised learning. That is, for linear classifiers (which are the norm, and which she used for many of her experiments), it really shouldn't matter if we apply a constant scaling factor to each feature. The learned feature weight itself is what is supposed to determine the importance of a feature. If we learn weights for tf*idf, we could easily convert them into just-as-good weights for tf by multiplying each weight by the corresponding df value.

I can think of a few explanations of this phenomenon, but none is at all satisfying (eg., by changing the scaling, we may be increasing the size of the margin while reducing the maximum norm vector size, which, for most generalization bounds, would be a good thing. Another answer is an interplay with regularization -- it becomes less "expensive" to have an overall high weight -- learned weight times df -- for infrequent informative words than for irrelevant words.). The problem is that this is a problem! It means that even for applications as simple as text categorization, not only is feature engineering an important issue, but feature weighting seems also to be.

(Incidentally, I'm curious if other fields -- eg., vision -- have similar techniques to "idf." For instance, one could imagine weighting pixel values by "idf"s of how often a pixel is on in a dataset. I wonder if this helps. If not, it may be because for smallish images, the number of features is small already. I wonder if there is an algorithm design for machine learning that would get around this problem.)

18 August 2006

Change of Notation

Ed Hovy has a particular label he assigns to a large variety of work in NLP: it performs a change of notation. The canonical example is POS tagging, where we are changing from one notation (words) to another notation (POS tags). He also applies this notion to most forms of parsing (though this is a bit more like adding notation), machine translation (one notation is Arabic, the other English), etc. My impression (though it's sometimes hard to tell where Ed really stands on such things) is that solving a task with a change-of-notation technique is suboptimal. I've never fully grasped what Ed was trying to say specifically until recently (and I make no promises I didn't get it wrong, but this is my interpretation).

I think what Ed means by change of notation is that what is being done is pure symbol manipulation. That is, we have a bunch of symbols (typically words) and we just push them around and change them, but the symbols don't actually "mean" anything. There's no sense that the symbol "chair" is grounded to an actual chair. Or, as a logician might say, the manipulation is purely syntactic, not semantic. There is no model in which statements are evaluated. My interpretation is that by "notation" Ed simply means "syntax" (the mathematical notion of syntax, not the linguistic one).

On the other hand, it's hard not to say that everything we do, by definition, must be symbol manipulation. Computer science is all about playing with symbols. There may be a sense in which the word "chair" is grounded to an actual chair, but this actual chair, once inside the computer, will again be a symbol!

Despite this, I think that what one can think about is something like "how large is the connection between my model and the real world?" For most applications, the connection is probably limited to the specific form of training data. This is probably as close to the smallest connection possible. Sometimes people attempt to integrate ontological information, either handcrafted (via something like WordNet) or automatically derived. This is, in some sense, attempting to open the "program to world" pipe a bit more. And there is some work that explicitly focuses on learning grounding, but I'm not aware of many actual uses for this yet. Additionally, some applications cannot help but to be tied more closely to the real world (eg., robot navigation). But for the majority of tasks, the connection is quite small.

13 August 2006

Human in the Loop Learning

What we, as NLPers, typically do vis-a-vis machine learning is what might be called "Human in the Loop" machine learning. Why? Typically, we develop a model and set of features, train it on some training data and evaluate it on some development data. We then augment/change the model and features, retrain and retest on the dev data. Only once we are satisfied with our dev results do we run it on the test data and report scores in a paper. This is "human in the loop" because in each iteration we're using our human knowledge to, say, add some additional features that will hopefully correct for errors on the development data.

To many machine learning people, this is not ideal: the whole point of machine learning is that the human is not in the loop. One can liken the "adding more features" to something akin to adding new sensors to a robot: when it is deployed on Mars, we cannot easily send a person to duct-tape on a new sensor.

But I'm not sure that this argument really applies here. We are, after all, trying to either learn new things about language or build systems that work. For both of these goals, retwiddling models and adding features is the right thing to do. If we could have an automated system to help us with this, that would be great. But I don't think we should shy away from keeping ourselves in the loop.

Perhaps another reason why machine learning tends to shy away from HITLL is that it goes against the long-standing models of supervised learning, which typically begin by saying something like "Let D be a distribution over pairs of inputs x and outputs y and let L be a labeled training set drawn iid from D." Here, the inputs (and hence the distribution) specify exactly what information we have access to in terms of features. This model essentially takes features for granted, as simply part of the learning problem. This is quite reasonable from the ML perspective, but is somehow lacking if one wants to talk about feature engineering, etc. There may be another, more apt model out there, but everything I can think of reduces trivially to the standard supervised model, unless inappropriate assumptions are made.

04 August 2006

Future DUC Tasks

The Document Understanding Conference features a yearly summarization competition. For the past few years, the task has been query-focused summarization of clusters of (essentially entirely) news documents. There will be a pilot task next year and based on comments made during DUC 2006, it appears it will be one of the following:

  1. Multidocument, (probably) query-focused summarization of blog posts.
  2. Multidocument summarization of news, with respect to known information.

The idea in (1) is that there are several "novel" aspects one has to deal with.  First, blog posts are out of domain for most parsers, etc., which means we'll get noisy input but not as noisy as speech.  Second, although the blog posts (the blogs would be from the TREC blog collection) will essentially all focus on news topics (saldy, NLPers is not in the corpus), they are almost certainly more emotionally fueled than vanilla news.  The identification of sentiment and opinion, which are both in vogue these days, will potentially become more useful.

The idea in (2) is that in most real world situations, the user who desires the summary has some background information on the topic.  The idea is that the summarization engine would be handed a collection of 5-10 documents that the user has presumably read, then 5-10 new documents to be summarized.  The novel aspect of this task is, essentially, detecting novelty.

Personally, I think both are potentially interesting, though not without their drawbacks.  The biggest potential problem I see with the blogs idea is that I think we're reentering the phase of not being able to achieve any sort of human agreement without fairly strict guidelines.  It's unclear if, say, two viewpoints are expressed, how a summary should reflect these.  The biggest problem I see with idea (2) is that it is very reminiscent of some TREC-style tasks, like TDT, and I'm not sure that doing anything more than essentially doing normal query-focused summarization with an MMR-style term to account for "known information."  That's not to say these aren't worth exploring -- I think both are quite interesting -- but, as always, we should be careful.

28 July 2006

Loss versus Conditional Probability

There was a talk in the session I chaired at ACL about directly optimizing CRFs to produce low F-scores for problems like NE tagging and chunking. The technique is fairly clever and is based on the observation that you can use very similar dynamic programming techniques to do max F-score as to do max log-probability in CRFs.

The details are not particularly important, but during the question phase, Chris Manning asked the following question: Given that F-score is not really motivated (i.e., is a type 4 loss), should we really be trying to optimize it? Conditional probability seems like a completely reasonable thing to want to maximize, given that we don't know how the tags will be used down the pipeline. (It seems Chris is also somewhat biased by a paper he subsequently had at EMNLP talking about sampling in pipelines.)

I think Chris' point is well taken. Absent any other information, conditional probably seems like a quite plausible thing to want to optimize, since given the true conditional probabilities, we can plug in any loss function at test time and do minimum Bayes risk (in theory, at least).

On the other hand, there is an interesting subtlety here. Conditional probability of what? The standard CRF optimization tries to maximize conditional probability of the entire sequence. The "alternative objective" of Kakade, Teh and Roweis optimizes the sub of the conditional probabilities of each label. These are two quite different criterea, and which one should we choose? In fact, neither really seems appropriate. Conditional probability of the sequence doesn't make sense because it would rather improve a bad label from probability 0.01 to 0.1 than improve a bad label from 0.4 to 0.6 and thus get it right. But summed conditional probability of labels doesn't make sense in NE tagging tasks because always assigning probability 0.9 to "not an entity" will do quite well. This is essentially the "accuracy versus f-score" problem, where, when few elements are actually "on," accuracy is a pretty terrible metric.

If we take Chris' advice and desire a conditional probability, it seems what we really want is direct conditional probability over the chunks! But how do we formulate this and how do we optimize it? My impression is that a direct modification of the paper Chris was asking about would actually enable us to do exactly that. So, while the authors of this paper were focusing on optimizing F-score, I think they've also given us a way to optimize conditional chunk probabilities (actually this should be easier than F-score because there are fewer forward/backward dependencies), similar to what Kakade et al. did for conditional label probabilities.

25 July 2006

Preprocessing vs. Model

Back from Sydney now, and despite being a bit jetlagged, I thought I'd begin with an COLING/ACL-related post. There was a paper presented by Ben Wellington that discusses whether current syntax-driven models for translation can capture the sorts of reorderings/alignments observed in hand-aligned data. I'm not looking at the paper, but what I remember from the talk is that something like 5-10% of sentences are missed assuming an ITG-like model, without gapping.

At the end of the talk, Dekai Wu commented that they are aware of this for ITG and that an analysis of where it happens is essentially is "weird" extrapositions that occur in English, such as wh-movement, topical preposing, adverbial movements, etc. Apparently such things occur in roughly 5-10% of sentences (from whatever corpora were being used). This talk was near the end of the conference, so my buffer was near full, but my recollection is that the suggestion (implied or explicit) was that English could (should?) be "repaired" by unextraposing such examples, at which time a model like ITG could be directly applied to nearly all the data.

To me this seems like a strange suggestion. ITG is a very simple, very easy to understand model of translation; more natural, in many regards, than, say, the IBM models. (Okay, maybe in all regards.) One nice thing about the model is that it is largely un-hackish. Given this, the last thing I'd really want to do is to fix English by applying some preprocessing step to it, so that the output could then be fed into this nice pretty model.

There's an obvious generalization to this question, hinted at in the title of this post. It seems that in many problems (translation with ITG being but one example), we can either choose a theoretically clean but maybe not completely appropriate model plus some preprocessing, or a somewhat messier but maybe 99.9% recall model with no preprocessing. I suppose one could argue that this is the long sought after integration of rule-based and statistically-based NLP, but this seems uncompelling, because it's not really an integration. I suppose that I'd almost always prefer the latter sort of solution, but that's not to say the first isn't also useful. For instance, in the ITG case, after identifying cases where the ITG assumption fails, we could try to minimally enhance the ITG model to be able to capture these corner cases.

15 July 2006

Where are Interesting Learning Problems in NLP?

I just spent a few days visiting Yee Whye and NUS (photos to prove it!). YW and I talked about many things, but one that stood out as a ripper is attempting to answer the question: as a machine learning person (i.e., YW), what problems are there in NLP that are likely to be amenable to interesting machine learning techniques (i.e., Bayesian methods, or non-parametric methods, etc.). This was a question we tried to answer at the workshop last year, but I don't think we reached a conclusion.

At first, we talked about some potential areas, mostly focusing around problems for which one really needs to perform some sort of inference at a global scale, rather than just locally.  I think that this answer is something of a pearler, but not the one I want to dwell on.

Another potential partial answer arose, which I think bears consideration: it will not be on any problem that is popular right now.  Why?  Well, what we as NLPers usually do these days is use simple (often hueristic) techniques to solve problems.  And we're doing a sick job at it, for the well studied tasks (translation, factoid QA, ad hoc search, parsing, tagging, etc.).  The hunch is that one of the reasons such problems are so popular these days is because such techniques work so bloody well.  Given this, you'd have to be a flamin' galah to try to apply something really fancy to solve one of these tasks.

This answer isn't incompatible with the original answer (globalness).  After all, most current techniques only use local information.  There is a push toward "joint inference" problems and reducing our use of pipelining, but this tends to be at a fairly weak level.

This is not to say that Bayesian techniques (or, fancy machine learning more generally) is not applicable to problems with only local information, but there seems to be little need to move toward integrating in large amounts of global uncertainty.  Of course, you may disagree and if you do, no wuckers.

p.s., I'm in Australia for ACL, so I'm trying to practice my Aussie English.

13 July 2006

Conference Formats

There are two conference formats I'm familiar with. One is typified by ACL and ICML; the other by NIPS. The ACL format is a roughly 3 day schedule, with three parallel sessions, which gives time for somewhere around 80 presentations, each 20 mins + 5 for questions. ACL has smallish poster sessions that are typically associated with "short papers." My impression is that many people do not attend the poster sessions.

In contrast, NIPS typically accepts many more papers (120ish per year), but the vast majority are presented as posters. There is only one track at the conference, with several invited talks, a few paper presentations (maybe 15 total) that last 20 minutes. They also have "spotlight presentations", which are 2 slide, 2 minute talks designed to draw attention to particularly interesting posters. The poster sessions run from (roughly) 6 until 10 every night and are attended by everyone.

My feeling is that both formats have advantages and disadvantages.  I personally like the NIPS format, essentially because I feel that at the end of the conference, I have seen more of what is there (because I spend so much time at posters).  This also means that I'm completely exhausted at the end of NIPS.  The major disadvantage to the NIPS format seems to be that I'm used to using post-conference time as dinner socialization time, and this seems to happen somewhat less at NIPS (this is confirmed by a friend who is more in the "in crowd" at NIPS than I am).  I think that it might be a little intimidating for junior researchers to go up to a "famous" poster, but on the other hand, this forces you to immediately get to know everyone.  The "lots of posters" format also means that the decision about the number of papers to accept at NIPS is essentially limited by physical space, rather than length of the conference.  Ken Church seems to think we should be accepting more papers, anyway.

Are there other major disadvantages to having a single-track conference, with the majority of papers presented as posters?  I don't expect to see a shift in how our conferences are held, but I'd like to understand all the pros and cons.

Visa Problems entering USA?

Some people are trying to do something about it (contacting congressmen, etc.). If you've had problems entering the US for conferences, you should contact them.

07 July 2006

What is Natural Language Understanding?

I've read lots of papers and seen lots of talks that justify a task as being useful to doing "natural language understanding." The problem I have here is that I really have no idea what NLU actually is. I think the standard line is that NLU is the process of transforming natural language text into "some representation" that is "easy" for a computer to "manipulate." Of course, the quoted words are so vacuous in meaning that I can easily imagine reasonable definitions for them that make this task either trivial or essentially impossible.

What I find unfortunate here is that, without trying hard to pin down definitions, this seems like a completely reasonable goal for NLP technology; in fact, my impression is that up until 10 or 20 years ago, this actually was one of the major goals of the field. According to the anthology, roughly 25% of the papers in 1979 contained the terms "NLU" or "language understanding", compared to 20% in the 1980s, 10% in the 90s and 7% in the 2000s. One possible explanation of the dwindling of this topic is that publication has become increasingly driven by experimental results, and if one cannot pin down a definition of a task, one cannot reliable compare results across multiple papers.

The recent push on textual entailment bears some semblance to NLU, but is simultaneously better defined and more restricted (though I admit to having heard some grumbling that some more work needs to be done to get textual entailment to be a very well defined task). There also continues to be a modicum of work on database filling and natural langauge database queries, though certainly not at the same rate as before.

03 July 2006

My Thesis Path

(Minor note: the latest version of the EDT section didn't get folded in properly to the version of the thesis that went up yesterday; this is fixed now.)

Many people have asked me how I settled on a thesis topic, I think largely because they are trying to find their own paths. My story is (at least to me) a bit odd and circuitous, but I'll tell it anyway.

When I came to USC, I knew I wanted to do NLP and, more specifically, I wanted to do summarization. Working with Daniel was a natural choice. I was particularly interested in coming up with good models for getting around the sentence extraction paradigm that dominates the field. My first work was on extending Kevin and Daniel's sentence compression work to the document level by using discourse information. This worked reasonably well. My next milestone was to try to leverage alignment information of document/abstract pairs in order to learn complex transformations. This led to the segment HMM model for doc/abs alignment, that met with reasonable success (considering how darned hard this problem is).

At that point, it became clear that trying to do a full abstractive system just wasn't going to work. So I started looking at interesting subproblems, for instance sentence fusion. Unfortunately, humans cannot reliably do this task, so I threw that out, along with a few other ideas. Around the same time, I began noticing that to have any sort of reasonably interesting model that did more than sentence extraction, I really was going to need to run a coreference resolution system. So I went to the web (this was back in 2003) and found one to download.

Except I didn't because there weren't any publicly available.

So I went to build my own. I wanted to do is "right" in the sense that I wanted a principled, joint system that could be trained and run efficiently. I read a lot of literature and didn't find anything. So then I read a lot of machine learning literature (I was beginning to get into ML fairly hardcore at this point) to find a framework that I could apply. I couldn't find anything.

So I decided to build my own thing and came up with LaSO, which is essentially a formalization and tweak on Mike Collins and Brian Roark's incremental perceptron. My thesis proposal used LaSO to solve the EDT problem, and I presented LaSO at ICML 2005. Also at ICML 2005 I met John Langford, having previously noticed that what I was doing with LaSO looked something like some recent work he had done on reinforcement learning. We had a long dinner and conversation and, after a few visits to Chicago, many emails, and lots of phone calls, came up with Searn. With both LaSO and Searn, I always had in the back of my mind that I didn't want to make any assumptions that would render it inapplicable to MT, since everyone else at ISI only does MT :).

At about the same time, I started thinking about how to do a better job of sentence compression and came up with the vine-growth model that eventually made it into my thesis. This was really the first point that I started thinking about summarization again, and, in the evolution of LaSO to Searn, it now became possible to do learning in this model.

So, in a sense, I had come full circle. I started with summarization, lost hope and interest, began to get more interested in EDT and machine learning, and then finally returned to summarization, this time with a new hammer.

02 July 2006

Thesis is Done!

I'm happy to announce that my thesis, Practical Structured Learning Techniques for Natural Language Processing, is now done (which is good, because it needs to be handed in by 11:59pm on July 5th). Many thanks to everyone who supported me through this (long) process; I really appreciate it.

27 June 2006

Beating an N-Gram

Our entry into the NIST MT eval this year has a recapitalization component, currently being done by a large language model (500mb, gzipped) together with SRILM's "disambig" tool. I was recently conscripted to try to improve this. After about a week's worth of work, using a bunch of tools and rules and a bunch of other stuff (though not 100% tuned), I've eeked out a 0.25 increase in BLEU scores on the 2003 and 2004 test data for three of our MT systems (phrase-based, syntax-based and Hiero). This could easily be upped to maybe 0.5 with a few more hours of work (there are tokenization differences between my training and test data).

The story that it's hard to beat an n-gram language model is fairly ubiquitous. In fact, my most useful features for classification are based on the n-best outputs of disambig using the large LM. There are older results that suggest that once you have enough data, using stupid techniques is sufficient. This seems to be more true for NLP than for many other areas I know about, though perhaps this is because there exist NLP problems for which it is even possible to get a "sufficient" amount of data.

While this story is true for English, I'm not sure that it would hold for other languages. My guess is that n-gram models work significantly worse in languages with more free word order and (though this usually comes as a package deal) stronger morpology. As a counter-argument, though, some of my friends who have contact with people who do commercial speech recognition in languages other than English do actually use vanilla n-gram models in those other languages. I don't know whether this is for lack of an alternative or just because they remain so good even outside of English.

I don't know the secret sauce to beating an n-gram. They appear to work so well because language (where, by "language" I mean "English") is really not that ambiguous, in a Shannon-experiment sense. Moreover, raw English words seem to really be just about right when it comes to information content. Sure, we can get some mileage by looking at suffixes or bigrams, but, comparing to, say, German (where one word can correspond to 2-3 English words), English really seems to strike the right balance of amount of information in a word to sparsity (where "right" = "right for NLP"). I'm sure other languages fall fairly close by and I recall seeing a study comparing Shannon-style tests in multiple languages (anyone know a pointer?), but, if pressed, this is my guess as to why n-grams work so well. To beat them, it seems we are almost forced to look at problems with less data, or problems which are significantly noisier.

19 June 2006

Having an Impact

Having just completed my thesis, I've been thinking a lot about what directions I should persue, post-graduation. While I'll probably stick with some of the stuff I've been working on in the past, this is also an opportunity to diversify. This opens the question: what are promising areas to work on? One important qualification (though by no means the only one) is impact: I want to work on things that are important, either in the sense that they become tools used by others for a diverse set of problems, or change the way people think about or look at a particular area.

Rather than applying my own model for what work has been important, one can look back at what other people think have been the most influential papers in NLP, as well as what papers have won best paper awards recently (with the caveat that the latter is probably less indicative). The story seems to be that the most impactful papers, at least as measured by this criteria, are either MT papers, parsing papers or machine learning papers. This makes some historical sense: MT is sort of the quintessential NLP application, while parsing is essentially the language-for-language's sake task.

This analysis, unfortunately, puts me in a sticky situation. I've already been doing machine learning stuff, which I hope will turn out to be useful for other people. I'd ideally like to get back into really languagey stuff in the future, and this leaves parsing at MT. I have little interest in working on parsing (nothing against parsing: it's just not my thing, and there are lots of people who can probably do it better than me), so now I'm stuct with MT. MT is, without a doubt, an interesting topic. But the field is so crowded right now, it seems difficult to find a sufficiently large and interesting niche to carve out. I personally think out-of-English translation is promising, especially conjoined with morphological research (pretty much required when going into a highly inflected language), but there are a large number of barriers to entry to this problem (evaluation, my American-esque inability to speak any foreign language sufficiently reliably to do error analysis (except Japanese, but there's so little data there), etc.).

But regardless of my personal interests and limitations, the fact that the field is so dominated by two problems seems unfortunate. There are 44 bins that ACL allowed you to select for your topic this year and while many of them largely overlap, there are many that do not. Why, for instance, does research in discourse, inference, phonology, question answering, semantics, and word sense disambiguation have comparatively little weight in the field as a whole? Why are there six sessions on MT (plus one multilingual), six on parsing (plus two on grammars), but only two on semantics and discourse and one on language modeling at ACL this year? And beyond that, there seem to be lots of interesting problems that no one is working on (or at least that no one publishes at ACL). I do not doubt that these are interesting problems, but it seems that there is a severe skew that may not be healthy to the community.