27 June 2006

Beating an N-Gram

Our entry into the NIST MT eval this year has a recapitalization component, currently being done by a large language model (500mb, gzipped) together with SRILM's "disambig" tool. I was recently conscripted to try to improve this. After about a week's worth of work, using a bunch of tools and rules and a bunch of other stuff (though not 100% tuned), I've eeked out a 0.25 increase in BLEU scores on the 2003 and 2004 test data for three of our MT systems (phrase-based, syntax-based and Hiero). This could easily be upped to maybe 0.5 with a few more hours of work (there are tokenization differences between my training and test data).

The story that it's hard to beat an n-gram language model is fairly ubiquitous. In fact, my most useful features for classification are based on the n-best outputs of disambig using the large LM. There are older results that suggest that once you have enough data, using stupid techniques is sufficient. This seems to be more true for NLP than for many other areas I know about, though perhaps this is because there exist NLP problems for which it is even possible to get a "sufficient" amount of data.

While this story is true for English, I'm not sure that it would hold for other languages. My guess is that n-gram models work significantly worse in languages with more free word order and (though this usually comes as a package deal) stronger morpology. As a counter-argument, though, some of my friends who have contact with people who do commercial speech recognition in languages other than English do actually use vanilla n-gram models in those other languages. I don't know whether this is for lack of an alternative or just because they remain so good even outside of English.

I don't know the secret sauce to beating an n-gram. They appear to work so well because language (where, by "language" I mean "English") is really not that ambiguous, in a Shannon-experiment sense. Moreover, raw English words seem to really be just about right when it comes to information content. Sure, we can get some mileage by looking at suffixes or bigrams, but, comparing to, say, German (where one word can correspond to 2-3 English words), English really seems to strike the right balance of amount of information in a word to sparsity (where "right" = "right for NLP"). I'm sure other languages fall fairly close by and I recall seeing a study comparing Shannon-style tests in multiple languages (anyone know a pointer?), but, if pressed, this is my guess as to why n-grams work so well. To beat them, it seems we are almost forced to look at problems with less data, or problems which are significantly noisier.

19 June 2006

Having an Impact

Having just completed my thesis, I've been thinking a lot about what directions I should persue, post-graduation. While I'll probably stick with some of the stuff I've been working on in the past, this is also an opportunity to diversify. This opens the question: what are promising areas to work on? One important qualification (though by no means the only one) is impact: I want to work on things that are important, either in the sense that they become tools used by others for a diverse set of problems, or change the way people think about or look at a particular area.

Rather than applying my own model for what work has been important, one can look back at what other people think have been the most influential papers in NLP, as well as what papers have won best paper awards recently (with the caveat that the latter is probably less indicative). The story seems to be that the most impactful papers, at least as measured by this criteria, are either MT papers, parsing papers or machine learning papers. This makes some historical sense: MT is sort of the quintessential NLP application, while parsing is essentially the language-for-language's sake task.

This analysis, unfortunately, puts me in a sticky situation. I've already been doing machine learning stuff, which I hope will turn out to be useful for other people. I'd ideally like to get back into really languagey stuff in the future, and this leaves parsing at MT. I have little interest in working on parsing (nothing against parsing: it's just not my thing, and there are lots of people who can probably do it better than me), so now I'm stuct with MT. MT is, without a doubt, an interesting topic. But the field is so crowded right now, it seems difficult to find a sufficiently large and interesting niche to carve out. I personally think out-of-English translation is promising, especially conjoined with morphological research (pretty much required when going into a highly inflected language), but there are a large number of barriers to entry to this problem (evaluation, my American-esque inability to speak any foreign language sufficiently reliably to do error analysis (except Japanese, but there's so little data there), etc.).

But regardless of my personal interests and limitations, the fact that the field is so dominated by two problems seems unfortunate. There are 44 bins that ACL allowed you to select for your topic this year and while many of them largely overlap, there are many that do not. Why, for instance, does research in discourse, inference, phonology, question answering, semantics, and word sense disambiguation have comparatively little weight in the field as a whole? Why are there six sessions on MT (plus one multilingual), six on parsing (plus two on grammars), but only two on semantics and discourse and one on language modeling at ACL this year? And beyond that, there seem to be lots of interesting problems that no one is working on (or at least that no one publishes at ACL). I do not doubt that these are interesting problems, but it seems that there is a severe skew that may not be healthy to the community.

15 June 2006

Approximating the Problem or the Solution

A while back I came across a paper that (in a completely separate context) argues for approximating problems in lieu of approximating solutions. This idea has a nice analogue in NLP: should we (A) choose a simple model for which we can do exact inference or (B) choose a complex model that is closer to the truth for which exact inference is not tractable. (A) is approximating the problem while (B) is approximating the solution.

It seems that all signs point to (B). In almost every interesting case I know of, it helps (or at the very least doesn't hurt) to move to more complex models that are more expressive, even if this renders learning or search intractable. This story is well known in word alignment (eg, GIZA) and MT (eg, model 4 decoding), but also has simpler examples in parsing (cf, McDonald), sequence labeling (cf, Sutton), relation extraction (cf, Culotta), as well as pretty much any area in which "joint inference" has been shown to be helpful.

One sobering example here is the story in word alignment, where one cannot go and directly use, say, model 4 for computing alignments, but must first follow a strict recipe: run a few iterations of model 1, followed by a few model 2, followed by some HMM, then model 4 (skipping model 3 all together).  The problem here is that learning model 4 parameters directly falls into local minima too easily, so one must initialize intelligently, by using outputs of previous iterations.  My guess is that this result will continue to hold for training (though perhaps not predicting) with more an more complex models.  This is unfortunate, and there may be ways of coming up with learning algorithms that automatically initialize themselves by some mechanism for simplifying their own structure (seems like a fun open question, somewhat related to recent work by Smith).

Aside from a strong suggestion as to how to design models and inference procedure (i.e., ignore tractability in favor of expressiveness), there may be something interesting to say here about human language processing.  If it is indeed true that, for the most part, we can computationally move to more complex models, forgoing tractable search, then it is not implausible to imagine that perhaps humans do the same thing.  My knowledge in this area is sparse, but my general understanding is that various models of human language processing are disfavored because they would be too computationally difficult.  But if, as in old-school AI, we believe that humans just have a really good innate search algorithm, then this observation might lead us to believe that we have, ourselves, very complex, intractable "models" in our heads.

13 June 2006

Incremental Improvements

I usually think of incremental improvements as the result of taking an existing result X and twiddling X a little to yield Y. This can happen in system development (adding syntax to my MT system helped), theory (Y is a straightforward corollary of X), etc. Incremental improvements are often uninteresting, unless your research area happens to be exactly that of X (i.e., if I work on MT, maybe I will try adding syntax now). But in a broader sense, everything is an incremental improvement. It is vanishingly unlikely that a paper will come along that is not incremental in some sense.

I think what often ends up making a paper more or less successful is how many incremental improvements it contains. For instance, the original CRF paper was clearly successful. Yet, mathematically, CRFs are only incrementally different from MEMMs. Experimentally, they only perform slightly better. Algorithmically, the optimization is only slightly more clever than inference in HMMs. Theoretically, they can solve only a slightly broader family of problems. But importantly, all of these advantages come in a bundle. The CRF is a success largely because a single, relatively simple formalism made (incremental) improvements in at least four areas. I think you can make similar arguments about other "must reads" in NLP, such as Collins' original parsing paper.

If true, this might lead one to make research and accept/reject decisions on the basis of the number of areas in which improvements are made. This would certainly cut down on the number of papers published, but I also feel that many purely incremental papers do end up being useful, if only as steps in a path. For instance, MEMMs themselves are much more incremental upon HMMs, maximum entropy models and some older work by both Brill and Roth. The important question is whether CRFs would have been discovered, had MEMMs not been used and exhibited problems (eg., the label-bias problem). Of course, such observations are easy in retrospect: I don't know how to identify them without seeing the goal.

(Incremental improvements also serve a second role: they are advertisements for the original technique, for those who missed it the first time around.)

09 June 2006

NAACL: The post-hoc short-list

NAACL is over, and while the majority of the conference took place in hallways for me this time around (lots of old and new friends made their way to NYC), I did get a chance to see a few papers that I enjoyed. I'm short-listing some of them here (including workshop papers), though I encourage others to list those that they liked, since one person can have at most a 33% recall.

  • Effective Self-Training for Parsing (McClosky, Charniak + Johnson). Train a parser A, run A on 2m sentences, use A's output to train a new parser B and B does better on test data. This is a somewhat surprising story, and one that has plenty of counterexamples in the literature. These guys got it to work. The key was to use the Charniak + Johnson reranking parser as parser A, rather than just the Charniak parser. This begs the question: why does it work in the latter case. Unfortunately, the paper doesn't answer this question. Though talking to Eugene, they are interested in ideas why.

    One obvious idea is that the epsilon difference in performance for reranking and not reranking happens to occur at a phase transition. I find this unlikely. My guess (which I've suggested and they seem interested in exploring) is the following. The unreranked parser produces systematic errors, and so self-training just reinforces these errors leading to worse performance. In contrast, the reranked parser has a high variance of errors (this is intuitively reasonable if you think about how reranking works) and therefore these come across as "noise" when you retrain on the large corpus. It's easy to verify this claim: simply remove enough data from the reranked parser so that its performance is comparable to the vanilla parser and see if you still get self-training improvements. If so, I may be right. One can then do a more exhaustive error analysis. Other people have suggested other reasons which I think the Brown folk will also look into. I'm very curious.

  • Prototype-driven Learning for Sequence Models (Haghighi + Klein). This received the best student paper award (congrats Aria!). The idea in this paper is that we don't want to have to label entire sequences to do normal supervised learning. Instead, what we do is, for each possible label, we list a small number of prototypical words from that label. They then use a constrained MRF, together with distributional similarity features to attempt to learn a sequence labeling model. I think this idea is very interesting, and I am in general very interested in ways of using "not quite right" data to learn to solve problems.

    My only concern with this approach is that it is quite strict to require that all occurances of prototypical words get their associated tag. This is perhaps okay for something like POS tagging, but in general problems, I feel that words are too ambiguous for this to work. (And, for completeness, since I asked this question at the talk, I would have liked to have seen at least one baseline model that made use of the prototypes, even in a naive way.)

  • Quasi-Synchronous Grammars: Alignment by Soft Projection of Syntactic Dependencies (D Smith + Eisner, SMT Workshop). This paper makes the argument that the raw synchronicity assumptions made by SCFGs for either projection or translation is too strong (others, like Hwa, have argued the same). They analyze the sorts of transformations required to explain real parallel data (does a parent/child end up as a parent/child after translation, or is sisterhood maintained). They propose a translation model that can account for much more varied transformations that standard SCFGs. It's a workshop paper, and there's no decoder yet, but talking to David afterwards, they are actively working on this and it may actually be safe to hold one's breath.

  • Computational Challenges in Parsing by Classification (Turian + Melamed, CHPJISLP Workshop). This is a history-based parsing-as-classification model, where one essentially parses in a bottom-up, sequential fashion. A lot of this paper is dedicated to talking about how to scale up to being able to train decision trees as classifiers (laziness, parallelization and sampling provide a 100 fold speedup), but something I found extremely interesting (though not shocking once you hear it) is that right-to-left search outperformed left-to-right. The obvious explanation is that English is largely right-branching, and by searching right-to-left, most decisions remain strongly local. They achive quite impressive results, and there's some obvious way in which this algorithm could be Searnified.
So that's my short-short list. I'd love to hear what papers other people liked (so I know what to read!). That said, I heard many rumors of lots of really cool sounding ACL and EMNLP papers, so I'm really looking forward to Sydney!