[Guest Post by Adam Lopez... thanks, Adam! Hal's comment: you may remember that a while ago I proposed the idea of conference area chairs posting summaries of their areas; well, Adam is the first to take me up on this idea... I still think it's a good idea, so anyone else who wants to do so in the future, let me know!]
Conferences can be exhausting, and back-to-back conferences can be really exhausting, so I want to convince you to pace yourself and save some energy for EMNLP at the end of the week, because we have some really interesting MT papers. I'll focus mainly on oral presentations, because unlike poster sessions, the parallel format of the oral sessions entails a hard choice between mutually exclusive options, and part of my motivation is to help you make that choice. That being said, there are many interesting papers at the poster session, so do take a look at them!
MT is a busy research area, and we have a really diverse set of papers covering the whole spectrum of ideas: from blue sky research on novel models, formalisms, and algorithms, to the hard engineering problems of wringing higher accuracy and speed out of a mature, top-scoring NIST system. I occasionally feel that my colleagues on far reaches of either side of this spectrum are too dismissive of work on the other side; we need both if we're going to improve translation.
Outside the Box
Before giving you a guided tour through that spectrum, I want to highlight one paper that I found thought-provoking, but hard to classify. Zaidan & Callison-Burch question a basic assumption underlying most machine learning approaches to NLP: that we must optimize on an easily computable approximation to the true loss function. They ask: why not optimize for human judgement? They design a metric that uses judgements on small snippets of a target sentence (defined by a spanning nonterminal in a parse tree of the aligned source sentence) and figure how many judgements they would need to collect (using Amazon Mechanical Turk) to cover an iteration of MERT, exploiting the fact that these snippets reoccur repeatedly during optimization. How hard is this exactly? I would say, in terms of this scale of loss functions, that their metric is a 2. Yet, it turns out to be cheap and fast to compute. The paper doesn't report results of an actual optimization run, but it's in the works... hopefully you'll learn more at the conference.
Connecting Theory and Practice
A few papers combine deep theoretical insight with convincing empirical results. Hopkins & Langmead improve on cube pruning, a popular approximate search technique for structured models with non-local features (i.e. translation with an integrated language model). They move cube pruning from its ad hoc roots to a firm theoretical basis by constructing a reduction to A* search, connecting it to classical AI search literature. This informs the derivation of new heuristics for a syntax-based translation model, including an admissible heuristic to perform exact cube pruning. It's still globally approximate, but exact for the local prediction problem that cube pruning solves (i.e., what are the n-best state splits of an item, given the n-best input states from previous deductions?). Amazingly, this is only slightly slower than the inexact version and improves the accuracy of a strong baseline on a large-scale Arabic-English task.
Li & Eisner show how to compute a huge number of statistics efficiently over a combinatorially large number of hypotheses represented in a hypergraph. The statistics include expected hypothesis length, feature expectation, entropy, cross-entropy, KL divergence, Bayes risk, variance of hypothesis length, gradient of entropy and Bayes risk, covariance and Hessian matrix. It's beautifully simple: they recast the quantities of interest as semirings and run the inside (or inside-outside) algorithm. As an example application, they perform minimum risk training on a small Chinese-English task, reporting gains in accuracy. For a related paper on minimum risk techniques, see the poster by Pauls et al.
Novel Modeling and Learning Approaches
Tromble & Eisner also connect translation to theory by way of a novel model, framing reordering as an instance of the linear ordering problem: given a matrix of pairwise ordering preferences between all words in a sentence, can we find a permutation that optimizes the global score? This is NP-hard, but they give a reasonable approximation based on ITG, with some clever dynamic programming tricks to make it work. Then they show how to learn the matrix and use it to reorder test sentences prior to translation, improving over the lexicalized reordering model of Moses on German-English.
However, most of the new models at EMNLP are syntax-based. In the last few years, syntax-based modeling has focused primarily on variants of synchronous context-free grammar (SCFG). This year there's a lot of work investigating more expressive formalisms.
Two papers model translation with restricted variants of synchronous tree-adjoining grammar (STAG). Carreras & Collins model syntax atop phrase pairs with a parser using sister adjunction (as in their 2008 parser). The model resembles a synchronous version of Markov grammar, which also connects it to recent dependency models of translation (e.g. Shen et al. 2008, Galley et al. 2009, Gimpel & Smith below, and Hassan et al. in the poster session). Decoding is NP-complete, and devising efficient beam search is a key point in the paper. The resulting system outperforms Pharaoh on German-English. DeNeefe & Knight model target-side syntax via synchronous tree insertion grammar (STIG). It's similar to synchronous tree substitution grammar (STSG; previously realized in MT as GHKM) with added left- and right-adjunction operations to model optional arguments. They show how to reuse a lot of the STSG machinery via a grammar transformation from STIG to STSG, and the results improve on a strong Arabic-English baseline.
Gimpel & Smith use a relatively new formalism: quasi-synchronous dependency grammar (QDG). In quasi-synchronous grammar, the generation of a target syntax tree is conditioned on (but not necessarily isomorphic to) a source syntax tree. Formally, each target node can be annotated with any source node. Since in dependency grammar the nodes are words, their QDG model resembles a word-to-word model. Decoding with QDG was not obvious given past work, and is one of several novel contributions of the paper. Another is the idea that all possible biphrases can fire an associated feature, regardless of overlap. Kääriäinen makes this idea central. Instead of reasoning over the latent derivations of a generative model, his model directly optimizes a feature-based representation of the target sentence, where the features consist of any biphrase in the training set (per standard heuristics). This raises some new problems -- such as how to find the target sentence given the optimal feature vector -- which are solved with dynamic programming. The decoder doesn't quite beat Moses when used with a language model, but it's an order of magnitude faster!
Three other papers operate on STSG models, with an emphasis on learning techniques. Cohn & Blunsom reformulate tree-to-string STSG induction as a problem in non-parametric Bayesian inference, extending their TSG model for monolingual parsing, and removing the dependence on heuristics over noisy GIZA++ word alignments. The model produces more compact rules, and outperforms GHKM on a Chinese-English task. This is a hot topic: check out Liu & Gildea's poster for an alternative Bayesian formulation of the same problem and language pair. Galron et al. look at tree-to-tree STSG (from a Data-Oriented Parsing perspective), with an eye towards discriminatively learning STSG rules to optimize for translation accuracy.
Bayesian inference also figures in the model of Chung & Gildea, who aim at bilingually-informed segmentation of a source language. The model is like IBM Model 1, except that the source positions are actually substrings of the source instead of single positions. Reasoning over the substring boundaries makes it resemble an HMM, and they use a sparse prior to avoid overfitting. Tokenizing new text uses the marginal distribution on source language segmentations, and this performs almost as well as a supervised segmenter on Chinese, and better on Korean, in end-to-end translation.
SCFG models aren't completely forgotten: Zhang & Li offer a new twist on reordering in binary-branching SCFG. Given a source parse, we could train a maximum entropy classifier to decide whether any binary production should be inverted; this requires a lot of computation over sparse vectors. They instead represent the features implicitly using a tree convolution kernel, showing nice gains in Chinese-English.
On the algorithmic side, Levenberg & Osborne look at language modeling under the condition that we have unbounded data streams in both source and target language, bounded computation, and the desire to bias our language model towards more recent language use without constantly retraining it. They accomplish this with online perfect hashing (extending previous work) in a succinct data structure that supports deletions, showing that they can draw on recent information in both the source and the target to incrementally update the model while keeping a bounded memory footprint.
Bai et al. focus on the problem of acquiring multiword expressions (i.e. idioms), showing why typical word alignment methods fail, and using a combination of statistical association measures and heuristics to fix the problem, with small gains in Chinese-English.
Decoding
Since SCFG models have become mainstream, there's been a greater emphasis on decoding. Following a recent strand of research on grammar transformations for SCFG, Xiao et al. observe that, in the space of possible transformations, many will pair source yields with huge numbers of target yields, which compete during decoding and thus result in more search errors. The trick is to select a transform that distributes target yields more evenly across source yields. They pose this as an optimization problem and give a greedy algorithm; the resulting grammar is reliably better under a variety of conditions on a Chinese-English task. Meanwhile, Zhang et al. engineer more efficient STSG decoding for the case in which the source is a parse forest and source units are tree fragments. The trick is to encode translation rules in the tree path equivalent of a prefix tree. On Chinese-English this improves decoding speed and ultimately translation accuracy, because the decoder can consider larger fragments much more efficiently. Finally, see Finch & Sumita's comprehensive poster on bidirectional phrase-based decoding for a huge number of language pairs.
Onwards and Upwards
The align/extract/MERT pipeline popularized by Moses and other NIST-style systems is incredibly hard to improve, but several papers manage just that.
Hermjakob's word aligner starts from lexical translation parameters learned by a statistical alignment model. Then, following some fairly general observations on different linguistic classes of words, it uses some well-motivated heuristics to fix a whole bunch of little things that many more principled models ignore: the different behavior of content words (improved via careful manipulation of pointwise mutual information) and function words (improved via constraints from parse structure) is treated along with careful handling of numbers, transliterations, and morphology to give strong improvements in Arabic-English.
Liu et al. then extract phrases by relaxing standard heuristic constraints. Given a posterior probability for every alignment point, they simply calculate the probability that a phrase would be extracted, and use this as their count in the typical frequency-based estimate. It's efficient and improves Chinese-English.
Three papers incorporate new feature types into strong baseline translation models, following a recent trend. Shen et al. devise some clever local features using source-side context, derivation span length, and dependency modeling to make impressive improvements on an already impressive baseline system in both Chinese-English and Arabic-English. Matsoukas et al. then show how a mixed-genre system can effectively be adapted for a particular target domain, by using a small amount data to tune weights tied to genre and collection types in the training corpus, again with strong results in Arabic-English. Mauser et al. take their previous triplet lexicon model (a probabilistic feature using an outside source word as additional conditioning context) and move it from a reranking step into the decoding step, with a nice experimental treatment showing improvements in large-scale Chinese-English and Arabic-English.
If you've seen the latest NIST results, you know that system combination gives huge improvements. Check out posters by He & Toutanova, Duan et al., and Feng et al. to learn the latest techniques. Last but not least, if you need a strategy for language pairs with very little parallel data, the poster by Nakov & Ng will interest you.
Thanks
EMNLP was the first time I've been area chair for a conference, and it was really rewarding to work with such great volunteers and< to see the great papers that were selected (I should note here that I included two papers not on my track that I'm quite familiar with -- the ones from Edinburgh). It was also very enlightening, but that's another story. Many thanks to Hal for offering this forum to share the results!