(guest post) Not always is a major conference a short train ride away, so I went down to Manchester even though I had no real business being at COLING this year. Liang Huang gave an interesting tutorial where he tied together a number of natural language problems and types of dynamic programming solutions for it. A nice treatment of this area, with examples from tree-based statistical machine translation, of course. There were also a lot of very strong papers, so let's take a look at them.
Syntax-Based SMT is alive: Currently one of the most exciting areas of statistical machine translation is the development of tree-based models, with all the open question: Real syntax in the source, in the target? Which grammar formalism? How do we parameterize the models? And all this efficient, please! When using real syntax on both sides, many of the rules in standard phrase-based models do not match anymore the syntactic annotations, so they have to be dropped. So, to accommodate more rules, Zhang et al. propose to extend the synchronous grammar formalism to allow multi-headed rules. He et al. build maximum entropy models for rule selection in hierarchical phrase models, similar to recent work by Carpuat and Wu in phrase-based models. Additional source-side context may of course be a syntactic parse tree. Xiong et al. use features about syntactic spans in their translation model and maximum entropy reordering model in a BTG framework. Zhang et al. present an algorithm to extract rules from a word-aligned corpus in linear time. The trick is to build up a tree structure that compactly encodes all possible rules. Storing all these rules is a real problem, too, it may take up tera-bytes, but Lopez presents a handy solution: suffix arrays. He also shows that rules that cover longer spans do help in hierarchical models, even if long phrases do not matter much in phrase-based models. Finally, Zollmann et al. present detailed results from their time at Google where they compared their syntax-augmented model against a hierarchical and standard phrase model -- so, syntax has arrived at Google at last.
Back to the Basics: Statistical machine translation systems have evolved into a fairly long pipeline of steps, and many of them deserve more attention. For instance the phrase segmentation of the input. Does it matter? The standard approach uses uniform segmentation, and maybe a phrase count feature. Blackwood et al. suggest to build an explicit phrasal segmentation model, which takes the form of a smoothed bigram phrase model. Moore and Quirk examine everybody's favorite: minimum error rate training (MERT), the step used to fine-tune weights given to various model components. This is typically done with a random starting point, and then hillclimbing to a better weight setting to optimize BLEU. Since this method easily gets stuck in local minmima, this is repeated several times. Moore and Quirk promise better merting (yes, that is a verb!) with fewer random restarts and better selection of the starting points through a random walk. But also the n-best list of candidate translations used during merting or re-ranking may be improved. Chen et al. suggest to add additional entries using a confusion network or by concatenating n-grams found in candidate translations. Speaking of building of confusion networks, they are also used in combining different system outputs (which is the current key method to win competitions, even though bad-mouthed as intellectually bankrupt at the recent NIST evaluation meeting). Ayan et al. suggest that better confusion networks and better system combination results may be achieved, when considering synonyms as found in Wordnet for matching up words when aligning the different outputs. Zwarts and Dras show that system combination could also be done by building a classifier that chooses one of the outputs for each sentence.
Reordering: Two papers on reordering, one of the big unsolved problems in statistical machine translation. Elming shows some advances in the first-reorder-into-lattice-then-decode approach, specifically how this lattice should be scored: he argues for scoring the final output to check which rules were implicitly followed (either by applying the rule or using a phrase translation that has it internalized). Zhang et al. propose that different reordering models should be built for different types of sentences, such as statements vs. questions.
Neat and surprising: We typically train our system on corpora that we find on web sites of multilingual institutions, such as the UN or the EU. When using such data, does it matter what the original source language was? I doubt it, but then van Halteren shows that he can detect the original source language in English Europarl documents with over 96 percent accuracy.
Connecting with our rule-based friends: An effective but simple method to combine a rule-based system like Systran with a statistical model is but first translating with Systran and then learn a statistical model to translate Systran English into real English. Based on your biases, you can call this rule-based preprocessing, serial combination, or statistical post-editing. Ueffing et al. show that some information from the rule-based system may help the statistical component, such as annotation of names or other reliable markup.
Aligning and building up dictionaries: What can you do with a large monolingual source language corpus, or a target language corpus, or a conventional bilingual dictionary? Wu et al. present various methods and compare them. Tsunakawa et al. use a method using a pivot language (English) to build a Japanese-Chinese dictionary. Given a syntactically parsed parallel corpus, Zhechev and Way compare different methods how to extract subtrees. Macken et al. extract technical term translations from a domain-specific parallel corpus. Lardilleux and Lepage propose an iterative phrase alignment method that first matches short sentences and phrases, and then subtracts the known alignments from longer phrases to extract the remainder, basically pigeon-holing.
Also: A paper on a Bayesian method for Chinese word segmentation by Xu et al.; a paper of transliteration, by Malik et al.; a paper on evaluation, especially if quality of reference translations matters (it does not), by Hamon and Mostefa; and a new grammar formalism for translation by Søgaard.
What's missing? No paper on word alignment or large-scale discriminative training, but there is always Hawaii.