29 September 2008

Statistical Machine Translation Papers at COLING

(guest post) Not always is a major conference a short train ride away, so I went down to Manchester even though I had no real business being at COLING this year. Liang Huang gave an interesting tutorial where he tied together a number of natural language problems and types of dynamic programming solutions for it. A nice treatment of this area, with examples from tree-based statistical machine translation, of course. There were also a lot of very strong papers, so let's take a look at them.

Syntax-Based SMT is alive: Currently one of the most exciting areas of statistical machine translation is the development of tree-based models, with all the open question: Real syntax in the source, in the target? Which grammar formalism? How do we parameterize the models? And all this efficient, please! When using real syntax on both sides, many of the rules in standard phrase-based models do not match anymore the syntactic annotations, so they have to be dropped. So, to accommodate more rules, Zhang et al. propose to extend the synchronous grammar formalism to allow multi-headed rules. He et al. build maximum entropy models for rule selection in hierarchical phrase models, similar to recent work by Carpuat and Wu in phrase-based models. Additional source-side context may of course be a syntactic parse tree. Xiong et al. use features about syntactic spans in their translation model and maximum entropy reordering model in a BTG framework. Zhang et al. present an algorithm to extract rules from a word-aligned corpus in linear time. The trick is to build up a tree structure that compactly encodes all possible rules. Storing all these rules is a real problem, too, it may take up tera-bytes, but Lopez presents a handy solution: suffix arrays. He also shows that rules that cover longer spans do help in hierarchical models, even if long phrases do not matter much in phrase-based models. Finally, Zollmann et al. present detailed results from their time at Google where they compared their syntax-augmented model against a hierarchical and standard phrase model -- so, syntax has arrived at Google at last.

Back to the Basics: Statistical machine translation systems have evolved into a fairly long pipeline of steps, and many of them deserve more attention. For instance the phrase segmentation of the input. Does it matter? The standard approach uses uniform segmentation, and maybe a phrase count feature. Blackwood et al. suggest to build an explicit phrasal segmentation model, which takes the form of a smoothed bigram phrase model. Moore and Quirk examine everybody's favorite: minimum error rate training (MERT), the step used to fine-tune weights given to various model components. This is typically done with a random starting point, and then hillclimbing to a better weight setting to optimize BLEU. Since this method easily gets stuck in local minmima, this is repeated several times. Moore and Quirk promise better merting (yes, that is a verb!) with fewer random restarts and better selection of the starting points through a random walk. But also the n-best list of candidate translations used during merting or re-ranking may be improved. Chen et al. suggest to add additional entries using a confusion network or by concatenating n-grams found in candidate translations. Speaking of building of confusion networks, they are also used in combining different system outputs (which is the current key method to win competitions, even though bad-mouthed as intellectually bankrupt at the recent NIST evaluation meeting). Ayan et al. suggest that better confusion networks and better system combination results may be achieved, when considering synonyms as found in Wordnet for matching up words when aligning the different outputs. Zwarts and Dras show that system combination could also be done by building a classifier that chooses one of the outputs for each sentence.

Reordering: Two papers on reordering, one of the big unsolved problems in statistical machine translation. Elming shows some advances in the first-reorder-into-lattice-then-decode approach, specifically how this lattice should be scored: he argues for scoring the final output to check which rules were implicitly followed (either by applying the rule or using a phrase translation that has it internalized). Zhang et al. propose that different reordering models should be built for different types of sentences, such as statements vs. questions.

Neat and surprising: We typically train our system on corpora that we find on web sites of multilingual institutions, such as the UN or the EU. When using such data, does it matter what the original source language was? I doubt it, but then van Halteren shows that he can detect the original source language in English Europarl documents with over 96 percent accuracy.

Connecting with our rule-based friends: An effective but simple method to combine a rule-based system like Systran with a statistical model is but first translating with Systran and then learn a statistical model to translate Systran English into real English. Based on your biases, you can call this rule-based preprocessing, serial combination, or statistical post-editing. Ueffing et al. show that some information from the rule-based system may help the statistical component, such as annotation of names or other reliable markup.

Aligning and building up dictionaries: What can you do with a large monolingual source language corpus, or a target language corpus, or a conventional bilingual dictionary? Wu et al. present various methods and compare them. Tsunakawa et al. use a method using a pivot language (English) to build a Japanese-Chinese dictionary. Given a syntactically parsed parallel corpus, Zhechev and Way compare different methods how to extract subtrees. Macken et al. extract technical term translations from a domain-specific parallel corpus. Lardilleux and Lepage propose an iterative phrase alignment method that first matches short sentences and phrases, and then subtracts the known alignments from longer phrases to extract the remainder, basically pigeon-holing.

Also: A paper on a Bayesian method for Chinese word segmentation by Xu et al.; a paper of transliteration, by Malik et al.; a paper on evaluation, especially if quality of reference translations matters (it does not), by Hamon and Mostefa; and a new grammar formalism for translation by Søgaard.

What's missing? No paper on word alignment or large-scale discriminative training, but there is always Hawaii.

21 September 2008

Co-training, 10 years later

At this year's ICML, they gave out a "10 year" award to a paper published in an ICML-related venue from 1998. This year it went to a COLT 1998 paper by Avrim Blum and Tom Mitchell: Combining Labeled and Unlabeled Data with Co-Training. While I'm not super familiar with what might have been a contender, I have to say that I definitely think this is a good choice.

For those unfamiliar with the idea of co-training, you should really read the paper. There's also a wikipedia entry that describes it as:

Co-training is a semi-supervised learning technique that requires two views of the data. It was introduced by Avrim Blum and Tom Mitchell. It assumes that each example is described using two different feature sets that provide different, complementary information about the instance. Ideally, the two views are conditionally independent (i.e., the two feature sets of each instance are conditionally independent given the class) and each view is sufficient (i.e., the class of an instance can be accurately predicted from each view alone). Co-training first learns a separate classifier for each view using any labeled examples. The most confident predictions of each classifier on the unlabeled data are then used to iteratively construct additional labeled training data.
This is a good summary of the algorithm, but what is left off is that---as far as I know---co-training was one of the first (if not the first) method for which theoretical analysis showed that semi-supervised learning might help. My history is a bit rough, so anyone should feel free to correct me if I'm wrong.

Another aspect of co-training that is cool for readers of this blog is that to a reasonable degree, it has its roots in a 1995 ACL paper by David Yarowsky: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods, which, as far as I know, was really the first paper to introduce the notion of having two views of data (although I don't think David described it as such).

All in all, the co-training paper is great. In fact, if you don't believe me that I think it's great, check out my EMNLP 2008 paper. My analysis (and, to some degree, algorithm) are based heavily on the co-training analysis.

Which brings me to what I really want to discuss. That is, I have a strong feeling that if the co-training paper were reviewed today, it would be blasted for the theoretical analysis. (Indeed, I had the same fear for my EMNLP paper; though since it was EMNLP and not, say, COLT, I don't think the problem was as severe.) The problem with the co-training paper is that the theoretical result is for an algorithm that is only superficially related to the actual algorithm they implement. In particular, the actual algorithm they implement uses notions of confidence, and steadily increasing training set size, and incremental additions and so on. It's vastly more complicated that the algorithm they analyze. My recent experience as both an author and reviewer at places like NIPS and ICML is that this is pretty much a non-starter these day.

In fact, the algorithm is so different that it took three years for an analysis of something even remotely close to come out. In NIPS 2001, Sanjoy Dasgupta, Michael Littman and David McAllester published a paper that actually tries to analyze something closer to the "real" co-training algorithm. They get pretty close. And this analysis paper is a full NIPS paper that basically just proves one (main) theorem.

(A similar set of events happened with David Yarowsky's paper. He didn't include any theoretical analysis, but there has been subsequent work, for instance by Steve Abney to try to understand the Yarowsky algorithm theoretically. And again we see that an analysis of the exact original algorithm is a bit out of grasp.)

I'm sure other people will disagree--which is fine--but my feeling about this matter is that there's nothing wrong with proving something interesting about an algorithm that is not quite exactly what you implement. The danger, of course, is if you get an incorrect intuition. For instance, in the case of co-training, maybe it really was all these "additions" that made the algorithm work, and the whole notion of having two views was useless. This seems to have turned out not to be the case, but it would be hard to tell. For instance, the co-training paper doesn't report results on the actual algorithm analyzed: presumably it doesn't work very well or there would be no need for the more complex variant (I've never tried to implement it). On the other hand, if it had taken Avrim and Tom three extra years to prove something stronger before publishing, then the world would have had to wait three extra years for this great paper.

The approach I took in my EMNLP paper, which, at least as of now, I think is reasonable, is to just flat out acknowledge that the theory doesn't really apply to the algorithm that was implemented. (Actually, in the case of the EMNLP paper, I did implement both the simple and the complex and the simple wasn't too much worse, but the difference was enough to make it worth--IMO--using the more complex one.)