N-gram language models have been fairly successful at the task of distinguishing homophones, in the context of speech recognition. In machine translation (and other tasks, such as summarization, headline generation, etc.), this is not their job. Their job is to select fluent/grammatical sentences, typically ones which have undergone significant reordering. In a sense, they have to order words. A large part of the thesis of my academic sibling, Radu Soricut, had to do with exploring how well ngram language models can reorder sentences. Briefly, they don't do very well. This is something that our advisor, Daniel Marcu, likes to talk about when he gives invited talk; he shows a 15 word sentence and the preferred reorderings by a ngram LM and they're total hogwash, even though audience members can fairly quickly solve the exponential time problem of reordering the words to make a good sounding sentence. (As an aside, Radu found that if you add in a syntactic LM, things get better... if you don't want to read the whole thesis, just skip forward to section 8.4.2.)
Let's say we like ngram models. They're friendly for many reasons. What could we do to make them more word-order sensitive? I'm not claiming that none of these things have been tried; just that I'm not aware of them having been tried :).
- Discriminative training. There's lots of work on discriminative training of language models, but, from what I've seen, it usually has to do with trying to discriminate true sentences from fake sentences, where the fake sentences are generated by some process (eg., an existing MT or speech system, a trigram LM, etc.). The alternative is to directly train a language model to order words. Essentially think of it as a structured prediction problem and try to predict the 8th word based on (say) the two previous. The correct answer is the actual 8th word; the incorrect answer is any other word in the sentence. Words that don't appear in the sentence are "ignored." This is easy to implement and seems to do something reasonable (on a small set of test data).
- Add syntactic features to words, eg., via cluster-based language models. My thought here is to look at syntactic features of words (for instance, CCG-style lexicon information) and use these to create descriptors of the words; these can then be clustered (eg., use tree-kernel-style-features) to give a cluster LM. This is similar to how people have added CCG/supertag information to phrase-based MT, although they don't usually do the clustering step. The advantage to clustering is then you (a) get generalization to new words and (b) it fits in nicely with the cluster LM framework.