07 March 2009

n-gram words an language Ordering model with

N-gram language models have been fairly successful at the task of distinguishing homophones, in the context of speech recognition. In machine translation (and other tasks, such as summarization, headline generation, etc.), this is not their job. Their job is to select fluent/grammatical sentences, typically ones which have undergone significant reordering. In a sense, they have to order words. A large part of the thesis of my academic sibling, Radu Soricut, had to do with exploring how well ngram language models can reorder sentences. Briefly, they don't do very well. This is something that our advisor, Daniel Marcu, likes to talk about when he gives invited talk; he shows a 15 word sentence and the preferred reorderings by a ngram LM and they're total hogwash, even though audience members can fairly quickly solve the exponential time problem of reordering the words to make a good sounding sentence. (As an aside, Radu found that if you add in a syntactic LM, things get better... if you don't want to read the whole thesis, just skip forward to section 8.4.2.)

Let's say we like ngram models. They're friendly for many reasons. What could we do to make them more word-order sensitive? I'm not claiming that none of these things have been tried; just that I'm not aware of them having been tried :).

  1. Discriminative training. There's lots of work on discriminative training of language models, but, from what I've seen, it usually has to do with trying to discriminate true sentences from fake sentences, where the fake sentences are generated by some process (eg., an existing MT or speech system, a trigram LM, etc.). The alternative is to directly train a language model to order words. Essentially think of it as a structured prediction problem and try to predict the 8th word based on (say) the two previous. The correct answer is the actual 8th word; the incorrect answer is any other word in the sentence. Words that don't appear in the sentence are "ignored." This is easy to implement and seems to do something reasonable (on a small set of test data).
  2. Add syntactic features to words, eg., via cluster-based language models. My thought here is to look at syntactic features of words (for instance, CCG-style lexicon information) and use these to create descriptors of the words; these can then be clustered (eg., use tree-kernel-style-features) to give a cluster LM. This is similar to how people have added CCG/supertag information to phrase-based MT, although they don't usually do the clustering step. The advantage to clustering is then you (a) get generalization to new words and (b) it fits in nicely with the cluster LM framework.
These both seem like such obvious ideas that they must have been tried... maybe they didn't work? Or maybe I just couldn't dig up papers. Or maybe they're just not good ideas so everyone else dismissed them :).

39 comments:

Mark Johnson said...

What I'm about to suggest may not have any practical application, but here goes anyway.

If you want to discriminatively train e.g. a logistic regression model to identify one permutation from the set of all possible permutations of a fixed string, you face the problem of calculating the partition function (i.e., the sum of the scores of all the permutations). Since the number of permutations grows exponentially with sentence length, exhaustive enumeration of all possible permutations very rapidly becomes impossible.

It turns out that our statistician friends have been thinking about this problem for a while; a search for "Mallows model" or "generalized Mallows model" should get you started. Most of this work focuses on ranking problems, but I've been wondering whether Mallows models could be extended to include e.g. bigram features between adjacent elements in the permutation (i.e., costs that depend on pairwise adjacencies).

hal said...

Quick comment... the mallows model idea is definitely interesting. Regarding the training, you could always use a structured prediction algorithm that doesn't require you to sum or argmax...

Bob Carpenter said...

Typical n-gram LMs suffer from both locality (due to length and sparsity of n-grams - just like HMMs) and label bias (due to backoff/interpolated smoothing - just like HMMs).

CRFs tackled label bias with a sentence-wide partition function rather than computing each tag locally. Could something like that work for LMs? I've seen people just sample negative data (often from an n-best list from a simpler process) rather than use all n! other orderings.

Presumably heuristic search could tackle the n! problem at run time.

Anonymous said...

Could PCFGs be a good solution to the word ordering problem? I mean: Collect a corpus, and tag each word with its part-of-speech. Then train a PCFG on the sequences of POS tags. Then tag each word in a target sentence with its POS, and output the most likely POS sequence wrt the PCFG. Now there may be several word orderings per POS ordering, but will that be a big problem?

Anonymous said...

Si en tí albergas el deseo de ayudar a otros aqui te pongo como se hace,SALVAR UNA VIDA NO ES POCA COSA ,PIENSALO!!!!!!!!!!!!!!!


¿QUÉ ES SER DONANTE DE MÉDULA ÓSEA?
Ser donante voluntario de Médula Ósea es aceptar firmemente el compromiso moral de donar la médula ósea a un enfermo de cualquier parte del mundo que, sin disponer de familiares compatibles, requiera un trasplante. El único requisito inicial es cumplimentar un formulario y someterse a una pequeña extracción de sangre, como para un análisis de rutina, con el fin de determinar el grupo de histocompatibilidad (HLA).
¿QUIÉN PUEDE SER DONANTE?
Puede incluirse en la Red Mundial de donantes de Médula Ósea a través de REDMO, toda aquella persona con edad comprendida entre los 18 y 55 años y que disfrute de buena salud. El criterio de buena salud consiste en no sufrir enfermedad cardiovascular, renal, pulmonar, de hígado u otras afecciones crónicas que requieran tratamiento continuo, y no tener antecedentes de análisis positivos en cuanto a infecciones de los virus de la hepatitis B, C y síndrome de inmunodeficiencia adquirida (SIDA).
¿EN QUÉ CONSISTE DONAR MÉDULA ÓSEA?
Donar Médula Ósea consiste en proporcionar al enfermo células madre de los glóbulos rojos, glóbulos blancos y plaquetas de la sangre procedente de un donante sano. Ello se lleva a cabo extrayendo del hueso de la cadera una jeringa, una pequeña cantidad de medula, suficiente para conseguir un injerto. Este acto se realiza bajo anestesia general o epidural, y siempre en un hospital especializado, emplazado en la misma localidad o en la más cercana posible a la de residencia del donante,
En la actualidad, existen estudios muy avanzados cuyos resultados harán posible que en el futuro se generalice la extracción de células madre desde la sangre mediante un proceso alternativo que dura unas dos horas, no requiere anestesia y no tiene más molestias que las de una donación de aféresis para transfusión.
¿QUÉ PROBABILIDAD HAY DE SER ELEGIDO PARA DONAR MÉDULA ÓSEA?
Un donante voluntario de médula ósea puede ser requerido en distintas ocasiones para someterse a nuevas extracciones de sangre que permitirán confirmar y ampliar su tipaje. Ello puede ocurrir inmediatamente tras su inscripción en el Registro, al cabo de cierto tiempo o incluso nunca, si no existiese ningún enfermo potencialmente compatible.
¿TIENE RIESGO DONAR MÉDULA ÓSEA?
No existe otro riesgo que el de la anestesia general o epidural el cual es muy bajo; en personas sanas la probabilidad de complicacciones es de 1 por 50.000 casos.
Por efecto de la extracción sólo puede aparecer un leve dolor residual en la cadera que desaparece a los pocos días de la donación.
¿EXISTE ABSOLUTA LIBERTAD PARA RETIRARSE DEL REGISTRO?
REDMO es consciente de que las circunstancias personales o físicas de una persona pueden variar a lo largo del tiempo y, por consiguiente, un donante es libre de darse de baja del Registro si así los desea y en cualquier momento, Sin embargo, se recuerda que el ser donante de médula ósea implica un compromiso moral que debe ser cuidadosamente meditado antes e inscribirse en el Registro, y se espera que el donante no cambie de idea si de ello depende la vida de un semejante.
¿LA DONACIÓN DE MÉDULA ÓSEA ESTÁ RETRIBUÍDA?
El donante no recibe compensación económica alguna por el acto de la donación de médula ósea. Los posibles gastos derivados del proceso de donación de su médula le serán costeados en su totalidad.
La compensación que recibe es la satisfacción de haber salvado una vida, o por lo menos de haberlo intentado.

Si te animas puede que salves una vida! si una vida!!!!!!!!!!!!!!
y si lo tienes que pensar mucho pues piensalo y mientras ayudame a difundir este tema,sé que no conseguiré abarrotar el registro de donantes,pero si tu y yo conseguimos que uno de nosotros lo haga una persona en este mundo que con desesperación está esperando que la bondad y solidaridad de otro le salve la vida,tendrá una respuesta.Quisiera decirte que soy donante con gusto lo haría, pero no puedo tengo anemías crónicas y una salud que no es la mejor,si no fuera así con orgullo te diría que lo hice ,estoy en campaña de mejorar mi salud,y si puedo te lo cuento.
Mientras pongamos nuestro granito de arena en esta causa,ayudemos, un día podría tocarnos una situación así ,ojalá nunca nos toque,pero aunque no sea nuestra realidad si que es la realidad de muchos,pongamos el corazón en esto difundamoslo y seamos donantes!

Yo te informo,tu decides y por favor cuentale esto a otros así podamos llegar a todos.

Rachel Cotterill said...

Seems to me that n-POS-tags (as opposed to n-words) would be a better way of approaching a word-ordering task. Assuming you could reliably POS-tag each individual word in the test cases, which sounds non-trivial. Interesting questions.

wow gold said...

buy wow gold,cheap wow gold,world of warcrft gold.

buy cheap wow gold,cheap wow gold .

Harr said...

Following up on Mark Johnson's comment, we (Regina Barzilay's group at MIT) did some recent work on applying the Mallows model to document level structuring. We have a NAACL paper on that stuff:

http://people.csail.mit.edu/harr/papers/naacl2009.pdf

Bob Moore said...

I know that this comment is a month late, but the thought just occurred to me. What about just using the likelihood ratio of the ngram model to the unigram model? That would factor out the contribution of the words themselves and leave only the contribution due to the order of the words.

cutepig said...

Do you know the anarchy credits, if you say I do not know, I want to tell you the ao credits in the game is very important, if you had more anarchy online credits, you will feel the game had become more and more interesting.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要到打工兼差打工,兼差,或者八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有快速賺錢又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢卡奴的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的夜間兼職工作,打工機會和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

yi said...

Hello, everybody. I am a new hand to be here. So nice to meet you all. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Anonymous said...

I've seen people just sample negative data (often from an n-best list from a simpler process) rather than use all n! other orderings.
Assignment Help | Coursework Help | Thesis Help

Anonymous said...

Now there may be several word orderings per POS ordering, but will that be a big problem?
Dissertation Help | Essay Help

writing said...

Hi,
I personally like your post; you have shared good information.

Dissertation Help

uk9876 said...

Hi,
Really nice post! Thanks for sharing such an informative article. Keep up the good work.

Writing Help

ccw said...

Hi,
This is inspiring; I am very pleased by this post. Nice work, thanks for such information.

Coursework help

se said...

Hi,
Interesting topic! Hope you will elaborate more on it in future posts
Custom Essay Writing

Custom Term Papers said...

Hi,
I haven’t any word to appreciate this post.....Really I am impressed from this post

Custom Term Paper

tariely said...

Классные мультики мультфильмы бесплатно на кинозоуне.
электронная почта без регистрации

tar said...

электронная почта без регистрации

gamefan12 said...

N-gram language models is so great to use and is very sucessful. I think it is so good.
orlando accident lawyers

qishaya said...

one day i went shopping outside,and in an ed hardy store,I found some kinds of ed hardy i love most they are Your website is really good Thank you for the information ed hardy ed hardy ed hardy clothing ed hardy clothing ed hardy shoes ed hardy shoes don ed hardy don ed hardy ed hardy clothes ed hardy clothes ed hardy bags ed hardy bags ed hardy swimwear ed hardy swimwear ed hardy jeans ed hardy jeans ed hardy mens ed hardy mens Thank you for the information

Dissertation Help said...

it's good to see this information in your post, i was looking the same but there was not any proper resource, thanx now i have the link which i was looking for my research.

UK Dissertations Help

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

pulmonary disease said...

Hi the ordering model with is a grates task is not easy doubt you need a lot of patients and the N words some people even know the real meaning .

combattery84 said...

HP dv9700 battery
HP F4809A Battery
HP nc8000 battery
HP nc8230 battery
HP pavilion zd8000 battery
HP f2024b battery
HP f4812a battery
HP Pavilion ZV5000 battery
HP Pavilion DV1000 battery
HP Pavilion ZD7000 Battery
HP Pavilion DV2000 battery
HP Pavilion DV4000 Battery
HP Pavilion dv6000 Battery
HP Pavilion DV9000 Battery
HP F4098A battery
HP pavilion zx6000 battery
HP omnibook xe4400 battery
HP omnibook xe4500 battery
HP omnibook xe3 battery
Notebook NX9110 battery
IBM 02K6821 battery
IBM 02K7054 battery
IBM 08K8195 battery
IBM 08K8218 battery
IBM 92P1089 battery
IBM Thinkpad 390 Series battery
IBM Thinkpad 390X battery
IBM ThinkPad Z61m Battery
IBM 02K7018 Battery
IBM thinkpad t41p battery
IBM THINKPAD T42 Battery

combattery84 said...

IBM ThinkPad R60 Battery
IBM ThinkPad T60 Battery
IBM ThinkPad T41 Battery
IBM ThinkPad T43 Battery
IBM ThinkPad X40 Battery
Thinkpad x24 battery
ThinkPad G41 battery
IBM thinkpad r52 battery
Thinkpad x22 battery
IBM thinkpad t42 battery
IBM thinkpad r51 battery
Thinkpad r50 battery
IBM thinkpad r32 battery
Thinkpad x41 battery
SONY VGP-BPS2 Battery
SONY VGP-BPS2C Battery
SONY VGP-BPS5 battery
SONY VGP-BPL2C battery
SONY VGP-BPS2A battery
SONY VGP-BPS2B battery
SONY PCGA-BP1N battery
SONY PCGA-BP2E battery
SONY PCGA-BP2NX battery
SONY PCGA-BP2S battery
SONY PCGA-BP2SA battery
SONY PCGA-BP2T battery
SONY PCGA-BP2V battery
SONY PCGA-BP4V battery
SONY PCGA-BP71 battery
SONY PCGA-BP71A battery
SONY VGP-BPL1 battery
SONY VGP-BPL2 battery

preety said...
This comment has been removed by the author.
DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex

Larah said...

We offer a great help - writing services. As result - a lot of free time!

Anonymous said...

коттедж
восстановление зрения
зеленый лазер
электрошокер

Anonymous said...

коттедж
восстановление зрения
зеленый лазер
электрошокер

Anonymous said...

коттедж
восстановление зрения
зеленый лазер
электрошокер

Anonymous said...

коттедж
восстановление зрения
зеленый лазер
электрошокер