Comments on natural language processing blog: Beating an N-Gram

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店酒店兼差PRETTY GIRL酒店公關酒...

2009-05-12T11:17:00.000-06:00

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店酒店兼差PRETTY GIRL酒店公關酒店小姐彩色爆米花酒店兼職,酒店工作彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀彩色爆米花

Wouldn't also the fact that character ngrams have ...

2006-07-13T23:34:00.000-06:00

Wouldn't also the fact that character ngrams have a tiny vocabulary (say, around 100) contribute significantly to the fact that the different smoothing techniques don't make that much of a difference?

Character n-grams are not just short word n-grams....

2006-07-10T16:15:00.000-06:00

Character n-grams are not just short word n-grams. Let me explain why.

First, there's not only the prefix/suffix issue Hal brought up, but also the cross-word issue. A character 8-gram will be a long n-gram for short words "out of a"
is only 8 characters.

Second, the usual application of token n-grams includes (a) case normalization, (b) punctuation removal, and (c) whitespace removal. That's why these had to be reinserted for IBM's "Entropy of English" paper (early 90s CL). I asked Joshua Goodman about his and Stanley Chen's papers, and they never modeled the generation of text.

Doing the same thing before running a character language model *increases* cross-entropy on both NY Times gigaword article bodies and MEDLINE citation abstracts.

Most token LMs are defective in that they assume a fixed vocabulary size. In practice, this means they don't sum to 1.0 over arbitrary sequences of tokens (or sequences of tokens of the same length). In short, they're *cheating* if you view them as arbitrary text analyzers, and thus they can't be used for tasks such as compression. We back our token models off to character models rather than a constant (as in Kneser-Ney, Dirichlet, Witten-Bell, etc. applied to tokens).

Google's found the one new token per document rule holds into the billions of documents, so there doesn't seem to be a natural upper bound. This problem's even worse (in theory) in highly inflected languages like Turkish or Basque. And it's problematic for Chinese, where the best systems at word segmentation are only 97% accurate or so.

Character n-grams under all the usual smoothing schemes converge to the ML estimate with enough data. So the more data they see, the less they "waste" on unseen events. And because they typically smoothe over n-gram levels, their estimates of unseen tokens are much better (partly the prefix/suffix issue).

Wow, Bob said a lot :).I have to admit something o...

2006-06-29T16:18:00.000-06:00

Wow, Bob said a lot :).

I have to admit something of a personal bias: it irks me that vanilla relative frequency works so darn well on so many things, and I'm really curious why this is. This is maybe a separate issue, though.

I think the "no data like more data" is fairly well accepted, though I would generally add "from the right domain" at the end. The free word versus free phrase order is an important distiction, and one I hadn't previously really thought about.

I find it utterly fascinating that character LMs work so well. Word ngrams I can buy because words are sort of whole semantic units. But I wonder if a high-ish order character ngram is essentially just mimicking a word ngram, but with a little smoothing over prefixes and suffixes? I.e., if you stem, or just chop words off at four characters, is there a significant difference between the character-based ngram and the word ngram? It seems that the character-based ngram has to be wasting probability mass on things that just don't occur. Probably more so than a word ngram would. But maybe this is okay, for the purposes of classification.

In some sense, it is nice that character based models can go somewhere. This reminds me of attempts to do computer vision by looking only at pixels (ala Yann LeCun).

Are there any similar results for freer word order languages than E and C? For instance, Russian or Hebrew or Arabic?

If you look closely at the literature, you'll find...

2006-06-29T15:39:00.000-06:00

If you look closely at the literature, you'll find a lot of the "improvements" on n-gram models are improvements over lousy baseline models (low order, bad smoothing, small amounts of data, etc.)

We just built a very large character n-gram model for spelling checking. It was trained on 50GB of domain-specific data. Works great as the source in a noisy-channel spelling corrector. Results from MSN (Brill, Moore, Cucerzan, etc.) show roughly the same thing.

Case restoration as done by IBM's TrueCasing system is just another noisy channel model spelling corrector.

As to free word order languages, most of them are not free *word* order, but free *phrase* order. Much of the power of n-grams comes from the phrasal level. They break down in English when you get long separations of related terms by apositives. As in "The President, who blah blah blah, declared ..."
where you don't have a chance of predicting "signed" on the basis of "President" with an n-gram. In fact, you wind up predicting it based on something it's less related to. Attempts to use parsing to sort these issues out have had mixed success at best. Parsers just don't provide very good language models. Probably because most decisions, even of a parser, are local, even in languages with more flexible phrase order.

As to the "English is just right", I think that's a perspective derived from a field where most of the work's in English. Sentence detection, for instance, is a pain because of our overloading of the period -- not a problem in Chinese.

As to scaling, check out my paper from last year's ACL workshop:
Scaling High-Order Character Language Models to Gigabytes. I basically showed that as the data gets up into the 1GB range for 8-grams, the smoothing you use is irrelevant -- Dirichlet, Kneser-Ney and what we went with, Witten-Bell, were all within three decimal places after that much data. Pretty much inline with the Brill and Banko conclusions that more data's better. N-grams are great because you can just keep piling in more data usefully.

Character LMs are tough to beat for classification. Both Bill Teahan and Ian Witten's work on compression-based models and Fuchun Peng's recent Waterloo thesis have shown this in a wide variety of settings. I just rebuilt Pang and Lee's sentiment analysis and character n-grams did as well as their SVM-based classifiers.

Chinese character LMs work very well, too. We used them to do word segmentation for the latest SIGHAN (Teahan did this first in his CL paper of about 2000). And using words rather than tokens worked very well for Chinese named entity recognition. Scores were about the same as for English (86% balanced F measure on person/location/organization).

I should mention that by "n-gram model" I mean ver...

2006-06-27T16:46:00.000-06:00

I should mention that by "n-gram model" I mean very well tuned n-gram model with interpolated Kneser-Ney smoothing. Obviously a lot of work has gone into getting smoothing to work out well, and it is possible that this is where the majority of the power is coming from, though, when not scrambling for tiny improvements, I tend to doubt it.