It's fair to say that perplexity is the de facto standard for evaluating language models. Perplexity comes under the usual attacks (what does it mean? does it correlate with something we care about? etc.), but here I want to attack it for a more pernicious reason: it locks us in to probabilistic models.
Background
Language modeling---or more specifically, history-based language modeling (as opposed to full sentence models)---is the task of predicting the next word in a text given the previous words. For instance, given the history "Mary likes her coffee with milk and", a good language model might predict "sugar" and a bad language model might predict "socks." (This is related to the notion of cloze probability.)
It's quite clear that there is no "right answer" to any of these prediction problem. As an extreme example, given the history " The", there are any number of possible words that could go next. There's just no way to know what the "right" answer is, whether you're a machine or a person.
This is probably the strongest justification for a perplexity-like measure. Since there's no "right" answer, we'll let our learned model propose a probability distribution over all possible next words. We say that this model is good if it assigns high probability to "sugar" and low probability to "socks."
Perplexity just measures the cross entropy between the empirical distribution (the distribution of things that actually appear) and the predicted distribution (what your model likes) and then divides by the number of words and exponentiates after throwing out unseen words.
The Issue
The issue here is that in order to compute perplexity, your model must produce a probability distribution. Historically we've liked probability distributions because they can be combined with other probability distributions according to the rules of probability (eg., Bayes rule or chain rule). Of course we threw that out a long time ago when we realized that combining things, for instance, in log-linear models, worked a lot better in practice, if you had a bit of data to tune the weights of the log-linear models.
So the issue in my mind is that there's plenty of good technology out there for making predictions that does not produce probability distributions. I think it's really unfortunate that non-probabilistic approaches don't get to play the language modeling game because they produce the "wrong" sort of output (according to the evaluation, but not according to the real world). I'm not saying there aren't good reasons to like probabilistic models, but just that alternatives are good. And right now those alternatives cannot compete. (For instance, Roark, Saraclar and Collins [2007] don't use perplexity at all and just go for word error rate of a speech recognizer around their perceptron-based language model.)
When I Ran Into This
I was curious about building a language model using vw, in the context of another project, and also to stress-test multiclass classification algorithms that scale well with respect to the number of classes.
As soon as I ran it, I discovered the issue: it produced results in the form of error rates. As I recall (it was a while ago), the error rate was somewhere in the 60s or 70s. I had absolutely no idea whether this was good or not. It seemed reasonable.
To get a sense of how standard language models fare, I decided to train a language model using srilm, and evaluate it according to error rate. To make my life easier, I just ran it on the WSJ portion of the Penn treebank. I used the first 48k sentences as train and the last 1208 sentences as test. I trained a 5gram Kneser-Ney smoothed language model and evaluated both perplexity and error rate (the latter required a bit of effort -- if anyone wants the scripts, let me know and I'll post them---but basically, I just take the LM's prediction to be the highest probability word given the context).
The language model I built had a perplexity (ppl1 in srilm) of 236.4, which seemed semi-reasonable, though of course pretty crappy. There was an OOV rate of 2.5% (ignored in the perplexity calculation).
The overall error rate for this model was 75.2%. This means that it was only guessing a quarter of words correct. (Note that this includes the 2.5% errors mandated by OOVs.)
I also tried another version, where all the model had to do was put the words in the right order. In other words, it knows ahead of time the set of words in the sentence and just has to pick between those 20, rather than between the full vocabulary (43k types). [This is maybe semi-reasonable for MT.] The error rate under this setting was 66.8%. Honestly I expected it would be a lot better.
Note that if you always guess ","---the most frequent type in this data---your error rate is 95.3%.
So why was it only moderately helpful (< 10% improvement) to tell the language model what the set of possible words was? Basically because the model was always guessing really high probability unigrams. Below are the top ten predicted words when the model made an error with their frequencies. They're basically all stop words. (This is in the unrestricted setting.)
1 14722 ,
2 1393 .
3 1298 the
4 512 and
5 485 in
6 439 of
7 270 to
8 163 ''
9 157 a
10 108 is
11 54 's
12 52 have
13 49 said
14 41 ``
15 38
16 38 for
17 38 are
18 34 %
19 33 $
20 31 be
The same list for the restricted setting is virtually identical, basically because most of these words are available in the average sentence:
1 10194 ,
2 5357 .
3 1274 the
4 274 of
5 251 in
6 232 to
7 230 and
8 193 a
9 51
10 36 for
11 28 's
12 25 ''
13 24 said
14 24 is
15 21 that
16 21 ``
17 16 it
18 15 from
19 14 be
20 13 by
Oh well.
Ok, so I don't have a "good" counter proposal to perplexity. Error rate certainly has many issues of its own. You could use IR-like metrics, like recall @ 10, mean average precision, etc., which are all questionable in their own ways.
I would just, in general, like if we could evaluate language models without having to be handcuffed by probabilities.
“Why do medical tests always have error rates?”
13 hours ago
9 comments:
It seems the problem here isn't really with how we could better evaluate LMs but what we want LMs to do in the first place. If we want to characterize the distribution over sentences (perhaps conditional on something, like a sentence in another language that you want to translate, or an image you want to describe), then perplexity/cross entropy/etc are obvious evaluations. On the other hand, if we want to get away from perplexity as an evaluation, we need to start by asking for different things from language models. What might these alternatives look like? One possible formulation might be LMs that assign acceptability judgements (* or no *), and this would motivate accuracy or AUC or something. Or maybe a ranking model of the next word in a context?
All that being said, as someone who is perfectly content with probabilistic LMs, I do think there is a lot about evaluation methodology to criticize. The standard practice of fixing vocabulary sizes and leaving out all of the interesting, high-information words from the test set seems to be in no danger of being fixed or even acknowledged (although a few of us do care). And, since standard methodology makes comparing across training sets fraught, analysis of generalization from different amounts of data is virtually unheard of in the LM literature, which is sad considering this is probably one of the most interesting questions in the language sciences.
Have you read Noah Smith's paper on adversarial evaluation? Basically, it proposes that your model is good insofar as it can distinguish between a genuine sample of natural language, and a subtly altered version of the sample.
Of course, "good" is here relative to the state-of-the-art machine that produces the altered versions of text. This defines a cycle of interdependent technical advances.
One motivation for this kind of evaluation is that it is not dependent on probabilities.
Hi Hal,
Guess you are aware of these, but just in case! A recent proposal that tries to move away from perplexity is the sentence completion task by MSR:
http://research.microsoft.com/pubs/163344/semco.pdf
The idea is that given a sentence with one missing word, the model has to pick the original word out of five candidates. It is similar to the word error rate, but the sentences and the candidates are chosen so that there is only one correct answer.
Some more ideas for evaluation beyond perplexity can be found in the syntactic LM paper from Berkeley:
http://www.cs.berkeley.edu/~adpauls/PAPERS/acl2012.pdf
In section 5.3, Pauls and Klein evaluate their model on three pseudo-negative classification tasks, i.e. telling a "good" sentence from an artificially constructed "bad" one.
One nice way to evaluate your language model would be via the Shannon game, i.e., have your model guess the next symbol (word, character, etc) until it gets it right. Shannon used the number of guesses (i.e., the position of the correct word in a ranked list) to establish the entropy of the language; you could use it to compare the quality of the two models. This is related to your 'error rate' measure, but presumably generally more informative (as tag accuracy is usually more informative than sentence accuracy). And applicable to non-probabilistic LMs.
Note that this procedure could be used to evaluate in Chris' open vocabulary scenario, by asking for a prediction at the individual (utf8) symbol granularity. Here I think one important difference between the probabilistic and non-probabilistic models may lie in the ease of marginalizing from predictions over multi-symbol strings (words) to single symbols. In the absence of this, I think models can get arbitrarily large.
of course, it all depends what task the model is being built for, which is why I do generally like extrinsic evaluation better, even if it muddies the water a bit.
interesting post!
I've had word prediction accuracy/error correlate well with word error rate (for keyboard input in Swype). It's a handy way to evaluate without a full system. It can lead to weirdness if you optimize to the metric though, leading to models that favor lots of contexts at the cost of good breath.
All in all, I'm a lot more comfortable with WER metrics than PP though.
Having a probabilistic language model is nice if you want to combine it with something else, such as using it in a decoder with an acoustic model or using it in a classifier with a distribution over categories. In both cases, you can in theory apply Bayes's rule to the result and make predictions. It also composes neatly with probabilistic transducers, like pronunciation models, which can then be used for applications like transliteration.
Having said all that, the coupling of acoustic and language models is pretty weak due to how poor acoustic models are predictively compared to language models. And compared to word or character level features, the distribution over categories rarely contributes much in the way of classification power for inputs of any size.
@Unknown (1) LingPipe, by the way, supports properly normalized language models at the character (code point) or byte (encoding) level. It also provides models at the character level that can be used for smoothing word-level models. Character-based models tend to provide lower entropy for a given memory profile in my experience.
@Unknown (2) Shannon's solutions are always so elegant. It's an awesome strategy to get a probabilistic evaluation for a non-probabilistic language model. Like say, humans, who can't report their probabilities. Or non-probabilistic computer systems. Is that how Roark et al. evaluated?
Hal --- are your concerns more one that we're optimizing the wrong criteria by minimizing entropy?
@Chris: one option is to ask a language model to put a BOW in the correct order. for problems like text-to-text generation (or more specifically MT) this seems like a pretty reasonable desiderata.
@Chris: I agree fixed vocab is a huge problem with LMs, but I think it's also a bit of an artifact of perplexity. because we insist on probabilistic models, we have to be really really clever to make sure to get the math right to enable LMs that scale beyond the fixed vocab. doing so might be a lot easier without this requirement of sum-to-one.
@Amber: I hadn't seen that paper, thanks. Perturbation is a reasonable strategy and aligns with my first suggestion to Chris.
@Unknown: I like the Shannon idea... it's going to be pretty similar to mean reciprocal rank (some transformation thereof).
@lingpipe (Bob? :P): yeah, probabilistic semantics are nice, but Bayes rule only applies when the models are exact and they never are, so people put exponents and various other kludges on them, which then begs the quesiton why we want probabilistic models in the first place.
@lingpipe: I'm not concerned that it's the "wrong criteria," but just that by choosing perplexity we've a priori chosen the class of models that we're willing to consider, and I think that's bad.
@Hal: Exactly --- the exponentiation is a crude calibration for failed independence assumptions in the acoustic models.
The nice thing about using entropy to measure is that it's a reasonable measure of overall model calibration (one bad estimate can kill you, which is why I think most models are conservative). That's what Shannon was trying to evaluate by looking at humans vs. n-grams, I suspect.
Perplexity is for inflating performance improvements for publication; see Josh Goodman's bit of progress paper, which should be required reading for anyone doing language modeling (unless there's a better more-recent alternative).
@Chris and Hal: As to perturbation, I think Jason Eisner was using that ages ago to provide discriminative training data for models (I can't recall which kinds offhand).
@Chris and Hal: Frank Wood's sequence memoizer has some neat ideas based on non-parametric Bayesian models for unbounded vocabulary modeling.
@Hal: It's indeed Bob here. I should've never chosen the LingPipe WordPress login. All my posts on the LingPipe blog reverted to Breck when I transferred ownership of the domain.
I agree that there are problems in evaluating language models and in the way we apply perplexity measure.
My PhD thesis (only with preliminary results) goes in this direction. That is why I would be highly interested to reproduce your experiments and have access to the scripts.
Thanks for sharing your insights!
As it was not mentioned. there is a rather infamous paper by chen: http://www.cs.cmu.edu/afs/.cs.cmu.edu/Web/People/roni/papers/eval-metrics-bntuw-9802.pdf suggesting that perplexity is not a good metric.
I think Brants et. al. also go in your direction using stupid back off and state something like: "It's not a probability distribution but what works works": http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.76.1126&rep=rep1&type=pdf
Post a Comment