tag:blogger.com,1999:blog-19803222.post5328528836724749775..comments2024-03-18T01:45:45.724-06:00Comments on natural language processing blog: Perplexity versus error rate for language modelinghalhttp://www.blogger.com/profile/02162908373916390369noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-19803222.post-62965015933219068252014-06-03T04:17:19.380-06:002014-06-03T04:17:19.380-06:00I agree that there are problems in evaluating lang...I agree that there are problems in evaluating language models and in the way we apply perplexity measure.<br /><br />My PhD thesis (only with preliminary results) goes in this direction. That is why I would be highly interested to reproduce your experiments and have access to the scripts.<br /><br />Thanks for sharing your insights!<br /><br />As it was not mentioned. there is a rather infamous paper by chen: http://www.cs.cmu.edu/afs/.cs.cmu.edu/Web/People/roni/papers/eval-metrics-bntuw-9802.pdf suggesting that perplexity is not a good metric. <br /><br />I think Brants et. al. also go in your direction using stupid back off and state something like: "It's not a probability distribution but what works works": http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.76.1126&rep=rep1&type=pdfRene Pickhardthttp://www.rene-pickhardt.denoreply@blogger.comtag:blogger.com,1999:blog-19803222.post-88088696230736109922014-05-20T15:15:39.170-06:002014-05-20T15:15:39.170-06:00@Hal: Exactly --- the exponentiation is a crude c...@Hal: Exactly --- the exponentiation is a crude calibration for failed independence assumptions in the acoustic models. <br /><br />The nice thing about using entropy to measure is that it's a reasonable measure of overall model calibration (one bad estimate can kill you, which is why I think most models are conservative). That's what Shannon was trying to evaluate by looking at humans vs. n-grams, I suspect.<br /><br />Perplexity is for inflating performance improvements for publication; see Josh Goodman's <a href="http://research.microsoft.com/en-us/um/redmond/groups/srg/papers/2001-joshuago-tr72.pdf" rel="nofollow">bit of progress</a> paper, which should be required reading for anyone doing language modeling (unless there's a better more-recent alternative).<br /><br />@Chris and Hal: As to perturbation, I think Jason Eisner was using that ages ago to provide discriminative training data for models (I can't recall which kinds offhand).<br /><br />@Chris and Hal: Frank Wood's sequence memoizer has some neat ideas based on non-parametric Bayesian models for unbounded vocabulary modeling.<br /><br />@Hal: It's indeed Bob here. I should've never chosen the LingPipe WordPress login. All my posts on the LingPipe blog reverted to Breck when I transferred ownership of the domain.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19803222.post-62178594330299538882014-05-20T13:01:02.368-06:002014-05-20T13:01:02.368-06:00@Chris: one option is to ask a language model to p...@Chris: one option is to ask a language model to put a BOW in the correct order. for problems like text-to-text generation (or more specifically MT) this seems like a pretty reasonable desiderata.<br /><br />@Chris: I agree fixed vocab is a huge problem with LMs, but I think it's also a bit of an artifact of perplexity. because we insist on probabilistic models, we have to be really really clever to make sure to get the math right to enable LMs that scale beyond the fixed vocab. doing so might be a lot easier without this requirement of sum-to-one.<br /><br />@Amber: I hadn't seen that paper, thanks. Perturbation is a reasonable strategy and aligns with my first suggestion to Chris.<br /><br />@Unknown: I like the Shannon idea... it's going to be pretty similar to mean reciprocal rank (some transformation thereof).<br /><br />@lingpipe (Bob? :P): yeah, probabilistic semantics are nice, but Bayes rule only applies when the models are exact and they never are, so people put exponents and various other kludges on them, which then begs the quesiton why we want probabilistic models in the first place.<br /><br />@lingpipe: I'm not concerned that it's the "wrong criteria," but just that by choosing perplexity we've a priori chosen the class of models that we're willing to consider, and I think that's bad.<br />halhttps://www.blogger.com/profile/02162908373916390369noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-5663016490497726702014-05-19T13:48:18.349-06:002014-05-19T13:48:18.349-06:00Having a probabilistic language model is nice if y...Having a probabilistic language model is nice if you want to combine it with something else, such as using it in a decoder with an acoustic model or using it in a classifier with a distribution over categories. In both cases, you can in theory apply Bayes's rule to the result and make predictions. It also composes neatly with probabilistic transducers, like pronunciation models, which can then be used for applications like transliteration.<br /><br />Having said all that, the coupling of acoustic and language models is pretty weak due to how poor acoustic models are predictively compared to language models. And compared to word or character level features, the distribution over categories rarely contributes much in the way of classification power for inputs of any size. <br /><br />@Unknown (1) LingPipe, by the way, supports properly normalized language models at the character (code point) or byte (encoding) level. It also provides models at the character level that can be used for smoothing word-level models. Character-based models tend to provide lower entropy for a given memory profile in my experience.<br /><br />@Unknown (2) Shannon's solutions are always so elegant. It's an awesome strategy to get a probabilistic evaluation for a non-probabilistic language model. Like say, humans, who can't report their probabilities. Or non-probabilistic computer systems. Is that how Roark et al. evaluated? <br /><br />Hal --- are your concerns more one that we're optimizing the wrong criteria by minimizing entropy?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-19803222.post-14892281644496569532014-05-18T00:28:55.608-06:002014-05-18T00:28:55.608-06:00I've had word prediction accuracy/error correl...I've had word prediction accuracy/error correlate well with word error rate (for keyboard input in Swype). It's a handy way to evaluate without a full system. It can lead to weirdness if you optimize to the metric though, leading to models that favor lots of contexts at the cost of good breath.<br /><br />All in all, I'm a lot more comfortable with WER metrics than PP though.Keithhttps://www.blogger.com/profile/01044828614138535779noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-15469195588201851252014-05-17T09:27:20.997-06:002014-05-17T09:27:20.997-06:00One nice way to evaluate your language model would...One nice way to evaluate your language model would be via the Shannon game, i.e., have your model guess the next symbol (word, character, etc) until it gets it right. Shannon used the number of guesses (i.e., the position of the correct word in a ranked list) to establish the entropy of the language; you could use it to compare the quality of the two models. This is related to your 'error rate' measure, but presumably generally more informative (as tag accuracy is usually more informative than sentence accuracy). And applicable to non-probabilistic LMs.<br /><br />Note that this procedure could be used to evaluate in Chris' open vocabulary scenario, by asking for a prediction at the individual (utf8) symbol granularity. Here I think one important difference between the probabilistic and non-probabilistic models may lie in the ease of marginalizing from predictions over multi-symbol strings (words) to single symbols. In the absence of this, I think models can get arbitrarily large.<br /><br />of course, it all depends what task the model is being built for, which is why I do generally like extrinsic evaluation better, even if it muddies the water a bit.<br /><br />interesting post!Unknownhttps://www.blogger.com/profile/13369984930408220481noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-87178204825609403732014-05-17T04:51:02.713-06:002014-05-17T04:51:02.713-06:00Hi Hal,
Guess you are aware of these, but just in...Hi Hal,<br /><br />Guess you are aware of these, but just in case! A recent proposal that tries to move away from perplexity is the sentence completion task by MSR:<br /><br />http://research.microsoft.com/pubs/163344/semco.pdf<br /><br />The idea is that given a sentence with one missing word, the model has to pick the original word out of five candidates. It is similar to the word error rate, but the sentences and the candidates are chosen so that there is only one correct answer. <br /><br />Some more ideas for evaluation beyond perplexity can be found in the syntactic LM paper from Berkeley:<br /><br />http://www.cs.berkeley.edu/~adpauls/PAPERS/acl2012.pdf<br /><br />In section 5.3, Pauls and Klein evaluate their model on three pseudo-negative classification tasks, i.e. telling a "good" sentence from an artificially constructed "bad" one.andreas vlachoshttps://www.blogger.com/profile/15361529713994348465noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-88967068531230710052014-05-16T18:41:56.940-06:002014-05-16T18:41:56.940-06:00Have you read Noah Smith's paper on adversaria...Have you read <a href="http://arxiv.org/abs/1207.0245" rel="nofollow">Noah Smith's paper on adversarial evaluation</a>? Basically, it proposes that your model is good insofar as it can distinguish between a genuine sample of natural language, and a subtly altered version of the sample.<br /><br />Of course, "good" is here relative to the state-of-the-art machine that produces the altered versions of text. This defines a cycle of interdependent technical advances.<br /><br />One motivation for this kind of evaluation is that it is not dependent on probabilities.Amber O'Hearnhttps://www.blogger.com/profile/10125813379900433577noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-59018373477658001492014-05-16T17:22:21.169-06:002014-05-16T17:22:21.169-06:00It seems the problem here isn't really with ho...It seems the problem here isn't really with how we could better evaluate LMs but what we want LMs to do in the first place. If we want to characterize the distribution over sentences (perhaps conditional on something, like a sentence in another language that you want to translate, or an image you want to describe), then perplexity/cross entropy/etc are obvious evaluations. On the other hand, if we want to get away from perplexity as an evaluation, we need to start by asking for different things from language models. What might these alternatives look like? One possible formulation might be LMs that assign acceptability judgements (* or no *), and this would motivate accuracy or AUC or something. Or maybe a ranking model of the next word in a context?<br /><br />All that being said, as someone who is perfectly content with probabilistic LMs, I do think there is a lot about evaluation methodology to criticize. The standard practice of fixing vocabulary sizes and leaving out all of the interesting, high-information words from the test set <a href="http://googleresearch.blogspot.com/2014/04/a-billion-words-because-todays-language.html" rel="nofollow">seems to be in no danger of being fixed or even acknowledged</a> (although <a href="http://aclweb.org/anthology/N/N13/N13-1140.pdf" rel="nofollow">a few</a> <a href="http://dl.acm.org/citation.cfm?id=146685" rel="nofollow">of us</a> <a href="http://www.cs.utoronto.ca/~ilya/pubs/2011/LANG-RNN.pdf" rel="nofollow">do care</a>). And, since standard methodology makes comparing across training sets fraught, analysis of generalization from different amounts of data is virtually unheard of in the LM literature, which is sad considering this is probably one of the most interesting questions in the language sciences.Chrishttps://www.blogger.com/profile/02873949286995651782noreply@blogger.com