07 May 2007

Non-linear models in NLP

If you talk to (many) "real" machine learning people, they express profound disbelief that almost everything we do in NLP is based on linear classifiers (maxent/lr, svms, perceptron, etc.). We only rarely use kernels and, while decision tress used to be popular, they seem to have fallen out of favor. Very few people use them for large-scale apps. (Our NYU friends are an exception.)

There are two possible explanations for this. (1) we really only need linear models; (2) we're too lazy to use anything other than linear models (or, alternative, non-linear models don't scale). My experience tells me that for most of our sequence-y problems (parsing, tagging, etc.), there's very little to be gained by moving to, eg., quadratic SVMs. I even tried doing NE tagging with boosted decision trees under Searn, because I really wanted it to work nicely, but it failed. I've also pondered the idea of making small decision trees with perceptrons on the leaves, so as to account for small amounts of non-linearity. Using default DT construction technology (eg., information gain), this doesn't seem to help either. (Ryan McDonald has told me that other people have tried something similar and it hasn't worked for them either.) Perhaps this is because IG is the wrong metric to use (there exist DTs for regression with linear models on the leaves and they are typically learned so as to maximize the "linearness" of the underlying data, but this is computationally too expensive).

One counter-example is the gains that people have gotten by using latent variable models (eg., Koo and Collins), which are essentially non-linearified linear models. In somewhat a similar vein, one could consider "edge" features in CRFs (or any structured prediction technique) to be non-linear features, but this is perhaps stretching it.

Part of this may be because over time we've adapted to using features that don't need to be non-linearified. If we went back and treated each character in a word as a single feature and then required the learning algorithm to recover what important features (like, "word is 'Bush'") then clearly non-linearity would be required. This is essentially what vision people do, and exactly the cases where things like deep belief networks really shine. But so long as we're subjecting our learning algorithms to featuritis (John Langford's term), perhaps there's really not much to gain.

8 comments:

Anonymous said...

I question Hal's presupposition that "almost everything we do in NLP is based on linear classifiers".

To the extent that we build taggers, chunkers, parsers, machine translators, speech recognizers, noisy channel spelling correctors, latent topic clusterers, social network analyzers, etc., we're not using linear models, at least in the usual sense. Some of these applications may have linear components, of course.

About the only threads of research I can think of that look even partially like linear modeling to the rest of the world would be the log-linear models of max ent and CRFs, simple TF/IDF models in information retrieval, the intriguing joint models of Dan Roth and crew at Illinois, and the use of singular value decomposition (SVD) for linear smoothing.

I have more understanding of the shock of the statisticians, which derives from machine learning's wham-bam focus on maximum a posteriori (MAP) models and first-best evaluations (0/1 loss) rather than Bayesian uncertainty estimates and probabilistic evaluation (log likelihood loss). I think this explains the prevalence of discriminitive, non-probabilistic techniques like perceptrons and SVMs, which also happen to be linear.

hal said...

Hrm...maybe I'm thinking of a different sense of "linear", but all of the examples you list in your second paragraph I would consider linear models (well, not topic models, and perhaps not some social network models). If you think of almost any parser, you get a score for a tree that's a product of a bunch of smaller scores, which are linear (exponential) functions, yielding an overall linear function.

I think the only examples when this doesn't hold are really the unsupervised systems... but even then, we typically assume that each latent component yields a linear function (certainly true in MT and speech). So while the overall functional isn't linear, it's a mixture of linear functions.

I guess the primary comparison I'm drawing is to vision and robotics problems, where they *never* use linear kernels for anything because they suck terribly, unless you do massive feature engineering (eg., wavelets, etc.).

Kevin Duh said...

I'm personally a big fan of linear models, due to its simplicity of implementatation, speed, and ease of use, etc. I think the reason NLPers often have linear models is that our features are often discrete (or even binary), and the simple method of generating a bunch of feature combinations (thereby explictly using polynomial-type kernels) works pretty well. I don't know whether this is the best possible solution for language data, but it certainly is reasonable.

Anonymous said...

hello.. My name is wanee. I'm from Malaysia and student of university. My group is needed to build the NLP system. So, I want to know about the suitable of language such as java,prolog or any, which suitable language can use to build this system.Please send your suggestion to my email,wanee_snar1803@yahoo.com. Thank you very much.

Anonymous said...

I think I see what Hal's saying now -- as long as we make naive independence assumptions that lead to our adding log probabilities, our models look linear in some sense. For instance, PCFGs model the log probability of a tree as the sum of the log probabilities of its local trees. HMMs model the log probability of an underlying state sequence and output as a sum of the log probabilities of the transitions and emissions.

But as soon as we start asking interesting probabilistic questions, the models no longer look linear. For instance, the probability of a token having a tag in an HMM part-of-speech model requires forward/backward; the probability of a category spanning a text chunk in a PCFG requires inside/outside. Both of these require sums of linear probabilities, not sums of log probabilities. So the overall probability estimates are not linear.

As soon as we do the kernel trick (explicitly or implicitly), we're also non-linear in underlying features. For instance, we might model feature F1, feature F2, and then introduce feature F1and2 which is only on if both F1 and F2 are on. A linear model fitting in an enhanced feature space (with conjunctions or generalizations to products if features are more than indicators) would then not be linear in Hal's sense (I'm guessing).

One more thing: I've been reading Gelman and Hill's new multi-level regression book (which I'd highly recommend), and was thinking in terms of numerical predictions. In this sense, NLP tends not to model continuous variables, as say, found in the Netflix task.

hal said...

i think we're on the same page now :). except i don't follow the "interesting questions bit." regardless of what question we ask, the model is still linear. sure, the answer may not be. looking for a tag marginal won't be. but by the same token, looking for the probability of a sequence squared won't be (because it's squared), but that's not a property of the model.

yes, the kernel trick throws us into nonlinear land, but how many people actually use kernels? and i've myself sometimes used conjunctions, etc., but these i found by hand rather than automatically (so i still consider the underlying model linear). if it found the conjunctions, i would consider it non-linear.

i think i'll post on the 0/1 versus prop issue soon :).

Gold Guide for World of Warcraft said...

good post :)

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花