21 January 2009

Featuritis in NLP

Featuritis (term from John Langford) is (in my mind) the process of throwing in a ton of features to a learning system without thinking about it.

Long gone are the days when one had to select a small number of useful features for supervised learning problems. Now that maxent has replaced naive Bayes and CRFs have replaced HMMs, we are free to throw a gigantic number of features into our learning problems with little to no repercussions (beyond a constant computation factor). The one key exception to this is MT, where the number of features is typically kept small because the algorithms that we currently use for feature weight tuning (eg., MERT) scale badly in the number of features. I know there is lots of work to get around this, but I think it's fair to say that this is still not de facto.

I think this is related to the fact that we cherish linear models.

That is, I think that a significant reason that featuritis has flourished is because linear models are pretty good at coping with it; and a reason that linear models have flourished is because they are computationally cheap and can always be "fixed up" by taking a featuritis approach.

I think a great point of contrast is the work that's been done in the machine learning community on using neural networks for solving NLP tasks. This work basically shows that if you're willing to give your machine learning algorithm much more power, you can kind of forget about representation. That is, just give yourself a feature for every word in your vocabulary (as you might, for instance, in language modeling), throw these through a convolution, then through a multilayer neural network and train it in a multitask fashion, making use of (essentially) Ando and Zhang-style auxiliary problems (from, eg., Wikipedia text) to do semi-supervised learning. And you do as well as a featuritis approach.

I suppose this is the standard "prior knowledge versus data" issue that comes up over an over again. Either I can put more prior knowledge into my system (cf., adding more features that I think are going to be useful) or putting more data into my system. The nuance seems to be that I cannot only make this trade-off. When I add more data to my system, I also have to change my learning model: a simple linear approach no longer cuts it. The linear model on a simple feature space just doesn't have the representational power to learn what I would like it to learn. So I have to go to a more complex function class and therefore need more data to reliably estimate parameters.

So why isn't everyone using neural nets? Well, to some degree we've been conditioned to not like them. Seeing cool positive results makes it a bit enticing to forget why we were conditioned not to like them in the first place. To me, there are basically three advantages that linear models have over multilayer neural nets. The first is that there is very little model selection to do: in a neural net, since I have little experience, I have no idea how to choose the network architecture. The second is training efficiency. Linear models are just ridiculously fast to train, and neural nets (despite all the progress over the past 20 years) are still darn slow. (Although, at least neural nets are fast to predict with; unlike, say, kernel machines.) The third is non-convexity. This means that we probably have to do lots of random restarts.

I doubt the third issue (non-convexity) carries much weight in the NLP community. We're such fans of algorithms like EM (also non-convex) and Gibbs sampling (atrociously not even comparable to notions of convexity) that I can't imagine that this is the thing that's stopping us.

The first issue (choosing network structure) is roughly analogous to choosing a feature representation. I think the difference is that when I say "I add a feature that pulls out a two-character suffix of the word," I can see exactly how this might affect learning and why it might be useful. When I say that I add a new node in a hidden layer of a network, I have no idea really what's going to happen.

The second issue (speed) is actually probably non-trivial. When I'm training a relatively simple classifier or sequence labeler, I kind of expect it to be able to train in a matter of minutes or hours, not days or weeks. The primary issue here doesn't seem to be the representation that's making things so much slower to train, but the fact that it seems (from experimental results) that you really have to do the multitask learning (with tons of auxiliary problems) to make this work. This suggests that maybe what should be done is just to fix an input representation (eg., the word identities) and then have someone train some giant multitask network on this (perhaps a few of varying sizes) and then just share them in a common format. Then, when I want to learn my specific task, I don't have to do the whole multitask thing and can just use that learned network structure and weights as an initial configuration for my network.

At the end of the day, you're going to still have to futz with something. You'll either stick with your friendly linear model and futz with features, or you'll switch over to the neural networks side and futz with network structure and/or auxiliary problem representation. It seems that at least as of now, futzing is unavoidable. At least network structure futzing it somewhat automatable (see lots of work in the 80s and 90s), but this isn't the whole package.

(p.s., I don't mean to imply that there isn't other modern work in NLP that uses neural networks; see, for instance, Titov and Henderson, ACL 2007.)

24 comments:

roschler said...

"futzing is unavoidable"

I think that is the truest statement I have ever heard about AI and perhaps software in general. :)

Yoav said...

Good points, interesting post!

But I don't believe this neural net approach is the way to go.

In fact, I think that giving a feature to every word in the vocabulary is in itself the core of the feauritis in NLP.

For the most part, word identities should not matter to a high-level NLP task. If word identities matter, then we are probably doomed with any supervised model -- we will never get a big enough corpus to estimate all the parameters. But using them make things much easier for the so-called model designer: "ok, now lets add also the lexical items and see how it goes". Empirically, it usually improves the results. But this improvement is due to maybe 10% or so of the lexical items. The rest of them have no effect at all (but at least they usually don't hurt the performance). I believe we can do much much better than that. What we need instead are linguistically motivated features, that can generalize well from our small training corpora to our test corpora, even with a relatively different vocabulary. I wish I knew how to find them ;)

I think that a lot of what's going on in the neural-net / Ando-and-Zhang approaches is that a part of the model is implicitly creating such generalizable features from the lexical level. This is very cool, but you can not really tell what is going on, and are probably stuck with what the model gave you. The best you can do is change your model structure in hope to get a better intermidiate feature representation. It might be better to aim directly at the generalizable feature representation: one simple model will find the representation, another one will use it. And again, I wish I knew how to do that ;)

Anonymous said...

Here's a nice t-SNE visualization of the Collobert & Weston features (with 2,500 words): http://www.cs.toronto.edu/~hinton/turian.png

Anonymous said...

Yoav - you just read my mind :)

hal said...

Yoav -- I guess it depends on your goals. On the one hand, with my machine learning hat on, I'd like the dumbest possible feature representation (words or characters) and then do something fancy on that. With my linguist hat on, I'd agree with you. At least in so far as you can have these general features that really will work on any problem, not just one specific problem (or style thereof). I totally agree that what's going on in the multitask stuff is that it's trying to learn a better representation; the hope would be that this would enable you to use small training corpora even with a relatively different vocabulary.

hal said...
This comment has been removed by the author.
Anonymous said...

This post was like deja vu for the conversation I had with Joseph Turian a week or so ago.

A feature of the "featureless" approach is that it infers the word feature vectors (aka "lookup table") jointly with the features for the task(s). Collobert and Weston showed it was possible to use a common set of word representations for different tasks over the same data.

This is very cool, but perhaps not that surprising in retrospect (what good idea is?). Automatic clustering for feature extraction had already been used in most of these tasks. Chen and Goodman gave us a glimpse of the future when they showed automatic clustering of words given contexts was more useful than Penn Treebank part-of-speech tags.

Collobert and Weston's tasks seem highly complimentary in that we know good named-entity extraction helps with part-of-speech tagging and vice-versa.

I don't think it'd work to use a common vector word representation for different data genres, such as text messages or MEDLINE citations. I'd expect to at least see another layer for the domain-specific representations.

It seems to me that deep belief network training relies on good initializations. Could a common representation for words help with that, or will the network shape make that impractical?

Collobert and Weston do a little futzing with tokenization, with pruning (most frequent 30K words only), and with case normalization (they lowercase but keep a capitalization feature). I didn't see any LM entropy results in the unified paper, so I'm not sure how they're evaluating or how this fancy method compares to Chen and Goodman's simpler clustered results.

Anonymous said...

Your point about EM is absolutely correct. Shockingly, it is not hard to construct an (by no means degenerate) example when EM fitting does not work well for a mixture of two reasonably separated Gaussians in one dimension. Why EM would be preferable to a gradient descent-based method is far from clear to me.

Misha B.

Anonymous said...

Misha-- I think the main reason to prefer EM to gradient descent is simplicity. Fitting a mixture of Gaussians by gradient descent is perfectly doable, but is very messy compared to EM. (You need to constrain the covariance matrices you are optimizing over to be positive definite, for example.) On the other hand, stochastic gradient descent can be used online, which can be a huge advantage with Big Data.

I don't believe there is any theoretical work that suggests that EM will tend to get trapped in local minima less often than gradient descent. There might be some folklore about this. My personal experience is that local minima are a real problem in practice for both.

Anonymous said...

Justin, it seems that it is a "widely believed fact" that EM tends to find correct values of the parameters. I think part of it stems from the confusion with MLE, which is known to be a consistent estimator, but is computationally intractable.

I am also unaware of any theoretical or experimental work suggesting that EM outperforms gradient descent, although, as you say, EM is simpler to implement. In my view a key issue for both methods is initialization and that is something which is not well-understood at all.

However, considering that EM does not always work well for two Gaussians in one dimension (even just for mean estimation) suggests that in higher dimensions and more complicated situations we really have very little idea what's going on.

Misha B

Jurgen Van Gael said...

There is some work on doing more fancy optimization (conjugate gradient) instead of just simple EM (http://www.cs.toronto.edu/~rsalakhu/papers/emecg.pdf). It turns out that CG is not always a clear winner over standard EM.

Bob: do you have the reference for that Chen and Goodman paper?

bj said...

Interesting post, a few comments.

Suppose on the one hand we have a lot of data and train an enormous non-linear deep (and why not wide as well) neural network with many layers, many hidden units, that works very well. Each hidden unit is doing something, but because of the millions of parameters, it is difficult or impossible to determine how it is precisely implementing its solution. This approach, actually, reminds me more of the brain, where for one reason or another the right environmental pressure (in the artificial case a loss function) encouraged a complex entity to arrive at some reasonable (perhaps fast and frugal) solution. We could spend our time trying to interpret this solution, but we could also spend our time trying to interpret the brain as they both offer solutions. Perhaps the best reason to try to interpret such a large artificial system is its measurably (unlike the brain, it is easy to know the values of each link in a NN for a particular input pattern).

On the other hand, a large feature-engineered linear model is such that it can work well, and with proper regularization, we are able to throw features at it with relative impunity. We could look at the surviving features and verify an initial assumption we might have about which features might be useful for a problem (which is why this approach is of interest from the linguistic, or any data-domain specific and/or scientific hypothesis verification, perspective since it allows us to verify using machine learning criteria hypotheses about that data or a human phenomena). Such an approach can also work really well on a problem, but it is inherently limited by the set of features that it is given (so if you run out of ideas for features, and your goal is strictly to get something to work, the non-linear approach might be better).

I see there being room for both approaches actually. From the domain side, lots of features in a linear model can work, and it is easy to understand. From the engineering get-it-to-work-no-matter-what-and-I-don't-care-if-I-have-no-idea-what-it-is-doing-other-than-the-fact-that-it-is-asymptotically-consistent perspective, a large nonlinear neural network is useful (since it might achieve useful practical and/or theoretical results). Also, since it is measurable, if we can figure out how to interpret it, it might help us to discover techniques to figure out how the brain works --- I'm not aware yet of any research on data-mining for patterns on large learned machine learning systems.

On an unrelated note, are HMMs really dead? Did CRFs really kill them? :-)

hal said...

Bob: I totally agree with everything :). I didn't mean to claim that the same representation would work across domains, but at least (maybe) for different tasks in the same (or similar) domains.

Misha: Re EM versus GD, the thing that was beat into my head when I was a grad student was that one of the cool things about EM is that you don't have to do annoying things like choose step sizes or deal with momentum or conjugate directions or anything like that.

Justin: I think if you reparameterize so that Sigma = U'U and then do GD on U, it's equally straightforward to do GD as EM. You can also do online EM instead of stochastic GD. I think it's largely an empirical question which is better. I started looking at this a while ago for IBM model 1 but then got bored and never really followed it up (partially because it doesn't seem especially publishable and it was too much work just for a blog post).

Jurgen: It's *the* Chen and Goodman "analysis of smoothing" paper, I believe.

JB: Yeah, I pretty much agree. Depending on your goals, both are totally reasonable approaches.

Anonymous said...

Natural gas is an energy source that is commonly used in homes for cooking, heating, and water heating.
Minerals Exploration Company India

Anonymous said...

I misremembered what was where. Chen and Goodman's classic paper doesn't talk about clustering, but Goodman's overview does:

http://arxiv.org/abs/cs/0108005v1

He uses a discrete clustering into a partition, not any kind of principal components-like representation. The clustering is tuned for the LM task, though.

Mobile Searcher said...

The problem with the featuritis approach *and* the superior learning algorithm is that they don't result in a lot of insights and create models that are largely incomprehensible, which is fine for a certain set of system building activity (natural language engineering) but not if we are wearing a scientific hat and want to learn about language. More research about the elegant and effective integration of rule based and statistical methods is due, and Markov Template Models are one promising line of investigation (see also the forthcoming workshop on NLP & software engineering at NAACL'09 in this regard).

Anonymous said...

And if we are talking specifics, here, then make it a waist 33, length 30 to crease

nicely over my square tipped Kenneth Cole Reaction vintage nubuck loafers. I also would

like a 1.8" wide Gucci horsebit ring buckle belt the same color as the loafers.

............................................................................

Anonymous said...

And if we are talking specifics, here, then make it a waist 33, length 30 to crease

nicely over my square tipped Kenneth Cole Reaction vintage nubuck loafers. I also would

like a 1.8" wide Gucci horsebit ring buckle belt the same color as the loafers.

............................................................................

Anonymous said...

Hello, everybody. I am a new hand to be here. So nice to meet you all. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . wb

Unknown said...

Latest film "Man Beast Hybrid" Trailer
The two rebel scientists Elsa and Clive, challenges the legal and moral constraints to human and animal DNA hybridization

to create a new species, "Dren". This new organism eventually became a pretty long and dangerous female monster with wings
Dell Latitude D600 battery|dell d830 ac adapter|dell d610 ac adapter|pa3468u 1aca ac adapter|dell latitude d520 ac adapter|5150 ac adapter|d400 ac adapter|d600 ac adapter|nadp 90kb|zd8000 battery|Laptop keyboard|HP Laptop keyboard|c1295 battery|pa3399u-1brs|pa3400u 1brs|HP 365485-001 Keyboard|HP/Compaq Laptop keyboard|Compaq M2000 Keyboard|HP Pavilion zv5000 Keyboard|HP Pavilion NX9100 Keyboard

We look forward to this wonderful movie please oh!

combattery84 said...

laptop battery
laptop battery
Power tools battery
Camcorder Battery
Digital Camera Battery
CANON Camcorder Battery
JVC Camcorder Battery
PANANSONIC Camcorder Battery
SONY Camcorder Battery
SHARP Camcorder Battery
CANON Digital Camera Battery
CASIO Digital Camera Battery
FUJIFILM Digital Camera Battery
JVC Digital Camera Battery
NIKON Digital Camera Battery
SANYO Digital Camera Battery
SHARP Digital Camera Battery

IBM Laptop Battery
acer laptop battery
apple Laptop Battery
toshiba laptop battery
hp latptop battery
dell laptop battery
sony laptop battery
asus laptop battery
compaq laptop battery
FUJITSU Laptop Battery
laptop battery
ACER Laptop Battery
APPLE Laptop Battery
laptop battery
laptop battery
laptop battery
laptop battery
laptop battery
laptop battery

combattery84 said...

Acer aspire 3613wlmi battery
Travelmate 2414wlmi battery
Acer batcl50l battery
Acer Travelmate 2300 battery
ACER aspire 3610 battery
ACER travelmate 4600 battery
Dell Latitude D800 battery
Dell Inspiron 600m battery
Dell Inspiron 8100 Battery
Dell Y9943 battery
Dell Inspiron 1521 battery
Dell Inspiron 510m battery
Dell Latitude D500 battery
Dell Latitude D520 battery
Dell GD761 battery
Dell NF343 battery
Dell D5318 battery
Dell G5260 battery
Dell Inspiron 9200 battery
Dell Latitude C500 battery
Dell HD438 Battery
Dell GK479 battery
Dell PC764 battery
Dell KD476 Battery
Dell Inspiron 1150 battery
Dell inspiron 8500 battery
Dell Inspiron 4100 battery
Dell Inspiron 4000 battery
Dell Inspiron 8200 battery
Dell FK890 battery

combattery84 said...

Dell INSPIRON 2500 battery
Dell INSPIRON 630m battery
Dell Latitude D820 battery
Dell Latitude D610 Battery
Dell Latitude D620 battery
Dell Latitude D630 battery
Dell xps m1210 battery
Dell e1705 battery
Dell d830 battery
Dell inspiron 2200 battery
Dell inspiron 640m battery
Dell inspiron b120 battery
Dell xps m1210 battery
Dell inspiron xps m1710 battery
Dell inspiron 1100 battery
Dell 310-6321 battery
Dell 1691p battery
Dell Inspiron 500m battery
Dell 6Y270 battery
Dell inspiron 8600 battery
Latitude x300 series battery
Dell latitude cpi battery
Dell 1x793 battery
dell Inspiron 1501 battery
Dell 75UYF Battery
Dell Inspiron 1720 battery
dell Latitude C640 battery
Dell XPS M140 battery
Dell Inspiron E1405 battery
dell 700m battery

combattery84 said...

IBM ThinkPad R60 Battery
IBM ThinkPad T60 Battery
IBM ThinkPad T41 Battery
IBM ThinkPad T43 Battery
IBM ThinkPad X40 Battery
Thinkpad x24 battery
ThinkPad G41 battery
IBM thinkpad r52 battery
Thinkpad x22 battery
IBM thinkpad t42 battery
IBM thinkpad r51 battery
Thinkpad r50 battery
IBM thinkpad r32 battery
Thinkpad x41 battery
SONY VGP-BPS2 Battery
SONY VGP-BPS2C Battery
SONY VGP-BPS5 battery
SONY VGP-BPL2C battery
SONY VGP-BPS2A battery
SONY VGP-BPS2B battery
SONY PCGA-BP1N battery
SONY PCGA-BP2E battery
SONY PCGA-BP2NX battery
SONY PCGA-BP2S battery
SONY PCGA-BP2SA battery
SONY PCGA-BP2T battery
SONY PCGA-BP2V battery
SONY PCGA-BP4V battery
SONY PCGA-BP71 battery
SONY PCGA-BP71A battery
SONY VGP-BPL1 battery
SONY VGP-BPL2 battery