31 January 2009

Boosting, neural networks and deep learning

I'm on a neural networks kick of late, apparently.

The following was pointed out to me recently. Maybe other people think it's obvious, and it's obvious once you hear it, but I'd never seen it that way. Suppose we're doing boosting and the weak learner that we choose is a thresholded linear model (eg., perceptron, SVM, logistic regression). Then what we get out of boosting is basically a neural network with a single hidden unit. The hidden units themselves are the weak learners, and the "alphas" that arise as part of the boosting procedure (eg., AdaBoost) are the weights connecting the hidden units to the output. (Thanks to Antonio Paiva for pointing this out to me.) So, in a sense, boosting a linear model can be seen as a kind-of-unusual method for training a two layer network. Essentially the number of boosting iterations that you run determines the number of hidden units (i.e., the network architecture).

Okay, that's great. But let's say we don't care about two layer networks, we care about deep learning. For a 3-layer network you could simple boost boosted-linear-models. For a 4-layer network you could boost boosted-boosted-linear models. And so on.

Is there an alternative?

Thinking purely procedurally, let's say my weak learner is a linear model. I start boosting and I've got 5 linear models trained in the standard AdaBoost manner. Now I have a choice. Should I train a 6th linear model to throw in to the standard boosting set? Or should I treat the 5 boosted linear models as a new base classifier and boost against the combination? If I choose the latter, I've now gone from two layers to three layers.

Why might it be a good idea to boost against the 5 collectively? Well, if you believe the whole deep learning propaganda, then it's a good idea because deep = good. From a more theoretical perspective, you might how that the extra level of recursion might get you an increased rate of improvement in the error rate. I.e., the recursion could potentially lead to stronger boosting results than the standard linear boosting. Of course, this is just a hunch: I haven't at all looked to try to figure out if it would actually work in theory. But it seems plausible. For instance, in neural networks theory, we know that a 2 layer network can approximate any (reasonable) function, but you might need an exponential number of hidden units; the number of required hidden units goes down if you make deeper networks (under assumptions).

Of course, maybe someone's already done this, maybe it doesn't work, and maybe it's a stupid idea :).

10 comments:

Peter Turney said...

The algorithm you describe reminds me a bit of Cascade Correlation.

Yoshua Bengio said...

The connection between boosted linear classifiers and one-hidden-layer neural networks has been studied in the past. See for example the NIPS'2005 paper on Convex Neural Networks
http://www.iro.umontreal.ca/%7Elisa/publications/index.php?page=publication&kind=single&ID=19

which shows that by successively adding the optimal linear classifier (as a weak learner) one could in principle find the global optimum of a 1-hidden layer neural net. Unfortunately, finding the optimal linear classifier is NP-hard. But more importantly, I agree that 1-hidden layer nets are not satisfying because they are too shallow.

I'm not sure I really understood your idea with 5 linear models and a 6th one added in such a way as to increase depth. To increase depth, the 6th one should see in its inputs the outputs of the previous ones. As noted by Peter Turney, this would turn the algorithm into something similar to Cascade Correlation. But looking back onto the goals and motivations of deep learning algorithms, something important would be missing: the discovery of intermediate representations that capture the variations in the input. But something else might be missing: the unsupervised component. Purely supervised greedy addition of layers does not work as well as unsupervised addition of layers, according to the experiments described in the NIPS'2006 paper by Bengio et al on Greedy Layerwise Training of Deep Networks (http://www.iro.umontreal.ca/%7Elisa/publications/index.php?page=publication&kind=single&ID=190). The authors conjectured that purely supervised greedy addition of layers was simply too greedy, not keeping enough of the input information (the part that cannot be exploited yet by a linear classifier but could be exploited later when more layers are added).


-- Yoshua Bengio

Joseph Turian said...

Another work which uses boosting to add units to an MLP is Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space.

As Yoshua indicated, though, the main difficulty in inducing a deep architecture is that a purely supervised signal does not seem sufficient to guide the optimization to a good minimum.

hal said...

Yoshua's comment is definitely apt and one that I had sort of forgotten about when writing. The unsupervised aspect of getting things rolling in deep nets definitely seems to be super important. I don't know how to fix this :).

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

Taylor Party, Wedding said...

I wanted to thank you for this great read!! I definitely enjoying every little bit of it :) I have you bookmarked to check out new stuff you post. Please visit my Please visit my blog when you have time Please come visit my site
http://www.partyrockstars.com
Chicago Business Directory
when you got time.

Party Planning Services, Event Planning Rental, Wedding Rentals said...

I am not much into reading, but somehow I got to read lots of articles on your blog.
Its amazing how interesting it is for me to visit you very often :)
Please come visit my site
http://www.partyrockstars.com
Chicago Business Directory
when you got time.

frozen ice cream desserts said...

love your blog! Very cool! Please come visit my site Makericecream Business Directory when you got time.

gamefan12 said...

Your blog is so amazing. You definitely put a great insight into this. Keep up the good work.
boca raton cosmetic dentist

Anonymous said...

Southeast pandora jewelry and main Asian countries have twisted pandora bracelets rubies for centuries, but research as to where, cheap pandora charms and how to find more deposits is spare, and production has figured out how pandora uk and mining companies,” Giuliani says, pandora sale to look at exactly the right time and place.Farther cheap pandora investigation of claret formation, based on tectonic scenery, geochemistry, buy pandora fluid inclusions and isotopic ratios, allowed Giuliani’s lineup to remodel a new prototype pandora bracelets uk for the French Institute of Research for Development (IRD) and the National Scientific cheap pandora bracelets Center of Research, two government-sponsored knowledge and technology research pandora bracelets sale institutes that aim to aid in the sustainable pandora bangles development of developing countries.Before the collision of the Eurasian and Indian plates, cheap pandora bangles lagoons or deltas sat in the regions where marble is giant, he says, “and there is the brains pandora necklace to expect that the new thoughts should help development of the artless capital.”