27 September 2007

Bootstrapping

There are many bootstrapping algorithms, but they all have (roughly) the same general form. Suppose we want to solve a binary classification problem. We do the following:

  1. Build (by hand) a classifier that predicts positive with high precision (and low recall)
  2. Build (by hand) a classifier that predicts negative with high precision (and low recall)
  3. Apply (1) and (2) to some really huge data set leading to a smaller labeled set (that has high precision)
  4. Train a new classifier on the output of (3)
  5. Apply the new classifier to the original data set and go to (4)
Note that the hand-built classifier can alternatively just be a hand-labeling of a small number of "representative" examples.

I've never actually used such an algorithm before, but I hear that they work pretty well.

There's always been one thing that's surprised me, though. And that's the fact that they work pretty well.

There are two reasons why the fact that these algorithms work surprises me. First, there is often a danger that the classifier learned in (4) simply memorizes what the rule-based classifier does. Second, there is a domain adaptation problem.

To be more explicit, let me give an example. Suppose I want to classify sentences as subjective or objective. This is a well studied problem to which bootstrapping algorithms have been applied. The first two steps involve inventing rule-based classifiers that get high precision on the "predict-subjective" and "predict-objective" tasks. One way to do this is to create word lists. Get a list of highly subjective words and a list of highly objective words (okay this latter is hard and you have to be clever to do it, but let's suppose that we could actually do it). Now, to make our rule based classifier, we might say: if a sentence contains at least two positive words and no negative words it's positive; if a sentence contains at least two negative words and no positive words it's negative; else, punt.

Now, we apply this word-lookup-based classifier on a large data set and extract some subset of the sentences as labeled. In the very simplest case, we'll extract bag-of-words features from these sentences and train up a simple classifier.

This is where things go haywire for my understanding.

First, by using a bag-of-words model, there is always the possibility that the classifier we learn will simply mimic the rule-based classifiers on the training data and do random stuff on everything else. In other words, since the information used to create the rule-based classifier exists in the training data, there's no guarantee that the classifier will actually learn to generalize. Essentially what we hope is that there are other, stronger, and simpler signals in the data that the classifier can learn instead.

One might think that to get around this problem we would remove from the bag of words any of the words from the initial word list, essentially forcing the classifier to learn to generalize. But people don't seem to do this. In fact, in several cases that I've seen, people actually include extra features of the form "how many positive (resp. negative) words appear in this sentence." But now we have a single feature that will allow us to replicate the rule-based classifier. This scares me greatly.

Second, our goal from this process is to learn a classifier that has high accuracy on the distribution of "all English sentences (perhaps from some particular domain)." But the classifier we train in the first iteration is trained on the distribution of "all English sentences (perhaps...) such that the sentence contains >=2 positive and 0 negative, or >=2 negative and 0 positive words." The fact that this will generalize is totally not obvious.

And yet these things seem to work pretty well. I just don't quite understand why.

17 comments:

Bob Carpenter said...

I find the best way to understand the common approach to "bootstrapping" in NLP (not to be confused with Efron's statistical bootstrap) is to think of it as a highly quantized EM with annealing.

Specifically, you build an initial model, ideally with high precision. Then you run EM iterations with quantization. By that I mean that you quantize the expectations with a high threshold (e.g. [0.0,0.9] to 0 and (0.9,1.0] to 1). As you go, that threshold presumably gets lowered, making the whole thing a kind of annealing.

There are some good refs related to this issue: Nigam, McCallum and Mitchell's Semi-Supervised Text Classification Using EM, and Neal and Hinton's A View Of The EM Algorithm That Justifies Incremental, Sparse, And Other Variants. The former discusses annealing and the latter winner-take-all versions of EM, which I'm suggesting modifying to winner-take-all if they won by a large enough margin.

Of course, this assumes an underlying classifier that can infer conditional estimates of categories given inputs. I suppose you can do this with Yarowsky-style word list classifiers by recasting them as decision trees and estimating probabilities.

hal said...

The EM connection makes sense, but I didn't mention it because it doesn't really address either of my concerns, or at best partially addresses only the first one :).

In a sense bootstrapping is a family of algorithms for solving semi-sup learning. I guess my concern---especially concern #2---is that we know that we can do much better than semi-sup learning when different domains are involved. (Thanks to John Blitzer for teaching us this.)

Bob Carpenter said...

The short answer is that bootstrapping works for the same reason naive Bayes works -- tokens in docs are highly correlated with topics, and with even a few example docs, we can build a better-than-chance classifier.

Problem 1 worries that we might never get off the ground (continuing the "bootstrap" analogy). The reason this doesn't happen is that from a few set of seed words, we'll pick up some documents. From even a small set of docs, we can train a classifier that is not embarassing on 0/1 loss at high confidence. If this doesn't happen, we're grounded, so the annealing param needs to be set right here. If we do pick up more docs, we can build up flight speed by inferring that the words in those docs are associated with the categories. Many of these inferences will be wrong, but the reason it works is the same as for naive Bayes -- we're accumulating lots of little pieces of evidence in a voting scheme and the method's fairly low variance. These new docs are then enough to push some more relevant docs over the threshold, and then we're flying.

Problem 2 questions whether we can estimate a good distribution over all words. The answer is that just like for naive Bayes, it doesn't matter much if what we care about is 0/1 loss. The method's surprisingly robust because all the words tend to provide evidence. The lack of modeling correlation is why naive Bayes does so poorly on log (cross-entropy) loss compared to models that model dependency better (make fewer independence assumptions).

Unfortunately, like EM, we're left with a residual problem 3: we may get stuck in a local optimum. This indeed seems to happen in practice. Luckily, annealing through high precision examples helps push the initial steps of EM in the right direction. And if it doesn't we can just choose a different seed set.

Mark Johnson said...

I confess I'm puzzled as to why these bootstrapping methods seem to work better than EM. The informal story seems to be that with EM, the unlabeled data overwhelms the labeled data, which seems a reasonable enough explanation of why EM goes wrong. But then the question is: why doesn't this happen with bootstrapping, especially if it is just a kind of approximation to EM as Bob suggests.

Steven Abney has thought a lot about these things, and has a Computational Linguistics article and new book on this topic (I haven't seen the book yet) which may be worth looking at.

hal said...

hrm... the statement "bootstrapping works for the same reason naive Bayes works" doesn't make sense to me. afaik, there's nothing that says that i couldn't bootstrap, say, with an svm or maxent model. but in these models, there's a huge chance that you'll just memorize the data and not generalize... i can easily construct distributions on which this will happen.

i think problem 1 is deeper than not getting off the ground: (to continue the analogy) i think we can actually start digging a hole. why doesn't this happen?

i think i'm starting to understand problem 2 and this may be kind of cool. essentially what bob seems to be saying (buried somewhere)---which i agree with---is that while D^0 (the distribution over sentences from the output of our rules) and D (the true distribution over sentences) maybe be different in an "adaptation" sense, by doing this "incrementally add a few more examples" thing, we're essentially constructing a sequence of distributions D^0, D^1, D^2, ..., D^T, where D^T = D.

looking at this from the domain adaptation perspective, this is actually quite interesting. i don't think anyone's looked at the problem in this way before. there may be some new D.A. algorithm lurking in there somewhere.

mark -- yes, this is weird. i don't have a good answer. one thing that i can suggest based on some work a student of mine is doing right now is that maybe what's going wrong is that when we usually do EM, we do it on a naive bayes model, which is very poorly calibrated for predicting probabilities (as bob points out). by doing this thresholding in bootstrapping, we may be effectively trying to remedy this problem....

Eric Yeh said...

In regards to the generalization problem, aren't there enough similarities between bootstrapping and boosting that could help explain the apparent lack of overfitting? In both instances, it appears you've got a weighted combination of classifiers that are specialized on certain areas of the problem space, and (at least in boosting) adding in additional features apparently does not lead immediately to overfitting.

Fernando Pereira said...

"... I hear that they work pretty well". Don't trust everything you hear. As far as I know, no bootstrapping algorithm has ever been shown to work widely beyond its first reported application. In other words, the bootstrap parameters have been seriously overfitted to the initial application. Bootstrapping is intuitively a very cool idea, but we are missing a sketchy theoretical understanding of under what conditions it would work. Until we do, bootstrapping methods will be one-offs from which we can learn little.

Harsha said...

Bootstrapping, self-training, semi-supervising etc. have been tried for at least 4 decades (some OCR research from the 60's already had same idea).

The main problem with bootstrapping with a 'single' classifier was that the samples classified with high precision by the classifier are not uniformly distributed in the feature space. I.e., the errors made by a classifier are near the boundary, and if we throw these samples away the remaining samples don't add much to the existing classifier.

The novelty of some of the latest approaches (like co-training) is that they assume that there are two classifiers where the errors of each one are 'uniformly distributed' for the other. In NLP we can often find a natural analogue because some intrinsic properties of tokens (like its identity) are usually independent of its context given its label.

Reza said...

This is an article on the analysis of semi-supervised learning in the same line as Abney's analysis.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

Adi said...

Oes Tsetnoc one of the ways in which we can learn seo besides Mengembalikan Jati Diri Bangsa. By participating in the Oes Tsetnoc or Mengembalikan Jati Diri Bangsa we can improve our seo skills. To find more information about Oest Tsetnoc please visit my Oes Tsetnoc pages. And to find more information about Mengembalikan Jati Diri Bangsa please visit my Mengembalikan Jati Diri Bangsa pages. Thank you So much.

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex