Comments on natural language processing blog: Bootstrapping

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店酒店兼差PRETTY GIRL酒店公關酒...

2009-05-12T10:45:00.000-06:00

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店酒店兼差PRETTY GIRL酒店公關酒店小姐彩色爆米花酒店兼職,酒店工作彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀彩色爆米花

This is an article on the analysis of semi-superv...

2007-10-03T14:40:00.000-06:00

This is an article on the analysis of semi-supervised learning in the same line as Abney's analysis.

Bootstrapping, self-training, semi-supervising etc...

2007-10-02T08:55:00.000-06:00

Bootstrapping, self-training, semi-supervising etc. have been tried for at least 4 decades (some OCR research from the 60's already had same idea).

The main problem with bootstrapping with a 'single' classifier was that the samples classified with high precision by the classifier are not uniformly distributed in the feature space. I.e., the errors made by a classifier are near the boundary, and if we throw these samples away the remaining samples don't add much to the existing classifier.

The novelty of some of the latest approaches (like co-training) is that they assume that there are two classifiers where the errors of each one are 'uniformly distributed' for the other. In NLP we can often find a natural analogue because some intrinsic properties of tokens (like its identity) are usually independent of its context given its label.

"... I hear that they work pretty well". Don't tru...

2007-09-28T16:02:00.000-06:00

"... I hear that they work pretty well". Don't trust everything you hear. As far as I know, no bootstrapping algorithm has ever been shown to work widely beyond its first reported application. In other words, the bootstrap parameters have been seriously overfitted to the initial application. Bootstrapping is intuitively a very cool idea, but we are missing a sketchy theoretical understanding of under what conditions it would work. Until we do, bootstrapping methods will be one-offs from which we can learn little.

In regards to the generalization problem, aren't t...

2007-09-28T10:33:00.000-06:00

In regards to the generalization problem, aren't there enough similarities between bootstrapping and boosting that could help explain the apparent lack of overfitting? In both instances, it appears you've got a weighted combination of classifiers that are specialized on certain areas of the problem space, and (at least in boosting) adding in additional features apparently does not lead immediately to overfitting.

hrm... the statement "bootstrapping works for the ...

2007-09-28T07:57:00.000-06:00

hrm... the statement "bootstrapping works for the same reason naive Bayes works" doesn't make sense to me. afaik, there's nothing that says that i couldn't bootstrap, say, with an svm or maxent model. but in these models, there's a huge chance that you'll just memorize the data and not generalize... i can easily construct distributions on which this will happen.

i think problem 1 is deeper than not getting off the ground: (to continue the analogy) i think we can actually start digging a hole. why doesn't this happen?

i think i'm starting to understand problem 2 and this may be kind of cool. essentially what bob seems to be saying (buried somewhere)---which i agree with---is that while D^0 (the distribution over sentences from the output of our rules) and D (the true distribution over sentences) maybe be different in an "adaptation" sense, by doing this "incrementally add a few more examples" thing, we're essentially constructing a sequence of distributions D^0, D^1, D^2, ..., D^T, where D^T = D.

looking at this from the domain adaptation perspective, this is actually quite interesting. i don't think anyone's looked at the problem in this way before. there may be some new D.A. algorithm lurking in there somewhere.

mark -- yes, this is weird. i don't have a good answer. one thing that i can suggest based on some work a student of mine is doing right now is that maybe what's going wrong is that when we usually do EM, we do it on a naive bayes model, which is very poorly calibrated for predicting probabilities (as bob points out). by doing this thresholding in bootstrapping, we may be effectively trying to remedy this problem....

I confess I'm puzzled as to why these bootstrappin...

2007-09-27T21:01:00.000-06:00

I confess I'm puzzled as to why these bootstrapping methods seem to work better than EM. The informal story seems to be that with EM, the unlabeled data overwhelms the labeled data, which seems a reasonable enough explanation of why EM goes wrong. But then the question is: why doesn't this happen with bootstrapping, especially if it is just a kind of approximation to EM as Bob suggests.

Steven Abney has thought a lot about these things, and has a Computational Linguistics article and new book on this topic (I haven't seen the book yet) which may be worth looking at.

The short answer is that bootstrapping works for t...

2007-09-27T16:14:00.000-06:00

The short answer is that bootstrapping works for the same reason naive Bayes works -- tokens in docs are highly correlated with topics, and with even a few example docs, we can build a better-than-chance classifier.

Problem 1 worries that we might never get off the ground (continuing the "bootstrap" analogy). The reason this doesn't happen is that from a few set of seed words, we'll pick up some documents. From even a small set of docs, we can train a classifier that is not embarassing on 0/1 loss at high confidence. If this doesn't happen, we're grounded, so the annealing param needs to be set right here. If we do pick up more docs, we can build up flight speed by inferring that the words in those docs are associated with the categories. Many of these inferences will be wrong, but the reason it works is the same as for naive Bayes -- we're accumulating lots of little pieces of evidence in a voting scheme and the method's fairly low variance. These new docs are then enough to push some more relevant docs over the threshold, and then we're flying.

Problem 2 questions whether we can estimate a good distribution over all words. The answer is that just like for naive Bayes, it doesn't matter much if what we care about is 0/1 loss. The method's surprisingly robust because all the words tend to provide evidence. The lack of modeling correlation is why naive Bayes does so poorly on log (cross-entropy) loss compared to models that model dependency better (make fewer independence assumptions).

Unfortunately, like EM, we're left with a residual problem 3: we may get stuck in a local optimum. This indeed seems to happen in practice. Luckily, annealing through high precision examples helps push the initial steps of EM in the right direction. And if it doesn't we can just choose a different seed set.

The EM connection makes sense, but I didn't mentio...

2007-09-27T11:29:00.000-06:00

The EM connection makes sense, but I didn't mention it because it doesn't really address either of my concerns, or at best partially addresses only the first one :).

In a sense bootstrapping is a family of algorithms for solving semi-sup learning. I guess my concern---especially concern #2---is that we know that we can do much better than semi-sup learning when different domains are involved. (Thanks to John Blitzer for teaching us this.)

I find the best way to understand the common appro...

2007-09-27T11:17:00.000-06:00

I find the best way to understand the common approach to "bootstrapping" in NLP (not to be confused with Efron's statistical bootstrap) is to think of it as a highly quantized EM with annealing.

Specifically, you build an initial model, ideally with high precision. Then you run EM iterations with quantization. By that I mean that you quantize the expectations with a high threshold (e.g. [0.0,0.9] to 0 and (0.9,1.0] to 1). As you go, that threshold presumably gets lowered, making the whole thing a kind of annealing.

There are some good refs related to this issue: Nigam, McCallum and Mitchell's Semi-Supervised Text Classification Using EM, and Neal and Hinton's A View Of The EM Algorithm That Justifies Incremental, Sparse, And Other Variants. The former discusses annealing and the latter winner-take-all versions of EM, which I'm suggesting modifying to winner-take-all if they won by a large enough margin.

Of course, this assumes an underlying classifier that can infer conditional estimates of categories given inputs. I suppose you can do this with Yarowsky-style word list classifiers by recasting them as decision trees and estimating probabilities.