There are many bootstrapping algorithms, but they all have (roughly) the same general form. Suppose we want to solve a binary classification problem. We do the following:
- Build (by hand) a classifier that predicts positive with high precision (and low recall)
- Build (by hand) a classifier that predicts negative with high precision (and low recall)
- Apply (1) and (2) to some really huge data set leading to a smaller labeled set (that has high precision)
- Train a new classifier on the output of (3)
- Apply the new classifier to the original data set and go to (4)
I've never actually used such an algorithm before, but I hear that they work pretty well.
There's always been one thing that's surprised me, though. And that's the fact that they work pretty well.
There are two reasons why the fact that these algorithms work surprises me. First, there is often a danger that the classifier learned in (4) simply memorizes what the rule-based classifier does. Second, there is a domain adaptation problem.
To be more explicit, let me give an example. Suppose I want to classify sentences as subjective or objective. This is a well studied problem to which bootstrapping algorithms have been applied. The first two steps involve inventing rule-based classifiers that get high precision on the "predict-subjective" and "predict-objective" tasks. One way to do this is to create word lists. Get a list of highly subjective words and a list of highly objective words (okay this latter is hard and you have to be clever to do it, but let's suppose that we could actually do it). Now, to make our rule based classifier, we might say: if a sentence contains at least two positive words and no negative words it's positive; if a sentence contains at least two negative words and no positive words it's negative; else, punt.
Now, we apply this word-lookup-based classifier on a large data set and extract some subset of the sentences as labeled. In the very simplest case, we'll extract bag-of-words features from these sentences and train up a simple classifier.
This is where things go haywire for my understanding.
First, by using a bag-of-words model, there is always the possibility that the classifier we learn will simply mimic the rule-based classifiers on the training data and do random stuff on everything else. In other words, since the information used to create the rule-based classifier exists in the training data, there's no guarantee that the classifier will actually learn to generalize. Essentially what we hope is that there are other, stronger, and simpler signals in the data that the classifier can learn instead.
One might think that to get around this problem we would remove from the bag of words any of the words from the initial word list, essentially forcing the classifier to learn to generalize. But people don't seem to do this. In fact, in several cases that I've seen, people actually include extra features of the form "how many positive (resp. negative) words appear in this sentence." But now we have a single feature that will allow us to replicate the rule-based classifier. This scares me greatly.
Second, our goal from this process is to learn a classifier that has high accuracy on the distribution of "all English sentences (perhaps from some particular domain)." But the classifier we train in the first iteration is trained on the distribution of "all English sentences (perhaps...) such that the sentence contains >=2 positive and 0 negative, or >=2 negative and 0 positive words." The fact that this will generalize is totally not obvious.
And yet these things seem to work pretty well. I just don't quite understand why.