11 October 2011

Active learning: far from solved

As Daniel Hsu and John Langford pointed out recently, there has been a lot of recent progress in active learning. This is to the point where I might actually be tempted to suggest some of these algorithms to people to use in practice, for instance the one John has that learns faster than supervised learning because it's very careful about what work it performs. That is, in particular, I might suggest that people try it out instead of the usual query-by-uncertainty (QBU) or query-by-committee (QBC). This post is a brief overview of what I understand of the state of the art in active learning (paragraphs 2 and 3) and then a discussion of why I think (a) researchers don't tend to make much use of active learning and (b) why the problem is far from solved. (a will lead to b.)

For those who know what QBU and QBC are, skip this paragraph. The idea with QBU is exactly what you think: when choosing the next point to as for the label of, choose the one on which your current model is maximally uncertain. If you're using a probabilistic model, this means something like "probability is closest to 0.5," or, in the non-binary case, something like "highest entropy of p(y|x)." If you're using something like an SVM, perhaps margin (aka distance to the hyperplane) is a reasonable measure of uncertainty. In QBC, the idea is still to query on uncertain points, but the uncertainty is computed by the amount of agreement among a committee of classifiers, for instance, classifiers trained in a boostrap manner on whatever data you have previously labeled.

One of the issues with QBU and QBC and really a lot of the classic methods for active learning is that you end up with a biased set of training data. This makes it really hard to prove anything about how well your algorithm is going to do on future test examples, because you've intentially selected examples that are not random (and probably not representative). One of the "obvious in retrospect" ideas that's broken this barrier is to always train your classifier on all examples: the label for those that you've queried on is given by the human, and the label for those that you haven't queried on is given by your model from the previous iteration. Thus, you are always training on an iid sample from the distribution you care about (at least from a p(x) perspective). This observation, plus a lot of other work, leads to some of the breakthroughs that John mentions.

An easy empirical observation is that not many people (in my sphere) actually use active learning. In fact, the only case that I know of was back in 2004 where IBM annotated extra coreference data for the Automatic Content Extraction (ACE) challenge using active learning. Of course people use it to write papers about active learning, but that hardly counts. (Note that the way that previously learned taggers, for instance the Brill Tagger, were used in the construction of the Penn Treebank does not fall under the auspices of active learning, at least as I'm thinking about it here.)

It is worth thinking about why this is. I think that the main issue is that you end up with a biased training set. If you use QBC or QBU, this is very obvious. If you use one of the new approaches that self-label the rest of the data to ensure that you don't have a biased training set, then of course p(x) is unbiased, but p(y|x) is very biased by whatever classifier you are using.

I think the disconnect is the following. The predominant view of active learning is that the goal is a classifier. That data that is labeled is a byproduct that will be thrown away, once the classifier exists.

The problem is that this view flies in the face of the whole point of active learning: that labeling is expensive. If labeling is so expensive, we should be able to reuse this data so that the cost is amortized. That is, yes, of course we care about a classifier. But just as much, we care about having a data set (or "corpus" in the case of NLP).

Consider, for instance, the Penn Treebank. The sorts of techniques that are good at parsing now were just flat-out not available (and perhaps not even conceivable) back in the late 1990s when the Treebank was being annotated. If we had done active learning for the Treebank under a non-lexicalized, non-parent-annoted PCFG that gets 83% accuracy, maybe worse because we didn't know how to smooth well, then how well would this data set work for modern day state splitting grammars with all sorts of crazy Markovization and whatnot going on?

The answer is: I have no idea. I have never seen an experiment that looks at this issue. And it would be so easy! Run your standard active learning algorithm with one type of classifier. Plot your usual active-versus-passive learning curves. Now, using the same sequence of data, train another classifier. Plot that learning curve. Does it still beat passive selection? By how much? And then, of course, can we say anything formal about how well this will work?

There are tons of ways that this problem can arise. For instance, when I don't have much data I might use a generative model and then when I have lots of data I might use a discriminative model. Or, as I get more data, I add more features. Or someone finds awesome features 5 years later for my problem. Or new machine learning techniques are developed. Or anything. I don't want my data to become obselete when this happens.

I am happy to acknowledge that this is a very hard problem. In fact, I suspect that there's some sort of no-free-lunch theorem lurking in here. Essentially, if the inductive biases of the classifier that you use to the active learning and the classifier you train at the end are too different, then you could do (arbitrarily?) badly. But in the real world, our hypothesis classes aren't all that different, or perhaps you can assume you're using a universal function approximator or a universal kernel or something. Assume what you want to start, but I think it's an interesting question.

And then, while we're on the topic of active learning, I'd also like to see whether an active learning algorithm's performance asymptotes before all your training data is exhausted. That is, the usual model in active learning experiments is that you have 1000 training examples because that's what someone labeled. You then do active learning up to 1000 examples, and of course at that point, everything has been labeled, so active learning performance coincides precisely with passive learning performance. But this is a poor reflection of many problems in the world, where new inputs are almost always free. I want the Banko and Brill paper for active learning... perhaps it's out there, and if you've seen it, I'd love a pointer. I ran a couple experiments along these lines (nothing concrete), but it actually seemed that active learning from a smaller pool was better, perhaps because you have fewer outliers (I was using QBU). But these results are by no means concrete, so don't cite me as saying this :).

At any rate, I agree that active learning has come a long way. I would humbly suggest that the goal of simply building a classifier is not in the real spirit of trying to save money. If you wanted to save money, you would save your data and share it (modulo lawyers). In the long run, passive learning currently seems much less expensive than active learning to me.

17 comments:

  1. Look at our emnlp 04 paper where we explictly looked at bias and data reuse in active learning

    ReplyDelete
  2. @All: the link is http://homepages.inf.ed.ac.uk/miles/papers/emnlp04.pdf

    @Miles: thanks, very cool. so basically your results support the conclusion that changing models is a bad idea after active learning :). i'm not super shocked, but it's nice to see!!!

    ReplyDelete
  3. Jason Catlett and I looked at selecting examples using uncertainty sampling based on a naive Bayes model, and using them to train a decision tree with C4.5. (Yeah, this was the olden days...):

    Lewis, D. and Catlett, J. Heterogeneous uncertainty sampling for supervised learning. ICML 1994, pp. 148--156.

    We found an uncertainty sample of size 1000 was as good as a random sample of size 10000. This was a simple text classification problem with low class frequencies, so really any approach that managed to oversample positive examples was likely to do well.

    For what it's worth, I've found active learning being used, formally or informally, in quite a few text classification settings. It's starting to take off in particular in e-discovery.

    ReplyDelete
  4. Very interesting post! With regards to whether active learning asymptotes before all your data is tagged, we did this with word segmentation (this didn't make it into our paper for ACL this year due to space constraints). The setting is adapting a Japanese word segmenter from the general domain to the medical domain using a 20M character unlabeled corpus and actually having an annotator label the data:

    http://www.phontron.com/images/tagaccuracies.png

    You can see that it's leveling off at 10000 examples, about 0.5% of the total. And for what it's worth, I've used it to create data for my word segmenter, but the same caveats still stand about whether the data is reusable.

    ReplyDelete
  5. @hal yep. like many things, this is obvious in hindsight really. another point worth mentioning is that if you have a bad model and a good model (and both for the same task), random sampling for the good model can produce better results than active learning for the bad one.

    ReplyDelete
  6. @daviddlewis @hal One of the reasons AL is so popular and "example learning" --which is the same thing-- isn't is the catchy name. truth in advertising

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. Thought provoking post, but I think your claim that actively selected data is useless in the light of future potentially better techniques might be a bit off.

    Say you use active learning to select M instances for labeling now. Then as a new technique, new features, whatever, comes along, why not again use active learning, using the M labeled instances as the seed set, to label N more instances. We then only need to have that M+N is less than P, the number of passively labeled instances required for the unbiased passive learner to reach the same performance. It seems likely that some instances are wasted when we seed with the M actively selected instances compared to seeding with M randomly sampled instances, but perhaps this waste is acceptable for the benefit of only having to label M instances to start with.

    I think this would be an interesting experiment to run. There might even be some theory to be developed on the relationship between the hypothesis space used when actively selecting the M first instances and the one used to when labeling the N subsequent instances and so on.

    Another perspective on this issue is whether you can come up with smarter ways of actively selecting instances for labeling than just letting the current intrinsic state of the learner decide. I would claim that this is something we do regularly. When we identify that our parser is bad at parsing questions, for example, we go out and label some questions, sampled in some way, and retrain our parser. If only we could get the learner to actively make this kind of higher-level decisions...

    /Oscar Täckström

    ReplyDelete
  9. @oooh!: Great point on the M+N<P thing, I had overlooked that! That said, there is probably more than one new algorithm, so you really need M+kN<P (ok you can do it sequentially and maybe get better...) but I feel like still eventually you're going to lose. But I agree that it would be interesting to look at.

    Isn't the questions thing more of a domain adaptation issue? The only reason you care that it's bad at parsing questions is because you all of a sudden care about questions, and these are underrepresented in the Treebank, so any reasonable active learning approach _wouldn't_ select them anyway?

    ReplyDelete
  10. I would characterize the key insight differently.

    There are in essence two ideas in the new active learning theory work, which I would characterize differently. The first is that we can (information-wise) track a version space when there is unknown and arbitrarily structured noise. The second insight is that we can do this with an importance weighting approach which is computationally tractable. Using these techniques, we can keep track of the bias in P(y|x) explicitly, allowing us to deal with the bias.

    With regards to active learning on big data, the theory says that you can't asymptote out, because there is an explicit guarantee of competing with supervised learning on the same number of unlabeled datapoints processed.

    With regards to dataset transfer, the IWAL approach does allow some safe explicit transfer. However, it's effectively at a reduced dataset size, as you might expect intuitively. And the dataset size shrinks as the algorithm that you transfer to behaves more different from the one gathering the data.

    With regards to active learning in practice, I think it depends a great deal on the mode that you are working in. If you are working on learning from well prepared large widely used datasets, active learning basically doesn't exist, except perhaps in an oversampling-of-rare-events sense, as per Fei-Fei's image dataset.

    But much of machine learning in companies isn't of that form---instead it's creating little datasets that probably no one else will use to solve some specific problem for example building a porn page recognizer. Here, I think, active learning is often attempted, sometimes quite successfully.

    ReplyDelete
  11. @oooh! @hal: with regard to the M+N<P argument, note that you shouldn't use the number of annotated examples to measure cost, as the actual cost is far from being linear in the number of examples: the first examples of any annotation effort are by far the most expensive (need to hire and train annotators, and they need to practice), while large parts of P may be redundant or easy, thus very fast/cheap for the human annotator (as well as for the AL one who will not select them to begin with). This is true for AL in general, and even more so in light of the M+N argument.

    ReplyDelete
  12. Katrin Tomanek also considers data reuse in her PhD dissertation.

    https://eldorado.tu-dortmund.de/bitstream/2003/27172/1/dissertation_katrin_tomanek_final_color.pdf

    ReplyDelete
  13. a couple of years ago, we ran a small survey on the nature of annotation projects in language processing in general, and the adoption of active learning in particular; turned out, not surprisingly, that active learning wasn't widely used:

    http://soda.swedish-ict.se/3613/

    ReplyDelete
  14. The Sanjoy Dasgupta paper from NIPS a while back was an interesting view on active learning.

    ReplyDelete
  15. John's right that our little company does active learnig with one-off datasets we're not allowed to redistribute.

    We're often working in the context of customers that have a pile of data, with lots of outliers. If you just select by uncertainty, you wind up training on very atypical examples, which don't help much with accuracy.

    The second issue is that we're never evaluating by micro- or macro-F or classification accuracy. Instead, we almost always have a minimum precision requirement and need to maximize recall at that precision (for instance, in question answering or spelling correction for customers that don't want to "look stupid" or have human backup in cases of uncertainty). Sometimes it's a high-recall requirement, as when we worked with intelligence analysts and biologists (both of whom are highly dedicated, pain tolerant users who'll sift through lots of false positives to find what they want but won't tolerate very many false negatives).

    For high precision classifiers, what we do is set up a search engine over the data and let our customers fish for positive examples (this is related to Dave Lewis's comment). Then, we train a classifier and use it to classify the entire data set. Next, we take the highest confidence classifications and have them annotated by humans.

    I posted roughly these same comments on John Langford's blog post The End of the Beginning of Active Learning and he explained how you could use their setup with weighted rewards to recreate roughly the same behavior in selecting examples by expected improvement in the metric you care about.

    One of the things that's going on is that with natural language data, there's a huge feature space. A few training examples don't cover the space very well, so the weights for unseen features are all zero (assuming you're doing some kind of regression or linear margin learning). Active learning the way we do it essentially does what Yarowsky and crew used to call "bootstrapping", which iteratively grows the set of positive examples.

    ReplyDelete
  16. Another issue with active learning is that the labeling efficiency gains usually aren't huge -- say an order of 2 or something (I'd love to see more efficient versions).

    The effort to set up active learning can often dwarf that efficiency gain if you're only labeling a few hundred examples. In most of what we do, annotators can easily label 100 examples/hour, whereas it's rather more complicated to set up servers inside customer's sites that provide active learning that beats the roughly one or two person days it takes to label enough data.

    When coupled with fear of data selection bias, it hardly seems worth it in many cases. We only bother with large-scale ongoing problems.

    ReplyDelete
  17. I like the idea of data reuse. This is related to the shared, open dataset prospective that has fueled many research projects. however, for many practical applications, the data may not be share-able, and it may not be reusable internally. however, there is a more insidious problem related to active data collection vis a vis experimental data- the bias in data cuts both ways. tools like cross validation become invalid when considering how a model would perform in the wild. There are a few papers out there on active data collection specifically for evaluation (as opposed to model training), where the goal is to get a confident assessment of a model's performance on an entire data set from as small sample (and therefore budget) as possible.

    Regarding your last comment, there are at least 10 papers that discuss what i call "early halting" in active learning. For good examples, see:
    http://web.mit.edu/seyda/www/Papers/CIKM07_LearningontheBorder.pdf
    and:
    http://dl.acm.org/citation.cfm?id=1596384
    this work seems to be frequently duplicated since it's often just a tangential section in a an otherwise unrelated paper.

    I have personally spent a tremendous amount of effort trying to get active learning to work for practical problems, but I haven't had too much success. This includes trying many of the very recent techniques, more classic, simple techniques, density-sensitive techniques (cf. Nguyen & Smeulders) ensembles of active learning heuristics, and crazy schemes that are way to complicated to make it into any research paper.

    A recent KDD Explorations paper backs these difficulties: http://dl.acm.org/citation.cfm?id=1964906

    ReplyDelete