I really enjoyed Mark Dredze's talk at EMNLP on multiclass confidence weighted algorithms, where they take their CW binary predictors and extend them in two (basically equivalent) ways to a multiclass/structured setting (warning: I haven't read the paper!). Mark did a great job presenting, as always, and dryly suggested that we should all throw away our perceptrons and our MIRAs and SVMs and just switch to CW permanently. It was a pretty compelling case.
Now, I'm going to pick on basically every "yet another classifier" paper I've read in the past ten years (read: ever). I'm not trying to point fingers, but just try to better understand why I, personally, haven't yet switched to using these things and continue to use either logistic regression or averaged perceptron for all of my classification needs (aside from the fact that I am rather fond of a particular software package for doing these things -- note, though, that it does support PA and maybe soon CW if I decide to spend 10 minutes implementing it!).
Here's the deal. Let's look at SVM versus logreg. Whether this is actually true or not, I have this gut feeling that logreg is much less sensitive to hyperparameter selection than are SVMs. This is not at all based on any science, and the experience that it's based on it somewhat unfair (comparing megam to libSVM, for instance, which use very different optimization methods, and libSVM doesn't do early stopping while megam does). However, I've heard from at least two other people that they have the same internal feelings. In other words, here's a caricature of how I believe logreg and SVM behave:
That is, if you really tune the regularizer (lambda) well, then SVMs will win out. But for the majority of the settings, they're either the same or logreg is a bit better.
As a result, what do I do? I use logreg with lambda=1. That's it. No tuning, no nothing.
(Note that, as I said before, I haven't ever run experiments to verify this. I think it would be a moderately interesting thing to try to see if it really holds up when all else -- eg., the optimization algorithm, early stopping, implementation, choice of regularizer (L1, L2, KL, etc.), and so on -- are held constant... maybe it's not true. But if it is, then it's an interesting theoretical question: hinge loss and log loss don't look that different, despite the fact that John seems to not like how log loss diverges: why should this be true?)
This is also why I use averaged perceptron: there aren't any hyperparameters to select. It just runs.
What I'd really like to see in future "yet another classifier" papers is an analysis of sensitivity to hyperparameter selection. You could provide graphs and stuff, but these get hard to read. I like numbers. I'd like a single number that I can look at. Here are two concrete proposals for what such a number could be (note: I'm assuming you're also going to provide performance numbers at the best possible selection of hyperparameters from development data or cross validation... I'm talking about something in addition):
- Performance at a default setting of the hyperparameter. For instance, SVM-light uses something like average inverse norm of the data vectors as the C parameter. Or you could just us 1, like I do for logreg. In particular, suppose you're testing your algorithm on 20 data sets from UCI. Pick a single regularization parameter (or parameter selection scheme, ala SVM-light) to use for all of them and report results using that value. If this is about the same as the "I carefully tuned" setting, I'm happy. If it's way worse, I'm not so happy.
- Performance within a range. Let's say that if I do careful hyperparameter selection then I get an accuracy of X. How large is the range of hyperparameters for which my accuracy is at least X*0.95? I.e., if I'm willing to suffer 5% multiplicative loss, how lazy can I be about hp selection? For this, you'll probably need to grid out your performance and then do empirical integration to approximate this. Of course, you'll need to choose a bounded range for your hp (usually zero will be a lower bound, but you'll have to pick an upper bound, too -- but this is fine: as a practitioner, if you don't give me an upper bound, I'm going to be somewhat unhappy).
(As an aside, Mark, if you're reading this, I can imagine the whole CW thing getting a bit confused if you're using feature hashing: have you tried this? Or has someone else?)
40 comments:
From my limited experience with perceptrons, they are statistically significantly worse than logreg/SVMs, atleast for sequence labeling tasks. It would be nice to see a third curve for averaged perceptron in your caricature (ofcourse it will be a flat line as there is no hyper-parameter).
The same anon as above.
Since there is no free lunch, any ideas why averaged perceptrons enjoy comparable if not better performances as logregs/SVM, even though there is nothing to tune?
@anons: In my experience on many tasks, avg perceptron always lags both batch optimizers with reasonable loss functions and regularization, and more refined stochastic gradient or online methods. Always. Sometimes by a lot.
@Hal: Feature hash collisions are just disjunctions of the original features, so they should do nothing bad to CW algorithms.
These opinions always make me happy because I'm too lazy to try gazillions of classifiers and settings.
It's easy enough to draw the curves for held out performance (or in-sample log likelihood if you're a statistician).
Friedman et al. have an interesting pathwise coordinate descent that they claim fits a dense and broad range of priors as fast or faster than a single reasonably high variance prior. They start with a very low variance prior and then gradually increase it, warm starting from the previous solution. Here's their glmnet R package documentation.
Interesting post.
In some sense, SVM and logistic regression are solving two different types of problem: classification and class probability estimation, respectively. The predictions from a logistic regressor can be easily turned into classifications via a threshold but there is no consistent way to turn the scores output by an SVM into probability estimates.
I don't think this observation explains the difference in sensitivity you have suspected but it is worth keeping in mind when comparing log loss and hinge loss since the former is proper for probability estimation while the latter is not.
doesn't perceptron have a learning rate?
@Hal: I was thinking about your questions sitting on a beautiful wild beach on Saturday (I had to do something with my frontal cortex...). One way to evaluate sensitivity to hyper-parameters would be to estimate mean and stdev test error where the training data is split randomly into alpha training examples and (1-alpha) held-out hyper-parameter setting examples, for varying alpha. If SVM is indeed more sensitive than logreg, you'd see worse statistics for alpha sufficiently close to 0 with SVM than with logreg.
anon: well, as fernando says, they're usually not quite as good.
fp: re hashing, i guess the thing i'm worried about is becoming overly confident on one feature in the disjunction and then using this confidence for another feature in that disjunction that you really shouldn't be confident about. i don't really doubt that it will still work: i think i just need a slightly tweaked intuition.
bob: yeah, i do the same thing in megam: the -tune will do this for you. it seems to work fine, but my experience is that for logreg, it really doesn't matter that much. and yes, graphs are nice, but if i'm comparing a dozen algorithms, it would be nice to have a single number.
mark: indeed!
anon2: yes, but it's irrelevant.
fp2: interesting thought... why not just keep the number of training examples fixed, though, and vary the number of held-out examples? the only complaint i might have is "maybe if i only have two held-out examples, i should just use lambda=1 rather than try to fit it"?
- The (averaged) MIRA/PA algorithm doesn't have a tunable parameter other than the number of iterations, and seem to produce robust results.
- the thing I like about the CW algorithm is that it should be able to provide better probabilistic estimates than logres. I never really trusted the probabilities produced by logres models as indicators of confidence in the prediction. At least from the paper, CW seem to address this nicely.
- What worries me about CW is the rare features -- I think they will get low variance and a high weight, which might cause a real overfitting problem, especially since if I understand correctly the confidence of the feature weight is taken into account in training, but is ignored at test time.
yoav: some of the mira variants have a "C" parameter. is there any reason that CW *should* produce probabilistic estimates? are these estimates consistent? (sorry -- i haven't read all the CW papers, but i don't remember seeing any claims along these lines.)
Mark, I see these algorithms as being more similar than different.
In my experience, SVM and L2-regularized logistic regression have given nearly identical performance on a range of high dimensional text and email spam classification tasks. (Linear classifiers, binary classification tasks.) I haven't noticed any major differences in sensitivity to parameter tuning.
Part of this can be explained by the fact that linear SVM and L2-regularized logistic regression can be viewed solving nearly identical optimization problems:
min(w): ||w||_2 + c*Loss(w, D)
For SVM, Loss(w, D) is hinge loss and for logistic regression Loss(w, D) is logistic loss. When you plot hinge loss and logistic loss they are extremely close to each other: some difference at the hinge point of SVM loss, and logistic loss asymptotes to 0 instead of reaching 0.
For me, the main differences are that SVM has the benefit of giving slightly sparser solutions (because hinge loss for many examples will be exactly 0 while logistic loss will only ever approach 0), while logistic regression can give very slightly better models in near-noiseless high dimensional data sets for roughly the same reasons.
As I said, I haven't noticed any particular differences in parameter sensitivity between these two algorithms, both in batch packages like libSVM, SVM-light, liblinear, and trirls, and in some of my own implementations of both learners with stochastic methods.
I've heard that this near-equivalence may not be the case for L1-norm regularization (although I haven't tried it myself), where the flatness of SVM/hinge loss may cause some odd interactions and make things more parameter-sensitive.
Interested to hear your thoughts.
@hal: yes, CW can produce probabilistic output, although I'm not sure it can be done efficiently. Check section 7 in the multi-class CW paper.
I've heard a few people say that averaged perceptron lags CRFs/max-margin methods, but I have a different perception: I'm not sure if
that's a) because of a difference of opinion about what constitutes a "significant/big"
difference or b) I've not run comparisons on the correct problems. Does anybody have citations
to results on sequence learning tasks where
there is a big difference?
As a couple of examples, in the paper by
Fei Sha and Fernando at NAACL 2003 on NP chunking CRFs gave a 0.3% gain over the perceptron; in the Koo et al paper at EMNLP 07
on matrix-tree training for dependency parsing, the gain was 79.05%->79.82% when
going from perceptron to CRF or max-margin methods. These gains may be statistically significant but in my mind are small, and the perceptron is far more efficient (partly because of the lack of regularizer constant to be tuned).
@Mike: those differences may not be huge, but as you say they are significant, and that deserves explanation. As for not having to tune a regularization parameter, for averaged perceptron (AP) you have to tune the number of iterations, and there are several other delicate issues (instance scaling, shuffling between iterations) that need attention. The main benefit of AP or PA for structured outputs is the algorithmic simplification of not having to compute marginals/normalization.
In the CRAIG gene predictor, we found that AP did very poorly, while the PA variant used in the published results did much better. For a variety of practical reasons (approximate inference/pruning/lack of space for forward-backward lattice), it was impractical to use a batch learner for that problem, but I have to reason to doubt that a batch learner would do at least as well as PA.
Hal's original point was about classification, though, where inference cost is not an issue. In those cases, I cannot think of many situations where AP (or PA) would be preferrable to a learner that is provably convergent (at some reasonable rate) to the training objective. The CW algorithm(s) are a potential exception in that they use a bit of second-order information very effectively to compete with batch learners in at least some situations. Koby, Mehryar and I have an AISTATS paper where we develop a batch version of CW that has a nice theoretical interpretation but that unfortunately is quite hard to make work well in practice. This just tells me that there's still a lot that we don't understand even for binary classification.
@mike, fp:
I've been playing a little with RNA folding lately, and there also PA does much better than AP.
Re the differences between AP and CRF/max-margin in NLP applications, I agree with Mike: the numbers he quotes are tiny, and probably meaningless. I'm not saying that CRF/max-margin are not learning something that AP missed. They probably do, as they constantly produce just-a-little-higher results. From a learning perspective, the difference is significant. But if your concern is not pure learning but the higher level tasks (chunking, parsing..) in a real world settings (e.g. when you actually care to apply your parser/chunker to text and use the results, not just report the official test-set results in a paper), then these numbers give no convincing reason to prefer the other methods over the AP.
The significance tests results don't tell you much about the real-world difference in performance, probably because their assumptions do not hold in this case, especially when you have models with many lexical parameters (new text is practically never distributed the same as the test-set, in a sense every application of the system to new text is a small-scale adaptation scenario).
A better way to compare the performance of two nlp systems would be to apply both of them on some new text (say, this year's wsj articles), and see where and by how much the outputs differ.
While this is not as easy as running evalb on section 23, this is also not very hard to do (we used this methodology in our last EMNLP paper, "on the role of lexical features in sequence labeling"). My guess is that under such a test, there will be no observable difference between the AP and the other methods on the tasks Mike mentioned.
@Fernando:
It's true that the number of iterations of training for the AP should be validated, but this comes at very little additional computational cost, because the different models are computed during one training run over the data. This is in contrast with CRFs/MM, where different models have to be trained for different regularizer constants (warm
starts may help a little here). I've never worried about shuffling/instance scaling, the numbers I've quoted where perceptron is close to other methods don't use these, it would be interesting to see what effect they have.
It seems that AP is close to other methods on several NLP problems -- POS tagging, NP chunking, parsing. (Another example, from Ryan and Fernando's paper at ACL 05: on English dependency parsing, AP = 90.6, MIRA = 90.9; on Czech, AP = 82.9, MIRA = 83.3.)
It also seems that CRFs/max-margin in general give a small gain, and it sounds like on some problems they give a big gain. So the message
would appear to be that if you can afford to do it, training a CRF/MM model is the safe option. But I'd there is a trade-off here, in that the AP is more efficient. (Perhaps MIRA hits a sweet spot, in being more robust than the perceptron, but being just as efficient?)
Another issue for structured problems, and another possible trade-off: in contrast to CRFs at least, online algorithms such as the AP lead to
relatively sparse solutions, when "negative" features are used (features which do not appear on correct structures in the training set, but which do appear on incorrect structures). Including negative features can give a gain on some problems in my experience, but will often increase the number of features quite dramatically. The AP (and algorithms such as MIRA) only give non-zero weight to features that are seen on correct training examples, or incorrect examples that have achieved the argmax during decoding. CRFs run into trouble in this case. As one example, we used negative features in a paper on discriminative TAG parsing last year, but I don't think we could have done this with a CRF.
Congratulations! Our selection committee compiled an exclusive list of the Top 100 data mining Blogs, and yours was included! Check it out at http://thedailyreviewer.com/top/data-mining
You can claim your Top 100 Blogs Award Badge at http://thedailyreviewer.com/pages/badges
Cheers!
Hi. I'm Felipe Nascimento, from Brazil. My cousin and I are heading a project here, that works exactly with this... we are, actualy, on its second version, and we have made great evolutions. I can tell you I'm proud of our work. Then, I think it may be interesting for both of us to talk about it. The link to its FIRST version, is at http://thewebmind.org We are looking for people to be beta testers of its second
Please, mail me at felipe@thewebmind.org
Cheers.
Since I’m new to blogging, these articles are greatly appreciated; very useful and informative blog and every body must visit this blog. Please come visit my site CLaredo Yellow Page Business Directory when you got time
Hello mate, I want to thank you for this nice blog. Would you mind telling me some secrets for a succesful blog ? Which could attract some visitors than it normally does. Please come visit my site customer relationship
give me any valuable feedbacks.
Hello mate, I want to thank you for this nice blog. Would you mind telling me some secrets for a succesful blog ? Which could attract some visitors than it normally does. Please come visit my site customer relationship
give me any valuable feedbacks.
Hello mate, I want to thank you for this nice blog. Would you mind telling me some secrets for a succesful blog ? Which could attract some visitors than it normally does. Please come visit my site customer relationship
give me any valuable feedbacks.
You got a really useful blog I have been here reading for about an hour. I am a newbee and your success is very much an inspiration for me. Please come visit my site social anxiety disorder when you got time.
Nice, I think it could be interesting to add some more entries following this one, and probably it's not only me having this opinion. Cheers! Please come visit my site Sacramento Yellow Page Business Directory when you got time.
Nice, I think it could be interesting to add some more entries following this one, and probably it's not only me having this opinion. Cheers! Please come visit my site Oakland Business Directory when you got time.
Nice blog design. This seems like it would be a very interesting blog to keep up with. Please come visit my site Cincinnati Yellow Page Business Directory when you got time.
Really it is nice post and thanks for sharing it and really it is very useful. Please come visit my site Toledo Yellow Page Business Directory when you got time.
Really it is nice post and thanks for sharing it and really it is very useful. Please come visit my site Seattle Yellow Pages when you got time.
Really it is nice post and thanks for sharing it and really it is very useful. Please come visit my site Free Local Business Directory Of Washington City when you got time.
love your blog! Very cool! Please come visit my site Austin Business Search when you got time.
love your blog! Very cool! Please come visit my site Austin Directory when you got time.
Wow! Thank you! I always wanted to write in my site something like that. Can I take part of your post to my blog? Please come visit my site Jacksonville Business Phone Numbers when you got time.
Wow! Thank you! I always wanted to write in my site something like that. Can I take part of your post to my blog? Please come visit my site Jacksonville Business Phone Listing when you got time.
Excellent article , i just share it with my friend of Italy. I Stumble UP your blog post , you will notice an increase of traffic within 24 hours for targeted people. Cheers . Please come visit my site Santa Ana Business Directory Forum Blog Classifieds when you got time.
Excellent article , i just share it with my friend of Italy. I Stumble UP your blog post , you will notice an increase of traffic within 24 hours for targeted people. Cheers . Please come visit my site Santa Ana Business Search Engine when you got time.
Awesome! I have read a lot on this topic, but you definitely give it a good vibe. This is a great post. Will be back to read more! Please come visit my site Las Vegas City Business Listings when you got time.
Awesome! I have read a lot on this topic, but you definitely give it a good vibe. This is a great post. Will be back to read more! Please come visit my site City Guide Las Vegas when you got time.
I agree with the last comment. You definitely give this a good vibe. I think this is so good to read.
tx asbestos lawyer
Online shoe shopping has never been so easy.A one-stop destination for all your footwear needs!
Nike Shox R4,Shox shoes,shox
nz,ugg boots or new spyder,you name it and we have it.We are not only the
premier shopping destination for spyder
jacketsonline but also for people of all age-groups and for all brands. Buy online without
leaving the comfort of your home. Whatever your needs: from clothing and accessories to all brands
(cheap ugg boots,discount ugg boots,ugg
boots,cheap ugg boots,discount ugg boots,Rare ghd,MBT boots,MBT shoes in fashion,cheap mbt shoes sale,discount mbt outlet 2010,MBT Walking Shoes)of shoes and sports gear,we has
it all for you.
Post a Comment