tag:blogger.com,1999:blog-19803222.post3289046896337342601..comments2024-03-18T01:45:45.724-06:00Comments on natural language processing blog: Overfittinghalhttp://www.blogger.com/profile/02162908373916390369noreply@blogger.comBlogger28125tag:blogger.com,1999:blog-19803222.post-58439733093028180302015-10-12T08:04:13.348-06:002015-10-12T08:04:13.348-06:00@Unknown: efficiency is dependent on the loss itse...@Unknown: efficiency is dependent on the loss itself. Minimizing square loss to obtain class estimates for binary and multi-class classification is identical to minimizing the Brier scoring rule. The Brier rule is a proper scoring rule (see e.g. Buja et al., 2005) and thus consistent, and although I am not aware of a pointer to the literature, I would assume that it is the asymptotically most efficient estimator for the squared-error loss on the probabilities.Anonymoushttps://www.blogger.com/profile/18017626810675904217noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-66809659738241613902015-10-08T21:10:38.919-06:002015-10-08T21:10:38.919-06:00@Rajhans Samdani: Your definition related to SRM i...@Rajhans Samdani: Your definition related to SRM is my favorite. Basically, I have overfit if I regret my model choice compared to if I had chosen a simpler model. The situation is subtle because sometimes (e.g., when fitting linear models), we think we are not in the SRM regime, because the model has "fixed complexity". But in fact, when using SGD, every gradient step can enlarge the model class by allowing the parameters to grow. <br /><br />@yisong yue: I like Pedro's paper, but I prefer the more general analysis of Gareth James ("Variance and Bias for General Loss Functions", Machine Learning, 51(2), 2003). <br /><br />@hal: While minimizing square loss for classification does produce class probability estimates, I believe it is inefficient statistically. Maybe someone else can confirm?Unknownhttps://www.blogger.com/profile/15737915145130913124noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-33463692619932810502015-10-01T16:58:18.861-06:002015-10-01T16:58:18.861-06:00The predictor depends both on the particular train...The predictor depends both on the particular training set (of size N) and on the method used to find a suitable model from the model class. For each predictor there is a true risk (expected loss) and a finite sample risk on the test set.<br /><br />If we consider repeating the training experiment we can consider a predictor S_N that is the infinite weighted mixture of all predictors obtained from each possible training set of size N, weighted accordingly.<br />Likewise we can define an expected finite sample risk function that takes any new data set and outputs the expected risk achieved by S_N. This definition removes the dependence on the particular training set realization but retains the finite sample estimation error.<br /><br />One possible definition of overfitting is then this: given a single training set of size N and a predictor F obtained from these samples, overfitting happens when the test risk R[F] is larger than the test risk R[S] or below upper quantile of the distribution of R[S].<br /><br />(Using one fixed test set is a way to estimate R[F], and bootstrap resampling is a way to estimate R[S]. The above definition is not convenient to evaluate in practice but I think it is close to a useful definition.)<br />Anonymoushttps://www.blogger.com/profile/18017626810675904217noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-82536770846627146002015-09-20T13:12:40.087-06:002015-09-20T13:12:40.087-06:00Appreciate the timely discussion, which has prompt...Appreciate the timely discussion, which has prompted me to create a slide for a recent lecture called "zero overfitting not desirable". (I liked Bob's way of talking about this.)Charles Suttonhttps://www.blogger.com/profile/10380218499923073861noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-23581078809166200442015-09-17T16:08:45.758-06:002015-09-17T16:08:45.758-06:00@hal @yisong now i think that bias/variance for (p...@hal @yisong now i think that bias/variance for (probabilistic-output) classification corresponds to "calibration/reliability" and "refinement" in the calibration literature. an unbiased probabilistic classifier means that when it says 80% chance of rain, that in the test set it really does rain 80% of the time. obviously unbiasedness is not the only interesting thing, since you could always predict the marginal probability and get an unbiased predictor. just like in regression you can predict the mean to be perfectly unbiased but not that useful.Brendan O'Connorhttps://www.blogger.com/profile/01622411855846515340noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-38658880601575332102015-09-14T10:00:02.342-06:002015-09-14T10:00:02.342-06:00@Yaroslav: For an application that brings real mon...@Yaroslav: For an application that brings real money to a company, the difference between 99.1% accuracy and 99.2% accuracy could be worth tens or hundreds of millions of dollars. So overfitting is a real issue.Yisong Yuehttps://www.blogger.com/profile/07112299585878991257noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-85201179753481731292015-09-11T14:54:43.397-06:002015-09-11T14:54:43.397-06:00@Yaroslav: couldn't disagree more. Overfitting...@Yaroslav: couldn't disagree more. Overfitting is certainly less of an issue when data is cheap. I believe that in the big picture of applications this a rather rare situation.<br />Even when data is relatively inexpensive, the more real money your application brings in, the more important even tiny improvements in accuracy become. Making such improvements reliably requires a lot of data (note that in general number of samples necessary scales quadratically with the inverse of the improvements in the error).<br /><br />Vitalyhttps://www.blogger.com/profile/13434795177882301995noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-49085889820250190822015-09-11T13:20:50.464-06:002015-09-11T13:20:50.464-06:00In some sense "overfitting" is an academ...In some sense "overfitting" is an academic phenomenon. For an application that brings real money a company would be motivated to pay human subjects to collect a few hundred million training examples at which point there's no overfitting (ie arxiv.org/pdf/1312.6082), Yaroslav Bulatovhttps://www.blogger.com/profile/06139256691290554110noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-59822598914578710522015-09-11T13:03:42.956-06:002015-09-11T13:03:42.956-06:00@Vitaly: yes duh :). sorry, brain not working!
I ...@Vitaly: yes duh :). sorry, brain not working!<br /><br />I think I've boiled this down the the following for me:<br /><br /> - When I think of "overfit" to me it means "more than fit" and then the case I mentioned to Bob above with two hypothetical classifiers cannot possibly be overfit because it's not even fit.<br /><br /> - The alternative definition is that "overfit" means positive generalization error.<br /><br /><br /><br />@Rajhans: yeah, that's basically what I have in my head, but the requirement of a chain of models seems really problematic.<br /><br />halhttps://www.blogger.com/profile/02162908373916390369noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-90023671077882320072015-09-11T12:57:27.760-06:002015-09-11T12:57:27.760-06:00... of course, I meant to write
Loss(T, f_{T, lamb...... of course, I meant to write<br />Loss(T, f_{T, lambda}) < Loss(T, f_{T, lambda'})Anonymoushttps://www.blogger.com/profile/15423525472529725504noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-33837541527912642802015-09-11T12:46:55.602-06:002015-09-11T12:46:55.602-06:00Great discussion going on here!
I think one angle ...Great discussion going on here!<br />I think one angle that is missing between some of the attempts at defining model-specific overfitting and the desired intuitive definition from your graphs is the SRM-ish angle. In particular, what your graphs present is a sequence of models from hypothesis families following a total order (assuming that the hypothesis class with smaller regularization contains one with larger regularization), whereas one can say that Moritz and co.'s definition considers model at a specific point on the x axis.<br /><br />How about the following, more procedural, definition of overfitting in the sense you present in your graphs (and Alexandre talks about in the sense of comparing two hypothesis):<br />Consider a total order of hypothesis classes: H_{lambda} where H_{lambda} \subset H_{lambda'} if lambda' > lambda (so lambda is sort of akin to inverse of the regularization param). Let D be the data distribution and T be some training data drawn from D.<br /><br />Let f_{T, lambda} = argmin_{h \ in H_{lambda}} Loss(T, h)<br /><br />Then f_{T, lambda} is said to "overfit" if \exists lambda' < lambda such that<br /><br />f_{T, lambda} < f_{T, lambda'} <br /><br />BUT<br /><br />E_{D}(Loss(f(T, lambda))) > E_{D}(Loss(f(T, lambda'))).<br /><br />This is a fairly trivial definition and not sure if it is useful. But I guess it formalizes some of the stuff that has been said in "English".Anonymoushttps://www.blogger.com/profile/15423525472529725504noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-31502184791068154962015-09-11T12:26:37.403-06:002015-09-11T12:26:37.403-06:00@yoav i dont know much at all about the literature...@yoav i dont know much at all about the literature on training classifiers by minimizing squared loss and was interested in just piecing together the subliteratures out there on it. i know more about sq loss for probabilistic binary classification evaluation; it certainly admits the bias/variance breakdown just like linear regression and thus may be a way to understand overfitting, and i like the calibration/refinement breakdown too (which is related in some way i'd like to know sometime). this is also known as Brier score and is used a lot in the forecasting literature (such as in metereology or human forecasting evaluations).<br /><br />@hal -- thanks for the references .. there's some neural networks speech paper i saw once that compared sq loss versus log-loss but i forget what it was now. also there's this "linear probability model" thing from econ maybe; training with sq loss then they argue better semantics for the coefficients.Brendan O'Connorhttps://www.blogger.com/profile/01622411855846515340noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-7939506847701824872015-09-11T11:51:09.864-06:002015-09-11T11:51:09.864-06:00@Hal: Thanks! True error is the exact sum of the t...@Hal: Thanks! True error is the exact sum of the training error and the generalization error (relative to the training set). So there is nothing lost is this expression. Generalization error is of course not known exactly so we rely on the test set instead to estimate the true error (and that's the place where we need to bound things). One way to relate your argument to our paper is to note that even for the test set we need to balance the utility (or how many and what kind of estimates we can obtain) with overfitting to the test set (or, equivalently, the accuracy of those estimates).<br />Vitalyhttps://www.blogger.com/profile/13434795177882301995noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-18575748241056878062015-09-11T08:39:14.141-06:002015-09-11T08:39:14.141-06:00@Dallas: I agree to some degree though ".... ...@Dallas: I agree to some degree though ".... all of the models *are overfitting*" seems like a circuitous argument.<br /><br />In thinking about it more, what's wrong with saying something like:<br /><br />(1) f is overfit to train if Error(f, Train) << Error(f*, Train), where f* is classifier in your hypothesis class that minimizes Error(f*, Distribution)?<br /><br />(2) The same, but not limited to a hypothesis class; thereby, f is overfit to train if Error(f, Train) << Error(f*, Train) where f* is the Bayes optimal classifier?<br /><br />Of course these are horribly not-possible-to-compute :), but I think they get much more at my internal notion of overfitting (exploiting stuff that's specific to Train but doesn't generalize) rather than "train better than test".<br /><br /><br />@Yoav: it's known/easy to show that if f* minimizes squared error, then f*(x) = E_{y~D|x}[y], so minimizing squared error estimates expectations. In the case that the labels are binary {0,1} then this is exactly doing class posterior probability estimation. A standard way to avoid outlier problems (this is what vw does, for instance) is to clip predictions are the min/max seen in the training data. I.e., if you give me some f with arbitrary range, I replace it with f'(x) = min(1, max(0, f(x))), and then do all my computation from there. There's some discussion here (http://hunch.net/~jl/projects/reductions/tutorial/paper/chapter.pdf) and here (http://hunch.net/?p=547).<br /><br /><br />@S: there's an issue with 10 fold cross validation for estimating variance: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf namely it is biased; see also http://jmlr.org/papers/volume6/markatou05a/markatou05a.pdf<br /><br /><br />@Vitaly: THANKS! Totally agree. If you picked only based on generalization error, you'd pick a random guessing hypothesis which is clearly UNDERfit! That would have zero generalization error :). I guess it wasn't clear enough in my post, but I think the whole idea of looking at reusable holdout is totally awesome and the way you all do it is really insightful. I love this work. In a sense, I guess what my concern boils down to is something like the following. We know that training error + generalization error bounds the true error, but perhaps there's something real and practical being lost in the fact that this is a bound?<br />halhttps://www.blogger.com/profile/02162908373916390369noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-40550059833385219032015-09-10T23:37:21.820-06:002015-09-10T23:37:21.820-06:00Thank you for the thought-provoking post, Hal and ...Thank you for the thought-provoking post, Hal and your thoughts on the definition of overfitting used in our paper.<br />I'd like to first clarify the definition of overfitting we use. Basically a model overfits to a dataset S if it has a "large" generalization error, that is, the difference between the empirical error/loss on S and the error/loss on the true distribution that generated the data (or future data samples). Properly used test set gives a valid estimate of the true error but it is also easy to obtain a model that also overfits to the test set (for example by doing too much parameter tuning using the test set). <br /><br />As Yisong and Bob pointed out, the ultimate goal of learning is minimizing the true error. The true error is the sum of the training error and the generalization error and our paper certainly does not advocate picking a hypothesis by minimizing only the generalization error. We do focus on the question of how to obtain statistically valid estimates of the true error of hypotheses when the test set is reused multiple times adaptively. This is equivalent to ensuring that we do not overfit to the *test set*. Our "reusable holdout" algorithm does not restrict in any way the use of the training set and does not aim to prevent overfitting to the training set.Vitalyhttps://www.blogger.com/profile/13434795177882301995noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-50131874727175589032015-09-10T22:06:09.145-06:002015-09-10T22:06:09.145-06:00Thanks for the post. I just started with the Dwork...Thanks for the post. I just started with the Dwork et.al. paper you referenced. I always wondered why one cannot use the average accuracy after 10 fold CV and the standard deviation of accuracy in the 10-folds as a measure of overfitting? (instead of comparing training set vs test set error that is). I am generally asking from the practical point of view of a non-ML researcher who applies ML for some other problems. Shttps://www.blogger.com/profile/08867007063775557734noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-3624818963835738722015-09-10T02:38:46.191-06:002015-09-10T02:38:46.191-06:00@hal @brendan Can you elaborate on the squared los...@hal @brendan Can you elaborate on the squared loss for classification? this is new and surprising to me, as my understanding was that it is not a good metric (you pay a very high price for outliers, for example).Yoavhttps://www.blogger.com/profile/10911181524186244654noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-84635846856768674242015-09-09T22:25:21.101-06:002015-09-09T22:25:21.101-06:00An interesting "statistical physics definitio...An interesting "statistical physics definition" of overfitting, in the context of network clustering problem, is the existence of <br />rugged energy (negative log-likelihood) landscape, as in spin glasses. http://www.pnas.org/content/111/51/18144.short张潘https://www.blogger.com/profile/17393195142155657815noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-86918015310218168502015-09-09T20:52:55.388-06:002015-09-09T20:52:55.388-06:00I think the problem is that there are two differen...I think the problem is that there are two different meanings of "overfitting" which are being conflated. I would say that your phrasing masks Dwork's definition just a bit, which is (again roughly) that a function f overfits to some dataset S, if the empirical average of f on S is far from the expected value of f on the true underlying distribution (which is implicitly a sign of high variance). <br /><br />All of the models in your post *are* overfitting to the training data at their optimal values, in that the error rates on the training set are far from the true error rates (as estimated by the test sets). As Yisong pointed out, overfitting to the training data (to some extent) is okay, and useful, as it allows you to trade off bias and variance in order to train a model with better performance in general. The only purpose of the development set, by contrast, is to get an accurate estimate of generalizable performance, so overfitting on that is inherently a problem, which is of course what Dwork et al are concerned about.<br /><br />What you're talking about, on the other hand, is basically "overfitting too much", which is what we commonly mean when we talk about overfitting; perhaps that's what we need a new name for.<br />Dallashttps://www.blogger.com/profile/00576134627513914816noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-17620155600112792822015-09-09T20:21:45.360-06:002015-09-09T20:21:45.360-06:00@Bob: yeah, my thinking went along these lines too...@Bob: yeah, my thinking went along these lines too. I think the problem is, like always, finite sample. I totally agree that "argmin_h risk(h)" is going to have identical training and test errors. Would that we had that :). But in the finite sample case, like Alexandre, if you hand me h1 and h2 and for h1 you have matching train/test error and h2 has lower training *and* lower test errors, then it's going to be hard for me to claim that h2 is overfit. In particular, to the degree that the test sample is representative of the true distribution, h2 is clearly a better match to "argmin_h risk(h)".<br /><br />@Yisong: yes, of course you're right. Thanks for the pointer to Pedro's paper; I hadn't seen that!<br /><br />@Brendan: yeah, agree. squared loss for classification (which is basically all I use anymore) has the nice property that the minimizer is the estimator of posterior probability, which is pretty much what you want most times. (among other nice properties!)<br />halhttps://www.blogger.com/profile/02162908373916390369noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-13714067114653079552015-09-09T19:46:00.129-06:002015-09-09T19:46:00.129-06:00Since i've been thinking more and more about s...Since i've been thinking more and more about squared error metrics for classification problems (like how we used Brier score in our EMNLP calibration paper http://arxiv.org/abs/1508.05154 ... also apropos since john myles white is here and all his comments on they call it "linear prob model" in econ i guess), I wonder what it would mean to just directly apply squared error bias/variance analysis to a classification setting with labels and predictions on {0,1}. or if or how it corresponds to the calibration/refinement tradeoffs in the calibration literature (sorry to keep going on about stuff in my paper, it's just on my mind )Brendan O'Connorhttps://www.blogger.com/profile/01622411855846515340noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-74721920892373645872015-09-09T19:38:47.874-06:002015-09-09T19:38:47.874-06:00My favorite definition of overfitting is relative:...My favorite definition of overfitting is relative: given two models, A and B, A is overfit with respect to B if it has lower training but higher holdout loss. Then what we mean by overfitting is when either training a model more or regularizing it less leads to better training but worse holdout performance.Alexandre Passoshttps://www.blogger.com/profile/10099321916600547808noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-88660231043045778112015-09-09T18:33:00.059-06:002015-09-09T18:33:00.059-06:00Actually, the bias/variance tradeoff is model agno...Actually, the bias/variance tradeoff is model agnostic... it's just not loss agnostic. You can estimate the bias and variance of decision tree regression w.r.t. squared loss via bootstrapping. However, providing useful definitions of the bias/variance tradeoff for other loss functions has proven problematic. (cf. http://homes.cs.washington.edu/~pedrod/papers/aaai00.pdf )Yisong Yuehttps://www.blogger.com/profile/07112299585878991257noreply@blogger.comtag:blogger.com,1999:blog-19803222.post-33290573864039449062015-09-09T12:22:38.833-06:002015-09-09T12:22:38.833-06:00I think from a theoretical perspective, Dwork et a...I think from a theoretical perspective, Dwork et al. have the right definition. Suppose you have the best possible parameter set, given the structure of the model and the true data distribution. If the training and test sets are both random samples of the true distribution, then it should be just as likely that the test error is lower than the training error as the other way round. The fact that we systematically observe the training error to be lower means that we *are* always overfitting the training data.<br /><br />However, avoiding overfitting is not our only, or even primary, goal. After all, a binary classifier that simply always says "yes" or always says "no" will have no overfitting, but will have a terrible error rate. Accepting some overfitting is a price we pay for learning something from the training data.<br /><br />In practical terms, the situation we want to avoid is learning a model M1, when there is a systematic way of learning another model M2 that has higher training error than M1, but lower expected test error. In this case, it is tempting to say that with M1 the capacity of the model is being used to predict idiosyncrasies of the training data, instead of predicting aspects of the true distribution.<br /><br />On the other hand, we want to prefer M1 to M2 if M1 has *both* lower training error and expected test error, even if the gap between training and test error is greater for M1 than M2. In that case, I would say that M1 overfits more than M2, but it is still a better model. Bob Moorenoreply@blogger.comtag:blogger.com,1999:blog-19803222.post-89776196871531888552015-09-09T10:46:45.931-06:002015-09-09T10:46:45.931-06:00@John, @Yisong: I really like both of these sugges...@John, @Yisong: I really like both of these suggestions, and the bias/variance trade-off has definitely been useful to me to think about problems, though computationalizing it is not always easy (decision trees?). I guess the main advantage I see with the Dwork et al. definition is that it's completely model agnostic. In fact, it doesn't even need to be a computational model -- you could apply their definition to a human mouse or fruitfly :).<br /><br />@Jon: I'm not sure I understand. What does "more difficult" mean and how does it show in the results? Do you just mean that the dev distribution is different than train? That's not the case here because I permute the dataset ahead of time.halhttps://www.blogger.com/profile/02162908373916390369noreply@blogger.com