08 September 2015

Overfitting

Many students, in response to their assigned reading for today's undergrad ML class, asked me for a formal definition of overfitting. The fact that I don't really have one irked me, especially this this basically is the problem in machine learning. This blog post is an extended version of the answer I gave in class.

Intuitively, overfitting is when your learned model is doing much better on training data than it will do on test data from the same distribution. This happens because your model learns to fit random idiosyncrasies in the training data rather than model the true underlying trends.

I then pointed to a recent high-profile paper by Cynthia Dwork and colleagues on Holdout Reuse (appeared behind a paywall in Science and something similar seems to be in NIPS too). This paper presents some really cool ideas that use machinery from differential privacy to allow one to effectively reuse the same heldout ("development") data multiple times, without worrying about overfitting the holdout. But that's not what I wanted to highlight; I wanted to highlight the formal definition Dwork et al. give for overfitting. They (roughly) say that a model has overfit the training data when its test error is larger than its training error (by at least some small amount).

I feel like this definition captures some of the essence of overfitting, but it also misses what I would call my "everyday experience" with using machine learning in practical, real world settings. And my everyday experience is that you training error is always lower than your test error. To confirm to myself that I wasn't misremembering my experiences, I ran a few experiments on some simple classification tasks. Below is a plot of train/dev/test error on Reuters RCV1:



Here, the x-axis is regularization strength (this is all with libsvm, C=2^-xvalue) and the y-axis is error rate. There are error bars on dev/test but they're so small you can't really see them.

The thing I want to highlight in this figure is that the best performance, measured by dev or test performance, occurs when the regularization is a bit below 0. But the point at which the gap between dev performance and training performance starts to inflate is much earlier, maybe around -4. The title of the figure tells us that the test error achieved when the training error drops below dev error is 9.48%; but when dev bottomw out the test error is 6.08%. That's leaving 33% relative error on the table.

RCV1 isn't unique. Below are the same plots for MNIST on several binary splits:

For all three splits, some easier, some harder, the relative improvement is always around 20-30% by not using the definition provided by Dwork et al.

Having somewhat convinced myself that I'm not crazy, the immediate question is: what is a better definition of overfitting that captures my everyday experience? I completely agree that the definition quoted above is a necessary condition, but somehow it does not seem sufficient.

Even though this was my experience, I felt a bit flummoxed because I had a hard time pointing out exactly what I felt was missing in the definition above.

Here's what I ended up settling on. Overfitting typically happens, as stated above, because there are random idiosyncrasies in the training data that don't generalize. This is a finite sample effect. If I give you a dataset with 1000 training examples, and I add 100 completely random features, at least one of these features is going to correlate somewhat with the training labels. Overfitting occurs when the model chooses to use one of these completely random features, which helps drive the training error down, but hurts at test time.

But there's a "throwing the baby out with the bathwater" effect going on here. Although there may be completely random features that accidentally correlate slightly with the true label, there are also weak but nevertheless useful features that correlate a small amount with the true label. On a finite sample it's going to be virtually impossible to distinguish between these two.

By adopting the "training error < test error" rubric for overfitting, we're forcing ourselves to throw out both types of features. This is a trade-off, and empirically at least, it appears that this is a losing trade-off. It seems better to keep around a few of these weak-but-useful features in exchange for the risk of also keeping around a few truly-useless-and-misleading features. If this weren't the case, I would have a hard time explaining my "everyday experience."

So do I have a better suggestion?

Not really.

One really nice property of the Dwork et al. definition is that it's testable for a single model. That is, I hand you a learned function h and you can run h on some training data and some dev data and you can tell me if it's overfit. I don't have to draw curves like those above. I don't even have to know anything about how the model was built, and I don't have to sweep hyperparameters (which may or may not be sweepable).

I really like all of those properties. And I believe that if we came up with an alternative definition of overfitting that obeys these properties, we can reuse much of the awesome work in the Dwork et al. paper.

But I also feel like the definition above is overly conservative and it's risky to use.

28 comments:

John Myles White said...

Why not define overfitting in frequentist terms using a model family's decision-theoretic risk? If you associate every model in your model family with a capacity parameter, C, then you can define overfitting in terms of the relationship between model capacity, C, and decision-theoretic risk, R. For a specific example where model capacity is interpreted as kernel bandwidth, see the sections on nonparametric regression in Wasserman's book, "All of Nonparametric Statistics".

The nice thing about using risk to define underfitting and overfitting is that you don't need to invoke concepts like test sets or training sets at all. A model is underfit if its capacity is less than it could be without increasing risk; a model is overfit if its capacity is more than it could be without increasing risk.

Yisong Yue said...

I think a more appropriate way to think about overfitting is via the bias-variance tradeoff. This is easier to think about when doing regression, because the bias-variance tradeoff is easy to analyze.

Models that overfit are low bias and high variance. As you decrease the regularization, models will overfit more. But so long as the increase in test error due to higher variance is less than the decrease in test error to due lower bias, then you're OK overfitting to some degree.

See Slides 40-46 in my <a href='http://www.yisongyue.com/courses/cs155/lectures/Lecture_01.pdf">lecture slides</a> (sorry to self-advertise).

Jonathan Graehl said...

Hal, I buy your weak-good vs random-bad feature tradeoff, but isn't it also possible that the dev set is just more difficult than the train set? I suppose you could draw dev and train from the same distribution if you wanted to use "oops, dev error is getting worse than train" stopping criteria that you've identified as dangerous (often too pessimistic). Did you do this already?

hal said...

@John, @Yisong: I really like both of these suggestions, and the bias/variance trade-off has definitely been useful to me to think about problems, though computationalizing it is not always easy (decision trees?). I guess the main advantage I see with the Dwork et al. definition is that it's completely model agnostic. In fact, it doesn't even need to be a computational model -- you could apply their definition to a human mouse or fruitfly :).

@Jon: I'm not sure I understand. What does "more difficult" mean and how does it show in the results? Do you just mean that the dev distribution is different than train? That's not the case here because I permute the dataset ahead of time.

Bob Moore said...

I think from a theoretical perspective, Dwork et al. have the right definition. Suppose you have the best possible parameter set, given the structure of the model and the true data distribution. If the training and test sets are both random samples of the true distribution, then it should be just as likely that the test error is lower than the training error as the other way round. The fact that we systematically observe the training error to be lower means that we *are* always overfitting the training data.

However, avoiding overfitting is not our only, or even primary, goal. After all, a binary classifier that simply always says "yes" or always says "no" will have no overfitting, but will have a terrible error rate. Accepting some overfitting is a price we pay for learning something from the training data.

In practical terms, the situation we want to avoid is learning a model M1, when there is a systematic way of learning another model M2 that has higher training error than M1, but lower expected test error. In this case, it is tempting to say that with M1 the capacity of the model is being used to predict idiosyncrasies of the training data, instead of predicting aspects of the true distribution.

On the other hand, we want to prefer M1 to M2 if M1 has *both* lower training error and expected test error, even if the gap between training and test error is greater for M1 than M2. In that case, I would say that M1 overfits more than M2, but it is still a better model.

Yisong Yue said...

Actually, the bias/variance tradeoff is model agnostic... it's just not loss agnostic. You can estimate the bias and variance of decision tree regression w.r.t. squared loss via bootstrapping. However, providing useful definitions of the bias/variance tradeoff for other loss functions has proven problematic. (cf. http://homes.cs.washington.edu/~pedrod/papers/aaai00.pdf )

Alexandre Passos said...

My favorite definition of overfitting is relative: given two models, A and B, A is overfit with respect to B if it has lower training but higher holdout loss. Then what we mean by overfitting is when either training a model more or regularizing it less leads to better training but worse holdout performance.

Brendan O'Connor said...

Since i've been thinking more and more about squared error metrics for classification problems (like how we used Brier score in our EMNLP calibration paper http://arxiv.org/abs/1508.05154 ... also apropos since john myles white is here and all his comments on they call it "linear prob model" in econ i guess), I wonder what it would mean to just directly apply squared error bias/variance analysis to a classification setting with labels and predictions on {0,1}. or if or how it corresponds to the calibration/refinement tradeoffs in the calibration literature (sorry to keep going on about stuff in my paper, it's just on my mind )

hal said...

@Bob: yeah, my thinking went along these lines too. I think the problem is, like always, finite sample. I totally agree that "argmin_h risk(h)" is going to have identical training and test errors. Would that we had that :). But in the finite sample case, like Alexandre, if you hand me h1 and h2 and for h1 you have matching train/test error and h2 has lower training *and* lower test errors, then it's going to be hard for me to claim that h2 is overfit. In particular, to the degree that the test sample is representative of the true distribution, h2 is clearly a better match to "argmin_h risk(h)".

@Yisong: yes, of course you're right. Thanks for the pointer to Pedro's paper; I hadn't seen that!

@Brendan: yeah, agree. squared loss for classification (which is basically all I use anymore) has the nice property that the minimizer is the estimator of posterior probability, which is pretty much what you want most times. (among other nice properties!)

Dallas said...

I think the problem is that there are two different meanings of "overfitting" which are being conflated. I would say that your phrasing masks Dwork's definition just a bit, which is (again roughly) that a function f overfits to some dataset S, if the empirical average of f on S is far from the expected value of f on the true underlying distribution (which is implicitly a sign of high variance).

All of the models in your post *are* overfitting to the training data at their optimal values, in that the error rates on the training set are far from the true error rates (as estimated by the test sets). As Yisong pointed out, overfitting to the training data (to some extent) is okay, and useful, as it allows you to trade off bias and variance in order to train a model with better performance in general. The only purpose of the development set, by contrast, is to get an accurate estimate of generalizable performance, so overfitting on that is inherently a problem, which is of course what Dwork et al are concerned about.

What you're talking about, on the other hand, is basically "overfitting too much", which is what we commonly mean when we talk about overfitting; perhaps that's what we need a new name for.

张潘 said...

An interesting "statistical physics definition" of overfitting, in the context of network clustering problem, is the existence of
rugged energy (negative log-likelihood) landscape, as in spin glasses. http://www.pnas.org/content/111/51/18144.short

Yoav said...

@hal @brendan Can you elaborate on the squared loss for classification? this is new and surprising to me, as my understanding was that it is not a good metric (you pay a very high price for outliers, for example).

S said...

Thanks for the post. I just started with the Dwork et.al. paper you referenced. I always wondered why one cannot use the average accuracy after 10 fold CV and the standard deviation of accuracy in the 10-folds as a measure of overfitting? (instead of comparing training set vs test set error that is). I am generally asking from the practical point of view of a non-ML researcher who applies ML for some other problems.

Vitaly said...

Thank you for the thought-provoking post, Hal and your thoughts on the definition of overfitting used in our paper.
I'd like to first clarify the definition of overfitting we use. Basically a model overfits to a dataset S if it has a "large" generalization error, that is, the difference between the empirical error/loss on S and the error/loss on the true distribution that generated the data (or future data samples). Properly used test set gives a valid estimate of the true error but it is also easy to obtain a model that also overfits to the test set (for example by doing too much parameter tuning using the test set).

As Yisong and Bob pointed out, the ultimate goal of learning is minimizing the true error. The true error is the sum of the training error and the generalization error and our paper certainly does not advocate picking a hypothesis by minimizing only the generalization error. We do focus on the question of how to obtain statistically valid estimates of the true error of hypotheses when the test set is reused multiple times adaptively. This is equivalent to ensuring that we do not overfit to the *test set*. Our "reusable holdout" algorithm does not restrict in any way the use of the training set and does not aim to prevent overfitting to the training set.

hal said...

@Dallas: I agree to some degree though ".... all of the models *are overfitting*" seems like a circuitous argument.

In thinking about it more, what's wrong with saying something like:

(1) f is overfit to train if Error(f, Train) << Error(f*, Train), where f* is classifier in your hypothesis class that minimizes Error(f*, Distribution)?

(2) The same, but not limited to a hypothesis class; thereby, f is overfit to train if Error(f, Train) << Error(f*, Train) where f* is the Bayes optimal classifier?

Of course these are horribly not-possible-to-compute :), but I think they get much more at my internal notion of overfitting (exploiting stuff that's specific to Train but doesn't generalize) rather than "train better than test".


@Yoav: it's known/easy to show that if f* minimizes squared error, then f*(x) = E_{y~D|x}[y], so minimizing squared error estimates expectations. In the case that the labels are binary {0,1} then this is exactly doing class posterior probability estimation. A standard way to avoid outlier problems (this is what vw does, for instance) is to clip predictions are the min/max seen in the training data. I.e., if you give me some f with arbitrary range, I replace it with f'(x) = min(1, max(0, f(x))), and then do all my computation from there. There's some discussion here (http://hunch.net/~jl/projects/reductions/tutorial/paper/chapter.pdf) and here (http://hunch.net/?p=547).


@S: there's an issue with 10 fold cross validation for estimating variance: http://www.jmlr.org/papers/volume5/grandvalet04a/grandvalet04a.pdf namely it is biased; see also http://jmlr.org/papers/volume6/markatou05a/markatou05a.pdf


@Vitaly: THANKS! Totally agree. If you picked only based on generalization error, you'd pick a random guessing hypothesis which is clearly UNDERfit! That would have zero generalization error :). I guess it wasn't clear enough in my post, but I think the whole idea of looking at reusable holdout is totally awesome and the way you all do it is really insightful. I love this work. In a sense, I guess what my concern boils down to is something like the following. We know that training error + generalization error bounds the true error, but perhaps there's something real and practical being lost in the fact that this is a bound?

Vitaly said...

@Hal: Thanks! True error is the exact sum of the training error and the generalization error (relative to the training set). So there is nothing lost is this expression. Generalization error is of course not known exactly so we rely on the test set instead to estimate the true error (and that's the place where we need to bound things). One way to relate your argument to our paper is to note that even for the test set we need to balance the utility (or how many and what kind of estimates we can obtain) with overfitting to the test set (or, equivalently, the accuracy of those estimates).

Brendan O'Connor said...

@yoav i dont know much at all about the literature on training classifiers by minimizing squared loss and was interested in just piecing together the subliteratures out there on it. i know more about sq loss for probabilistic binary classification evaluation; it certainly admits the bias/variance breakdown just like linear regression and thus may be a way to understand overfitting, and i like the calibration/refinement breakdown too (which is related in some way i'd like to know sometime). this is also known as Brier score and is used a lot in the forecasting literature (such as in metereology or human forecasting evaluations).

@hal -- thanks for the references .. there's some neural networks speech paper i saw once that compared sq loss versus log-loss but i forget what it was now. also there's this "linear probability model" thing from econ maybe; training with sq loss then they argue better semantics for the coefficients.

Rajhans Samdani said...

Great discussion going on here!
I think one angle that is missing between some of the attempts at defining model-specific overfitting and the desired intuitive definition from your graphs is the SRM-ish angle. In particular, what your graphs present is a sequence of models from hypothesis families following a total order (assuming that the hypothesis class with smaller regularization contains one with larger regularization), whereas one can say that Moritz and co.'s definition considers model at a specific point on the x axis.

How about the following, more procedural, definition of overfitting in the sense you present in your graphs (and Alexandre talks about in the sense of comparing two hypothesis):
Consider a total order of hypothesis classes: H_{lambda} where H_{lambda} \subset H_{lambda'} if lambda' > lambda (so lambda is sort of akin to inverse of the regularization param). Let D be the data distribution and T be some training data drawn from D.

Let f_{T, lambda} = argmin_{h \ in H_{lambda}} Loss(T, h)

Then f_{T, lambda} is said to "overfit" if \exists lambda' < lambda such that

f_{T, lambda} < f_{T, lambda'}

BUT

E_{D}(Loss(f(T, lambda))) > E_{D}(Loss(f(T, lambda'))).

This is a fairly trivial definition and not sure if it is useful. But I guess it formalizes some of the stuff that has been said in "English".

Rajhans Samdani said...

... of course, I meant to write
Loss(T, f_{T, lambda}) < Loss(T, f_{T, lambda'})

hal said...

@Vitaly: yes duh :). sorry, brain not working!

I think I've boiled this down the the following for me:

- When I think of "overfit" to me it means "more than fit" and then the case I mentioned to Bob above with two hypothetical classifiers cannot possibly be overfit because it's not even fit.

- The alternative definition is that "overfit" means positive generalization error.



@Rajhans: yeah, that's basically what I have in my head, but the requirement of a chain of models seems really problematic.

Yaroslav Bulatov said...

In some sense "overfitting" is an academic phenomenon. For an application that brings real money a company would be motivated to pay human subjects to collect a few hundred million training examples at which point there's no overfitting (ie arxiv.org/pdf/1312.6082),

Vitaly said...

@Yaroslav: couldn't disagree more. Overfitting is certainly less of an issue when data is cheap. I believe that in the big picture of applications this a rather rare situation.
Even when data is relatively inexpensive, the more real money your application brings in, the more important even tiny improvements in accuracy become. Making such improvements reliably requires a lot of data (note that in general number of samples necessary scales quadratically with the inverse of the improvements in the error).

Yisong Yue said...

@Yaroslav: For an application that brings real money to a company, the difference between 99.1% accuracy and 99.2% accuracy could be worth tens or hundreds of millions of dollars. So overfitting is a real issue.

Brendan O'Connor said...

@hal @yisong now i think that bias/variance for (probabilistic-output) classification corresponds to "calibration/reliability" and "refinement" in the calibration literature. an unbiased probabilistic classifier means that when it says 80% chance of rain, that in the test set it really does rain 80% of the time. obviously unbiasedness is not the only interesting thing, since you could always predict the marginal probability and get an unbiased predictor. just like in regression you can predict the mean to be perfectly unbiased but not that useful.

Charles Sutton said...

Appreciate the timely discussion, which has prompted me to create a slide for a recent lecture called "zero overfitting not desirable". (I liked Bob's way of talking about this.)

Sebastian Nowozin said...

The predictor depends both on the particular training set (of size N) and on the method used to find a suitable model from the model class. For each predictor there is a true risk (expected loss) and a finite sample risk on the test set.

If we consider repeating the training experiment we can consider a predictor S_N that is the infinite weighted mixture of all predictors obtained from each possible training set of size N, weighted accordingly.
Likewise we can define an expected finite sample risk function that takes any new data set and outputs the expected risk achieved by S_N. This definition removes the dependence on the particular training set realization but retains the finite sample estimation error.

One possible definition of overfitting is then this: given a single training set of size N and a predictor F obtained from these samples, overfitting happens when the test risk R[F] is larger than the test risk R[S] or below upper quantile of the distribution of R[S].

(Using one fixed test set is a way to estimate R[F], and bootstrap resampling is a way to estimate R[S]. The above definition is not convenient to evaluate in practice but I think it is close to a useful definition.)

Unknown said...

@Rajhans Samdani: Your definition related to SRM is my favorite. Basically, I have overfit if I regret my model choice compared to if I had chosen a simpler model. The situation is subtle because sometimes (e.g., when fitting linear models), we think we are not in the SRM regime, because the model has "fixed complexity". But in fact, when using SGD, every gradient step can enlarge the model class by allowing the parameters to grow.

@yisong yue: I like Pedro's paper, but I prefer the more general analysis of Gareth James ("Variance and Bias for General Loss Functions", Machine Learning, 51(2), 2003).

@hal: While minimizing square loss for classification does produce class probability estimates, I believe it is inefficient statistically. Maybe someone else can confirm?

Sebastian Nowozin said...

@Unknown: efficiency is dependent on the loss itself. Minimizing square loss to obtain class estimates for binary and multi-class classification is identical to minimizing the Brier scoring rule. The Brier rule is a proper scoring rule (see e.g. Buja et al., 2005) and thus consistent, and although I am not aware of a pointer to the literature, I would assume that it is the asymptotically most efficient estimator for the squared-error loss on the probabilities.