Many students, in response to their assigned reading for today's undergrad ML class, asked me for a formal definition of overfitting. The fact that I don't really have one irked me, especially this this basically is the problem in machine learning. This blog post is an extended version of the answer I gave in class.
Intuitively, overfitting is when your learned model is doing much better on training data than it will do on test data from the same distribution. This happens because your model learns to fit random idiosyncrasies in the training data rather than model the true underlying trends.
I then pointed to a recent high-profile paper by Cynthia Dwork and colleagues on Holdout Reuse (appeared behind a paywall in Science and something similar seems to be in NIPS too). This paper presents some really cool ideas that use machinery from differential privacy to allow one to effectively reuse the same heldout ("development") data multiple times, without worrying about overfitting the holdout. But that's not what I wanted to highlight; I wanted to highlight the formal definition Dwork et al. give for overfitting. They (roughly) say that a model has overfit the training data when its test error is larger than its training error (by at least some small amount).
I feel like this definition captures some of the essence of overfitting, but it also misses what I would call my "everyday experience" with using machine learning in practical, real world settings. And my everyday experience is that you training error is always lower than your test error. To confirm to myself that I wasn't misremembering my experiences, I ran a few experiments on some simple classification tasks. Below is a plot of train/dev/test error on Reuters RCV1:
Here, the x-axis is regularization strength (this is all with libsvm, C=2^-xvalue) and the y-axis is error rate. There are error bars on dev/test but they're so small you can't really see them.
The thing I want to highlight in this figure is that the best performance, measured by dev or test performance, occurs when the regularization is a bit below 0. But the point at which the gap between dev performance and training performance starts to inflate is much earlier, maybe around -4. The title of the figure tells us that the test error achieved when the training error drops below dev error is 9.48%; but when dev bottomw out the test error is 6.08%. That's leaving 33% relative error on the table.
RCV1 isn't unique. Below are the same plots for MNIST on several binary splits:
Having somewhat convinced myself that I'm not crazy, the immediate question is: what is a better definition of overfitting that captures my everyday experience? I completely agree that the definition quoted above is a necessary condition, but somehow it does not seem sufficient.
Even though this was my experience, I felt a bit flummoxed because I had a hard time pointing out exactly what I felt was missing in the definition above.
Here's what I ended up settling on. Overfitting typically happens, as stated above, because there are random idiosyncrasies in the training data that don't generalize. This is a finite sample effect. If I give you a dataset with 1000 training examples, and I add 100 completely random features, at least one of these features is going to correlate somewhat with the training labels. Overfitting occurs when the model chooses to use one of these completely random features, which helps drive the training error down, but hurts at test time.
But there's a "throwing the baby out with the bathwater" effect going on here. Although there may be completely random features that accidentally correlate slightly with the true label, there are also weak but nevertheless useful features that correlate a small amount with the true label. On a finite sample it's going to be virtually impossible to distinguish between these two.
By adopting the "training error < test error" rubric for overfitting, we're forcing ourselves to throw out both types of features. This is a trade-off, and empirically at least, it appears that this is a losing trade-off. It seems better to keep around a few of these weak-but-useful features in exchange for the risk of also keeping around a few truly-useless-and-misleading features. If this weren't the case, I would have a hard time explaining my "everyday experience."
So do I have a better suggestion?
One really nice property of the Dwork et al. definition is that it's testable for a single model. That is, I hand you a learned function h and you can run h on some training data and some dev data and you can tell me if it's overfit. I don't have to draw curves like those above. I don't even have to know anything about how the model was built, and I don't have to sweep hyperparameters (which may or may not be sweepable).
I really like all of those properties. And I believe that if we came up with an alternative definition of overfitting that obeys these properties, we can reuse much of the awesome work in the Dwork et al. paper.
But I also feel like the definition above is overly conservative and it's risky to use.