Last semester from grad ML, I totally revamped stuff and started using the awesome book Understanding Machine Learning by Shai Ben-David and Shai Shalev-Shwartz (I'll try to review this in a later post). Naturally, one of the things we talked about was subgradients and subgradient descent.
I imagine that for many of you, if I asked you to define a subgradient of a convex function f at x informally (let's stick in one dimension for now), you would say something like "it's any line that that makes contact with f at x and everywhere else lies below f." This is the definition given in UML, and the definition we always see in pictures, like the one in ciml:
So then on the midterm I asked a question that was intended to be a freebie: give me the subderivatives of the ramp loss function. Ramp loss is like hinge loss (shown above), but where the negative part gets clamped at 1. Formally, it's ramp(x) = min(1, hinge(x)), where hinge(x) = max(0, 1-x).
Turns out this wasn't a freebie. The problem, obviously in retrospect, is that ramp is not convex, and therefore doesn't have subderivatives! Or at least it doesn't have subderivatives for any x less than zero, and it's (only) subderivative for x≥0 is zero.
Several students point this out, and several students just went through the motions and computed what-we-might-normally-call subderivatives anyway. And by "what-we-might-normally-call" I of course mean what I had originally intended, and what I would have done myself if I had been a student in this course, and also what pretty much any auto-diff toolkit would do.
And yet, it's clearly wrong according to the definition of subgradients that we all know and use everyday in our very non-convex neural networks, when we do things like relu or hardtanh units.
It turns out (not surprisingly) that subdifferentials of non-convex functions has been studied (extensively) since the 1970s. Murdukhovich and Shao give a brief history in their paper on Banach spaces. Unfortunately, I don't actually understand most of that paper (or its citations).
I did manage to find one set of slides that I could understand, though, by Adil Bagirov for a talk on Subgradient methods in nonsmooth nonconvexoptimization! Basically the idea that Adil proposes is that we can use a gradient-ified version of a quasisecant and most/some subgradient-like methods still go through and make sense with this generalized notion.
Why am I posting this? Because it caused my brain to reconfigure itself when I was forced to think about this by the very smart students in my class! Am I going to teach quasisecants in the future? Probably not. But I am going to explicitly point out that the standard definition of subgradient doesn't work for nonconvex functions (or, more specifically, it works, but you get an empty set in a lot of cases) and that there are generalizations but that I don't think we've really figured this all out (as a community).
If anyone has other pointers that I can use in the future, I'd love to see them!