30 June 2016

Subgradient descent, or something....

Last semester from grad ML, I totally revamped stuff and started using the awesome book Understanding Machine Learning by Shai Ben-David and Shai Shalev-Shwartz (I'll try to review this in a later post). Naturally, one of the things we talked about was subgradients and subgradient descent.

I imagine that for many of you, if I asked you to define a subgradient of a convex function f at x informally (let's stick in one dimension for now), you would say something like "it's any line that that makes contact with f at x and everywhere else lies below f." This is the definition given in UML, and the definition we always see in pictures, like the one in ciml:

Okay, so this is all great, and one of the fun things you can do is derive algorithms like (stochastic) subgradient descent, which involve picking any subgradient and then using that as if it were a gradient in a standard gradient descent procedure. Yeah, you run into some speed limits for optimization rates, but it basically works.

So then on the midterm I asked a question that was intended to be a freebie: give me the subderivatives of the ramp loss function. Ramp loss is like hinge loss (shown above), but where the negative part gets clamped at 1. Formally, it's ramp(x) = min(1, hinge(x)), where hinge(x) = max(0, 1-x).

Turns out this wasn't a freebie. The problem, obviously in retrospect, is that ramp is not convex, and therefore doesn't have subderivatives! Or at least it doesn't have subderivatives for any x less than zero, and it's (only) subderivative for x≥0 is zero.

Several students point this out, and several students just went through the motions and computed what-we-might-normally-call subderivatives anyway. And by "what-we-might-normally-call" I of course mean what I had originally intended, and what I would have done myself if I had been a student in this course, and also what pretty much any auto-diff toolkit would do.

And yet, it's clearly wrong according to the definition of subgradients that we all know and use everyday in our very non-convex neural networks, when we do things like relu or hardtanh units.

It turns out (not surprisingly) that subdifferentials of non-convex functions has been studied (extensively) since the 1970s. Murdukhovich and Shao give a brief history in their paper on Banach spaces. Unfortunately, I don't actually understand most of that paper (or its citations).

I did manage to find one set of slides that I could understand, though, by Adil Bagirov for a talk on Subgradient methods in nonsmooth nonconvexoptimization! Basically the idea that Adil proposes is that we can use a gradient-ified version of a quasisecant and most/some subgradient-like methods still go through and make sense with this generalized notion.

Why am I posting this? Because it caused my brain to reconfigure itself when I was forced to think about this by the very smart students in my class! Am I going to teach quasisecants in the future? Probably not. But I am going to explicitly point out that the standard definition of subgradient doesn't work for nonconvex functions (or, more specifically, it works, but you get an empty set in a lot of cases) and that there are generalizations but that I don't think we've really figured this all out (as a community).

If anyone has other pointers that I can use in the future, I'd love to see them!

6 comments:

Tuomas said...

How about "local subgradient" for what you appear to want to use? Instead of having the subgradient below the function everywhere (globally), it only needs to be below it everywhere within a small ball of radius epsilon.

Tim Vieira said...
This comment has been removed by the author.
Tim Vieira said...

Ha! I've ranted on this so many times!

Especially in the context of (structured) ramp loss and why you want to use the Convex-ConCave Procedure (subgradient descent does not work on ramp loss).

Extra rants related to this one:

* The difference of general convex functions is not convex... neither are general linear combinations... sigh...

* Ramp loss (much like 0/1 loss) is scale invariant, so the usual L1 and L2 regularization penalty do not make sense!

Unknown said...

When you say "ramp loss is like hinge loss, except the negative part gets clamped at 1", don't you mean that the positive part (where x is negative) gets clipped at 1? Also, wouldn't the ramp only have subderivatives for x≥1 instead of x≥0?

André Martins said...

For functions f that are almost everywhere differentiable (convex or not), there's the notion of Clarke's generalized derivatives (or Jacobians if we consider maps R^n -> R^m). This is basically the convex hull of the set of limit points of gradients \nabla f(x_t) for sequences (x_t) that converge to x. If f is convex, this becomes the subdifferential of f (i.e. the set of subgradients of f as defined in your post).

Clarke, Frank H. Optimization and Nonsmooth Analysis. New York, Wiley, 1983.

André Martins said...

Here's the correct reference for the above:

F.H. Clarke. Generalized gradients and applications. Trans. Amer. Math. Soc., 205:247–262, Apr. 1975.

http://www.ams.org/journals/tran/1975-205-00/S0002-9947-1975-0367131-6/S0002-9947-1975-0367131-6.pdf