30 June 2016

Last semester from grad ML, I totally revamped stuff and started using the awesome book Understanding Machine Learning by Shai Ben-David and Shai Shalev-Shwartz (I'll try to review this in a later post). Naturally, one of the things we talked about was subgradients and subgradient descent.

I imagine that for many of you, if I asked you to define a subgradient of a convex function f at x informally (let's stick in one dimension for now), you would say something like "it's any line that that makes contact with f at x and everywhere else lies below f." This is the definition given in UML, and the definition we always see in pictures, like the one in ciml:

Okay, so this is all great, and one of the fun things you can do is derive algorithms like (stochastic) subgradient descent, which involve picking any subgradient and then using that as if it were a gradient in a standard gradient descent procedure. Yeah, you run into some speed limits for optimization rates, but it basically works.

So then on the midterm I asked a question that was intended to be a freebie: give me the subderivatives of the ramp loss function. Ramp loss is like hinge loss (shown above), but where the negative part gets clamped at 1. Formally, it's ramp(x) = min(1, hinge(x)), where hinge(x) = max(0, 1-x).

Turns out this wasn't a freebie. The problem, obviously in retrospect, is that ramp is not convex, and therefore doesn't have subderivatives! Or at least it doesn't have subderivatives for any x less than zero, and it's (only) subderivative for x≥0 is zero.

Several students point this out, and several students just went through the motions and computed what-we-might-normally-call subderivatives anyway. And by "what-we-might-normally-call" I of course mean what I had originally intended, and what I would have done myself if I had been a student in this course, and also what pretty much any auto-diff toolkit would do.

And yet, it's clearly wrong according to the definition of subgradients that we all know and use everyday in our very non-convex neural networks, when we do things like relu or hardtanh units.

It turns out (not surprisingly) that subdifferentials of non-convex functions has been studied (extensively) since the 1970s. Murdukhovich and Shao give a brief history in their paper on Banach spaces. Unfortunately, I don't actually understand most of that paper (or its citations).

I did manage to find one set of slides that I could understand, though, by Adil Bagirov for a talk on Subgradient methods in nonsmooth nonconvexoptimization! Basically the idea that Adil proposes is that we can use a gradient-ified version of a quasisecant and most/some subgradient-like methods still go through and make sense with this generalized notion.

Why am I posting this? Because it caused my brain to reconfigure itself when I was forced to think about this by the very smart students in my class! Am I going to teach quasisecants in the future? Probably not. But I am going to explicitly point out that the standard definition of subgradient doesn't work for nonconvex functions (or, more specifically, it works, but you get an empty set in a lot of cases) and that there are generalizations but that I don't think we've really figured this all out (as a community).

If anyone has other pointers that I can use in the future, I'd love to see them!

24 June 2016

Language bias and black sheep

Tolga Bolukbasi and colleagues recently posted an article about bias in what is learned with word2vec, on the standard Google News crawl (h/t Jack Clark). Essentially what they found is that word embeddings reflect stereotypes regarding gender (for instance, "nurse" is closer to "she" than "he" and "hero" is the reverse) and race ("black male" is closest to "assaulted" and "white male" to "entitled"). This is not hugely surprising, and it's nice to see it confirmed. The authors additionally present a method for removing those stereotypes with no cost (as measured with analogy tasks) to accuracy of the embeddings. This also shows up on twitter embeddings related to hate speech.

There have been a handful of reactions to this work, some questioning the core motivation, essentially variants of "if there are biases in the data, they're there for a reason, and removing them is removing important information." The authors give a nice example in the paper (web search; two identical web pages about CS; one mentions "John" and the other "Mary"; query for "computer science" ranks the "John" one higher because of embeddings; appeal to a not-universally-held-belief that this is bad).

I'd like to take a step back and argue the the problem is much deeper than this. The problem is that even though we all know that strong Sapir-Whorf is false, we seem to want it to be true for computational stuff.

At a narrow level, the issue here is the question of what does a word "mean." I don't think anyone would argue that "nurse" means "female" or that "computer scientist" means "male." And yet, these word embeddings, which claim to be capturing meaning, are clearly capturing this non-meaning-effect. So then the argument becomes one of "well ok, nurse doesn't mean female, but it is correlated in the real world."

Which leads us to the "black sheep problem." We like to think that language is a reflection of underlying truth, and so if a word embedding (or whatever) is extracted from language, then it reflects some underlying truth about the world. The problem is that even in the simplest cases, this is super false.

The "black sheep problem" is that if you were to try to guess what color most sheep were by looking and language data, it would be very difficult for you to conclude that they weren't almost all black. [This example came up in discussions at the 2011 JHU summer research program and is due to Meg Mitchell. Note: I later learned (see comments below) that Ben van Durme also discusses it in his 2010 dissertation, where he terms it "reporting bias" (see sec 3.7)] In English, "black sheep" outnumbers "white sheep" about 25:1 (many "black sheep"s are movie references); in French it's 3:1; in German it's 12:1. Some languages get it right; in Korean it's 1:1.5 in favor of white sheep. This happens with other pairs, too; for example "white cloud" versus "red cloud." In English, red cloud wins 1.1:1 (there's a famous Sioux named "Red Cloud"); in Korean, white cloud wins 1.2:1, but four-leaf clover wins 2:1 over three-leaf clover. [Thanks to Karl Stratos and Kota Yamaguchi for helping with the multilingual examples.]

This is all to say that co-occurance frequencies of words definitely do not reflect co-occurance frequencies of things in the real world. And the fact that the correlation can both both ways means that just trying to model a "default" as something that doesn't appear won't work. (Also, computer vision doesn't really help: there are many many pictures of black sheep out there because of photographer bias.)

We observed a related phenomena when working on plot units. We were trying to extract "patient polarity verbs" (this idea has now been expanded and renamed "implicit sentiment": a much better name). The idea is that we want to know what polarity verbs inflict on their arguments. If I "feed" you, is this good or bad for you? For me? If I "punch" you, likewise. We focused on patients because action verbs are almost always good for the agent.

In order to accomplish this, we started with a seed list of "do-good-ers" and "wrong-do-ers." For instance, "the devil" was a wrong do-er, and so we can extract things that the devil does, and assume that these are (on average) bad for their patients. The problem was the "do-good-ers" don't do good, or at least they don't do good in the news. One of our do-good-ers was "firefighter". Firefighters are awesome. Even stereotyped, this is arguably a very positive social good, heroic profession. But in the news, what do firefighters do? Bad things. Is this because most firefighters do bad things in the world? Of course not. It's because news is especially poignant when stereotypically good people do bad things.

This comes up in translation too, especially when looking at looking at domain adaptation effects. For instance, our usual example for French to English translation is that in Hansards, "enceinte" transates as "room" but in EMEA (medical domain), it translates as "pregnant." What does this have to do with things like gender bias? In Canadian Hansards, "merde" translates mostly as "shit" and sometimes as "crap." In movie subtitles, it's very frequently "fuck." (I suspect translation direction is a confounder here.) This is essentially a form of intensification (or detensification, depending on direction). It is not hard to imagine similar intensifications happening between racial descriptions and racial slurs, or between gender descriptions and sexist slurs, depending on where the data came from.