Tolga Bolukbasi and colleagues recently posted an article about bias in what is learned with word2vec, on the standard Google News crawl (h/t Jack Clark). Essentially what they found is that word embeddings reflect stereotypes regarding gender (for instance, "nurse" is closer to "she" than "he" and "hero" is the reverse) and race ("black male" is closest to "assaulted" and "white male" to "entitled"). This is not hugely surprising, and it's nice to see it confirmed. The authors additionally present a method for removing those stereotypes with no cost (as measured with analogy tasks) to accuracy of the embeddings. This also shows up on twitter embeddings related to hate speech.
There have been a handful of reactions to this work, some questioning the core motivation, essentially variants of "if there are biases in the data, they're there for a reason, and removing them is removing important information." The authors give a nice example in the paper (web search; two identical web pages about CS; one mentions "John" and the other "Mary"; query for "computer science" ranks the "John" one higher because of embeddings; appeal to a not-universally-held-belief that this is bad).
I'd like to take a step back and argue the the problem is much deeper than this. The problem is that even though we all know that strong Sapir-Whorf is false, we seem to want it to be true for computational stuff.
At a narrow level, the issue here is the question of what does a word "mean." I don't think anyone would argue that "nurse" means "female" or that "computer scientist" means "male." And yet, these word embeddings, which claim to be capturing meaning, are clearly capturing this non-meaning-effect. So then the argument becomes one of "well ok, nurse doesn't mean female, but it is correlated in the real world."
Which leads us to the "black sheep problem." We like to think that language is a reflection of underlying truth, and so if a word embedding (or whatever) is extracted from language, then it reflects some underlying truth about the world. The problem is that even in the simplest cases, this is super false.
The "black sheep problem" is that if you were to try to guess what color most sheep were by looking and language data, it would be very difficult for you to conclude that they weren't almost all black. [This example came up in discussions at the 2011 JHU summer research program and is due to Meg Mitchell. Note: I later learned (see comments below) that Ben van Durme also discusses it in his 2010 dissertation, where he terms it "reporting bias" (see sec 3.7)] In English, "black sheep" outnumbers "white sheep" about 25:1 (many "black sheep"s are movie references); in French it's 3:1; in German it's 12:1. Some languages get it right; in Korean it's 1:1.5 in favor of white sheep. This happens with other pairs, too; for example "white cloud" versus "red cloud." In English, red cloud wins 1.1:1 (there's a famous Sioux named "Red Cloud"); in Korean, white cloud wins 1.2:1, but four-leaf clover wins 2:1 over three-leaf clover. [Thanks to Karl Stratos and Kota Yamaguchi for helping with the multilingual examples.]
This is all to say that co-occurance frequencies of words definitely do not reflect co-occurance frequencies of things in the real world. And the fact that the correlation can both both ways means that just trying to model a "default" as something that doesn't appear won't work. (Also, computer vision doesn't really help: there are many many pictures of black sheep out there because of photographer bias.)
We observed a related phenomena when working on plot units. We were trying to extract "patient polarity verbs" (this idea has now been expanded and renamed "implicit sentiment": a much better name). The idea is that we want to know what polarity verbs inflict on their arguments. If I "feed" you, is this good or bad for you? For me? If I "punch" you, likewise. We focused on patients because action verbs are almost always good for the agent.
In order to accomplish this, we started with a seed list of "do-good-ers" and "wrong-do-ers." For instance, "the devil" was a wrong do-er, and so we can extract things that the devil does, and assume that these are (on average) bad for their patients. The problem was the "do-good-ers" don't do good, or at least they don't do good in the news. One of our do-good-ers was "firefighter". Firefighters are awesome. Even stereotyped, this is arguably a very positive social good, heroic profession. But in the news, what do firefighters do? Bad things. Is this because most firefighters do bad things in the world? Of course not. It's because news is especially poignant when stereotypically good people do bad things.
This comes up in translation too, especially when looking at looking at domain adaptation effects. For instance, our usual example for French to English translation is that in Hansards, "enceinte" transates as "room" but in EMEA (medical domain), it translates as "pregnant." What does this have to do with things like gender bias? In Canadian Hansards, "merde" translates mostly as "shit" and sometimes as "crap." In movie subtitles, it's very frequently "fuck." (I suspect translation direction is a confounder here.) This is essentially a form of intensification (or detensification, depending on direction). It is not hard to imagine similar intensifications happening between racial descriptions and racial slurs, or between gender descriptions and sexist slurs, depending on where the data came from.