Tolga Bolukbasi and colleagues recently posted an article about bias in what is learned with word2vec, on the standard Google News crawl (h/t Jack Clark). Essentially what they found is that word embeddings reflect stereotypes regarding gender (for instance, "nurse" is closer to "she" than "he" and "hero" is the reverse) and race ("black male" is closest to "assaulted" and "white male" to "entitled"). This is not hugely surprising, and it's nice to see it confirmed. The authors additionally present a method for removing those stereotypes with no cost (as measured with analogy tasks) to accuracy of the embeddings. This also shows up on twitter embeddings related to hate speech.
There have been a handful of reactions to this work, some questioning the core motivation, essentially variants of "if there are biases in the data, they're there for a reason, and removing them is removing important information." The authors give a nice example in the paper (web search; two identical web pages about CS; one mentions "John" and the other "Mary"; query for "computer science" ranks the "John" one higher because of embeddings; appeal to a not-universally-held-belief that this is bad).
I'd like to take a step back and argue the the problem is much deeper than this. The problem is that even though we all know that strong Sapir-Whorf is false, we seem to want it to be true for computational stuff.
At a narrow level, the issue here is the question of what does a word "mean." I don't think anyone would argue that "nurse" means "female" or that "computer scientist" means "male." And yet, these word embeddings, which claim to be capturing meaning, are clearly capturing this non-meaning-effect. So then the argument becomes one of "well ok, nurse doesn't mean female, but it is correlated in the real world."
Which leads us to the "black sheep problem." We like to think that language is a reflection of underlying truth, and so if a word embedding (or whatever) is extracted from language, then it reflects some underlying truth about the world. The problem is that even in the simplest cases, this is super false.
The "black sheep problem" is that if you were to try to guess what color most sheep were by looking and language data, it would be very difficult for you to conclude that they weren't almost all black. [This example came up in discussions at the 2011 JHU summer research program and is due to Meg Mitchell. Note: I later learned (see comments below) that Ben van Durme also discusses it in his 2010 dissertation, where he terms it "reporting bias" (see sec 3.7)] In English, "black sheep" outnumbers "white sheep" about 25:1 (many "black sheep"s are movie references); in French it's 3:1; in German it's 12:1. Some languages get it right; in Korean it's 1:1.5 in favor of white sheep. This happens with other pairs, too; for example "white cloud" versus "red cloud." In English, red cloud wins 1.1:1 (there's a famous Sioux named "Red Cloud"); in Korean, white cloud wins 1.2:1, but four-leaf clover wins 2:1 over three-leaf clover. [Thanks to Karl Stratos and Kota Yamaguchi for helping with the multilingual examples.]
This is all to say that co-occurance frequencies of words definitely do not reflect co-occurance frequencies of things in the real world. And the fact that the correlation can both both ways means that just trying to model a "default" as something that doesn't appear won't work. (Also, computer vision doesn't really help: there are many many pictures of black sheep out there because of photographer bias.)
We observed a related phenomena when working on plot units. We were trying to extract "patient polarity verbs" (this idea has now been expanded and renamed "implicit sentiment": a much better name). The idea is that we want to know what polarity verbs inflict on their arguments. If I "feed" you, is this good or bad for you? For me? If I "punch" you, likewise. We focused on patients because action verbs are almost always good for the agent.
In order to accomplish this, we started with a seed list of "do-good-ers" and "wrong-do-ers." For instance, "the devil" was a wrong do-er, and so we can extract things that the devil does, and assume that these are (on average) bad for their patients. The problem was the "do-good-ers" don't do good, or at least they don't do good in the news. One of our do-good-ers was "firefighter". Firefighters are awesome. Even stereotyped, this is arguably a very positive social good, heroic profession. But in the news, what do firefighters do? Bad things. Is this because most firefighters do bad things in the world? Of course not. It's because news is especially poignant when stereotypically good people do bad things.
This comes up in translation too, especially when looking at looking at domain adaptation effects. For instance, our usual example for French to English translation is that in Hansards, "enceinte" transates as "room" but in EMEA (medical domain), it translates as "pregnant." What does this have to do with things like gender bias? In Canadian Hansards, "merde" translates mostly as "shit" and sometimes as "crap." In movie subtitles, it's very frequently "fuck." (I suspect translation direction is a confounder here.) This is essentially a form of intensification (or detensification, depending on direction). It is not hard to imagine similar intensifications happening between racial descriptions and racial slurs, or between gender descriptions and sexist slurs, depending on where the data came from.
“Why do medical tests always have error rates?”
7 hours ago
9 comments:
Great Post. We need more critical thinking, and application, like this.
I answered a quora post with a similar angle, a point that i feel needs to be taken to heart in much of the application and research in NLP. There is no escaping the real-world consequences for NLP... now that, as I believe, it has truly taken hold in commerce and consumer goods; and very likely legal areas. (besides a lot of bad journalism and hyped up marketing, there is a solid base taken hold and NLP is going nowhere). In fact, i just got done consulting with a retired police officer and federal agent about how to approach the problem of a "sexual predator program that locates likely sexual predators through social media"... 0.0001% failure rate on a population of 1 million people means you have now labeled many innocent people, and does/should law enforcement deal with the this kind of error rate?... and if so, how? Both legally and via the community (i.e., local or county police departments do not have resources to surveil 100 potential sexual predators in their city).
"In short, I think, sentiment analysis is not constrained by the technology, but by human behavior in general. Sentiment analysis works well if applied under the right conditions.... The question is, are those conditions so constrained by bias that it's even worth the effort using such a technique."
https://www.quora.com/How-could-AI-NLP-sentiment-analysis-be-used-to-predict-whether-or-not-people-will-judge-you-based-on-something-you-say/answer/Joshua-Bowles
I feel that the issue is more about people throwing bold, unsubstantiated claims about "[w]ord embeddings, trained only on word co-occurrence in text corpora, capture .. meanings" (from Bolukbasi et al., which I enjoyed reading.) Instead, if one were specific and precise about what word embeddings do (capturing co-occurrence relationships among words), the whole story'd be far from dramatic, because those biases are likely visible from co-occurrence statistics.
In fact, it's a bias not in the word embeddings but in the text on which those vectors were estimated. For instance, take two examples: "she is a nurse" vs "he is a nurse" and "she is a programmer" vs "he is a programmer". According to Google N-gram viewer, "she is a nurse" is far more frequent than "he is a nurse" (https://goo.gl/6kkg0R), and "he is a programmer" is far more frequent than "she is a programmer" (in fact, "she is a programmer" doesn't even show up.., https://goo.gl/LgsvvT) But, I don't think anyone's particularly going to be surprised and claim that n-gram language models are biased, likely because no one claims that n-gram statistics tables capture the meanings of words.
Sapir Whorf?
https://en.wikipedia.org/wiki/Linguistic_relativity
Yeah, this is a fascinating problem for me. But generally I thought "typical" values for things are within some top N, although often not the first; Cho, I'm bummed that "she is a programmer" doesn't even show up, that contradicts my theory that typical values are "somewhere", if not immediately present.
I think one way to approach this is to separate "ground truth" from "data truth".
This was a first step we took in the "Seeing Through the Human Reporting Bias" paper (http://arxiv.org/abs/1512.06974), but that just scratches the surface. Ideally we have a wide range of hidden states corresponding to "ground truth". These can be used to help us estimate "data truth", maximizing the likelihood of the observed data; and then we can check out what we learn in the hidden ground truth.
Hi Hal! This is Adam Kalai, one of the authors of the paper. It's great to read a such a thoughtful post from someone who has been working in this domain for so long.
I love your example that "black sheep" is 25 times more frequent than "white sheep". Interestingly, I just checked the public word2vec embedding trained on google news, and the word "sheep" is closer to the word "white" than "black"! In particular, I took the publicly available word2vec embedding trained on google news, normalized all vectors to unit length, and "sheep" is closer to "white" than "black." Equivalently, "sheep" has a positive inner product with the vector difference of "white" and "black".
Now, I'm not saying all word embeddings are perfect by any means, but the fact that they can overcome such strong co-occurrence biases suggests they are doing something beyond normalization of word counts. Are they totally random? I don't think so. We find that in many cases (but not all), their gender biases seem to match popular gender stereotypes despite the aptly named "black sheep" problem.
We plan to put out a full version of the paper very soon that will discuss issues like this in a bit more detail, but I love hearing your thoughts.
Great post Hal.
It's not super important but you can actually get the sheep thing correct, with enough text, if you do more of a frontal assault for the fact you're verifying. Google hit counts:
"most sheep are white" -- 64 hits
"most sheep are black" -- 2 hits
(sidenote, neither of the "most sheep are black" hits actually asserts that most sheep are black. One uses modality "to say most sheep are black is to be at once inaccurate and untrue" and the other is a clipped phrase "most sheep are black face sheep")
[cross-posting from FBook link by Hal] This was previously called "Reporting Bias", Van Durme thesis 2010 + Gordon and Van Durme 2013 @ AKBC. I've been referring to it in talks/lectures as the "Blinking and Breathing" problem: you don't talk about blinking and breathing but those events largely dominate situations involving people, as compared to reported events like "laugh", "murder", etc.
Approaches that try to filter/normalize/... language failed to recognize what language is. A model based on co-occurrences in written language will never be equal to an ontological truth. There is no inference step in w2v that could lead to a "true" representation, and surely not with a bit of transformation of the embedding spaces by some "bad" axis (it is no surprise the authors didn't show any text examples after the "bias" reduction).
Instead of filtering and removing information, people should enrich the model, so it is more useful. What about adding geographical and time modes to w2v, would this not be far more interesting?
Post a Comment