A while ago I created this image for thinking about how machine learning systems tend to get deployed. In this figure, for Chapter 2 of CIML, the left column shows a generic decision being made, and the right column shows an example of this decision in the case of advertising placement on a search engine we’ve built.
The purpose of the image at the time was basically thinking of different types of “approximation” error, where we have some real world goal (e.g., increase revenue) and design a machine learning system to achieve that. The point here, which echoes a lot of the Rules of Machine Learning by Martin Zinkevich (who knows much more about this than I do) writes about, is that it’s important to recognize that there’s a whole lot of stuff that goes around any machine learning system, and each piece puts an upper bound on what you can achieve.
A year or two ago, in talking to Suresh Venkatasubramanian, we realized that it’s also perhaps an interesting way to think about different places that “discrimination” might come into a system (I’ll avoid the term “bias” because it’s overloaded here with the “bias/variance tradeoff”). By “discrimination” I simply mean a situation in which some subpopulation is disadvantaged. Below are some thoughts on how this might happen.
To make things more interesting (and navel-gaze-y), I’m going to switch from the example of ad display to paper recommendations in a hypothetical arxiv rss-feed-based paper recommendation system. To be absolutely clear, this is very much a contrived, simplified thought example, and not meant to be reflective of any decisions anyone or any company has made. (For the purposes of this example, I will assume all papers on arxiv are in English.)
In stating this goal, we are explicitly making a value judgment of what matters. In this case, one part of this value judgment is that it’s only new papers that are interesting, potentially disadvantaging authors who have high quality older work. It also advantages people who put their papers on arxiv, which is not a uniform slice of the research population.
By deciding to build an iPhone app, we have privileged iPhone users over users of other phones, which likely correlates both with country of residence and economic status of the user. By designing the mechanism such that extracted paper information is shown and a judgment is collected immediately, we are possibly advantaging papers (and thereby the authors of those papers) whose contributions can be judged quickly, or which seem flashy (click-bait-y). Similarly, since human flash judgments may focus on less relevant features, we may be biasing toward authors who are native English speakers, because things like second language errors may disproportionately affect quick judgments.
I actually don’t have anything to say on this one. Open to ideas 😊.
There are obviously repercussions to this decision, but I’m not sure any are discriminatory. Had we chosen a more sophisticated exploration policy, this could possibly run into discrimination issues because small populations might get “explored on” more, potentially disadvantaging them.
By choosing to record author and institution (for instance), we are both opening up the possibility of discrimination against certain authors or institutions, but, because many techniques for addressing discrimination in machine learning assume that you have access to some notion of protected category, we are also opening up the possibility of remedying that. Similarly, by recording the abstract, we are (similar but different to before) opening the possibility for discrimination by degree of English proficiency.
A major source of potential discrimination here comes from the features we use of the current user. If the current user, for instance, only has a small number of papers from which we can learn about the topics they care about, then the system will plausibly work worse for them than for someone with lots of papers (and therefore a more robust user profile).
This is a fairly easy representation to understand. Because we’ve chosen a bag of embeddings, this could plausibly underperform on topics/areas where the keywords are separated by spaces (e.g., I heard a story once that someone who works mostly on dependency parsing tends to get lots of papers suggested to them to review by TPMS on decision tree models because of the overlap of the word “tree”). It’s not clear to me that there are obvious discrimination issues here, but it could be.
In choosing to evaluate our system based only on average 0/1 loss over the run, we are potentially missing the opportunity to even observe systematic bias. An alternative would be to do things like evaluating 0/1 error as a function of various confounding variables, like institution prestige, author prolificity, some measure of nativism of the language, etc. Similarly breaking the error down into features of the user for similar reasons. Finally, considering not just error but separating out false positive and false negatives can often reveal discriminatory structures not otherwise obvious.
I don’t think this analysis is perfect, and some things don’t really apply, but I found it to be a useful thought exercise.
One thing very interesting about thinking about discrimination in this setting is that there are two opportunities to mess up: on the content provider (author) side and on the content consumer (reader) side. This comes up in other places too: should your music/movie/etc. recommender just recommend popular things to you or should it be fair to content providers who are less well known? (Thanks to Fernando Diaz for this example.)