A while ago I created this image for thinking about how
machine learning systems tend to get deployed. In this figure, for Chapter 2 of CIML, the left column shows a generic decision being
made, and the right column shows an example of this decision in the case of
advertising placement on a search engine we’ve built.
The purpose of the image at the time was basically thinking
of different types of “approximation” error, where we have some real world goal
(e.g., increase revenue) and design a machine learning system to achieve that.
The point here, which echoes a lot of the Rules of Machine
Learning by Martin Zinkevich (who
knows much more about this than I do) writes about, is that it’s important to
recognize that there’s a whole lot of stuff that goes around any machine
learning system, and each piece puts an upper bound on what you can achieve.
A year or two ago, in talking to Suresh Venkatasubramanian, we
realized that it’s also perhaps an interesting way to think about different
places that “discrimination” might come into a system (I’ll avoid the term “bias”
because it’s overloaded here with the “bias/variance tradeoff”). By “discrimination”
I simply mean a situation in which some subpopulation is disadvantaged. Below
are some thoughts on how this might happen.
To make things more interesting (and navel-gaze-y), I’m
going to switch from the example of ad display to paper recommendations in a
hypothetical arxiv rss-feed-based paper recommendation system. To be absolutely
clear, this is very much a contrived, simplified thought example, and not meant
to be reflective of any decisions anyone or any company has made. (For the
purposes of this example, I will assume all papers on arxiv are in English.)
In stating this goal, we are explicitly making a value judgment of what matters. In this case, one part of this value judgment is that it’s only new papers that are interesting, potentially disadvantaging authors who have high quality older work. It also advantages people who put their papers on arxiv, which is not a uniform slice of the research population.
By deciding to build an iPhone app, we have privileged iPhone users over users of other phones, which likely correlates both with country of residence and economic status of the user. By designing the mechanism such that extracted paper information is shown and a judgment is collected immediately, we are possibly advantaging papers (and thereby the authors of those papers) whose contributions can be judged quickly, or which seem flashy (click-bait-y). Similarly, since human flash judgments may focus on less relevant features, we may be biasing toward authors who are native English speakers, because things like second language errors may disproportionately affect quick judgments.
I actually don’t have anything to say on this one. Open to ideas 😊.
There are obviously repercussions to this decision, but I’m not sure any are discriminatory. Had we chosen a more sophisticated exploration policy, this could possibly run into discrimination issues because small populations might get “explored on” more, potentially disadvantaging them.
By choosing to record author and institution
(for instance), we are both opening up the possibility of discrimination against
certain authors or institutions, but, because many techniques for addressing discrimination
in machine learning assume that you
have access to some notion of protected category, we are also opening up the
possibility of remedying that. Similarly, by recording the abstract, we are (similar
but different to before) opening the possibility for discrimination by degree
of English proficiency.
A major source of potential discrimination here comes from the features we use of the current user. If the current user, for instance, only has a small number of papers from which we can learn about the topics they care about, then the system will plausibly work worse for them than for someone with lots of papers (and therefore a more robust user profile).
This is a fairly easy representation to understand. Because we’ve chosen a bag of embeddings, this could plausibly underperform on topics/areas where the keywords are separated by spaces (e.g., I heard a story once that someone who works mostly on dependency parsing tends to get lots of papers suggested to them to review by TPMS on decision tree models because of the overlap of the word “tree”). It’s not clear to me that there are obvious discrimination issues here, but it could be.
In choosing to evaluate our system based only on average 0/1 loss over the run, we are potentially missing the opportunity to even observe systematic bias. An alternative would be to do things like evaluating 0/1 error as a function of various confounding variables, like institution prestige, author prolificity, some measure of nativism of the language, etc. Similarly breaking the error down into features of the user for similar reasons. Finally, considering not just error but separating out false positive and false negatives can often reveal discriminatory structures not otherwise obvious.
----------------------------
I don’t think this analysis is perfect, and some things don’t
really apply, but I found it to be a useful thought exercise.
One thing very interesting about thinking about
discrimination in this setting is that there are two opportunities to mess up:
on the content provider (author) side and on the content consumer (reader) side.
This comes up in other places too: should your music/movie/etc. recommender
just recommend popular things to you or should it be fair to content providers
who are less well known? (Thanks to Fernando
Diaz for this example.)
Regarding decision #3: one could argue that there is an opportunity for discrimination here in the way that a “key outcome” is extracted from a larger, more complex process. In this case, that process is peer review, which does produce thumbs up/thumbs down decisions on papers, but also produces other less easily grasped or modeled outcomes. For example, a reviewer for a conference may reject a paper as not appropriate for a particular venue, but still like the paper, and write a review with advice on how it could be revised and submitted elsewhere. Or, frustrations among reviewers who see the review forms as somehow inappropriate for the submitted papers may lead to changes in the review process—this kind of thing too is a possible outcome of peer review, that gets abstracted away by modeling it only as a process that assigns 0 or 1 to papers. The point is simply that there are many possible “learning problems” one might define with respect to peer review, none of which alone is the “real” or “right” problem, but only some of which are amenable to modeling using machine learning.
ReplyDeleteset up a learning problem: online prediction of thumbs-up/down for papers
What a fantastic read on Data Science. This has helped me understand a lot in Data Science course. Please keep sharing similar write ups on Data Science . Guys if you are keen to knw more on Data Science , must check this wonderful Data Science tutorial and i'm sure you will enjoy learning on Data Science training.https://www.youtube.com/watch?v=1ek7IdGhbXI
ReplyDelete