16 July 2018

Yet another list of things we can do to have more diverse sets of invited speakers

I am super reticent to write this because so many people have written similar things. Yet despite that, we still have things like an initial workshop program with 17 invited speakers of whom 16 are men (at ICML 2018) and a panel where 3 of the 5 panelists are or were previously in the same group at Stanford (at NAACL 2018). And I (and others) still hear things like "tell me what to do." So, here it is, I'm saying what we should do. (And I'm saying this mostly so that the others, who are usually women, don't have to keep dealing with this and can hopefully point here.)

Like any such list, this is incomplete. I'd love suggestions for improvements. I'm also taking for granted that you want a diverse set of invited speakers. If you don't, go read up on that first.

I think it's also important to recognize that we all mess up. I (Hal) have messed up many times. The important thing is to learn. In order to learn, we need a signal, and so if someone points out that your event did not live up to what-should-be-expectations, accept that. Don't get defensive. Realize that this learning signal is super rare, and so apply some inverse propensity weighting and do a big learning update.

This list was created by my first brainstorming ideas (largely based on my own failures), then going to some excellent sources and adding/modifying. You should also read these sources (which mostly focus on panels, not invited speakers, and also mostly focus on gender):
 Okay, so here's my list, broken down into "phases" of the workshop organization process.
  • Before you begin inviting folks
    • Have an goal, write it down, and routinely evaluate yourself against that goal. If you succeed, great. If you fail, learn (also great!).
    • Have co-organizers with different social networks than you have, or you'll all only think of the same people.
    • Start with an initial diverse list of potential speakers that's mostly (well more than half) women and/or minorities, covering different geographic regions, different universities, and different levels of "seniority". You need to start with well more than half because (a) you should expect many to say no, and because (b) many of your contributed papers are highly likely presented by abled white guys from privileged institutions in the US, so if the event is to be even remotely balanced you need to compensate early.
    • Scan through the proceedings of recent conferences for people that aren't immediately on your radar.
    • If you can't come up with a long enough list that's also diverse, then maybe consider whether your topic is just "you and your buddies," and perhaps think about if you can expand your topic in an interesting direction to cast a broader net.
    • If you can't come up with such a list, maybe your criteria for who to invite is unrealistic already very white male-biased. For instance, having a criteria like "I only want full profs from US universities" comes with a lot of historical/social baggage.
    • Ask everyone you know for suggested names. Check out existing resources like the WiML and Widening NLP directories, but also realize that there are many forms of diversity that may not be well covered in these.
    • Once you have long list of potential speakers with many women or minorities on it, ensure that you're not just inviting women to talk about "soft" topics and men about "technical" topics.
  • In the invitation process
    • In the invitation letter to speakers, offer to cover childcare for the speaker (regardless of who it is) either at the workshop or at their home city. Women in particular often take the majority of child rearing responsibilities; this may help tip the scales, but will also help everyone who has kids.
    • In each invitation that you send out to men, or people who are not under-represented, ask them explicitly for suggestions of additional speakers (who are not white men) you could invite in the initial invitation (i.e., not just when they decline).
    • Invite speakers from under-represented or historically excluded groups very early before they become even more overcommitted. But also give them an easy out to say no.
    • When you start sending invitations out, invite the abled white guys at privileged institutions slowly and later. That way, if you have trouble getting a diverse set of speakers, you’re not already overcommitted.
  • Dealing with challenges
    • If the diversity of your event is being hurt by the fact that potential speakers cannot travel, consider allowing one or two people speak remotely.
    • If you do find yourself overcommitted to a non-diverse speaker group, it may be time to eat crow, apologize to a few of them, and say directly that you were aiming for a diverse slate of speakers, but you messed up, and you would like to know if they would be willing to step down to allow room for someone else to speak in their stead.
    • Go back to your goals. How are you doing? What can you do better next time?

Finally, if you're a guy, or otherwise hold significant privilege, even if you're not organizing a workshop try to help people who are. You should have a go-to list of alternate speakers that you can provide when colleagues ask you for ideas of who to invite or when you get invited yourself. You can have an inclusion rider for giving talks and being on panels, and perhaps also for putting your name on workshops as a co-organizer. I promise, people will appreciate the help!

12 June 2018

Many opportunities for discrimination in deploying machine learning systems


A while ago I created this image for thinking about how machine learning systems tend to get deployed. In this figure, for Chapter 2 of CIML, the left column shows a generic decision being made, and the right column shows an example of this decision in the case of advertising placement on a search engine we’ve built.


The purpose of the image at the time was basically thinking of different types of “approximation” error, where we have some real world goal (e.g., increase revenue) and design a machine learning system to achieve that. The point here, which echoes a lot of the Rules of Machine Learning by Martin Zinkevich (who knows much more about this than I do) writes about, is that it’s important to recognize that there’s a whole lot of stuff that goes around any machine learning system, and each piece puts an upper bound on what you can achieve.

A year or two ago, in talking to Suresh Venkatasubramanian, we realized that it’s also perhaps an interesting way to think about different places that “discrimination” might come into a system (I’ll avoid the term “bias” because it’s overloaded here with the “bias/variance tradeoff”). By “discrimination” I simply mean a situation in which some subpopulation is disadvantaged. Below are some thoughts on how this might happen.

To make things more interesting (and navel-gaze-y), I’m going to switch from the example of ad display to paper recommendations in a hypothetical arxiv rss-feed-based paper recommendation system. To be absolutely clear, this is very much a contrived, simplified thought example, and not meant to be reflective of any decisions anyone or any company has made. (For the purposes of this example, I will assume all papers on arxiv are in English.)



1.       We start with a real-world goal: helping people filter newly uploaded papers on arxiv.

In stating this goal, we are explicitly making a value judgment of what matters. In this case, one part of this value judgment is that it’s only new papers that are interesting, potentially disadvantaging authors who have high quality older work. It also advantages people who put their papers on arxiv, which is not a uniform slice of the research population.

2.       We now need a real-world mechanism to achieve our goal: an iPhone app that shows extracted information from a paper that users can thumbs-up or thumbs-down (or swipe left/right as you prefer).

By deciding to build an iPhone app, we have privileged iPhone users over users of other phones, which likely correlates both with country of residence and economic status of the user. By designing the mechanism such that extracted paper information is shown and a judgment is collected immediately, we are possibly advantaging papers (and thereby the authors of those papers) whose contributions can be judged quickly, or which seem flashy (click-bait-y). Similarly, since human flash judgments may focus on less relevant features, we may be biasing toward authors who are native English speakers, because things like second language errors may disproportionately affect quick judgments.

3.       Next, set up a learning problem: online prediction of thumbs-up/down for papers displayed to the user (essentially a contextual bandit learning problem).

I actually don’t have anything to say on this one. Open to ideas 😊.

4.       Next, we define a mechanism for collecting data: we will deploy a system and use epsilon-greedy exploration to collect data.

There are obviously repercussions to this decision, but I’m not sure any are discriminatory. Had we chosen a more sophisticated exploration policy, this could possibly run into discrimination issues because small populations might get “explored on” more, potentially disadvantaging them.

5.       From this data collection procedure, we decide what data to log: paper title, authors, institution, abstract, and rating (thumbs up/down).

By choosing to record author and institution (for instance), we are both opening up the possibility of discrimination against certain authors or institutions, but, because many techniques for addressing discrimination in machine learning assume that you have access to some notion of protected category, we are also opening up the possibility of remedying that. Similarly, by recording the abstract, we are (similar but different to before) opening the possibility for discrimination by degree of English proficiency.

6.       Given this logged data, we have to choose a data representation: for this we’ll take text from title, authors, institution and abstract, and then have features for the current user of the system (e.g., the same features from their own papers), and a binary signal for thumbs up/down.

A major source of potential discrimination here comes from the features we use of the current user. If the current user, for instance, only has a small number of papers from which we can learn about the topics they care about, then the system will plausibly work worse for them than for someone with lots of papers (and therefore a more robust user profile).

7.       Next we choose a model family: for simplicity and because we have a “cold start” problem, instead of going the full neural network route, we’ll just use a bag-of-embeddings representation for the paper being considered, and a bag-of-embeddings representation for the user, and combine them with cross-product features into a linear model.

This is a fairly easy representation to understand. Because we’ve chosen a bag of embeddings, this could plausibly underperform on topics/areas where the keywords are separated by spaces (e.g., I heard a story once that someone who works mostly on dependency parsing tends to get lots of papers suggested to them to review by TPMS on decision tree models because of the overlap of the word “tree”). It’s not clear to me that there are obvious discrimination issues here, but it could be.

8.       Selecting the training data: in this case, the training data is simply coming online, so the discussion in (3) applies.

9.       We now train the model and select hyperparameters: again, this is an online setting, so there’s really no train/test split, so this question doesn’t really apply (though see the comment about exploration in #4).

10.   The model is then used to make predictions: again, online, doesn’t really apply.

11.   Lastly, we evaluate error: here, the natural is 0/1 loss on whether thumbs up/down was predicted correctly or not.

In choosing to evaluate our system based only on average 0/1 loss over the run, we are potentially missing the opportunity to even observe systematic bias. An alternative would be to do things like evaluating 0/1 error as a function of various confounding variables, like institution prestige, author prolificity, some measure of nativism of the language, etc. Similarly breaking the error down into features of the user for similar reasons. Finally, considering not just error but separating out false positive and false negatives can often reveal discriminatory structures not otherwise obvious.

----------------------------

I don’t think this analysis is perfect, and some things don’t really apply, but I found it to be a useful thought exercise.

One thing very interesting about thinking about discrimination in this setting is that there are two opportunities to mess up: on the content provider (author) side and on the content consumer (reader) side. This comes up in other places too: should your music/movie/etc. recommender just recommend popular things to you or should it be fair to content providers who are less well known? (Thanks to Fernando Diaz for this example.)