There are a handful of definitions of "fairness" lying around, of which the most common is disparate impact: the rate at which you hire members of a protected category should be at least 80% of the rate you hire members not of that category. (Where "hire" is, for our purposes, a prediction problem, and 80% is arbitrary.) DI has all sorts of issues, as do many other notions of fairness, but all the ones I've seen rely on a pre-ordained notion of "protected category".
I've been thinking a lot about something many NLP/ML people have thought about in their musing/navel-gazing hours: something like a recommender system for papers. In fact, Percy Liang and I build something like this a few years ago (called Braque), but it's now defunct, and its job wasn't really to recommend, but rather to do offline search. Recommendation was always lower down the TODO list. I know others have thought about this a lot because over the last 10 years I've seen a handful of proposals and postdoc ads go out on this topic, though I don't really know of any solutions.
A key property that such a "paper recommendation system" should have is that it be fair.
But what does fair mean in this context, where the notion of "protected category" is at best unclear and at worst a bad idea? And to whom should it be fair?
Below are some thoughts, but they are by no means complete and not even necessarily good :P.
In order to talk about fairness of predictions, we have to first define what is being predicted. To make things concrete, I'll go with the following: the prediction is whether the user wants to read the entire paper or not. For instance, a user might be presented with a summary or the abstract of the paper, and the "ground truth" decision is whether they choose to read the rest of the paper.
The most obvious fairness concept is authorship fairness: that whether a paper is recommended or not should be independent of who the authors are (and what institutions they're from). On the bright side, a rule like this attempts to break the rich-get-richer effect, and means that even non-famous authors' papers get seen. On the dark side, authorship is actually a useful feature for determining how much I (as a reader) trust a result. Realistically, though, no recommender system is going to model whether a result is trustworthy: just that someone finds a paper interesting enough to read beyond the abstract. (Though the two are correlated.)
A second obvious but difficult notion of fairness is that performance of the recommender system should not be a function of, eg., how "in domain" the paper is. For example, if our recommender system relies on generating parse trees (I know, comical, but suppose...), and parsing works way better on NLP papers than ML papers, this shouldn't yield markedly worse recommendations for ML papers. Or similarly, if the underlying NLP fares worse on English prose that is slightly non-standard, or slightly non-native (for whatever you choose to be "native"), this should not systematically bias against such papers.
A third notion of fairness might have to do with underlying popularity of topics. I'm not sure how to formalize this, but suppose there are two topics that anyone ever writes papers about: deep learning and discourse. There are far more DL papers that discourse papers, but a notion of fairness might establish that they be recommended at similar rates.
This strong rule seems somewhat dubious to me: if there are lots of papers on DL then probably there are lots of readers, and so probably DL papers should be recommended more. (Of course it could be that there exists an area where tons of papers get written and none get read, in which case this wouldn't be true.)
A weaker version of this rule might state conditions of one-sided error rates. Suppose that every time d discourse paper is recommended, it is read (high precision), but that only about half of the recommended DL papers get read (low precision). Such a situation might be considered unfair to discourse papers because tons of DL papers get recommended when they shouldn't, but not so for discourse papers.
Now, one might argue that this is going to be handled by just maximizing accuracy (aka click-through rate), but this is not the case if the number of people who are interested in discourse is dwarfed by the number interested in DL. Unless otherwise constrained, a system might completely forgo performance on those interested in discourse in favor of those interested in DL.
This is all fine, except that the world doesn't consist of just DL papers and just discourse papers (and nary a paper in the intersection, sorry Yi and Jabob :P). So what can we do then?
Perhaps a strategy is to say: I should not be able to predict the accuracy of recommendation on a specific paper, given its contents. That is: just because I know that a paper includes the words "discourse" and "RST" shouldn't tell me anything about what the error rate is on this paper. (Of course it does tell me something about the recommendations I would make on this paper.) You'd probably need to soften this with some empirical confidence intervals to handle the fact that many papers will have very few observations. You could also think about making a requirement/goal like this simultaneously on both false positives and false negatives.
A related issue is that of bubbles. I've many times been told that one of my (pre-neural-net) papers was done in neural nets land ten years ago; I've many times told-or-wanted-to-tell the opposite. Both of these are failures of exploration. Not out of malice, but just out of lack-of-time. If a user chooses to read papers if and only if they're on DL, should a system continue to recommend non-DL papers to them? If so, why? This directly contradicts the notion of optimizing for accuracy.
Overall, I'm not super convinced by any of these thoughts enough to even try to really formalize them. Some relevant links I found on this topic:
Parsing floats at over a gigabyte per second in C#
12 hours ago
2 comments:
The issue of bubbles is real, and becomes compounded by the review process. In practice, "peer review" means "people assigned to review this paper agree that it meets certain standards of scientific rigor and significance." Of course, in different communities, those standards can vary widely -- what's the relative importance of theory vs. experiments, which datasets and metrics are accepted, which baselines are used, which related work is read and cited? Once a community has enough researchers on program committees to reliably review each other's work, it can reinforce its own standards independent of the larger research community. This is partly good -- work is reviewed by the people most likely to appreciate and value its strengths -- and partly bad -- research can become stagnant when its assumptions are never challenged. I don't really have a solution to this, other than trying to assign a somewhat diverse set of reviewers to each paper.
Feynman had an anecdote about learning calculus from an old book focused on integration by parts, which wasn't well covered anymore. Other researchers were much stronger at calculus in general then him, but when a problem stumped everybody he'd be able to solve it with integration by parts (since the rest had already tried out the other techniques). This got him a reputation as a calculus virtuoso.
I think a population-level measure is going to be a more natural place to formulate and justify diversity metrics. Some degree of shared reading within a group/department/population is valuable for communication, setting shared expectations of basic knowledge in a field. But beyond that, mining obscure programs or tangential fields is a promising source of novel insights that peers won't have. If an idea from field A combined with an idea from field B provides great insight, how long will recommended reading and conversation take to combine the ideas in one mind?
Post a Comment