05 July 2016

Rating the quality of reviews, after the fact

Groan groan groan reviewers are horrible people. Not you and me. Those other reviewers over there!

tldr: In general we actually don't think our reviews are that bad, though of course it's easy to remember the bad ones. Author perception of review quality is colored by, but not determined by, the overall accept/reject decision and/or the overall score that review gave to the paper.


NIPS did an experiment a bunch of years ago (can't dig it up any more, now it's an urban legend) where they asked reviewers at, I think, the time of author feedback, to rate the reviews. The anecdotal story was that there was almost perfect correlation between "this is a good review" and "this review gave my paper a high score." Of course this is not super surprising, even if you get rid of "emotions," because presumably I like my paper and so any review that doesn't like it is flawed.

For NAACL 2013, we did a similar experiment, but we asked authors for their responses several months after the fact (actually, even after the conference had taken place), at which point hopefully emotions had cooled a bit and they could look back at their reviews with a sort of fond recollection. We presented the contact author of each paper with the original review text for each of their reviews, but did not show them the original scores. We asked them on a standard Likert scale how helpful this review was, and how informative this review was.

Because this was long after the fact, response rate was of course not 100%, and it was also biased toward authors of papers that were accepted. We got responses from 128 authors on a total of 138 papers (some papers had same contact authors), covering a total of about 397 reviews (roughly one per paper, but some short papers only had two, and some papers had four).

All the plots below are restricted to this set of 138 papers, not to the full set of about 500.

First, let's get a sense of the data. Here are the overall results for this entire set of 138 papers:


(Note that the numbers add up to 397, not 138, because this is counting per-review not per-paper.) The first row shows the accept/reject ratio. Since NAACL 2013 had an acceptance rate between 25% and 30%, obviously survey results are biased toward accepted papers, but we still have a healthy response rate from rejected papers.

Overall, the vast majority (~80%) of reviews were considered both informative and helpful (score 4 or 5) according to the authors. So yes, we need to do something about the 20% of reviews that got a 1, 2 or 3 on the Likert scale, but we're actually not doing that horribly. (Modulo sample selection bias.) The papers themselves were considered overwhelmingly appropriate and clear. The overall score distribution matches (roughly) the overall score distribution for the entire conference.

Let's look at what happens if we look only at accepted or rejected paper:

Comparing these, we definitely see a bit of the NIPS effect. For accepted papers, the reviews were considered overwhelmingly informative and helpful (scores of 4 or 5 in 85% or more cases). However, for rejected papers, the reviews were still considered largely informative and helpful (~73% of cases were 4s and 5s). Not surprisingly, accepted papers fare quite well on the individual score metrics, in particular overall score (duh!).

We can alternatively condition the analysis on the overall paper score rather than the final accept/reject decision. Here's how that looks:

That's not substantially different.

So what makes the difference between good (informativeness 4 or 5) and bad (informativeness 1 or 2) reviews?

On average, the "good" reviews were about 15% longer than the bad reviews (on average 320 characters versus 280 characters).

Somewhat surprisingly, a linear classifier on bag of words data and distinguish with 90% accuracy "good" from "bad" reviews, but the features it gives high weight to are basically features that look like positive versus negative reviews, rather essentially exploiting of the correlation between informativeness and acceptance, rather than informativeness on its own.

3 comments:

crash-dev said...

280 characters! I hope you meant words, but even those seem pretty short.

Unknown said...

For PLDI'16, Emery Berger, the PC Chair, prodded the PC toward having reviews that are at least 500 words long. The PC basically listened. I would be curious to see if this affected authors' perceptions of review quality.

hal said...

Yes, I meant words :).

The problem is that you can easily hit X number of words by pointing out typos or missing citations to oneself :/.