Groan groan groan reviewers are horrible people. Not you and me. Those other reviewers over there!
tldr: In general we actually don't think our reviews are that bad, though of course it's easy to remember the bad ones. Author perception of review quality is colored by, but not determined by, the overall accept/reject decision and/or the overall score that review gave to the paper.
NIPS did an experiment a bunch of years ago (can't dig it up any more, now it's an urban legend) where they asked reviewers at, I think, the time of author feedback, to rate the reviews. The anecdotal story was that there was almost perfect correlation between "this is a good review" and "this review gave my paper a high score." Of course this is not super surprising, even if you get rid of "emotions," because presumably I like my paper and so any review that doesn't like it is flawed.
For NAACL 2013, we did a similar experiment, but we asked authors for their responses several months after the fact (actually, even after the conference had taken place), at which point hopefully emotions had cooled a bit and they could look back at their reviews with a sort of fond recollection. We presented the contact author of each paper with the original review text for each of their reviews, but did not show them the original scores. We asked them on a standard Likert scale how helpful this review was, and how informative this review was.
Because this was long after the fact, response rate was of course not 100%, and it was also biased toward authors of papers that were accepted. We got responses from 128 authors on a total of 138 papers (some papers had same contact authors), covering a total of about 397 reviews (roughly one per paper, but some short papers only had two, and some papers had four).
All the plots below are restricted to this set of 138 papers, not to the full set of about 500.
First, let's get a sense of the data. Here are the overall results for this entire set of 138 papers:
(Note that the numbers add up to 397, not 138, because this is counting per-review not per-paper.) The first row shows the accept/reject ratio. Since NAACL 2013 had an acceptance rate between 25% and 30%, obviously survey results are biased toward accepted papers, but we still have a healthy response rate from rejected papers.
Overall, the vast majority (~80%) of reviews were considered both informative and helpful (score 4 or 5) according to the authors. So yes, we need to do something about the 20% of reviews that got a 1, 2 or 3 on the Likert scale, but we're actually not doing that horribly. (Modulo sample selection bias.) The papers themselves were considered overwhelmingly appropriate and clear. The overall score distribution matches (roughly) the overall score distribution for the entire conference.
Let's look at what happens if we look only at accepted or rejected paper:
We can alternatively condition the analysis on the overall paper score rather than the final accept/reject decision. Here's how that looks:
So what makes the difference between good (informativeness 4 or 5) and bad (informativeness 1 or 2) reviews?
On average, the "good" reviews were about 15% longer than the bad reviews (on average 320 characters versus 280 characters).
Somewhat surprisingly, a linear classifier on bag of words data and distinguish with 90% accuracy "good" from "bad" reviews, but the features it gives high weight to are basically features that look like positive versus negative reviews, rather essentially exploiting of the correlation between informativeness and acceptance, rather than informativeness on its own.