27 August 2010

Calibrating Reviews and Ratings

NIPS decision are going out soon, and then we're done with submitting and reviewing for a blessed few months. Except for journals, of course.

If you're not interested in paper reviews, but are interested in sentiment analysis, please skip the first two paragraphs :).

One thing that anyone who has ever area chaired, or probably even ever reviewed, has noticed is that different people have different "baseline" ratings. Conferences try to adjust for this, for instance NIPS defines their 1-10 rating scale as something like "8 = Top 50% of papers accepted to NIPS" or something like that. Even so, some people are just harsher than others in scoring, and it seems like the area chair's job to calibrate for this. (For instance, I know I tend to be fairly harsh -- I probably only give one 5 (out of 5) for every ten papers I review, and I probably give two or three 1s in the same size batch. I have friends who never give a one -- except in the case of something just being wrong -- and often give 5s. Perhaps I should be nicer; I know CS tends to be harder on itself than other fiends.) As an aside, this is one reason why I'm generally in favor of fewer reviewers and more reviews per reviewer: it allows easier calibration.

There's also the issue of areas. Some areas simply seem to be harder to get papers into than others (which can lead to some gaming of the system). For instance, if I have a "new machine learning technique applied to parsing," do I want it reviewed by parsing people or machine learning people? How do you calibrate across areas, other than by some form of affirmative action for less-represented areas?

A similar phenomenon occurs in sentiment analysis, as was pointed out to me at ACL this year by Franz Och. The example he gives is very nice. If you go to TripAdvisor and look up The French Laundry, which is definitely one of the best restaurants in the U.S. (some people say the best), you'll see that it got 4.0/5.0 stars, and a 79% recommendation. On the other hand, if you look up In'N'Out Burger, a LA-based burger chain (which, having grown up in LA, was admittedly one of my favorite places to eat in high school, back when I ate stuff like that) you see another 4.0/5.0 stars and a 95% recommendation.

So now, we train a machine learning system to predict that the rating for The French Laundry is 79% and In'N'Out Burger is 95%. And we expect this to work?!

Probably the main issue here is calibrating for expectations. As a teacher, I've figured out quickly that managing student expectations is a big part of getting good teaching reviews. If you go to In'N'Out, and have expectations for a Big Mac, you'll be pleasantly surprised. If you go to The French Laundry with expectations of having a meal worth selling your soul, your children's souls, etc., for, then you'll probably be disappointed (though I can't really say: I've never been).

One way that a similar problem has been dealt with on Hotels.com is that they'll show you ratings for the hotel you're looking at, and statistics of ratings for other hotels within a 10 mile radius (or something). You could do something similar for restaurants, though distance probably isn't the right categorization: maybe price. For "$", In'N'Out is probably near the top, and for "$$$$" The French Laundry probably is.

(Anticipating comments, I don't think this is just an "aspect" issue. I don't care how bad your palate is, even just considering the "quality of food" aspect, Laundry has to trump In'N'Out by a large margin.)

I think the problem is that in all of these cases -- papers, restaurants, hotels -- and others (movies, books, etc.) there simply isn't a total order on the "quality" of the objects you're looking at. (For instance, as soon as a book becomes a best seller, or is advocated by Oprah, I am probably less likely to read it.) There is maybe a situation-depend order, and the distance to hotel, or "$" rating, or area classes are heuristics for describing this "situation." Bit without knowing the situation, or having a way to approximate it, I worry that we might be entering a garbage-in-garbage-out scenario here.


  1. This was an issue in the Netflix Prize competition and in basically any system based on ordinal or scalar ratings.

    There are some standard techniques for adjusting for these biases. On the ordinal scale of 1-5 reviews, I'd strongly recommend checking out Uebersax and Grove's uber-cool paper,

    Uebersax, J.S. and W.M. Grove. 1993. A latent trait finite mixture model for the analysis of rating agreement. Biometrics.

  2. @Bob: thanks for the comment -- I totally agree that this is an important aspect; I was going to mention it and point to the max-margin matrix factorization work, since they seem to get big benefits out of having per-user biases.

    I think there's more to it than that, though. I would be perfectly happy giving 4.5/5 ratings to both, say, Killer Shrimp (a slightly dive-y shrimp joint in LA) and Le Bernardin (probably the best seafood restaurant in the world). I'm just comparing them against different standards.

  3. I've recently started to think of this problem as a domain adaptation one where the domains are not necessarily well defined.

    In my view, absolute recommendations/ratings don't really make sense, and what people are actually looking for are rankings over subsets of the possible things (burger places, fast food joints, romantic dinner places, etc, with added location information, since the same food quality in Hickville will probably end up better rated than in NYC or SF). Each such "natural" subset of things defines a domain where you can actually compare ratings and a classifier/ranking-regressor adapted for one such setting will not naturally generalize to others.

    So I think this is finer grained than a pairwise preference classification task, since I can conceivably imagine that whether A is better than B depends on the other things that are actually being considered (let's say we have a posh pizza place with good expensive pizzas and a cheap pizza delivery place that also lets you eat there with the delivery guys; if we thing of picking the best out of these two in a context where a romantic meal is involved (say, lots of other restaurants in the set of things to be ranked) the posh place should win, but if you are looking for cheap pizzas (that when the set is filled with delivery and eat-in pizza places and nothing else) the cheaper place might look better, and deliver same value for less price).

    The thing is, it seems to be hard to find data where this sort of hypothesis can be tested, but playing with restaurant reviews I do get the feeling that there are domain adaptation problems hidden inside the obvious ranking task.

  4. Is the "Recommended" rating though representing the "goodness" of the place or the popularity? I'd assume the latter, especially given the data you mention.

    In n' out is more accessible than the Laundry for the "average person". Once you look at the number that way, I don't think there are any calibration problems. I'm not sure about reliably inferring a "goodness" of a place from these types of ratings.

  5. Thanks for picking up on the calibration issue, which is very important and widely ignored.
    As I pointed out earlier this year (#SocMed10 WS @ NAACL; slide 33), in sentiment analysis people are assuming a decision boundaries based on good/bad word counts at 0, which is arbitrary, when they should calibrate by correlating their apriori arbitrary sentiment scores with human perceptions.

    Your restaurant example seems to indicate sub-category specific scaling (the good old "big mouse < small elephant" textbook example comes to mind).

  6. In mature scientific fields, conferences are events where members of a community get together and talk about what they have been doing over the last year. Very little reviewing or gate keeping is needed.

    Reviewing for conferences like NIPS mostly shows that there isn't much of a stable community, and that's pretty sad. And the obsession with ratings and grading really leaves a sour taste.

    If gate-keeping is needed because the community is ill-defined, stop worrying about numerical ratings. Instead, ask yourself the questions: is this paper relevant to the conference, is it of acceptable quality, and should it be oral or written.

    And you can't accommodate all the papers that meet those criteria, the conference and the field should split.

    In short, when worrying about putting numerical ratings on the same scale, you're simply asking the wrong question.

  7. The problem with direct comparisons between In and Out versus the French Laundry is pretty easily handled if you view recommendations as entirely personal and contextual rather than universal. Thus, for most people you absolutely should recommend a burger joint over a fancy restaurant, for others you should never recommend the burger joint. Even for people for whom a burger is not normally considered food, there are moments when In and Out is the right recommendation (suppose you are headed south from LA and need a place to get a few calories, some caffeine and to off-load that last load of caffeine ... French Laundry is a bad recommendation in that case, In and Out in Capistrano is a good one).

    Personalization is the whole point of social algorithms and a total order is never much good for recommendations.

  8. "I know CS tends to be harder on itself than other fiends." I found the small fiends vs. fields typo amusing.

    Also, Oprah's book choices are maybe not as bad as you seem to think!

  9. A similar issue pops up in information retrieval: which document (or image or whatever) is best satisfies the information need?

    Even taking into account the context, one finds that absolute metrics like ratings can be hard to calibrate. But relative metrics can be much more reliable.

    For example, you can introduce objectives like: given this context, the user should prefer The French Laundry over In'N'Out (or vice versa).

  10. Read a blog post recently on this topic in the context of sentiment for hotel reviews. One way of viewing the disconnect is that people are rating how each option compared to their expectation. Sure 99.9% of the reviewers would choose FL over In'N'Out in a taste test, but people expected more of FL than it could deliver. It's also not a crazy set of scores when you consider value: Is FL providing a meal 50x better than In'N'Out to match the price difference?