natural language processing blog: 2012/09

I'll warn in advance that this is probably one of the more controversial posts I've written, but realize that my goal is really to raise questions, not necessarily give answers. It's just more fun to write strong rhetoric :).

Let me write down a simple Markov Chain:

Download some data from the web
Call part of that data the input and part of it the label
Train a classifier on bag of words and get 84% accuracy
Submit a paper to *ACL
Go to 1

Such papers exist in the vision community, too, where you replace "bag of words" with "SIFT features" and "*ACL" with "CVPR/ICCV." In that community (according to my one informant :P), such papers are called "data porn." Turns out this is actually a term from journalism, in which one definition is "where journalists look for big, attention grabbing numbers or produce visualisations of data that add no value to the story."

There's a related paper that looks at this issue in one specific setting: predicting political outcomes. On Arxiv back at the end of April, we got this wonderful, and wonderfully titled paper:

"I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data by Daniel Gayo-Avello

The thing I especially like about this paper is that it's not a complaint (like this blog post!) but rather a thoughtful critique of how one could do this sort of research right. This includes actually looking at what has been done before (political scientists have been studying this issue for a long time and perhaps we should see what they have to say; what can we do to make our results more reproducible; etc).

For me, personally, this goes back to my main reviewing criteria: "what did I learn from this paper?" The problem is that in the extreme, cartoon version of a data porn paper (my 1-4 list above), the answer is that I learned that machine learning works pretty well, even when asked to do arbitrary tasks. Well, actually I already knew that. So I didn't really learn anything.

Now, of course, many data porn-esque papers aren't actually that bad. There are many things one can do (and people often do do) that make these results interesting:

Picking a real problem -- i.e., one that someone else might actually care about. There's a temptation (that I suffer from, too) of saying "well, people are interested in X, and X' is kind of like X, so let's try to predict X' and call it a day." For example, in the context of looking at scientific articles, it's a joy in many communities to predict future citation counts because we think that might be indicative of something else. I've certainly been guilty of this. But where this work can get interesting is if you're able to say "yes, I can collect data for X' and train a model there, but I'll actually evaluate it in terms of X, which is the thing that is actually interesting."
Once you pick a real problem, there's an additional challenge: other people (perhaps social scientists, political scientists, humanities researchers, etc.) have probably looked at this in lots of different lights before. That's great! Teach me what they've learned! How, qualitatively, do your results compare to their hypotheses? If they agree, then great. If they disagree, then explain to me why this would happen: is there something your model can see that they cannot? What's going on?
On the other hand, once you pick a real problem, there's a huge advantage: other people have looked at this and can help you design your model! Whether you're doing something straightforward like linear classification/regression (with feature engineering) or something more in vogue, like building some complex Bayesian model, you need information sources (preferably beyond bag of words!) and all this past work can give you insights here. Teach me how to think about the relationship between the input and the output, not just the fact that one exists.

In some sense, these things are obvious. And of course I'm not saying that it's not okay to define new problems: that's part of what makes the world fun. But I think it's prudent to be careful.

One attitude is "eh, such papers will die a natural death after people realize what's going on, they won't garner citations, no harm done." I don't think this is all together wrong. Yes, maybe they push out better papers, but there's always going to be that effect, and it's very hard to evaluate "better."

The thing I'm more worried about is the impression that such work gives from our community to others. For instance, I'm sure we've all seen papers published in other venues that do NLP-ish things poorly (Joshua Goodman has his famous example in physics, but there's tons more). The thing I worry is that we're doing ourselves a disservice as a community to try to claim that we're doing something interesting in other people's spaces, without trying to understand and acknowledge what they're doing.

NLP obviously has a lot of potential impact on the world, especially in the social and humanities space, but really anywhere that we want to deal with text. I'd like to see ourselves set up to succeed there, by working on real problems and making actual scientific contributions, in terms of new knowledge gathered and related to what was previously known.

I don't know how it happened or when it happened, but at some point NIPS workshops were posted and papers are due about a week from now and I completely missed it! The list of workshops is here:

http://nips.cc/Conferences/2012/Program/schedule.php?Session=Workshops

Since my job as a blogger is to express my opinion about things you don't want to hear my opinion about, I wish they'd select fewer workshops. I've always felt that NIPS workshops are significantly better than *ACL workshops because they tend to be workshops and not mini-conferences (where "mini" is a euphemism for non-selective :P). At NIPS workshops people go, really talk about problems and it's really the best people and the best work in the area. And while, yes, it's nice to be supportive of lots of areas, but what ends up happening is that people jump between workshops because there are too many that interest them, and then you lose this community feeling. This is especially troubling when workshops are already competing with skiing :).

Anyway, with that behind me, there are a number that NLP folks might find interesting:

Algorithmic and Statistical Approaches for Large Social Networks: Social networks aren't that big of a topic in NLP, but I think it's potentially a space we'll move in to and overlap with, especially as more and more text becomes available on such networks.
Discrete Optimization in Machine Learning: Structure and Scalability: One of the things that's perhaps relevant to our field but approached from a different perspective.
Social network and social media analysis: Methods, models and applications: Another social network thing, perhaps focused more on media than the networks
Spectral Learning Workshop: Spectral methods have made a bit of a splash in ACL land and there's definitely lots of interesting work to do here.
Cross-Lingual Technologies: This is probably the closest to our hearts. Anything multilingual!!!
Big Learning: New Perspectives, Implmentations and Challenges: Who doesn't like big learning. I'm still not convinced that the ML notion of "big" is as big as our notion of "big" but it's definitely getting there!
Personalizing Education With Machine Learning: I don't think there's much in the way of NLP in personalizing education, but I see this as a potential big avenue for large breakthroughs in NLP in the coming 5 years, especially as we start talking about automated grading in MOOCs and whatnot. ETS folks, I hear you!

With the deadlines so close I don't imagine anyone's going to be submitting stuff that they just started, but if you have things that already exist, NIPS is fun and it would be fun to see more NLPers there!

natural language processing blog

26 September 2012

Sure, you can do that....

15 September 2012

Somehow I totally missed NIPS workshops!