26 September 2012

Sure, you can do that....

I'll warn in advance that this is probably one of the more controversial posts I've written, but realize that my goal is really to raise questions, not necessarily give answers.  It's just more fun to write strong rhetoric :).

Let me write down a simple Markov Chain:

  1. Download some data from the web
  2. Call part of that data the input and part of it the label
  3. Train a classifier on bag of words and get 84% accuracy
  4. Submit a paper to *ACL
  5. Go to 1
Such papers exist in the vision community, too, where you replace "bag of words" with "SIFT features" and "*ACL" with "CVPR/ICCV."  In that community (according to my one informant :P), such papers are called "data porn."  Turns out this is actually a term from journalism, in which one definition is "where journalists look for big, attention grabbing numbers or produce visualisations of data that add no value to the story."

There's a related paper that looks at this issue in one specific setting: predicting political outcomes.  On Arxiv back at the end of April, we got this wonderful, and wonderfully titled paper:
"I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper" -- A Balanced Survey on Election Prediction using Twitter Data by Daniel Gayo-Avello
The thing I especially like about this paper is that it's not a complaint (like this blog post!) but rather a thoughtful critique of how one could do this sort of research right.  This includes actually looking at what has been done before (political scientists have been studying this issue for a long time and perhaps we should see what they have to say; what can we do to make our results more reproducible; etc).

For me, personally, this goes back to my main reviewing criteria: "what did I learn from this paper?"  The problem is that in the extreme, cartoon version of a data porn paper (my 1-4 list above), the answer is that I learned that machine learning works pretty well, even when asked to do arbitrary tasks.  Well, actually I already knew that.  So I didn't really learn anything.

Now, of course, many data porn-esque papers aren't actually that bad.  There are many things one can do (and people often do do) that make these results interesting:
  • Picking a real problem -- i.e., one that someone else might actually care about.  There's a temptation (that I suffer from, too) of saying "well, people are interested in X, and X' is kind of like X, so let's try to predict X' and call it a day."  For example, in the context of looking at scientific articles, it's a joy in many communities to predict future citation counts because we think that might be indicative of something else.  I've certainly been guilty of this.  But where this work can get interesting is if you're able to say "yes, I can collect data for X' and train a model there, but I'll actually evaluate it in terms of X, which is the thing that is actually interesting."
     
  • Once you pick a real problem, there's an additional challenge: other people (perhaps social scientists, political scientists, humanities researchers, etc.) have probably looked at this in lots of different lights before.  That's great!  Teach me what they've learned!  How, qualitatively, do your results compare to their hypotheses?  If they agree, then great.  If they disagree, then explain to me why this would happen: is there something your model can see that they cannot?  What's going on?
  • On the other hand, once you pick a real problem, there's a huge advantage: other people have looked at this and can help you design your model!  Whether you're doing something straightforward like linear classification/regression (with feature engineering) or something more in vogue, like building some complex Bayesian model, you need information sources (preferably beyond bag of words!) and all this past work can give you insights here.  Teach me how to think about the relationship between the input and the output, not just the fact that one exists.
In some sense, these things are obvious.  And of course I'm not saying that it's not okay to define new problems: that's part of what makes the world fun.  But I think it's prudent to be careful.

One attitude is "eh, such papers will die a natural death after people realize what's going on, they won't garner citations, no harm done."  I don't think this is all together wrong.  Yes, maybe they push out better papers, but there's always going to be that effect, and it's very hard to evaluate "better."

The thing I'm more worried about is the impression that such work gives from our community to others.  For instance, I'm sure we've all seen papers published in other venues that do NLP-ish things poorly (Joshua Goodman has his famous example in physics, but there's tons more).  The thing I worry is that we're doing ourselves a disservice as a community to try to claim that we're doing something interesting in other people's spaces, without trying to understand and acknowledge what they're doing.

NLP obviously has a lot of potential impact on the world, especially in the social and humanities space, but really anywhere that we want to deal with text.  I'd like to see ourselves set up to succeed there, by working on real problems and making actual scientific contributions, in terms of new knowledge gathered and related to what was previously known.

19 comments:

Fred said...

Great post. The temptation to do something easy/pick the low-hanging fruit is, I think, encouraged by the "must get something into ALL of the annual conferences" culture.

In re: people claiming to do interesting work in fields outside their own, I can only point to Shalizi & Tozier's "A Simple Model of the Evolution of Simple Models of Evolution" (http://arxiv.org/abs/adap-org/9910002) as the ultimate takedown.

Emily M. Bender said...

+1 on "Once you pick a real problem, there's an additional challenge: other people (perhaps social scientists, political scientists, humanities researchers, etc.) have probably looked at this in lots of different lights before."

I'd add: talk to those people! Collaboration is good and leads to more interesting research. (And Linguistics is one of those social science/humanities fields that NLPers should often keep in mind.)

Brendan O'Connor said...

I love the Gayo-Avello review paper. It is the only paper that, when talking about our polling paper, mentions our negative result :)

Brendan O'Connor said...

By the way, I always thought the Goodman response was kind of overblown. Naive Bayes presupposes tokenization, which LZW -- a kind of character-level ngram model -- does not. (His complaints about physics venues being bad are fine as far as cranky territorial grandstanding goes, but doesn't offer a reasonable substantive critique. But I haven't read the note for a while, instead opting to write ill-informed blog comments instead.)

bulbul said...

If this post is really winds up to be controversial, then there's something wrong in NLP, because you are 100% right.
I think chief problem here is that many NLP (and digital humanities) researches fail to paint a broader picture of their research, i.e. how it helps other people who work within the same area or with the same data. I see that time and time again and I think there are two reasons for that: first, doing actual background research ("Teach me what they've learned!") is really tough even for those familiar with the topic, let alone for somebody who is new to it. The obvious solution would be for scholars to team up, but that's easier said than done. Second, there's the attitude of some researchers, especially those coming from pure math background, who are almost unwilling to spoil there elegant results with real-life data. There are not in the majority, they may not even be that many of them, but I've run into more than a few.

Also: 'data porn', hehe.

Dani said...

Hi Hal,

First, let me say I'm glad you have found the paper interesting. It was a product of frustration with authors cherry-picking references.

I'm not sure you are really interested in it but later I produced another preprint trying to go beyond "good advice" to something more solid (http://arxiv.org/abs/1206.5851).

Of course I have to say hi to Brendan. Really glad you like the paper even when I pay attention to the only thing you hadn't find a correlation with! :P

With regards to the comment saying that this post should not be controversial since it's right I agree *but* my impression is that something is wrong (not only in NLP which is not my field) but in almost any venue since we are over-emphasizing (close to hype level) promising and positive results while not trying our best on difficult problems even when failing is a real chance.

That said, I don't have a clue about how (as a community) we could convince researchers to publish more negative results (just to avoid revisiting failures).

The main problem with encouraging negative results as a perfectly valid kind of paper is that they are trivial to obtain :(

What I'd be really interested is in research that tried everything and, still, does not work.

Anyway, as always a nice post Hal, flattered to be mentioned in it.

Best, Dani

Jacob Eisenstein said...

I agree with almost everything everybody has said, so maybe this post isn't so controversial after all :)

But I've spent a lot of time trying to build collaborations with sociolinguists, and I want to add a few words about why this is hard.

Modern NLP has an absolute mania about making predictions on held out data. I think most NLPers agree that shared prediction tasks are a big reason for all the progress that the field has made. But we have to understand that this is not how many other scientists measure research, especially social scientists.

So there is a deep mismatch about what research should be about. A dialectologist might be able to help me design features that reduce the error of my geographical prediction software by 50%, but that isn't going to be interesting to her unless it fits into some larger theoretical model about how language & geography work. Moreover, such theoretical models can almost never be compared directly on predictive grounds, so the success of her features in my predictive task doesn't really count as ammo in the fights she wants to win. It's rarely possible to tell whether the "better" model was really the better explanation, or was merely easier to convert in a algorithmic implementation. For NLPers, whether a model can be converted into code is an important feature; for social scientists it may not be.

When I started working on language and geography, I tried to do the right thing and read relevant papers from that literature. It was hard for me to understand what they were about; the methods were clear, but the high-level goals weren't. I think that NLP-plus-social-science is only going to be meaningful -- as either social science or NLP -- when we develop the theoretical chops to understand what social scientists are actually arguing about. Without this understanding, even collaborations will not be fruitful, because we will have relegated ourselves to overqualified tech support.

p.s. Shameless plug: I just posted my first sociolinguistics + NLP collaborative paper to my website. We plan to submit to a linguistics journal; comments welcome.

Brendan O'Connor said...

Jacob, nice comment. On


Modern NLP has an absolute mania about making predictions on held out data. I think most NLPers agree that shared prediction tasks are a big reason for all the progress that the field has made. But we have to understand that this is not how many other scientists measure research, especially social scientists.


This is why I'm so negative about *ACL as a community to publish social-science-relevant NLP work. Some people, like Hal's blogpost here, sometimes say they want to see more engagement with social science theories, but I find it hard to see the predominant majority of NLP researchers (and reviewers) caring. Maybe the problem is a lack of work that does have lots of meaningful social-science-engagement, and if there was more then everyone would like it. But it seems risky to me. Hopefully I'm wrong.

Brendan O'Connor said...

BTW, example that contradicts my cynicism: Yano, Smith and Wilkerson at NAACL has very nice engagement with hypotheses from the political science literature (helps to have a polisci coauthor!)

Noah said...

It's possible that you'd consider some of my papers "data porn." But one of the valuable things I learned from this kind of work is that language/text is a lot more interesting when you put it into a social context. I've done a fair amount of this work in direct collaboration with living, breathing social scientists who were extremely surprised by what you and I might see as banal applications of ML to text data. The final predictive numbers aren't necessarily the point. In some cases, new data open up new questions that haven't been considered before. Or new measurements (the study of new data requires new methods -- NLP is a growing set of methods). Or a data-driven way of approaching a question that hadn't been treated that way before. Social scientists are going to figure out how to use text as data, whether we join in or not -- NLPers might as well join the fun. In the cross-disciplinary meetings I've been going to for the past few years (there's one next week at Harvard), there continues to be a shared excitement and curiosity from both sides: "what can we do with text data?" Porn is just documentation of very exciting times.

Also, I think Jacob's point about cross-field differences about what research should be about needs to be taken very seriously. This matters for practical reasons, too (consider that your social scientist friend wants to get tenure in his/her department, just like you do). And fields move at very different speeds (this 2009 paper on volatility has a related journal paper that's more about testing a hypothesis ... it's just a much longer process to publish it).

At the risk of a tasteless analogy ... dabbling in data porn may be great preparation for sexy new applications of NLP that are on the horizon.

Yoav said...

Maybe we NLP researchers should leave the social-science work to social-scientists, and focus more on creating good tools which they actually find useful (the "meta questions")? One thing that comes to mind which is sorely needed are methods for meaningful analysis and interpretation of the predictive features. Another is good confidence-calibrated and/or abstaining classifiers for high-dimensional and sparse spaces.

Also, given that linear models with bag-of-words features perform extremely well regardless of the dataset (which is really cool, but also old-news), maybe we should come up with better ways of measuring significance for text classification tasks.

Fernando Pereira said...

Two comments: 1) "absolute mania about making predictions on held out data" is a problem whenever a field's practitioners as oversold on magic ML black boxes. In my experience, what biomedical researchers or Web search engineers what from ML is an understanding of which meaningful features are relevant to an outcome rather than black-box predictions that they can't explain or easily combine with other sources of evidence; 2) the best antidote to data porn is having a problem to solve that real users care about (we are still hiring ;))

hal said...

Wow, lots of awesome comments -- I can barely keep up!

I essentially agree with all that's been said. One thing that's come up in several comments is basically in reference to the fact that many people in our community (and others) seem to think that "making predictions is doing science." There absolutely are cultural differences across fields that can be very hard to bridge, but frankly I think those "other fields" are right to not put much stock in our "predictions on held-out data" view of the world (@Jacob, @Brendan, @Noah, @Yoav all mention this).

As a concrete example, I've been spending the past 1.5 years working with a social scientist and published nothing as a result. I have to say that honestly, in this 1.5 years, it's been really hard for me not to throw in the towel and write a couple data porn papers just to psych myself (and my students) into thinking that we're making progress: I'm pretty certain I could get them accepted. This alone is a problem, when the social structure of the community is pushing one to do bad work.

In my case, it's definitely been hard to find that balance between the theories that exist on "their side" and the technology that exists on "our side." But now, 1.5 years later, I think I kind of understand. For instance, there are very specific hypotheses ("when ABC happens, we expect language to change in XYZ ways") where the challenge is to essentially figure out how to detect ABC and XYZ. (And perhaps extend XYZ to XYZ++ because maybe we can discover something new, too!) But at its heart, this is really just a hypothesis testing problem. It has nothing to do with making predictions. If we want to do science, let's do it in the way that science is done.

Now, like everything, it's not enough to design a new test tube and say "p value < 0.01" -- we actually have to look closely at that test tube and why it's doing what it's doing, to (try to) ensure that it's actually picking up on XYZ and not some accidental correlate of ABC. But that's what data analysis is all about. That said, I'm sympathetic to the problem @Jacob mentions: when testing between two models, there's always the question of whether you simply implemented one better than the other.

So I guess I'm arguing that instead of trying to foist our practices on others, perhaps we should take seriously how they evaluate success in their own field. I truly believe that the *ACL community will be receptive (I'm not yet totally cynical). Plus, you can always do the thing they do in medicine, where you write the CS paper and Med paper separately. But if you're going to do this, the CS paper better actually have CS contributions!

And I think the comment made by several people that it's easier to do this if you have a living breathing collaborator on the "other side." Yes. Duh. :)

@Noah: I think we're on the same page that people are going to start using our technology and we want to embrace it. I just want to embrace it in a way that we get taken seriously outside our community.

@Fernando: +1. Except for the hiring thing: come to UMD instead!

Jacob Eisenstein said...

@Hal: glad to know that I'm not the only one who has had trouble making these collaborations work. Interdisciplinarity is easy to say (hmm... sort of) but harder to do.

@Yoav: The reason not to just leave social science to the social scientists is that language is fundamentally social, and some of us have the intuition that the next generation of NLP systems will need to engage deeply with the social dimension. So it's less a question of building tools for "their" research, but rather, figuring out what social science means for NLP.

In my opinion, it may not be possible to do NLP on social media data without taking a theoretical stance on the underlying social science issues. This guides the decision of what data to gather, what to model, and what applications to try to provide. (Just think about the sociolinguistic theory embodied by the predictive text-entry software on your phone!) So the question is whether we are going to think about social science issues in an informed way, or whether we are going to use "common sense," which in social science is often inaccurate and regressive.

Anyway, this is miles away from the kind of paper that Hal was originally kvetching about. I think there are two different questions: what NLP researchers should do, and where they should publish it. Noah and Fernando are kind of saying the same thing: cool NLP research comes from problems driven by real users. By engaging with social scientists, Noah gets a new set of users that might motivate some cool new models. If the collaboration applies existing NLP technology to get new answers in the social science domain, that's great -- just like an application of NLP to a new Google product is great -- but as Hal & others have said, maybe *ACL isn't the right place.

It seems the machine learning must have wrestled with these same issues, regarding applications of ML to fields like NLP... And in 2012, I still can't figure out whether I should submit an application paper to ICML!

hal said...

quick comment: although I cited the polisci paper, I hope it didn't come across like I was specifically targeting work in polisci or socsci!

Raphael Cohen said...

In my experience it is very hard to get a paper about someone else's problem (in my case a medical / biology problem) into ACL. Most of the biomedical NLP stuff is only published in the Bio workshop instead of the main conference.
There is also the other Markov model:
1. Download some data from the web
2. Call part of that data the input and part of it the label
3. Produce a highly complex and possibly impractical/slow learning method and and get 85% accuracy (over the SVM's 84%)
4. Submit a paper to ACL

lingpipe said...

I'd argue that a large part of science IS about making predictions. One of the main reasons Newton's laws are so breathtaking is their predictive power. No matter how beautiful they are, if they didn't predict everything from tides to cannonballs, why would anyone have cared?

Everyone loves CRFs, right? How did they become so popular? Making good predictions. I'm not really reading the NLP literature any more, but last time I checked, nobody cared that they produce better calibrated probabilities. Same story for logistic regression versus naive Bayes.

I'd just like to see people start taking predictive probability into account -- one way to do this is to move from 0/1 loss to log loss. Of course, that's a no-go for straight-up SVMs, because they don't make probabilistic predictions (yes, I know one can estimate a function mapping margin distances to probabilities).

I'm on the fence about black-box classifiers, etc.. I've usually found it useful to look at what the features are doing in my systems, but I'm not as obsessed about it as traditional statisticians, and now that I've drunk a Big Gulp portion of the Bayesian KoolAid, I wouldn't be caught dead calculating p-values for regression coefficients.

One of the things Andrew Gelman's been struggling with in writing the 3rd Edition of Bayesian Data Analysis is how to move model evaluation from things like Bayes factors and DIC to something more related to predictive performance like WAIC. The Bayesians have always paid attention to predictions in a decision-theoretic context (see, for instance, Berger's classic book or the decision theory chapter in BDA).

P.S. Speaking of NLP porn, if I see another word cloud generated by some variant of LDA, I'm going to scream!

Ana.PG said...

It is only fitting that I find this post while in the middle of a "crawling and annotating" session ... As an applied scientist guilty of a couple of "data porn"-type papers, I do agree they don't belong in conferences meant to showcase research or new systems (and I think reviewing helps there). However, I think I am not the only one for whom the novelty of such papers wore thin pretty fast - while I am trying to work on new data, I think the "data analysis" part should not be _the_ paper. There should be some actual task a user or a customer cares deeply about or a new algorithm - otherwise, we're basically doing (bad) journalism, which may be fun and interesting, but it's not a paper...

Ted said...

This is an interesting discussion, and I regret joining it so pitifully late. But, let me plunge on since I've been stewing about reviewing lately (again) and this all somehow seems to fit into that...

It does seem fair to say that if you can annotate up some apparently novel data (automatically or manually) and then run it through some moderately interesting learning algorithm and generate results that you then evaluate in some not unreasonable way, if you do this without any horrible flaws your chances of acceptance seem to be fairly good.

This happens, I think, because too much credit is given for novelty of data sources or problems, and not nearly enough weight or thought is applied to potential significance. I have puzzled over why this might be the case, and have a theory at least for the moment...

We are great admirers of technique which given our field is perhaps natural. So we like it if a barn, a town square, a church, and a mountain are all painted in the same somewhat methodical and careful way, and we view each of these works as "novel" to some extent because, well it's a painting of a different thing, even when it's all a bit factory like.

And this ties into my real question and concern, and that is why so many *ACL papers make barely a whisper in terms of impact. And so I think my painting analogy is a possible explanation - there is just too much emphasis on and reward given to careful application of technique without regard to what problem might be solved or what this paper might actually end up enabling. We appreciate the work, but we don't really feel it - it doesn't change us or how we do anything, it just looks kind of nice and then we go on with whatever we were already doing.

I might even argue that data porn might be the wrong term, since while many people might wish to view porn, they might not actually propose to create it. Or if they do they won't share it. One hopes. :) I would describe porn as a high-read low-write business, whereas the papers we are discussing are most likely low-read high-write (maybe even write only). So we have a lot of people (potentially) writing papers that relatively few of us will really read, and yet somehow some of these papers find homes in most of our big events, which inspires more of the same, which leads to a fair number of *ACL papers that just don't have much impact.

We may need to re-think how we define both novelty and significance, I think. I might suggest we dispense with both "scores" and ask questions of reviewers like "will this paper change how *you* do your work or what you work on?"