natural language processing blog: NAACL-HLT 2009 Retrospective

I hope this post will be a small impetus to get other people to post comments about papers they saw at NAACL (and associated workshops) that they really liked.

As usual, I stayed for the whole conference, plus workshops. As usual, I also hit that day -- about halfway through the first workshop day -- where I was totally burnt out and was wondering why I always stick around for the entire week. That's not to say anything bad about the workshops specifically (there definitely were good ones going on, in fact, see some comments below), but I was just wiped.

Anyway, I saw a bunch of papers and missed even more. I don't think I saw any papers that I actively didn't like (or wondered how they got in), including short papers, which I think is fantastic. Many thanks to all the organizers (Mari Ostendorf for organizing everything, Mike Collins, Lucy Vanderwende, Doug Oard and Shri Narayanan for putting together a great program, James Martin and Martha Palmer for local arrangements -- which were fantastic -- and all the other organizers who sadly we -- i.e., the NAACL board -- didn't get a chance to thank publicly).

Here are some things I thought were interesting:

Classifier Combination Techniques Applied to Coreference Resolution (Vemulapalli, Luo, Pitrelli and Zitouni). This was a student research workshop paper; in fact, it was the one that I was moderating (together with Claire Cardie). The student author, Smita, performed this work while at IBM; though her main research is on similar techniques applied to really cool sounding problems in recognizing stuff that happens in the classroom. Somehow classifier combination, and general system combination, issues came up a lot at this conference (mostly in the hallways where someone was begrudgingly admitting to working on something as dirty as system combination). I used to think system combination was yucky, but I don't really feel that way anymore. Yes, it would be nice to have one huge monolithic system that does everything, but that's often infeasible. My main complaint with system combination stuff is that in many cases I don't really understand why it's helping, which means that unless it's applied to a problem I really care about (of which there are few), it's hard for me to take anything away. But I think it's interesting. Getting back to Smita's paper, the key thing she did to make this work is introduce the notion of alignments between different clusterings, which seemed like a good idea. The results probably weren't as good as they were hoping for, but still interesting. My only major pointers as a panelist were to try using different systems, rather than bootstrapped versions of the same system, and to take a look at the literature on consensus clustering, which is fairly relevant for this problem.
Graph-based Learning for Statistical Machine Translation (Alexandrescu and Kirchhoff). I'd heard of some of this work before in small group meetings with Andrei and Kathrin, but this is the first time I'd seen the results they presented. This is an MT paper, but really it's about how to do graph-based semi-supervised learning in a structured prediction context, when you have some wacky metric (read: BLEU) on which you're evaluating. Computation is a problem, but we should just hire some silly algorithms people to figure this out for us. (Besides, there was a paper last year at ICML -- I'm too lazy to dig it up -- that showed how to do graph-based stuff on billions of examples.)
Intersecting Multilingual Data for Faster and Better Statistical Translations (Chen, Kay and Eisele). This is a very simple idea that works shockingly well. Had I written this paper, "Frustrating" would probably have made it into the title. Let's say we want an English to French phrase table. Well, we do phrase table extraction and we get something giant and ridiculous (have you ever looked at those phrase pairs) that takes tons of disk space and memory, and makes translation slow (it's like the "grammar constant" in parsing that means that O(n^3) for n=40 is impractical). Well, just make two more phrase tables, English to German and German to French and intersect. And viola, you have tiny phrase tables and even slightly better performance. The only big caveat seems to be that they estimate all these things on Europarl. What if your data sets are disjoint: I'd be worried that you'd end up with nothing in the resulting phrase table except the/le and sometimes/quelquefois (okay, I just used that example because I love that word).
Quadratic Features and Deep Architectures for Chunking (Turian, Bergstra and Bengio). I definitely have not drunk the deep architectures kool-aid, but I still think this sort of stuff is interesting. The basic idea here stems from some work Bergstra did for modeling vision, where they replaced a linear classifier(y = w'x) with a low rank approximation to a quadratic classifier (y = w'x + sqrt[(a'x)^2 + (b'x)^2 + ... ]). Here, the a,b,... vectors are all estimated as part of the learning process (eg., by stochastic gradient descent). If you use a dozen of them, you get some quadratic style features, but without the expense of doing, say, an implicit (or worse, explicit) quadratic kernel. My worry (that I asked about during the talk) is that you obviously can't initialize these things to zero or else you're in a local minimum, so you have to do some randomization and maybe that makes training these things a nightmare. Joseph reassured me that they have initialization methods that make my worries go away. If I have enough time, maybe I'll give it a whirl.
Exploring Content Models for Multi-Document Summarization (Haghighi and Vanderwende). This combines my two favorite things: summarization and topic models. My admittedly biased view was they started with something similar to BayeSum and then ran a marathon. There are a bunch of really cool ideas in here for content-based summarization.
Global Models of Document Structure using Latent Permutations (Chen, Branavan, Barzilay and Karger). This is a really cool idea (previously mentioned in a comment on this blog) based on using generalized Mallow's models for permutation modeling (incidentally, see a just-appeared JMLR paper for some more stuff related to permutations!). The idea is that documents on a similar topic (eg., "cities") tend to structure their information in similar ways, which is modeled as a permutation over "things that could be discussed." It's really cool looking, and I wonder if something like this could be used in conjunction with the paper I talk about below on summarization for scientific papers (9, below). One concern raised during the questions that I also had was how well this would work for things not as standardized as cities, where maybe you want to express preferences of pairwise ordering, not overall permutations. (Actually, you can do this, at least theoretically: a recent math visitor here, Mark Huber, has some papers on exact sampling from permutations under such partial order constraints using coupling from the past.) The other thing that I was thinking during that talk that I thought would be totally awesome would be to do a hierarchical Mallow's model. Someone else asked this question, and Harr said they're thinking about this. Oh, well... I guess I'm not the only one :(.
Dan Jurafsky's invited talk was awesome. It appealed to me in three ways: as someone who loves language, as a foodie, and as an NLPer. You just had to be there. I can't do it justice in a post.
More than Words: Syntactic Packaging and Implicit Sentiment (Greene and Resnik). This might have been one of my favorite papers of the conference. The idea is that how you say things can express your point of view as much as what you say. They look specifically at effects like passivization in English, where you might say something like "The truck drove into the crowd" rather than "The soldier drove the truck into the crowd." The missing piece here seems to be identifying the "whodunnit" in the first sentence. This is like figuring out subjects in languages that like the drop subjects (like Japanese). Could probably be done; maybe it has been (I know it's been worked on in Japanese; I don't know about English).
Using Citations to Generate Surveys of Scientific Paradigms (Mohammad, Dorr, Egan, Hassan, Muthukrishan, Qazvinian, Radev and Zajic). I really really want these guys to succeed. They basically study how humans and machines create summaries of scientific papers when given either the text of the paper, or citation snippets to the paper. The idea is to automatically generate survey papers. This is actually an area I've toyed with getting in to for a while. The summarization aspect appeals to me, and I actually know and understand the customer very well. The key issue I would like to see addressed is how these summaries vary across different users. I've basically come to the conclusion that in summarization, if you don't pay attention to the user, you're sunk. This is especially true here. If I ask for a summary of generalization bound stuff, it's going to look very different than if Peter Bartlett asks for it.
Online EM for Unsupervised Models (Liang and Klein). If you want to do online EM read this paper. On the other hand, you're going to have to worry about things like learning rate and batch size (think Pegasos). I was thinking about stuff like this a year or two ago and was wondering how this would compare to doing SGD on the log likelihood directly and not doing EM at all. Percy says that asymptotically they're the same, but who knows what they're like in the real world :). I think it's interesting, but I'm probably not going to stop doing vanilla EM.

I then spent some time at workshops.

I spent the first morning in the Computational Approaches to Linguistic Creativity workshop, which was just a lot of fun. I really liked all of the morning talks: if you love language and want to see stuff somewhat off the beaten path, you should definitely read these. I went by the Semantic Evaluation Workshop for a while and learned that the most frequent sense baseline is hard to beat. Moreover, there might be something to this discourse thing after all: Marine tells us that translators don't like to use multiple translations when one will do (akin to the one sense per discourse observation). The biggest question in my head here is how much the direction of translation matters (eg., when this heuristic is violated, is it violated by the translator, or the original author)? Apparently this is under investigation. But it's cool because it says that even MT people shouldn't just look at one sentence at a time!

Andrew McCallum gave a great, million-mile-an-hour invited talk on joint inference in CoNLL. I'm pretty interested in this whole joint inference business, which also played a big role in Jason Eisner's invited talk (that I sadly missed) at the semi-supervised learning workshop. To me, the big question is: what happens if you don't actually care about some of the tasks. In a probabilistic model, I suppose you'd marginalize them out... but how should you train? In a sense, since you don't care about them, it doesn't make sense to have a real loss associated with them. But if you don't put a loss, what are you doing? Again,in probabilistic land you're saved because you're just modeling a distribution, but this doesn't answer the whole question.

Al Aho gave a fantastically entertaining talk in the machine translation workshop about unnatural language processing. How the heck they managed to get Al Aho to give an invited talk is beyond me, but I suspect we owe Dekai some thanks for this. He pointed to some interesting work that I wasn't familiar with, both in raw parsing (eg., how to parse errorfull strings with a CFG when you want to find the closest in edit distance string that is parseable by a CFG) and natural language/programming language interfaces. (In retrospect, the first result is perhaps obvious had I actually thought much about it, though probably not so back in 1972: you can represent edit distance by a lattice and then parse the lattice, which we know is efficient.)

Anyway, there were other things that were interesting, but those are the ones that stuck in my head somehow (note, of course, that this list is unfairly biased toward my friends... what can I say? :P).

So, off to ICML on Sunday. I hope to see many of you there!

18 comments:

Joseph Turian said...: Dan Klein was also asking about how to initialize the weights with the quadratic filters setup. Here is the technique I used for chunking. (James Bergstra has a slightly different setup for his vision experiments.)

For a particular layer of training weights, let the fanin be the number of units in the input. For simple logistic regression (only one layer), the fanin is simply the number of input dimensions.

Let v = 1/sqrt(fanin)

We initialize the bias terms to be 0.
We initialize the weights uniformly in the range [-v, +v].
We initialize the quadratic filters uniformly in the range [-v*qscale, +v*qscale]. qscale = 0.01 worked fine for us.

For future reference, when doing SGD training it helps to scale the
learning rate by v. This only makes a difference if there is more than
one layer (e.g. single hidden layer net).

See you chez nous, at ICML!; 12 June, 2009 22:42
Unknown said...: Re #8, you write "passivization in English, where you might say something like 'The truck drove into the crowd' rather than 'The soldier drove the truck into the crowd.'" Actually, the passive here would be "The truck was driven into the crowd." I'm not sure there's a name for the sentence "The truck drove into the crowd"; it's similar in some ways to what was called a Middle Voice in Greek. Pustejovsky (who BTW gave a poster at NAACL) referred to pairs like "John broke the vase" and "The vase broke" as causative/ unaccusative pairs. Beth Levin wrote about the "drive" class verbs, but I don't have access to her book.; 14 June, 2009 10:58
hal said...: Oops, yes... it's definitely more like a causative than a passive. Thanks for catching the error.

BTW, I also realized I totally missed the best papers, both of which were really great. I'll edit the post in a few minutes.; 15 June, 2009 08:10
Chris Brew said...: Alan Ramsey gave a good paper at the end of the Linguistic Creativity workshop. It was called "Sorry seems to be the hardest word", and consisted of a series of examples of things that "I am sorry that X" could mean. This has the golden property that it draws attention to facts about language that are a challenge to both the stats-heavy and the logic-heavy subgroups of ACL. There seems to be a need for common-sense reasoning here. The temptation is to dismiss this as hopelessly retro and do something else instead. That would be wrong.; 15 June, 2009 20:18
Anonymous said...: and your success is very much an inspiration for me. Please come visit my site Kansas City City Business Search Engine when you got time.; 15 October, 2009 00:46
Anonymous said...: and your success is very much an inspiration for me. Please come visit my site Business Services Web Directory Of St. Louis when you got time.; 15 October, 2009 00:47
Anonymous said...: I really liked your blog! Please come visit my site Phone Directory Of Nashville City Tennessee TN State when you got time.; 24 October, 2009 01:32
Anonymous said...: I really liked your blog! Please come visit my site Business Directory Nashville when you got time.; 24 October, 2009 01:33
Anonymous said...: I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site San Francisco Business Search Engine when you got time.; 30 October, 2009 22:29
Anonymous said...: I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site San Francisco Web Link when you got time.; 30 October, 2009 22:29
Anonymous said...: The vase broke" as causative/ unaccusative pairs. Beth Levin wrote about the "drive" class verbs, but I don't have access to her book.
Assignment | Coursework | Dissertation; 04 November, 2009 05:25
Anonymous said...: For simple logistic regression (only one layer), the fanin is simply the number of input dimensions.
Essay | Thesis; 04 November, 2009 05:27
L. Venkata Subramaniam said...: I just came across this article. Though I didnt attend NAACL-HLT in 2009, I did read some of the papers you have written about here. Very interesting!

Also this idea of having a retrospective blog post is a great idea. I am hoping to attend NAACL-HLT in 2010 hope to contribute more to your post then :); 01 December, 2009 03:25
rr8004 said...: I wanted to thank you for this great read!! I definitely enjoying every little bit of it :) I have you bookmarked to check out new stuff you post. Please visit my Please visit my blog when you have time Please come visit my site Business Services Web Directory Of Reno when you got time.; 10 January, 2010 00:57
rr8004 said...: I wanted to thank you for this great read!! I definitely enjoying every little bit of it :) I have you bookmarked to check out new stuff you post. Please visit my Please visit my blog when you have time Please come visit my site Free Business Listing of Reno when you got time.; 10 January, 2010 00:58
willson said...: love to see this discussion! It’s great to see you all working through the issues and also, it’s great to see recommendations for testing. In the end, it’s what your actual users do and prefer that should be your biggest driver in making these decisions.
internet work parttime; 16 January, 2010 11:47
gamefan12 said...: The workshop was so good to go to. I would love to see more papers.
mesothelioma lawyer columbus; 02 March, 2010 15:34
Anonymous said...: great post.
SEO
Seo marketing
Search engine marketing
web design
web development; 08 June, 2010 00:39

natural language processing blog

12 June 2009

NAACL-HLT 2009 Retrospective

18 comments:

About Me

Labels

My Blog List

Blog Archive