natural language processing blog: papers

Showing posts with label papers. Show all posts

07 July 2011

Introducing Braque, your paper discovery friend

(Shameless plug/advertisement follows.)

Want to be informed of new interesting papers that show up online?

Tired of trolling conference proceedings to find that one gem?

Want to make sure interested parties hear about your newest results?

Want to know when a new paper comes out that cites you?

Braque (http://braque.cc) can help.

Braque is a news service for research papers (currently focusing primarily on NLP and ML, though it needn't be that way). You can create channels that provide email or RSS feeds for topics you care about. You can add your own publications page as a resource to Braque so it knows to crawl your papers and send them out to interested parties.

Braque is something I built ages ago with Percy Liang, but it's finally more or less set up after my move. Feel free to email me questions and comments or (preferably) use the online comment system.

As a bit of warning: Braque is neither a paper search engine nor a paper archive. And please be a bit forgiving if you go there immediately after this post shows up and it's a bit slow.... we only have one server :).

ps., yes, Braque is sort of like WhatToSee on crack.

24 July 2010

ACL 2010 Retrospective

ACL 2010 finished up in Sweden a week ago or so. Overall, I enjoyed my time there (the local organization was great, though I think we got hit with unexpected heat, so those of us who didn't feel like booking a room at the Best Western -- hah! why would I have done that?! -- had no A/C and my room was about 28-30 every night).

But you don't come here to hear about sweltering nights, you come to hear about papers. My list is actually pretty short this time. I'm not quite sure why that happened. Perhaps NAACL sucked up a lot of the really good stuff, or I went to the wrong sessions, or something. (Though my experience was echoed by a number of people (n=5) I spoke to after the conference.) Anyway, here are the things I found interesting.

Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, by Matthew Gerber and Joyce Chai (this was the Best Long Paper award recipient). This was by far my favorite paper of the conference. For all you students out there (mine included!), pay attention to this one. It was great because they looked at a fairly novel problem, in a fairly novel way, put clear effort into doing something (they annotated a bunch of data by hand), developed features that were significantly more interesting than the usual off-the-shelf set, and got impressive results on what is clearly a very hard problem. Congratulations to Matthew and Joyce -- this was a great paper, and the award is highly deserved.
Challenge Paper: The Human Language Project: Building a Universal Corpus of the World’s Languages, by Steven Abney and Steven Bird. Basically this would be awesome if they can pull it off -- a giant structured database with stuff from tons of languages. Even just having tokenization in tons of languages would be useful for me.
Extracting Social Networks from Literary Fiction, by David Elson, Nicholas Dames and Kathleen McKeown. (This was the IBM best student paper.) Basically they construct networks of characters from British fiction and try to analyze some literary theories in terms of those networks, and find that there might be holes in the existing theories. My biggest question, as someone who's not a literary theorist, is why did those theories exist in the first place? The analysis was over 80 or so books, surely literary theorists have read and pondered all of them.
Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish, by Reyyan Yeniterzi and Kemal Oﬂazer. You probably know that I think translating morphology and translating out of English are both interesting topics, so it's perhaps no big surprise that I liked this paper. The other thing I liked about this paper is that they presented things that worked, as well as things that might well have worked but didn't.
Learning Common Grammar from Multilingual Corpus, by Tomoharu Iwata, Daichi Mochihashi and Hiroshi Sawad. I wouldn't go so far as to say that I thought this was a great paper, but I would say there is the beginning of something interesting here. They basically learn a coupled PCFG in Jenny Finkel hierarchical-Bayes style, over multiple languages. The obvious weakness is that languages don't all have the same structure. If only there were an area of linguistics that studies how they differ.... (Along similar lines, see

Phylogenetic Grammar Induction, by Taylor Berg-Kirkpatrick and Dan Klein, which has a similar approach/goal.)
Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation, by Michael Bloodgood and Chris Callison-Burch. The "trend" referenced in the title is that active learning always asymptotes depressingly early. They have turkers translate bits of sentences in context (i.e., in a whole sentence, translate the highlighted phrase) and get a large bang-for-the-buck. Right now they're looking primarily at out-of-vocabulary stuff, but there's a lot more to do here.

A few papers that I didn't see, but other people told me good things about:

“Was It Good? It Was Provocative.” Learning the Meaning of Scalar Adjectives, by Marie-Catherine de Marneffe, Christopher D. Manning and Christopher Pott.
Unsupervised Ontology Induction from Text, by Hoifung Poon and Pedro Domingos.
Improving the Use of Pseudo-Words for Evaluating Selectional Preferences, by Nathanael Chambers and Daniel Jurafsky.
Learning to Follow Navigational Directions, by Adam Vogel and Daniel Jurafsky.
Compositional Matrix-Space Models of Language, by Sebastian Rudolph and Eugenie Giesbrecht. (This was described to me as "thought provoking" though not necessarily more.)
Top-Down K-Best A* Parsing, by Adam Pauls, Dan Klein and Chris Quirk.

At any rate, I guess that's a reasonably long list. There were definitely good things, but with a fairly heavy tail. If you have anything you'd like to add, feel free to comment. (As an experiment, I've turned comment moderation on as a way to try to stop the spam... I'm not sure I'll do it indefinitely; I hadn't turned it on before because I always thought/hoped that Google would just start doing spam detection and/or putting hard captcha's up or something to try to stop spam, but sadly they don't seem interested.)

28 June 2010

ICML 2010 Retrospective

Just got back from Israel for ICML, which was a great experience: I'd wanted to go there for a while and this was a perfect opportunity. I'm very glad I spent some time afterwards out of Haifa, though.

Overall, I saw a lot of really good stuff. The usual caveats apply (I didn't see everything it's a biased sample, blah blah blah). Here are some things that stood out:

Structured Output Learning with Indirect Supervision (M.-W. Chang, V. Srikumar, D. Goldwasser, D. Roth). This was probably one of my favorite papers of the conference, even though I had learned some about the work when I visited UIUC a few months ago. Let's say you're trying to do word alignment, and you have a few labeled examples of alignments. But then you also have a bunch of parallel data. What can you do? You can turn the parallel data into a classification problem: are these two sentences translations of each other. You can pair random sentences to get negative examples. A very clever observation is basically that the weight vector for this binary classifier should point in the same direction as the weight vector for the (latent variable) structured problem! (Basically the binary classifier should say "yes" only when there exists an alignment that renders these good translations.) Tom Dietterich asked a question during Q/A: these binary classification problems seem very hard: is that bad? Ming-Wei reassured him that it wasn't. In thinking about it after the fact, I wonder if it is actually really importantant that they're hard: namely, if they were easy, then you could potentially answer the question without bothering to make up a reasonable alignment. I suspect this might be the case.

A Language-based Approach to Measuring Scholarly Impact (S. Gerrish, D. Blei). The idea here is that without using citation structure, you can model influence in large document collections. The basic idea is that when someone has a new idea, they often introduce new terminology to a field that wasn't there before. The important bit is that they don't change all of science, or even all of ACL: they only change what gets talked about in their particular sub-area (aka topic :P). It was asked during Q/A what would happen if you did use citations, and my guess based on my own small forays in this area is that the two sources would really reinforce eachother. That is, you might regularly cite the original EM even if your paper has almost nothing to do with it. (The example from the talk was then Penn Treebank paper: one that has a bajillion citations, but hasn't lexically affected how people talk about research.)

Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola). This received one of the best paper awards. While I definitely liked this paper, actually what I liked more what that it taught me something from COLT last year that I hadn't known (thanks to Percy Liang for giving me more details on this). That paper was A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang) and basically shows that you can use spectral decomposition techniques to "solve" the HMM problem. You create the matrix of observation pairs (A_ij = how many times did I see observation j follow observation i) and then do some processing and then a spectral decomposition and, voila, you get parameters to an HMM! In the case that the data was actually generated by an HMM, you get good performance and good guarantees. Unfortunately, if the data was not generated by an HMM, then the theory doesn't work and the practice does worse than EM. That's a big downer, since nothing is ever generated by the model we use, but it's a cool direction. At any rate, the current paper basically asks what happens if your observations are drawn from an RKHS, and then does an analysis. (Meta-comment: as was pointed out in the Q/A session, and then later to me privately, this has fairly strong connections to some stuff that's been done in Gaussian Process land recently.)

Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process (N. Bartlett, D. Pfau, F. Wood). This paper shows that if you're building a hierarchical Pitman-Yor language model (think Kneser-Ney smoothing if that makes you feel more comfortable) in an online manner, then you should feel free to throw out entire restaurants as you go through the process. (A restaurant is just the set of counts for a given context.) You do this to maintain a maximum number of restaurants at any given time (it's a fixed memory algorithm). You can do this intelligently (via a heuristic) or just stupidly: pick them at random. Turns out it doesn't matter. The explanation is roughly that if it were important, and you threw it out, you'd see it again and it would get re-added. The chance that something that occurs a lot keeps getting picked to be thrown out is low. There's some connection to using approximate counting for language modeling, but the Bartlett et al. paper is being even stupider than we were being!

Learning efficiently with approximate inference via dual losses (O. Meshi, D. Sontag, T. Jaakkola, A. Globerson). Usually when you train structured models, you alternate between running inference (a maximization to find the most likely output for a given training instance) and running some optimization (a minimization to move your weight vector around to achieve lower loss). The observation here is that by taking the dual of the inference problem, you turn the maximization into a minimization. You now have a dual minimization, which you can solve simultaneously, meaning that when your weights are still crappy, you aren't wasting time finding perfect outputs. Moreover, you can "warm start" your inference for the next round. It's a very nice idea. I have to confess I was a bit disappointed by the experimental results, though: the gains weren't quite what I was hoping. However, most of the graphs they were using weren't very large, so maybe as yo move toward harder problems, the speed-ups will be more obvious.

Deep learning via Hessian-free optimization (J. Martens). Note that I neither saw this presentation nor read the paper (skimmed it!), but I talked with James about this over lunch one day. The "obvious" take away message is that you should read up on your optimization literature, and start using second order methods instead of your silly gradient methods (and don't store that giant Hessian: use efficient matrix-vector products). But the less obvious take away message is that some of the prevailing attitudes about optimizing deep belief networks may be wrong. For those who don't know, the usual deal is to train the networks layer by layer in an auto-encoder fashion, and then at the end apply back-propogation. The party line that I've already heard is that the layer-wise training is very important to getting the network near a "good" local optimum (whatever that means). But if James' story holds out, this seems to not be true: he doesn't do any clever initialization and still find good local optima!

A theoretical analysis of feature pooling in vision algorithms (Y.-L. Boureau, J. Ponce, Y. LeCun). Yes, that's right: a vision paper. Why should you read this paper? Here's the question they're asking: after you do some blah blah blah feature extraction stuff (specifically: Sift features), you get something that looks like a multiset of features (hrm.... sounds familiar). These are often turned into a histogram (basically taking averages) and sometimes just used as a bag: did I see this feature or not. (Sound familiar yet?) The analysis is: why should one of these be better and, in particular, why (in practice) do vision people see multiple regimes. Y-Lan et al. provide a simple, obviously broken, model (that assumes feature independence... okay, this has to sound familiar now) to look at the discriminability of these features (roughly the ration of between-class variances and overall variances) to see how these regimes work out. And they look basically how they do in practice (modulo one "advanced" model, which doesn't quite work out how they had hoped).

Some other papers that I liked, but don't want to write too much about:

Learning Programs: A Hierarchical Bayesian Approach (P. Liang, M. Jordan, D. Klein). Structured models over programs are very hard; this paper gives one approach to modeling them.
Budgeted Nonparametric Learning from Data Streams (R. Gomes, A. Krause). Shows that a clustering problem and a Gaussian process problem are submodular, goes from there.
Internal Rewards Mitigate Agent Boundedness (J. Sorg, S. Singh, R. Lewis). Exactly what the title says.
The Translation-invariant Wishart-Dirichlet Process for Clustering Distance Data (J. Vogt, S. Prabhakaran, T. Fuchs, V. Roth). Been wanting to do something like this for a while, but they did it better than I would have!
Sparse Gaussian Process Regression via L_1 Penalization (F. Yan, Y. Qi). Very interesting way to get sparsity in a GP basically by changing your approximating distribution.

Some papers that other people said they liked were:

Multi-Class Pegasos on a Budget (Z. Wang, K. Crammer, S. Vucetic)
Risk minimization, probability elicitation, and cost-sensitive SVMs (H. Masnadi-Shirazi, N. Vasconcelos)
Asymptotic Analysis of Generative Semi-Supervised Learning (J. Dillon, K. Balasubramanian, G. Lebanon)

Hope to see you at ACL!

07 June 2010

NAACL 2010 Retrospective

I just returned from NAACL 2010, which was simultaneously located in my home town of Los Angeles and located nowhere near my home town of Los Angeles. (That's me trying to deride downtown LA as being nothing like real LA.)

Overall I was pleased with the program. I saw a few talks that changed (a bit) how I think about some problems. There were only one or two talks I saw that made me wonder how "that paper" got in, which I think is an acceptable level. Of course I spend a great deal of time not at talks, but no longer feel bad about doing so.

On tutorials day, I saw Hoifung Poon's tutorial on Markov Logic Networks. I think Hoifung did a great job of targetting the tutorial at just the right audience, which probably wasn't exactly me (though I still quite enjoyed it myself). I won't try to describe MLNs, but my very brief summary is "language for compactly expressing complex factor graphs (or CRFs, if you prefer)." That's not exactly right, but I think it's pretty close. You can check back in a few months and see if there are going to be any upcoming "X, Y and Daume, 2011" papers using MLNs :P. At any rate, I think it's a topic worth knowing about, especially if you really just want to get a system up and running quickly. (I'm also interested in trying Andrew McCallum's Factorie system, which, to some degree, trades easy of use for added functionality. But honestly, I don't really have time to just try things these days: students have to do that for me.)

One of my favorite papers of the conference was one that I hadn't even planned to go see! It is Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables by Tetsuji Nakagawa, Kentaro Inui and Sadao Kurohashi. (I saw it basically because by the end of the conference, I was too lazy to switch rooms after the prvious talk.) There are two things I really like about this paper. The first is that the type of sentiment they're going after is really broad. Example sentences included things that I'd love to look up, but apparently were only in the slides... but definitely more than "I love this movie." The example in the paper is "Tylenol prevents cancer," which is a nice positive case.

The basic idea is that some words give you sentiment. For instance, by itself, "cancer" is probably negative. But then some words flip polarity. Like "prevents." Or negation. Or other things. They set up a model based on sentence level annotations with latent variables for the "polarity" words and for the "flipping" words. The flipping words are allowed to flip any sentiment below them in the dependency tree. Cool idea! Of course, I have to nit-pick the paper a bit. It probably would be better to allow arguments/adjuncts to flip polarity, too. Otherwise, negation (which is usually a leaf) will never flip anything. And adjectives/adverbs can't flip either (eg., going from "happy" to "barely happy"). But overall I liked the paper.

A second thing I learned is that XOR problems do exist in real life, which I had previously questioned. The answer came (pretty much unintentionally) from the paper The viability of web-derived polarity lexicons by Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan and Ryan McDonald. I won't talk much about this paper other than to say that if you have 4 billion web pages, you can get some pretty good sentimenty words, if you're careful to not blindly apply graph propagation. But at the end, they throw a meta classifier on the polarity classification task, whose features include things like (1) how many positive terms are in the text, (2) how many negative terms are in the text, (3) how many negations are in the text. Voila! XOR! (Because negation XORs terms.)

I truly enjoyed Owen Rambow's poster on The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece. If you're ever taken a class in mathematical logic, it is very easy for me to summarize this paper: parse trees (dependency or phrase structure) are your languge, but unless you have a theory of that language (in the model-theoretic sense) then whatever you do is meaningless. In more lay terms: you can always push symbols around, but unless you tie a semantics to those symbols, you're really not doing anything. Take home message: pay attention to the meaning of your symbols!

In the category of "things everyone should know about", there was Painless unsupervised learning with features by Taylor Berg-Kirkpatrick, Alexandre Bouchard Côté, John DeNero and Dan Klein. The idea is that you can replace your multinomails in an HMM (or other graphical model) with little maxent models. Do EM in this for unsuperviesd learning and you can throw in a bunch of extra features. I would have liked to have seen a comparison against naive Bayes with the extra features, but my prior belief is sufficiently strong that I'm willing to believe that it's helpful. The only sucky thing about this training regime is that training maxent models with (tens of) thousands of classes is pretty painful. Perhaps a reduction like tournaments or SECOC would help bring it down to a log factor.

I didn't see the presentation for From baby steps to leapfrog: How "Less is More" in unsupervised dependency parsing by Valetin Spitkovsky, Hiyan Alshawi and Dan Jurafsky, but I read it. The idea is that you can do better unsupervised dependency parsing by giving your learner progressively harder examples. I really really really tried to get something like this to work for unsearn, but nothing helped and most things hurn. (I only tried adding progressively longer sentences: other ideas, based on conversations with other folks, include looking at vocabulary size, part of speech (eg., human babies learn words in a particular order), etc.) I'm thrilled it actually works.

Again, I didn't see Discriminative Learning over Constrained Latent Representations by Ming-Wei Chang, Dan Goldwasser, Dan Roth and Vivek Srikumar, but I learned about the work when I visited UIUC recently (thanks again for the invitation, Dan R.!). This paper does exactly what you would guess from the title: learns good discriminative models when you have complex latent structures that you know something about a priori.

I usually ask people at the end of conferences what papers they liked. Here are some papers that were spoken highly of by my fellow NAACLers. (This list is almost unadulterated: one person actually nominated one of the papers I thought shouldn't have gotten in, so I've left it off the list. Other than that, I think I've included everything that was specifically mentioned to me.)

Optimal Parsing Strategies for Linear Context-Free Rewriting Systems by Daniel Gildea.
Products of Random Latent Variable Grammars by Slav Petrov.
Joint Parsing and Alignment with Weakly Synchronized Grammars by David Burkett, John Blitzer and Dan Klein.
For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia by Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil and Lillian Lee.
Type-Based MCMC by Percy Liang, Michael I. Jordan and Dan Klein.

I think I probably have two high level "complaints" about the program this year. First, I feel like we're seeing more and more "I downloaded blah blah blah data and trained a model using entirely standard features to predict something and it kind of worked" papers. I apologize if I've just described your paper, but these papers really rub me the wrong way. I feel like I just don't learn anything from them: we already know that machine learning works surprisingly well and I don't really need more evidence of that. Now, if my sentence described your paper, but your paper additionally had a really interesting analysis that helps us understand something about language, then you rock! Second, I saw a lot of presentations were speakers were somewhat embarassingly unaware of very prominent very relevant prior work. (In none of these cases was the prior work my own: it was work that's much more famous.) Sometimes the papers were cited (and it was more of a "why didn't you compare against that" issue) but very frequently they were not. Obviously not everyone knows about all papers, but I recognized this even for papers that aren't even close to my area.

Okay, I just ranted, so let's end on a positive note. I'm leaving the conference knowing more than when I went, and I had fun at the same time. Often we complain about the obvious type I errors and not-so-obvious type II errors, but overall I found the program strong. Many thanks to the entire program committee for putting together an on-average very good set of papers, and many thanks to all of you for writing these papers!

06 November 2009

Getting Started In: Bayesian NLP

This isn't so much a post in the "GSI" series, but just two links that recently came out. Kevin Knight and Philip Resnik both just came out with tutorials for Bayesian NLP. They're both excellent, and almost entirely non-redundant. I highly recommend reading both. And I thank Kevin and Philip from the bottom of my heart, since I'd been toying with the idea of writing such a thing (for a few years!) and they've saved me the effort. I'd probably start with Kevin's and then move on to Philip's (which is more technically meaty), but either order is really fine.

Bayesian Inference with Tears by Kevin
Gibbs Sampling for the Uninitiated by Philip

Thanks again to both of them. (And if you haven't read Kevin's previous workbook on SMT -- which promises free beer! -- I highly recommend that, too.)

07 September 2009

ACL and EMNLP retrospective, many days late

Well, ACL and EMNLP are long gone. And sadly I missed one day of each due either to travel or illness, so most of my comments are limited to Mon/Tue/Fri. C'est la vie. At any rate, here are the papers I saw or read that I really liked.

P09-1010 [bib]: S.R.K. Branavan; Harr Chen; Luke Zettlemoyer; Regina Barzilay
Reinforcement Learning for Mapping Instructions to Actions

and
P09-1011 [bib]: Percy Liang; Michael Jordan; Dan Klein
Learning Semantic Correspondences with Less Supervision

these papers both address what might roughly be called the grounding problem, or at least trying to learn something about semantics by looking at data. I really really like this direction of research, and both of these papers were really interesting. Since I really liked both, and since I think the directions are great, I'll take this opportunity to say what I felt was a bit lacking in each. In the Branavan paper, the particular choice of reward was both clever and a bit of a kludge. I can easily imagine that it wouldn't generalize to other domains: thank goodness those Microsoft UI designers happened to call the Start Button something like UI_STARTBUTTON. In the Liang paper, I worry that it relies too heavily on things like lexical match and other very domain specific properties. They also should have cited Fleischman and Roy, which Branavan et al did, but which many people in this area seem to miss out on -- in fact, I feel like the Liang paper is in many ways a cleaner and more sophisticated version of the Fleischman paper.
P09-1054 [bib]: Yoshimasa Tsuruoka; Jun’ichi Tsujii; Sophia Ananiadou
Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

This paper is kind of an extension of the truncated gradient approach to learning l1-regularized models that John, Lihong and Tong had last year at NIPS. The paper did a great job at motivated why L1 penalties is hard. The first observation is that L1 regularizes optimized by gradient steps like to "step over zero." This is essentially the observation in truncated gradient and frankly kind of an obvious one (I always thought this is how everyone optimized these models, though of course John, Lihong and Tong actually proved something about it). The second observation, which goes into this current paper, is that you often end up with a lot of non-zeros simply because you haven't run enough gradient steps since the last increase. They have a clever way to accumulating these penalties lazily and applying them at the end. It seems to do very well, is easy to implement, etc. But they can't (or haven't) proved anything about it.
P09-1057 [bib]: Sujith Ravi; Kevin Knight
Minimized Models for Unsupervised Part-of-Speech Tagging

I didn't actually see this paper (I think I was chairing a session at the time), but I know about it from talking to Sujith. Anyone who considers themselves a Bayesian in the sense of "let me put a prior on that and it will solve all your ills" should read this paper. Basically they show that sparse priors don't give you things that are sparse enough, and that by doing some ILP stuff to minimize dictionary size, you can get tiny POS tagger models that do very well.
D09-1006: [bib] Omar F. Zaidan; Chris Callison-Burch
Feasibility of Human-in-the-loop Minimum Error Rate Training

Chris told me about this stuff back in March when I visited JHU and I have to say I was totally intrigued. Adam already discussed this paper in an earlier post, so I won't go into more details, but it's definitely a fun paper.
D09-1011: [bib] Markus Dreyer; Jason Eisner
Graphical Models over Multiple Strings

This paper is just fun from a technological perspective. The idea is to have graphical models, but where nodes are distributions over strings represented as finite state automata. You do message passing, where your messages are now automata and you get to do all your favorite operations (or at least all of Jason's favorite operations) like intersection, composition, etc. to compute beliefs. Very cool results.
D09-1024: [bib] Ulf Hermjakob
Improved Word Alignment with Statistics and Linguistic Heuristics

Like the Haghighi coreference paper below, here we see how to do word alignment without fancy math!
D09-1120: [bib] Aria Haghighi; Dan Klein
Simple Coreference Resolution with Rich Syntactic and Semantic Features

How to do coreference without math! I didn't know you could still get papers accepted if they didn't have equations in them!

In general, here's a trend I've seen in both ACL and EMNLP this year. It's the "I find a new data source and write a paper about it" trend. I don't think this trend is either good or bad: it simply is. A lot of these data sources are essentially Web 2.0 sources, though some are not. Some are Mechanical Turk'd sources. Some are the Penn Discourse Treebank (about which there were a ridiculous number of papers: it's totally unclear to me why everyone all of a sudden thinks discourse is cool just because there's a new data set -- what was wrong with the RST treebank that it turned everyone off from discourse for ten years?! Okay, that's being judgmental and I don't totally feel that way. But I partially feel that way.)

12 June 2009

NAACL-HLT 2009 Retrospective

I hope this post will be a small impetus to get other people to post comments about papers they saw at NAACL (and associated workshops) that they really liked.

As usual, I stayed for the whole conference, plus workshops. As usual, I also hit that day -- about halfway through the first workshop day -- where I was totally burnt out and was wondering why I always stick around for the entire week. That's not to say anything bad about the workshops specifically (there definitely were good ones going on, in fact, see some comments below), but I was just wiped.

Anyway, I saw a bunch of papers and missed even more. I don't think I saw any papers that I actively didn't like (or wondered how they got in), including short papers, which I think is fantastic. Many thanks to all the organizers (Mari Ostendorf for organizing everything, Mike Collins, Lucy Vanderwende, Doug Oard and Shri Narayanan for putting together a great program, James Martin and Martha Palmer for local arrangements -- which were fantastic -- and all the other organizers who sadly we -- i.e., the NAACL board -- didn't get a chance to thank publicly).

Here are some things I thought were interesting:

Classifier Combination Techniques Applied to Coreference Resolution (Vemulapalli, Luo, Pitrelli and Zitouni). This was a student research workshop paper; in fact, it was the one that I was moderating (together with Claire Cardie). The student author, Smita, performed this work while at IBM; though her main research is on similar techniques applied to really cool sounding problems in recognizing stuff that happens in the classroom. Somehow classifier combination, and general system combination, issues came up a lot at this conference (mostly in the hallways where someone was begrudgingly admitting to working on something as dirty as system combination). I used to think system combination was yucky, but I don't really feel that way anymore. Yes, it would be nice to have one huge monolithic system that does everything, but that's often infeasible. My main complaint with system combination stuff is that in many cases I don't really understand why it's helping, which means that unless it's applied to a problem I really care about (of which there are few), it's hard for me to take anything away. But I think it's interesting. Getting back to Smita's paper, the key thing she did to make this work is introduce the notion of alignments between different clusterings, which seemed like a good idea. The results probably weren't as good as they were hoping for, but still interesting. My only major pointers as a panelist were to try using different systems, rather than bootstrapped versions of the same system, and to take a look at the literature on consensus clustering, which is fairly relevant for this problem.
Graph-based Learning for Statistical Machine Translation (Alexandrescu and Kirchhoff). I'd heard of some of this work before in small group meetings with Andrei and Kathrin, but this is the first time I'd seen the results they presented. This is an MT paper, but really it's about how to do graph-based semi-supervised learning in a structured prediction context, when you have some wacky metric (read: BLEU) on which you're evaluating. Computation is a problem, but we should just hire some silly algorithms people to figure this out for us. (Besides, there was a paper last year at ICML -- I'm too lazy to dig it up -- that showed how to do graph-based stuff on billions of examples.)
Intersecting Multilingual Data for Faster and Better Statistical Translations (Chen, Kay and Eisele). This is a very simple idea that works shockingly well. Had I written this paper, "Frustrating" would probably have made it into the title. Let's say we want an English to French phrase table. Well, we do phrase table extraction and we get something giant and ridiculous (have you ever looked at those phrase pairs) that takes tons of disk space and memory, and makes translation slow (it's like the "grammar constant" in parsing that means that O(n^3) for n=40 is impractical). Well, just make two more phrase tables, English to German and German to French and intersect. And viola, you have tiny phrase tables and even slightly better performance. The only big caveat seems to be that they estimate all these things on Europarl. What if your data sets are disjoint: I'd be worried that you'd end up with nothing in the resulting phrase table except the/le and sometimes/quelquefois (okay, I just used that example because I love that word).
Quadratic Features and Deep Architectures for Chunking (Turian, Bergstra and Bengio). I definitely have not drunk the deep architectures kool-aid, but I still think this sort of stuff is interesting. The basic idea here stems from some work Bergstra did for modeling vision, where they replaced a linear classifier(y = w'x) with a low rank approximation to a quadratic classifier (y = w'x + sqrt[(a'x)^2 + (b'x)^2 + ... ]). Here, the a,b,... vectors are all estimated as part of the learning process (eg., by stochastic gradient descent). If you use a dozen of them, you get some quadratic style features, but without the expense of doing, say, an implicit (or worse, explicit) quadratic kernel. My worry (that I asked about during the talk) is that you obviously can't initialize these things to zero or else you're in a local minimum, so you have to do some randomization and maybe that makes training these things a nightmare. Joseph reassured me that they have initialization methods that make my worries go away. If I have enough time, maybe I'll give it a whirl.
Exploring Content Models for Multi-Document Summarization (Haghighi and Vanderwende). This combines my two favorite things: summarization and topic models. My admittedly biased view was they started with something similar to BayeSum and then ran a marathon. There are a bunch of really cool ideas in here for content-based summarization.
Global Models of Document Structure using Latent Permutations (Chen, Branavan, Barzilay and Karger). This is a really cool idea (previously mentioned in a comment on this blog) based on using generalized Mallow's models for permutation modeling (incidentally, see a just-appeared JMLR paper for some more stuff related to permutations!). The idea is that documents on a similar topic (eg., "cities") tend to structure their information in similar ways, which is modeled as a permutation over "things that could be discussed." It's really cool looking, and I wonder if something like this could be used in conjunction with the paper I talk about below on summarization for scientific papers (9, below). One concern raised during the questions that I also had was how well this would work for things not as standardized as cities, where maybe you want to express preferences of pairwise ordering, not overall permutations. (Actually, you can do this, at least theoretically: a recent math visitor here, Mark Huber, has some papers on exact sampling from permutations under such partial order constraints using coupling from the past.) The other thing that I was thinking during that talk that I thought would be totally awesome would be to do a hierarchical Mallow's model. Someone else asked this question, and Harr said they're thinking about this. Oh, well... I guess I'm not the only one :(.
Dan Jurafsky's invited talk was awesome. It appealed to me in three ways: as someone who loves language, as a foodie, and as an NLPer. You just had to be there. I can't do it justice in a post.
More than Words: Syntactic Packaging and Implicit Sentiment (Greene and Resnik). This might have been one of my favorite papers of the conference. The idea is that how you say things can express your point of view as much as what you say. They look specifically at effects like passivization in English, where you might say something like "The truck drove into the crowd" rather than "The soldier drove the truck into the crowd." The missing piece here seems to be identifying the "whodunnit" in the first sentence. This is like figuring out subjects in languages that like the drop subjects (like Japanese). Could probably be done; maybe it has been (I know it's been worked on in Japanese; I don't know about English).
Using Citations to Generate Surveys of Scientific Paradigms (Mohammad, Dorr, Egan, Hassan, Muthukrishan, Qazvinian, Radev and Zajic). I really really want these guys to succeed. They basically study how humans and machines create summaries of scientific papers when given either the text of the paper, or citation snippets to the paper. The idea is to automatically generate survey papers. This is actually an area I've toyed with getting in to for a while. The summarization aspect appeals to me, and I actually know and understand the customer very well. The key issue I would like to see addressed is how these summaries vary across different users. I've basically come to the conclusion that in summarization, if you don't pay attention to the user, you're sunk. This is especially true here. If I ask for a summary of generalization bound stuff, it's going to look very different than if Peter Bartlett asks for it.
Online EM for Unsupervised Models (Liang and Klein). If you want to do online EM read this paper. On the other hand, you're going to have to worry about things like learning rate and batch size (think Pegasos). I was thinking about stuff like this a year or two ago and was wondering how this would compare to doing SGD on the log likelihood directly and not doing EM at all. Percy says that asymptotically they're the same, but who knows what they're like in the real world :). I think it's interesting, but I'm probably not going to stop doing vanilla EM.

I then spent some time at workshops.

I spent the first morning in the Computational Approaches to Linguistic Creativity workshop, which was just a lot of fun. I really liked all of the morning talks: if you love language and want to see stuff somewhat off the beaten path, you should definitely read these. I went by the Semantic Evaluation Workshop for a while and learned that the most frequent sense baseline is hard to beat. Moreover, there might be something to this discourse thing after all: Marine tells us that translators don't like to use multiple translations when one will do (akin to the one sense per discourse observation). The biggest question in my head here is how much the direction of translation matters (eg., when this heuristic is violated, is it violated by the translator, or the original author)? Apparently this is under investigation. But it's cool because it says that even MT people shouldn't just look at one sentence at a time!

Andrew McCallum gave a great, million-mile-an-hour invited talk on joint inference in CoNLL. I'm pretty interested in this whole joint inference business, which also played a big role in Jason Eisner's invited talk (that I sadly missed) at the semi-supervised learning workshop. To me, the big question is: what happens if you don't actually care about some of the tasks. In a probabilistic model, I suppose you'd marginalize them out... but how should you train? In a sense, since you don't care about them, it doesn't make sense to have a real loss associated with them. But if you don't put a loss, what are you doing? Again,in probabilistic land you're saved because you're just modeling a distribution, but this doesn't answer the whole question.

Al Aho gave a fantastically entertaining talk in the machine translation workshop about unnatural language processing. How the heck they managed to get Al Aho to give an invited talk is beyond me, but I suspect we owe Dekai some thanks for this. He pointed to some interesting work that I wasn't familiar with, both in raw parsing (eg., how to parse errorfull strings with a CFG when you want to find the closest in edit distance string that is parseable by a CFG) and natural language/programming language interfaces. (In retrospect, the first result is perhaps obvious had I actually thought much about it, though probably not so back in 1972: you can represent edit distance by a lattice and then parse the lattice, which we know is efficient.)

Anyway, there were other things that were interesting, but those are the ones that stuck in my head somehow (note, of course, that this list is unfairly biased toward my friends... what can I say? :P).

So, off to ICML on Sunday. I hope to see many of you there!

15 December 2008

Interesting NIPS papers, take 1

I just got back from NIPS. Kevin Duh was nice enough to forward his "top N" list of NIPS papers; I'll post my own shortly. Thanks Kevin!

"Large Margin Taxonomy Embedding for Document Categorization" - Kilian Weinberger, Olivier Chapelle (Yahoo)
- Suppose you have a multi-class classification problem where you need to assign documents to different nodes in the topic hierarchy. While there are hierarchical classifiers for solving this problem, the authors instead proposes to embed the taxonomy in a continuous space and use regression. The idea is as follows: from a taxonomy, we can compute distances between nodes that characterize the loss of classifying one node when the true class is the other. This pairwise distance matrix is used by multidimensional scaling to create a set of prototype vectors in Euclidean space, one for each class. Then, we train multiple regression that maps training samples to these prototype vectors. However, the problem with this two-stage approach is that the prototypes are computed without regard to the training data, so the solution may be suboptimal for classification. The paper then introduces an objective function that combines both steps: essentially, we constrain the mapping of the training samples and the prototypes to have a large margin.

"Learning taxonomies by Dependence Maximization" - Matthew Blaschko, Arthur Gretton (MPI)
- Our goal is to cluster a dataset and provide a taxonomy that shows the relationship between clusters. The standard solutions are agglomerative/divisive hierarchical clustering. This paper proposes an alternative solution which allows us to use kernels (and is thus related to spectral clustering). The idea is based on a kernel measure of dependence: roughly speaking, if K is the kernel matrix of the original data and L is the kernel matrix of the resulting clustering, the objective max_{L} trace(K*L) is an measures the dependence between samples and clusters and is thus a viable clustering objective. The method gets a taxonomy by formulating L=PYP' where P is a partition matrix (maps cluster to samples) and Y is a positive semi-definite matrix that encodes relationships between clusters.

Fast Prediction on a Tree" - Mark Herbster, Massimiliano Pontil, Sergio Rojas (UCL)
- Graph-based semi-supervised learning needs to scale well with the number of unlabeled samples in order to be truly useful in large data scenarios. This paper presents a method to improve the computational scalability of Laplacian-based methods: First, convert the data graph to a tree (using, e.g. a maximum spanning tree algorithm). Second, they show a fast way to compute the pseudo-inverse of the graph/tree Laplacian in O(m2 + mS), where m is the number of labeled samples and S is the tree diameter. This Laplacian pseudo-inverse corresponds to a kernel, and so one can use, say, a kernel perceptron. to predict on test points. Experiments show that tree approximations to graph did not deteriorate accuracy, while drastically increasing speed.

"Unlabeled data: Now it helps, now it doesn't" - Aarti Singh, Rob Nowak, Jerry Zhu (Wisconsin)
- This is an interesting theoretical paper that analyzes when unlabeled data helps under the cluster assumption. First, the authors argue that asymptotic analysis is unsuitable for analyzing the difference between supervised learning and SSL, and instead uses finite-sample analysis and minimax bounds. Let n be the number of labeled samples, m the number of unlabeled samples, d the feature dimension, and g the margin between two classes (can be positive or negative). The proof is of the form: suppose a clairvoyant supervised learner will full knowledge of the underlying density p(x) has error less than e2(n), and a supervised learner has error greater than e1(n). Then, the error of SSL is no more than e2(n) + O(some function of m). Thus, if O(some function of m) is negligible (and this depends on the exact values of m,d,g,n), then SSL will improve over supervised learning; otherwise, no. In words, the cases where SSL helps is as follows: if the margin g is relatively large compared to the average spacing between labeled points (n^{-1/d}), then supervised learning can discover p(x) accurately and works just as well as SSL. However, if g is small relative to the spacing between labeled points, but large relative to the spacing between unlabeled points (m^{-1/d}), then SSL will beat any supervised learner. In the case that the margin is negative, if -g is larger than (m^{-1/d}), then SSL also wins.

"DiscLDA: Discriminative learning for dimensionality reduction and classification" - Simon Lacoste-Julien, Fei Sha, Michael Jordan (Berkeley/USC)
- Unsupervised topic models have become popular methods for finding latent structures in text documents. These are usually trained by max likelihood, but this may be suboptimal if our final goal is classification. This paper considers the problem of introducing labeled data (e.g. topic labels) into topic models. Recall in Latent Dirichlet Allocation (LDA), foreach each document, we first draw a (k-dimensional) topic mixture from a Dirichlet prior. Then we draw a words according to p(word|topic)p(topic|topic-mixture). We can view each document as a topic simplex. The idea here is to introduce a transformation T on the topic simplex, so that documents with the same label will be mapped close together.

"Modeling the effects of memory on human online sentence processing with particle filters" - Roger Levy (UCSD), Florencia Realia, Tom Griffiths (Berkeley)
- Humans comprehend sentences in an online manner: it is believed that we do incremental parsing as we hear words one at a time. Thus, garden-path sentences are able to catch us off-guard. Moreover, the longer the sentence is before a disambiguation point is reached, the harder it is for humans to recover (digging-in effect). This is a psycholinguistics paper that seeks to explain garden-path and digging-in by a novel particle-filter based PCFG parser: essentially, whenever a word is received, a partial parse is sampled. The number of "incorrect" particles increase with sentence length (modeling digging-in), and the number of particles used correlates with the memory constraints of the brain.

"Tighter bounds for structured estimation" - Olivier Chapelle, et. al. (Yahoo/Stanford/NICTA)
- A common approach in optimizing difficult loss functions is to minimize a convex upper bound instead (e.g. hinge loss in SVM's). However, these losses are often loose. In particular, outliers often suffer large loss, so the general classifier accuracy may be sacrificed since the optimizer focuses on these extremely difficult points. The idea here is to use a non-convex, but tighter upper bound. They adopt a ramp-loss for the structured prediction problem and use the convex-concave procedure to solve it.

21 September 2008

Co-training, 10 years later

At this year's ICML, they gave out a "10 year" award to a paper published in an ICML-related venue from 1998. This year it went to a COLT 1998 paper by Avrim Blum and Tom Mitchell: Combining Labeled and Unlabeled Data with Co-Training. While I'm not super familiar with what might have been a contender, I have to say that I definitely think this is a good choice.

For those unfamiliar with the idea of co-training, you should really read the paper. There's also a wikipedia entry that describes it as:

Co-training is a semi-supervised learning technique that requires two views of the data. It was introduced by Avrim Blum and Tom Mitchell. It assumes that each example is described using two different feature sets that provide different, complementary information about the instance. Ideally, the two views are conditionally independent (i.e., the two feature sets of each instance are conditionally independent given the class) and each view is sufficient (i.e., the class of an instance can be accurately predicted from each view alone). Co-training first learns a separate classifier for each view using any labeled examples. The most confident predictions of each classifier on the unlabeled data are then used to iteratively construct additional labeled training data.

This is a good summary of the algorithm, but what is left off is that---as far as I know---co-training was one of the first (if not the first) method for which theoretical analysis showed that semi-supervised learning might help. My history is a bit rough, so anyone should feel free to correct me if I'm wrong.

Another aspect of co-training that is cool for readers of this blog is that to a reasonable degree, it has its roots in a 1995 ACL paper by David Yarowsky: Unsupervised Word Sense Disambiguation Rivaling Supervised Methods, which, as far as I know, was really the first paper to introduce the notion of having two views of data (although I don't think David described it as such).

All in all, the co-training paper is great. In fact, if you don't believe me that I think it's great, check out my EMNLP 2008 paper. My analysis (and, to some degree, algorithm) are based heavily on the co-training analysis.

Which brings me to what I really want to discuss. That is, I have a strong feeling that if the co-training paper were reviewed today, it would be blasted for the theoretical analysis. (Indeed, I had the same fear for my EMNLP paper; though since it was EMNLP and not, say, COLT, I don't think the problem was as severe.) The problem with the co-training paper is that the theoretical result is for an algorithm that is only superficially related to the actual algorithm they implement. In particular, the actual algorithm they implement uses notions of confidence, and steadily increasing training set size, and incremental additions and so on. It's vastly more complicated that the algorithm they analyze. My recent experience as both an author and reviewer at places like NIPS and ICML is that this is pretty much a non-starter these day.

In fact, the algorithm is so different that it took three years for an analysis of something even remotely close to come out. In NIPS 2001, Sanjoy Dasgupta, Michael Littman and David McAllester published a paper that actually tries to analyze something closer to the "real" co-training algorithm. They get pretty close. And this analysis paper is a full NIPS paper that basically just proves one (main) theorem.

(A similar set of events happened with David Yarowsky's paper. He didn't include any theoretical analysis, but there has been subsequent work, for instance by Steve Abney to try to understand the Yarowsky algorithm theoretically. And again we see that an analysis of the exact original algorithm is a bit out of grasp.)

I'm sure other people will disagree--which is fine--but my feeling about this matter is that there's nothing wrong with proving something interesting about an algorithm that is not quite exactly what you implement. The danger, of course, is if you get an incorrect intuition. For instance, in the case of co-training, maybe it really was all these "additions" that made the algorithm work, and the whole notion of having two views was useless. This seems to have turned out not to be the case, but it would be hard to tell. For instance, the co-training paper doesn't report results on the actual algorithm analyzed: presumably it doesn't work very well or there would be no need for the more complex variant (I've never tried to implement it). On the other hand, if it had taken Avrim and Tom three extra years to prove something stronger before publishing, then the world would have had to wait three extra years for this great paper.

The approach I took in my EMNLP paper, which, at least as of now, I think is reasonable, is to just flat out acknowledge that the theory doesn't really apply to the algorithm that was implemented. (Actually, in the case of the EMNLP paper, I did implement both the simple and the complex and the simple wasn't too much worse, but the difference was enough to make it worth--IMO--using the more complex one.)

22 January 2008

An NLPer in the Algorithmists Court

I just returned from SODA (the Symposium on Discrete Algorithms). Obviously (I suppose), I didn't have a paper there, but I was interested in learning new things. Moreover, it was in SF which is a short cheap flight and admits free housing by crashing with one of my very close friends. My fellow professorial newbie, Suresh Venkatasubramanian, ran a seminar this past Fall on approximate high-dimensional geometry that I attended. This is a topic that is excruciatingly close to many areas of machine learning, and I suspect that there will be a lot of cross-polination in the next few years. I'll try to get around at some point to post about things I learned in the seminar. Some are more ML-related, some more NLP. Anyway, I wanted to (a) see what a theory conference was like and (b) learn about the latest, greatest techniques. (Sadly, I had to miss the last day because I teach early Tuesday mornings.)

Below are a few of the papers I saw that I liked enough to put them on my "to read" list. I'm not quite sure what my qualification for "liked" is, so take this with a huge grain of salt. If you want more authoritative opinions, see either Suresh's blog or one of the other theory blogs. I also warn you guys that most of these things really have nothing to do with NLP.

The first invited talk was by Bonnie Berger, talking about computational biology. Most of the topics were pretty standard: RNA/protein structure prediction, protein-protein interaction graphcs, etc. There are some surface connections to NLP (eg., some other people use stochastic CFGs to do structure prediction (though she does not). The twist that she is taking is to try to solve these problems simultaneously for more than two organisms (typically: yeast, worm, fly, mouse, human; these are the organisms for which we have the most data). The rational is based on the hypothesis that if a structure is biologically important, then it is (roughly) conserved across species. I feel like a lot of linguistics is based on roughly the same hypothesis, and this is probably a topic I'll try to blog about in the near(ish) future.

The second invited talk was by Persi Diaconis, who I've now seen give two talks and I just love him (I wish I had been able to take probability from him as an undergrad). At a very high level, his talk was a bit of evangelizing functional combinatorics, which is essentially a paradigm for understanding combinatorial problems that unifies a large body of disparate problems in terms of the analysis of symmetric polynomials (which have, themselves, a rich theory). The particular example he gave was a relationship between carrying (like, what you do when you add numbers with pen and paper) and shuffling. I confess I learned a lot more about shuffling (which seems to be a bit of a love for Perci), but it was still pretty interesting. Not enough for me to learn functional combinatorics, but still interesting.

One of my favorite papers was Fast dimension reduction using Rademacher series on dual BCH codes by Nir Ailon and Edo Liberty. Johnson-Lindenstrauss gives us a method of embedding N points in D dimensions into K<<D dimensions that preserve l2 distances with high probability, using random projection matrices (lots of follow on work, too). One problem is that these methods require storing a D*K matrix and each projection takes D*K operations. The result of this paper is that we can get the same results using much smaller space and time for each projection (or, in certain boundary cases, a matched space/time complexity). The key idea is to force oneself to use a random diagonal K-matrix composed with "something else" that gives us the desired property. The paper analyzes what properties the "something else" must have, and then constructs such a (deterministic) matrix based on Fourier codes. I thought it was both interesting theoretically and liketly to be quite practical (though time will tell on the latter).

Another paper I really liked was Declaring independence via the sketching of sketches by Piotr Indyk and Andrew McGregor. The problem they are tackling is trying to determing whether two variables are correlated, when the variables come in pairs in a stream. (For those who don't know streaming, think of it as an algorithm that gets one data point at a time and has to do something quickly using small space.) The two quantities that are relevant are the joint distribution over the pairs, and the product-of-marginals distribution over the pairs. If these are very similar, the two items in the pair are likely to be independent. They measure distance in three ways: Euclidean (l2) distance, variational (l1) distance, and mutual information. For l2, the have a very clever analysis that is very straightforward to understand if you know anything about sketching (it's basically a natural extension of previous results by Indyk). They are able to get pretty good results. They have similar results for the other metrics. Again, I can already think of a handful of possible applications of this stuff. It also leaves open an interesting question: what if you get k-tuples (instead of pairs) and you want to extract the M-most-correlated pairs. I suspect you can do this efficiently using a combination of this result with known results on frequent item sets. This would have immediate application to Bayes net structure learning, among other things.

Venkatesan Guruswami, James Lee and Alexander Razborov had a paper on Almost Euclidean subspaces of l_1^N via expander codes (I can't find a link because Google doesn't like latex codes) that I thought was both interesting and very well presented. This paper is a bit harder for me to get across (much less understand!). At a high level, their observation is that for a lot of metric embedding problems, the only known way to get a good embedding (i.e., one that preserves distances or norms) is to use a randomize construction (and then prove that it happens with high probability). Their result is a deterministic, constructive embedding matrix for embedding l1 into l2 (under some qualifications). The other plus of this method is that the embedding matrix is sparse, which means that the image of the vectors under the embedding ar sparse. There are also interesting connections to compressed sensing, which are mentioned in the paper.

There were other papers that I liked, but that I'm not going to try to summarize. Timothy Chan had a impressive (if somewhat ugly) result On the bichormatic k-set problem, which is effectively the problem of trying to figure out how many bad halfspace classifiers there are for labeled data (translated into machine learning lingo). Hubert Chan, Anupam Gupta and Kunal Talwar had a nice result on Ultra-low-dimensional embeddings for doubling metrics (I could only find a previous version that appeared in a NIPS workshop) that shows that if your metric doesn't have nearly uniform volume (characterized by the doubling dimension) then you can embed into Euclidean space with low distortion (this is somewhat surprising: classic results of Bourgain show that in general this is not possible). Despite the niceness of their result, the thing I found most interesting in the talk was a reference to a construction due to Satish Rao on Small distortion and volume preserving embeddings for planar and Euclidean metrics that they basically are extending.

Overall, I enjoyed the conference, despite getting pretty lost in a bunch of the talks. I wish I could have stayed for the last day, since there are a handful of papers that I think would probably be interesting. But I suppose that's what this 1300 page proceedings they dumped on me is for. In particular, the ones that look interesting from titles and skimming are: Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm (actually, I've read the TR version of this paper and it's very good); Sampling algorithms and coresets for l_p regression (only considers the N>>D case, which is a bit limiting) and Linked decompositions of networks and power of choice in Polya urns, which has interesting connections to (shockingly) Polya urn schemes. In particular, they show that if you have a Polya urn where in each step you sample two urns proportional to their occupancy, but then insert into the smaller of the two, then you get a balanced distribution. This is in contrast to typically Polya urn schemes that show up in Dirichlet processes and Pitman-Yor processes where you get power law distributions. (The "power of two" references is to hashing strategies where you hash by two independent hash functions and insert into the lesser occupied of the two bins.)

(p.s., I apologize if I misquoted the exact results in any of these papers; most of this post is based on memory and very short notes I took.)

15 May 2007

Whence JCLR?

Journal publication is not too popular for NLPers -- we tend to be a conference driven bunch. While I could care less about some arguments for journals (eg., the folks on tenure committees like them), I do feel that they serve a purpose beyond simply acting as an archive (which things like arxiv.org, the ACL anthology, citeseer, rexa, etc. do anyway). In particular, a journal paper is often a place where you get to really set the stage for your problem, describe your algorithms so that they're actually reimplementable, and go in to serious error analysis. Certainly not every paper that appears in a *ACL should continue on to the journal path, but many times a handful of papers could be merged.

One significant problem is that we're currently really limited in our choice of publication venues. Computational Linguistics (MIT Press) is definitely the place to publish a journal paper if you can. Unfortunately, CL only puts out four issues per year, each with about 4-5 papers. Sure there aren't hundreds of good papers per year, but I have to believe there are more than 16-20. Moreover, I don't feel that CL actually mirrors the *ACL proceedings -- there are many papers published in CL that I don't think match with the general sensitivities of *ACL. In addition to the small number of CL papers, the turn around time is quite slow. I was personally very impressed with my turnaround time two years ago (date of submission -> date of publication was about a year) and I know that Robert Dale (who's editing now) has done a lot to try to improve this. But still, a year is a long time. And I've heard of papers that take several years to get through. Finally, it's not open. I hate pay-for-access journals almost as much as I had pay-for-access conference proceedings. Sure, if you attend an *ACL you get it for "free" and most universities have agreements, but this is more a principle thing than a practical thing.

Things were similar in machine learning land about six years ago (though in fact I think they were worse). The big journal there was Machine Learning (published by Springer). They had roughly the same problems, to the extent that a large fraction of the editorial board resigned to found the Journal of Machine Learning Research (JMLR). JMLR has since become very successful, publishes dozens of papers per year, and has incredibly quick turnaround (I have seen a journal version of a NIPS paper appear in JMLR before NIPS even happens). The creation of JMLR was greatly assisted by the SPARC group, which helps fledgling journal get off the ground.

I would love to see a similar thing happen in the NLP community. I, personally, cannot make this happen (I don't have enough weight to throw around), but in talking to colleagues (the majority of whom also don't have enough weight) this seems to be something that many people would be in favor of. I don't think it has to be a "JCLR is better than CL" sort of thing; I think it's very possible for both to co-exist, essentially serving slightly different purposes for slightly different communities. In particular, aside from fast turnaround and online pubs, some things that I would love to see happen with such a journal are: Strongly encouraged sharing of code/data (if one could build in some sort of copyright protection for private data, this would be even better since it would let more people share); and a built-in board for paper discussion (probably with membership); ability for authors to easily submit addenda.

A while back I went through the SPARC suggestions of how to begin such a thing and it's very non-trivial. But it's doable. And I'd be willing to help. The biggest thing that would be required would be a bunch of people with white hair who are willing to commit body-and-soul to such a move.

09 May 2007

WhatToSee

I've been using a small number of Perl scripts for helping me decide what papers to read and I just put them up on the web for people to play with. See http://hal3.name/WhatToSee. It's currently queued with a few recent years of ACL, NAACL, ICML and NIPS. Yes, it would be more useful if conferences were to publish the proceedings before the conference, but don't complain to me about that. Feel free to submit additional indices if you want (an index has to be an online webpage that contains links that point to online .PDF files -- if no such page exists, you can always create one on your own webpage and point to that). There are probably ways to break it, but hopefully that won't happen frequently.

05 February 2007

Bag of Words citation

I was recently asked by a colleague if I knew what the first paper was that used the bag of words model. I'm pretty certain it would be an IR paper, but have no idea what I would be. Manning+Schutze and Jurafsky+Martin don't have it. I know tf-idf is due to Sparck-Jones, but I presumably BOW existed before that. The vector space model is often credited to Salton, which is probably the earliest thing I know of, but my guess is that BOW predated even that. Anyone know a citation?

03 February 2007

To err is human, but what about researchers?

Errors happen and sometimes get in to papers. A recent example is the JAIR paper I had with Daniel on Domain Adaptation last year. I actually didn't catch the error myself -- it was caught by someone who was reimplementing the technique. And it's a totally not-insignificant error: essentially, the update equation for the generative parameters is completely botched. If you look through the derivation in the Appendix, it's clear where the error crept in.

Thankfully, this sort of error is essentially a typo. That is, the error was introduced when I was typing up the paper, not when I was doing the research. Why this is important is that it means the the implementation reflects the correct updates: only the paper has the mistake. This means that the experimental results from the paper are valid, contingent on the fact that you rederive the updates yourself, or just ask me what they should be.

I'm writing this post because it's somewhat unclear what to do when such a thing arises. One temptation is to do nothing. I have to admit that I was completely embarrassed when this was pointed out to me. There was a part of me that wanted to ignore it. It seems that this is the wrong approach for a variety of reasons, not the least of which is to make sure that correct information does get out. The question, to some degree, is exactly how to do this. I have a blog, which means I can write an entry like this. I can also put an errata on my web page that points out the errors (I'm writing this up as we "speak"). Given that this is a pub in an online journal, I believe I am able to submit updates, or at least additional appendices, which means that the "official version" can probably be remedied.

But what about conference pubs? If this had appeared in ACL and I didn't have a blog, the situation would be something different (ironically, an earlier version with the correct updates had been rejected from ACL because the derivations were omitted for space and two reviewers couldn't verify them). Also, what if someone hadn't pointed it out to me? I certainly wouldn't have noticed -- that paper was behind me. But then anyone who noticed the errors might dismiss the results on the grounds that they could assume that the implementation was also incorrect (it's not inconceivable that an erroneous implementation can still get good results). This would also not be good because the idea in the paper (any paper with such errors) might actually be interesting.

False things are published all the time. The STOC/FOCS community (i.e., theory community) has a handful of examples...for them, errors are easy to identify because you can prove the opposite of any theorem. I recall hearing of a sequence of several papers that incrementally used results from a previous, but the first was in error, putting the rest in error (I also recall hearing that many of the subsequent results could be salvaged, despite the ancestral mistake).

I don't know if there's a good solution, given our publication mechanisms (essentially, publish-once-then-appear-in-the-anthology). But I'm pretty sure mine is not the first paper with such errors. At least I hope not :).

28 January 2007

Good News on ACL Reviews

I'm reviewing for ACL again this year (in the machine learning subcomponent). A couple of days ago, I received my notice to start bidding on papers (more on bidding below). The email came with the following note:

Naturally, reviewers have been chosen to assess papers based on their own expertise and outlook. Having said this, we are aware that ACL has sometimes been perceived, especially in recent years, as overemphasizing the pursuit of small incremental improvements of existing methods, perhaps at the expense of exciting new developments. (ACL is solid but boring, is what some people would say.) While we believe that it would be counterproductive to change course radically -- We certainly would not want to sacrifice solidity! -- we would like to encourage you, as a reviewer, to look out particularly for what's novel and interesting, even if this means accepting a paper that has one or two flaws, for example because it has not been evaluated as rigourously as you would like. (It is for you to judge when a flaw becomes a genuine problem.)

I think this is fantastic! (Would someone who is reviewing---i.e., on the PC---for another area confirm or deny that all areas got such a message, or was it just ML?) One difficulty I always have as a reviewer is that I assign scores to different categories (originality, interest, citations, etc.) and then am asked to come up with a meta-score that summarizes all these scores. But I'm not given any instruction on how to weigh the different components. What this note seems to be doing is saying "weigh interest higher than you usually would." In the past two years or so, I've been trying to do this. I think that when you start out reviewing, it's tempting to pick apart little details on a paper, rather than focusing on the big picture. It's been a conscious (and sometimes difficult) process for me to get over this. This explicit note is nice to see because it is essentially saying that my own internal process is a good one (or, at least, whomever wrote it thinks it's a good one).

I also think---in comparison to other conferences I've PCed for or reviewed for---that ACL does a really good job of moderating the bidding process. (For those unfamiliar with bidding... when a paper gets submitted, some area chair picks it up. All papers under an area chair are shown---title plus abstract---to the reviewers in that area. Reviewers can bid "I want to review this," "I don't want to review this," "I am qualified to review this," or "Conflict of interest." There is then some optimization strategy to satisfy reviewers preferences/constraints.) In comparison to ECML and NIPS in the past, the ACL strategy of dividing into area chairs seems to be a good thing. For ECML, I got a list of about 500 papers to select from, and I had to rank them 1-5 (or 1-10, I don't remember). This was a huge hassle.

It seems that of most of the conferences that I'm familiar with, ACL has a pretty decent policy. While I would be thrilled to see them introduce an "author feedback" step, everything else seems to work pretty well. In the past, I've only once gotten in to a real argument over a paper with other reviewers --- most of the time, all the reviewer scores have tended to be +/- 1 or 2 (out of ten) of each other. And for the times when there is an initial disagreement, it is usually resolved quickly (eg., one reviewer points out some major accomplishment, or major flaw, in the paper that another reviewer missed).

25 January 2007

Error Analysis

I was recently asked if I thought that it would be a good idea if our conferences were to explicitly require an error analysis to be performed and reported in papers. While this is perhaps a bit extreme (more on this later), there are at least two reasons why this would be desirable.

When multiple techniques exist for solving the same problem, and they get reasonably close scores, is this because they are making the same sort of errors or different sorts?
If someone were to build on your paper and try to improve it, where should they look?

There's an additional aspect that comes up, especially once you're in a sort of supervisory role. It's often hard to get students to actually look at outputs and forcing this as part of the game early on is a good idea. I was the same as a student (and continue to be the same now) -- only two or three our of a dozen or so papers of mine contain an error analysis.

This situation reminds me a bit of an excellent talk I saw a few years ago (at ACL or EMNLP in Barcelona, I think) by Mitch Marcus talking about some parsing stuff. I don't really remember much of his talk, except that he kept flashing a single slide that read "Look at the data, stupid." His argument was essentially that we're not going to be able to model what we want to model unless we really understand what's going on in the data representing the phenomena we're trying to study.

An exercise that's also good from this perspective is to do some data annotation yourself. This is perhaps even more painful than doing an error analysis, but it really drives home the difficulties in the task.

Getting back to the point at hand, I don't think it's feasible or even necessarily advisable to require all papers to include an error analysis. But I also think that more papers should contain error analyses than actually do (including some of my own). In the universal struggle to fit papers within an 8 page limit, things have to get cut. It seems that the error analysis is the first thing to get cut (in that it gets cut before the paper is even written -- typically by not being performed).

But, at least for me, when I read a paper, I want to know after the fact what I have learned. Occasionally it's a new learning technique. Or occasionally it's some new useful features. Or sometimes it's a new problem. But if you were to take the most popular problems out there that I don't work on (MT, parsing, language modeling, ASR, etc.), I really have no idea what problems are still out there. I can guess (I think names in MT are hard, as is ordering; I think probably attachment and conjunctions in parsing; I have little idea in LM and ASR), but I'm sure that people who work on these problems (and I really mean work: like, you care about getting better systems, not just getting papers) know. So it would be great to see it in papers.

18 January 2007

Comments on: Mark-up Barking Up the Wrong Tree

The "Last Words" article in the Dec 2006 issue of Computational Linguistics is by Annie Zaenen from PARC. (I hope that everyone can access this freely, but I sadly suspect it is not so... I'm half tempted to reproduce it, since I think it's really worth reading, but I don't want to piss off the Gods at MIT press too much.)

The main point I got from the article is that we really need to pay attention to how annotation is done. A lot of our exuberance for annotating is due to the success of machine learning approaches on the Treebank, so we have since gone out and annotated probably hundreds of corpora for dozens of other tasks. The article focuses on coreference, but I think most of the claims apply broadly. The first point made is that the Treebank annotation was controlled, and done by experts (linguists). Many other annotates are not done so: are done without real standards and without deep analysis of the task. The immediate problem, then, is that a learning algorithm that "succeeds" on the annotated data is not necessarily solving the right task.

There was a similar story that my ex-office-mate Alex Fraser ran across in machine translation; specifically, with evaluating alignments for machine translation. The basic problem was two-fold. First, the dataset that everyone used (the French-English data from Aachen) was essentially broken, due largely to its distinction between "sure" and "possible" links -- almost every word pair was possibly linked. This, together with the broken evaluation metric (alignment error rate --- or AER) made results on this dataset virtually useless. The conclusion is essentially: don't use the Aachen data and don't use AER. That is, don't use them if you want improved MT performance, i.e., if you expect higher alignment performance to imply higher MT performance. (If this sounds familiar, it's perhaps because I mentioned it before.)

I should say I largely agree with the article. Where I differ (perhaps only by epsilon) is that the article seems to pick on annotation for machine learning, but I really don't see any reason why the fact that we're using machine learning matters. The issue is really one of evaluation: we need to know that at the end of the day, when we compute a number, that number matters. We can compute a number intrinsically or extrinsically. In the extrinsic case, we are golden, assuming the extrinsic task is real (turtles upon turtles). In the intrinsic case, the situation is fishy. We have to make sure that both our annotations mean something and our method of computing error rate means something (ala the error metric types and the use of F for named entities). While I've argued on this blog that the error metric is important, the CL article argues that the annotation is important. I think that as someone who is on the machine learning side, this is easy to forget.

natural language processing blog