05 July 2014

My ACL 2014 picks...

Usual caveats: didn't see all papers, blah blah blah. Also look for #acl14nlp on twitter -- lots of papers were mentioned there too!

  • A Tabular Method for Dynamic Oracles in Transition-Based Parsing; Yoav Goldberg, Francesco Sartorio, Giorgio Satta.
    Jaokim Nivre, Ryan McDonald and I tried searnifying MaltParser back in 2007 and never got it to work. Perhaps this is because we didn't have dynamic oracles and we thought that a silly approximate oracle would be good enough. Guess not. Yoav, Francesco and Giorgio have a nice technique for efficiently computing the best possible-to-achieve dependency parse given some prefix, possibly incorrect, parse.
  • Joint Incremental Disfluency Detection and Dependency Parsing; Matthew Honnibal, Mark Johnson
    The basic idea is to do shift-reduce dependency parsing, but allow for "rewinds" in the case of (predicted) disfluencies. I like that they didn't just go with the most obvious model and actually thought about how might be a good way to solve this problem. Basic idea is if you get "Please book a flight to Boston uh to Denver..." is that you parse "to Boston" like usual but then when you get to the "uh", you remove old arcs. You do it this way because detecting the disfluent segment ("to Boston") is much easier when you hit "uh" than when you hit "to Boston."
  • Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors; Marco Baroni; Georgiana Dinu; Germán Kruszewski
    This paper is summarized best by its own statement, which should win it the award for most honest paper ever: "...we set out to conduct this study because we were annoyed by the triumphalist overtones often surrounding [neural network embeddings], despite the almost complete lack of a proper comparison.... Our secret wish was to discover that it is all hype... Instead, we found that the [embeddings] are so good that, while the triumphalist overtones still sound excessive, there are very good reasons to switch to the new architecture."
  • Learning to Automatically Solve Algebra Word Problems ; Nate Kushman; Luke Zettlemoyer; Regina Barzilay; Yoav Artzi
    An algebra word problem is something like "I have twice as many dimes as nickles and have $2.53. How many nickles do I have." Of course usually they actually have an answer. They have a nice, fairly linguistically unstructured (i.e., no CCG) for mapping word problems to algebraic formulae and then solving those formulae. Code/data available.
  • Grounded Compositional Semantics for Finding and Describing Images with Sentences; Richard Socher, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng
    This is the follow-on work from Richard's NIPS workshop paper on text <-> images from this past NIPS. They fixed the main bug in that paper (the use of l2 error, which gives a trivial and uninteresting global optimal solution) and get nice results. If you're in the langvis space, worth a read, even if you don't like neural networks :).
  • From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions; Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier
    I really like the "visual denotations" idea here. Basically you say something like "the set of worlds in which this sentence is true is the set of images in which this sentence is true (i.e., roughly the sentence is entailed by the image)." You can then measure similarity between sentences based on denotations.
  • Kneser-Ney Smoothing on Expected Counts; Hui Zhang; David Chiang
    I didn't actually see this talk or read the paper, but lots of people told me in hallways that this is a very nice result. Basically we like KN smoothing, but it only works for integral counts, which means it's hard to incorporate into something like EM, which produces fractional counts. This paper solves this problem.
  • Linguistic Structured Sparsity in Text Categorization; Dani Yogatama; Noah A. Smith
    Also didn't see this one, but skimmed the paper. The reason I really like this paper is because they took a well known technique in ML land (structured sparsity) and applied it to NLP, but in an interesting way. I.e., it wasn't just "apply X to Y" but rather find a very linguistically clever/interesting way of mapping X to a problem that we care about. Very cool work.
Overall I really liked the conference, thanks to everyone who helped put it together. I can't help but notice that about half of my picks above were actually TACL papers. I suspect this will be more and more true over time.

Please add comments with your favorite papers that I missed!


Nikos said...

Regarding word embeddings, the predictive approach has been known to work better for some time now.
I was hoping the paper would compare the neural embeddings with CCA embeddings which are known to be optimal under an HMM generative assumption. Given last year's NAACL tutorial on spectral learning for NLP, I would expect the NLP crowd would be largely familiar with these methods, and include them as baselines.

Regarding structured sparsity, Dani gave a similar talk at ICML, focusing on the case where only a small number of sentences are informative in a review or other piece of text. This and a couple of other ICML talks confirmed my suspicion that ADMM is much more valuable as a method to derive fast custom solvers for non-standard optimization problems, rather than a parallel optimization strategy.

hal said...

@Nikos: on word embeddings, perhaps this is worth a separate post, but do you have refs? I don't follow _that_ closely, but I was kind of under the impression that the jury was still out. Totally agree about CCA, though my impression was that (a) deep networks came along and basically killed CCA (for sociological reasons, perhaps) and (b) CCA-based HMM stuff never really showed itself to be better than vanilla HMM stuff in terms of actual end tasks (the same is true of LDA stuff AFAIK). I could be wrong, though.

Thanks for the ICML pointers -- any other suggestions? (Since obviously I couldn't make it this year...)

Nikos said...

I have not done a thorough comparison on word embeddings myself, but I had done some casual experiments using precomputed CCA embeddings (google for "eigenwords") and a silly analogical inference script for "A is to B as C is to " (which was just 2-nearest neighbor around the embedding of B-A+C). As you might expect, you find the same stuff as the neural embeddings, e.g. "France is to Paris as England is to ", returns London.

Not sure what exactly you are looking for in references, for my purposes I use the two step CCA procedure from ICML 2012. Under an HMM assumption, the past and the future are conditionally independent views of the current state, so the old multiview CCA results [pdf] apply. The NAACL 2013 tutorial contains many other references.

Though I haven't actually tried it, a word representation evaluation dataset was just released by MSR Asia, and it might help resolve the issue of whose embedding is better. See the last talk at this ICML workshop.

For ICML, I mostly agree with
John's list and Paul's list. Ken Tran has written a summary of the best papers. A couple interesting ones to me not in these lists are the
Admixture of Poisson MRFs
(cool pictures)
and the Elementary Estimators for High-Dimensional Linear Regression both from Pradeep's group. The latter is a very simple/cheap way to do sparse modeling with the same guarrantees as L1.

An interesting new workshop was the AutoML (which became the hyperparameter optimization workshop but aspires to be much more than that).

hal said...

@Nikos: thanks for the pointers. the WordRep thing is interesting, but I'm really more curious as to what's useful for downstream tasks, rather than rather for the somewhat made-up evaluation techniques they propose. I wonder if hosting a shared evaluation for generic word representation learning would interest people...

and thanks for the icml pointers! automl looked really cool and I wanted to submit, but the ICML/ACL timing this year sucked :(

Anonymous said...

As to Kneser-Ney, I like Yee Whye Teh's derivation as approximate Pitman-Yor:


Superficially, it makes sense when you think about Kneser-Ney as additive smoothing allowing negatives and what a Dirichlet parameter less than 1 means in the way of "prior counts".

I really like Frank Wood's work with Yee Whye that formulated very general n-gram-like models using Pitman-Yor; here's a summary:


hal said...

@bob (?): yup, I like Yee Whye's stuff, too, and also Sharon, Tom and Mark's roughly simultaneous derivation: http://homepages.inf.ed.ac.uk/sgwater/papers/nips05.pdf (NIPS 2005, "Interpolating Between Types and Tokens
by Estimating Power-Law Generators"). i haven't read the zhang and chiang in depth, but i vaguely remember them saying that this isn't just a simple application of the PY results).

MaNaaL said...

Hi Hal,

Regarding the "Count, don't predict" paper: There is upcoming work which proves exactly otherwise. It has been shown that count-based vectors when constructed carefully perform better or competitive to the word2vec vectors on a variety of tasks.

Also, the authors only compared word2vec vectors against the count-based vectors. There are another set of "predict" vectors like the recurrent neural language model ones which are much worse than both word2vec and count based vectors.

Lets wait till EMNLP 2014. :)