ACL/EMNLP just concluded. Overall, I thought both conferences were a success, though by now I am quite ready to return home. Prague was very nice. I especially enjoyed Tom Mitchell's invited talk on linking fMRI experiments to language. They actually use lexical semantic information to be able to identify what words people are thinking about when they scan their brains. Scary mind-reading stuff going on here. I think this is a very interesting avenue of research---probably not one I'll follow myself, but one that I'm really happy
someone is persuing.
This is a really long post, but I hope people (both who attended and who didn't) find it useful. Here are some highlights, more or less by theme (as usual, there are lots of papers I'm not mentioning, the majority of which because I didn't see them):
Machine Translation:The overall theme at the Stat-MT workshop was that it's hard to translate out of domain. I didn't see any conclusive evidence that we've figure out how to do this well. Since domain adaptation is a topic I'm interested in, I'd like to have seen something. Probably the most interesting paper I saw here was about
using dependency order templates to improve translation, but I'm actually not convinced that this is actually helping much with the domain issues: the plots (eg., Fig 7.1) seem to indicate that the improvement is independent of domain when compared to the treelet system. They do do better than phrases though. There was a cute talk about trying
character-based models for MT. Doesn't seem like this does much of interest, and it only really works for related languages, but it was nice to see someone try. This idea was echoed later in EMNLP but none of the talks there really stood out for me. The only other theme that I picked up at Stat-MT (I didn't stay all day) was that a lot of people are doing some form of syntactic MT now. Phrase-based seems to be on its way out (modulo the next paragraph).
There were also a lot of talks using Philipp Koehn's new
Moses translation system, both at Stat-MT as well as at ACL and EMNLP. I won't link you to all of them because they all tried very similar things, but Philipp's
own paper is probably a good reference. The idea is to do factored translation (ala factored language modeling) by splitting
both the input words and the output words into factors (eg., lemma+morphology) and translating each independently. The plus is that most of your algorithms for phrase-based translation remain the same, and you can still use max-Bleu traning. These are also the cons. It seems to me (being more on the MT side) that what we need to do is rid ourselves of max-Bleu, and then just switch to a purely discriminative approach with tons of features, rather than a linear combination of simple generative models.
There were also a lot of word-alignment talks. The most conclusive, in my mind, was
Alex Fraser's (though I should be upfront about bias: he was my officemate for 5 years). He actually introduced a
new generative alignment model (i.e., one that does have "IBM" in the name) that accounts for phrases directly in the model (no more symmetrization, either). And it helps quite a bit. There was also a
paper on alignments tuned for syntax by John DeNero and Dan Klein that I liked (I tried something similar previously in summarization, but I think their model makes more sense). (A second bias: I have known John since I was 4 years old.)
The google folks had a paper on
training language models on tera-word corpora. The clever trick here is that if your goal is a LM in a
linear model (see below), it doesn't matter if its normalized or not. This makes the estimation much easier. They also (not too surprisingly) find that when you have lots of words, backoff doesn't matter as much. Now, if only the google ngram corpus included low counts. (This paper, as well as many other google papers both within and without MT, makes a big deal of how to do all this computation in a map-reduce framework. Maybe it's just me, but I'd really appreciate
not reading this anymore. As a
functional programming languages guy, map-reduce is just map and fold. When I've written my code as a map-fold operation, I don't put it in my papers... should I?)
The talk that probably got the most attention at EMNLP was on
WSD improving machine translation by Marine Carpuat and Dekai Wu. I think Marine and Dekai must have predicted a bit of difficulty from the audience, because the talk was put together a bit tongue in cheek, but overall came across very well. The key to getting WSD to help is: (a) integrate in the decoder, (b) do WSD on phrases not just words, and (c) redefine the WSD task :). Okay, (c) is not quite fair. What they do is essentially train a classifier to do phrase prediction, rather than just using a t-table or a phrase-table. (Actually, Berger and the Della Pietras did
this back in the 90s, but for word translation.) Daniel Marcu nicely complimented that he would go back to LA and tell the group that he saw a very nice talk about training a classifier to predict phrases based on more global information, but that he may not mention that it was called WSD. They actually had a backup slide prepared for this exact question. Personally, I wouldn't have called it WSD if I had written the paper. But I don't think it's necessarily wrong to. I liken it to David Chiang's
Hiero system: is it syntactic? If you say yes, I think you have to admit that Marine and Dekai's system uses WSD. Regardless of what you call it, I think this paper may have quite a bit of impact.
(p.s., MT-people, listen up. When you have a model of the form "choose translation by an argmax over a sum over features of a weight times a feature value", please stop refering to it as a log-linear model. It's just a linear model.)Machine Learning:Daisuke Okanohara and Jun'ichi Tsujii presented a paper on
learning discriminative language models with pseudo-negative samples. Where do they get their negative samples? They're just "sentences" produced by a trigram language model! I find it hard to believe no one has done this before because it's so obvious in retrospect, but I liked it. However, I think they underplay the similarity to the
whole-sentence maxent language models from 2001. Essentially, when one trains a WSMELM, one has to do sampling to compute the partition function. The samples are actually generated by a base language model, typically a trigram. If you're willing to interpret the partition funciton as a collection of negative examples, you end up with something quite similar.
There were two papers on an application of the matrix-tree theorem to dependency parsing, one by
the MIT crowd, the other by the
Smiths. A clever application (by both sets of authors) of the m-t theorem essentially allows you to efficiently (cubic time) compute marginals over dependency links in non-projective trees. I think both papers are good and if this is related to your area, it's worth reading both. My only nit pick is the too-general title of the MIT paper :).
John Blitzer, Mark Drezde and Fernando Pereira had a very nice paper on an application of
SCL (a domain adaptation technique) to
sentiment classification. Sentiment is definitely a hot topic (fad topic?) right now, but it's cool to see some fancy learning stuff going on there. If you know SCL, you know roughly what they did, but the paper is a good read.
One of my favorite papers at ACL was on
dirichlet process models for coreference resolution by Aria Haghighi and Dan Klein. I'd say you should probably read this paper, even if it's not your area.
One of my favorite papers at EMNLP was on
bootstrapping for dependency parsing by David Smith and Jason Eisner. They use a clever application of Renyi entropy to obtain a reasonable bootstrapping algorithm. I was not aware of this, but during the question period, it was raised that apparently these entropy-based measures can sometimes do funky things (make you too confident in wrong predictions). But I think this is at least somewhat true for pretty much all semi-supervised or bootstrapping models.
Random other stuff:I learned about
system combination from the BBN talk. The idea here is to get lots of outputs from lots of models and try to combine the outputs in a meaningful way. The high-level approach for translation is to align all the outputs using some alignment technique. Now, choose one as a pivot. For each aligned phrase in the pivot, try replacing it with the corresponding phrase from one of the other outputs. It's kinda crazy that this works, but it helps at least a few bleu points (which I'm told is a lot). On principle I don't like the idea. It seems like just a whole lot of engineering. But if you're in the "get good results" game, it seems like just as good a strategy as anything else. (I'm also curious: although currently quite ad-hoc, this seems a
lot like doing an error-correcting output code. Does anyone know if it has been formalized as such? Do you get any gains out of this?)
My final blurb is to plug a paper by MSR on
single-document summarization. Yes, that's right, single-document. And they beat the baseline. The cool thing about this paper is that they use the "highlights" put up on many CNN news articles as training. Not only are these not extracts, but they're also "out of order." My sense from talking to the authors is that most of the time a single highlight corresponds to one sentence, but is simplified. I actually downloaded a bunch of this data a month ago or so (it's annoying -- CNN says that you can only download 50 per day and you have to do it "manually" -- it's unclear that this is actually enforceable or if it would fall under fair use, but I can understand from Microsoft's perspective it's better safe than sorry). I was waiting to collect a few more months of this data and then release it for people to use, so check back later. (I couldn't quite tell if MSR was going to release their version or not... if so, we should probably talk and make sure that we don't waste effort.)
Wrapping Up:Since we're ending on a summarization note, here's a challenge: create a document summarization system that will generate the above post from the data in the anthology. (Okay, to be fair, you can exclude the information that's obtained from conversations and questions. But if we start videotaping ACL, then that should be allowable too.)
I put
pictures from Prague up on my web site; feel free to email me if you want a high-res version of any of them. Also, if I was talking to you at the conference about sequential Monte Carlo, email me -- I have new info for you, but I can't remember who you are :).