02 July 2007

ACL and EMNLP 2007 Report

ACL/EMNLP just concluded. Overall, I thought both conferences were a success, though by now I am quite ready to return home. Prague was very nice. I especially enjoyed Tom Mitchell's invited talk on linking fMRI experiments to language. They actually use lexical semantic information to be able to identify what words people are thinking about when they scan their brains. Scary mind-reading stuff going on here. I think this is a very interesting avenue of research---probably not one I'll follow myself, but one that I'm really happy someone is persuing.

This is a really long post, but I hope people (both who attended and who didn't) find it useful. Here are some highlights, more or less by theme (as usual, there are lots of papers I'm not mentioning, the majority of which because I didn't see them):

Machine Translation:

The overall theme at the Stat-MT workshop was that it's hard to translate out of domain. I didn't see any conclusive evidence that we've figure out how to do this well. Since domain adaptation is a topic I'm interested in, I'd like to have seen something. Probably the most interesting paper I saw here was about using dependency order templates to improve translation, but I'm actually not convinced that this is actually helping much with the domain issues: the plots (eg., Fig 7.1) seem to indicate that the improvement is independent of domain when compared to the treelet system. They do do better than phrases though. There was a cute talk about trying character-based models for MT. Doesn't seem like this does much of interest, and it only really works for related languages, but it was nice to see someone try. This idea was echoed later in EMNLP but none of the talks there really stood out for me. The only other theme that I picked up at Stat-MT (I didn't stay all day) was that a lot of people are doing some form of syntactic MT now. Phrase-based seems to be on its way out (modulo the next paragraph).

There were also a lot of talks using Philipp Koehn's new Moses translation system, both at Stat-MT as well as at ACL and EMNLP. I won't link you to all of them because they all tried very similar things, but Philipp's own paper is probably a good reference. The idea is to do factored translation (ala factored language modeling) by splitting both the input words and the output words into factors (eg., lemma+morphology) and translating each independently. The plus is that most of your algorithms for phrase-based translation remain the same, and you can still use max-Bleu traning. These are also the cons. It seems to me (being more on the MT side) that what we need to do is rid ourselves of max-Bleu, and then just switch to a purely discriminative approach with tons of features, rather than a linear combination of simple generative models.

There were also a lot of word-alignment talks. The most conclusive, in my mind, was Alex Fraser's (though I should be upfront about bias: he was my officemate for 5 years). He actually introduced a new generative alignment model (i.e., one that does have "IBM" in the name) that accounts for phrases directly in the model (no more symmetrization, either). And it helps quite a bit. There was also a paper on alignments tuned for syntax by John DeNero and Dan Klein that I liked (I tried something similar previously in summarization, but I think their model makes more sense). (A second bias: I have known John since I was 4 years old.)

The google folks had a paper on training language models on tera-word corpora. The clever trick here is that if your goal is a LM in a linear model (see below), it doesn't matter if its normalized or not. This makes the estimation much easier. They also (not too surprisingly) find that when you have lots of words, backoff doesn't matter as much. Now, if only the google ngram corpus included low counts. (This paper, as well as many other google papers both within and without MT, makes a big deal of how to do all this computation in a map-reduce framework. Maybe it's just me, but I'd really appreciate not reading this anymore. As a functional programming languages guy, map-reduce is just map and fold. When I've written my code as a map-fold operation, I don't put it in my papers... should I?)

The talk that probably got the most attention at EMNLP was on WSD improving machine translation by Marine Carpuat and Dekai Wu. I think Marine and Dekai must have predicted a bit of difficulty from the audience, because the talk was put together a bit tongue in cheek, but overall came across very well. The key to getting WSD to help is: (a) integrate in the decoder, (b) do WSD on phrases not just words, and (c) redefine the WSD task :). Okay, (c) is not quite fair. What they do is essentially train a classifier to do phrase prediction, rather than just using a t-table or a phrase-table. (Actually, Berger and the Della Pietras did this back in the 90s, but for word translation.) Daniel Marcu nicely complimented that he would go back to LA and tell the group that he saw a very nice talk about training a classifier to predict phrases based on more global information, but that he may not mention that it was called WSD. They actually had a backup slide prepared for this exact question. Personally, I wouldn't have called it WSD if I had written the paper. But I don't think it's necessarily wrong to. I liken it to David Chiang's Hiero system: is it syntactic? If you say yes, I think you have to admit that Marine and Dekai's system uses WSD. Regardless of what you call it, I think this paper may have quite a bit of impact.

(p.s., MT-people, listen up. When you have a model of the form "choose translation by an argmax over a sum over features of a weight times a feature value", please stop refering to it as a log-linear model. It's just a linear model.)

Machine Learning:

Daisuke Okanohara and Jun'ichi Tsujii presented a paper on learning discriminative language models with pseudo-negative samples. Where do they get their negative samples? They're just "sentences" produced by a trigram language model! I find it hard to believe no one has done this before because it's so obvious in retrospect, but I liked it. However, I think they underplay the similarity to the whole-sentence maxent language models from 2001. Essentially, when one trains a WSMELM, one has to do sampling to compute the partition function. The samples are actually generated by a base language model, typically a trigram. If you're willing to interpret the partition funciton as a collection of negative examples, you end up with something quite similar.

There were two papers on an application of the matrix-tree theorem to dependency parsing, one by the MIT crowd, the other by the Smiths. A clever application (by both sets of authors) of the m-t theorem essentially allows you to efficiently (cubic time) compute marginals over dependency links in non-projective trees. I think both papers are good and if this is related to your area, it's worth reading both. My only nit pick is the too-general title of the MIT paper :).

John Blitzer, Mark Drezde and Fernando Pereira had a very nice paper on an application of SCL (a domain adaptation technique) to sentiment classification. Sentiment is definitely a hot topic (fad topic?) right now, but it's cool to see some fancy learning stuff going on there. If you know SCL, you know roughly what they did, but the paper is a good read.

One of my favorite papers at ACL was on dirichlet process models for coreference resolution by Aria Haghighi and Dan Klein. I'd say you should probably read this paper, even if it's not your area.

One of my favorite papers at EMNLP was on bootstrapping for dependency parsing by David Smith and Jason Eisner. They use a clever application of Renyi entropy to obtain a reasonable bootstrapping algorithm. I was not aware of this, but during the question period, it was raised that apparently these entropy-based measures can sometimes do funky things (make you too confident in wrong predictions). But I think this is at least somewhat true for pretty much all semi-supervised or bootstrapping models.

Random other stuff:

I learned about system combination from the BBN talk. The idea here is to get lots of outputs from lots of models and try to combine the outputs in a meaningful way. The high-level approach for translation is to align all the outputs using some alignment technique. Now, choose one as a pivot. For each aligned phrase in the pivot, try replacing it with the corresponding phrase from one of the other outputs. It's kinda crazy that this works, but it helps at least a few bleu points (which I'm told is a lot). On principle I don't like the idea. It seems like just a whole lot of engineering. But if you're in the "get good results" game, it seems like just as good a strategy as anything else. (I'm also curious: although currently quite ad-hoc, this seems a lot like doing an error-correcting output code. Does anyone know if it has been formalized as such? Do you get any gains out of this?)

My final blurb is to plug a paper by MSR on single-document summarization. Yes, that's right, single-document. And they beat the baseline. The cool thing about this paper is that they use the "highlights" put up on many CNN news articles as training. Not only are these not extracts, but they're also "out of order." My sense from talking to the authors is that most of the time a single highlight corresponds to one sentence, but is simplified. I actually downloaded a bunch of this data a month ago or so (it's annoying -- CNN says that you can only download 50 per day and you have to do it "manually" -- it's unclear that this is actually enforceable or if it would fall under fair use, but I can understand from Microsoft's perspective it's better safe than sorry). I was waiting to collect a few more months of this data and then release it for people to use, so check back later. (I couldn't quite tell if MSR was going to release their version or not... if so, we should probably talk and make sure that we don't waste effort.)

Wrapping Up:

Since we're ending on a summarization note, here's a challenge: create a document summarization system that will generate the above post from the data in the anthology. (Okay, to be fair, you can exclude the information that's obtained from conversations and questions. But if we start videotaping ACL, then that should be allowable too.)

I put pictures from Prague up on my web site; feel free to email me if you want a high-res version of any of them. Also, if I was talking to you at the conference about sequential Monte Carlo, email me -- I have new info for you, but I can't remember who you are :).

22 comments:

hal said...

Just noticed that Fernando has already listed his favorite papers (or at least the ones he's reading).

Fernando Pereira said...

Regarding the Smith and Eisner's paper, I'm puzzled by the following. In structured problems with many features, it seems to be easy to set parameters to take the entropy down to zero by exploiting features that are on for specific combinations of inputs and outputs. I asked David about this and he said it wasn't a problem because their algorithm did early stopping. But then the regularization is really in the early stopping, not in the entropy term.

Yoav said...

I don't quite understand the fuss about Smith and Eisner's bootstrapping paper -- it might have a non trivial application of something-or-another, but the overall the results are useless.

To be honest, I must say I didn't fully read the paper (yet?), but I was at the presentation, and I have skimmed it again just to be sure I remembered correctly. Feel free to thrash me if my analysis is wrong.

From what I understand,
they show how they can improve horrible useless parsers trained on 100 sentences into slightly less horrible but still useless parsers by complicated self training.

They also show that this is not additive -- when the initial parser was trained on 1000 sentences, the initial scores were much higher and the improvement from the same self training procedure minimal. (See table 2).

That is, their self training method works only in situations in which the gains from it are meaningless.

Furthermore, if I understand correctly (please correct me if I am wrong about that one), increasing the alpha value to infinity (which improves the results) essentially mean "choose all the sentences regardless of their confidence". So, if I am indeed correct about that, their clever method amount to just choosing all the sentences (which was also shown to work in an ACL paper by Roi Reichart and Ari Rapaport).

To sum it up, this is exactly the kind of papers I would NOT like to see in ACL/EMNLP.

hal said...

To be fair, I didn't see the Smith+Eisner talk (it was opposite a session I was chairing) -- only skimmed the paper (in retrospect, perhaps I shouldn't have said it was my favorite given this :P).

Regarding Fernando's comment: we've experienced the same thing, but not in the context of entropy. When doing semi-supervised learning with structured problems with lots of features, we have found that the probability assigned to labels for the unlabeled data typically approaches zero or one very quickly. This doesn't happen in something like Nigam because they basically only use word ids. But it seems that overfitting is a big problem.

Anonymous said...

I too enjoyed the Smith and Eisner paper.

My major concern with all the work on semi-supervised parsing is that they might be proceeding under a false assumption. That is, we can get our hands on a small amount of labeled data, but not a large enough amount to train state-of-the-art classifiers. Note that this is also true of the Reichart and Rappoport self-training paper, but not the original McClosky et al. paper.

I believe this assumption is false, since it has been documented that training annotators for complex problems such as deep syntactic parsing is very time consuming. As a result, generating thousands of sentences should not be that much more difficult than generating a few hundred. My observations of the annotators on the PennBioIE project confirms this -- it took a while to get any annotated data, but once we got a little, we got a lot.

Anonymous said...

Nice pictures from Prague, but the river is definitely not called "the Charles river", but "Vltava" (Moldau in German);-)) Let's meet on the Columbus river next year;-)

Anonymous said...

From your Google comment, I think you might be underestimating the importance of what they do with large datasets. Working with large datasets is not a trivial thing to do. Say, you have half a terabyte of text and would like to use it as a corpus. You can't just do it on one machine. In fact even a dozen would be prohibitively slow. And in general case parallel algorithms are non-trivial and there's a lot of room for improvement. That's what Google folks are trying to do.

hal said...

Ryan: That's a very interesting point. I've heard similar things echoed by many other people, and think it may be wise for us to reconsider.

Another semi-supervised learning issue that was brought up at the conference was in response to a paper on semi-supervised DOP CRFs. Fernando commented that a trend that we see a lot is that if you start from a non-state-of-the-art baseline, semi-sup learning can help. But if you start from the state-of-the-art, it doesn't help. (The conclusion, I believe, being that we should always start from the state-of-the-art baseline.)

I think there is one issue that this doesn't address: namely, the engineering difficulty issue. Specifically, if it requires a ton of work (usually feature engineering, but sometimes algorithm tuning as well) to get to the state-of-the-art, and I could also get to the same place just as easily with a generic learning algorithm, I'm not wholly convinced that the generic learning algorithm isn't a valuable contribution. Sure it would be ideal to have algorithms that always improve, but we all know that's not going to happen.

Regarding Google: Hrm. I may be a bit unfair to the Google folks, but I don't think horribly so. I guess there are two cases. Let's take any learning algorithm that works by some form of gradient descent. This can be trivially parallelized on any reasonable parallel archictecture by having each node compute local gradients and then having a master sum them up and take the step (perhaps with distributed step-size computations). Similarly for EM: E can be performed independently on lots of processors; M can be hosted by one. (This easily maps to map-reduce.) These are things that are so trivial and so obvious that I really don't think that they're worth talking about, even if they're necessary to make things work. There are exceptions. For instance, when SVMs are trained with SMO, it's totally not obvious how to distribute this, so papers which do so are interesting. But (just to pick on the Google paper a bit more) spending 1.5 pages (sections 5-5.3) on distributed ngram counts is just a waste of space. I completely agree that there are non-trivial issues (such as the integration with MT) and these deserve mention. (Note that I'm not saying that I don't like this paper---I think it's quite interesting---but just that certain things can go nearly unsaid.)

Anonymous said...

What do you think of the Neural Network paper for SLR "Fast Semantic Extraction Using a Novel Neural Network Architecture"? it is the first time since ages we see a NN paper in ACL. I wasn't there, please tell us your personal perepective as well as the audience ones. they say NN is very fast in training and in testing, which is true, but what are the dissadvantages here?

Anonymous said...

Hi, I saw that you commented on Wu's Emnlp paper. Would just like to add a note that we have a paper called "Word Sense Disambiguation Improves Statistical Machine Translation" in ACL-07. Authors are Yee Seng Chan, Hwee Tou Ng and David Chiang. Cheers.

Yoav said...

Regarding only improving the state-of-the-art: I feel that in many cases "improving upon state of the art" actually means "a slightly different over-fitting of the WSJ", so learning of a new way of doing something is worthwhile even if it gets to slightly below state of the art results.

This is especially true for semi-supervised learning -- if the unsupervised part stems from a different source (ie not from the WSJ), we are almost guaranteed not to overfit, and so results which are just below state of the art can be expected to transfer much better to a different domain.

Having said that, you still have to remain somewhat near the state of the art -- there is no justification for an algorithm that can use clever tricks and whatnot, but still get you to many levels below state of the art without any chance of improving (e.g. current semi-supervised parsing with a small seed).

hal said...

anon3: I didn't see the NN paper. In general, I'm not an anti-neural nets guy. At least not totally. I think deep belief nets are definitely interesting, and think that NNs for multitask learning are very natural. But I'm also definitely not pro-NNs, since nonconvexity always scares me and at least historically they're slow to train.

anon4: thanks! i missed it.

yoav: i more or less agree, with one caveat: if you want to show that your technique generalizes better across domains, you'd better actually show this. of course, it's possible to do semisup learning on tasks *other than* WSJ-based data :). but i definitely think ryan's comment is worth serious consideration, because i too have found it to be true.

Anonymous said...

there is not many insights here but you could find some pics:
http://www.notes.co.il/oren/33912.asp

Fernando Pereira said...

"Fernando commented that a trend that we see a lot is that if you start from a non-state-of-the-art baseline, semi-sup learning can help. But if you start from the state-of-the-art, it doesn't help. (The conclusion, I believe, being that we should always start from the state-of-the-art baseline.)" I've seen many cases in which a semi-supervised method that seemed to do well with artificially restricted labeled data is unable to improve over the best supervised baseline when more unlabeled data is added. (Notable exceptions include McClosky et al's parser self-training and Ando and Zhang's ASO results.) One possible explanation for that common behavior is that most semi-supervised methods are based on strict capacity control (strong regularization), which prevents significant learning improvements when labeled training data gets sufficiently large.

Anonymous said...

Who won the best paper award? I haven't found it on the conference website.

Anonymous said...

The best paper award for this year's ACL went to Y. W. Wong and R. J. Mooney, "Learning synchronous grammars for semantic parsing with lambda calculus". The best paper for EMNLP was J. Clarke and M. Lapata, "Modelling Compression with Discourse Constraints".

Mark Dredze said...

In reading over the comments quickly, I noticed the comments about semi-sup methods improving over state of the art baselines. There are a few issues here.
1) Ryan is correct in that the setting is not very realistic for parsing. Training annotators takes a lot of time and once that is done, things go pretty fast. A fast annotator can do a lot of labeling in a week, usually enough to do better than methods that use lots of unlabeled data and a very small amount of labeled data.
2) For other problems, this setting makes sense. Some tasks, such as document classification, don't take much time to train an annotator. In these settings, it may be more realistic to assume a small amount of annotations. Also, I think there are some settings where you just can't get many labels, such as learning from user feedback in real world settings.
3) There are problems where the state of the art doesn't use much labeled data. Some problems there is far less data than we'd like, so semi-sup learning can improve even though it may not with large amounts of labeled data.

In general, improving state of the art results with large amounts of labeled data is the ideal goal of semi-sup learning. In practice, it usually only works when you have a small amount of data.

Anonymous said...

As to the little amount of manually annotated training data for parsing, I think something is missed here.

Suppose that you want to apply a statistical parser to the WEB - well this is a very heterogeneous environment and as we know the performance of parsers degrades in the adaptation scenario.

Even if we already have good annotators, we wont be able to "use" them to create a large training set for every new subdomain of the WEB.

Thus, I think the small training dataset is the common situation. Even if the annotators can parse many sentences, for each subdomain they will parse only a small amount.

This is the scenario we addressed in our work.

Anonymous said...

Hi Roi,

I think this is a reasonable point that I hadn't thought of. Two things that pop into my head:

1. The cost of training annotators to annotate N sentences total in M domains (say N/M sentences in each domain) is probably significantly more than the cost to train annotators to annotate N sentences in a single domain. This is because there are domain specific lexicons and syntactic structures that annotators must become familiar with, e.g., the complex dependencies within compound nouns in scientific text.

2. Even if you only have N/M sentences in one domain, you still have N-(N/M) additional sentences of annotated data from the other domains. It has been shown that there are very simple, but very effective techniques for using this out-of-domain data to improve performance, e.g., Hal's ACL paper. Thus, one might argue that any semi-supervised parsing method for the small labeled corpus setting (i.e., due to the situation you describe) should show improvements that are either additive to simple adaptation techniques, or are at least comparable.

Anonymous said...

Hi Ryan,

Thanks for your comments.


1. I agree with you. I just think it is a common situation. In real word applications you will sometime need to investigate this time.

2. As to other semi-supervised techniques. Of course, self-training is not the only way. Steedman and his group showed improvement with co-training and Becker and Osborne with active learning. It is just interesting that self-training also works (and it seems to give very good results when compared to other, more complicated techniques).

As to Hal's ACL paper I recall that it is designed for classifiers. In structured output domains such as constituency parsers things are much more complicated. Indeed, classifiers achieve good results for parsing only for short sentences (such as with WSJ10). In dependency parsing things are different (as you have shown), but I am focused here in constituency.

Actually it can be great if Hal can suggest an extension of his method to constituency parsing (or in general to complex domains that are learned with generative models).

As far as I know, nobody have ever showed how to use annotated data from several domains in an optimal way in the context of statistical parsing.

If the data comes from very different domains (which is again the common situation in applications) I think this is not trivial at all.

Anonymous said...

Hi Roi,

That is true. Most of the effective adaptation techniques I know assume discriminative classifiers. In Blitzer et al. EMNLP 2006, there are some experiments adapting a parser to the biomedical domain w/ small amount of seed data, but again for discriminative parsers.

Some simple things can be done for generative constituent parsing such as corpus mixing or MAP adaptation, e.g., the Roark and Bacchiani NAACL 2003 paper. Though, as you point out in your paper, they assume the in-domain labeled seed set is very large.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花