natural language processing blog: ICML/UAI/COLT 2008 Retrospective

I know it's a bit delayed, but better late than never, eh? I have to say: I thought ICML/UAI/COLT this year was fantastic. The organizers all did a fantastic job. I can't possibly list everything I liked, but here are some highlights:

The tutorials were incredible. I'm reaching that point where I often skip out on tutorials because there's nothing new enough. This time, there was an embarrassment of riches: I couldn't go to everything I wanted to. In the end, I went to the Smola+Gretton+Fukumizu tutorial on distributional embeddings and the Krause+Guestrin tutorial on submodularity. Both were fantastic. My only complaint is that while all the plenary talks at ICML/UAI/COLT were video taped, the tutorials were not! Anyway, I will blog separately about both of these tutorials in the near future; also check out their web pages: the slides are probably reasonably self-explanatory.

The first day began with an invited talk by John Winn, who does (at least used to do) probabilistic modeling for computer vision. I'm not sure if this was the intended take-away message, but the thing I liked best about the talk is that he listed about a dozen things that you would have to model, were you to model "vision." He then talked about problems and techniques in terms of which of these you model, which of them you fix, and which of them you treat as "noise." I think this is a great way to think about modeling any kind of complex real world phenomenon (hint hint!).

My favorite morning talk was Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatci, Yee Whye Teh, and Zoubin Ghahramani. This is a really clever use of slice sampling, which everyone should know about! The idea is to resample entire sequences of hidden states in one go. The trouble with this in an infinite model is that I don't know how to do forward-backward over an infinite state space. The trick is to slice the probability mass on the lattice and only sample over those state/observation pairs that stick out above the slice. This is effectively a fully Bayesian way to do beam-like methods, and is certainly no limited to infinite HMMs.

That afternoon, there was the joint award session, with a "ten year" award (great idea!) going to Blum and Mitchell's original co-training paper from COLT 1998. This is clearly a very influential paper, and one I will blog about separately. I particularly liked two of the best paper award recipients:

Percy Liang and Michael Jordan for An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators. The idea here is to look at what various estimators are like in the limit of infinite data. The conclusion (as I read it) is that if your model is correct (i.e., the true distribution is realized by a certain set of parameters in your model family), then you're asymptotically better using generative models. If the model is even a tiny tiny bit incorrect, then you're better using discriminative. The key weakness of this analysis seems to be that, since we're looking only at asymptotics, there's no notion of a "sortof-wrong" model -- the wrongness gets blown up in the limit.
Shai Shalev-Shwartz and Nati Srebo for SVM Optimization: Inverse Dependence on Training Set Size. The idea here is that having lots of data should be a good thing, not something that makes us say "ugh, now my SVM will take forever to run." The key to getting an inverse dependence on the size of the training set is to sort of change the problem. The idea is that if you have 1,000 training points, you may get 88% accuracy and it may take an hour. If you have 1,000,000 training points, we're not going to ask that we find something with 90% accuracy, but rather still aim for 88% accuracy but try to get there faster. The analysis is based on the idea that if we have lots more data, we can be a bit looser with finding the exact best model parameters, which is what usually takes so long. We can afford to have not exactly the best parameters, since we're still going for 90% accuracy, not 88%.

Although I didn't see the talk, I enjoyed the poster on Training Structural SVMs when Exact Inference is Intractable by Thomas Finley and Thorsten Joachims. They investigate the difference between approximate inference methods that are overgenerating versus undergenerating. The former is exemplified by a conversion from an IP to an LP. The latter by, say, search. I don't think there are huge cut and dry situations where one is better than the other, but I think this is a really good step to understanding these models better.

Tuesday morning, there was a great talk on Efficient Projections onto the L1-Ball for Learning in High Dimensions by John Duchi, Shai Shalev-Shwartz, Yoram Singer and Tushar Chandra. Here, there's trying to extend the Pegasos algorithm to do L1 regularized SVMs. The difficult step is, after taking a sub-gradient step, projecting yourself back on to the L1 ball. They have a very clever algorithm for doing this that makes use of red-black trees to maintain a notion of when a dimension was last updated (necessary in order to get complexity that depends on the sparseness of feature vectors). The only worry I have about this approach is that I'm very happy dealing with arrays for everything, but once someone tells me that my algorithm needs to use a tree structure, I start worrying. Not because of big-O issues, but just because trees mean pointers, pointers mean pointer chasing, and pointer chasing (especially out of cache) is a nightmare for actual efficiency.

All the papers in the Tuesday 10:40am online learning section I found interesting. Confidence-weighted linear classification (Mark Dredze, Koby Crammer and Fernando Pereira) presents a technique for keeping track of how confident you are about your weights for various features (if you squint it looks very PAC-Bayes) and they get very good online performance with only one or two passes over the data. Francesco Orabona, Joseph
Keshet and Barbara Caputo had a paper on memory-bounded online perceptrons called the Projectron. The idea is simple and effective; however, the big take-away I had from this paper has to do with the baseline they compare against. If you only allow your perceptron to maintain, say, 100 support vectors, just use the first 100 points that the perceptron updates on. This does pretty much as well as most of the past work in this problem! How did this not get noticed before? Finally, Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari talked about bandit problems; this is interesting work, but still seems far from practical. The idea of naively picking an unknown label with uniform probability over a gigantic set of labels is just not going to work in the real world. But I think it's a really good start.

The entire NLP section on the last afternoon was good. David Chen and Ray Mooney talked about how to generate sportscasts from RoboCup trials; Ronan Collobert and Jason Weston talked about using neural nets to solve every NLP problem in the world (more on this in another post); Jason Wolfe, Aria Haghighi and Dan Klein talked about how to parallelize the M step in addition to the E step; and then Percy Liang, me, and Dan Klein talked about whether its reasonable or not to get rid of features in structured prediction (more on this in another post).

The next day was workshop day. The two workshops I went to were the Prior Knowledge in Text workshop (that I co-organized with Marc Dymetman, Guillame Bouchard and Yee Whye Teh), and the Nonparametric Bayes workshop (where I talked a tiny bit about HBC). I'll relate the news of our own workshop at a later date; you can see some about the NPBayes workshop here.

Then came UAI and COLT. I must admit by this time I was a bit conferenced out, so I missed a bunch of sessions. However, there were still some papers that stood out for me.

In UAI, David Mimno and Andrew McCallum presented Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression, essentially a conditional variant of LDA wherein the distribution over hyperparameters is given by a generalized linear model and can use arbitrary features. One audience question that came up (David, if you're reading, maybe you can answer this!) had to do with putting the GLM on the "alpha" hyperparameter. The fact that there was such a big improvement over baseline suggests that LDA-like models are highly sensitive to the setting of their hyperparameters: this is a bit surprising. I (like the questioner) was a bit surprised that tweaking a hyperparameter could have such a big influence! Nevertheless, very cool.

The best paper went to David Sontag, Talya Meltzer, Amir Globerson, Tommi Jaakkola and Yair Weiss for Tightening LP Relaxations for MAP using Message Passing. The idea is to take your marginal polytope initially defined by simple single-node marginal constraints and iteratively add pairwise, 3-wise, etc., constraints. There are some implementation details that allow you to do warm restarts. They managed to scale really amazingly well. This furthers my confidence that LPs are a really great way to do MAP inference in graphical models.

Kuzman Ganchev, Joao Graca, John Blitzer and Ben Taskar talked about Multi-View Learning over Structured and Non-Identical Outputs. I see this as a bit of an extension to the "alignment by agreement" and the "agreement-based learning" work, but now the difference is that the output spaces don't have to "match up."

Marina Meila and Le Bao had a poster on Estimation and clustering with infinite rankings, where they give a nonparametric model over rankings, focusing on properties of the distribution. Kurt Miller, Tom Griffiths and Mike Jordan had an extension to the Indian Buffet Process that intentionally removes exchangeability in order to model cases where we know the data is not exchangeable. Chong Wang, Dave Blei and David Heckerman had a nice result on how to do topic modeling over continuous time (modeled as Brownian motion). The cool thing is that by going continuous, they're able to get a more efficient algorithm that for previous work that functioned in discrete time steps.

At COLT, there were also a good number of good papers. Shai Ben-David, Tyler Lu and David Pal ask: Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning. I'll ruin the suspense: No, it does not. (At least not without strong assumptions about the label distribution.) Nina Balcan, Avrim Blum and Nati Srebo
show that learning with similarity functions is even better than we thought.

One paper at COLT that I especially liked because I wasn't aware of much of the previous work they cited was by Liwei Wang, Masashi Sugiyama, Cheng Yang, Zhi-Hua Zhou and Jufu Feng On the Margin Explanation of Boosting Algorithms. The history here is roughly like this. Boosting came along and worked pretty well. People started trying to explain it from the perspective of margin-based analysis. Then, Breiman came along and described an algorithm arc-gv that provably generates a better margin than AdaBoost (and does so in practice as well), yet works worse! This paper attempts to re-analyze boosting from the perspective of an "Equilibrium Margin," which provides sharper bounds and results that actually agree with what is observed empirically.

7 comments:

Anonymous said...: Nice overview! I also noticed there was no ACL overview this year -- does this mean the interesting stuff is drifting from ACL to ICML?; 22 July, 2008 14:02
hal said...: No... it means that I only attended like 5 talks at ACL this year other than the ones in the summarization summary, and didn't feel like I could give it a fair shake down.

If someone else went to ACL and saw lots of talks and wanted to write a post, I'd be happy to "augment" it with what I saw!; 22 July, 2008 17:09
Brendan O'Connor said...: That was great, thanks! I'm pretty interested to hear your thoughts on the neural networks paper -- I can't imagine such a system to be at all practical at all to develop, but it sure sounds fun...; 24 July, 2008 11:36
Mark Dredze said...: Thanks for the shout-out. I didn't make it to ICML this year but I'm glad to see some much interesting stuff. I'll add some of these to my reading pile.; 24 July, 2008 21:51
Anonymous said...: I want to tell you, I know a website sell the gaia gold, and in this website all gaia online gold is very cheap, there had many customer, if you want to buy gaia gold, I suggest you to come here then to buy the cheap gaia gold; 16 April, 2009 00:16
Anonymous said...: 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店酒店兼差PRETTY GIRL酒店公關酒店小姐彩色爆米花酒店兼職,酒店工作彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀彩色爆米花; 12 May, 2009 10:31
gamefan12 said...: I agree with you all the way. This was definitely fantastic. I would love to see this more in the future.
ponte vedra cosmetic dentist; 15 April, 2010 10:18

natural language processing blog

21 July 2008

ICML/UAI/COLT 2008 Retrospective

7 comments:

About Me

Labels

My Blog List

Blog Archive