natural language processing blog

11 January 2011

NIPS 2010 Retrospective

Happy New Year and I know I've been silent but I've been busy. But no teaching this semester (YAY!) so maybe you'll see more posts.

At any rate, I'm really late to the table, but here are my comments about this past year's NIPS. Before we get to that, I hope that everyone knows that this coming NIPS will be in Granada, and then for (at least) the next five years will be in Tahoe. Now that I'm not in ski-land, it's nice to have a yearly ski vacation ... erm I mean scientific conference.

But since this was the last year of NIPS in Vancouver, I thought I'd share a conversation that occurred this year at NIPS, with participants anonymized. (I hope everyone knows to take this in good humor: I'm perfectly happy to poke fun at people from the States, too...). The context is that one person in a large group, which was going to find lunch, had a cell phone with a data plan that worked in Canada:

A: Wow, that map is really taking a long time to load.
B: I know. It's probably some socialized Canadian WiFi service.
C: No, it's probably just slow because every third bit has to be a Canadian bit?
D: No no, it's because every bit has to be sent in both English and French!

Okay it's not that funny, but it was funny at the time. (And really "B" is as much a joke about the US as it was about Canada :P.)

But I'm sure you are here to hear about papers, not stupid Canada jokes. So here's my take.

The tutorial on optimization by Stephen Wright was awesome. I hope this shows up on videolectures soon. (Update: it has!) I will make it required reading / watching for students. There's just too much great stuff in it to go in to, but how about this: momentum is the same as CG! Really?!?! There's tons of stuff that I want to look more deeply into, such as robust mirror descent, some work by Candes about SVD when we don't care about near-zero SVs, regularized stochastic gradient (Xiao) and sparse eigenvector work. Lots of awesome stuff. My favorite part of NIPS.

Some papers I saw that I really liked:

A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): Formalizes boosting in a multiclass setting. The crux is a clever generalization of the "weak learning" notion from binary. The idea is that a weak binary classifier is one that has a small advantage over random guessing (which, in the binary case, gives 50/50). Generalize this and it works.

Structured sparsity-inducing norms through submodular functions (Francis Bach): I need to read this. This was one of those talks where I understood the first half and then got lost. But the idea is that you can go back-and-forth between submodular functions and sparsity-inducing norms.

Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher): The title says it all! It's an alternative construction to the Polya urn scheme and also to the stick-breaking scheme.

A Reduction from Apprenticeship Learning to Classification (Umar Syed, Robert Schapire): Right up my alley, some surprising results about apprenticeship learning (aka Hal's version of structured prediction) and classification. Similar to a recent paper by Stephane Ross and Drew Bagnell on Efficient Reductions for Imitation Learning.

Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): When you have complex combinatorial spaces (think traveling salesman), how can you construct generic variational inference algorithms?

Implicit Differentiation by Perturbation (Justin Domke): This is a great example of a paper that I never would have read, looked at, seen, visited the poster of, known about etc., were it not for serendipity at conferences (basically Justin was the only person at his poster when I showed up early for the session, so I got to see this poster). The idea is if you have a graphical model, and some loss function L(.) which is defined over the marginals mu(theta), where theta are the parameters of the model, and you want to optimize L(mu(theta)) as a function of theta. Without making any serious assumptions about the form of L, you can actually do gradient descent, where each gradient computation costs two runs of belief propagation. I think this is amazing.

Probabilistic Deterministic Infinite Automata (David Pfau, Nicholas Bartlett, Frank Wood): Another one where the title says it all. DP-style construction of infinite automata.

Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): The idea here is to define a regression function over a graph. It should be regularized in a sensible way. Very LASSO-esque model, as you might expect given the author list :).

Other papers I saw that I liked but not enough to write mini summaries of:

Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Categories and Functional Units: An Infinite Hierarchical Model for Brain Activations (Danial Lashkari, Ramesh Sridharan, Polina Golland)
Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)
Joint Analysis of Time-Evolving Binary Matrices and Associated Documents (Eric Wang, Dehong Liu, Jorge Silva, David Dunson, Lawrence Carin)
Learning Efficient Markov Networks (Vibhav Gogate, William Webb, Pedro Domingos)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher)
Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)

Two students who work with me (though one isn't actually mine :P), who went to NIPS also shared their favorite papers. The first is a list from Avishek Saha:

A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire)

Repeated Games against Budgeted Adversaries (Jacob Abernethy, Manfred Warmuth)

Non-Stochastic Bandit Slate Problems (Satyen Kale, Lev Reyzin, Robert Schapire)

Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)

Learning Bounds for Importance Weighting (Corinna Cortes, Yishay Mansour, Mehryar Mohri)

Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)

The second list is from Piyush Rai, who apparently aimed for recall (though not with a lack of precision) :P:

Online Learning: Random Averages, Combinatorial Parameters, and Learnability (Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari): defines several complexity measures for online learning akin to what we have for the batch setting (e.g., radamacher averages, covering numbers etc).

Online Learning in The Manifold of Low-Rank Matrices (Uri Shalit, Daphna Weinshall, Gal Chechik): nice general framework applicable in a number of online learning settings. could also be used for online multitask learning.

Fast global convergence rates of gradient methods for high-dimensional statistical recovery (Alekh Agarwal, Sahand Negahban, Martin Wainwright): shows that the properties of sparse estimation problems that lead to statistical efficiency also lead to computational efficiency which explains the faster practical convergence of gradient methods than what the theory guarantees.

Copula Processes (Andrew Wilson, Zoubin Ghahramani): how do you determine the relationship between random variables which could have different marginal distributions (say one has gamma and the other has gaussian distribution)? copula process gives an answer to this.

Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): usually undirected graph structure learning involves a set of random variables y drawn from a distribution p(y). but what if y depends on another variable x? this paper is about learning the graph structure of the distribution p(y|x=x).

Structured sparsity-inducing norms through submodular functions (Francis Bach): standard sparse recovery uses l1 norm as a convex proxy for the l0 norm (which constrains the number of nonzero coefficients to be small). this paper proposes several more general set functions and their corresponding convex proxies, and links them to known norms.

Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum): an interesting paper -- what if in an online learning setting you could abstain from making a prediction on some of the training examples and just say "i don't know"? on others, you may or may not make the correct prediction. lies somewhere in the middle of always predicting right or wrong (i.e., standard mistake driven online learning) versus the recent work on only predicting correctly or otherwise saying "i don't know".

Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): cool paper. applicable to lots of settings.

A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): we know that boosting in binary case requires "slightly better than random" weak learners. this paper characterizes conditions on the weak learners for the multi-class case, and also gives a boosting algorithm.

Multitask Learning without Label Correspondences (Novi Quadrianto, Alexander Smola, Tiberio Caetano, S.V.N. Vishwanathan, James Petterson): usually mtl assumes that the output space is the same for all the tasks but in many cases this may not be true. for instance, we may have two related prediction problems on two datasets but the output spaces for both may be different and may have some complex (e.g., hierarchical, and potentially time varying) output spaces. the paper uses a mutual information criteria to learn the correspondence between the output spaces.

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty (Yi Zhang, Jeff Schneider): presents a general multitask learning framework and many recently proposed mtl models turn out to be special cases. models both feature covariance and task covariance matrices.

Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions (Achintya Kundu, vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-Tal): the title says it all. :) multiple kernel learning is usually applied in classification setting but due to the applicability of the proposed method for a wide variety of loss functions, one can possibly also use it for unsupervised learning problems as well (e.g., spectral clustering, kernel pca, etc).

Getting lost in space: Large sample analysis of the resistance distance (Ulrike von Luxburg, Agnes Radl, Matthias Hein): large sample analysis of the commute distance: shows a rather surprising result that commute distance between two vertices in the graph if the graph is "large" and nodes represent high dimensional variables is meaningless. the paper proposes a correction and calls it "amplified commute distance".

A Bayesian Approach to Concept Drift (Stephen Bach, Mark Maloof): gives a bayesian approach for segmenting a sequence of observations such that each "block" of observations has the same underlying concept.

MAP Estimation for Graphical Models by Likelihood Maximization (Akshat Kumar, Shlomo Zilberstein): they show that you can think of an mrf as a mixture of bayes nets and then the map problem on the mrf corresponds to solving a form of the maximum likelihood problem on the bayes net. em can be used to solve this in a pretty fast manner. they say that you can use this methods with the max-product lp algorithms to yield even better solutions, with a quicker convergence.

Energy Disaggregation via Discriminative Sparse Coding (J. Zico Kolter, Siddharth Batra, Andrew Ng): about how sparse coding could be used to save energy. :)

Semi-Supervised Learning with Adversarially Missing Label Information (Umar Syed, Ben Taskar): standard ssl assumes that labels for the unlabeled data are missing at random but in many practical settings this isn't actually true.this paper gives an algorithm to deal with the case when the labels could be adversarially missing.

Multi-View Active Learning in the Non-Realizable Case (Wei Wang, Zhi-Hua Zhou): shows that (under certain assumptions) exponential improvements in the sample complexity of active learning are still possible if you have a multiview learning setting.

Self-Paced Learning for Latent Variable Models (M. Pawan Kumar, Benjamin Packer, Daphne Koller): an interesting paper, somewhat similar in spirit to curriculum learning. basically, the paper suggests that in learning a latent variable model, it helps if you provide the algorithm easy examples first.

More data means less inference: A pseudo-max approach to structured learning (David Sontag, Ofer Meshi, Tommi Jaakkola, Amir Globerson): a pseudo-max approach to structured learning: this is somewhat along the lines of the paper on svm's inverse dependence on training size from icml a couple of years back. :)

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning (Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman): selecting the most uncertain example in a pool based active learning can be expensive if the number of candidate examples is very large. this paper suggests some hashing tricks to expedite the search.

Active Instance Sampling via Matrix Partition (Yuhong Guo): frames batch mode active learning as a matrix partitioning problems and proposes local optimization technique for the matrix partitioning problem.

A Discriminative Latent Model of Image Region and Object Tag Correspondence (Yang Wang, Greg Mori): it's kind of doing correspondence lda on image+captions but they additionally infer the correspondences between tags and objects in the images, and show that this gives improvements over corr-lda.

Factorized Latent Spaces with Structured Sparsity (Yangqing Jia, Mathieu Salzmann, Trevor Darrell): a multiview learning algorithm that uses sparse coding to learn shared as well as private features of different views of the data.

Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy): extends lda for the case when you have access to features for each word in the vocabulary

18 November 2010

Crowdsourcing workshop (/tutorial) decisions

Everyone at conferences (with multiple tracks) always complains that there are time slots with nothing interesting, and other time slots with too many interesting papers. People have suggested crowdsourcing this, enabling parcipants to say -- well ahead of the conference -- which papers they'd go to... then let an algorithm schedule.

I think there are various issues with this model, but don't want to talk about it. What I do want to talk about is applying the same ideas to workshop acceptance decisions. This comes up because I'm one of the two workshop chairs for ACL this year, and because John Langford just pointed to the ICML call for tutorials. (I think what I have to say applies equally to tutorials as to workshops.)

I feel like a workshop (or tutorial) is successful if it is well attended. This applies both from a monetary perspective, as well as a scientific perspective. (Note, though, that I think that small workshops can also be successful, especially if they are either fostering a small community, bring people in, or serving other purposes. That is to say, size is not all that matters. But it is a big part of what matters.)

We have 30-odd workshop proposals for three of us to sort through (John Carroll and I are the two workshop chairs for ACL, and Marie Candito is the workshop chair for EMNLP; workshops are being reviewed jointly -- which actually makes the allocation process more difficult). The idea would be that I could create a poll, like the following:

Are you going to ACL? Yes, maybe, no
Are you going to EMNLP? Yes, maybe, no
If workshop A were offered at a conference you were going to, would you go to workshop A?
If workshop B...
And so on

This gives you two forms of information. First it can help estimate expected attendance (though we ask proposers to estimate that, too, and I think they do a reasonable job if you skew their estimates down by about 10%). But more importantly, it gives correlations between workshops. This lets you be sure that you're not scheduling things on top of each other that people might want to go to. Some of these are obvious (for instance, if we got 10 MT workshop proposals... which didn't actually happen but is moderately conceivable :P), but some are not. For instance, maybe people who care about annotation also care about ML, but maybe not? I actually have no idea.

Of course we're not going to do this this year. It's too late already, and it would be unfair to publicise all the proposals, given that we didn't tell proposers in advance that we would do so. And of course I don't think this should exclusively be a popularity contest. But I do beleive that popularity should be a factor. And it should probably be a reasonably big factor. Workshop chairs could then use the output of an optimization algorithm as a starting point, and use this as additional data for making decisions. Especially since two or three people are being asked to make decisions that cover--essentially--all areas of NLP, this actually seems like a good idea to me.

I actually think something like this is more likely to actually happen at a conference like ICML than ACL, since ICML seems (much?) more willing to try new things than ACL (for better or for worse).

But I do think it would be interesting to try to see what sort of response you get. Of course, just polling on this blog wouldn't be sufficient: you'd want to spam, perhaps all of last year's attendees. But this isn't particularly difficult.

Is there anything I'm not thinking of that would make this obviously not work? I could imagine someone saying that maybe people won't propose workshops/tutorials if the proposals will be made public? I find that a bit hard to swallow. Perhaps there's a small embarassment factor if you're public and then don't get accepted. But I wouldn't advocate making the voting results public -- they would be private to the organizers / workshop chairs.

I guess -- I feel like I'm channeling Fernando here? -- that another possible issue is that you might not be able to decide which workshops you'd go to without seeing what papers are there and who is presenting. This is probably true. But this is also the same problem that the workshop chairs face anyway: we have to guess that good enough papers/people will be there to make it worthwhile. I doubt I'm any better at guessing this than any other random NLP person...

So what am I forgetting?

09 November 2010

Managing group papers

Every time a major conference deadline (ACL, NIPS, EMNLP, ICML, etc...) comes around, we usually have a slew of papers (>=3, typically) that are getting prepared. I would say on average 1 doesn't make it, but that's par for the course.

For AI-Stats, whose deadline just passed, I circulated student paper drafts to all of my folks to solicit comments at any level that they desired. Anywhere from not understanding the problem/motivation to typos or errors in equations. My experience was that it was useful, both from the perspective of distributing some of my workload and getting an alternative perspective, to keeping everyone abreast of what everyone else is working on.

In fact, it was so successful that two students suggested to me that I require more-or-less complete drafts of papers at least one week in advance so that this can take place. How you require something like this is another issue, but the suggestion they came up with was that I'll only cover conference travel if this occurs. It's actually not a bad idea, but I don't know if I'm enough of a hard-ass (or perceived as enough of a hard-ass) to really pull it off. Maybe I'll try it though.

The bigger question is how to manage such a thing. I was thinking of installing some conference management software locally (eg., HotCRP, which I really like) and giving students "reviewer" access. Then, they could upload their drafts, perhaps with an email circulated when a new draft is available, and other students (and me!) could "review" them. (Again, perhaps with an email circulated -- I'm a big fan of "push" technology: I don't have time to "pull" anymore!)

The only concern I have is that it would be really nice to be able to track updates, or to have the ability for authors to "check off" things that reviewers suggested. Or to allow discussion. Or something like that.

I'm curious if anyone has ever tried anything like this and whether it was successful or not. It seems like if you can get a culture of this established, it could actually be quite useful.

21 October 2010

Comparing Bounds

This is something that's bothered me for quite a while, and I don't know of a good answer. I used to think it was something that theory people didn't worry about, but then this exact issue was brought up by a reviewer of a theory-heavy paper that we have at NIPS this year (with Avishek Saha and Abhishek Kumar). There are (at least?) two issues with comparing bounds, the first is the obvious "these are both upper bounds, what does it mean to compare them?" The second is the slightly less obvious "but your empirical losses may be totally different" issue. It's actually the second one that I want to talk about, but I have much less of a good internal feel about it.

Let's say that I'm considering two learning approaches. Say it's SVMs versus logistic regression. Both regularized. Or something. Doesn't really matter. At the end of the day, I'll have a bound that looks roughly like:

expected test error <= empirical training error + f( complexity / N)

Here, f is often "sqrt", but could really be any function. And N is the number of data points.

Between two algorithms, both "f" and "complexity" can vary. For instance, one might have a linear dependence on the dimensionality of the data (i.e., complexity looks like O(D), where D is dimensionality) and the other might have a superlinear dependence (eg., O(D log D)). Or one might have a square root. Who knows. Sometimes there's an inf or sup hiding in there, too, for instance in a lot of the margin bounds.

At the end of the day, we of course want to say "my algorithm is better than your algorithm." (What else is there in life?) The standard way to say this is that "my f(complexity / N) looks better than your f'(complexity' / N)."

Here's where two issues crop up. The first is that our bound is just an upper bound. For instance, Alice could come up to me and say "I'm thinking of a number between 1 and 10" and Bob could say "I'm thinking of a number between 1 and 100." Even though the bound is lower for Alice, it doesn't mean that Alice is actually thinking of a smaller number -- maybe Alice is thinking of 9 and Bob of 5. In this way, the bounds can be misleading.

My general approach with this issue is to squint, as I do for experimental results. I don't actually care about constant factors: I just care about things like "what does the dependence on D look like." Since D is usually huge for problems I care about, a linear or sublinear dependence on D looks really good to me. Beyond that I don't really care. I especially don't care if the proof techniques are quite similar. For instance, if they both use Rademacher complexities, then I'm more willing to compare them than if one uses Rademacher complexities and the other uses covering numbers. They somehow feel more comparable: I'm less likely to believe that the differences are due to the method of analysis.

(You can also get around this issue with some techniques, like Rademacher complexities, which give you both upper and lower bounds, but I don't think anyone really does that...)

The other issue I don't have as good a feeling for. The issue is that we're entirely ignoring the "empirical training error" question. In fact, this is often measured differently between different algorithms! For instance, for SVMs, the formal statement is more like "expected 0/1 loss on test <= empirical hinge loss on training + ..." Whereas for logistic regression, you might be comparing expected 0/1 loss with empirical log loss.

Now I really don't know what to do.

We ran into this issue because we were trying to compare some bounds between EasyAdapt and a simple model trained just on source data. The problem is that the source training error might be totally incomparable to the (source + target) training error.

But the issue is for sure more general. For instance, what if your training error is measured in squared error? Now this can be huge when hinge loss is still rather small. In fact, your squared error could be quadratically large in your hinge loss. Actually it could be arbitrarily larger, since hinge goes to zero for any sufficiently correct classification, but squared error does not. (Neither does log loss.)

This worries me greatly, much more than the issue of comparing upper bounds.

Does this bother everyone, or is it just me? Is there a good way to think about this that gets your out of this conundrum?

05 October 2010

My Giant Reviewing Error

I try to be a good reviewer, but like everything, reviewing is a learning process. About five years ago, I was reviewing a journal paper and made an error. I don't want to give up anonymity in this post, so I'm going to be vague in places that don't matter.

I was reviewing a paper, which I thought was overall pretty strong. I thought there was an interesting connection to some paper from Alice Smith (not the author's real name) in the past few years and mentioned this in my review. Not a connection that made the current paper irrelevant, but something the authors should probably talk about. In the revision response, the authors said that they had looked to try to find Smith's paper, but could figure out which one I was talking about, and asked for a pointer. I spend the next five hours looking for the reference and couldn't find it myself. It turns out that actually I was thinking of a paper by Bob Jones, so I provided that citation. But the Jones paper wasn't even as relevant as it seemed at the time I wrote the review, so I apologized and told the authors they didn't really need to cover it that closely.

Now, you might be thinking to yourself: aha, now I know that Hal was the reviewer of my paper! I remember that happening to me!

But, sadly, this is not true. I get reviews like this all the time, and I feel it's one of the most irresponsible things reviewers can do. In fact, I don't think a single reviewing cycle has passed where I don't get a review like this. The problem with such reviews is that it enables a reviewer to make whatever claim they want, without any expectation that they have to back it up. And the claims are usually wrong. They're not necessarily being mean (I wasn't trying to be mean), but sometimes they are.

Here are some of the most ridiculous cases I've seen. I mention these just to show how often this problem occurs. These are all on papers of mine.

One reviewer wrote "This idea is so obvious this must have been done before." This is probably the most humorous example I've seen, but the reviewer was clearly serious. And no, this was not in a review for one of the the "frustratingly easy" papers.
In a NSF grant review for an educational proposal, we were informed by 4 of 7 reviewers (who each wrote about a paragraph) that our ideas had been done in SIGCSE several times. Before submitting, we had skimmed/read the past 8 years of SIGCSE and could find nothing. (Maybe it's true and we just were looking in the wrong place, but that still isn't helpful.) It turned out to strongly seem that this was basically their way of saying "you are not one of us."
In a paper on technique X for task A, we were told hands down that it's well known that technique Y works better, with no citations. The paper was rejected, we went and implemented Y, and found that it worked worse on task A. We later found one paper saying that Y works better than X on task B, for B fairly different from A.
In another paper, we were told that what we were doing had been done before and in this case a citation was provided. The citation was to one of our own papers, and it was quite different by any reasonable metric. At least a citation was provided, but it was clear that the reviewer hadn't bothered reading it.
We were told that we missed an enormous amount of related work that could be found by a simple web search. I've written such things in reviews, often saying something like "search for 'non-parametric Bayesian'" or something like that. But here, no keywords were provided. It's entirely possible (especially when someone moves into a new domain) that you can miss a large body of related work because you don't know how to find in: that's fine -- just tell me how to find it if you don't want to actually provide citations.

There are other examples I could cite from my own experience, but I think you get the idea.

I'm posting this not to gripe (though it's always fun to gripe about reviewing), but to try to draw attention to this problem. It's really just an issue of laziness. If I had bothered trying to look up a reference for Alice Smith's paper, I would have immediately realized I was wrong. But I was lazy. Luckily this didn't really adversely affect the outcome of the acceptance of this paper (journals are useful in that way -- authors can push back -- and yes, I know you can do this in author responses too, but you really need two rounds to make it work in this case).

I've really really tried ever since my experience above to not ever do this again. And I would encourage future reviewers to try to avoid the temptation to do this: you may find your memory isn't as good as you think. I would also encourage area chairs and co-reviewers to push their colleagues to actually provide citations for otherwise unsubstantiated claims.

24 September 2010

ACL / ICML Symposium?

ACL 2011 ends on June 24, in Portland (that's a Friday). ICML 2011 begins on June 28, near Seattle (the following Tuesday). This is pretty much as close to a co-location as we're probably going to get in a long time. A few folks have been discussing the possibility of having a joint NLP/ML symposium in between. The current thought is to have it on June 27 at the ICML venue (for various logistical reasons). There are buses and trains that run between the two cities, and we might even be able to charter some buses.

One worry is that it might only attract ICML folks due to the weekend between the end of ACL and the beginning of said symposium. As a NLPer/MLer, I believe in data. So please provide data by filling out the form below and, if you wish, adding comments.

If you woudn't attend any, you don't need to fill out the poll :).

The last option is there if you want to tell me "I'm going to go to ACL, and I'd really like to go to the symposium, but the change in venue and the intervening weekend is too problematic to make it possible."

15 September 2010

Very sad news....

I heard earlier this morning that Fred Jelinek passed away last night. Apparently he had been working during the day: a tenacious aspect of Fred that probably has a lot to do with his many successes.

Fred is probably most infamous for the famous "Every time I fire a linguist the performace of the recognizer improves" quote, which Jurafsky+Martin's textbook says is actually supposed to be the more innocuous "Anytime a linguist leaves the group the recognition rate goes up." And in Fred's 2009 ACL Lifetime Achievement Award speech, he basically said that such a thing never happened. I doubt that will have any effect on how much the story is told.

Fred has had a remarkable influence on the field. So much so that I won't attempt to list anything here: you can find all about him all of the internet. Let me just say that the first time I met him, I was intimidated. Not only because he was Fred, but because I knew (and still know) next to nothing about speech, and the conversation inevitably turned to speech. Here's roughly how a segment of our conversation went:

Hal: What new projects are going on these days?
Fred: (Excitedly.) We have a really exciting new speech recognition problem. We're trying to map speech signals directly to fluent text.
Hal: (Really confused.) Isn't that the speech recognition problem?
Fred: (Playing the "teacher role" now.) Normally when you transcribe speech, you end up with a transcrit that includes disfluencies like "uh" and "um" and also false starts [Ed note: like "I went... I went to the um store"].
Hal: So now you want to produce the actual fluent sentence, not the one that was spoken?
Fred: Right.

Apparently (who knew) in speech recognition you try to transcribe disfluencies and are penalized for missing them! We then talked for a while about how they were doing this, and other fun topics.

A few weeks later, I got a voicemail on my home message machine from Fred. That was probably one of the coolest things that have ever happened to me in life. I actually saved it (but subsequently lost it, which saddens me greatly). The content is irrelevant: the point is that Fred -- Fred! -- called me -- me! -- at home! Amazing.

I'm sure that there are lots of other folks who knew Fred better than me, and they can add their own stories in comments if they'd like. Fred was a great asset to the field, and I will certainly miss his physical presense in the future, though his work will doubtless continue to affect the field for years and decades to come.

13 September 2010

AIStats 2011 Call for Papers

The full call, and some changes to the reviewing process. The submission deadline is Nov 1, and the conference is April 11-13, in Fort Lauderdale, Florida. Promises to be warm :).

The changes the the reviewing process are interesting. Basically the main change is that the author response is replaced by a journal-esque "revise and resubmit." That is, you get 2 reviews, edit your paper, submit a new version, and get a 3rd review. The hope is that this will reduce author frustration from the low bandwidth of author response. Like with a journal, you'll also submit a "diff" saying what you've changed. I can see this going really well: the third reviewer will presumably see a (much) better than the first two. The disadvantage, which irked me at ICML last year, is that it often seemed like the third reviewer made the deciding call, and I would want to make sure that the first two reviewers also get updated. I can also see it going poorly: authors invest even more time in "responding" and no one listens. That will be increased frustration :).

The other change is that there'll be more awards. I'm very much in favor of this, and I spend two years on the NAACL exec trying to get NAACL to do the same thing, but always got voted down :). Oh well. The reason I think it's a good idea is two-fold. First, I think we're bad at selecting single best papers: a committee decision can often lead to selecting least offensive papers rather than ones that really push the boundary. I also think there are lots of ways for papers to be great: they can introduce new awesome algorithms, have new theory, have a great application, introduce a cool new problem, utilize a new linguistic insight, etc., etc., etc... Second, best papers are most useful at promotion time (hiring, and tenure), where you're being compared with people from other fields. Why should our field put our people at a disadvantage by not awarding great work that they can list of their CVs?

Anyway, it'll be an interesting experiment, and I encourage folks to submit!

07 September 2010

Manifold Assumption versus Margin Assumption

[This post is based on some discussions that came up while talking about manifold learning with Ross Whitaker and Sam Gerber, who had a great manifold learning paper at ICCV last year.]

There are two assumptions that are often used in statistical learning (both theory and practice, though probably more of the latter), especially in the semi-supervised setting. Unfortunately, they're incompatible.

The margin assumption states that your data are well separated. Usually it's in reference to linear, possibly kernelized, classifiers, but that need not be the case. As most of us know, there are lots of other assumptions that boil down to the same thing, such as the low-weight-norm assumption, or the Gaussian prior assumption. At the end of the day, it means your data looks like what you have on the left, below, not what you have on the right.

The manifold assumption that is particularly popular in semi-supervised learning, but also shows up in supervised learning, says that your data lie on a low dimensional manifold embedded in a higher dimensional space. One way of thinking about this is saying that your features cannot covary arbitrarily, but the manifold assumption is quite a bit stronger. It usually assumes a Reimannian (i.e., locally Euclidean) structure, with data points "sufficiently" densely sampled. In other words, life looks like the left, not the right, below:

Okay, yes, I know that the "Bad" one is a 2D manifold embedded in 2D, but that's only because I can't draw 3D images :). And anyway, this is a "weird" manifold in the sense that at one point (where the +s and -s meet), it drops down to 1D. This is fine in math-manifold land, but usually not at all accounted for in ML-manifold land.

The problem, of course, is that once you say "margin" and "manifold" in the same sentence, things just can't possibly work out. You'd end up with a picture like:

This is fine from a margin perspective, but it's definitely not a (densely sampled) manifold any more.

In fact, almost by definition, once you stick a margin into a manifold (which is okay, since you'll define margin Euclideanly, and manifolds know how to deal with Euclidean geometry locally), you're hosed.

So I guess the question is: who do you believe?

31 August 2010

Online Learning Algorithms that Work Harder

It seems to be a general goal in practical online learning algorithm development to have the updates be very very simply. Perceptron is probably the simplest, and involves just a few adds. Winnow takes a few multiplies. MIRA takes a bit more, but still nothing hugely complicated. Same with stochastic gradient descent algorithms for, eg., hinge loss.

I think this maybe used to make sense. I'm not sure that it makes sense any more. In particular, I would be happier with online algorithms that do more work per data point, but require only one pass over the data. There are really only two examples I know of: the StreamSVM work that my student Piyush did with me and Suresh, and the confidence-weighted work by Mark Dredze, Koby Crammer and Fernando Pereira (note that they maybe weren't trying to make a one-pass algorithm, but it does seem to work well in that setting).

Why do I feel this way?

Well, if you look even at standard classification tasks, you'll find that if you have a highly optimized, dual threaded implementation of stochastic gradient descent, then your bottleneck becomes I/O, not learning. This is what John Langford observed in his Vowpal Wabbit implementation. He has to do multiple passes. He deals with the I/O bottleneck by creating an I/O friendly, proprietary version of the input file during the first past, and then careening through it on subsequent passes.

In this case, basically what John is seeing is that I/O is too slow. Or, phrased differently, learning is too fast :). I never thought I'd say that, but I think it's true. Especially when you consider that just having two threads is a pretty low requirement these days, it would be nice to put 8 or 16 threads to good use.

But I think the problem is actually quite a bit more severe. You can tell this by realizing that the idealized world in which binary classifier algorithms usually get developed is, well, idealized. In particular, someone has already gone through the effort of computing all your features for you. Even running something simple like a tokenizer, stemmer and stop word remover over documents takes a non-negligible amount of time (to convince yourself: run it over Gigaword and see how long it takes!), easily much longer than a silly perceptron update.

So in the real world, you're probably going to be computing your features and learning on the fly. (Or at least that's what I always do.) In which case, if you have a few threads computing features and one thread learning, your learning thread is always going to be stalling, waiting for features.

One way to partially circumvent this is to do a variant of what John does: create a big scratch file as you go and write everything to this file on the first pass, so you can just read from it on subsequent passes. In fact, I believe this is what Ryan McDonald does in MSTParser (he can correct me in the comments if I'm wrong :P). I've never tried this myself because I am lazy. Plus, it adds unnecessary complexity to your code, requires you to chew up disk, and of course adds its own delays since you now have to be writing to disk (which gives you tons of seeks to go back to where you were reading from initially).

A similar problem crops up in structured problems. Since you usually have to run inference to get a gradient, you end up spending way more time on your inference than your gradients. (This is similar to the problems you run into when trying to parallelize the structured perceptron.)

Anyway, at the end of the day, I would probably be happier with an online algorithm that spent a little more energy per-example and required fewer passes; I hope someone will invent one for me!

27 August 2010

Calibrating Reviews and Ratings

NIPS decision are going out soon, and then we're done with submitting and reviewing for a blessed few months. Except for journals, of course.

If you're not interested in paper reviews, but are interested in sentiment analysis, please skip the first two paragraphs :).

One thing that anyone who has ever area chaired, or probably even ever reviewed, has noticed is that different people have different "baseline" ratings. Conferences try to adjust for this, for instance NIPS defines their 1-10 rating scale as something like "8 = Top 50% of papers accepted to NIPS" or something like that. Even so, some people are just harsher than others in scoring, and it seems like the area chair's job to calibrate for this. (For instance, I know I tend to be fairly harsh -- I probably only give one 5 (out of 5) for every ten papers I review, and I probably give two or three 1s in the same size batch. I have friends who never give a one -- except in the case of something just being wrong -- and often give 5s. Perhaps I should be nicer; I know CS tends to be harder on itself than other fiends.) As an aside, this is one reason why I'm generally in favor of fewer reviewers and more reviews per reviewer: it allows easier calibration.

There's also the issue of areas. Some areas simply seem to be harder to get papers into than others (which can lead to some gaming of the system). For instance, if I have a "new machine learning technique applied to parsing," do I want it reviewed by parsing people or machine learning people? How do you calibrate across areas, other than by some form of affirmative action for less-represented areas?

A similar phenomenon occurs in sentiment analysis, as was pointed out to me at ACL this year by Franz Och. The example he gives is very nice. If you go to TripAdvisor and look up The French Laundry, which is definitely one of the best restaurants in the U.S. (some people say the best), you'll see that it got 4.0/5.0 stars, and a 79% recommendation. On the other hand, if you look up In'N'Out Burger, a LA-based burger chain (which, having grown up in LA, was admittedly one of my favorite places to eat in high school, back when I ate stuff like that) you see another 4.0/5.0 stars and a 95% recommendation.

So now, we train a machine learning system to predict that the rating for The French Laundry is 79% and In'N'Out Burger is 95%. And we expect this to work?!

Probably the main issue here is calibrating for expectations. As a teacher, I've figured out quickly that managing student expectations is a big part of getting good teaching reviews. If you go to In'N'Out, and have expectations for a Big Mac, you'll be pleasantly surprised. If you go to The French Laundry with expectations of having a meal worth selling your soul, your children's souls, etc., for, then you'll probably be disappointed (though I can't really say: I've never been).

One way that a similar problem has been dealt with on Hotels.com is that they'll show you ratings for the hotel you're looking at, and statistics of ratings for other hotels within a 10 mile radius (or something). You could do something similar for restaurants, though distance probably isn't the right categorization: maybe price. For "$", In'N'Out is probably near the top, and for "$$$$" The French Laundry probably is.

(Anticipating comments, I don't think this is just an "aspect" issue. I don't care how bad your palate is, even just considering the "quality of food" aspect, Laundry has to trump In'N'Out by a large margin.)

I think the problem is that in all of these cases -- papers, restaurants, hotels -- and others (movies, books, etc.) there simply isn't a total order on the "quality" of the objects you're looking at. (For instance, as soon as a book becomes a best seller, or is advocated by Oprah, I am probably less likely to read it.) There is maybe a situation-depend order, and the distance to hotel, or "$" rating, or area classes are heuristics for describing this "situation." Bit without knowing the situation, or having a way to approximate it, I worry that we might be entering a garbage-in-garbage-out scenario here.

23 August 2010

Finite State NLP with Unlabeled Data on Both Sides

(Can you tell, by the recent frequency of posts, that I'm try not to work on getting ready for classes next week?)

[This post is based partially on some conversations with Kevin Duh, though not in the finite state models formalism.]

The finite state machine approach to NLP is very appealing (I mean both string and tree automata) because you get to build little things in isolation and then chain them together in cool ways. Kevin Knight has a great slide about how to put these things together that I can't seem to find right now, but trust me that it's awesome, especially when he explains it to you :).

The other thing that's cool about them is that because you get to build them in isolation, you can use different data sets, which means data sets with different assumptions about the existence of "labels", to build each part. For instance, to do speech to speech transliteration from English to Japanese, you might build a component system like:

English speech --A--> English phonemes --B--> Japanese phonemes --C--> Japanese speech --D--> Japanese speech LM

You'll need a language model (D) for Japanese speech, that can be trained just on acoustic Japanese signals, then parallel Japanese speech/phonemes (for C), parallel English speech/phonemes (for A) and parallel English phonemes/Japanese phonemes (for B). [Plus, of course, if you're missing any of these, EM comes to your rescue!]

Let's take a simpler example, though the point I want to make applies to long chains, too.

Suppose I want to just do translation from French to English. I build an English language model (off of monolingual English text) and then an English-to-French transducer (remember that in the noisy channel, things flip direction). For the E2F transducer, I'll need parallel English/French text, of course. The English LM gives me p(e) and the transducer gives me p(f|e), which I can put together via Bayes' rule to get something proportional to p(e|f), which will let me translate new sentences.

But, presumably, I also have lots of monolingual French text. Forgetting math for a moment, which seems to suggest that this can't help me, we can ask: why should this help?

Well, it probably won't help with my English language model, but it should be able to help with my transducer. Why? Because my transducer is supposed to give me p(f|e). If I have some French sentence in my GigaFrench corpus to which my transducer assigns zero probability (for instance, max_e p(f|e) = 0), then this is probably a sign that something bad is happening.

More generally, I feel like the following two operations should probably give roughly the same probabilities:

Drawing an English sentence from the language model p(e).
Picking a French sentence at random from GigaFrench, and drawing an English sentence from p(e|f), where p(e|f) is the composition of the English LM and the transducer.

If you buy this, then perhaps one thing you could do is to try to learn a transducer q(f|e) that has low KL divergence between 1 and 2, above. If you work through the (short) make, and throw away terms that are independent of the transducer, then you end up wanting to minimize [ sum_e p(e) log sum_f q(f|e) ]. Here, the sum over f is a finite sum over GigaFrench, and the sum over e is an infinite sum over positive probability English sentences given my the English LM p(e).

One could then apply something like posterior regularization (Kuzman Ganchev, Graça and Taskar) to do the learning. There's the nasty bit about how to compute these things, but that's why you get to be friends with Jason Eisner so he can tell you how to do anything you could ever want to do with finite state models.

Anyway, it seems like an interesting idea. I'm definitely not aware if anyone has tried it.

21 August 2010

Readers kill blogs?

I try to avoid making meta-posts, but the timing here was just too impeccable for me to avoid a short post on something that's been bothering me for a year or so.

On the one hand, yesterday, Aleks stated that the main reason he blogs is to see comments. (Similarly, Lance also thinks comments are a very important part of having an "open" blog.)
On the other hand, people are more an more moving to systems like Google Reader, as re-blogged by Fernando, also yesterday.

I actually complete agree with both points. The problem is that I worry that they are actually fairly opposed. I comment much less on other people's blogs now that I use reader, because the 10 second overhead of clicking on the blog, being redirected, entering a comment, blah blah blah, is just too high. Plus, I worry that no one (except the blog author) will see my comment, since most readers don't (by default) show comments in with posts.

Hopefully the architects behind readers will pick up on this and make these things (adding and viewing comments, within the reader -- yes, I realize that it's then not such a "reader") easier. That is, unless they want to lose out to tweets!

Until then, I'd like to encourage people to continue commenting here.

19 August 2010

Multi-task learning: should our hypothesis classes be the same?

It is almost an unspoken assumption in multitask learning (and domain adaptation) that you use the same type of classifier (or, more formally, the same hypothesis class) for all tasks. In NLP-land, this usually means that everything is a linear classifier, and the feature sets are the same for all tasks; in ML-land, this usually means that the same kernel is used for every task. In neural-networks land (ala Rich Caruana), this is enforced by the symmetric structure of the networks used.

I probably would have gone on not even considering this unspoken assumption, until a few years ago I saw a couple papers that challenged it, albeit indirectly. One was Factorizing Complex Models: A Case Study in Mention Detection by Radu (Hans) Florian, Hongyan Jing, Nanda Kambhatla and Imed Zitouni, all from IBM. They're actually considering solving tasks separately rather than jointly, but joint learning and multi-task learning are very closely related. What they see is that different features are useful for spotting entity spans, and for labeling entity types.

That year, or the next, I saw another paper (can't remember who or what -- if someone knows what I'm talking about, please comment!) that basically showed a similar thing, where a linear kernel was doing best for spotting entity spans, and a polynomial kernel was doing best for labeling the entity types (with the same feature sets, if I recall correctly).

Now, to some degree this is not surprising. If I put on my feature engineering hat, then I probably would design slightly different features for these two tasks. On the other hand, coming from a multitask learning perspective, this is surprising: if I believe that these tasks are related, shouldn't I also believe that I can do well solving them in the same hypothesis space?

This raises an important (IMO) question: if I want to allow my hypothesis classes to be different, what can I do?

One way is to punt: you can just concatenate your feature vectors and cross your fingers. Or, more nuanced, you can have some set of shared features and some set of features unique to each task. This is similar (the nuanced version, not the punting version) to what Jenny Finkel and Chris Manning did in their ACL paper this year, Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data.

An alternative approach is to let the two classifiers "talk" via unlabeled data. Although motivated differently, this was something of the idea behind my EMNLP 2008 paper on Cross-Task Knowledge-Constrained Self Training, where we run two models on unlabeled data and look for where they "agree."

A final idea that comes to mind, though I don't know if anyone has tried anything like this, would be to try to do some feature extraction over the two data sets. That is, basically think of it as a combination of multi-view learning (since we have two different hypothesis classes) and multi-task learning. Under the assumption that we have access to examples labeled for both tasks simultaneously (i.e., not the settings for either Jenny's paper or my paper), then one could do a 4-way kernel CCA, where data points are represented in terms of their task-1 kernel, task-2 kernel, task-1 label and task-2 label. This would be sort of a blending of CCA-for-multiview-learning and CCA-for-multi-task learning.

I'm not sure what the right way to go about this is, but I think it's something important to consider, especially since it's an assumption that usually goes unstated, even though empirical evidence seems to suggest it's not (always) the right assumption.

02 August 2010

Why Discourse Structure?

I come from a strong lineage of discourse folks. Writing a parser for Rhetorical Structure Theory was one of the first class projects I had when I was a grad student. Recently, with the release of the Penn Discourse Treebank, there has been a bit of a flurry of interest in this problem (I had some snarky comments right after ACL about this). I've also talked about why this is a hard problem, but never really about why it is an interesting problem.

My thinking about discourse has changed a lot over the years. My current thinking about it is in an "interpretation as abduction" sense. (And I sincerely hope all readers know what that means... if not, go back and read some classic papers by Jerry Hobbs.) This is a view I've been rearing for a while, but I finally started putting it into words (probably mostly Jerry's words) in a conversation at ACL with Hoifung Poon and Joseph Turian (I think it was Joseph... my memory fades quickly these days :P).

This view is that discourse is that thing that gives you an interpretation above and beyond whatever interpretations you get from a sentence. Here's a slightly refined version of the example I came up with on the fly at ACL:

I only like traveling to Europe. So I submitted a paper to ACL.
I only like traveling to Europe. Nevertheless, I submitted a paper to ACL.

What does the hearer (H) infer from these sentences. Well, if we look at the sentences on their own, then H infers something like Hal-likes-travel-to-Europe-and-only-Europe, and H infers something like Hal-submitted-a-paper-to-ACL. But when you throw discourse in, you can derive two additional bits of information. In example (1), you can infer ACL-is-in-Europe-this-year and in (2) you can infer the negation of that.

Pretty amazing stuff, huh? Replacing a "so" with a "nevertheless" completely changes this interpretation.

What does this have to do with interpretation as abduction? Well, we're going to assume that this discourse is coherent. Given that assumption, we have to ask ourselves: in (1), what do we have to assume about the world to make this discourse coherent? The answer is that you have to assume that ACL is in Europe. And similarly for (2).

Of course, there are other things you could assume that would make this discourse coherent. For (1), you could assume that I have a rich benefactor who likes ACL submissions and will send me to Europe every time I submit something to ACL. For (2), you could assume that I didn't want my paper to get in, but I wanted a submission to get reviews, and so I submitted a crappy paper. Or something. But these fail the Occam's Razor test. Or, perhaps they are a priori simply less likely (i.e., you have to assume more to get the same result).

Interestingly, I can change the interpretation of (2), for instance, by adding a third sentence to the discourse: "I figured that it would be easy to make my way to Europe after going to Israel." Here, we would abduce that ACL is in Israel, and that I'm willing to travel to Israel on my way to Europe. For you GOFAI folks, this would be something like non-monotonic reasoning.

Whenever I talk about discourse to people who don't know much about it, I always get this nagging sense of "yes, but why do I care that you can recognize that sentence 4 is background to sentence 3, unless I want to do summarization?" I hope that this view provides some alternative answer to that question. Namely, that there's some information you can get from sentences, but there is additional information in how those sentences are glued together.

Of course, one of the big problems we have is that we have no idea how to represent sentence-level interpretations, or at least some ideas but no way to get there in the general case. In the sentence-level case, we've seen some progress recently in terms of representing semantics in a sort of substitutability manner (ala paraphrasing), which is nice because the representation is still text. One could ask if something similar might be possible at a discourse level. Obviously you could paraphrase discourse connectives, but that's missing the point. What else could you do?

24 July 2010

ACL 2010 Retrospective

ACL 2010 finished up in Sweden a week ago or so. Overall, I enjoyed my time there (the local organization was great, though I think we got hit with unexpected heat, so those of us who didn't feel like booking a room at the Best Western -- hah! why would I have done that?! -- had no A/C and my room was about 28-30 every night).

But you don't come here to hear about sweltering nights, you come to hear about papers. My list is actually pretty short this time. I'm not quite sure why that happened. Perhaps NAACL sucked up a lot of the really good stuff, or I went to the wrong sessions, or something. (Though my experience was echoed by a number of people (n=5) I spoke to after the conference.) Anyway, here are the things I found interesting.

Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, by Matthew Gerber and Joyce Chai (this was the Best Long Paper award recipient). This was by far my favorite paper of the conference. For all you students out there (mine included!), pay attention to this one. It was great because they looked at a fairly novel problem, in a fairly novel way, put clear effort into doing something (they annotated a bunch of data by hand), developed features that were significantly more interesting than the usual off-the-shelf set, and got impressive results on what is clearly a very hard problem. Congratulations to Matthew and Joyce -- this was a great paper, and the award is highly deserved.
Challenge Paper: The Human Language Project: Building a Universal Corpus of the World’s Languages, by Steven Abney and Steven Bird. Basically this would be awesome if they can pull it off -- a giant structured database with stuff from tons of languages. Even just having tokenization in tons of languages would be useful for me.
Extracting Social Networks from Literary Fiction, by David Elson, Nicholas Dames and Kathleen McKeown. (This was the IBM best student paper.) Basically they construct networks of characters from British fiction and try to analyze some literary theories in terms of those networks, and find that there might be holes in the existing theories. My biggest question, as someone who's not a literary theorist, is why did those theories exist in the first place? The analysis was over 80 or so books, surely literary theorists have read and pondered all of them.
Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish, by Reyyan Yeniterzi and Kemal Oﬂazer. You probably know that I think translating morphology and translating out of English are both interesting topics, so it's perhaps no big surprise that I liked this paper. The other thing I liked about this paper is that they presented things that worked, as well as things that might well have worked but didn't.
Learning Common Grammar from Multilingual Corpus, by Tomoharu Iwata, Daichi Mochihashi and Hiroshi Sawad. I wouldn't go so far as to say that I thought this was a great paper, but I would say there is the beginning of something interesting here. They basically learn a coupled PCFG in Jenny Finkel hierarchical-Bayes style, over multiple languages. The obvious weakness is that languages don't all have the same structure. If only there were an area of linguistics that studies how they differ.... (Along similar lines, see

Phylogenetic Grammar Induction, by Taylor Berg-Kirkpatrick and Dan Klein, which has a similar approach/goal.)
Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation, by Michael Bloodgood and Chris Callison-Burch. The "trend" referenced in the title is that active learning always asymptotes depressingly early. They have turkers translate bits of sentences in context (i.e., in a whole sentence, translate the highlighted phrase) and get a large bang-for-the-buck. Right now they're looking primarily at out-of-vocabulary stuff, but there's a lot more to do here.

A few papers that I didn't see, but other people told me good things about:

“Was It Good? It Was Provocative.” Learning the Meaning of Scalar Adjectives, by Marie-Catherine de Marneffe, Christopher D. Manning and Christopher Pott.
Unsupervised Ontology Induction from Text, by Hoifung Poon and Pedro Domingos.
Improving the Use of Pseudo-Words for Evaluating Selectional Preferences, by Nathanael Chambers and Daniel Jurafsky.
Learning to Follow Navigational Directions, by Adam Vogel and Daniel Jurafsky.
Compositional Matrix-Space Models of Language, by Sebastian Rudolph and Eugenie Giesbrecht. (This was described to me as "thought provoking" though not necessarily more.)
Top-Down K-Best A* Parsing, by Adam Pauls, Dan Klein and Chris Quirk.

At any rate, I guess that's a reasonably long list. There were definitely good things, but with a fairly heavy tail. If you have anything you'd like to add, feel free to comment. (As an experiment, I've turned comment moderation on as a way to try to stop the spam... I'm not sure I'll do it indefinitely; I hadn't turned it on before because I always thought/hoped that Google would just start doing spam detection and/or putting hard captcha's up or something to try to stop spam, but sadly they don't seem interested.)

28 June 2010

ICML 2010 Retrospective

Just got back from Israel for ICML, which was a great experience: I'd wanted to go there for a while and this was a perfect opportunity. I'm very glad I spent some time afterwards out of Haifa, though.

Overall, I saw a lot of really good stuff. The usual caveats apply (I didn't see everything it's a biased sample, blah blah blah). Here are some things that stood out:

Structured Output Learning with Indirect Supervision (M.-W. Chang, V. Srikumar, D. Goldwasser, D. Roth). This was probably one of my favorite papers of the conference, even though I had learned some about the work when I visited UIUC a few months ago. Let's say you're trying to do word alignment, and you have a few labeled examples of alignments. But then you also have a bunch of parallel data. What can you do? You can turn the parallel data into a classification problem: are these two sentences translations of each other. You can pair random sentences to get negative examples. A very clever observation is basically that the weight vector for this binary classifier should point in the same direction as the weight vector for the (latent variable) structured problem! (Basically the binary classifier should say "yes" only when there exists an alignment that renders these good translations.) Tom Dietterich asked a question during Q/A: these binary classification problems seem very hard: is that bad? Ming-Wei reassured him that it wasn't. In thinking about it after the fact, I wonder if it is actually really importantant that they're hard: namely, if they were easy, then you could potentially answer the question without bothering to make up a reasonable alignment. I suspect this might be the case.

A Language-based Approach to Measuring Scholarly Impact (S. Gerrish, D. Blei). The idea here is that without using citation structure, you can model influence in large document collections. The basic idea is that when someone has a new idea, they often introduce new terminology to a field that wasn't there before. The important bit is that they don't change all of science, or even all of ACL: they only change what gets talked about in their particular sub-area (aka topic :P). It was asked during Q/A what would happen if you did use citations, and my guess based on my own small forays in this area is that the two sources would really reinforce eachother. That is, you might regularly cite the original EM even if your paper has almost nothing to do with it. (The example from the talk was then Penn Treebank paper: one that has a bajillion citations, but hasn't lexically affected how people talk about research.)

Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola). This received one of the best paper awards. While I definitely liked this paper, actually what I liked more what that it taught me something from COLT last year that I hadn't known (thanks to Percy Liang for giving me more details on this). That paper was A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang) and basically shows that you can use spectral decomposition techniques to "solve" the HMM problem. You create the matrix of observation pairs (A_ij = how many times did I see observation j follow observation i) and then do some processing and then a spectral decomposition and, voila, you get parameters to an HMM! In the case that the data was actually generated by an HMM, you get good performance and good guarantees. Unfortunately, if the data was not generated by an HMM, then the theory doesn't work and the practice does worse than EM. That's a big downer, since nothing is ever generated by the model we use, but it's a cool direction. At any rate, the current paper basically asks what happens if your observations are drawn from an RKHS, and then does an analysis. (Meta-comment: as was pointed out in the Q/A session, and then later to me privately, this has fairly strong connections to some stuff that's been done in Gaussian Process land recently.)

Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process (N. Bartlett, D. Pfau, F. Wood). This paper shows that if you're building a hierarchical Pitman-Yor language model (think Kneser-Ney smoothing if that makes you feel more comfortable) in an online manner, then you should feel free to throw out entire restaurants as you go through the process. (A restaurant is just the set of counts for a given context.) You do this to maintain a maximum number of restaurants at any given time (it's a fixed memory algorithm). You can do this intelligently (via a heuristic) or just stupidly: pick them at random. Turns out it doesn't matter. The explanation is roughly that if it were important, and you threw it out, you'd see it again and it would get re-added. The chance that something that occurs a lot keeps getting picked to be thrown out is low. There's some connection to using approximate counting for language modeling, but the Bartlett et al. paper is being even stupider than we were being!

Learning efficiently with approximate inference via dual losses (O. Meshi, D. Sontag, T. Jaakkola, A. Globerson). Usually when you train structured models, you alternate between running inference (a maximization to find the most likely output for a given training instance) and running some optimization (a minimization to move your weight vector around to achieve lower loss). The observation here is that by taking the dual of the inference problem, you turn the maximization into a minimization. You now have a dual minimization, which you can solve simultaneously, meaning that when your weights are still crappy, you aren't wasting time finding perfect outputs. Moreover, you can "warm start" your inference for the next round. It's a very nice idea. I have to confess I was a bit disappointed by the experimental results, though: the gains weren't quite what I was hoping. However, most of the graphs they were using weren't very large, so maybe as yo move toward harder problems, the speed-ups will be more obvious.

Deep learning via Hessian-free optimization (J. Martens). Note that I neither saw this presentation nor read the paper (skimmed it!), but I talked with James about this over lunch one day. The "obvious" take away message is that you should read up on your optimization literature, and start using second order methods instead of your silly gradient methods (and don't store that giant Hessian: use efficient matrix-vector products). But the less obvious take away message is that some of the prevailing attitudes about optimizing deep belief networks may be wrong. For those who don't know, the usual deal is to train the networks layer by layer in an auto-encoder fashion, and then at the end apply back-propogation. The party line that I've already heard is that the layer-wise training is very important to getting the network near a "good" local optimum (whatever that means). But if James' story holds out, this seems to not be true: he doesn't do any clever initialization and still find good local optima!

A theoretical analysis of feature pooling in vision algorithms (Y.-L. Boureau, J. Ponce, Y. LeCun). Yes, that's right: a vision paper. Why should you read this paper? Here's the question they're asking: after you do some blah blah blah feature extraction stuff (specifically: Sift features), you get something that looks like a multiset of features (hrm.... sounds familiar). These are often turned into a histogram (basically taking averages) and sometimes just used as a bag: did I see this feature or not. (Sound familiar yet?) The analysis is: why should one of these be better and, in particular, why (in practice) do vision people see multiple regimes. Y-Lan et al. provide a simple, obviously broken, model (that assumes feature independence... okay, this has to sound familiar now) to look at the discriminability of these features (roughly the ration of between-class variances and overall variances) to see how these regimes work out. And they look basically how they do in practice (modulo one "advanced" model, which doesn't quite work out how they had hoped).

Some other papers that I liked, but don't want to write too much about:

Learning Programs: A Hierarchical Bayesian Approach (P. Liang, M. Jordan, D. Klein). Structured models over programs are very hard; this paper gives one approach to modeling them.
Budgeted Nonparametric Learning from Data Streams (R. Gomes, A. Krause). Shows that a clustering problem and a Gaussian process problem are submodular, goes from there.
Internal Rewards Mitigate Agent Boundedness (J. Sorg, S. Singh, R. Lewis). Exactly what the title says.
The Translation-invariant Wishart-Dirichlet Process for Clustering Distance Data (J. Vogt, S. Prabhakaran, T. Fuchs, V. Roth). Been wanting to do something like this for a while, but they did it better than I would have!
Sparse Gaussian Process Regression via L_1 Penalization (F. Yan, Y. Qi). Very interesting way to get sparsity in a GP basically by changing your approximating distribution.

Some papers that other people said they liked were:

Multi-Class Pegasos on a Budget (Z. Wang, K. Crammer, S. Vucetic)
Risk minimization, probability elicitation, and cost-sensitive SVMs (H. Masnadi-Shirazi, N. Vasconcelos)
Asymptotic Analysis of Generative Semi-Supervised Learning (J. Dillon, K. Balasubramanian, G. Lebanon)

Hope to see you at ACL!

07 June 2010

NAACL 2010 Retrospective

I just returned from NAACL 2010, which was simultaneously located in my home town of Los Angeles and located nowhere near my home town of Los Angeles. (That's me trying to deride downtown LA as being nothing like real LA.)

Overall I was pleased with the program. I saw a few talks that changed (a bit) how I think about some problems. There were only one or two talks I saw that made me wonder how "that paper" got in, which I think is an acceptable level. Of course I spend a great deal of time not at talks, but no longer feel bad about doing so.

On tutorials day, I saw Hoifung Poon's tutorial on Markov Logic Networks. I think Hoifung did a great job of targetting the tutorial at just the right audience, which probably wasn't exactly me (though I still quite enjoyed it myself). I won't try to describe MLNs, but my very brief summary is "language for compactly expressing complex factor graphs (or CRFs, if you prefer)." That's not exactly right, but I think it's pretty close. You can check back in a few months and see if there are going to be any upcoming "X, Y and Daume, 2011" papers using MLNs :P. At any rate, I think it's a topic worth knowing about, especially if you really just want to get a system up and running quickly. (I'm also interested in trying Andrew McCallum's Factorie system, which, to some degree, trades easy of use for added functionality. But honestly, I don't really have time to just try things these days: students have to do that for me.)

One of my favorite papers of the conference was one that I hadn't even planned to go see! It is Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables by Tetsuji Nakagawa, Kentaro Inui and Sadao Kurohashi. (I saw it basically because by the end of the conference, I was too lazy to switch rooms after the prvious talk.) There are two things I really like about this paper. The first is that the type of sentiment they're going after is really broad. Example sentences included things that I'd love to look up, but apparently were only in the slides... but definitely more than "I love this movie." The example in the paper is "Tylenol prevents cancer," which is a nice positive case.

The basic idea is that some words give you sentiment. For instance, by itself, "cancer" is probably negative. But then some words flip polarity. Like "prevents." Or negation. Or other things. They set up a model based on sentence level annotations with latent variables for the "polarity" words and for the "flipping" words. The flipping words are allowed to flip any sentiment below them in the dependency tree. Cool idea! Of course, I have to nit-pick the paper a bit. It probably would be better to allow arguments/adjuncts to flip polarity, too. Otherwise, negation (which is usually a leaf) will never flip anything. And adjectives/adverbs can't flip either (eg., going from "happy" to "barely happy"). But overall I liked the paper.

A second thing I learned is that XOR problems do exist in real life, which I had previously questioned. The answer came (pretty much unintentionally) from the paper The viability of web-derived polarity lexicons by Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan and Ryan McDonald. I won't talk much about this paper other than to say that if you have 4 billion web pages, you can get some pretty good sentimenty words, if you're careful to not blindly apply graph propagation. But at the end, they throw a meta classifier on the polarity classification task, whose features include things like (1) how many positive terms are in the text, (2) how many negative terms are in the text, (3) how many negations are in the text. Voila! XOR! (Because negation XORs terms.)

I truly enjoyed Owen Rambow's poster on The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece. If you're ever taken a class in mathematical logic, it is very easy for me to summarize this paper: parse trees (dependency or phrase structure) are your languge, but unless you have a theory of that language (in the model-theoretic sense) then whatever you do is meaningless. In more lay terms: you can always push symbols around, but unless you tie a semantics to those symbols, you're really not doing anything. Take home message: pay attention to the meaning of your symbols!

In the category of "things everyone should know about", there was Painless unsupervised learning with features by Taylor Berg-Kirkpatrick, Alexandre Bouchard Côté, John DeNero and Dan Klein. The idea is that you can replace your multinomails in an HMM (or other graphical model) with little maxent models. Do EM in this for unsuperviesd learning and you can throw in a bunch of extra features. I would have liked to have seen a comparison against naive Bayes with the extra features, but my prior belief is sufficiently strong that I'm willing to believe that it's helpful. The only sucky thing about this training regime is that training maxent models with (tens of) thousands of classes is pretty painful. Perhaps a reduction like tournaments or SECOC would help bring it down to a log factor.

I didn't see the presentation for From baby steps to leapfrog: How "Less is More" in unsupervised dependency parsing by Valetin Spitkovsky, Hiyan Alshawi and Dan Jurafsky, but I read it. The idea is that you can do better unsupervised dependency parsing by giving your learner progressively harder examples. I really really really tried to get something like this to work for unsearn, but nothing helped and most things hurn. (I only tried adding progressively longer sentences: other ideas, based on conversations with other folks, include looking at vocabulary size, part of speech (eg., human babies learn words in a particular order), etc.) I'm thrilled it actually works.

Again, I didn't see Discriminative Learning over Constrained Latent Representations by Ming-Wei Chang, Dan Goldwasser, Dan Roth and Vivek Srikumar, but I learned about the work when I visited UIUC recently (thanks again for the invitation, Dan R.!). This paper does exactly what you would guess from the title: learns good discriminative models when you have complex latent structures that you know something about a priori.

I usually ask people at the end of conferences what papers they liked. Here are some papers that were spoken highly of by my fellow NAACLers. (This list is almost unadulterated: one person actually nominated one of the papers I thought shouldn't have gotten in, so I've left it off the list. Other than that, I think I've included everything that was specifically mentioned to me.)

Optimal Parsing Strategies for Linear Context-Free Rewriting Systems by Daniel Gildea.
Products of Random Latent Variable Grammars by Slav Petrov.
Joint Parsing and Alignment with Weakly Synchronized Grammars by David Burkett, John Blitzer and Dan Klein.
For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia by Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil and Lillian Lee.
Type-Based MCMC by Percy Liang, Michael I. Jordan and Dan Klein.

I think I probably have two high level "complaints" about the program this year. First, I feel like we're seeing more and more "I downloaded blah blah blah data and trained a model using entirely standard features to predict something and it kind of worked" papers. I apologize if I've just described your paper, but these papers really rub me the wrong way. I feel like I just don't learn anything from them: we already know that machine learning works surprisingly well and I don't really need more evidence of that. Now, if my sentence described your paper, but your paper additionally had a really interesting analysis that helps us understand something about language, then you rock! Second, I saw a lot of presentations were speakers were somewhat embarassingly unaware of very prominent very relevant prior work. (In none of these cases was the prior work my own: it was work that's much more famous.) Sometimes the papers were cited (and it was more of a "why didn't you compare against that" issue) but very frequently they were not. Obviously not everyone knows about all papers, but I recognized this even for papers that aren't even close to my area.

Okay, I just ranted, so let's end on a positive note. I'm leaving the conference knowing more than when I went, and I had fun at the same time. Often we complain about the obvious type I errors and not-so-obvious type II errors, but overall I found the program strong. Many thanks to the entire program committee for putting together an on-average very good set of papers, and many thanks to all of you for writing these papers!

Which would you attend?
	Only ACL
	Only ICML
	ACL and the symposium
	ICML and the symposium
	Only ACL and ICML
	All three
	Only ACL, because of convenience

pollcode.com free polls