By which I mean NIPS, and the incumbent exhaustion of 14 hour days. (P.s., if you're at NIPS, see the meta-comment I have at the end of this post, even if you skip the main body :P.)
Today I went to two tutorials: one by Yee Whye Teh and Peter Orbanz (who is starting shortly at Columbia) on non-parametric Bayesian stuff, and one by Naftali Tishby on information theory in learning and control. They were streamed online; I'm sure the videos will show up at some point on some web page, but I can't find them right now (incidentally, and shamelessly, I think NAACL should have video tutorials, too -- actually my dear friend Chris wants that too, and since Kevin Knight has already promised that ACL approves all my blog posts, I suppose I can only additionally promise that I will do everything I can to keep at least a handful of MT papers appearing in each NAACL despite the fact that no one really works on it anymore :P). Then there were spotlights followed by posters, passed (as well as sat down) hors d'oeuvres, free wine/sangria/beer/etc, and friends and colleagues.
The first half of the Teh/Orbanz tutorial is roughly what I would categorize as "NP Bayes 101" -- stuff that everyone should know, with the addition of some pointers to results about consistency, rates of convergence of the posterior, etc. The second half included a lot of stuff that's recently become interesting, in particular topics like completely random measures, coagulation/fragmentation processes, and the connection between gamma processes (an example of a completely random measure) and Dirichlet processes (which we all know and love/hate).
One of the more interesting things toward the end was what I was roughly characterized as variants of the DeFinetti theorem on exchangable objects. What follows is from memory, so please forgive errors: you can look it up in the tutorial. DeFinetti's theorem states that if p(X1, X2, ..., Xn, ...) is exchangeable, then p has a representation as a mixture model, with (perhaps) infinite dimensional mixing coefficients. This is a fairly well-known result, and was apparently part of the initial reason Bayesians started looking into non-parametrics.
The generalizations (due to people like Kingman, Pitman, Aldous, etc...) are basically what happens for other types of data (i.e., other than just exchangeable). For instance, if a sequence of data is block-exchangeable (think of a time-series, which is obviously not exchangeable, but for which you could conceivably cut it into a bunch of contiguous pieces and these pieces would be exchangeable) then it has a representation as a mixture of Markov chains. For graph-structured data, if the nodes are exchangeable (i.e., all that matters is the pattern of edges, not precisely which nodes they happen to connect), then this also has a mixture parameterization, though I've forgotten the details.
The Tishby tutorial started off with some very interesting connections between information theory, statistics, and machine learning, essentially from the point of view of hypothesis testing. The first half of the tutorial centered around information bottleneck, which is a very beautiful idea. You should all go read about it if you don't know it already.
What actually really struck me was a comment that Tishby made somewhat off-hand, and I'm wondering if anyone can help me out with a reference. The statement has to do with the question "why KL?" His answer had two parts. For the first part, consider mutual information (which is closely related to KL). MI has the property that if "X -> Y -> Z" is a Markov chain, then the amount of information that Y gives you about Z is at most the amount of information that X gives you about Z. In other words, if you think if Y as a "processed" version of X, then this processing cannot give you more information. This property is more general than just MI, and I believe anything that obeys it is a Csiszar divergence. The second part is the part that I'm not so sure of. It originated with the observation that if you have a product, take a log, you now get an additive term. This is really nice because you can apply results like the central limit theorem to this additive term. (Many of the results in the first half of his tutorial hinged on this additivity.) The claim was something like: the only divergences that have this additivity are Bregman divergences. (This is not obvious to me, and actually not entirely obvious what the right definition of additivity is, so if someone wants to help out, please do so!) But the connection is that MI and KL are the "intersection" of Bregman divergences and Csiszar divergences. In other words, if you want the decreasing information property and you want the additivity property, then you MUST use information theoretic measures.
I confess that roughly the middle third of the talk went above my head, but I did learn about an interesting connection between Gaussian information bottleneck and CCA: basically they're the same, up to a trimming of the eigenvalues. This is in a 2005 JMLR paper by Amir Globerson and others. In the context of this, actually, Tishby made a very offhand comment that I couldn't quite parse as whether it was a theorem or a hope. Basically the context was that when working with Gaussian distributed random variables, you can do information bottleneck "easily," but that it's hard for other distributions. So what do we do? We do a kernel mapping into a high dimension space (they use an RBF kernel) where the data will look "more Gaussian." As I said, I couldn't quite parse whether this is "where the data will provably look more Gaussian" or "where we hope that maybe by dumb luck the data will look more Gaussian" or something in between. If anyone knows the answer, again, I'd love to know. And if you're here at NIPS and can answer either of these two questions to my satisfaction, I'll buy you a glass of wine (or beer, but why would you want beer? :P).
Anyway, that's my report for day one of NIPS!
p.s. I made the silly decision of taking a flight from Granada to Madrid at 7a on Monday 19 Dec. This is way too early to take a bus, and I really don't want to take a bus Sunday night. Therefore, I will probably take a cab. I think it will be about 90 euros. If you also were silly and booked early morning travel on Monday and would like to share said cab, please email me (me AT hal3 DOT name).
my biased thoughts on the fields of natural language processing (NLP), computational linguistics (CL) and related topics (machine learning, math, funding, etc.)
12 December 2011
14 October 2011
You need a job and I have $$$
If you're an NLP or ML person and graduating in the next six months or so and are looking for a postdoc position with very very open goals, read on. The position would be at UMD College Park, in the greater DC area, with lots of awesome people around (as well as JHU and other universities a short drive/train away).
This position could start as early as January 2012, probably more likely around June 2012 and could be as late as September 2012 for the right person. Even if you're not graduating until the one-year-from-now time frame, please contact me now! I'm looking more for a brilliant, hard-working, creative person than anyone with any particular skills. That said, you probably know what sort of problems I tend to work on, so it would be good if you're at least interested in things roughly in that space (regardless of whether you've worked on them before or not).
The position would be for one year, with an extension to two if things are working out well for both of us (not subject to funding).
If you're interested, please email me at postdoc@hal3.name with the following information:
This position could start as early as January 2012, probably more likely around June 2012 and could be as late as September 2012 for the right person. Even if you're not graduating until the one-year-from-now time frame, please contact me now! I'm looking more for a brilliant, hard-working, creative person than anyone with any particular skills. That said, you probably know what sort of problems I tend to work on, so it would be good if you're at least interested in things roughly in that space (regardless of whether you've worked on them before or not).
The position would be for one year, with an extension to two if things are working out well for both of us (not subject to funding).
If you're interested, please email me at postdoc@hal3.name with the following information:
- Inline in the email:
- Your PhD institution and advisor, thesis title, and expected graduation date.
- Links to the two or three most awesome papers you have, together with titles and venue.
- Link to your homepage.
- A list of three references (names, positions and email addresses).
- Attached to the email:
- A copy of your CV, in PDF format.
- A brief (one page) research statement that focuses mostly on what problem(s) you'd most like to work on in a postdoc position with me. Also in PDF format.
11 October 2011
Active learning: far from solved
As Daniel Hsu and John Langford pointed out recently,
there has been a lot of recent progress in active learning. This is to
the point where I might actually be tempted to suggest some of these
algorithms to people to use in practice, for instance the one John has
that learns faster than supervised learning because it's very careful
about what work it performs. That is, in particular, I might suggest
that people try it out instead of the usual query-by-uncertainty (QBU)
or query-by-committee (QBC). This post is a brief overview of what I
understand of the state of the art in active learning (paragraphs 2 and
3) and then a discussion of why I think (a) researchers don't tend to
make much use of active learning and (b) why the problem is far from
solved. (a will lead to b.)
For those who know what QBU and QBC are, skip this paragraph. The idea with QBU is exactly what you think: when choosing the next point to as for the label of, choose the one on which your current model is maximally uncertain. If you're using a probabilistic model, this means something like "probability is closest to 0.5," or, in the non-binary case, something like "highest entropy of p(y|x)." If you're using something like an SVM, perhaps margin (aka distance to the hyperplane) is a reasonable measure of uncertainty. In QBC, the idea is still to query on uncertain points, but the uncertainty is computed by the amount of agreement among a committee of classifiers, for instance, classifiers trained in a boostrap manner on whatever data you have previously labeled.
One of the issues with QBU and QBC and really a lot of the classic methods for active learning is that you end up with a biased set of training data. This makes it really hard to prove anything about how well your algorithm is going to do on future test examples, because you've intentially selected examples that are not random (and probably not representative). One of the "obvious in retrospect" ideas that's broken this barrier is to always train your classifier on all examples: the label for those that you've queried on is given by the human, and the label for those that you haven't queried on is given by your model from the previous iteration. Thus, you are always training on an iid sample from the distribution you care about (at least from a p(x) perspective). This observation, plus a lot of other work, leads to some of the breakthroughs that John mentions.
An easy empirical observation is that not many people (in my sphere) actually use active learning. In fact, the only case that I know of was back in 2004 where IBM annotated extra coreference data for the Automatic Content Extraction (ACE) challenge using active learning. Of course people use it to write papers about active learning, but that hardly counts. (Note that the way that previously learned taggers, for instance the Brill Tagger, were used in the construction of the Penn Treebank does not fall under the auspices of active learning, at least as I'm thinking about it here.)
It is worth thinking about why this is. I think that the main issue is that you end up with a biased training set. If you use QBC or QBU, this is very obvious. If you use one of the new approaches that self-label the rest of the data to ensure that you don't have a biased training set, then of course p(x) is unbiased, but p(y|x) is very biased by whatever classifier you are using.
I think the disconnect is the following. The predominant view of active learning is that the goal is a classifier. That data that is labeled is a byproduct that will be thrown away, once the classifier exists.
The problem is that this view flies in the face of the whole point of active learning: that labeling is expensive. If labeling is so expensive, we should be able to reuse this data so that the cost is amortized. That is, yes, of course we care about a classifier. But just as much, we care about having a data set (or "corpus" in the case of NLP).
Consider, for instance, the Penn Treebank. The sorts of techniques that are good at parsing now were just flat-out not available (and perhaps not even conceivable) back in the late 1990s when the Treebank was being annotated. If we had done active learning for the Treebank under a non-lexicalized, non-parent-annoted PCFG that gets 83% accuracy, maybe worse because we didn't know how to smooth well, then how well would this data set work for modern day state splitting grammars with all sorts of crazy Markovization and whatnot going on?
The answer is: I have no idea. I have never seen an experiment that looks at this issue. And it would be so easy! Run your standard active learning algorithm with one type of classifier. Plot your usual active-versus-passive learning curves. Now, using the same sequence of data, train another classifier. Plot that learning curve. Does it still beat passive selection? By how much? And then, of course, can we say anything formal about how well this will work?
There are tons of ways that this problem can arise. For instance, when I don't have much data I might use a generative model and then when I have lots of data I might use a discriminative model. Or, as I get more data, I add more features. Or someone finds awesome features 5 years later for my problem. Or new machine learning techniques are developed. Or anything. I don't want my data to become obselete when this happens.
I am happy to acknowledge that this is a very hard problem. In fact, I suspect that there's some sort of no-free-lunch theorem lurking in here. Essentially, if the inductive biases of the classifier that you use to the active learning and the classifier you train at the end are too different, then you could do (arbitrarily?) badly. But in the real world, our hypothesis classes aren't all that different, or perhaps you can assume you're using a universal function approximator or a universal kernel or something. Assume what you want to start, but I think it's an interesting question.
And then, while we're on the topic of active learning, I'd also like to see whether an active learning algorithm's performance asymptotes before all your training data is exhausted. That is, the usual model in active learning experiments is that you have 1000 training examples because that's what someone labeled. You then do active learning up to 1000 examples, and of course at that point, everything has been labeled, so active learning performance coincides precisely with passive learning performance. But this is a poor reflection of many problems in the world, where new inputs are almost always free. I want the Banko and Brill paper for active learning... perhaps it's out there, and if you've seen it, I'd love a pointer. I ran a couple experiments along these lines (nothing concrete), but it actually seemed that active learning from a smaller pool was better, perhaps because you have fewer outliers (I was using QBU). But these results are by no means concrete, so don't cite me as saying this :).
At any rate, I agree that active learning has come a long way. I would humbly suggest that the goal of simply building a classifier is not in the real spirit of trying to save money. If you wanted to save money, you would save your data and share it (modulo lawyers). In the long run, passive learning currently seems much less expensive than active learning to me.
For those who know what QBU and QBC are, skip this paragraph. The idea with QBU is exactly what you think: when choosing the next point to as for the label of, choose the one on which your current model is maximally uncertain. If you're using a probabilistic model, this means something like "probability is closest to 0.5," or, in the non-binary case, something like "highest entropy of p(y|x)." If you're using something like an SVM, perhaps margin (aka distance to the hyperplane) is a reasonable measure of uncertainty. In QBC, the idea is still to query on uncertain points, but the uncertainty is computed by the amount of agreement among a committee of classifiers, for instance, classifiers trained in a boostrap manner on whatever data you have previously labeled.
One of the issues with QBU and QBC and really a lot of the classic methods for active learning is that you end up with a biased set of training data. This makes it really hard to prove anything about how well your algorithm is going to do on future test examples, because you've intentially selected examples that are not random (and probably not representative). One of the "obvious in retrospect" ideas that's broken this barrier is to always train your classifier on all examples: the label for those that you've queried on is given by the human, and the label for those that you haven't queried on is given by your model from the previous iteration. Thus, you are always training on an iid sample from the distribution you care about (at least from a p(x) perspective). This observation, plus a lot of other work, leads to some of the breakthroughs that John mentions.
An easy empirical observation is that not many people (in my sphere) actually use active learning. In fact, the only case that I know of was back in 2004 where IBM annotated extra coreference data for the Automatic Content Extraction (ACE) challenge using active learning. Of course people use it to write papers about active learning, but that hardly counts. (Note that the way that previously learned taggers, for instance the Brill Tagger, were used in the construction of the Penn Treebank does not fall under the auspices of active learning, at least as I'm thinking about it here.)
It is worth thinking about why this is. I think that the main issue is that you end up with a biased training set. If you use QBC or QBU, this is very obvious. If you use one of the new approaches that self-label the rest of the data to ensure that you don't have a biased training set, then of course p(x) is unbiased, but p(y|x) is very biased by whatever classifier you are using.
I think the disconnect is the following. The predominant view of active learning is that the goal is a classifier. That data that is labeled is a byproduct that will be thrown away, once the classifier exists.
The problem is that this view flies in the face of the whole point of active learning: that labeling is expensive. If labeling is so expensive, we should be able to reuse this data so that the cost is amortized. That is, yes, of course we care about a classifier. But just as much, we care about having a data set (or "corpus" in the case of NLP).
Consider, for instance, the Penn Treebank. The sorts of techniques that are good at parsing now were just flat-out not available (and perhaps not even conceivable) back in the late 1990s when the Treebank was being annotated. If we had done active learning for the Treebank under a non-lexicalized, non-parent-annoted PCFG that gets 83% accuracy, maybe worse because we didn't know how to smooth well, then how well would this data set work for modern day state splitting grammars with all sorts of crazy Markovization and whatnot going on?
The answer is: I have no idea. I have never seen an experiment that looks at this issue. And it would be so easy! Run your standard active learning algorithm with one type of classifier. Plot your usual active-versus-passive learning curves. Now, using the same sequence of data, train another classifier. Plot that learning curve. Does it still beat passive selection? By how much? And then, of course, can we say anything formal about how well this will work?
There are tons of ways that this problem can arise. For instance, when I don't have much data I might use a generative model and then when I have lots of data I might use a discriminative model. Or, as I get more data, I add more features. Or someone finds awesome features 5 years later for my problem. Or new machine learning techniques are developed. Or anything. I don't want my data to become obselete when this happens.
I am happy to acknowledge that this is a very hard problem. In fact, I suspect that there's some sort of no-free-lunch theorem lurking in here. Essentially, if the inductive biases of the classifier that you use to the active learning and the classifier you train at the end are too different, then you could do (arbitrarily?) badly. But in the real world, our hypothesis classes aren't all that different, or perhaps you can assume you're using a universal function approximator or a universal kernel or something. Assume what you want to start, but I think it's an interesting question.
And then, while we're on the topic of active learning, I'd also like to see whether an active learning algorithm's performance asymptotes before all your training data is exhausted. That is, the usual model in active learning experiments is that you have 1000 training examples because that's what someone labeled. You then do active learning up to 1000 examples, and of course at that point, everything has been labeled, so active learning performance coincides precisely with passive learning performance. But this is a poor reflection of many problems in the world, where new inputs are almost always free. I want the Banko and Brill paper for active learning... perhaps it's out there, and if you've seen it, I'd love a pointer. I ran a couple experiments along these lines (nothing concrete), but it actually seemed that active learning from a smaller pool was better, perhaps because you have fewer outliers (I was using QBU). But these results are by no means concrete, so don't cite me as saying this :).
At any rate, I agree that active learning has come a long way. I would humbly suggest that the goal of simply building a classifier is not in the real spirit of trying to save money. If you wanted to save money, you would save your data and share it (modulo lawyers). In the long run, passive learning currently seems much less expensive than active learning to me.
29 September 2011
A technique for me is a task for you
Originally in the context of Braque and now in the context of FUSE, I've thought a bit about understanding the role of techniques and tasks in scientific papers (admittedly, mostly NLP and ML, which I realize are odd and biased). I worked with Sandeep Pokkunuri, a MS student at Utah, looking at the following problem: given a paper (title, abstract, fulltext), determine what task is being solved and what technique is being used to solve it. For instance, a paper like "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data" the task would be "segmenting and labeling sequence data" and the technique would be "conditional random fields."
You can actually go a long way just looking for simple patterns in paper titles, like "TECH for TASK" or "TASK by TECH" and a few things like that (after doing some NP chunking and clean-up). From there you can get a good list of seed tasks and techniques, and could conceivably bootstrap your way from there. We never got a solid result out of these, and sadly I moved and Sandeep graduated and it never went anywhere. What we wanted to do was automatically generate tables of "for this TASK, here are all the TECHs that have been applied (and maybe here are some results) oh and by the way maybe applying these other TECHs would make sense." Or visa-verse: this TECH has been applied to blah blah blah tasks. You might even be able to tell what TECHs are better for what types of tasks, but that's quite a bit more challenging.
At any rate, a sort of "obvious in retrospect" thing that we noticed was that what I might consider a technique, you might consider a task. And you can construct a chain, typically all the way back to math. For instance, I might consider movie recommendations a task. To solve recommendations, I apply the technique of sparse matrix factorization. But then to you, sparse matrix factorization is a task and to solve it, you apply the technique of compressive sensing. But to Scott Tanner, compressive sensing is a task, and he applies the technique of smoothed analysis (okay this is now false, but you get the idea). But to Daniel Spielman, smoothed analysis is the task, and he applies the technique of some other sort of crazy math. And then eventually you get to set theory (or some might claim you get to category theory, but they're weirdos :P).
(Note: I suspect the same thing happens in other fields, like bio, chem, physics, etc., but I cannot offer such an example because I don't know those areas. Although not so obvious, I do think it holds in math: I use the proof technique of Shelah35 to prove blah -- there, both theorems and proof techniques are objects.)
At first, this was an annoying observation. It meant that our ontology of the world into tasks and techniques was broken. But it did imply something of a richer structure than this simple ontology. For instance, one might posit as a theory of science and technologies studies (STS, a subfield of social science concerned with related things) that the most basic thing that matters is that you have objects (things of study) and an appliedTo relationship. So recommender systems, matrix factorization, compressive sensing, smoothed analysis, set theory, etc., are all objects, and they are linked by appliedTos.
You can then start thinking about what sort of properties appliedTo might have. It's certainly not a function (many things can be applied to any X, and any Y can be applied to many things). I'm pretty sure it should be antireflexive (you cannot apply X to solve X). It should probably also be antisymmetric (if X is applied to Y, probably Y cannot be applied to X). Transitivity is not so obvious, but I think you could argue that it might hold: if I apply gradient descent to an optimization problem, and my particular implementation of gradient descent uses line search, then I kind of am applying line search to my problem, though perhaps not directly. (I'd certainly be interested to hear of counter-examples if any come to mind!)
If this is true, then what we're really talking about is something like a directed acyclic graph, which at least at a first cut seems like a reasonable model for this world. Probably you can find exceptions to almost everything I've said, but that's why you need statistical models or other things that can deal with "noise" (aka model misspecification).
Actually something more like a directed acyclic hypergraph might make sense, since often you simultaneously apply several techniques in tandem to solve a problem. For instance, I apply subgradient descent and L1 regularization to my binary classification problem -- the fact that these two are being applied together rather than separately seems important somehow.
Not that we've gone anywhere with modeling the world like this, but I definitely thing there are some interesting questions buried in this problem.
You can actually go a long way just looking for simple patterns in paper titles, like "TECH for TASK" or "TASK by TECH" and a few things like that (after doing some NP chunking and clean-up). From there you can get a good list of seed tasks and techniques, and could conceivably bootstrap your way from there. We never got a solid result out of these, and sadly I moved and Sandeep graduated and it never went anywhere. What we wanted to do was automatically generate tables of "for this TASK, here are all the TECHs that have been applied (and maybe here are some results) oh and by the way maybe applying these other TECHs would make sense." Or visa-verse: this TECH has been applied to blah blah blah tasks. You might even be able to tell what TECHs are better for what types of tasks, but that's quite a bit more challenging.
At any rate, a sort of "obvious in retrospect" thing that we noticed was that what I might consider a technique, you might consider a task. And you can construct a chain, typically all the way back to math. For instance, I might consider movie recommendations a task. To solve recommendations, I apply the technique of sparse matrix factorization. But then to you, sparse matrix factorization is a task and to solve it, you apply the technique of compressive sensing. But to Scott Tanner, compressive sensing is a task, and he applies the technique of smoothed analysis (okay this is now false, but you get the idea). But to Daniel Spielman, smoothed analysis is the task, and he applies the technique of some other sort of crazy math. And then eventually you get to set theory (or some might claim you get to category theory, but they're weirdos :P).
(Note: I suspect the same thing happens in other fields, like bio, chem, physics, etc., but I cannot offer such an example because I don't know those areas. Although not so obvious, I do think it holds in math: I use the proof technique of Shelah35 to prove blah -- there, both theorems and proof techniques are objects.)
At first, this was an annoying observation. It meant that our ontology of the world into tasks and techniques was broken. But it did imply something of a richer structure than this simple ontology. For instance, one might posit as a theory of science and technologies studies (STS, a subfield of social science concerned with related things) that the most basic thing that matters is that you have objects (things of study) and an appliedTo relationship. So recommender systems, matrix factorization, compressive sensing, smoothed analysis, set theory, etc., are all objects, and they are linked by appliedTos.
You can then start thinking about what sort of properties appliedTo might have. It's certainly not a function (many things can be applied to any X, and any Y can be applied to many things). I'm pretty sure it should be antireflexive (you cannot apply X to solve X). It should probably also be antisymmetric (if X is applied to Y, probably Y cannot be applied to X). Transitivity is not so obvious, but I think you could argue that it might hold: if I apply gradient descent to an optimization problem, and my particular implementation of gradient descent uses line search, then I kind of am applying line search to my problem, though perhaps not directly. (I'd certainly be interested to hear of counter-examples if any come to mind!)
If this is true, then what we're really talking about is something like a directed acyclic graph, which at least at a first cut seems like a reasonable model for this world. Probably you can find exceptions to almost everything I've said, but that's why you need statistical models or other things that can deal with "noise" (aka model misspecification).
Actually something more like a directed acyclic hypergraph might make sense, since often you simultaneously apply several techniques in tandem to solve a problem. For instance, I apply subgradient descent and L1 regularization to my binary classification problem -- the fact that these two are being applied together rather than separately seems important somehow.
Not that we've gone anywhere with modeling the world like this, but I definitely thing there are some interesting questions buried in this problem.
26 September 2011
Four months without blogs
As you've noticed, I haven't posted in a while. I've also not been reading blogs. My unread number of posts is now 462. Clearly I'm not going to go back and read all 462 posts that I missed. I will claim that this was an experiment to see what a (nearly) blog-free world is like.
I actually found that I missed both the reading and the writing, so now (especially that I've switch over to public transportation and so have about an hour to kill in transportation time) I'm going to go back to reading while being transported and blogging when I have time.
I figured I'd return to blogging by saying a bit about a recent experience. Less than a month ago I had the honor of serving on Jurgen Van Gael's Ph.D. examination committee. Jurgen did an excellent job and, as perhaps expected, passed. But what I want to talk about is how the UK model (or at least the Cambridge model) is different from the US model.
In the UK, the examination is done by two faculty members, one internal (this was Stephen Clark) and one external (that was me). It does not involve the advisor/supervisor, though this person can sit in the room without speaking :). There is no public presentation and the process we followed was basically to go through the dissertation chapter-by-chapter, ask clarification questions, perhaps some things to get Jurgen to think on his toes, and so on. This took about two hours.
Contrast this to the (prototypical) US model, where a committee consists of 5 people, perhaps one external (either external to CS or to the university, depending on how your institution sets it up), and includes the advisor. The defense is typically a 45 minute public presentation followed by questions from the committee in a closed-room environment with the student.
Having been involved, now, in both types, I have to say they each have their pros and cons. I think the lack of a public presentation in the UK model is a bit of a shame, though of course students could decide to do these anyway. But it's nice to have something official for parents or spouses to come to if they'd like. However, in the US, the public presentation, plus the larger committee, probably leads to situation that students often joke about that not even their committee reads their dissertation. You can always fall back on the presentation, much like students skip class reading when they know that the lecture will cover it all. When it was just me, Stephen and Jurgen, there's really no hiding in the background :).
I also like how in the UK model, you can skip over the easy stuff and really spend time talking with the student about the deep material. I found myself much more impressed with how well Jurgen knows his stuff after the examination than before, and this is not a feeling I usually get with US students because their defense it typically quite high-level. And after 45 minutes of a presentation, plus 15 minutes of audience questions, the last thing anyone wants to do is sit around for another two hours examining the details of the defense chapter-by-chapter.
Regarding the issue of having the advisor there or not, I don't have a strong preference. The one thing I will say is that by having the advisor missing removes the potential for weird politics. For instance, I have seen one or two defenses in which an advisor tends to answer questions for the student, without the student first attempting an answer. If I were on these committees, with a relatively senior advisor, it might be politically awkward to ask them not to do this. Luckily this issue hasn't come up for me, but I could imagine it happening.
Obviously I don't really expect anyone's policies to change, and I'm not even sure that they should, but I like thinking about things that I've grown used to taking for granted. Plus, after having gone through the UK model, I think I will grill students a bit more during the Q/A time. And if this means that fewer students ask me to be on their committees, then there's more time to blog :).
I actually found that I missed both the reading and the writing, so now (especially that I've switch over to public transportation and so have about an hour to kill in transportation time) I'm going to go back to reading while being transported and blogging when I have time.
I figured I'd return to blogging by saying a bit about a recent experience. Less than a month ago I had the honor of serving on Jurgen Van Gael's Ph.D. examination committee. Jurgen did an excellent job and, as perhaps expected, passed. But what I want to talk about is how the UK model (or at least the Cambridge model) is different from the US model.
In the UK, the examination is done by two faculty members, one internal (this was Stephen Clark) and one external (that was me). It does not involve the advisor/supervisor, though this person can sit in the room without speaking :). There is no public presentation and the process we followed was basically to go through the dissertation chapter-by-chapter, ask clarification questions, perhaps some things to get Jurgen to think on his toes, and so on. This took about two hours.
Contrast this to the (prototypical) US model, where a committee consists of 5 people, perhaps one external (either external to CS or to the university, depending on how your institution sets it up), and includes the advisor. The defense is typically a 45 minute public presentation followed by questions from the committee in a closed-room environment with the student.
Having been involved, now, in both types, I have to say they each have their pros and cons. I think the lack of a public presentation in the UK model is a bit of a shame, though of course students could decide to do these anyway. But it's nice to have something official for parents or spouses to come to if they'd like. However, in the US, the public presentation, plus the larger committee, probably leads to situation that students often joke about that not even their committee reads their dissertation. You can always fall back on the presentation, much like students skip class reading when they know that the lecture will cover it all. When it was just me, Stephen and Jurgen, there's really no hiding in the background :).
I also like how in the UK model, you can skip over the easy stuff and really spend time talking with the student about the deep material. I found myself much more impressed with how well Jurgen knows his stuff after the examination than before, and this is not a feeling I usually get with US students because their defense it typically quite high-level. And after 45 minutes of a presentation, plus 15 minutes of audience questions, the last thing anyone wants to do is sit around for another two hours examining the details of the defense chapter-by-chapter.
Regarding the issue of having the advisor there or not, I don't have a strong preference. The one thing I will say is that by having the advisor missing removes the potential for weird politics. For instance, I have seen one or two defenses in which an advisor tends to answer questions for the student, without the student first attempting an answer. If I were on these committees, with a relatively senior advisor, it might be politically awkward to ask them not to do this. Luckily this issue hasn't come up for me, but I could imagine it happening.
Obviously I don't really expect anyone's policies to change, and I'm not even sure that they should, but I like thinking about things that I've grown used to taking for granted. Plus, after having gone through the UK model, I think I will grill students a bit more during the Q/A time. And if this means that fewer students ask me to be on their committees, then there's more time to blog :).
07 July 2011
Introducing Braque, your paper discovery friend
(Shameless plug/advertisement follows.)
Want to be informed of new interesting papers that show up online?
Tired of trolling conference proceedings to find that one gem?
Want to make sure interested parties hear about your newest results?
Want to know when a new paper comes out that cites you?
Braque (http://braque.cc) can help.
Braque is a news service for research papers (currently focusing primarily on NLP and ML, though it needn't be that way). You can create channels that provide email or RSS feeds for topics you care about. You can add your own publications page as a resource to Braque so it knows to crawl your papers and send them out to interested parties.
Braque is something I built ages ago with Percy Liang, but it's finally more or less set up after my move. Feel free to email me questions and comments or (preferably) use the online comment system.
As a bit of warning: Braque is neither a paper search engine nor a paper archive. And please be a bit forgiving if you go there immediately after this post shows up and it's a bit slow.... we only have one server :).
ps., yes, Braque is sort of like WhatToSee on crack.
Want to be informed of new interesting papers that show up online?
Tired of trolling conference proceedings to find that one gem?
Want to make sure interested parties hear about your newest results?
Want to know when a new paper comes out that cites you?
Braque (http://braque.cc) can help.
Braque is a news service for research papers (currently focusing primarily on NLP and ML, though it needn't be that way). You can create channels that provide email or RSS feeds for topics you care about. You can add your own publications page as a resource to Braque so it knows to crawl your papers and send them out to interested parties.
Braque is something I built ages ago with Percy Liang, but it's finally more or less set up after my move. Feel free to email me questions and comments or (preferably) use the online comment system.
As a bit of warning: Braque is neither a paper search engine nor a paper archive. And please be a bit forgiving if you go there immediately after this post shows up and it's a bit slow.... we only have one server :).
ps., yes, Braque is sort of like WhatToSee on crack.
06 July 2011
The conference(s) post: ACL and ICML
I'm using ACL/ICML as an excuse to jumpstart my resumed, hopefully regular, posting. The usual "I didn't see/read everything" applies to all of this. My general feeling about ACL (which was echoed by several other participants) was that the program was quite strong, but there weren't many papers that really stood out as especially great. Here are some papers I liked and some attached thoughts, from ACL:
P11-1002 [bib]: Sujith Ravi; Kevin Knight
Deciphering Foreign LanguageThis paper is about building MT systems without parallel data. There's been a bunch of work in this area. The idea here is that if I have English text, I can build an English LM. If you give me some French text and I hallucinate a F2E MT system, then it's output had better score high on the English LM.
P11-1020 [bib] [dataset]: David Chen; William Dolan
Collecting Highly Parallel Data for Paraphrase Evaluation
Although this paper is about paraphrasing, the fun part is the YouTube stuff they did. Read it and see :).
P11-1060 [bib]: Percy Liang; Michael Jordan; Dan Klein
Learning Dependency-Based Compositional Semantics
This paper is along the lines of semantic parsing stuff that various people (Ray Mooney, Luke Zettlemoyer/Mike Collins, etc.) have been doing. It's a nice compositional model that is learned online.
P11-1099 [bib]: Vanessa Wei Feng; Graeme Hirst
Classifying arguments by scheme
This paper is about argumentation (in the "debate" sense) and identifying different argumentation types. There are some nice correlations with discourse theory, but in a different context.
P11-2037 [bib]: Shu Cai; David Chiang; Yoav Goldberg
Language-Independent Parsing with Empty Elements
I'm really glad to see that people are starting to take this problem seriously again. This falls under the category of "if you've ever actually tried to use a parser to do something then you need this."
Okay so that's not that many papers, but I did "accidentally" skip some sections. So you're on your own for the rest.
For ICML, I actually felt it was more of a mixed bag. Here are some things that stood out as cool:
Minimum Probability Flow Learning
Jascha Sohl-Dickstein; Peter Battaglino; Michael DeWeese
This is one that I need to actually go read, because it seems too good to be true. If computing a partition function ever made you squirm, read this paper.
Tree-Structured Infinite Sparse Factor Model
XianXing Zhang; David Dunson; Lawrence Carin
This is trying to do factor analysis with tree factors; they use a "multiplicative gamma process" to accomplish it. This is something we tried to do a while ago, but could never really figure out how to do it.
Sparse Additive Generative Models of Text
Jacob Eisenstein; Amr Ahmed; Eric Xing
The idea here is that if you're learning a model of text, don't re-learn the same "general background" distribution over and over again. Then learn class- or topic-specific stuff as a sparse amendment to that background.
OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning
Arvind Sujeeth; HyoukJoong Lee; Kevin Brown; Tiark Rompf; Hassan Chafi; Michael Wu; Anand Atreya; Martin Odersky; Kunle Olukotun
Two words: MATLAB KILLER.
Six more words: Most authors ever on ICML paper.
Generalized Boosting Algorithms for Convex Optimization
Alexander Grubb; Drew Bagnell
Suppose you want to boost something that's non-smooth? Now you can do it. Has nice applications in imitation learning, which is I suppose why I like it.
Learning from Multiple Outlooks
Maayan Harel; Shie Mannor
This is a nice approach based on distribution mapping to the problem of multiview learning when you don't have data with parallel views. (I'm not sure that we need a new name for this task, but I still like the paper.)
Parsing Natural Scenes and Natural Language with Recursive Neural Networks
Richard Socher; Cliff Chiung-Yu Lin; Andrew Ng; Chris Manning
This is basically about learning compositional semantics for vector space models of text, something that I think is really interesting and understudied (Mirella Lapata has done some stuff). The basic idea is that if "red" is embedded at position x, and "sparrow" is embedded at y, then the embedding of the phrase "red sparrow" should be at f([x y]) where f is some neural network. Trained to get good representations for parsing.
Please reply in comments if you had other papers you liked!!!
P11-1002 [bib]: Sujith Ravi; Kevin Knight
Deciphering Foreign LanguageThis paper is about building MT systems without parallel data. There's been a bunch of work in this area. The idea here is that if I have English text, I can build an English LM. If you give me some French text and I hallucinate a F2E MT system, then it's output had better score high on the English LM.
P11-1020 [bib] [dataset]: David Chen; William Dolan
Collecting Highly Parallel Data for Paraphrase Evaluation
Although this paper is about paraphrasing, the fun part is the YouTube stuff they did. Read it and see :).
P11-1060 [bib]: Percy Liang; Michael Jordan; Dan Klein
Learning Dependency-Based Compositional Semantics
This paper is along the lines of semantic parsing stuff that various people (Ray Mooney, Luke Zettlemoyer/Mike Collins, etc.) have been doing. It's a nice compositional model that is learned online.
P11-1099 [bib]: Vanessa Wei Feng; Graeme Hirst
Classifying arguments by scheme
This paper is about argumentation (in the "debate" sense) and identifying different argumentation types. There are some nice correlations with discourse theory, but in a different context.
P11-2037 [bib]: Shu Cai; David Chiang; Yoav Goldberg
Language-Independent Parsing with Empty Elements
I'm really glad to see that people are starting to take this problem seriously again. This falls under the category of "if you've ever actually tried to use a parser to do something then you need this."
Okay so that's not that many papers, but I did "accidentally" skip some sections. So you're on your own for the rest.
For ICML, I actually felt it was more of a mixed bag. Here are some things that stood out as cool:
Minimum Probability Flow Learning
Jascha Sohl-Dickstein; Peter Battaglino; Michael DeWeese
This is one that I need to actually go read, because it seems too good to be true. If computing a partition function ever made you squirm, read this paper.
Tree-Structured Infinite Sparse Factor Model
XianXing Zhang; David Dunson; Lawrence Carin
This is trying to do factor analysis with tree factors; they use a "multiplicative gamma process" to accomplish it. This is something we tried to do a while ago, but could never really figure out how to do it.
Sparse Additive Generative Models of Text
Jacob Eisenstein; Amr Ahmed; Eric Xing
The idea here is that if you're learning a model of text, don't re-learn the same "general background" distribution over and over again. Then learn class- or topic-specific stuff as a sparse amendment to that background.
OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning
Arvind Sujeeth; HyoukJoong Lee; Kevin Brown; Tiark Rompf; Hassan Chafi; Michael Wu; Anand Atreya; Martin Odersky; Kunle Olukotun
Two words: MATLAB KILLER.
Six more words: Most authors ever on ICML paper.
Generalized Boosting Algorithms for Convex Optimization
Alexander Grubb; Drew Bagnell
Suppose you want to boost something that's non-smooth? Now you can do it. Has nice applications in imitation learning, which is I suppose why I like it.
Learning from Multiple Outlooks
Maayan Harel; Shie Mannor
This is a nice approach based on distribution mapping to the problem of multiview learning when you don't have data with parallel views. (I'm not sure that we need a new name for this task, but I still like the paper.)
Parsing Natural Scenes and Natural Language with Recursive Neural Networks
Richard Socher; Cliff Chiung-Yu Lin; Andrew Ng; Chris Manning
This is basically about learning compositional semantics for vector space models of text, something that I think is really interesting and understudied (Mirella Lapata has done some stuff). The basic idea is that if "red" is embedded at position x, and "sparrow" is embedded at y, then the embedding of the phrase "red sparrow" should be at f([x y]) where f is some neural network. Trained to get good representations for parsing.
Please reply in comments if you had other papers you liked!!!
07 May 2011
CI Fellows Program, again
Are you graduating and interested in doing a fun postdoc in your area of choosing on your project of choosing? Apply to be a NSF CI Fellow!
12 April 2011
Google scholar "cited by" changed recently???
Someone (can't remember who anymore, even though it was just a couple days ago!) pointed out to me that Google scholar seems to be doing weird things with citations. In particular, it seems to think that the citation relation is symmetric (at least in some cases).
Here's an easy example. Look up Khalid El-Arini's 2009 paper "Turning down the noise in the blogosphere" paper on Google scholar (or just follow that link). Apparently it's been cited by 24 papers. Let's look at who cites them. Apparently in 2003, in addition to inventing LDA, also invented a time machine so that he could cite Khalid's paper!
The weird thing is that this doesn't seem to be a systematic error! It only happens some times.
Oh well, I won't complain -- it just makes my H index look better :)
Here's an easy example. Look up Khalid El-Arini's 2009 paper "Turning down the noise in the blogosphere" paper on Google scholar (or just follow that link). Apparently it's been cited by 24 papers. Let's look at who cites them. Apparently in 2003, in addition to inventing LDA, also invented a time machine so that he could cite Khalid's paper!
The weird thing is that this doesn't seem to be a systematic error! It only happens some times.
Oh well, I won't complain -- it just makes my H index look better :)
06 April 2011
Seeding, transduction, out-of-sample error and the Microsoft approach...
My past master's student Adam Teichert (now at JHU) did some work on inducing part of speech taggers using typological information. We wanted to compare the usefulness of using small amounts of linguistic information with small amounts of lexical information in the form of seeds. (Other papers give seeds different names, like initial dictionaries or prototypes or whatever... it's all the same basic idea.)
The basic result was that if you don't use seeds, then typological information can help a lot. If you do you seeds, then your baseline performance jumps from like 5% to about 40% and then using typological information on top of this isn't really that beneficial.
This was a bit frustrating, and led us to think more about the problem. The way we got seeds was to look at the wikipedia page about Portuguese (for instance) and use their example list of words for each tag. An alternative popular way is to use labeled data and extract the few most frequent words for each part of speech type. They're not identical, but there is definitely quite a bit of overlap between the words that Wikipedia lists as examples of determiners and the most frequent determiners (this correlation is especially strong for closed-class words).
In terms of end performance, there are two reasons seeds can help. The first, which is the interesting case, is that knowing that "the" is a determiner helps you find other determiners (like "a") and perhaps also nouns (for instance, knowing the determiners often precede nouns in Portuguese). The second, which is the uninteresting case, is that now every time you see one of your seeds, you pretty much always get it right. In other words, just by specifying seeds, especially by frequency (or approximately by frequency ala Wikipedia), you're basically ensuring that you get 90% accuracy (due to ambiguity) on some large fraction of the corpus (again, especially for closed-class words which have short tails).
This phenomena is mentioned in the text (but not the tables :P), for instance, in Haghighi & Klein's 2006 NAACL paper on prototype-driven POS tagging, wherein they say: "Adding prototypes ... gave an accuracy of 68.8% on all tokens, but only 47.7% on non-prototype occurrences, which is only a marginal improvement over [a baseline system with no prototypes." Their improved system remedies this and achieves better accuracy on non-prototypes as well as prototypes (aka seeds).
This is very similar to the idea of transductive learning in machine learning land. Transduction is an alternative to semi-supervised learning. The setting is that you get a bunch of data, some of which is labeled and some of which is unlabeled. Your goal is to simply label the unlabeled data. You need not "induce" the labeling function (though many approach do, in passing).
The interesting thing is that learning with seeds is very similar to transductive learning, though perhaps with a bit stronger assumption of noise on the "labeled" part. The irony is that in machine learning land, you would never report "combined training and test accuracy" -- this would be ridiculous. Yet this is what we seem to like to do in NLP land. This is itself related to an old idea in machine learning wherein you rate yourself only on test example that you didn't see at training time. This is your out-of-sample error, and is obviously much harder than your standard generalization error. (The famous no-free-lunch theorems are from an out-of-sample analysis.) The funny thing out of sample error is that sometimes you prefer not to get more training examples, because you then know you won't be tested on it! If you were getting it right already, this just hurts you. (Perhaps you should be allowed to see x and say "no I don't want to see y"?)
I think the key question is: what are we trying to do. If we're trying to build good taggers (i.e., we're engineers) then overall accuracy is what we care about and including "seed" performance in our evaluations make sense. But when we're talking about 45% tagging accuracy (like Adam and I were), then this is a pretty pathetic claim. In the case that we're trying to understand learning algorithms and study their performance on real data (i.e., we're scientists) then accuracy on non-seeds is perhaps more interesting. (Please don't jump on me for the engineer/scientist distinction: it's obviously much more subtle than this.)
This also reminds me of something Eric Brill said to me when I was working with him as a summer intern in MLAS at Microsoft (back when MLAS existed and back when Eric was in MLAS....). We were working on web search stuff. His comment was that he really didn't care about doing well on the 1000 most frequent queries. Microsoft could always hire a couple annotators to manually do a good job on these queries. And in fact, this is what is often done. What we care about is the heavy tail, where there are too many somewhat common things to have humans annotate them all. This is precisely the same situation here. I can easily get 1000 seeds for a new language. Do I actually care how well I do on those, or do I care how well I do on the other 20000+ things?
The basic result was that if you don't use seeds, then typological information can help a lot. If you do you seeds, then your baseline performance jumps from like 5% to about 40% and then using typological information on top of this isn't really that beneficial.
This was a bit frustrating, and led us to think more about the problem. The way we got seeds was to look at the wikipedia page about Portuguese (for instance) and use their example list of words for each tag. An alternative popular way is to use labeled data and extract the few most frequent words for each part of speech type. They're not identical, but there is definitely quite a bit of overlap between the words that Wikipedia lists as examples of determiners and the most frequent determiners (this correlation is especially strong for closed-class words).
In terms of end performance, there are two reasons seeds can help. The first, which is the interesting case, is that knowing that "the" is a determiner helps you find other determiners (like "a") and perhaps also nouns (for instance, knowing the determiners often precede nouns in Portuguese). The second, which is the uninteresting case, is that now every time you see one of your seeds, you pretty much always get it right. In other words, just by specifying seeds, especially by frequency (or approximately by frequency ala Wikipedia), you're basically ensuring that you get 90% accuracy (due to ambiguity) on some large fraction of the corpus (again, especially for closed-class words which have short tails).
This phenomena is mentioned in the text (but not the tables :P), for instance, in Haghighi & Klein's 2006 NAACL paper on prototype-driven POS tagging, wherein they say: "Adding prototypes ... gave an accuracy of 68.8% on all tokens, but only 47.7% on non-prototype occurrences, which is only a marginal improvement over [a baseline system with no prototypes." Their improved system remedies this and achieves better accuracy on non-prototypes as well as prototypes (aka seeds).
This is very similar to the idea of transductive learning in machine learning land. Transduction is an alternative to semi-supervised learning. The setting is that you get a bunch of data, some of which is labeled and some of which is unlabeled. Your goal is to simply label the unlabeled data. You need not "induce" the labeling function (though many approach do, in passing).
The interesting thing is that learning with seeds is very similar to transductive learning, though perhaps with a bit stronger assumption of noise on the "labeled" part. The irony is that in machine learning land, you would never report "combined training and test accuracy" -- this would be ridiculous. Yet this is what we seem to like to do in NLP land. This is itself related to an old idea in machine learning wherein you rate yourself only on test example that you didn't see at training time. This is your out-of-sample error, and is obviously much harder than your standard generalization error. (The famous no-free-lunch theorems are from an out-of-sample analysis.) The funny thing out of sample error is that sometimes you prefer not to get more training examples, because you then know you won't be tested on it! If you were getting it right already, this just hurts you. (Perhaps you should be allowed to see x and say "no I don't want to see y"?)
I think the key question is: what are we trying to do. If we're trying to build good taggers (i.e., we're engineers) then overall accuracy is what we care about and including "seed" performance in our evaluations make sense. But when we're talking about 45% tagging accuracy (like Adam and I were), then this is a pretty pathetic claim. In the case that we're trying to understand learning algorithms and study their performance on real data (i.e., we're scientists) then accuracy on non-seeds is perhaps more interesting. (Please don't jump on me for the engineer/scientist distinction: it's obviously much more subtle than this.)
This also reminds me of something Eric Brill said to me when I was working with him as a summer intern in MLAS at Microsoft (back when MLAS existed and back when Eric was in MLAS....). We were working on web search stuff. His comment was that he really didn't care about doing well on the 1000 most frequent queries. Microsoft could always hire a couple annotators to manually do a good job on these queries. And in fact, this is what is often done. What we care about is the heavy tail, where there are too many somewhat common things to have humans annotate them all. This is precisely the same situation here. I can easily get 1000 seeds for a new language. Do I actually care how well I do on those, or do I care how well I do on the other 20000+ things?
10 March 2011
Postdoc Position at CLIP (@UMD)
Okay, now is why I take serious unfair advantage of having this blog. We have a postdoc opening. See the official ad below for details:
A postdoc position is available in the Computational Linguistics and
Information Processing (CLIP) Laboratory in the Institute for Advanced
Computer Studies at University of Maryland. We are seeking a talented
researcher in natural language processing, with strong interests in
the processing of scientific literature.
A successful candidate should have a strong NLP background with a
track record of top-tier research publications. A Ph.D. in computer
science and strong organizational and coordination skills are a must.
In addition to pursuing original research in scientific literature
processing, the ideal candidate will coordinate the efforts of the
other members of that project. While not necessary, experience in one
or more of the following areas is highly advantageous: summarization,
NLP or data mining for scientific literature, machine learning, and
the use of linguistic knowledge in computational systems.
Additionally, experience with large-data NLP and system building will
be considered favorably.
The successful candidate will work closely with current CLIP faculty,
especially Bonnie Dorr, Hal Daume III and Ken Fleischmann, while
interacting with a large team involving NLP researchers across several
other prominent institutions. The duration of the position is one
year, starting Summer or Fall 2011, and is potentially extendible.
CLIP is a a dynamic interdisciplinary computational linguistics
program with faculty from across the university, and major research
efforts in machine translation, information retrieval, semantic
analysis, generation, and development of large-scale statistical
language processing tools.
Please send a CV and names and contact information of 3 referees,
preferably by e-mail, to:
Jessica Touchard
jessica AT cs DOT umd DOT edu
Department of Computer Science
A.V. Williams Building, Room 1103
University of Maryland
College Park, MD 20742
Specific questions about the position may be addressed to Hal Daume
III at hal AT umiacs DOT umd DOT edu.
A postdoc position is available in the Computational Linguistics and
Information Processing (CLIP) Laboratory in the Institute for Advanced
Computer Studies at University of Maryland. We are seeking a talented
researcher in natural language processing, with strong interests in
the processing of scientific literature.
A successful candidate should have a strong NLP background with a
track record of top-tier research publications. A Ph.D. in computer
science and strong organizational and coordination skills are a must.
In addition to pursuing original research in scientific literature
processing, the ideal candidate will coordinate the efforts of the
other members of that project. While not necessary, experience in one
or more of the following areas is highly advantageous: summarization,
NLP or data mining for scientific literature, machine learning, and
the use of linguistic knowledge in computational systems.
Additionally, experience with large-data NLP and system building will
be considered favorably.
The successful candidate will work closely with current CLIP faculty,
especially Bonnie Dorr, Hal Daume III and Ken Fleischmann, while
interacting with a large team involving NLP researchers across several
other prominent institutions. The duration of the position is one
year, starting Summer or Fall 2011, and is potentially extendible.
CLIP is a a dynamic interdisciplinary computational linguistics
program with faculty from across the university, and major research
efforts in machine translation, information retrieval, semantic
analysis, generation, and development of large-scale statistical
language processing tools.
Please send a CV and names and contact information of 3 referees,
preferably by e-mail, to:
Jessica Touchard
jessica AT cs DOT umd DOT edu
Department of Computer Science
A.V. Williams Building, Room 1103
University of Maryland
College Park, MD 20742
Specific questions about the position may be addressed to Hal Daume
III at hal AT umiacs DOT umd DOT edu.
08 March 2011
Some thoughts on supplementary materials
Having the option of authors submitting supplementary materials is becoming popular in NLP/ML land. NIPS was one of the first conferences I submit to that has allowed this; I think ACL allowed it this past year, at least for specific types of materials (code, data), and EMNLP is thinking of allowing it at some point in the near future.
Here is a snippet of the NIPS call for papers (see section 5) that describes the role of supplementary materials:
You can think of the emphasized part of the call as a form of reviewer protection. It basically says: look, we know that reviewers are overloaded; if your paper isn't very interesting, the reviewers aren't required to read the supplement. (As an aside, I feel the same thing happens with pages 2-8 given page 1 in a lot of cases :P.)
I think it's good to have such a form a reviewer protection. What I wonder is whether it also makes sense to add a form of author protection. In other words, the current policy -- which seems only explicitly stated in the case of NIPS, but seems to be generally understood elsewhere, too -- is that reviewers are protected from overzealous authors. I think we need to have additional clauses that protect authors from overzealous reviewers.
Why? Already I get annoyed with reviewers who seem to think that extra experiments, discussion, proofs or whatever can somehow magically fit in an already crammed 8 page page. A general suggestion to reviewers is that if you're suggesting things to add, you should also suggest things to cut.
This situation is exacerbated infinity-fold with the "option" of supplementary material. There now is no length-limit reason why an author couldn't include everything under the sun. And it's too easy for a reviewer just to say that XYZ should have been included because, well, it could just have gone in the supplementary material!
So what I'm proposing is that supplementary material clauses should have two forms of protection. The first being the existing one, protecting reviewers from overzealous authors. The second being the reverse, something like:
Here is a snippet of the NIPS call for papers (see section 5) that describes the role of supplementary materials:
In addition to the submitted PDF paper, authors can additionally submit supplementary material for their paper... Such extra material may include long technical proofs that do not fit into the paper, image, audio or video sample outputs from your algorithm, animations that describe your algorithm, details of experimental results, or even source code for running experiments. Note that the reviewers and the program committee reserve the right to judge the paper solely on the basis of the 8 pages, 9 pages including citations, of the paper; looking at any extra material is up to the discretion of the reviewers and is not required.(Emphasis mine.) Now, before everyone goes misinterpreting what I'm about to say, let me make it clear that in general I like the idea of supplementary materials, given our current publishing model.
You can think of the emphasized part of the call as a form of reviewer protection. It basically says: look, we know that reviewers are overloaded; if your paper isn't very interesting, the reviewers aren't required to read the supplement. (As an aside, I feel the same thing happens with pages 2-8 given page 1 in a lot of cases :P.)
I think it's good to have such a form a reviewer protection. What I wonder is whether it also makes sense to add a form of author protection. In other words, the current policy -- which seems only explicitly stated in the case of NIPS, but seems to be generally understood elsewhere, too -- is that reviewers are protected from overzealous authors. I think we need to have additional clauses that protect authors from overzealous reviewers.
Why? Already I get annoyed with reviewers who seem to think that extra experiments, discussion, proofs or whatever can somehow magically fit in an already crammed 8 page page. A general suggestion to reviewers is that if you're suggesting things to add, you should also suggest things to cut.
This situation is exacerbated infinity-fold with the "option" of supplementary material. There now is no length-limit reason why an author couldn't include everything under the sun. And it's too easy for a reviewer just to say that XYZ should have been included because, well, it could just have gone in the supplementary material!
So what I'm proposing is that supplementary material clauses should have two forms of protection. The first being the existing one, protecting reviewers from overzealous authors. The second being the reverse, something like:
Authors are not obligated to include supplementary materials. The paper should stand on its own, excluding any supplement. Reviewers must take into account the strict 8 page limit when evaluating papers.Or something like that: the wording isn't quite right. But without this, I fear that supplementary materials will, in the limit, simply turn into an arms race.
02 March 2011
Grad school survey, revisited
You may recall a while ago I ran a survey on where people applied to grad school. Obviously I've been sitting on these results for a while now, but I figured since it's that time of year when people are choosing grad schools, that I would say how things turned out. Here's a summary of things that people thought were most important (deciding factor), and moderately important (contributing factor, in parens):
- Academic Program
- Specialty degree programs in my research area, 48%
- (Availability of interesting courses, 16%)
- (Time to completion, 4%)
- Application Process
- Nothing
- Faculty Member(s)
- Read research papers by faculty member, 44%
- Geographic Area
- (Outside interests/personal preference, 15%)
- Recommendations from People
- Professors in technical area, 45%
- (Teachers/academic advisors, 32%)
- (Technical colleagues, 20%)
- Reputation
- ... of research group, 61%
- ... of department/college, 50%
- (Ranking of university, 35%)
- (Reputation of university, 34%)
- Research Group
- Research group works on interesting problems, 55%
- Many faculty in a specialty area (eg., ML), 44%
- (Many faculty/students in general area (eg., AI), 33%)
- (Research group publishes a lot, 26%)
- Web Presence
- (Learned about group via web search, 37%)
- (Learned about dept/univ via web search, 24%)
- General
- Funding availability, 49%
- (High likelihood of being accepted, 12%)
- (Size of dept/university, 5%)
28 February 2011
What are your plans between ACL and ICML?
I'll tell you what they should be: attending the Symposium on Machine Learning in Speech and Language Processing, jointly sponsored by IMLS, ICML and ISCA, that I'm co-organizing with Dan Roth, Geoff Zweig and Joseph Keshet (the exact date is June 27, in the same venue as ICML in Bellevue, Washington). So far we've got a great list of invited speakers from all of these areas, including Mark Steedman, Stan Chen, Yoshua Bengio, Lawrence Saul, Sanjoy Dasgupta and more. (See the web page for more details.) We'll also be organizing some sort of day trips (together with the local organizers of ICML) for people who want to join! You should also consider submitting papers (deadline is April 15).
I know I said a month ago that I would blog more. I guess that turned out to be a lie. The problem is that I only have so much patience for writing and I've been spending a lot of time writing non-blog things recently. I decided to use my time-off-teaching doing something far more time consuming than teaching. This has been a wonderously useful exercise for me and I hope that, perhaps starting in 2012, other people can take advantage of this work.
I know I said a month ago that I would blog more. I guess that turned out to be a lie. The problem is that I only have so much patience for writing and I've been spending a lot of time writing non-blog things recently. I decided to use my time-off-teaching doing something far more time consuming than teaching. This has been a wonderously useful exercise for me and I hope that, perhaps starting in 2012, other people can take advantage of this work.
17 January 2011
Parsing with Transformations
I remember when I took my first "real" Syntax class, where by "real" I mean "Chomskyan." It was at USC in Fall 2001, taught by Roumyana Pancheva. It was hard as hell but I loved it. However, as a computationally minded guy, I remember snickering to myself the whole time we were talking about movements that get you from deep structure to surface structure. This stuff was all computationally ridiculous.
But why was it computationally ridiculous? It was ridiculous because my mindset, and I think the mindset of most computational folks at the time, was that of n^3 CKY or Earley style parsing. Namely exact parsing in a context free manner. This whole idea of transformations would kill anything like that in a very bad way.
However, there's been a recent shift in attitudes. Sure, people still do their n^3 parsing, but of course none of it is exact anyway (due to pruning). But more than that, things like linear time parsing algorithms as popularized by people like Joakim Nivre and Kenji Sagae and Brian Roark and Joseph Turian, have proved very useful. They work well, are incredibly efficient, and are easy to implement. They're also a bit more psychologically plausible (as Eugene Charniak said recently "we don't know what people are doing, but they're definitely not doing CKY.").
So I'm led to wonder: could we actually do parsing in a transformational grammar using all the new stuff we know about (for instance) left-to-right parsing?
One thing that stands in our way, of course, is the stupid Penn Treebank, which was annotated only with very simple transformations (mostly noun phrase movements) and not really "deep" transformations as most Chomskyan linguists would recognize them.
But I think you could still do it. It would end up as being partially unsupervised, but at least from a minimum description length perspective, I can either spend weights learning more special cases, or I can learn general transformational rules. It would take some thought and effort to write it out and figure out how to actually optimize such a thing, but I bet it could be done in a semester.
So then the question is: aside from smaller models (potentially), is there any other reason to do it?
I can think of at least one: parsing non-declarative sentences. Since almost all sentences in the Treebank are declarative, parsers do pretty crappy when tested on other things. Slav Petrov had a paper at EMNLP 2010 on parsing questions. Here is the abstract, which says pretty much everything:
We have observed similar effects in the parsing of commands, such as "Put your head in a noose" where parsers -- even constituency ones -- really really want "Put" to be a noun! Again, if you know simple transformations -- like subject dropping -- you should be able to parse commands if you can already parse declarations.
As with any generalization, the hope is that by realizing the generalization, you don't need to store so many specific cases. So if you can learn that commands and questions are simple transformation on declarative sentences, and you can learn to parse declaratives, you should be able to handle the other case.
(Anticipating comments: yes, I know you could try to pre-transform your data, like they do in MT, but that's quite inelegant. And yes, I know you could probably take the treebank and turn a lot of the sentences into commands or questions to create a new data set. But that's kind of missing the point: I don't want to just handle commands or questions... I want to handle anything, even things that I might not have anticipated.)
But why was it computationally ridiculous? It was ridiculous because my mindset, and I think the mindset of most computational folks at the time, was that of n^3 CKY or Earley style parsing. Namely exact parsing in a context free manner. This whole idea of transformations would kill anything like that in a very bad way.
However, there's been a recent shift in attitudes. Sure, people still do their n^3 parsing, but of course none of it is exact anyway (due to pruning). But more than that, things like linear time parsing algorithms as popularized by people like Joakim Nivre and Kenji Sagae and Brian Roark and Joseph Turian, have proved very useful. They work well, are incredibly efficient, and are easy to implement. They're also a bit more psychologically plausible (as Eugene Charniak said recently "we don't know what people are doing, but they're definitely not doing CKY.").
So I'm led to wonder: could we actually do parsing in a transformational grammar using all the new stuff we know about (for instance) left-to-right parsing?
One thing that stands in our way, of course, is the stupid Penn Treebank, which was annotated only with very simple transformations (mostly noun phrase movements) and not really "deep" transformations as most Chomskyan linguists would recognize them.
But I think you could still do it. It would end up as being partially unsupervised, but at least from a minimum description length perspective, I can either spend weights learning more special cases, or I can learn general transformational rules. It would take some thought and effort to write it out and figure out how to actually optimize such a thing, but I bet it could be done in a semester.
So then the question is: aside from smaller models (potentially), is there any other reason to do it?
I can think of at least one: parsing non-declarative sentences. Since almost all sentences in the Treebank are declarative, parsers do pretty crappy when tested on other things. Slav Petrov had a paper at EMNLP 2010 on parsing questions. Here is the abstract, which says pretty much everything:
... We show that dependency parsers have more difficulty parsing questions than constituency parsers. In particular, deterministic shift-reduce dependency parsers ... drop to 60% labeled accuracy on a question test set. We propose an uptraining procedure in which a deterministic parser is trained on the output of a more accurate, but slower, latent variable constituency parser (converted to dependencies). Uptraining with 100K unlabeled questions achieves results comparable to having 2K labeled questions for training. With 100K unlabeled and 2K labeled questions, uptraining is able to improve parsing accuracy to 84%, closing the gap between in-domain and out-of-domain performance.Now, at least in principle, if you can parse declarative sentences, you should be able to parse questions. At least if you know about some basic syntactic transformations in English. (As an aside, the "uptraining" idea is almost exactly the same as the structure compilation idea that Percy, Dan and I had at ICML 2008, though Slav and colleagues apply it to a domain adaptation problem, while we just did simple semi-supervised learning.)
We have observed similar effects in the parsing of commands, such as "Put your head in a noose" where parsers -- even constituency ones -- really really want "Put" to be a noun! Again, if you know simple transformations -- like subject dropping -- you should be able to parse commands if you can already parse declarations.
As with any generalization, the hope is that by realizing the generalization, you don't need to store so many specific cases. So if you can learn that commands and questions are simple transformation on declarative sentences, and you can learn to parse declaratives, you should be able to handle the other case.
(Anticipating comments: yes, I know you could try to pre-transform your data, like they do in MT, but that's quite inelegant. And yes, I know you could probably take the treebank and turn a lot of the sentences into commands or questions to create a new data set. But that's kind of missing the point: I don't want to just handle commands or questions... I want to handle anything, even things that I might not have anticipated.)
11 January 2011
NIPS 2010 Retrospective
Happy New Year and I know I've been silent but I've been busy. But no teaching this semester (YAY!) so maybe you'll see more posts.
At any rate, I'm really late to the table, but here are my comments about this past year's NIPS. Before we get to that, I hope that everyone knows that this coming NIPS will be in Granada, and then for (at least) the next five years will be in Tahoe. Now that I'm not in ski-land, it's nice to have a yearly ski vacation ... erm I mean scientific conference.
But since this was the last year of NIPS in Vancouver, I thought I'd share a conversation that occurred this year at NIPS, with participants anonymized. (I hope everyone knows to take this in good humor: I'm perfectly happy to poke fun at people from the States, too...). The context is that one person in a large group, which was going to find lunch, had a cell phone with a data plan that worked in Canada:
But I'm sure you are here to hear about papers, not stupid Canada jokes. So here's my take.
The tutorial on optimization by Stephen Wright was awesome. I hope this shows up on videolectures soon. (Update: it has!) I will make it required reading / watching for students. There's just too much great stuff in it to go in to, but how about this: momentum is the same as CG! Really?!?! There's tons of stuff that I want to look more deeply into, such as robust mirror descent, some work by Candes about SVD when we don't care about near-zero SVs, regularized stochastic gradient (Xiao) and sparse eigenvector work. Lots of awesome stuff. My favorite part of NIPS.
Some papers I saw that I really liked:
A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): Formalizes boosting in a multiclass setting. The crux is a clever generalization of the "weak learning" notion from binary. The idea is that a weak binary classifier is one that has a small advantage over random guessing (which, in the binary case, gives 50/50). Generalize this and it works.
Structured sparsity-inducing norms through submodular functions (Francis Bach): I need to read this. This was one of those talks where I understood the first half and then got lost. But the idea is that you can go back-and-forth between submodular functions and sparsity-inducing norms.
Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher): The title says it all! It's an alternative construction to the Polya urn scheme and also to the stick-breaking scheme.
A Reduction from Apprenticeship Learning to Classification (Umar Syed, Robert Schapire): Right up my alley, some surprising results about apprenticeship learning (aka Hal's version of structured prediction) and classification. Similar to a recent paper by Stephane Ross and Drew Bagnell on Efficient Reductions for Imitation Learning.
Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): When you have complex combinatorial spaces (think traveling salesman), how can you construct generic variational inference algorithms?
Implicit Differentiation by Perturbation (Justin Domke): This is a great example of a paper that I never would have read, looked at, seen, visited the poster of, known about etc., were it not for serendipity at conferences (basically Justin was the only person at his poster when I showed up early for the session, so I got to see this poster). The idea is if you have a graphical model, and some loss function L(.) which is defined over the marginals mu(theta), where theta are the parameters of the model, and you want to optimize L(mu(theta)) as a function of theta. Without making any serious assumptions about the form of L, you can actually do gradient descent, where each gradient computation costs two runs of belief propagation. I think this is amazing.
Probabilistic Deterministic Infinite Automata (David Pfau, Nicholas Bartlett, Frank Wood): Another one where the title says it all. DP-style construction of infinite automata.
Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): The idea here is to define a regression function over a graph. It should be regularized in a sensible way. Very LASSO-esque model, as you might expect given the author list :).
Other papers I saw that I liked but not enough to write mini summaries of:
Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Categories and Functional Units: An Infinite Hierarchical Model for Brain Activations (Danial Lashkari, Ramesh Sridharan, Polina Golland)
Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)
Joint Analysis of Time-Evolving Binary Matrices and Associated Documents (Eric Wang, Dehong Liu, Jorge Silva, David Dunson, Lawrence Carin)
Learning Efficient Markov Networks (Vibhav Gogate, William Webb, Pedro Domingos)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher)
Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)
Two students who work with me (though one isn't actually mine :P), who went to NIPS also shared their favorite papers. The first is a list from Avishek Saha:
A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire)
Repeated Games against Budgeted Adversaries (Jacob Abernethy, Manfred Warmuth)
Non-Stochastic Bandit Slate Problems (Satyen Kale, Lev Reyzin, Robert Schapire)
Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)
Learning Bounds for Importance Weighting (Corinna Cortes, Yishay Mansour, Mehryar Mohri)
Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)
The second list is from Piyush Rai, who apparently aimed for recall (though not with a lack of precision) :P:
Online Learning: Random Averages, Combinatorial Parameters, and Learnability (Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari): defines several complexity measures for online learning akin to what we have for the batch setting (e.g., radamacher averages, covering numbers etc).
Online Learning in The Manifold of Low-Rank Matrices (Uri Shalit, Daphna Weinshall, Gal Chechik): nice general framework applicable in a number of online learning settings. could also be used for online multitask learning.
Fast global convergence rates of gradient methods for high-dimensional statistical recovery (Alekh Agarwal, Sahand Negahban, Martin Wainwright): shows that the properties of sparse estimation problems that lead to statistical efficiency also lead to computational efficiency which explains the faster practical convergence of gradient methods than what the theory guarantees.
Copula Processes (Andrew Wilson, Zoubin Ghahramani): how do you determine the relationship between random variables which could have different marginal distributions (say one has gamma and the other has gaussian distribution)? copula process gives an answer to this.
Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): usually undirected graph structure learning involves a set of random variables y drawn from a distribution p(y). but what if y depends on another variable x? this paper is about learning the graph structure of the distribution p(y|x=x).
Structured sparsity-inducing norms through submodular functions (Francis Bach): standard sparse recovery uses l1 norm as a convex proxy for the l0 norm (which constrains the number of nonzero coefficients to be small). this paper proposes several more general set functions and their corresponding convex proxies, and links them to known norms.
Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum): an interesting paper -- what if in an online learning setting you could abstain from making a prediction on some of the training examples and just say "i don't know"? on others, you may or may not make the correct prediction. lies somewhere in the middle of always predicting right or wrong (i.e., standard mistake driven online learning) versus the recent work on only predicting correctly or otherwise saying "i don't know".
Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): cool paper. applicable to lots of settings.
A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): we know that boosting in binary case requires "slightly better than random" weak learners. this paper characterizes conditions on the weak learners for the multi-class case, and also gives a boosting algorithm.
Multitask Learning without Label Correspondences (Novi Quadrianto, Alexander Smola, Tiberio Caetano, S.V.N. Vishwanathan, James Petterson): usually mtl assumes that the output space is the same for all the tasks but in many cases this may not be true. for instance, we may have two related prediction problems on two datasets but the output spaces for both may be different and may have some complex (e.g., hierarchical, and potentially time varying) output spaces. the paper uses a mutual information criteria to learn the correspondence between the output spaces.
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty (Yi Zhang, Jeff Schneider): presents a general multitask learning framework and many recently proposed mtl models turn out to be special cases. models both feature covariance and task covariance matrices.
Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions (Achintya Kundu, vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-Tal): the title says it all. :) multiple kernel learning is usually applied in classification setting but due to the applicability of the proposed method for a wide variety of loss functions, one can possibly also use it for unsupervised learning problems as well (e.g., spectral clustering, kernel pca, etc).
Getting lost in space: Large sample analysis of the resistance distance (Ulrike von Luxburg, Agnes Radl, Matthias Hein): large sample analysis of the commute distance: shows a rather surprising result that commute distance between two vertices in the graph if the graph is "large" and nodes represent high dimensional variables is meaningless. the paper proposes a correction and calls it "amplified commute distance".
A Bayesian Approach to Concept Drift (Stephen Bach, Mark Maloof): gives a bayesian approach for segmenting a sequence of observations such that each "block" of observations has the same underlying concept.
MAP Estimation for Graphical Models by Likelihood Maximization (Akshat Kumar, Shlomo Zilberstein): they show that you can think of an mrf as a mixture of bayes nets and then the map problem on the mrf corresponds to solving a form of the maximum likelihood problem on the bayes net. em can be used to solve this in a pretty fast manner. they say that you can use this methods with the max-product lp algorithms to yield even better solutions, with a quicker convergence.
Energy Disaggregation via Discriminative Sparse Coding (J. Zico Kolter, Siddharth Batra, Andrew Ng): about how sparse coding could be used to save energy. :)
Semi-Supervised Learning with Adversarially Missing Label Information (Umar Syed, Ben Taskar): standard ssl assumes that labels for the unlabeled data are missing at random but in many practical settings this isn't actually true.this paper gives an algorithm to deal with the case when the labels could be adversarially missing.
Multi-View Active Learning in the Non-Realizable Case (Wei Wang, Zhi-Hua Zhou): shows that (under certain assumptions) exponential improvements in the sample complexity of active learning are still possible if you have a multiview learning setting.
Self-Paced Learning for Latent Variable Models (M. Pawan Kumar, Benjamin Packer, Daphne Koller): an interesting paper, somewhat similar in spirit to curriculum learning. basically, the paper suggests that in learning a latent variable model, it helps if you provide the algorithm easy examples first.
More data means less inference: A pseudo-max approach to structured learning (David Sontag, Ofer Meshi, Tommi Jaakkola, Amir Globerson): a pseudo-max approach to structured learning: this is somewhat along the lines of the paper on svm's inverse dependence on training size from icml a couple of years back. :)
Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning (Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman): selecting the most uncertain example in a pool based active learning can be expensive if the number of candidate examples is very large. this paper suggests some hashing tricks to expedite the search.
Active Instance Sampling via Matrix Partition (Yuhong Guo): frames batch mode active learning as a matrix partitioning problems and proposes local optimization technique for the matrix partitioning problem.
A Discriminative Latent Model of Image Region and Object Tag Correspondence (Yang Wang, Greg Mori): it's kind of doing correspondence lda on image+captions but they additionally infer the correspondences between tags and objects in the images, and show that this gives improvements over corr-lda.
Factorized Latent Spaces with Structured Sparsity (Yangqing Jia, Mathieu Salzmann, Trevor Darrell): a multiview learning algorithm that uses sparse coding to learn shared as well as private features of different views of the data.
Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy): extends lda for the case when you have access to features for each word in the vocabulary
At any rate, I'm really late to the table, but here are my comments about this past year's NIPS. Before we get to that, I hope that everyone knows that this coming NIPS will be in Granada, and then for (at least) the next five years will be in Tahoe. Now that I'm not in ski-land, it's nice to have a yearly ski vacation ... erm I mean scientific conference.
But since this was the last year of NIPS in Vancouver, I thought I'd share a conversation that occurred this year at NIPS, with participants anonymized. (I hope everyone knows to take this in good humor: I'm perfectly happy to poke fun at people from the States, too...). The context is that one person in a large group, which was going to find lunch, had a cell phone with a data plan that worked in Canada:
A: Wow, that map is really taking a long time to load.Okay it's not that funny, but it was funny at the time. (And really "B" is as much a joke about the US as it was about Canada :P.)
B: I know. It's probably some socialized Canadian WiFi service.
C: No, it's probably just slow because every third bit has to be a Canadian bit?
D: No no, it's because every bit has to be sent in both English and French!
But I'm sure you are here to hear about papers, not stupid Canada jokes. So here's my take.
The tutorial on optimization by Stephen Wright was awesome. I hope this shows up on videolectures soon. (Update: it has!) I will make it required reading / watching for students. There's just too much great stuff in it to go in to, but how about this: momentum is the same as CG! Really?!?! There's tons of stuff that I want to look more deeply into, such as robust mirror descent, some work by Candes about SVD when we don't care about near-zero SVs, regularized stochastic gradient (Xiao) and sparse eigenvector work. Lots of awesome stuff. My favorite part of NIPS.
Some papers I saw that I really liked:
A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): Formalizes boosting in a multiclass setting. The crux is a clever generalization of the "weak learning" notion from binary. The idea is that a weak binary classifier is one that has a small advantage over random guessing (which, in the binary case, gives 50/50). Generalize this and it works.
Structured sparsity-inducing norms through submodular functions (Francis Bach): I need to read this. This was one of those talks where I understood the first half and then got lost. But the idea is that you can go back-and-forth between submodular functions and sparsity-inducing norms.
Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher): The title says it all! It's an alternative construction to the Polya urn scheme and also to the stick-breaking scheme.
A Reduction from Apprenticeship Learning to Classification (Umar Syed, Robert Schapire): Right up my alley, some surprising results about apprenticeship learning (aka Hal's version of structured prediction) and classification. Similar to a recent paper by Stephane Ross and Drew Bagnell on Efficient Reductions for Imitation Learning.
Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): When you have complex combinatorial spaces (think traveling salesman), how can you construct generic variational inference algorithms?
Implicit Differentiation by Perturbation (Justin Domke): This is a great example of a paper that I never would have read, looked at, seen, visited the poster of, known about etc., were it not for serendipity at conferences (basically Justin was the only person at his poster when I showed up early for the session, so I got to see this poster). The idea is if you have a graphical model, and some loss function L(.) which is defined over the marginals mu(theta), where theta are the parameters of the model, and you want to optimize L(mu(theta)) as a function of theta. Without making any serious assumptions about the form of L, you can actually do gradient descent, where each gradient computation costs two runs of belief propagation. I think this is amazing.
Probabilistic Deterministic Infinite Automata (David Pfau, Nicholas Bartlett, Frank Wood): Another one where the title says it all. DP-style construction of infinite automata.
Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): The idea here is to define a regression function over a graph. It should be regularized in a sensible way. Very LASSO-esque model, as you might expect given the author list :).
Other papers I saw that I liked but not enough to write mini summaries of:
Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Categories and Functional Units: An Infinite Hierarchical Model for Brain Activations (Danial Lashkari, Ramesh Sridharan, Polina Golland)
Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)
Joint Analysis of Time-Evolving Binary Matrices and Associated Documents (Eric Wang, Dehong Liu, Jorge Silva, David Dunson, Lawrence Carin)
Learning Efficient Markov Networks (Vibhav Gogate, William Webb, Pedro Domingos)
Tree-Structured Stick Breaking for Hierarchical Data (Ryan Adams, Zoubin Ghahramani, Michael Jordan)
Construction of Dependent Dirichlet Processes based on Poisson Processes (Dahua Lin, Eric Grimson, John Fisher)
Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)
Two students who work with me (though one isn't actually mine :P), who went to NIPS also shared their favorite papers. The first is a list from Avishek Saha:
A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire)
Repeated Games against Budgeted Adversaries (Jacob Abernethy, Manfred Warmuth)
Non-Stochastic Bandit Slate Problems (Satyen Kale, Lev Reyzin, Robert Schapire)
Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum)
Learning Bounds for Importance Weighting (Corinna Cortes, Yishay Mansour, Mehryar Mohri)
Supervised Clustering (Pranjal Awasthi, Reza Bosagh Zadeh)
The second list is from Piyush Rai, who apparently aimed for recall (though not with a lack of precision) :P:
Online Learning: Random Averages, Combinatorial Parameters, and Learnability (Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari): defines several complexity measures for online learning akin to what we have for the batch setting (e.g., radamacher averages, covering numbers etc).
Online Learning in The Manifold of Low-Rank Matrices (Uri Shalit, Daphna Weinshall, Gal Chechik): nice general framework applicable in a number of online learning settings. could also be used for online multitask learning.
Fast global convergence rates of gradient methods for high-dimensional statistical recovery (Alekh Agarwal, Sahand Negahban, Martin Wainwright): shows that the properties of sparse estimation problems that lead to statistical efficiency also lead to computational efficiency which explains the faster practical convergence of gradient methods than what the theory guarantees.
Copula Processes (Andrew Wilson, Zoubin Ghahramani): how do you determine the relationship between random variables which could have different marginal distributions (say one has gamma and the other has gaussian distribution)? copula process gives an answer to this.
Graph-Valued Regression (Han Liu, Xi Chen, John Lafferty, Larry Wasserman): usually undirected graph structure learning involves a set of random variables y drawn from a distribution p(y). but what if y depends on another variable x? this paper is about learning the graph structure of the distribution p(y|x=x).
Structured sparsity-inducing norms through submodular functions (Francis Bach): standard sparse recovery uses l1 norm as a convex proxy for the l0 norm (which constrains the number of nonzero coefficients to be small). this paper proposes several more general set functions and their corresponding convex proxies, and links them to known norms.
Trading off Mistakes and Don't-Know Predictions (Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum): an interesting paper -- what if in an online learning setting you could abstain from making a prediction on some of the training examples and just say "i don't know"? on others, you may or may not make the correct prediction. lies somewhere in the middle of always predicting right or wrong (i.e., standard mistake driven online learning) versus the recent work on only predicting correctly or otherwise saying "i don't know".
Variational Inference over Combinatorial Spaces (Alexandre Bouchard-Cote, Michael Jordan): cool paper. applicable to lots of settings.
A Theory of Multiclass Boosting (Indraneel Mukherjee, Robert Schapire): we know that boosting in binary case requires "slightly better than random" weak learners. this paper characterizes conditions on the weak learners for the multi-class case, and also gives a boosting algorithm.
Multitask Learning without Label Correspondences (Novi Quadrianto, Alexander Smola, Tiberio Caetano, S.V.N. Vishwanathan, James Petterson): usually mtl assumes that the output space is the same for all the tasks but in many cases this may not be true. for instance, we may have two related prediction problems on two datasets but the output spaces for both may be different and may have some complex (e.g., hierarchical, and potentially time varying) output spaces. the paper uses a mutual information criteria to learn the correspondence between the output spaces.
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty (Yi Zhang, Jeff Schneider): presents a general multitask learning framework and many recently proposed mtl models turn out to be special cases. models both feature covariance and task covariance matrices.
Efficient algorithms for learning kernels from multiple similarity matrices with general convex loss functions (Achintya Kundu, vikram Tankasali, Chiranjib Bhattacharyya, Aharon Ben-Tal): the title says it all. :) multiple kernel learning is usually applied in classification setting but due to the applicability of the proposed method for a wide variety of loss functions, one can possibly also use it for unsupervised learning problems as well (e.g., spectral clustering, kernel pca, etc).
Getting lost in space: Large sample analysis of the resistance distance (Ulrike von Luxburg, Agnes Radl, Matthias Hein): large sample analysis of the commute distance: shows a rather surprising result that commute distance between two vertices in the graph if the graph is "large" and nodes represent high dimensional variables is meaningless. the paper proposes a correction and calls it "amplified commute distance".
A Bayesian Approach to Concept Drift (Stephen Bach, Mark Maloof): gives a bayesian approach for segmenting a sequence of observations such that each "block" of observations has the same underlying concept.
MAP Estimation for Graphical Models by Likelihood Maximization (Akshat Kumar, Shlomo Zilberstein): they show that you can think of an mrf as a mixture of bayes nets and then the map problem on the mrf corresponds to solving a form of the maximum likelihood problem on the bayes net. em can be used to solve this in a pretty fast manner. they say that you can use this methods with the max-product lp algorithms to yield even better solutions, with a quicker convergence.
Energy Disaggregation via Discriminative Sparse Coding (J. Zico Kolter, Siddharth Batra, Andrew Ng): about how sparse coding could be used to save energy. :)
Semi-Supervised Learning with Adversarially Missing Label Information (Umar Syed, Ben Taskar): standard ssl assumes that labels for the unlabeled data are missing at random but in many practical settings this isn't actually true.this paper gives an algorithm to deal with the case when the labels could be adversarially missing.
Multi-View Active Learning in the Non-Realizable Case (Wei Wang, Zhi-Hua Zhou): shows that (under certain assumptions) exponential improvements in the sample complexity of active learning are still possible if you have a multiview learning setting.
Self-Paced Learning for Latent Variable Models (M. Pawan Kumar, Benjamin Packer, Daphne Koller): an interesting paper, somewhat similar in spirit to curriculum learning. basically, the paper suggests that in learning a latent variable model, it helps if you provide the algorithm easy examples first.
More data means less inference: A pseudo-max approach to structured learning (David Sontag, Ofer Meshi, Tommi Jaakkola, Amir Globerson): a pseudo-max approach to structured learning: this is somewhat along the lines of the paper on svm's inverse dependence on training size from icml a couple of years back. :)
Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning (Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman): selecting the most uncertain example in a pool based active learning can be expensive if the number of candidate examples is very large. this paper suggests some hashing tricks to expedite the search.
Active Instance Sampling via Matrix Partition (Yuhong Guo): frames batch mode active learning as a matrix partitioning problems and proposes local optimization technique for the matrix partitioning problem.
A Discriminative Latent Model of Image Region and Object Tag Correspondence (Yang Wang, Greg Mori): it's kind of doing correspondence lda on image+captions but they additionally infer the correspondences between tags and objects in the images, and show that this gives improvements over corr-lda.
Factorized Latent Spaces with Structured Sparsity (Yangqing Jia, Mathieu Salzmann, Trevor Darrell): a multiview learning algorithm that uses sparse coding to learn shared as well as private features of different views of the data.
Word Features for Latent Dirichlet Allocation (James Petterson, Alexander Smola, Tiberio Caetano, Wray Buntine, Shravan Narayanamurthy): extends lda for the case when you have access to features for each word in the vocabulary