18 November 2010

Crowdsourcing workshop (/tutorial) decisions

Everyone at conferences (with multiple tracks) always complains that there are time slots with nothing interesting, and other time slots with too many interesting papers.  People have suggested crowdsourcing this, enabling parcipants to say -- well ahead of the conference -- which papers they'd go to... then let an algorithm schedule.

I think there are various issues with this model, but don't want to talk about it.  What I do want to talk about is applying the same ideas to workshop acceptance decisions.  This comes up because I'm one of the two workshop chairs for ACL this year, and because John Langford just pointed to the ICML call for tutorials.  (I think what I have to say applies equally to tutorials as to workshops.)

I feel like a workshop (or tutorial) is successful if it is well attended.  This applies both from a monetary perspective, as well as a scientific perspective.  (Note, though, that I think that small workshops can also be successful, especially if they are either fostering a small community, bring people in, or serving other purposes.  That is to say, size is not all that matters.  But it is a big part of what matters.)

We have 30-odd workshop proposals for three of us to sort through (John Carroll and I are the two workshop chairs for ACL, and Marie Candito is the workshop chair for EMNLP; workshops are being reviewed jointly -- which actually makes the allocation process more difficult).  The idea would be that I could create a poll, like the following:

  1. Are you going to ACL?  Yes, maybe, no
  2. Are you going to EMNLP?  Yes, maybe, no
  3. If workshop A were offered at a conference you were going to, would you go to workshop A?
  4. If workshop B...
  5. And so on
This gives you two forms of information.  First it can help estimate expected attendance (though we ask proposers to estimate that, too, and I think they do a reasonable job if you skew their estimates down by about 10%).  But more importantly, it gives correlations between workshops.  This lets you be sure that you're not scheduling things on top of each other that people might want to go to.  Some of these are obvious (for instance, if we got 10 MT workshop proposals... which didn't actually happen but is moderately conceivable :P), but some are not.  For instance, maybe people who care about annotation also care about ML, but maybe not?  I actually have no idea.

Of course we're not going to do this this year.  It's too late already, and it would be unfair to publicise all the proposals, given that we didn't tell proposers in advance that we would do so.  And of course I don't think this should exclusively be a popularity contest.  But I do beleive that popularity should be a factor.  And it should probably be a reasonably big factor.  Workshop chairs could then use the output of an optimization algorithm as a starting point, and use this as additional data for making decisions.  Especially since two or three people are being asked to make decisions that cover--essentially--all areas of NLP, this actually seems like a good idea to me.

I actually think something like this is more likely to actually happen at a conference like ICML than ACL, since ICML seems (much?) more willing to try new things than ACL (for better or for worse).

But I do think it would be interesting to try to see what sort of response you get.  Of course, just polling on this blog wouldn't be sufficient: you'd want to spam, perhaps all of last year's attendees.  But this isn't particularly difficult.

Is there anything I'm not thinking of that would make this obviously not work?  I could imagine someone saying that maybe people won't propose workshops/tutorials if the proposals will be made public?  I find that a bit hard to swallow.  Perhaps there's a small embarassment factor if you're public and then don't get accepted.  But I wouldn't advocate making the voting results public -- they would be private to the organizers / workshop chairs.

I guess -- I feel like I'm channeling Fernando here? -- that another possible issue is that you might not be able to decide which workshops you'd go to without seeing what papers are there and who is presenting.  This is probably true.  But this is also the same problem that the workshop chairs face anyway: we have to guess that good enough papers/people will be there to make it worthwhile.  I doubt I'm any better at guessing this than any other random NLP person...

So what am I forgetting?

09 November 2010

Managing group papers

Every time a major conference deadline (ACL, NIPS, EMNLP, ICML, etc...) comes around, we usually have a slew of papers (>=3, typically) that are getting prepared.  I would say on average 1 doesn't make it, but that's par for the course.

For AI-Stats, whose deadline just passed, I circulated student paper drafts to all of my folks to solicit comments at any level that they desired.  Anywhere from not understanding the problem/motivation to typos or errors in equations.  My experience was that it was useful, both from the perspective of distributing some of my workload and getting an alternative perspective, to keeping everyone abreast of what everyone else is working on.

In fact, it was so successful that two students suggested to me that I require more-or-less complete drafts of papers at least one week in advance so that this can take place.  How you require something like this is another issue, but the suggestion they came up with was that I'll only cover conference travel if this occurs.  It's actually not a bad idea, but I don't know if I'm enough of a hard-ass (or perceived as enough of a hard-ass) to really pull it off.  Maybe I'll try it though.

The bigger question is how to manage such a thing.  I was thinking of installing some conference management software locally (eg., HotCRP, which I really like) and giving students "reviewer" access.  Then, they could upload their drafts, perhaps with an email circulated when a new draft is available, and other students (and me!) could "review" them.  (Again, perhaps with an email circulated -- I'm a big fan of "push" technology: I don't have time to "pull" anymore!)

The only concern I have is that it would be really nice to be able to track updates, or to have the ability for authors to "check off" things that reviewers suggested.  Or to allow discussion.  Or something like that.

I'm curious if anyone has ever tried anything like this and whether it was successful or not.  It seems like if you can get a culture of this established, it could actually be quite useful.

21 October 2010

Comparing Bounds

This is something that's bothered me for quite a while, and I don't know of a good answer.  I used to think it was something that theory people didn't worry about, but then this exact issue was brought up by a reviewer of a theory-heavy paper that we have at NIPS this year (with Avishek Saha and Abhishek Kumar). There are (at least?) two issues with comparing bounds, the first is the obvious "these are both upper bounds, what does it mean to compare them?"  The second is the slightly less obvious "but your empirical losses may be totally different" issue.  It's actually the second one that I want to talk about, but I have much less of a good internal feel about it.

Let's say that I'm considering two learning approaches.  Say it's SVMs versus logistic regression.  Both regularized.  Or something.  Doesn't really matter.  At the end of the day, I'll have a bound that looks roughly like:

        expected test error <= empirical training error + f( complexity / N)

Here, f is often "sqrt", but could really be any function.  And N is the number of data points.

Between two algorithms, both "f" and "complexity" can vary.  For instance, one might have a linear dependence on the dimensionality of the data (i.e., complexity looks like O(D), where D is dimensionality) and the other might have a superlinear dependence (eg., O(D log D)).  Or one might have a square root.  Who knows.  Sometimes there's an inf or sup hiding in there, too, for instance in a lot of the margin bounds.

At the end of the day, we of course want to say "my algorithm is better than your algorithm."  (What else is there in life?)  The standard way to say this is that "my f(complexity / N) looks better than your f'(complexity' / N)."

Here's where two issues crop up.  The first is that our bound is just an upper bound.  For instance, Alice could come up to me and say "I'm thinking of a number between 1 and 10" and Bob could say "I'm thinking of a number between 1 and 100."  Even though the bound is lower for Alice, it doesn't mean that Alice is actually thinking of a smaller number -- maybe Alice is thinking of 9 and Bob of 5.  In this way, the bounds can be misleading.

My general approach with this issue is to squint, as I do for experimental results.  I don't actually care about constant factors: I just care about things like "what does the dependence on D look like."  Since D is usually huge for problems I care about, a linear or sublinear dependence on D looks really good to me.  Beyond that I don't really care.  I especially don't care if the proof techniques are quite similar.  For instance, if they both use Rademacher complexities, then I'm more willing to compare them than if one uses Rademacher complexities and the other uses covering numbers.  They somehow feel more comparable: I'm less likely to believe that the differences are due to the method of analysis.

(You can also get around this issue with some techniques, like Rademacher complexities, which give you both upper and lower bounds, but I don't think anyone really does that...)

The other issue I don't have as good a feeling for.  The issue is that we're entirely ignoring the "empirical training error" question.  In fact, this is often measured differently between different algorithms!  For instance, for SVMs, the formal statement is more like "expected 0/1 loss on test <= empirical hinge loss on training + ..."  Whereas for logistic regression, you might be comparing expected 0/1 loss with empirical log loss.

Now I really don't know what to do.

We ran into this issue because we were trying to compare some bounds between EasyAdapt and a simple model trained just on source data.  The problem is that the source training error might be totally incomparable to the (source + target) training error.

But the issue is for sure more general.  For instance, what if your training error is measured in squared error?  Now this can be huge when hinge loss is still rather small.  In fact, your squared error could be quadratically large in your hinge loss.  Actually it could be arbitrarily larger, since hinge goes to zero for any sufficiently correct classification, but squared error does not.  (Neither does log loss.)

This worries me greatly, much more than the issue of comparing upper bounds.

Does this bother everyone, or is it just me?  Is there a good way to think about this that gets your out of this conundrum?

05 October 2010

My Giant Reviewing Error

I try to be a good reviewer, but like everything, reviewing is a learning process.  About five years ago, I was reviewing a journal paper and made an error.  I don't want to give up anonymity in this post, so I'm going to be vague in places that don't matter.

I was reviewing a paper, which I thought was overall pretty strong.  I thought there was an interesting connection to some paper from Alice Smith (not the author's real name) in the past few years and mentioned this in my review.  Not a connection that made the current paper irrelevant, but something the authors should probably talk about.  In the revision response, the authors said that they had looked to try to find Smith's paper, but could figure out which one I was talking about, and asked for a pointer.  I spend the next five hours looking for the reference and couldn't find it myself.  It turns out that actually I was thinking of a paper by Bob Jones, so I provided that citation.  But the Jones paper wasn't even as relevant as it seemed at the time I wrote the review, so I apologized and told the authors they didn't really need to cover it that closely.

Now, you might be thinking to yourself: aha, now I know that Hal was the reviewer of my paper!  I remember that happening to me!

But, sadly, this is not true.  I get reviews like this all the time, and I feel it's one of the most irresponsible things reviewers can do.  In fact, I don't think a single reviewing cycle has passed where I don't get a review like this.  The problem with such reviews is that it enables a reviewer to make whatever claim they want, without any expectation that they have to back it up.  And the claims are usually wrong.  They're not necessarily being mean (I wasn't trying to be mean), but sometimes they are.

Here are some of the most ridiculous cases I've seen.  I mention these just to show how often this problem occurs.  These are all on papers of mine.

  • One reviewer wrote "This idea is so obvious this must have been done before."  This is probably the most humorous example I've seen, but the reviewer was clearly serious.  And no, this was not in a review for one of the the "frustratingly easy" papers.
  • In a NSF grant review for an educational proposal, we were informed by 4 of 7 reviewers (who each wrote about a paragraph) that our ideas had been done in SIGCSE several times.  Before submitting, we had skimmed/read the past 8 years of SIGCSE and could find nothing.  (Maybe it's true and we just were looking in the wrong place, but that still isn't helpful.)  It turned out to strongly seem that this was basically their way of saying "you are not one of us."
  • In a paper on technique X for task A, we were told hands down that it's well known that technique Y works better, with no citations.  The paper was rejected, we went and implemented Y, and found that it worked worse on task A.  We later found one paper saying that Y works better than X on task B, for B fairly different from A.
  • In another paper, we were told that what we were doing had been done before and in this case a citation was provided.  The citation was to one of our own papers, and it was quite different by any reasonable metric.  At least a citation was provided, but it was clear that the reviewer hadn't bothered reading it.
  • We were told that we missed an enormous amount of related work that could be found by a simple web search.  I've written such things in reviews, often saying something like "search for 'non-parametric Bayesian'" or something like that.  But here, no keywords were provided.  It's entirely possible (especially when someone moves into a new domain) that you can miss a large body of related work because you don't know how to find in: that's fine -- just tell me how to find it if you don't want to actually provide citations.
There are other examples I could cite from my own experience, but I think you get the idea.

I'm posting this not to gripe (though it's always fun to gripe about reviewing), but to try to draw attention to this problem.  It's really just an issue of laziness.  If I had bothered trying to look up a reference for Alice Smith's paper, I would have immediately realized I was wrong.  But I was lazy.  Luckily this didn't really adversely affect the outcome of the acceptance of this paper (journals are useful in that way -- authors can push back -- and yes, I know you can do this in author responses too, but you really need two rounds to make it work in this case).

I've really really tried ever since my experience above to not ever do this again.  And I would encourage future reviewers to try to avoid the temptation to do this: you may find your memory isn't as good as you think.  I would also encourage area chairs and co-reviewers to push their colleagues to actually provide citations for otherwise unsubstantiated claims.

24 September 2010

ACL / ICML Symposium?

ACL 2011 ends on June 24, in Portland (that's a Friday). ICML 2011 begins on June 28, near Seattle (the following Tuesday). This is pretty much as close to a co-location as we're probably going to get in a long time. A few folks have been discussing the possibility of having a joint NLP/ML symposium in between. The current thought is to have it on June 27 at the ICML venue (for various logistical reasons).  There are buses and trains that run between the two cities, and we might even be able to charter some buses.

One worry is that it might only attract ICML folks due to the weekend between the end of ACL and the beginning of said symposium. As a NLPer/MLer, I believe in data. So please provide data by filling out the form below and, if you wish, adding comments.

If you woudn't attend any, you don't need to fill out the poll :).

The last option is there if you want to tell me "I'm going to go to ACL, and I'd really like to go to the symposium, but the change in venue and the intervening weekend is too problematic to make it possible."

Which would you attend?
Only ACL
ACL and the symposium
ICML and the symposium
Only ACL and ICML
All three
Only ACL, because of convenience

pollcode.com free polls

15 September 2010

Very sad news....

I heard earlier this morning that Fred Jelinek passed away last night.  Apparently he had been working during the day: a tenacious aspect of Fred that probably has a lot to do with his many successes.

Fred is probably most infamous for the famous "Every time I fire a linguist the performace of the recognizer improves" quote, which Jurafsky+Martin's textbook says is actually supposed to be the more innocuous "Anytime a linguist leaves the group the recognition rate goes up."  And in Fred's 2009 ACL Lifetime Achievement Award speech, he basically said that such a thing never happened.  I doubt that will have any effect on how much the story is told.

Fred has had a remarkable influence on the field.  So much so that I won't attempt to list anything here: you can find all about him all of the internet.  Let me just say that the first time I met him, I was intimidated.  Not only because he was Fred, but because I knew (and still know) next to nothing about speech, and the conversation inevitably turned to speech.  Here's roughly how a segment of our conversation went:

Hal: What new projects are going on these days?
Fred: (Excitedly.)  We have a really exciting new speech recognition problem.  We're trying to map speech signals directly to fluent text.
Hal: (Really confused.) Isn't that the speech recognition problem?
Fred: (Playing the "teacher role" now.)  Normally when you transcribe speech, you end up with a transcrit that includes disfluencies like "uh" and "um" and also false starts [Ed note: like "I went... I went to the um store"].
Hal: So now you want to produce the actual fluent sentence, not the one that was spoken?
Fred: Right.

Apparently (who knew) in speech recognition you try to transcribe disfluencies and are penalized for missing them!  We then talked for a while about how they were doing this, and other fun topics.

A few weeks later, I got a voicemail on my home message machine from Fred.  That was probably one of the coolest things that have ever happened to me in life.  I actually saved it (but subsequently lost it, which saddens me greatly).  The content is irrelevant: the point is that Fred -- Fred! -- called me -- me! -- at home!  Amazing.

I'm sure that there are lots of other folks who knew Fred better than me, and they can add their own stories in comments if they'd like.  Fred was a great asset to the field, and I will certainly miss his physical presense in the future, though his work will doubtless continue to affect the field for years and decades to come.

13 September 2010

AIStats 2011 Call for Papers

The full call, and some changes to the reviewing process. The submission deadline is Nov 1, and the conference is April 11-13, in Fort Lauderdale, Florida. Promises to be warm :).

The changes the the reviewing process are interesting.  Basically the main change is that the author response is replaced by a journal-esque "revise and resubmit."  That is, you get 2 reviews, edit your paper, submit a new version, and get a 3rd review.  The hope is that this will reduce author frustration from the low bandwidth of author response.  Like with a journal, you'll also submit a "diff" saying what you've changed.  I can see this going really well: the third reviewer will presumably see a (much) better than the first two.  The disadvantage, which irked me at ICML last year, is that it often seemed like the third reviewer made the deciding call, and I would want to make sure that the first two reviewers also get updated.  I can also see it going poorly: authors invest even more time in "responding" and no one listens.  That will be increased frustration :).

The other change is that there'll be more awards.  I'm very much in favor of this, and I spend two years on the NAACL exec trying to get NAACL to do the same thing, but always got voted down :).  Oh well.  The reason I think it's a good idea is two-fold.  First, I think we're bad at selecting single best papers: a committee decision can often lead to selecting least offensive papers rather than ones that really push the boundary.  I also think there are lots of ways for papers to be great: they can introduce new awesome algorithms, have new theory, have a great application, introduce a cool new problem, utilize a new linguistic insight, etc., etc., etc... Second, best papers are most useful at promotion time (hiring, and tenure), where you're being compared with people from other fields.  Why should our field put our people at a disadvantage by not awarding great work that they can list of their CVs?

Anyway, it'll be an interesting experiment, and I encourage folks to submit!

07 September 2010

Manifold Assumption versus Margin Assumption

[This post is based on some discussions that came up while talking about manifold learning with Ross Whitaker and Sam Gerber, who had a great manifold learning paper at ICCV last year.]

There are two assumptions that are often used in statistical learning (both theory and practice, though probably more of the latter), especially in the semi-supervised setting.  Unfortunately, they're incompatible.

The margin assumption states that your data are well separated.  Usually it's in reference to linear, possibly kernelized, classifiers, but that need not be the case.  As most of us know, there are lots of other assumptions that boil down to the same thing, such as the low-weight-norm assumption, or the Gaussian prior assumption.  At the end of the day, it means your data looks like what you have on the left, below, not what you have on the right.
The manifold assumption that is particularly popular in semi-supervised learning, but also shows up in supervised learning, says that your data lie on a low dimensional manifold embedded in a higher dimensional space.  One way of thinking about this is saying that your features cannot covary arbitrarily, but the manifold assumption is quite a bit stronger.  It usually assumes a Reimannian (i.e., locally Euclidean) structure, with data points "sufficiently" densely sampled.  In other words, life looks like the left, not the right, below:
Okay, yes, I know that the "Bad" one is a 2D manifold embedded in 2D, but that's only because I can't draw 3D images :).  And anyway, this is a "weird" manifold in the sense that at one point (where the +s and -s meet), it drops down to 1D.  This is fine in math-manifold land, but usually not at all accounted for in ML-manifold land.

The problem, of course, is that once you say "margin" and "manifold" in the same sentence, things just can't possibly work out.  You'd end up with a picture like:

This is fine from a margin perspective, but it's definitely not a (densely sampled) manifold any more.

In fact, almost by definition, once you stick a margin into a manifold (which is okay, since you'll define margin Euclideanly, and manifolds know how to deal with Euclidean geometry locally), you're hosed.

So I guess the question is: who do you believe?

31 August 2010

Online Learning Algorithms that Work Harder

It seems to be a general goal in practical online learning algorithm development to have the updates be very very simply.  Perceptron is probably the simplest, and involves just a few adds.  Winnow takes a few multiplies.  MIRA takes a bit more, but still nothing hugely complicated.  Same with stochastic gradient descent algorithms for, eg., hinge loss.

I think this maybe used to make sense.  I'm not sure that it makes sense any more.  In particular, I would be happier with online algorithms that do more work per data point, but require only one pass over the data.  There are really only two examples I know of: the StreamSVM work that my student Piyush did with me and Suresh, and the confidence-weighted work by Mark Dredze, Koby Crammer and Fernando Pereira (note that they maybe weren't trying to make a one-pass algorithm, but it does seem to work well in that setting).

Why do I feel this way?

Well, if you look even at standard classification tasks, you'll find that if you have a highly optimized, dual threaded implementation of stochastic gradient descent, then your bottleneck becomes I/O, not learning.  This is what John Langford observed in his Vowpal Wabbit implementation.  He has to do multiple passes.  He deals with the I/O bottleneck by creating an I/O friendly, proprietary version of the input file during the first past, and then careening through it on subsequent passes.

In this case, basically what John is seeing is that I/O is too slow.  Or, phrased differently, learning is too fast :).  I never thought I'd say that, but I think it's true.  Especially when you consider that just having two threads is a pretty low requirement these days, it would be nice to put 8 or 16 threads to good use.

But I think the problem is actually quite a bit more severe.  You can tell this by realizing that the idealized world in which binary classifier algorithms usually get developed is, well, idealized.  In particular, someone has already gone through the effort of computing all your features for you.  Even running something simple like a tokenizer, stemmer and stop word remover over documents takes a non-negligible amount of time (to convince yourself: run it over Gigaword and see how long it takes!), easily much longer than a silly perceptron update.

So in the real world, you're probably going to be computing your features and learning on the fly.  (Or at least that's what I always do.)  In which case, if you have a few threads computing features and one thread learning, your learning thread is always going to be stalling, waiting for features.

One way to partially circumvent this is to do a variant of what John does: create a big scratch file as you go and write everything to this file on the first pass, so you can just read from it on subsequent passes.  In fact, I believe this is what Ryan McDonald does in MSTParser (he can correct me in the comments if I'm wrong :P).  I've never tried this myself because I am lazy.  Plus, it adds unnecessary complexity to your code, requires you to chew up disk, and of course adds its own delays since you now have to be writing to disk (which gives you tons of seeks to go back to where you were reading from initially).

A similar problem crops up in structured problems.  Since you usually have to run inference to get a gradient, you end up spending way more time on your inference than your gradients.  (This is similar to the problems you run into when trying to parallelize the structured perceptron.)

Anyway, at the end of the day, I would probably be happier with an online algorithm that spent a little more energy per-example and required fewer passes; I hope someone will invent one for me!

27 August 2010

Calibrating Reviews and Ratings

NIPS decision are going out soon, and then we're done with submitting and reviewing for a blessed few months. Except for journals, of course.

If you're not interested in paper reviews, but are interested in sentiment analysis, please skip the first two paragraphs :).

One thing that anyone who has ever area chaired, or probably even ever reviewed, has noticed is that different people have different "baseline" ratings. Conferences try to adjust for this, for instance NIPS defines their 1-10 rating scale as something like "8 = Top 50% of papers accepted to NIPS" or something like that. Even so, some people are just harsher than others in scoring, and it seems like the area chair's job to calibrate for this. (For instance, I know I tend to be fairly harsh -- I probably only give one 5 (out of 5) for every ten papers I review, and I probably give two or three 1s in the same size batch. I have friends who never give a one -- except in the case of something just being wrong -- and often give 5s. Perhaps I should be nicer; I know CS tends to be harder on itself than other fiends.) As an aside, this is one reason why I'm generally in favor of fewer reviewers and more reviews per reviewer: it allows easier calibration.

There's also the issue of areas. Some areas simply seem to be harder to get papers into than others (which can lead to some gaming of the system). For instance, if I have a "new machine learning technique applied to parsing," do I want it reviewed by parsing people or machine learning people? How do you calibrate across areas, other than by some form of affirmative action for less-represented areas?

A similar phenomenon occurs in sentiment analysis, as was pointed out to me at ACL this year by Franz Och. The example he gives is very nice. If you go to TripAdvisor and look up The French Laundry, which is definitely one of the best restaurants in the U.S. (some people say the best), you'll see that it got 4.0/5.0 stars, and a 79% recommendation. On the other hand, if you look up In'N'Out Burger, a LA-based burger chain (which, having grown up in LA, was admittedly one of my favorite places to eat in high school, back when I ate stuff like that) you see another 4.0/5.0 stars and a 95% recommendation.

So now, we train a machine learning system to predict that the rating for The French Laundry is 79% and In'N'Out Burger is 95%. And we expect this to work?!

Probably the main issue here is calibrating for expectations. As a teacher, I've figured out quickly that managing student expectations is a big part of getting good teaching reviews. If you go to In'N'Out, and have expectations for a Big Mac, you'll be pleasantly surprised. If you go to The French Laundry with expectations of having a meal worth selling your soul, your children's souls, etc., for, then you'll probably be disappointed (though I can't really say: I've never been).

One way that a similar problem has been dealt with on Hotels.com is that they'll show you ratings for the hotel you're looking at, and statistics of ratings for other hotels within a 10 mile radius (or something). You could do something similar for restaurants, though distance probably isn't the right categorization: maybe price. For "$", In'N'Out is probably near the top, and for "$$$$" The French Laundry probably is.

(Anticipating comments, I don't think this is just an "aspect" issue. I don't care how bad your palate is, even just considering the "quality of food" aspect, Laundry has to trump In'N'Out by a large margin.)

I think the problem is that in all of these cases -- papers, restaurants, hotels -- and others (movies, books, etc.) there simply isn't a total order on the "quality" of the objects you're looking at. (For instance, as soon as a book becomes a best seller, or is advocated by Oprah, I am probably less likely to read it.) There is maybe a situation-depend order, and the distance to hotel, or "$" rating, or area classes are heuristics for describing this "situation." Bit without knowing the situation, or having a way to approximate it, I worry that we might be entering a garbage-in-garbage-out scenario here.

23 August 2010

Finite State NLP with Unlabeled Data on Both Sides

(Can you tell, by the recent frequency of posts, that I'm try not to work on getting ready for classes next week?)

[This post is based partially on some conversations with Kevin Duh, though not in the finite state models formalism.]

The finite state machine approach to NLP is very appealing (I mean both string and tree automata) because you get to build little things in isolation and then chain them together in cool ways. Kevin Knight has a great slide about how to put these things together that I can't seem to find right now, but trust me that it's awesome, especially when he explains it to you :).

The other thing that's cool about them is that because you get to build them in isolation, you can use different data sets, which means data sets with different assumptions about the existence of "labels", to build each part. For instance, to do speech to speech transliteration from English to Japanese, you might build a component system like:

English speech --A--> English phonemes --B--> Japanese phonemes --C--> Japanese speech --D--> Japanese speech LM

You'll need a language model (D) for Japanese speech, that can be trained just on acoustic Japanese signals, then parallel Japanese speech/phonemes (for C), parallel English speech/phonemes (for A) and parallel English phonemes/Japanese phonemes (for B). [Plus, of course, if you're missing any of these, EM comes to your rescue!]

Let's take a simpler example, though the point I want to make applies to long chains, too.

Suppose I want to just do translation from French to English. I build an English language model (off of monolingual English text) and then an English-to-French transducer (remember that in the noisy channel, things flip direction). For the E2F transducer, I'll need parallel English/French text, of course. The English LM gives me p(e) and the transducer gives me p(f|e), which I can put together via Bayes' rule to get something proportional to p(e|f), which will let me translate new sentences.

But, presumably, I also have lots of monolingual French text. Forgetting math for a moment, which seems to suggest that this can't help me, we can ask: why should this help?

Well, it probably won't help with my English language model, but it should be able to help with my transducer. Why? Because my transducer is supposed to give me p(f|e). If I have some French sentence in my GigaFrench corpus to which my transducer assigns zero probability (for instance, max_e p(f|e) = 0), then this is probably a sign that something bad is happening.

More generally, I feel like the following two operations should probably give roughly the same probabilities:

  1. Drawing an English sentence from the language model p(e).
  2. Picking a French sentence at random from GigaFrench, and drawing an English sentence from p(e|f), where p(e|f) is the composition of the English LM and the transducer.
If you buy this, then perhaps one thing you could do is to try to learn a transducer q(f|e) that has low KL divergence between 1 and 2, above. If you work through the (short) make, and throw away terms that are independent of the transducer, then you end up wanting to minimize [ sum_e p(e) log sum_f q(f|e) ]. Here, the sum over f is a finite sum over GigaFrench, and the sum over e is an infinite sum over positive probability English sentences given my the English LM p(e).

One could then apply something like posterior regularization (Kuzman Ganchev, Graça and Taskar) to do the learning. There's the nasty bit about how to compute these things, but that's why you get to be friends with Jason Eisner so he can tell you how to do anything you could ever want to do with finite state models.

Anyway, it seems like an interesting idea. I'm definitely not aware if anyone has tried it.

21 August 2010

Readers kill blogs?

I try to avoid making meta-posts, but the timing here was just too impeccable for me to avoid a short post on something that's been bothering me for a year or so.

I actually complete agree with both points. The problem is that I worry that they are actually fairly opposed. I comment much less on other people's blogs now that I use reader, because the 10 second overhead of clicking on the blog, being redirected, entering a comment, blah blah blah, is just too high. Plus, I worry that no one (except the blog author) will see my comment, since most readers don't (by default) show comments in with posts.

Hopefully the architects behind readers will pick up on this and make these things (adding and viewing comments, within the reader -- yes, I realize that it's then not such a "reader") easier. That is, unless they want to lose out to tweets!

Until then, I'd like to encourage people to continue commenting here.

19 August 2010

Multi-task learning: should our hypothesis classes be the same?

It is almost an unspoken assumption in multitask learning (and domain adaptation) that you use the same type of classifier (or, more formally, the same hypothesis class) for all tasks. In NLP-land, this usually means that everything is a linear classifier, and the feature sets are the same for all tasks; in ML-land, this usually means that the same kernel is used for every task. In neural-networks land (ala Rich Caruana), this is enforced by the symmetric structure of the networks used.

I probably would have gone on not even considering this unspoken assumption, until a few years ago I saw a couple papers that challenged it, albeit indirectly. One was Factorizing Complex Models: A Case Study in Mention Detection by Radu (Hans) Florian, Hongyan Jing, Nanda Kambhatla and Imed Zitouni, all from IBM. They're actually considering solving tasks separately rather than jointly, but joint learning and multi-task learning are very closely related. What they see is that different features are useful for spotting entity spans, and for labeling entity types.

That year, or the next, I saw another paper (can't remember who or what -- if someone knows what I'm talking about, please comment!) that basically showed a similar thing, where a linear kernel was doing best for spotting entity spans, and a polynomial kernel was doing best for labeling the entity types (with the same feature sets, if I recall correctly).

Now, to some degree this is not surprising. If I put on my feature engineering hat, then I probably would design slightly different features for these two tasks. On the other hand, coming from a multitask learning perspective, this is surprising: if I believe that these tasks are related, shouldn't I also believe that I can do well solving them in the same hypothesis space?

This raises an important (IMO) question: if I want to allow my hypothesis classes to be different, what can I do?

One way is to punt: you can just concatenate your feature vectors and cross your fingers. Or, more nuanced, you can have some set of shared features and some set of features unique to each task. This is similar (the nuanced version, not the punting version) to what Jenny Finkel and Chris Manning did in their ACL paper this year, Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data.

An alternative approach is to let the two classifiers "talk" via unlabeled data. Although motivated differently, this was something of the idea behind my EMNLP 2008 paper on Cross-Task Knowledge-Constrained Self Training, where we run two models on unlabeled data and look for where they "agree."

A final idea that comes to mind, though I don't know if anyone has tried anything like this, would be to try to do some feature extraction over the two data sets. That is, basically think of it as a combination of multi-view learning (since we have two different hypothesis classes) and multi-task learning. Under the assumption that we have access to examples labeled for both tasks simultaneously (i.e., not the settings for either Jenny's paper or my paper), then one could do a 4-way kernel CCA, where data points are represented in terms of their task-1 kernel, task-2 kernel, task-1 label and task-2 label. This would be sort of a blending of CCA-for-multiview-learning and CCA-for-multi-task learning.

I'm not sure what the right way to go about this is, but I think it's something important to consider, especially since it's an assumption that usually goes unstated, even though empirical evidence seems to suggest it's not (always) the right assumption.

02 August 2010

Why Discourse Structure?

I come from a strong lineage of discourse folks. Writing a parser for Rhetorical Structure Theory was one of the first class projects I had when I was a grad student. Recently, with the release of the Penn Discourse Treebank, there has been a bit of a flurry of interest in this problem (I had some snarky comments right after ACL about this). I've also talked about why this is a hard problem, but never really about why it is an interesting problem.

My thinking about discourse has changed a lot over the years. My current thinking about it is in an "interpretation as abduction" sense. (And I sincerely hope all readers know what that means... if not, go back and read some classic papers by Jerry Hobbs.) This is a view I've been rearing for a while, but I finally started putting it into words (probably mostly Jerry's words) in a conversation at ACL with Hoifung Poon and Joseph Turian (I think it was Joseph... my memory fades quickly these days :P).

This view is that discourse is that thing that gives you an interpretation above and beyond whatever interpretations you get from a sentence. Here's a slightly refined version of the example I came up with on the fly at ACL:

  1. I only like traveling to Europe. So I submitted a paper to ACL.
  2. I only like traveling to Europe. Nevertheless, I submitted a paper to ACL.
What does the hearer (H) infer from these sentences. Well, if we look at the sentences on their own, then H infers something like Hal-likes-travel-to-Europe-and-only-Europe, and H infers something like Hal-submitted-a-paper-to-ACL. But when you throw discourse in, you can derive two additional bits of information. In example (1), you can infer ACL-is-in-Europe-this-year and in (2) you can infer the negation of that.

Pretty amazing stuff, huh? Replacing a "so" with a "nevertheless" completely changes this interpretation.

What does this have to do with interpretation as abduction? Well, we're going to assume that this discourse is coherent. Given that assumption, we have to ask ourselves: in (1), what do we have to assume about the world to make this discourse coherent? The answer is that you have to assume that ACL is in Europe. And similarly for (2).

Of course, there are other things you could assume that would make this discourse coherent. For (1), you could assume that I have a rich benefactor who likes ACL submissions and will send me to Europe every time I submit something to ACL. For (2), you could assume that I didn't want my paper to get in, but I wanted a submission to get reviews, and so I submitted a crappy paper. Or something. But these fail the Occam's Razor test. Or, perhaps they are a priori simply less likely (i.e., you have to assume more to get the same result).

Interestingly, I can change the interpretation of (2), for instance, by adding a third sentence to the discourse: "I figured that it would be easy to make my way to Europe after going to Israel." Here, we would abduce that ACL is in Israel, and that I'm willing to travel to Israel on my way to Europe. For you GOFAI folks, this would be something like non-monotonic reasoning.

Whenever I talk about discourse to people who don't know much about it, I always get this nagging sense of "yes, but why do I care that you can recognize that sentence 4 is background to sentence 3, unless I want to do summarization?" I hope that this view provides some alternative answer to that question. Namely, that there's some information you can get from sentences, but there is additional information in how those sentences are glued together.

Of course, one of the big problems we have is that we have no idea how to represent sentence-level interpretations, or at least some ideas but no way to get there in the general case. In the sentence-level case, we've seen some progress recently in terms of representing semantics in a sort of substitutability manner (ala paraphrasing), which is nice because the representation is still text. One could ask if something similar might be possible at a discourse level. Obviously you could paraphrase discourse connectives, but that's missing the point. What else could you do?

24 July 2010

ACL 2010 Retrospective

ACL 2010 finished up in Sweden a week ago or so. Overall, I enjoyed my time there (the local organization was great, though I think we got hit with unexpected heat, so those of us who didn't feel like booking a room at the Best Western -- hah! why would I have done that?! -- had no A/C and my room was about 28-30 every night).

But you don't come here to hear about sweltering nights, you come to hear about papers. My list is actually pretty short this time. I'm not quite sure why that happened. Perhaps NAACL sucked up a lot of the really good stuff, or I went to the wrong sessions, or something. (Though my experience was echoed by a number of people (n=5) I spoke to after the conference.) Anyway, here are the things I found interesting.

  • Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, by Matthew Gerber and Joyce Chai (this was the Best Long Paper award recipient). This was by far my favorite paper of the conference. For all you students out there (mine included!), pay attention to this one. It was great because they looked at a fairly novel problem, in a fairly novel way, put clear effort into doing something (they annotated a bunch of data by hand), developed features that were significantly more interesting than the usual off-the-shelf set, and got impressive results on what is clearly a very hard problem. Congratulations to Matthew and Joyce -- this was a great paper, and the award is highly deserved.

  • Challenge Paper: The Human Language Project: Building a Universal Corpus of the World’s Languages, by Steven Abney and Steven Bird. Basically this would be awesome if they can pull it off -- a giant structured database with stuff from tons of languages. Even just having tokenization in tons of languages would be useful for me.

  • Extracting Social Networks from Literary Fiction, by David Elson, Nicholas Dames and Kathleen McKeown. (This was the IBM best student paper.) Basically they construct networks of characters from British fiction and try to analyze some literary theories in terms of those networks, and find that there might be holes in the existing theories. My biggest question, as someone who's not a literary theorist, is why did those theories exist in the first place? The analysis was over 80 or so books, surely literary theorists have read and pondered all of them.

  • Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish, by Reyyan Yeniterzi and Kemal Oflazer. You probably know that I think translating morphology and translating out of English are both interesting topics, so it's perhaps no big surprise that I liked this paper. The other thing I liked about this paper is that they presented things that worked, as well as things that might well have worked but didn't.

  • Learning Common Grammar from Multilingual Corpus, by Tomoharu Iwata, Daichi Mochihashi and Hiroshi Sawad. I wouldn't go so far as to say that I thought this was a great paper, but I would say there is the beginning of something interesting here. They basically learn a coupled PCFG in Jenny Finkel hierarchical-Bayes style, over multiple languages. The obvious weakness is that languages don't all have the same structure. If only there were an area of linguistics that studies how they differ.... (Along similar lines, see
    Phylogenetic Grammar Induction, by Taylor Berg-Kirkpatrick and Dan Klein, which has a similar approach/goal.)

  • Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation, by Michael Bloodgood and Chris Callison-Burch. The "trend" referenced in the title is that active learning always asymptotes depressingly early. They have turkers translate bits of sentences in context (i.e., in a whole sentence, translate the highlighted phrase) and get a large bang-for-the-buck. Right now they're looking primarily at out-of-vocabulary stuff, but there's a lot more to do here.
A few papers that I didn't see, but other people told me good things about:
At any rate, I guess that's a reasonably long list. There were definitely good things, but with a fairly heavy tail. If you have anything you'd like to add, feel free to comment. (As an experiment, I've turned comment moderation on as a way to try to stop the spam... I'm not sure I'll do it indefinitely; I hadn't turned it on before because I always thought/hoped that Google would just start doing spam detection and/or putting hard captcha's up or something to try to stop spam, but sadly they don't seem interested.)

28 June 2010

ICML 2010 Retrospective

Just got back from Israel for ICML, which was a great experience: I'd wanted to go there for a while and this was a perfect opportunity. I'm very glad I spent some time afterwards out of Haifa, though.

Overall, I saw a lot of really good stuff. The usual caveats apply (I didn't see everything it's a biased sample, blah blah blah). Here are some things that stood out:

Structured Output Learning with Indirect Supervision (M.-W. Chang, V. Srikumar, D. Goldwasser, D. Roth). This was probably one of my favorite papers of the conference, even though I had learned some about the work when I visited UIUC a few months ago. Let's say you're trying to do word alignment, and you have a few labeled examples of alignments. But then you also have a bunch of parallel data. What can you do? You can turn the parallel data into a classification problem: are these two sentences translations of each other. You can pair random sentences to get negative examples. A very clever observation is basically that the weight vector for this binary classifier should point in the same direction as the weight vector for the (latent variable) structured problem! (Basically the binary classifier should say "yes" only when there exists an alignment that renders these good translations.) Tom Dietterich asked a question during Q/A: these binary classification problems seem very hard: is that bad? Ming-Wei reassured him that it wasn't. In thinking about it after the fact, I wonder if it is actually really importantant that they're hard: namely, if they were easy, then you could potentially answer the question without bothering to make up a reasonable alignment. I suspect this might be the case.

A Language-based Approach to Measuring Scholarly Impact (S. Gerrish, D. Blei). The idea here is that without using citation structure, you can model influence in large document collections. The basic idea is that when someone has a new idea, they often introduce new terminology to a field that wasn't there before. The important bit is that they don't change all of science, or even all of ACL: they only change what gets talked about in their particular sub-area (aka topic :P). It was asked during Q/A what would happen if you did use citations, and my guess based on my own small forays in this area is that the two sources would really reinforce eachother. That is, you might regularly cite the original EM even if your paper has almost nothing to do with it. (The example from the talk was then Penn Treebank paper: one that has a bajillion citations, but hasn't lexically affected how people talk about research.)

Hilbert Space Embeddings of Hidden Markov Models (L. Song, B. Boots, S. Saddiqi, G. Gordon, A. Smola). This received one of the best paper awards. While I definitely liked this paper, actually what I liked more what that it taught me something from COLT last year that I hadn't known (thanks to Percy Liang for giving me more details on this). That paper was A spectral algorithm for learning hidden Markov models (D. Hsu, S. Kakade, T. Zhang) and basically shows that you can use spectral decomposition techniques to "solve" the HMM problem. You create the matrix of observation pairs (A_ij = how many times did I see observation j follow observation i) and then do some processing and then a spectral decomposition and, voila, you get parameters to an HMM! In the case that the data was actually generated by an HMM, you get good performance and good guarantees. Unfortunately, if the data was not generated by an HMM, then the theory doesn't work and the practice does worse than EM. That's a big downer, since nothing is ever generated by the model we use, but it's a cool direction. At any rate, the current paper basically asks what happens if your observations are drawn from an RKHS, and then does an analysis. (Meta-comment: as was pointed out in the Q/A session, and then later to me privately, this has fairly strong connections to some stuff that's been done in Gaussian Process land recently.)

Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process (N. Bartlett, D. Pfau, F. Wood). This paper shows that if you're building a hierarchical Pitman-Yor language model (think Kneser-Ney smoothing if that makes you feel more comfortable) in an online manner, then you should feel free to throw out entire restaurants as you go through the process. (A restaurant is just the set of counts for a given context.) You do this to maintain a maximum number of restaurants at any given time (it's a fixed memory algorithm). You can do this intelligently (via a heuristic) or just stupidly: pick them at random. Turns out it doesn't matter. The explanation is roughly that if it were important, and you threw it out, you'd see it again and it would get re-added. The chance that something that occurs a lot keeps getting picked to be thrown out is low. There's some connection to using approximate counting for language modeling, but the Bartlett et al. paper is being even stupider than we were being!

Learning efficiently with approximate inference via dual losses (O. Meshi, D. Sontag, T. Jaakkola, A. Globerson). Usually when you train structured models, you alternate between running inference (a maximization to find the most likely output for a given training instance) and running some optimization (a minimization to move your weight vector around to achieve lower loss). The observation here is that by taking the dual of the inference problem, you turn the maximization into a minimization. You now have a dual minimization, which you can solve simultaneously, meaning that when your weights are still crappy, you aren't wasting time finding perfect outputs. Moreover, you can "warm start" your inference for the next round. It's a very nice idea. I have to confess I was a bit disappointed by the experimental results, though: the gains weren't quite what I was hoping. However, most of the graphs they were using weren't very large, so maybe as yo move toward harder problems, the speed-ups will be more obvious.

Deep learning via Hessian-free optimization (J. Martens). Note that I neither saw this presentation nor read the paper (skimmed it!), but I talked with James about this over lunch one day. The "obvious" take away message is that you should read up on your optimization literature, and start using second order methods instead of your silly gradient methods (and don't store that giant Hessian: use efficient matrix-vector products). But the less obvious take away message is that some of the prevailing attitudes about optimizing deep belief networks may be wrong. For those who don't know, the usual deal is to train the networks layer by layer in an auto-encoder fashion, and then at the end apply back-propogation. The party line that I've already heard is that the layer-wise training is very important to getting the network near a "good" local optimum (whatever that means). But if James' story holds out, this seems to not be true: he doesn't do any clever initialization and still find good local optima!

A theoretical analysis of feature pooling in vision algorithms (Y.-L. Boureau, J. Ponce, Y. LeCun). Yes, that's right: a vision paper. Why should you read this paper? Here's the question they're asking: after you do some blah blah blah feature extraction stuff (specifically: Sift features), you get something that looks like a multiset of features (hrm.... sounds familiar). These are often turned into a histogram (basically taking averages) and sometimes just used as a bag: did I see this feature or not. (Sound familiar yet?) The analysis is: why should one of these be better and, in particular, why (in practice) do vision people see multiple regimes. Y-Lan et al. provide a simple, obviously broken, model (that assumes feature independence... okay, this has to sound familiar now) to look at the discriminability of these features (roughly the ration of between-class variances and overall variances) to see how these regimes work out. And they look basically how they do in practice (modulo one "advanced" model, which doesn't quite work out how they had hoped).

Some other papers that I liked, but don't want to write too much about:

Some papers that other people said they liked were:
Hope to see you at ACL!

07 June 2010

NAACL 2010 Retrospective

I just returned from NAACL 2010, which was simultaneously located in my home town of Los Angeles and located nowhere near my home town of Los Angeles. (That's me trying to deride downtown LA as being nothing like real LA.)

Overall I was pleased with the program. I saw a few talks that changed (a bit) how I think about some problems. There were only one or two talks I saw that made me wonder how "that paper" got in, which I think is an acceptable level. Of course I spend a great deal of time not at talks, but no longer feel bad about doing so.

On tutorials day, I saw Hoifung Poon's tutorial on Markov Logic Networks. I think Hoifung did a great job of targetting the tutorial at just the right audience, which probably wasn't exactly me (though I still quite enjoyed it myself). I won't try to describe MLNs, but my very brief summary is "language for compactly expressing complex factor graphs (or CRFs, if you prefer)." That's not exactly right, but I think it's pretty close. You can check back in a few months and see if there are going to be any upcoming "X, Y and Daume, 2011" papers using MLNs :P. At any rate, I think it's a topic worth knowing about, especially if you really just want to get a system up and running quickly. (I'm also interested in trying Andrew McCallum's Factorie system, which, to some degree, trades easy of use for added functionality. But honestly, I don't really have time to just try things these days: students have to do that for me.)

One of my favorite papers of the conference was one that I hadn't even planned to go see! It is Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables by Tetsuji Nakagawa, Kentaro Inui and Sadao Kurohashi. (I saw it basically because by the end of the conference, I was too lazy to switch rooms after the prvious talk.) There are two things I really like about this paper. The first is that the type of sentiment they're going after is really broad. Example sentences included things that I'd love to look up, but apparently were only in the slides... but definitely more than "I love this movie." The example in the paper is "Tylenol prevents cancer," which is a nice positive case.

The basic idea is that some words give you sentiment. For instance, by itself, "cancer" is probably negative. But then some words flip polarity. Like "prevents." Or negation. Or other things. They set up a model based on sentence level annotations with latent variables for the "polarity" words and for the "flipping" words. The flipping words are allowed to flip any sentiment below them in the dependency tree. Cool idea! Of course, I have to nit-pick the paper a bit. It probably would be better to allow arguments/adjuncts to flip polarity, too. Otherwise, negation (which is usually a leaf) will never flip anything. And adjectives/adverbs can't flip either (eg., going from "happy" to "barely happy"). But overall I liked the paper.

A second thing I learned is that XOR problems do exist in real life, which I had previously questioned. The answer came (pretty much unintentionally) from the paper The viability of web-derived polarity lexicons by Leonid Velikovich, Sasha Blair-Goldensohn, Kerry Hannan and Ryan McDonald. I won't talk much about this paper other than to say that if you have 4 billion web pages, you can get some pretty good sentimenty words, if you're careful to not blindly apply graph propagation. But at the end, they throw a meta classifier on the polarity classification task, whose features include things like (1) how many positive terms are in the text, (2) how many negative terms are in the text, (3) how many negations are in the text. Voila! XOR! (Because negation XORs terms.)

I truly enjoyed Owen Rambow's poster on The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece. If you're ever taken a class in mathematical logic, it is very easy for me to summarize this paper: parse trees (dependency or phrase structure) are your languge, but unless you have a theory of that language (in the model-theoretic sense) then whatever you do is meaningless. In more lay terms: you can always push symbols around, but unless you tie a semantics to those symbols, you're really not doing anything. Take home message: pay attention to the meaning of your symbols!

In the category of "things everyone should know about", there was Painless unsupervised learning with features by Taylor Berg-Kirkpatrick, Alexandre Bouchard Côté, John DeNero and Dan Klein. The idea is that you can replace your multinomails in an HMM (or other graphical model) with little maxent models. Do EM in this for unsuperviesd learning and you can throw in a bunch of extra features. I would have liked to have seen a comparison against naive Bayes with the extra features, but my prior belief is sufficiently strong that I'm willing to believe that it's helpful. The only sucky thing about this training regime is that training maxent models with (tens of) thousands of classes is pretty painful. Perhaps a reduction like tournaments or SECOC would help bring it down to a log factor.

I didn't see the presentation for From baby steps to leapfrog: How "Less is More" in unsupervised dependency parsing by Valetin Spitkovsky, Hiyan Alshawi and Dan Jurafsky, but I read it. The idea is that you can do better unsupervised dependency parsing by giving your learner progressively harder examples. I really really really tried to get something like this to work for unsearn, but nothing helped and most things hurn. (I only tried adding progressively longer sentences: other ideas, based on conversations with other folks, include looking at vocabulary size, part of speech (eg., human babies learn words in a particular order), etc.) I'm thrilled it actually works.

Again, I didn't see Discriminative Learning over Constrained Latent Representations by Ming-Wei Chang, Dan Goldwasser, Dan Roth and Vivek Srikumar, but I learned about the work when I visited UIUC recently (thanks again for the invitation, Dan R.!). This paper does exactly what you would guess from the title: learns good discriminative models when you have complex latent structures that you know something about a priori.

I usually ask people at the end of conferences what papers they liked. Here are some papers that were spoken highly of by my fellow NAACLers. (This list is almost unadulterated: one person actually nominated one of the papers I thought shouldn't have gotten in, so I've left it off the list. Other than that, I think I've included everything that was specifically mentioned to me.)

  1. Optimal Parsing Strategies for Linear Context-Free Rewriting Systems by Daniel Gildea.

  2. Products of Random Latent Variable Grammars by Slav Petrov.

  3. Joint Parsing and Alignment with Weakly Synchronized Grammars by David Burkett, John Blitzer and Dan Klein.

  4. For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia by Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-Mizil and Lillian Lee.

  5. Type-Based MCMC by Percy Liang, Michael I. Jordan and Dan Klein.
I think I probably have two high level "complaints" about the program this year. First, I feel like we're seeing more and more "I downloaded blah blah blah data and trained a model using entirely standard features to predict something and it kind of worked" papers. I apologize if I've just described your paper, but these papers really rub me the wrong way. I feel like I just don't learn anything from them: we already know that machine learning works surprisingly well and I don't really need more evidence of that. Now, if my sentence described your paper, but your paper additionally had a really interesting analysis that helps us understand something about language, then you rock! Second, I saw a lot of presentations were speakers were somewhat embarassingly unaware of very prominent very relevant prior work. (In none of these cases was the prior work my own: it was work that's much more famous.) Sometimes the papers were cited (and it was more of a "why didn't you compare against that" issue) but very frequently they were not. Obviously not everyone knows about all papers, but I recognized this even for papers that aren't even close to my area.

Okay, I just ranted, so let's end on a positive note. I'm leaving the conference knowing more than when I went, and I had fun at the same time. Often we complain about the obvious type I errors and not-so-obvious type II errors, but overall I found the program strong. Many thanks to the entire program committee for putting together an on-average very good set of papers, and many thanks to all of you for writing these papers!

29 April 2010

Graduating? Want a post-doc? Let NSF pay!

I get many of emails of the form "I'm looking for a postdoc...." I'm sure that other, more senior, more famous people get lots of these. My internal reaction is always "Great: I wish I could afford that!" NSF has a solution: let them pay for it! This is the second year of the CI fellows program, and I know of two people who did it last year (one in NLP, one in theory). I think it's a great program, both for faculty and for graduates (especially since the job market is so sucky this year).

If you're graduating, you should apply (unless you have other, better, job options already). But beware, the deadline is May 23!!! Here's more info directly from NSF's mouth:

The CIFellows Project is an opportunity for recent Ph.D. graduates in computer science and closely related fields to obtain one- to two-year postdoctoral positions at universities, industrial research laboratories, and other organizations that advance the field of computing and its positive impact on society. The goals of the CIFellows project are to retain new Ph.D.s in research and teaching and to support intellectual renewal and diversity in the computing fields at U.S. organizations.
Every CIFellow application must identify 1-3 host mentors. Click here to visit a website where prospective mentors have posted their information. In addition, openings that have been posted over the past year (and may be a source of viable mentors/host organizations for the CIFellowships) are available here.
Good luck!

20 April 2010

((A => B) and (not B)) => (not A)

I remember back in middle school or high school -- added uncertainty so as not to date myself too much -- I first learned of the existence of file compression software. We used pkzip, mostly. I remember one of the first things I did was: compress myfile.exe to myfile.exe.zip. Then compress that to myfile.exe.zip.zip. Then myfile.exe.zip.zip.zip. I cannot tell you how disappointed I was when, at one stage, the file size went up after "compression."

We read papers all the time that demonstrate something roughly of the form "if A then B." The happens most obviously the closer you get to theory (when such statements can be made precise), but happens also in non-theoretical work. The point of this post is: if you believe A => B, then you have to ask yourself: which do I believe more? A, or not B?

The simplest case is the compression case. Let's say a weak compressor is one that always reduces a (non-empty) file's size by one bit. A strong compressor is one that cuts the file down to one bit. I can easily prove to you that if you give me a weak compressor, I can turn it into a strong compressor by running it N-1 times on files of size N. Trivial, right? But what do you conclude from this? You're certainly not happy, I don't think For what I've proved really is that weak compressors don't exist, not that strong compressors do. That is, you believe so strongly that a strong compressor is impossible, that you must conclude from (weak) => (strong) that (weak) cannot possibly exist.

Let's take the most obvious example from machine learning: boosting. The basic result of boosting is that I can turn a "weak classifier" into a "strong classifier." A strong classifier is one that (almost always) does a really good job at classification. A weak classifier is one that (always always) does a not completely crappy job at classification. That is, a strong classifier will get you 99.9% accuracy (most of the time) while I weak classifier will only get you 50.1% accuracy (most of the time). Boosting works by repeatedly applying the weak classifier to modified data sets in order ot get a strong classifier out.

In all the boosting papers, the results are presented as positive. That is: look, obviously we want strong classifiers. But weak classifiers look much more attainable. And voila, by boosting magic, we can turn the weak into the strong. This is an A => B setting: (existence of weak classifiers) => (existence of strong classifiers).

Sounds great, right? But let me ask you: do you believe more strongly that (weak classifiers exist) or more strongly that (strong classifiers do not exist)? For me, it's the latter. To some degree, no free lunch theorems apply here. This yields a totally different read of boosting results, namely: weak classifiers don't even exist!!!

More on the language side, one of the more classic experimentally results we have is that if you give me a really good semantic representation, I can do an awesome job doing natural language generation from those semantics. In order to do translation, for instance, I just have to generate the semantics of a French sentence and then I can generate a near-perfect English translation. (French semantics) => (Perfect translations). But do I believe mose strongly that we can get perfect French semantics or that we can not get perfect translation? Right now, almost certainly the latter.

My point is really one about critical reading. When you read a paper, things will be presented in one light. But that's often not the only light in which they can be interpreted. Apply your own interpretation.

12 April 2010

How I teach machine learning

I've had discussions about this with tons of people, and it seems like my approach is fairly odd. So I thought I'd blog about it because I've put a lot of thought into it over the past four offerings of the machine learning course here at Utah.

At a high level, if there is one thing I want them to remember after the semester is over it's the idea of generalization and how it relates to function complexity. That's it. Now, more operationally, I'd like them to learn SVMs (and kernels) and EM for generative models.

In my opinion, the whole tenor of the class is set by how it starts. Here's how I start.

  1. Decision trees. No entropy. No mutual information. Just decision trees based on classification accuracy. Why? Because the point isn't to teach them decision trees. The point is to get as quickly as possible to the point where we can talk about things like generalization and function complexity. Why decision trees? Because EVERYONE gets them. They're so intuitive. And analogies to 20 questions abound. We also talk about the who notion of data being drawn from a distribution and what it means to predict well in the future.

  2. Nearest neighbor classifiers. No radial basis functions, no locally weighted methods, etc. Why? Because I want to introduce the idea of thinking of data as points in high dimensional space. This is a big step for a lot of people, and one that takes some getting used to. We then do k-nearest neighbor and relate it to generalization, overfitting, etc. The punch line of this section is the idea of a decision boundary and the complexity of decision boundaries.

  3. Linear algebra and calculus review. At this point, they're ready to see why these things matter. We've already hinted at learning as some sort of optimization (via decision trees) and data in high dimensions, hence calculus and linear algebra. Note: no real probability here.

  4. Linear classifiers as methods for directly optimizing a decision boundary. We start with 0-1 loss and then move to perceptron. Students love perceptron because it's so procedural.
The rest follows mostly as almost every other machine learning course out there. But IMO these first four days are crucial. I've tried (in the past) starting with linear regression or linear classification and it's just a disaster. You spend too much time talking about unimportant stuff. The intro with error-based decision trees moving to kNN is amazingly useful.

The sad thing is that there are basically no books that follow any order even remotely like this. Except...drum roll... it's actually not far from what Mitchell's book does. Except he does kNN much later. It's really depressing how bad most machine learning books are from a pedagogical perspective... you'd think that in 12 years someone would have written something that works better.

On top of that, the most recent time I taught ML, I structured everything around recommender systems. You can actually make it all work, and it's a lot of fun. We actually did recommender systems for classes here at the U (I had about 90-odd students from AI the previous semester fill out ratings on classes they'd taken in the past). The data was a bit sparse, but I think it was a lot of fun.

The other thing I change most recently that I'm very happy with is that I have a full project on feature engineering. (It ties in to the course recommender system idea.) Why? Because most people who take ML, if they ever use it at all, will need to do this. It's maybe one of the most important things that they'll have to learn. We should try to teach it. Again, something that no one ever talks about in books.

Anyway, that's my set of tricks. If you have some that you particularly like, feel free to share!

07 April 2010

When Maximum Likelihood Doesn't Make Sense

One of the most fun AHA moments in ML or stats is when you see that for multinomial distributions, your naive idea of relative frequency corresponds (through a bunch of calculus) to maximum likelihood estimation. Ditto means of Gaussians, though that's never quite as amazing because Gaussians seems sort of odd beasts anyway. It's nice when our intuitions match what stats tells us.

I'll often watch a DVD, and then leave it in the DVD player until I want to watch something else. Usually it will return to the menu screen and the timer display will go through 0:00, 0:01, 0:02, ..., until it hits the length of the title screen loop, and then go back to 0:00. What I always wonder when I glance up at it is "what is the actual length of the title screen loop?" That is, what is the highest value it'll count up to.

Being a good statistician, I set up a model. First I play the frequentist game. Since I glance up and pretty arbitrary times, my observations can be modeled as a uniform random variable from 0 to some upper bound U, which is the quantity I want to infer.

Supposing that I only make one observation, say 0:27, I want a maximum likelihood estimate of U. It's easy to see that U=27 is the maximum likelihood solution. Anything less than 27 will render my observation impossible. For any U>=27, my observation has probability 1/(U+1), so the ML solution is precisely U=27. Note that even if I observe five observations a <= b <= c <= d <= e, the ML solution is still U=e. Huh. The math works. The model is as correct as I could really imagine it. But this answer doesn't really seem reasonable to me. Somehow I expect a number greater than my observation to come about.

Perhaps I need a prior. That'll solve it. I'll add a prior p(U) and then do maximum a posteriori. Well, it's easy to see that if p(U) is unimodal, and its mode is less than my (highest) observation, then the MAP solution will still be the (highest) observation. If it's mode is more than my (highest) observation, then the MAP solution will be the mode of p(U). If I think about this a bit, it's hard for me to justify a multi-modal p(U), and also hard for me to be happy with a system in which my prior essentially completely determines my solution.

Or I could be fully Bayesian and look at a posterior distribution p(U | data). This will just be a left-truncated version of my prior, again, not very satisfying.

(Note that the "Maximum Likelihood Sets" idea, which I also like quite a bit, doesn't fix this problem either.)

It also really bugs me that only my largest observation really affects the answer. If I get one hundred observations, most of them centered around 0:10, and then one at 0:30, then I'd guess that 0:30 or 0:31 or 0:32 is probably the right answer. But if I observe a bunch of stuff spread roughly uniformly between 0:00 and 0:30, then I'd be more willing to believe something larger.

I don't really have a solution to this dilemma. Perhaps you could argue that my difficulties arise from the use of a Uniform distribution -- namely, that were I to use another distribution, these problems wouldn't really arise. I don't think this is quite true, as described below:

We actually run in to this problem in NLP fairly often. I observe a billion documents and see, in this 1b documents, 100k unique words, many of which are singletons. But I'm not willing to believe that if I see that document 1,000,000,001, that there won't be a new unseen word. Sure it's less likely that I see a new unique word than in document 1001, where it is almost guaranteed, but there's still a relatively large probability that some new word will show up in that next document.

The whole point is that somehow we expect random samples to be representatives of the underlying distribution. We don't expect them to define the underlying distribution. I don't actually expect to ever see the corner cases, unless I've seen everything else.

UPDATE: The Bayesian approach actually does do something reasonable. Here is some Matlab code for computing posteriors with a uniform prior, and results with an upper bound of 100 and various observations are plotted below:

01 April 2010

Classification weirdness, regression simplicity

In the context of some work on multitask learning, we came to realize that classification is kind of weird. Or at least linear classification. It's not that it's weird in a way that we didn't already know: it's just sort of a law of unexpected consequences.

If we're doing linear (binary) classification, we all know that changing the magnitude of the weight vector doesn't change the predictions. A standard exercise in a machine learning class might be to show that if your data is linearly separable, then for some models (for instance, unregularized models), the best solution is usually an infinite norm weight vector that's pointing in the right direction.

This is definitely not true of (linear) regression. Taking a good (or even perfect) linear regressor and blowing up the weights by some constant will kill your performance. By adding a regularizer, what you're basically doing is just saying how big you want that norm to be.

Of course, by regression I simply mean minimizing something like squared error and by classification I mean something like 0/1 loss or hinge loss or logistic loss or whatever.

I think this is stuff that we all know.

Where this can bite you in unexpected ways is the following. In lots of problems, like domain adaptation and multitask learning, you end up making assumptions roughly of the form "my weight vector for domain A should look like my weight vector for domain B" where "look like" is really the place where you get to be creative and define things how you feel best.

This is all well and good in the regression setting. A magnitude 5 weight in regression means a magnitude 5 weight in regression. But not so in classification. Since you can arbitrarily scale your weight vectors and still get the same decision boundaries, a magnitude 5 weight kind of means nothing. Or at least it means something that has to do more with the difficulty of the problem and how you chose to set your regularization parameter, rather than something to do with the task itself.

Perhaps we should be looking for definitions of "look like" that are insensitive to things like magnitude. Sure you can always normalize all your weight vectors to unit norm before you co-regularize them, but that loses information as well.

Perhaps this is a partial explanation of some negative transfer. One thing that you see, when looking at the literature in DA and MTL, is that all the tasks are typically of about the same difficulty. My expectation is that if you have two tasks that are highly related, but one is way harder than the other, is going to lead to negative transfer. Why? Because the easy task will get low norm weights, and the hard task will get high norm weights. The high norm weights will pull the low norm weights toward them too much, leading to worse performance on the "easy" task. In a sense, we actually want the opposite to happen: if you have a really hard task, it shouldn't screw up everyone else that's easy! (Yes, I know that being Bayesian might help here since you'd get a lot of uncertainty around those high norm weight vectors!)

20 February 2010

Google 5gram corpus has unreasonable 5grams

In the context of something completely unrelated, I was looking for a fairly general pattern in the Google 1TB corpus. In particular, I was looking for verbs that are sort of transitive. I did a quick grep for 5grams of the form "the SOMETHING BLAHed the SOMETHING." Or, more specifically:

    grep -i '^the [a-z][a-z]* [a-z][a-z]*ed the [a-z]*'
I then took these, lower cased them, and then merged the counts. Here are the top 25, sorted and with counts:
     1  101500  the surveyor observed the use
2 30619 the rivals shattered the farm
3 27999 the link entitled the names
4 22928 the trolls ambushed the dwarfs
5 22843 the dwarfs ambushed the trolls
6 21427 the poet wicked the woman
7 15644 the software helped the learning
8 13481 the commission released the section
9 12273 the mayor declared the motion
10 11046 the player finished the year
11 10809 the chicken crossed the road
12 8968 the court denied the motion
13 8198 the president declared the bill
14 7890 the board approved the following
15 7848 the bill passed the house
16 7373 the fat feed the muscle
17 7362 the report presented the findings
18 7115 the committee considered the report
19 6956 the respondent registered the domain
20 6923 the chairman declared the motion
21 6767 the court rejected the argument
22 6307 the court instructed the jury
23 5962 the complaint satisfied the formal
24 5688 the lord blessed the sabbath
25 5486 the bill passed the senate
What the heck?! First of all, the first one is shocking, but maybe you could convince me. How about numbers 4 and 5? "The trolls ambushed the dwarfs" (and vice versa)? These things are the fourth and fifth most common five grams matching my pattern on the web? "The poet wicked the woman"? What does "wicked" even mean? And yet these all beat out "The bill passed the house" and "The court instructed the jury". But then #23: "The prince compiled the Mishna"??? (#30 is also funny: "the matrix reloaded the matrix" is an amusing segmentation issue.)

If we do a vanilla google search for the counts of some of these, we get:
     1     10900  the surveyor observed the use
4 7750 the trolls ambushed the dwarfs
5 7190 the dwarfs ambushed the trolls
6 ZERO! the poet wicked the woman
15 20200000 the bill passed the house
22 3600000 the court instructed the jury
This just flabbergasts me. I'm told that lots of people have expressed worries over the Google 1TB corpus, but have never actually heard anything myself... And never seen anything myself.

Does anyone have an explanation for these effects? How can I expect to get anything done with such ridiculous data!