Are you graduating and interested in doing a fun postdoc in your area of choosing on your project of choosing? Apply to be a NSF CI Fellow!
07 May 2011
28 February 2011
What are your plans between ACL and ICML?
I'll tell you what they should be: attending the Symposium on Machine Learning in Speech and Language Processing, jointly sponsored by IMLS, ICML and ISCA, that I'm co-organizing with Dan Roth, Geoff Zweig and Joseph Keshet (the exact date is June 27, in the same venue as ICML in Bellevue, Washington). So far we've got a great list of invited speakers from all of these areas, including Mark Steedman, Stan Chen, Yoshua Bengio, Lawrence Saul, Sanjoy Dasgupta and more. (See the web page for more details.) We'll also be organizing some sort of day trips (together with the local organizers of ICML) for people who want to join! You should also consider submitting papers (deadline is April 15).
I know I said a month ago that I would blog more. I guess that turned out to be a lie. The problem is that I only have so much patience for writing and I've been spending a lot of time writing non-blog things recently. I decided to use my time-off-teaching doing something far more time consuming than teaching. This has been a wonderously useful exercise for me and I hope that, perhaps starting in 2012, other people can take advantage of this work.
Posted by
hal
at
2/28/2011 01:27:00 PM
| 3
comments
18 November 2010
Crowdsourcing workshop (/tutorial) decisions
Everyone at conferences (with multiple tracks) always complains that there are time slots with nothing interesting, and other time slots with too many interesting papers. People have suggested crowdsourcing this, enabling parcipants to say -- well ahead of the conference -- which papers they'd go to... then let an algorithm schedule.
I think there are various issues with this model, but don't want to talk about it. What I do want to talk about is applying the same ideas to workshop acceptance decisions. This comes up because I'm one of the two workshop chairs for ACL this year, and because John Langford just pointed to the ICML call for tutorials. (I think what I have to say applies equally to tutorials as to workshops.)
I feel like a workshop (or tutorial) is successful if it is well attended. This applies both from a monetary perspective, as well as a scientific perspective. (Note, though, that I think that small workshops can also be successful, especially if they are either fostering a small community, bring people in, or serving other purposes. That is to say, size is not all that matters. But it is a big part of what matters.)
We have 30-odd workshop proposals for three of us to sort through (John Carroll and I are the two workshop chairs for ACL, and Marie Candito is the workshop chair for EMNLP; workshops are being reviewed jointly -- which actually makes the allocation process more difficult). The idea would be that I could create a poll, like the following:
- Are you going to ACL? Yes, maybe, no
- Are you going to EMNLP? Yes, maybe, no
- If workshop A were offered at a conference you were going to, would you go to workshop A?
- If workshop B...
- And so on
Of course we're not going to do this this year. It's too late already, and it would be unfair to publicise all the proposals, given that we didn't tell proposers in advance that we would do so. And of course I don't think this should exclusively be a popularity contest. But I do beleive that popularity should be a factor. And it should probably be a reasonably big factor. Workshop chairs could then use the output of an optimization algorithm as a starting point, and use this as additional data for making decisions. Especially since two or three people are being asked to make decisions that cover--essentially--all areas of NLP, this actually seems like a good idea to me.
I actually think something like this is more likely to actually happen at a conference like ICML than ACL, since ICML seems (much?) more willing to try new things than ACL (for better or for worse).
But I do think it would be interesting to try to see what sort of response you get. Of course, just polling on this blog wouldn't be sufficient: you'd want to spam, perhaps all of last year's attendees. But this isn't particularly difficult.
Is there anything I'm not thinking of that would make this obviously not work? I could imagine someone saying that maybe people won't propose workshops/tutorials if the proposals will be made public? I find that a bit hard to swallow. Perhaps there's a small embarassment factor if you're public and then don't get accepted. But I wouldn't advocate making the voting results public -- they would be private to the organizers / workshop chairs.
I guess -- I feel like I'm channeling Fernando here? -- that another possible issue is that you might not be able to decide which workshops you'd go to without seeing what papers are there and who is presenting. This is probably true. But this is also the same problem that the workshop chairs face anyway: we have to guess that good enough papers/people will be there to make it worthwhile. I doubt I'm any better at guessing this than any other random NLP person...
So what am I forgetting?
Posted by
hal
at
11/18/2010 03:52:00 PM
| 8
comments
Labels: community
09 November 2010
Managing group papers
Every time a major conference deadline (ACL, NIPS, EMNLP, ICML, etc...) comes around, we usually have a slew of papers (>=3, typically) that are getting prepared. I would say on average 1 doesn't make it, but that's par for the course.
For AI-Stats, whose deadline just passed, I circulated student paper drafts to all of my folks to solicit comments at any level that they desired. Anywhere from not understanding the problem/motivation to typos or errors in equations. My experience was that it was useful, both from the perspective of distributing some of my workload and getting an alternative perspective, to keeping everyone abreast of what everyone else is working on.
In fact, it was so successful that two students suggested to me that I require more-or-less complete drafts of papers at least one week in advance so that this can take place. How you require something like this is another issue, but the suggestion they came up with was that I'll only cover conference travel if this occurs. It's actually not a bad idea, but I don't know if I'm enough of a hard-ass (or perceived as enough of a hard-ass) to really pull it off. Maybe I'll try it though.
The bigger question is how to manage such a thing. I was thinking of installing some conference management software locally (eg., HotCRP, which I really like) and giving students "reviewer" access. Then, they could upload their drafts, perhaps with an email circulated when a new draft is available, and other students (and me!) could "review" them. (Again, perhaps with an email circulated -- I'm a big fan of "push" technology: I don't have time to "pull" anymore!)
The only concern I have is that it would be really nice to be able to track updates, or to have the ability for authors to "check off" things that reviewers suggested. Or to allow discussion. Or something like that.
I'm curious if anyone has ever tried anything like this and whether it was successful or not. It seems like if you can get a culture of this established, it could actually be quite useful.
Posted by
hal
at
11/09/2010 07:41:00 AM
| 13
comments
05 October 2010
My Giant Reviewing Error
I try to be a good reviewer, but like everything, reviewing is a learning process. About five years ago, I was reviewing a journal paper and made an error. I don't want to give up anonymity in this post, so I'm going to be vague in places that don't matter.
I was reviewing a paper, which I thought was overall pretty strong. I thought there was an interesting connection to some paper from Alice Smith (not the author's real name) in the past few years and mentioned this in my review. Not a connection that made the current paper irrelevant, but something the authors should probably talk about. In the revision response, the authors said that they had looked to try to find Smith's paper, but could figure out which one I was talking about, and asked for a pointer. I spend the next five hours looking for the reference and couldn't find it myself. It turns out that actually I was thinking of a paper by Bob Jones, so I provided that citation. But the Jones paper wasn't even as relevant as it seemed at the time I wrote the review, so I apologized and told the authors they didn't really need to cover it that closely.
Now, you might be thinking to yourself: aha, now I know that Hal was the reviewer of my paper! I remember that happening to me!
But, sadly, this is not true. I get reviews like this all the time, and I feel it's one of the most irresponsible things reviewers can do. In fact, I don't think a single reviewing cycle has passed where I don't get a review like this. The problem with such reviews is that it enables a reviewer to make whatever claim they want, without any expectation that they have to back it up. And the claims are usually wrong. They're not necessarily being mean (I wasn't trying to be mean), but sometimes they are.
Here are some of the most ridiculous cases I've seen. I mention these just to show how often this problem occurs. These are all on papers of mine.
- One reviewer wrote "This idea is so obvious this must have been done before." This is probably the most humorous example I've seen, but the reviewer was clearly serious. And no, this was not in a review for one of the the "frustratingly easy" papers.
- In a NSF grant review for an educational proposal, we were informed by 4 of 7 reviewers (who each wrote about a paragraph) that our ideas had been done in SIGCSE several times. Before submitting, we had skimmed/read the past 8 years of SIGCSE and could find nothing. (Maybe it's true and we just were looking in the wrong place, but that still isn't helpful.) It turned out to strongly seem that this was basically their way of saying "you are not one of us."
- In a paper on technique X for task A, we were told hands down that it's well known that technique Y works better, with no citations. The paper was rejected, we went and implemented Y, and found that it worked worse on task A. We later found one paper saying that Y works better than X on task B, for B fairly different from A.
- In another paper, we were told that what we were doing had been done before and in this case a citation was provided. The citation was to one of our own papers, and it was quite different by any reasonable metric. At least a citation was provided, but it was clear that the reviewer hadn't bothered reading it.
- We were told that we missed an enormous amount of related work that could be found by a simple web search. I've written such things in reviews, often saying something like "search for 'non-parametric Bayesian'" or something like that. But here, no keywords were provided. It's entirely possible (especially when someone moves into a new domain) that you can miss a large body of related work because you don't know how to find in: that's fine -- just tell me how to find it if you don't want to actually provide citations.
I'm posting this not to gripe (though it's always fun to gripe about reviewing), but to try to draw attention to this problem. It's really just an issue of laziness. If I had bothered trying to look up a reference for Alice Smith's paper, I would have immediately realized I was wrong. But I was lazy. Luckily this didn't really adversely affect the outcome of the acceptance of this paper (journals are useful in that way -- authors can push back -- and yes, I know you can do this in author responses too, but you really need two rounds to make it work in this case).
I've really really tried ever since my experience above to not ever do this again. And I would encourage future reviewers to try to avoid the temptation to do this: you may find your memory isn't as good as you think. I would also encourage area chairs and co-reviewers to push their colleagues to actually provide citations for otherwise unsubstantiated claims.
Posted by
hal
at
10/05/2010 11:14:00 AM
| 9
comments
24 September 2010
ACL / ICML Symposium?
ACL 2011 ends on June 24, in Portland (that's a Friday). ICML 2011 begins on June 28, near Seattle (the following Tuesday). This is pretty much as close to a co-location as we're probably going to get in a long time. A few folks have been discussing the possibility of having a joint NLP/ML symposium in between. The current thought is to have it on June 27 at the ICML venue (for various logistical reasons). There are buses and trains that run between the two cities, and we might even be able to charter some buses.
One worry is that it might only attract ICML folks due to the weekend between the end of ACL and the beginning of said symposium. As a NLPer/MLer, I believe in data. So please provide data by filling out the form below and, if you wish, adding comments.
If you woudn't attend any, you don't need to fill out the poll :).
The last option is there if you want to tell me "I'm going to go to ACL, and I'd really like to go to the symposium, but the change in venue and the intervening weekend is too problematic to make it possible."
Posted by
hal
at
9/24/2010 07:25:00 PM
| 1 comments
15 September 2010
Very sad news....
I heard earlier this morning that Fred Jelinek passed away last night. Apparently he had been working during the day: a tenacious aspect of Fred that probably has a lot to do with his many successes.
Fred is probably most infamous for the famous "Every time I fire a linguist the performace of the recognizer improves" quote, which Jurafsky+Martin's textbook says is actually supposed to be the more innocuous "Anytime a linguist leaves the group the recognition rate goes up." And in Fred's 2009 ACL Lifetime Achievement Award speech, he basically said that such a thing never happened. I doubt that will have any effect on how much the story is told.
Fred has had a remarkable influence on the field. So much so that I won't attempt to list anything here: you can find all about him all of the internet. Let me just say that the first time I met him, I was intimidated. Not only because he was Fred, but because I knew (and still know) next to nothing about speech, and the conversation inevitably turned to speech. Here's roughly how a segment of our conversation went:
Hal: What new projects are going on these days?
Fred: (Excitedly.) We have a really exciting new speech recognition problem. We're trying to map speech signals directly to fluent text.
Hal: (Really confused.) Isn't that the speech recognition problem?
Fred: (Playing the "teacher role" now.) Normally when you transcribe speech, you end up with a transcrit that includes disfluencies like "uh" and "um" and also false starts [Ed note: like "I went... I went to the um store"].
Hal: So now you want to produce the actual fluent sentence, not the one that was spoken?
Fred: Right.
Apparently (who knew) in speech recognition you try to transcribe disfluencies and are penalized for missing them! We then talked for a while about how they were doing this, and other fun topics.
A few weeks later, I got a voicemail on my home message machine from Fred. That was probably one of the coolest things that have ever happened to me in life. I actually saved it (but subsequently lost it, which saddens me greatly). The content is irrelevant: the point is that Fred -- Fred! -- called me -- me! -- at home! Amazing.
I'm sure that there are lots of other folks who knew Fred better than me, and they can add their own stories in comments if they'd like. Fred was a great asset to the field, and I will certainly miss his physical presense in the future, though his work will doubtless continue to affect the field for years and decades to come.
Posted by
hal
at
9/15/2010 08:42:00 AM
| 6
comments
Labels: community
27 August 2010
Calibrating Reviews and Ratings
NIPS decision are going out soon, and then we're done with submitting and reviewing for a blessed few months. Except for journals, of course.
If you're not interested in paper reviews, but are interested in sentiment analysis, please skip the first two paragraphs :).
One thing that anyone who has ever area chaired, or probably even ever reviewed, has noticed is that different people have different "baseline" ratings. Conferences try to adjust for this, for instance NIPS defines their 1-10 rating scale as something like "8 = Top 50% of papers accepted to NIPS" or something like that. Even so, some people are just harsher than others in scoring, and it seems like the area chair's job to calibrate for this. (For instance, I know I tend to be fairly harsh -- I probably only give one 5 (out of 5) for every ten papers I review, and I probably give two or three 1s in the same size batch. I have friends who never give a one -- except in the case of something just being wrong -- and often give 5s. Perhaps I should be nicer; I know CS tends to be harder on itself than other fiends.) As an aside, this is one reason why I'm generally in favor of fewer reviewers and more reviews per reviewer: it allows easier calibration.
There's also the issue of areas. Some areas simply seem to be harder to get papers into than others (which can lead to some gaming of the system). For instance, if I have a "new machine learning technique applied to parsing," do I want it reviewed by parsing people or machine learning people? How do you calibrate across areas, other than by some form of affirmative action for less-represented areas?
A similar phenomenon occurs in sentiment analysis, as was pointed out to me at ACL this year by Franz Och. The example he gives is very nice. If you go to TripAdvisor and look up The French Laundry, which is definitely one of the best restaurants in the U.S. (some people say the best), you'll see that it got 4.0/5.0 stars, and a 79% recommendation. On the other hand, if you look up In'N'Out Burger, a LA-based burger chain (which, having grown up in LA, was admittedly one of my favorite places to eat in high school, back when I ate stuff like that) you see another 4.0/5.0 stars and a 95% recommendation.
So now, we train a machine learning system to predict that the rating for The French Laundry is 79% and In'N'Out Burger is 95%. And we expect this to work?!
Probably the main issue here is calibrating for expectations. As a teacher, I've figured out quickly that managing student expectations is a big part of getting good teaching reviews. If you go to In'N'Out, and have expectations for a Big Mac, you'll be pleasantly surprised. If you go to The French Laundry with expectations of having a meal worth selling your soul, your children's souls, etc., for, then you'll probably be disappointed (though I can't really say: I've never been).
One way that a similar problem has been dealt with on Hotels.com is that they'll show you ratings for the hotel you're looking at, and statistics of ratings for other hotels within a 10 mile radius (or something). You could do something similar for restaurants, though distance probably isn't the right categorization: maybe price. For "$", In'N'Out is probably near the top, and for "$$$$" The French Laundry probably is.
(Anticipating comments, I don't think this is just an "aspect" issue. I don't care how bad your palate is, even just considering the "quality of food" aspect, Laundry has to trump In'N'Out by a large margin.)
I think the problem is that in all of these cases -- papers, restaurants, hotels -- and others (movies, books, etc.) there simply isn't a total order on the "quality" of the objects you're looking at. (For instance, as soon as a book becomes a best seller, or is advocated by Oprah, I am probably less likely to read it.) There is maybe a situation-depend order, and the distance to hotel, or "$" rating, or area classes are heuristics for describing this "situation." Bit without knowing the situation, or having a way to approximate it, I worry that we might be entering a garbage-in-garbage-out scenario here.
Posted by
hal
at
8/27/2010 11:14:00 AM
| 10
comments
21 August 2010
Readers kill blogs?
I try to avoid making meta-posts, but the timing here was just too impeccable for me to avoid a short post on something that's been bothering me for a year or so.
- On the one hand, yesterday, Aleks stated that the main reason he blogs is to see comments. (Similarly, Lance also thinks comments are a very important part of having an "open" blog.)
- On the other hand, people are more an more moving to systems like Google Reader, as re-blogged by Fernando, also yesterday.
Hopefully the architects behind readers will pick up on this and make these things (adding and viewing comments, within the reader -- yes, I realize that it's then not such a "reader") easier. That is, unless they want to lose out to tweets!
Until then, I'd like to encourage people to continue commenting here.
Posted by
hal
at
8/21/2010 12:42:00 PM
| 14
comments
Labels: community
29 April 2010
Graduating? Want a post-doc? Let NSF pay!
I get many of emails of the form "I'm looking for a postdoc...." I'm sure that other, more senior, more famous people get lots of these. My internal reaction is always "Great: I wish I could afford that!" NSF has a solution: let them pay for it! This is the second year of the CI fellows program, and I know of two people who did it last year (one in NLP, one in theory). I think it's a great program, both for faculty and for graduates (especially since the job market is so sucky this year).
If you're graduating, you should apply (unless you have other, better, job options already). But beware, the deadline is May 23!!! Here's more info directly from NSF's mouth:
The CIFellows Project is an opportunity for recent Ph.D. graduates in computer science and closely related fields to obtain one- to two-year postdoctoral positions at universities, industrial research laboratories, and other organizations that advance the field of computing and its positive impact on society. The goals of the CIFellows project are to retain new Ph.D.s in research and teaching and to support intellectual renewal and diversity in the computing fields at U.S. organizations.Good luck!
.....
Every CIFellow application must identify 1-3 host mentors. Click here to visit a website where prospective mentors have posted their information. In addition, openings that have been posted over the past year (and may be a source of viable mentors/host organizations for the CIFellowships) are available here.
Posted by
hal
at
4/29/2010 09:43:00 AM
| 2
comments
07 February 2010
Blog heads east
I started this blog ages ago while still in grad school in California at USC/ISI. It came with me three point five years ago when I started as an Assistant Professor at the University of Utah. Starting some time this coming summer, I will take it even further east: to CS at the University of Maryland where I have just accepted a faculty offer.
These past (almost) four years at Utah have been fantastic for me, which has made this decision to move very difficult. I feel very lucky to have been able to come here. I've had enormous freedom to work in directions that interest me, great teaching opportunities (which have taught me a lot), and great colleagues. Although I know that moving doesn't mean forgetting one's friends, it does mean that I won't run in to them in hallways, or grab lunch, or afternoon coffee or whatnot anymore. Ellen, Suresh, Erik, John, Tom, Tolga, Ross, and everyone else here have made my time wonderful. I will miss all of them.
The University here has been incredibly supportive in every way, and I've thoroughly enjoyed my time here. Plus, having world-class skiing a half hour drive from my house isn't too shabby either. (Though my # of days skiing per year declined geometrically since I started: 30-something the first year, then 18, then 10... so far only a handful this year. Sigh.) Looking back, my time here has been great and I'm glad I had the opportunity to come.
That said, I'm of course looking forward to moving to Maryland also, otherwise I would not have done it! There are a number of great people there in natural language processing, machine learning and related fields. I'd like to think that UMD should be and is one of the go-to places for these topics, and am excited to be a part of it. Between Bonnie, Philip, Lise, Mary, Doug, Judith, Jimmy and the other folks in CLIP and related groups, I think it will be a fantastic place for me to be, and a fantastic place for all those PhD-hungry students out there to go! Plus, having all the great folks at JHU's CLSP a 45 minute drive a way will be quite convenient.
A part of me is sad to be leaving, but another part of me is excited at new opportunities. The move will take place some time over the summer (carefully avoiding conferences), so if I blog less then, you'll know why. Thanks again to everyone who has made my life here fantastic.
Posted by
hal
at
2/07/2010 10:29:00 AM
| 19
comments
26 January 2010
A machine learner's apology
Andrew Gelman recently announced an upcoming talk by John Lafferty. This reminded me of a post I've been meaning to write for ages (years, really) but haven't gotten around to. Well, now I'm getting around to it.
A colleague from Utah (not in ML) went on a trip and spent some time talking to a computational statistician, who will remain anonymous. But let's call this person Alice. The two were talking about various topics and at one point machine learning came up. Alice commented:
"Machine learning is just non-rigorous computational statistics."Or something to that effect.
A first reaction is to get defensive: no it's not! But Alice has a point. Some subset of machine learning, in particular the side more Bayesian, tends to overlap quite a bit with compstats, so much so that in some cases they're probably not really that differentiable. (That is to say, there's a subset of ML that's very very similar to a subset of compstats... you could probably fairly easily find some antipoles that are amazingly different.)
And there's a clear intended interpretation to the comment: it's not that we're not rigorous, it's that we're not rigorous in the way that computational statisticians are. To that end, let me offer a glib retort:
Computational statistics is just machine learning where you don't care about the computation.In much the same way that I think Alice's claim is true, I think this claim is also true. The part of machine learning that's really strong on the CS side, cares a lot about computation: how long, how much space, how many samples, etc., will it take to learn something. This is something that I've rarely seen in compstats, where the big questions really have to do with things like: is this distribution well defined, can I sample from it, etc., now let me run Metropolis-Hastings. (Okay, I'm still being glib.)
I saw a discussion on a theory blog recently that STOC/FOCS is about "THEORY of algorithms" while SODA is about "theory of ALGORITHMS" or something like that. (Given the capitalization, perhaps it was Bill Gasarch :)?) Likewise, I think it's fair to say that classic ML is "MACHINE learning" or "COMPUTATIONAL statistics" and classic compstats is "machine LEARNING" or "computational STATISTICS." We're really working on very similar problems, but the things that we value tend to be different.
Due to that, I've always found it odd that there's not more interaction between compstats and ML. Sure, there's some... but not very much. Maybe it's cultural, maybe it's institutional (conferences versus journals), maybe we really know everything we need to know about the other side and talking wouldn't really get us anywhere. But if it's just a sense of "I don't like you because you're treading on my field," then that's not productive (either direction), even if it is an initial gut reaction.
Posted by
hal
at
1/26/2010 08:00:00 AM
| 13
comments
Labels: community
04 January 2010
ArXiV and NLP, ML and Computer Science
Arxiv is something of an underutilized resource in computer science. Indeed, many computer scientists seems not to even know it exists, despite it having been around for two decades now! On the other hand, it is immensely popular among (some branches of) mathematics and physics. This used to strike me as odd: arxiv is a computer service, why haven't computer scientists jumped on it. Indeed, I spent a solid day a few months ago putting all my (well almost all my) papers on arxiv. One can always point to "culture" for such things, but I suspect there are more rational reasons why it hasn't affected us as much as it has others.
I ran in to arxiv first when I was in math land. The following is a cartoon view of how (some branches of) math research gets published:
- Authors write a paper
- Authors submit paper to a journal
- Authors simultaneously post paper on arxiv
- Journal publishes (or doesn't publish) paper
- Conference announces deadline
- One day before deadline, authors write a paper
- Conference publishes (or rejects) paper
So then why don't we do arxiv too? I think there are two reasons. First, we think that conference turn around is good enough -- we don't need anything faster. Second, it completely screws up our notions of blind review. If everyone simultaneously posted a paper on arxiv when submitting to a conference, we could no longer claim, at all, to be blind. (Please, I beg of you, do not start commenting about blind review versus non-blind review -- I hate this topic of conversation and it never goes anywhere!) Basically, we rely on our conferences to do both advertising and stamp of approval. Of course, the speed of conferences is mitigated by the fact that you sometimes have to go through two or three before your paper gets in, which can make it as slow, or slower than, journals.
In a sense, I think that largely because of the blind thing, and partially because conferences tend to be faster than journals, the classic usage of arxiv is not really going to happen in CS.
(There's one other potential use for arxiv, which I'll refer to as the tech-report effect. I've many times seen short papers posted on people's web pages either as tech-reports or as unpublished documents. I don't mean tutorial like things, like I have, but rather real semi-research papers. These are papers that contain a nugget of an idea, but for which the authors seem unwilling to go all the way to "make it work." One could imagine posting such things on arxiv. Unfortunately, I really dislike such papers. It's very much a "flag planting" move in my opinion, and it makes life difficult for people who follow. That is, if I have an idea that's in someone elses semi-research paper, do I need to cite them? Ideas are a dime a dozen: making it work is often the hard part. I don't think you should get to flag plant without going through the effort of making it work. But that's just me.)
However, there is one prospect that arxiv could serve that I think would be quite valuable: literally, as an archive. Right now, ACL has the ACL anthology. UAI has its own repository. ICML has a rather sad state of affairs where, from what I can tell, papers from ICML #### are just on the ICML #### web page and if that happens to go down, oh well. All of these things could equally well be hosted on arxiv, which has strong government support to be sustained, is open access, blah blah blah.
This brings me to a question for you all: how would you feel if all (or nearly all) ICML papers were to be published on arxiv? That is, if your paper is accepted, instead of uploading a camera-ready PDF to the ICML conference manager website, you instead uploaded to arxiv and then sent your arxiv DOI link to the ICML folks?
Obviously there are some constraints, so there would need to be an opt-out policy, but I'm curious how everyone feels about this....
Posted by
hal
at
1/04/2010 10:36:00 AM
| 22
comments
Labels: community, conferences, poll
30 December 2009
Some random NIPS thoughts...
I missed the first two days of NIPS due to teaching. Which is sad -- I heard there were great things on the first day. I did end up seeing a lot that was nice. But since I missed stuff, I'll instead post some paper suggests from one of my students, Piyush Rai, who was there. You can tell his biases from his selections, but that's life :). More of my thoughts after his notes...
Says Piyush:
There was an interesting tutorial by Gunnar Martinsson on using randomization to speed-up matrix factorization (SVD, PCA etc) of really really large matrices (by "large", I mean something like 106 x 106). People typically use Krylov subspace methods (e.g., the Lanczos algo) but these require multiple passes over the data. It turns out that with the randomized approach, you can do it in a single pass or a small number of passes (so it can be useful in a streaming setting). The idea is quite simple. Let's assume you want the top K evals/evecs of a large matrix A. The randomized method draws K *random* vectors from a Gaussian and uses them in some way (details here) to get a "smaller version" of A on which doing SVD can be very cheap. Having got the evals/evecs of B, a simple transformation will give you the same for the original matrix A.The rest of this post are random thoughts that occurred to me at NIPS. Maybe some of them will get other people's wheels turning? This was originally an email I sent to my students, but I figured I might as well post it for the world. But forgive the lack of capitalization :):
The success of many matrix factorization methods (e.g., the Lanczos) also depends on how quickly the spectrum decays (eigenvalues) and they also suggest ways of dealing with cases where the spectrum doesn't quite decay that rapidly.
Some papers from the main conference that I found interesting:
Distribution Matching for Transduction (Alex Smola and 2 other guys): They use maximum mean discrepancy (MMD) to do predictions in a transduction setting (i.e., when you also have the test data at training time). The idea is to use the fact that we expect the output functions f(X) and f(X') to be the same or close to each other (X are training and X' are test inputs). So instead of using the standard regularized objective used in the inductive setting, they use the distribution discrepancy (measured by say D) of f(X) and f(X') as a regularizer. D actually decomposes over pairs of training and test examples so one can use a stochastic approximation of D (D_i for the i-th pair of training and test inputs) and do something like an SGD.
Semi-supervised Learning using Sparse Eigenfunction Bases (Sinha and Belkin from Ohio): This paper uses the cluster assumption of semi-supervised learning. They use unlabeled data to construct a set of basis functions and then use labeled data in the LASSO framework to select a sparse combination of basis functions to learn the final classifier.
Streaming k-means approximation (Nir Ailon et al.): This paper does an online optimization of the k-means objective function. The algo is based on the previously proposed kmeans++ algorithm.
The Wisdom of Crowds in the Recollection of Order Information. It's about aggregating rank information from various individuals to reconstruct the global ordering.
Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora (by some folks at gatech): The problem setting is interesting here. Here the "multi-instance" is a bit of a misnomer. It means that each example in turn can consists of several sub-examples (which they call instances). E.g., a document consists of several paragraphs, or a webpage consists of text, images, videos.
Construction of Nonparametric Bayesian Models from Parametric Bayes Equations (Peter Orbanz): If you care about Bayesian nonparametrics. :) It basically builds on the Kolmogorov consistency theorem to formalize and sort of gives a recipe for the construction of nonparametric Bayesian models from their parametric counterparts. Seemed to be a good step in the right direction.
Indian Buffet Processes with Power-law Behavior (YWT and Dilan Gorur): This paper actually does the exact opposite of what I had thought of doing for IBP. The IBP (akin to the sense of the Dirichlet process) encourages the "rich-gets-richer" phenomena in the sense that a dish that has been already selected by a lot of customers is highly likely to be selected by future customers as well. This leads to the expected number of dishes (and thus the latent-features) to be something like O(alpha* log n). This paper tries to be even more aggressive and makes the relationship have a power-law behavior. What I wanted to do was a reverse behavior -- maybe more like a "socialist IBP" :) where the customers in IBP are sort of evenly distributed across the dishes.
persi diaconis' invited talk about reinforcing random walks... that is, you take a random walk, but every time you cross an edge, you increase the probability that you re-cross that edge (see coppersmith + diaconis, rolles + diaconis).... this relates to a post i had a while ago: nlpers.blogspot.com/2007/04/multinomial-on-graph.html ... i'm thinking that you could set up a reinforcing random walk on a graph to achieve this. the key problem is how to compute things -- basically want you want is to know for two nodes i,j in a graph and some n >= 0, whether there exists a walk from i to j that takes exactly n steps. seems like you could craft a clever data structure to answer this question, then set up a graph multinomial based on this, with reinforcement (the reinforcement basically looks like the additive counts you get from normal multinomials)... if you force n=1 and have a fully connected graph, you should recover a multinomial/dirichlet pair.
also from persi's talk, persi and some guy sergei (sergey?) have a paper on variable length markov chains that might be interesting to look at, perhaps related to frank wood's sequence memoizer paper from icml last year.
finally, also from persi's talk, steve mc_something from ohio has a paper on using common gamma distributions in different rows to set dependencies among markov chains... this is related to something i was thinking about a while ago where you want to set up transition matrices with stick-breaking processes, and to have a common, global, set of sticks that you draw from... looks like this steve mc_something guy has already done this (or something like it).
not sure what made me think of this, but related to a talk we had here a few weeks ago about unit tests in scheme, where they basically randomly sample programs to "hope" to find bugs... what about setting this up as an RL problem where your reward is high if you're able to find a bug with a "simple" program... something like 0 if you don't find a bug, or 1/
|P|
if you find a bug with program P. (i think this came up when i was talking to percy -- liang, the other one -- about some semantics stuff he's been looking at.) afaik, no one in PL land has tried ANYTHING remotely like this... it's a little tricky because of the infinite but discrete state space (of programs), but something like an NN-backed Q-learning might do something reasonable :P.i also saw a very cool "survey of vision" talk by bill freeman... one of the big problems they talked about was that no one has a good p(image) prior model. the example given was that you usually have de-noising models like p(image)*p(noisy image|image) and you can weight p(image) by ^alpha... as alpha goes to zero, you should just get a copy of your noisy image... as alpha goes to infinity, you should end up getting a good image, maybe not the one you *want*, but an image nonetheless. this doesn't happen.
one way you can see that this doesn't happen is in the following task. take two images and overlay them. now try to separate the two. you *clearly* need a good prior p(image) to do this, since you've lost half your information.
i was thinking about what this would look like in language land. one option would be to take two sentences and randomly interleave their words, and try to separate them out. i actually think that we could solve this tasks pretty well. you could probably formulate it as a FST problem, backed by a big n-gram language model. alternatively, you could take two DOCUMENTS and randomly interleave their sentences, and try to separate them out. i think we would fail MISERABLY on this task, since it requires actually knowing what discourse structure looks like. a sentence n-gram model wouldn't work, i don't think. (although maybe it would? who knows.) anyway, i thought it was an interesting thought experiment. i'm trying to think if this is actually a real world problem... it reminds me a bit of a paper a year or so ago where they try to do something similar on IRC logs, where you try to track who is speaking when... you could also do something similar on movie transcripts.
hierarchical topic models with latent hierarchies drawn from the coalescent, kind of like hdp, but not quite. (yeah yeah i know i'm like a parrot with the coalescent, but it's pretty freaking awesome :P.)
That's it! Hope you all had a great holiday season, and enjoy your New Years (I know I'm going skiing. A lot. So there, Fernando! :)).
Posted by
hal
at
12/30/2009 04:18:00 PM
| 18
comments
Labels: community, conferences, machine learning
07 November 2009
NLP as a study of representations
Ellen Riloff and I run an NLP reading group pretty much every semester. Last semester we covered "old school NLP." We independently came up with lists of what we consider some of the most important ideas (idea = paper) from pre-1990 (most are much earlier) and let students select which to present. There was a lot of overlap between Ellen's list and mine (not surprisingly). If people are interested, I can provide the whole list (just post a comment and I'll dig it up). The whole list of topics is posted as a comment. The topics that were actually selected are here.
I hope the students have found this exercise useful. It gets you thinking about language in a way that papers from the 2000s typically do not. It brings up a bunch of issues that we no longer think about frequently. Like language. (Joking.) (Sort of.)
One thing that's really stuck out for me is how much "old school" NLP comes across essentially as a study of representations. Perhaps this is a result of the fact that AI -- as a field -- was (and, to some degree, still is) enamored with knowledge representation problems. To be more concrete, let's look at a few examples. It's already been a while since I read these last (I had meant to write this post during the spring when things were fresh in my head), so please forgive me if I goof a few things up.
I'll start with one I know well: Mann and Thompson's rhetorical structure theory paper from 1988. This is basically "the" RST paper. I think that when a many people think of RST, they think of it as a list of ways that sentences can be organized into hierarchies. Eg., this sentence provides background for that one, and together they argue in favor of yet a third. But this isn't really where RST begins. It begins by trying to understand the communicative role of text structure. That is, when I write, I am trying to communicate something. Everything that I write (if I'm writing "well") is toward that end. For instance, in this post, I'm trying to communicate that old school NLP views representation as the heart of the issue. This current paragraph is supporting that claim by providing a concrete example, which I am using to try to convince you of my claim.
As a more detailed example, take the "Evidence" relation from RST. M+T have the following characterization of "Evidence." Herein, "N" is the nucleus of the relation, "S" is the satellite (think of these as sentences), "R" is the reader and "W" is the writer:
relation name: EvidenceThis is a totally different way from thinking about things than I think we see nowadays. I kind of liken it to how I tell students not to program. If you're implementing something moderately complex (say, forward/backward algorithm), first write down all the math, then start implementing. Don't start implementing first. I think nowadays (and sure, I'm guilty!) we see a lot of implementing without the math. Or rather, with plenty of math, but without a representational model of what it is that we're studying.
constraints on N: R might not believe N to a degree satisfactory to W
constraints on S: R believes S or will find it credible
constraints on N+S: R's comprehending S increases R's belief of N
the effect: R's belief of N is increased
locus of effect: N
The central claim of the RST paper is that one can think of texts as being organized into elementary discourse units, and these are connected into a tree structure by relations like the one above. (Or at least this is my reading of it.) That is, they have laid out a representation of text and claimed that this is how texts get put together.
As a second example (this will be sorter), take Wendy Lehnert's 1982 paper, "Plot units and narrative summarization." Here, the story is about how stories get put together. The most interesting thing about the plot units model to me is that it breaks from how one might naturally think about stories. That is, I would naively think of a story as a series of events. The claim that Lehnert makes is that this is not the right way to think about it. Rather, we should think about stories as sequences of affect states. Effectively, an affect state is how a character is feeling at any time. (This isn't quite right, but it's close enough.) For example, Lehnert presents the following story:
When John tried to start his care this morning, it wouldn't turn over. He asked his neighbor Paul for help. Paul did something to the carburetor and got it going. John thanked Paul and drove to work.The representation put forward for this story is something like: (1) negative-for-John (the car won't start), which leads to (2) motivation-for-John (to get it started, which leads to (3) positive-for-John (it's started), when then links back and resolves (1). You can also analyze the story from Paul's perspective, and then add links that go between the two characters showing how things interact. The rest of the paper describes how these relations work, and how they can be put together into more complex event sequences (such as "promised request bungled"). Again, a high level representation of how stories work from the perspective of the characters.
So now I, W, hope that you, R, have an increased belief in the title of the post.
Why do I think this is interesting? Because at this point, we know a lot about how to deal with structure in language. From a machine learning perspective, if you give me a structure and some data (and some features!), I will learn something. It can even be unsupervised if it makes you feel better. So in a sense, I think we're getting to a point where we can go back, look at some really hard problems, use the deep linguistic insights from two decades (or more) ago, and start taking a crack at things that are really deep. Of course, features are a big problem; as a very wise man once said to me: "Language is hard. The fact that statistical association mining at the word level made it appear easy for the past decade doesn't alter the basic truth. :-)." We've got many of the ingredients to start making progress, but it's not going to be easy!
Posted by
hal
at
11/07/2009 10:38:00 AM
| 21
comments
Labels: community, discourse, linguistics, problems, structured prediction
09 September 2009
Where did you Apply to Grad School?
Ellen and I are interested (for obvious reasons) in how people choose what schools to apply to for grad school. Note that this is not the question of how you chose where to go. This is about what made the list of where you actually applied. We'd really appreciate if you'd fill out our 10-15 minute survey and pass it along to your friends (and enemies). If you're willing, please go here.
Posted by
hal
at
9/09/2009 03:15:00 PM
| 15
comments
25 June 2009
Should there be a shared task for semi-supervised learning in NLP?
(Guest post by Kevin Duh. Thanks, Kevin!)
At the NAACL SSL-NLP Workshop recently, we discussed whether there ought to be a "shared task" for semi-supervised learning in NLP. The panel discussion consisted of Hal Daume, David McClosky, and Andrew Goldberg as panelists and audience input from Jason Eisner, Tom Mitchell, and many others. Here we will briefly summarize the points raised and hopefully solicit some feedback from blog readers.
Three motivations for a shared task
A1. Fair comparison of methods: A common dataset will allow us to compare different methods in an insightful way. Currently, different research papers use different datasets or data-splits, making it difficult to draw general intuitions from the combined body of research.
A2. Establish a methodology for evaluating SSLNLP results: How exactly should a semi-supervised learning method be evaluated? Should we would evaluate the same method for both low-resource scenarios (few labeled points, many unlabeled points) and high-resource scenarios (many labeled points, even more unlabeled points)? Should we evaluate the same method under different ratios of labeled/unlabeled data? Currently there is no standard methodology for evaluating SSLNLP results, which means that the completeness/quality of experimental sections in research papers varies considerably.
A3. Encourage more research in the area: A shared task can potentially lower the barrier of entry to SSLNLP, especially if it involves pre-processed data and community support network. This will make it easier for researchers in outside fields, or researchers with smaller budgets to contribute their expertise to the field. Furthermore, a shared task can potentially direct the community research efforts in a particular subfield. For example, "online/lifelong learning for SSL" and "SSL as joint inference of multiple tasks and heterogeneous labels" (a la Jason Eisner's keynote) were identified as new, promising areas to focus on in the panel discussion. A shared task along those lines may help us rally the community behind these efforts.
Arguments against the above points
B1. Fair comparison: Nobody really argues against fair comparison of methods. The bigger question, however, is whether there exist a *common* dataset or task where everyone is interested in. At the SSLNLP Workshop, for example, we had papers in a wide range of areas ranging from information extraction to parsing to text classification to speech. We also had papers where the need for unlabeled data is intimately tied in to particular components of a larger system. So, a common dataset is good, but what dataset can we all agree upon?
B2. Evaluation methodology: A consistent standard for evaluating SSLNLP results is nice to have, but this can be done independently from a shared task through, e.g. an influential paper or gradual recognition of its importance by reviewers. Further, one may argue: what makes you think that your particular evaluation methodology is the best? What makes you think people will adopt it generally, both inside and outside of the shared task?
B3. Encourage more research: It is nice to lower the barriers to entry, especially if we have pre-processed data and scripts. However, it has been observed in other shared tasks that often it is the pre-processing and features that matter most (more than the actual training algorithm). This presents a dilemma: If the shared task pre-processes the data to make it easy for anyone to join, will we lose the insights that may be gained via domain knowledge? On the other hand, if we present the data in raw form, will this actually encourage outside researchers to join the field?
Rejoinder
A straw poll at the panel discussion showed that people are generally in favor of looking into the idea of a shared task. The important question is how to make it work, and especially how to address counterpoints B1 (what task to choose) and B3 (how to prepare the data). We did not have enough time during the panel discussion to go through the details, but here are some suggestions:
- We can view NLP problems as several big "classes" of problems: sequence prediction, tree prediction, multi-class classification, etc. In choosing a task, we can pick a representative task in each class, such as name-entity recognition for sequence prediction, dependency parsing for tree prediction, etc. This common dataset won't attract everyone in NLP, but at least it will be relevant for a large subset of researchers.
- If participants are allowed to pre-process their own data, the evaluation might require participant to submit a supervised system along with their semi-supervised system, using the same feature set and setup, if possible. This may make it easier to learn from results if there are differences in pre-processing.
- There should also be a standard supervised and semi-supervised baseline (software) provided by the shared task organizer. This may lower the barrier of entry for new participants, as well as establish a common baseline result.
Posted by
hal
at
6/25/2009 02:40:00 PM
| 27
comments
Labels: community, conferences, machine learning
29 May 2009
How to reduce reviewing overhead?
It's past most reviewing time (for the year), so somehow conversations I have with folks I visit tend to gravitate toward the awfulness of reviewing. That is, there is too much to review and too much garbage among it (of course, garbage can be slightly subjective). Reviewing plays a very important role but is a very fallible system, as everyone knows, both in terms of precision and recall. Sometimes there even seems to be evidence of abuse.
But this post isn't about good reviewing and bad reviewing. This is about whether it's possible to cut down on the sheer volume of reviewing. The key aspect of cutting down reviewing, to me, is that in order to be effective, the reduction has to be significant. I'll demonstrate by discussing a few ideas that have come up, and some notes about why I think they would or wouldn't work:
- Tiered reviewing (this was done at ICML this year). The model at ICML was that everyone was guaranteed two reviews, and only a third if your paper was "good enough." I applaud ICML for trying this, but as a reviewer I found it useless. First, it means that at most 1/3 of reviews are getting cut (assuming all papers are bad), but in practice it's probably more like 1/6 that get reduced. This means that if on average a reviewer would have gotten six papers to review, he will now get five. First, this is a very small decrease. Second, it comes with an additional swapping overhead: effectively I now have to review for ICML twice, which makes scheduling a much bigger pain. It's also harder for me to be self-consistent in my evaluations.
- Reject without review (this was suggested to me at dinner last night: if you'd like to de-anonymize yourself, feel free in comments). Give area chairs the power that editors of journals have to reject papers out of hand. This gives area chairs much more power (I actually think this is a good thing: area chairs are too often too lazy in my experience, but that's another post), so perhaps there would be a cap on the number of reject without reviews. If this number is less that about 20%, then my reviewing load will drop in expectation from 5 to 4, which, again, is not a big deal for me.
- Cap on submissions (again, a suggestion from dinner last night): authors may only submit one paper to any conference on which their name comes first. (Yes, I know, this doesn't work in theory land where authorship is alphabetical, but I'm trying to address our issues, not someone else's.) I've only twice in my life had two papers at a conference where my name came first, and maybe there was a third where I submitted two and one was rejected (I really can't remember). At NAACL this year, there are four such papers; at ACL there are two. If you assume these are equally distributed (which is probably a bad assumption, since the people who submit multiple first author papers at a conference probably submit stronger papers), then this is about 16 submissions to NAACL and 8 submissions to ACL. Again, which is maybe 1-4% of submitted papers: again, something that won't really affect me as a reviewer (this, even less than the above two).
- Strong encouragement of short papers (my idea, finally, but with tweaks from others): right now I think short papers are underutilized, perhaps partially because they're seen (rightly or wrongly) as less significant than "full" papers. I don't think this need be the case. Short papers definitely take less time to review. A great "short paper tweak" that was suggested to me is to allow only 3 pages of text, but essentially arbitrarily many pages of tables/figures (probably not arbitrarily, but at least a few... plus, maybe just make data online). This would encourage experimental evaluation papers to be submitted as shorts (currently these papers typically just get rejected as being longs because they don't introduce really new ideas, and rejected as shorts because its hard to fit lots of experiments in four pages). Many long papers that appear in ACL could easily be short papers (I would guesstimate somewhere around 50%), especially ones that have the flavor of "I took method X and problem Y and solved Y with X (where both are known)" or "I took known system X, did new tweak Y and got better results." One way to start encouraging short papers is to just have an option that reviews can say something like "I will accept this paper as a short paper but not a long paper -- please rewrite" and then just have it accepted (with some area chair supervision) without another round of reviewing. The understanding would have to be that it would be poor form as an author to pull your paper out just because it got accepted short rather than accepted long, and so authors might be encouraged just to submit short versions. (This is something that would take a few years to have an effect, since it would be partially social.)
- Multiple reviewer types (an idea that's been in the ether for a while). The idea would be that you have three reviewers for each paper, but each serves a specific role. For instance, one would exclusively check technical details. The other two could then ignore these. Or maybe one would be tasked with "does this problem/solution make sense." This would enable area chairs (yes, again, more work for area chairs) to assign reviewers to do things that they're good at. You'd still have to review as many papers, but you wouldn't have to do the same detailed level of review for all of them.
- Require non-student authors on papers to review 3 times as many papers as they submit to any given conference, no exceptions ("three" because that's how many reviews they will get for any given paper). I know several faculty to follow the model of "if there is a deadline, we will submit." I don't know how widespread this is. The idea is that even half-baked ideas will get garner reviews that can help direct the research. I try to avoid offending people here, but that's what colleagues are for: please stop wasting my time as a reviewer by submitting papers like this. If they get rejected, you've wasted my time; if they get accepted, it's embarrassing for you (unless you take time by camera ready to make them good, which happens only some of the time). Equating "last author" = "senior person", there were two people at NAACL who have three papers and nine who have two. This means that these two people (who in expectation submitted 12 papers -- probably not true, probably more like 4 or 5) should have reviewed 12-15 papers. The nine should probably have reviewed 9-12 papers. I doubt they did. (I hope these two people know that I'm not trying to say they're evil in any way :P.) At ACL, there are four people with three papers (one is a dupe with a three from NAACL -- you know who you are!) and eight people with two. This would have the added benefit of having lots of reviews done by senior people (i.e., no crummy grad student reviews) with the potential downside that these people would gain more control over the community (which could be good or bad -- it's not a priori obvious that being able to do work that leads to many publications is highly correlated with being able to identify good work done by others).
- Make the job of the reviewer more clear. Right now, most reviews I read have a schizophrenic feel. On the one hand, the reviewer justifies his rating to the area chair. On the other, he provides (what he sees as) useful feedback to the authors. I know that in my own reviewing, I have cut down on the latter. This is largely in reaction to the "submit anything and everything" model that some people have. I'll typically give (what I hope is) useful feedback to papers I rate highly, largely because I have questions whose answers I am curious about, but for lower ranked papers (1-3), I tend to say things like "You claim X but your experiments only demonstrate Y." Rather than "[that] + ... and in order to show Y you should do Z." Perhaps I would revert to my old ways if I had less to review, but this was a choice I made about a year ago.
I actually think that together, some of these ideas could have a significant impact. For instance, I would imagine 2 and 4 together would probably cut a 5-6 paper review down to a 3-4 paper review, and doing 6 on top of this would probably take the average person's review load down maybe one more. Overall, perhaps a 50% reduction in number of papers to review, unless you're one of the types who submits lots of papers. I'd personally like to see it done!
Posted by
hal
at
5/29/2009 07:33:00 PM
| 26
comments
Labels: community, conferences
22 March 2009
Programming Language of Choice
Some of you know that I (at least used to be) a bit of a programming language snob. In fact, on several occasions, I've met (in NLP or ML land) someone who recognizes my name from PL land and is surprised that I'm not actually a PL person. My favorite story is after teaching machine learning for the second time, I had Ken Shan, a friend from my PL days, visit. I announced his visit and got an email from a student who had taken ML from me saying:
I _knew_ your name was familiar! I learned a ton about Haskell from your tutorial, for what's worth.. Great read back in my freshman year in college. (Belatedly) Thanks for writing it!
And it's not like my name is particularly common!
At any rate (and, admittedly, this is a somewhat an HBC-related question) I'm curious what programming language(s) other NLP folks tend to use. I've tried to include a subset of the programming language shootout list here that I think are most likely to be used, but if you need to write-in, feel free to do so in a comment. You can select as many as you would like, but please just try to vote for those that you actually use regularly, and that you actually use for large projects. Eg., I use Perl a lot, but only for o(100) line programs... so I wouldn't select Perl.
Posted by
hal
at
3/22/2009 09:13:00 AM
| 29
comments
04 February 2009
Beating the state of the art, big-O style
I used to know a fair amount about math; in fact, I even applied to a few grad schools to do logic (long story, now is not the time). I was never involved in actual math research (primarily because of the necessary ramp-up time -- thank goodness this doesn't exist as much in CS!), but I did get a sense from my unofficial undergrad advisor how things worked. The reason I started thinking of this is because I recently made my way about halfway through a quite dense book on long games by a prof from grad school (I cross-enrolled at UCLA for a few semesters). The basic idea of a countable game is that there is a fixed subset A of [0,1] (subset of the reals) and two players alternate play. In each move, they play an integer from 0 to 9. They play for a countably infinite number of moves, essentially writing down (in decimal form) a real number. If, at the "end", this number is in A, then player 1 wins; otherwise, player 2 wins. Both players know A. The set A is said to be determined if there is a strategy that will force a win; it is undetermined otherwise. A long game is the obvious generalization where you play for longer than countable time. The details don't matter.
This led me to think, as someone who's moved over to working a lot on machine learning: is there an analogous question for online learning? There are several ways to set this up and I won't bore you with the details (I doubt any reader here really cares), but depending on how you set it up, you can prove several relatively trivial, but kind of cute, results (I say trivial because they took me on the order of hours, which means that someone who knows what they're doing probably would see them immediately). I basically did this as a mental exercise, not for any real reason.
But it got me thinking: obviously machine learning people wouldn't care about this because it's too esoteric and not at all a realistic setting (even for COLTers!). I strongly doubt that logicians would care either, but for a totally different reason. From my interaction, they would be interested if and only if two things were satisfied: (a) the result showed some interesting connection between a new model and existing models; (b) the proofs were non-trivial and required some new insight that could be then applied to other problems. Obviously this is not my goal in life, so I've dropped it.
This led to me introspect: what is is that we as a community need in order to find some result interesting? What about other fields that I claim to know a bit about?
Let's take algorithms for a minute. Everything here is about big-O. Like the math types, a result without an interesting proof is much less interesting than a result with an interesting proof, though if you start reading CS theory blogs, you'll find that there's a bit of divide in the community on whether this is good or not. But my sense (which could be totally broken) is that if you have a result with a relatively uninteresting proof that gets you the same big-O running time as the current state of the art, you're in trouble.
I think it's interesting to contrast this with what happens in both NLP and ML. Big-O works a bit differently here. My non-technical description of big-O to someone who knows nothing is that it measure "order of magnitude" improvements. (Okay, O(n log n) versus O(n log log n) is hard to call an order of magnitude, but you get the idea.) An equivalent on the experimental side would seem to be something like: you cut the remaining error on a problem by half or more. In other words, if state of the art is 60% accuracy, then an order of magnitude improvement would be 80% accuracy or better. At 80% it would be 90%. At 90% it would be 95% and so on. 90.5% to 91.2% is not order of magnitude.
I actually like this model for looking at experimental results. Note that this has absolutely nothing to do with statistical significance. It's kind of like reading results graphs with a pair of thick glasses on (for those who don't wear glasses) or no glasses on (for those who wear think glasses). I think the justification is that for less than order of magnitude improvement, it's really just hard to say whether the improvement is due to better tweaking or engineering or dumb luck in how some feature was implemented or what. For order of magnitude improvement, there almost has to be something interesting going on.
Now, I'm not proposing that a paper isn't publishable if it doesn't have order of magnitude improvement. Very few papers would be published this way. I'm just suggesting that improving the state of the art not be -- by itself -- a reason for acceptance unless it's an order of magnitude improvement. That is, you'd either better have a cool idea, be solving a new problem, analyzing the effect of some aspect of a problem that's important, etc., or work on a well trod task and get a big-O improvement.
What I'm saying isn't novel, of course... the various exec boards at the ACL conferences have been trying to find ways to get more "interesting" papers into the conferences for (at least) a few years. This is just a concrete proposal. Obviously it requires buy-in at least of area chairs and probably reviewers. And there are definitely issues with it. Like any attempt to make reviewing non-subjective, there are obviously corner cases (eg., you have a sort-of-interesting idea and an almost-order-of-magnitude improvement). You can't mechanize the reviewing process. But frankly, when I see paper reviews that gush over tiny improvements in the state of the art in an otherwise drab paper, I just get annoyed :).
Posted by
hal
at
2/04/2009 08:22:00 AM
| 21
comments
Labels: acl, community, conferences