26 June 2007

ACL Business Meeting Results

This afternoon here in Prague was the ACL business meeting. A few interesting points were brought up. As well all know, ACL will be in Columbus, OH next year. It will actually be joint with HLT, which means that (as I previously expected), there won't be a separate HLT next year. Combining with the fact that when ACL is in north america, there is no NAACL, it looks like there will only be one north american conference next year (unless EMNLP--which is now officially a conference--chooses not to co-locate with ACL/HLT). The paper submission deadline looks to be around 11 Jan -- calls will be out in September. EACL 2008 will be in Greece.

The new information: ACL 2009 will be in Singapore, which was one of my two guesses (the other being Beijing). This should be a really nice location, though I'm saddened since I've already been there.

A few changes have been proposed for ACL 2008 in terms of reviewing. None will necessarily happen, but for what it's worth I've added my opinion here. If you have strong feelings, you should contact the board, or perhaps Kathy McKoewn, who is the conference chair.
  • Conditional accepts and/or author feedback. I'd be in favor of doing one of these, but not both (redundancy). I'd prefer author feedback.

  • Increased poster presence with equal footing in the proceedings, ala NIPS. I would also be in favor of this because already we are at four tracks and too much is going on. Alternatively, we could reduce the number of accepted papers, which I actually don't think would be terrible, but going to posters seems like a safer solution. The strongest argument against this is a personality one: ACLers tend to ignore poster sessions. Something would have to be doing about this. Spotlights may help.

  • Wildcards from senior members. The idea would be that "senior" (however defined?) members would be able to play a single wildcard to accept an otherwise controversial paper. I'm probably not in favor of this, partially because it seems to introduce annoying political issues "What? I'm not senior enough for you?" (I wouldn't say that, since I'm not, but...); partially because it seems that this is essentially already the job of area chairs. There may be a problem here, but it seems that there are better, more direct solutions.

  • Something having to do with extra reviewing of borderline papers. I didn't quite get what was meant here; it didn't seem that the proposal was to have fewer than 3 reviews, but to ask for more in case of confusion. I would actually argue for being more extreme: have a single (maybe 2) reviewer to an initial round of rejects and then get three reviews only for those papers that have any chance at all of being accepted. I doubt this idea will fly, though, but it would be interesting to check in previous papers how many got in that had one reviewer give a really bad score.... how many got in that two reviewers gave a really bad score. If these numbers are really really low, then it should be safe. Anyone have access to this data???
Finally, we talked about the "grassroots" efforts. The proposals were: archive videos, augment the anthology to include link structure, augmenting the anthology with tech reports and journals (given permission from publishers), and ours to make CL open access. Speaking with respect to ours, the only big complains were with respect to typesetting information, but several people did voice support, both in the meeting and in person. I remain hopeful!

I'll post more about technical content after the conference.

17 June 2007

3 Small Newses

(Yeah, I know, "news" isn't a count noun.)
  1. WhatToSee has been updated with ACL and EMNLP 2007, so figure out what talks you want to go to!
  2. Yoav has set up a State Of The Art wiki page (see previous blog post on this topic)... please contribute!
  3. The proposal for making CL an open-access journal has been accepted, so we get our 5 minutes of fame -- come by to support (or not). The business meeting is scheduled for 1:30pm on 26 June.

11 June 2007

First-best, Balanced F and All That

Our M.O. in NLP land is to evaluate our systems in a first-best setting, typically against a balanced F measure (balanced F means that precision and recall are weighed equally). Occasionally we see precision/recall curves, but this is typically in straightforward classification tasks, not in more complex applications.

Why is this (potentially) bad? Well, it's typically because our evaluation criteria is uncalibrated against human use studies. In other words, picking on balanced F for a second, it may turn out that for some applications it's better to have higher precisions, while for others its better to have higher recall. Reporting a balanced F removes our ability to judge this. Sure, one can report precision, recall and F (and people often do this), but this doesn't give us a good sense of the trade-off. For instance, if I report P=70, R=50, F=58, can I conclude that I could just as easily get P=50, R=70, F=58 or P=R=F=58 using the same system but tweaked differently? Likely not. But this seems to be the sort of conclusion we like to draw, especially when we compare across systems by using balanced F as a summary.

The issue is essentially that it's essentially impossible for any single metric to capture everything we need to know about the performance of a system. This even holds up the line in applications like MT. The sort of translations that are required to do cross-lingual IR, for instance, are of a different nature than those that are required to put a translation in front of a human. (I'm told that for cross lingual IR, it's hard to beat just doing "query expansion" using model 1 translation tables.)

I don't think the solution is to proliferate error metrics, as has been seemingly popular recently. The problem is that once you start to apply 10 different metrics to a single problem (something I'm guilty of myself), you actually cease to be able to understand the results. It's reasonable for someone to develop a sufficiently deep intuition about a single metric, or two metrics, or maybe even three metrics, to be able to look at numbers and have an idea what they mean. I feel that this is pretty impossible with ten very diverse metrics. (And even if possible, it may just be a waste of time.)

One solution is to evaluate a different "cutoffs" ala precision/recall curves, or ROC curves. The problem is that while this is easy for thresholded binary classifiers (just change the threshold), it is less clear for other classifiers, much less complex applications. For instance, in my named entity tagger, I can trade-off precision/recall by postprocessing the weights and increasing the "bias" toward the "out of entity" tag. While this is an easy hack to accomplish, there's nothing to guarantee that this is actually doing the right thing. In other words, I might be able to do much better were I to directly optimize some sort of unbalanced F. For a brain teaser, how might one do this in Pharaoh? (Solutions welcome in comments!)

Another option is to force systems to produce more than a first-best output. In the limit, if you can get every possible output together with a probability, you can compute something like expected loss. This is good, but limits you to probabilistic classifiers, which makes like really hard in structure land where things quickly become #P-hard or worse to normalize. Alternatively, one could produce ranked lists (up to, say, 100 best) and then look at something like precision a 5, 10, 20, 40, etc. as they do in IR. But this presupposes that your algorithm can produce k-best lists. Moreover, it doesn't answer the question of how to optimize for producing k-best lists.

I don't think there's a one-size fits all answer. Depending on your application and your system, some of the above options may work. Some may not. I think the important thing to keep in mind is that it's entirely possible (and likely) that different approaches will be better at different points of trade-off.

05 June 2007

Tracking the State of the Art

I just received the following email from Yoav Goldberg:
I believe a resource well needed in the ACL community is a "state-of-the-art-repository", that is a public location in which one can find information about the current state-of-the-art results, papers and software for various NLP tasks (e.g. NER, Parsing, WSD, PP-Attachment, Chunking, Dependency Parsing, Summarization, QA, ...). This will help newcomers to the field to get the feel of "what's available" in terms of both tasks and available tools, and will allow active researchers to keep current on fields other than their own.

For example, I am currently quite up to date with what's going on with parsing, PoS tagging and chunking (and of course the CoNLL shared tasks are great when available, yet in many cases not updated enough), but I recently needed to do some Anaphora Resolution,
and was quite lost as for where to start looking...

I think the ACL Wiki is an ideal platform for this, and if enough people will show some interest, I will create a "StateOfTheArt" page and start populating it. But, before I do that, I would like to (a) know if there really is an interest in something like this and (b) hear any comments you might have about it (how you think it should be organized, what should be the scope, how it can be advertised other than in this blog, etc).
I find this especially amusing because this is something that I'd been planning to blog about for a few weeks and just haven't found the time! I think that this is a great idea. If we could start a community effect where everytime you publish a paper with new results on a common task, you also publish those results on the wiki, it would make life a lot easier for everyone.

I would suggest that the pages essentially consist of a table with the following columns: paper reference (and link), scores in whatever the approate metric(s) are, brief description of extra resources used. If people feel compelled, they would also be encouraged to write a paragraph summary under the table with a bit more detail.

I would certainly agree to use this and to support the effort, I would be happy to go back through all my old papers and post their results on this page. It would be nice if someone (Yoav perhaps???) could initialize pages for the main tasks, so that that burden is lifted.

I'm sure other suggestions would be taken to heart, so comment away!

01 June 2007

Open Access CL Proposal

Following up on the Whence JCLR discussion, Stuart Shieber, Fernando Pereira, Ryan McDonald, Kevin Duh and I have just submitted a proposal for an open access version of CL to the ACL exec committee, hopefully to be discussed in Prague. In the spirit of open access, you can read the official proposal as well as see discussion that led up to it on our wiki. Feel free to email me with comments/suggestions, post them here, or bring them with you to Prague!