05 June 2007

Tracking the State of the Art

I just received the following email from Yoav Goldberg:

I believe a resource well needed in the ACL community is a "state-of-the-art-repository", that is a public location in which one can find information about the current state-of-the-art results, papers and software for various NLP tasks (e.g. NER, Parsing, WSD, PP-Attachment, Chunking, Dependency Parsing, Summarization, QA, ...). This will help newcomers to the field to get the feel of "what's available" in terms of both tasks and available tools, and will allow active researchers to keep current on fields other than their own.

For example, I am currently quite up to date with what's going on with parsing, PoS tagging and chunking (and of course the CoNLL shared tasks are great when available, yet in many cases not updated enough), but I recently needed to do some Anaphora Resolution,
and was quite lost as for where to start looking...

I think the ACL Wiki is an ideal platform for this, and if enough people will show some interest, I will create a "StateOfTheArt" page and start populating it. But, before I do that, I would like to (a) know if there really is an interest in something like this and (b) hear any comments you might have about it (how you think it should be organized, what should be the scope, how it can be advertised other than in this blog, etc).
I find this especially amusing because this is something that I'd been planning to blog about for a few weeks and just haven't found the time! I think that this is a great idea. If we could start a community effect where everytime you publish a paper with new results on a common task, you also publish those results on the wiki, it would make life a lot easier for everyone.

I would suggest that the pages essentially consist of a table with the following columns: paper reference (and link), scores in whatever the approate metric(s) are, brief description of extra resources used. If people feel compelled, they would also be encouraged to write a paragraph summary under the table with a bit more detail.

I would certainly agree to use this and to support the effort, I would be happy to go back through all my old papers and post their results on this page. It would be nice if someone (Yoav perhaps???) could initialize pages for the main tasks, so that that burden is lifted.

I'm sure other suggestions would be taken to heart, so comment away!

17 comments:

Peter Turney said...

As one of the people who initiated the ACL Wiki, this use of the wiki is entirely compatible with my vision of the wiki. I've already done something along these lines here:

http://aclweb.org/aclwiki/index.php?title=TOEFL_Synonym_Questions

http://aclweb.org/aclwiki/index.php?title=SAT_Analogy_Questions

Panos Ipeirotis said...

This is a great idea. In general, wikis seem to be a great medium for keeping
track of the "state of the art" in any field. There are the appropriate
incentives for individual authors to post their own results in such a wiki, so
this seems to be a self-sustaining approach. Whoever believes that has the
best tool for some task, they can post their entry, and gain visibility.

I was thinking of doing the same for survey papers that summarize the state of
the art in a particular field. (See the
related blog entry.
)

One of the issues raised for maintaining such "state of the art" lists was the
lack of support from current wikis for adding semantically meaningful links
that can connect the different papers, techniques, tools, and so on. (e.g.,
tool A "complements" tool B, tool C "outperforms" tool D). Still, I believe that
this approach has potential.

Peter Turney said...

The mandate of the ACL Wiki is "to facilitate the sharing of information on all aspects of Computational Linguistics". Survey papers and state-of-the-art-repositories fit the mandate perfectly. As they say at Wikipedia, "Be bold!"

http://en.wikipedia.org/wiki/Wikipedia:Be_bold_in_updating_articles

Fernando Pereira said...

One worry with this proposal is that published results do not define the state-of-the-art; reproduced results are what is needed. All too often, published results are not reproducible or very difficult to reproduce. I have seen instances of papers that were rejected because their results were not better than a "state-of-the-art" that no one could reproduce. At the very least, state-of-the-art status requires published code and data that will yield the state-of-the-art results. That is not the standard in our field yet.

Yoav said...

Good to know there is interest in this proposal!

I agree with Fernando that reproducible results are far more important than claimed results, and think an "Available Software" column in the listing can go a long way in solving this issue.

Another issue that I would like to hear comments about before I bootstrap some pages in the wiki is how to deal with similar-yet-different tasks. Three instantiations of this are (1) tasks that really have a lot in common, or that subsume each other (e.g. NP Bracketing vs. NP Chunking vs. Chunking). (2) Different learning frameworks (e.g. Rule based vs. Supervised vs. Semi Supervised vs. Unsupervised). And (3) languages other than English.

How should these be organized? Should they be considered the same task? A completely different task? Subtasks in some kind of an hierarchy? Any other suggestions?

Anonymous said...

Funny this should come up. I am trying to do the very same thing Hal is discussing here, and let me tell you, it's a mess out there. What should I read up on while unemployed?

At the University of Berkeley they parse really fast, and they have a great POS tagging demo. Wow. Should I quickly learn their techniques?

I am reading Bikel et. al.'s NER extraction paper. Great paper, but a thorough understanding of HMMs is required. After you get the main technique, does the remainder of the paper (endless smoothing formulae with Lambdas) just contain lab-specific solutions, and is it worth wading through?

What are is the basic knowledge an NLPer out on the job market needs, anyway?

I have tried to sift what *I* think are the highlights of the past 10 years, and really, there are not that many. I said 'highlights' That does not mean I frown on all the intense research as not being potential highlights.

Thing that struck me as that for instance, Jurakfski and Galdea are wporking on lexical semantics, but this was tackled at BBN 14 years ago. Penelope Sibun, in what is almost an afterthought in Cutting's Xerox paper, claims good results relating arguments and assigning semantic roles.

I am lost. My interpretation of all this is: there is not all that much ground-breaking innovation, and if there is, we don't know it yet (or I don't know it yet).

I have tried to put all this together on a fledgling set of webpages. If you think that's a contribution to this conversation, great.

http:///www.geocities.com/koos_wilt/TheLinguisticsPages/intro.html

Anonymous said...

koos wrote:

"Thing that struck me as that for instance, Jurakfski and Galdea are wporking on lexical semantics, but this was tackled at BBN 14 years ago. Penelope Sibun, in what is almost an afterthought in Cutting's Xerox paper, claims good results relating arguments and assigning semantic roles.

I am lost. My interpretation of all this is: there is not all that much ground-breaking innovation, and if there is, we don't know it yet (or I don't know it yet)."

I suggest you re-read the Cutting paper and the latest papers on semantic role labeling (perhaps the work of Pradhan et al.). The Cutting paper reports 80% accuracy on a coarse-grained classification task, whereas modern papers report 90%+ on a more fine-grained classification task.

I am not sure where you get the impression that there has been no ground breaking research. What about machine translation? Systems have gone from language specific and totally unusable to robust, language general and very much useful (though with many more improvements still needed).

Discriminative models, rich feature sets and other developments have led to named-entity taggers with accuracies above 90% This is not only for simple categories like people names and places, but also for complex entity types like genes and chemical compounds. Entity taggers are so robust today, that they can often be (and are) used out-of-the-box for many real world applications.

Similar improvements have been made in parsing, word sense disambiguation, generation, discourse analysis, relation extraction, co-reference resolution, etc.

It might be true that it is rare for a single paper to be considered "ground breaking innovation". However, I think it is simplistic to expect that. Language is complex and difficult. Though we want our solutions to ultimately be as simple as possible, we should expect the path in which we reach those solutions to also be complex and as a result incremental. When taken as a whole, I think it would be hard to argue that the body of research over the past 10 years has not been innovative.

An interesting take on incremental research can be seen in a post by Fernando Pereira

hal said...

yoav -- available software is a big plus. i'm not sure how to handle the similar tasks -- a reasonably dense linking structure might be the way to go. imo, you should make it so that it is as easy as possible for people to add their info, even if this makes it slightly harder to find. if it's hard to enter, no one will and it will be useless. if it's easy to enter but hard(er) to find, then it's still better than combing 100s of papers, so there's still benefit.

koos/ryan: i think ryan is right. a lot of times it's somewhat hard to track progress because the problems are a bit amorphous. the same problem goes by different names; similar yet different problems by the same name. i would say that while there have been few papers over the past decade that alone have been amazingly groundbreaking, the sum progress is huge. i'm oversimplifying here, but 10 years ago things didn't work at all. today many things work well enough.

Anonymous said...

Ryan wrote in response to my posting: I suggest you re-read the Cutting paper and the latest papers on semantic role labeling (perhaps the work of Pradhan et al.). The Cutting paper reports 80% accuracy on a coarse-grained classification task, whereas modern papers report 90%+ on a more fine-grained classification task.

My reply: Thank you for your reaction (and man, do my typos look embarrassing). Please realize my post should be taken in the spirit of this discussion, which I interpret to be "how can we see the forest for the trees?"

Ryan wrote in response to my posting:I am not sure where you get the impression that there has been no ground breaking research.

My reply: There are a number of reasons why I have that impression, the main one being *I* am having a hard time seeing the forest for the trees. (This implies others may not have a similarly hard time).

As an 'industrial linguist', but not one working at a major research lab, it is hard for me to determine which particular line of research is important and will bear fruit in the (near) future. I might have formulated my anguish ( :) ) as a question very much in keeping with this particular topic: how will any serious researcher determine which papers/lines of research are the Church's, the Cutting et. al.'s and the Weischedels of the present? In other words, I am not saying there is no progress, per se (I did say that verbatim, but phrased it awkwardly); I am saying - what is the most effective way for an 'industrial linguist' to stay informed of significant research.

Ryan wrote in response to my posting: What about machine translation? Systems have gone from language specific and totally unusable to robust, language general and very much useful (though with many more improvements still needed).

My reply: I am all-too-happy to hear it, having done some actual work in MT. And yes, it used to be an intractable problem. My current interest, however, lies in working with other textual technologies.

Ryan wrote in response to my posting:Discriminative models, rich feature sets and other developments have led to named-entity taggers with accuracies above 90% This is not only for simple categories like people names and places, but also for complex entity types like genes and chemical compounds. Entity taggers are so robust today, that they can often be (and are) used out-of-the-box for many real world applications.

My reply: I am aware of this, but, in a way, my awareness is too dim. And that's in keeping with the purpose of this partcular conversation: how do we see the forest for the trees?

Ryan wrote in response to my posting:It might be true that it is rare for a single paper to be considered "ground breaking innovation". However, I think it is simplistic to expect that. Language is complex and difficult. Though we want our solutions to ultimately be as simple as possible, we should expect the path in which we reach those solutions to also be complex and as a result incremental. When taken as a whole, I think it would be hard to argue that the body of research over the past 10 years has not been innovative.

My reply: You are absolutely correct in the previous paragraph. Again, though, my question is "How do we, in the field, having CTOs and CEOs that expect results, effectively wade through the deluge of papers and information to keep up?" There are several routes to take on which one can read/study incrementally.

Again, take my web visit to Berkeley as an example. The demo there is downright impressive. The tagger is incredibly fast, and the parser even faster. It's also accurate, and it deals with unseen data. Does this imply I should start reading their every research paper? Of course not, but then what *should* I read? Again, that seems what this conversation is supposed to address, correct?

Anonymous said...

Ryan wrote in response to my posting: I suggest you re-read the Cutting paper and the latest papers on semantic role labeling (perhaps the work of Pradhan et al.). The Cutting paper reports 80% accuracy on a coarse-grained classification task, whereas modern papers report 90%+ on a more fine-grained classification task.

My reply: Thank you for your reaction (and man, do my typos look embarrassing). Please realize my post should be taken in the spirit of this discussion, which I interpret to be "how can we see the forest for the trees?"

Ryan wrote in response to my posting:I am not sure where you get the impression that there has been no ground breaking research.

My reply: There are a number of reasons why I have that impression, the main one being *I* am having a hard time seeing the forest for the trees. (This implies others may not have a similarly hard time).

As an 'industrial linguist', but not one working at a major research lab, it is hard for me to determine which particular line of research is important and will bear fruit in the (near) future. I might have formulated my anguish ( :) ) as a question very much in keeping with this particular topic: how will any serious researcher determine which papers/lines of research are the Church's, the Cutting et. al.'s and the Weischedels of the present? In other words, I am not saying there is no progress, per se (I did say that verbatim, but phrased it awkwardly); I am saying - what is the most effective way for an 'industrial linguist' to stay informed of significant research.

Ryan wrote in response to my posting: What about machine translation? Systems have gone from language specific and totally unusable to robust, language general and very much useful (though with many more improvements still needed).

My reply: I am all-too-happy to hear it, having done some actual work in MT. And yes, it used to be an intractable problem. My current interest, however, lies in working with other textual technologies.

Ryan wrote in response to my posting:Discriminative models, rich feature sets and other developments have led to named-entity taggers with accuracies above 90% This is not only for simple categories like people names and places, but also for complex entity types like genes and chemical compounds. Entity taggers are so robust today, that they can often be (and are) used out-of-the-box for many real world applications.

My reply: I am aware of this, but, in a way, my awareness is too dim. And that's in keeping with the purpose of this partcular conversation: how do we see the forest for the trees?

Ryan wrote in response to my posting:It might be true that it is rare for a single paper to be considered "ground breaking innovation". However, I think it is simplistic to expect that. Language is complex and difficult. Though we want our solutions to ultimately be as simple as possible, we should expect the path in which we reach those solutions to also be complex and as a result incremental. When taken as a whole, I think it would be hard to argue that the body of research over the past 10 years has not been innovative.

My reply: You are absolutely correct in the previous paragraph. Again, though, my question is "How do we, in the field, having CTOs and CEOs that expect results, effectively wade through the deluge of papers and information to keep up?" There are several routes to take on which one can read/study incrementally.

Again, take my web visit to Berkeley as an example. The demo there is downright impressive. The tagger is incredibly fast, and the parser even faster. It's also accurate, and it deals with unseen data. Does this imply I should start reading their every research paper? Of course not, but then what *should* I read? Again, that seems what this conversation is supposed to address, correct?

Yoav said...

Ok, I created a new Wiki category called "State of the Art" with a link from the first ACLWiki page. I populated it with skeletons for some core NLP tasks, and started filling in some of the entries (for now some POS tagging and some Parsing, more will follow soon).

Contributions and updates are ofcourse welcome!

Anonymous said...

Thank you, Yoav - being a novice to this blog, could you tell me where this 'state of the art' link is?

-Koos

Yoav said...

koos -- it's not a part of the blog, but of the ACL wiki. Here's the url of the Wiki's main page: http://aclweb.org/aclwiki/index.php?title=Main_Page

Anonymous said...

Looks very useful, Yoav (and others). I have about 2-3 hours a week I could spend as a volunteer.
Is there anything I could do working on this?

Anonymous said...

I checked out the new Wiki for results. In the POS tagging entry, I noticed Libin Shen et al.'s new tagging paper from ACL '07.

It reports an improvement from Toutanova et al.'s 97.24 to 97.33 on the same old sections of the treebank (test on sections 22-24). I can't afford the treebank, so I'm just estimating here, but there are about 1M words, and about 25 sections, so the test set is only about 120K words.

A simple binomial hypothesis test would put a one-sigma confidence interval at sqrt(.97 * (1 - .97) / 120,000), or 0.0005. The 95% confidence interval would be 2 sigma, or about.001, or about .1%, or just about the improvement noted in the paper.

So is the result "significant"? No, it's not, because the confidence interval is still too fat. For it to be a true confidence interval, the tests would have to be taken at random. But they're not -- they're all taken from section 22-24 of the Treebank, in which there are all kinds of temporal and topical dependendencies within the whole articles making up the corpus. For instance, the same phrase shows up again and again referring to a person, but the evals treat them as independent.

Another assumption is that we don't build gazillions of systems and then choose the best one post-hoc. The multi-way significance eval would be much stricter.

I don't mean to pick on Shen et al. I had the same reaction to Michael Collins's paper on improving his parser some fractional degree. And often reimplementations of the same "idea" have this much noise in them (e.g. Bikel's reimplementation of Collins's parser).

This is a problem in our field and how it misunderstands significant improvements. I've had papers rejected for not evaluating on a "standard" test set, even when there wasn't one.

Finally, I'd like to plea for memory and time reporting for results. Ideally with the amount of human effort spent feature tweaking. When I'm shopping for a technique for a commercial app, these are overriding concerns that dwarf 0.001 improvements in accuracy on an "easy" test set that matches the training data. In that vein, I'd love to see results on words not in the training set.

Unknown said...

Wouldn't it also be nice to have the information about the language for which the results were obtained? I'm new into this field, but I assume most results are language dependent, and I can also imagine that there are languages for which the performance will lag behind forever in comparison to, for example, English. Moreover, I agree that reproducibility is crucial thus I think it would be nice to have an indication whether and where the results have been reproduced.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花