04 January 2010

ArXiV and NLP, ML and Computer Science

Arxiv is something of an underutilized resource in computer science. Indeed, many computer scientists seems not to even know it exists, despite it having been around for two decades now! On the other hand, it is immensely popular among (some branches of) mathematics and physics. This used to strike me as odd: arxiv is a computer service, why haven't computer scientists jumped on it. Indeed, I spent a solid day a few months ago putting all my (well almost all my) papers on arxiv. One can always point to "culture" for such things, but I suspect there are more rational reasons why it hasn't affected us as much as it has others.

I ran in to arxiv first when I was in math land. The following is a cartoon view of how (some branches of) math research gets published:
  1. Authors write a paper
  2. Authors submit paper to a journal
  3. Authors simultaneously post paper on arxiv
  4. Journal publishes (or doesn't publish) paper
We can contrast this with how life goes in CS land:
  1. Conference announces deadline
  2. One day before deadline, authors write a paper
  3. Conference publishes (or rejects) paper
I think there are a few key differences that matter. Going up to the mathematician model, we can ask ourselves, why do they do #3? It's a way to get the results out without having to wait for a journal to come back with a go/no-go response. Basically in the mathematician model, arxiv is used for advertising while a journal is used for a stamp of approval (or correctness).

So then why don't we do arxiv too? I think there are two reasons. First, we think that conference turn around is good enough -- we don't need anything faster. Second, it completely screws up our notions of blind review. If everyone simultaneously posted a paper on arxiv when submitting to a conference, we could no longer claim, at all, to be blind. (Please, I beg of you, do not start commenting about blind review versus non-blind review -- I hate this topic of conversation and it never goes anywhere!) Basically, we rely on our conferences to do both advertising and stamp of approval. Of course, the speed of conferences is mitigated by the fact that you sometimes have to go through two or three before your paper gets in, which can make it as slow, or slower than, journals.

In a sense, I think that largely because of the blind thing, and partially because conferences tend to be faster than journals, the classic usage of arxiv is not really going to happen in CS.

(There's one other potential use for arxiv, which I'll refer to as the tech-report effect. I've many times seen short papers posted on people's web pages either as tech-reports or as unpublished documents. I don't mean tutorial like things, like I have, but rather real semi-research papers. These are papers that contain a nugget of an idea, but for which the authors seem unwilling to go all the way to "make it work." One could imagine posting such things on arxiv. Unfortunately, I really dislike such papers. It's very much a "flag planting" move in my opinion, and it makes life difficult for people who follow. That is, if I have an idea that's in someone elses semi-research paper, do I need to cite them? Ideas are a dime a dozen: making it work is often the hard part. I don't think you should get to flag plant without going through the effort of making it work. But that's just me.)

However, there is one prospect that arxiv could serve that I think would be quite valuable: literally, as an archive. Right now, ACL has the ACL anthology. UAI has its own repository. ICML has a rather sad state of affairs where, from what I can tell, papers from ICML #### are just on the ICML #### web page and if that happens to go down, oh well. All of these things could equally well be hosted on arxiv, which has strong government support to be sustained, is open access, blah blah blah.

This brings me to a question for you all: how would you feel if all (or nearly all) ICML papers were to be published on arxiv? That is, if your paper is accepted, instead of uploading a camera-ready PDF to the ICML conference manager website, you instead uploaded to arxiv and then sent your arxiv DOI link to the ICML folks?

How do you feel about arxiving ICML?
No, please don't put my paper on arxiv.
I'm happy to have my paper on arxiv, but you should do it for me!
I'm happy to upload my paper to arxiv.

Obviously there are some constraints, so there would need to be an opt-out policy, but I'm curious how everyone feels about this....

22 comments:

  1. Could you be a little more specific on the boundary you perceive between "flag planting" and "making it work"?

    I'm all in favor of the Arxiv plus (open source) journal approach. I think it represents the best combination of (a) timeliness of sharing, (b) flag planting [no longer being an academic, it doesn't matter so much for me], and (c) long-term quality.

    Conferences are no longer the fastest route to sharing, are always a pain to justify outside of CS for tenure and promotion cases, and given the pressure on writers and reviewers Hal notes, not prone to produce as high quality a result as a journal article (or even a tech report). Conference paper length bounds are also problematic for ideas that are smaller or larger than the de facto minimal publishable unit of 8 pages.

    I love conferences. I'd just prefer they relinquish the role of gatekeeper and embrace the role of community builder. They used to work more that way in speech, though I hear they're tightening their acceptance rates, which I'm also told looks good on academic CVs.

    I'll let you draw your own conclusions about the role of single- or double-blind reviewing and their feasability or desirability in an Arxiv/journal world.

    ReplyDelete
  2. arxiv has had for a really long while (e.g., since 1994?) a cmp-lg component (now cs.CL):

    http://arxiv.org/list/cs.CL/recent

    ReplyDelete
  3. Playing devil's advocate, what is the value of arXiv when I can just link PDF to my web page and count on Google to rapidly index it?

    ReplyDelete
  4. @Bob: Well, if I were a finger pointing type this would be a lot easier. As a very small example, take my topic models on a graph paper. The idea is really straightforward. I tried to make it work for a long time until I realized that it just plain didn't work unless you put edge weights on the graph that could be inferred. I could easily have put up a tech report with the original idea, but there's no way I could have known that to "make it work" you'd really need to deal with edge weights properly. Then, on top of that, there's all the effort of getting the data, writing the code, debugging, blah blah blah.

    drago: sure, but hardly anyone uses it!

    regehr: permanence. your web page might change url, you might move, or retire or whatever, but your papers should live on! plus, there are lots of people (mostly senior :P) who don't bother updating their web pages anymore, and then if the ICML web page goes down, hasta la vista paper!

    ReplyDelete
  5. About your comment on tech-reports: if we ignore the credit-related issues, these reports can be seen as small pieces of unfinished research that could help another person's research. For instance, if a researcher made some observations while performing an experiment, which he/she is not able to explain from a theoretical standpoint, perhaps due to a lack of background/resources. In this situation, do you think it would be reasonable to put those results into a tech-report?

    ReplyDelete
  6. The proceedings of ICML have been archived in the ACM Digital Library for quite a while now, and they all get a ISBN. (I've been publication chair for this conference a couple of times, and I was surprised that ACM does check in detail those permission-to-publish forms!)

    Recently, it appears that full mirrors for the conference websites have been made available at the IMLS website. Certainly nothing like the books.nips website, but that's a start. UAI has recently started a major effort to make the whole series of proceedings available electronically, and perhaps ICML will follow.

    I remember several conversations in the past about making ICML papers automatically available at arxiv, but it seems quite a pain: we either have to format all source files in a way it can be compiled by arxiv (hard to centrally coordinate with the authors), or we have to make a case for them to accept our PDFs (I have no idea how rigid they can be on rejecting PDFs. Perhaps somebody else can clarify).

    ReplyDelete
  7. Very nice information. Thanks for this. Please come visit my site Directory Aurora City when you got time.

    ReplyDelete
  8. Very nice information. Thanks for this. Please come visit my site Aurora Business Services And Classifieds when you got time.

    ReplyDelete
  9. Certainly nothing like the books.nips website, but that's a start.



    Affordable SEO Services

    ReplyDelete
  10. This is the first time I know about ArXiV. Thanks for the info.

    ReplyDelete
  11. How about these science cartoons?

    There are many good ones on Vadlo search engine http://vadlo.com/cartoons.php?id=1.

    ReplyDelete
  12. Well i think computer have developed lots of interesting languages so that those languages are really very beneficial for the development. So i am really impressed to know about it.

    ReplyDelete
  13. For those defective to buy Pandora see after today's announcement, the band will still not a bargain. Beads Bangles Now the band says that "substantially advanced" day duty are Pandora Bangles Sale being booked for the fourth billet and 2005. Beads bracelets The reports is so good that three brokerage firms raised their rating buy pandora bracelets on justness (ROE) to the keep selling at $34.75 a split, pandora necklace beads two cents in this year and $0.90 a divide next year. Pandora necklace sale While the newscast

    ReplyDelete