26 October 2006

Saving Read Papers, Revisited

So, why am I interesting in how you save read papers? Well, I don't want to ruin the surprise yet. First, let's take a look at the (still incoming) results. The most popular method (roughly 60% of the population surveyed) is to save them locally. People have also pointed to some tools for archiving, though my guess is that these are probably under utilized. I'm actually a bit surprised more people don't use delicious, though I do not so perhaps I shouldn't be surprised. (Incidentally, I fall into the majority class.)

The reason I'm curious is that I spend a nontrivial amount of time browsing people's web pages to see what papers they put up. Some of this has to do with the fact that I only follow about a dozen conferences with regularity, which means that something that appears in, say, CIKM, often falls off my radar. Moreover, it seems to be increasingly popular to simply put papers up on web pages before they are published formally. Whether this is good or not is a whole separate debate, but it is happening more and more. And I strongly believe that it will continue to increase in popularity. So, I have a dozen or so researcher's whose web pages I visit once a month or so to see if they have any new papers out that I care about. And, just like the dozen conferences I follow, there are lots that fall off my radar here.

But this is (almost) exactly the sort of research problem I like to solve: we have too much information and we need it fed to us. I've recently been making a fairly obvious extension to my Bayesian query-focused summarization system that enables one to also account for "prior knowledge" (i.e., I've read such and such news stories -- give me a summary that updates me). I've been thinking about whether to try such a thing out on research articles. The basic idea would be to feed it your directory containing the papers you've read, and then it would routinely go around and find new papers that you should find interesting. Such a thing could probably be hooked into something like delicious, though given the rather sparse showing here, it's unclear that would be worthwhile.

Of course, it's a nontrivial undertaking to get such a thing actually running beyond my controlled research environment (my desktop), so I wanted to get a sense of whether anyone might actually be interested. Ross's comment actually really got my attention because it would be probably easier technologically if everything could be done online (so one wouldn't have to worry about cross-platform, etc.).

Anyway, this is something I've been thinking about for a while and it seems like a lot of the tools exist out there already.

20 October 2006

Saving Read Papers

I'm going to have a go at doing a mini-poll. Basically, I'm interested in whether or not papers you have read (and, presumably, find interesting) find their way into a permanent spot on your machine or your physical space. Please vote :).


How do you archive papers you have read?
I save most of them to a directory on my machine
I bookmark most of them in my browser
I print most of them and save them in a filing cabinet
I same them in some other way that allows easy electronic access
I same them in some other way that allows easy physical access
I don't save them
  

18 October 2006

The Shared Task Effect

Shared tasks have been increasing in popularity over the past half decade. These are effectively competitions (though perhaps that word is rightfully disdained) for building systems that perform well on a given task, for a specific data set. Typically a lot of stuff is given to you for free: the data, variously preprocessing steps, evaluation scripts, etc. Anywhere from a handful of people to dozens enter these shared tasks. Probably the most well known are the CoNLL shared tasks, but they have also taken place in other workshops (eg., the two SMT workshops and many others). Goverment-run competitions (eg., GALE, ACE, DUC (to some degree) and others) are somehow similar, with the added bonus that money is often contingent on performance, but for the most part, I'll be talking about the community-driven shared tasks. (I'll note that shared tasks exist in other communities, but not to the extent that they exist in NLP, to my knowledge.)

I think there are both good and bad things about having these shared tasks, and a lot depends on how they are run. Perhaps some analysis (and discussion?) can serve to help future shared task organizers make decisions about how to run these things.

Many pros of shared tasks are perhaps obvious:
  1. Increases community attention to the task.
  2. Often leads to development or convergence of techniques by getting lots of people together to talk about the same problem.
  3. Significantly reduces the barrier of entry to the task (via the freely available, preprocessed data and evaluation scripts).
  4. (Potentially) enables us to learn what works and what doesn't work for the task.
  5. Makes a standardized benchmark against which future algorithms can be compared.
Many of these are quite compelling. I think (3) and (5) are the biggest wins (with the caveat that it's dangerous to test against the same data set for an extended period of time). My impression (which may be dead wrong) is that cf. (1), there has been a huge source of interest in semantic role labeling due to the CoNLL shared task. I can't comment on how useful (2) is, though it seems that there is at least quite a bit of potential there. I know there have been at least a handful of shared task paper that I've read that gave me an idea along the lines of "I should try that feature."

In my opinion, (4) seems like it should be the real reason to do these things. I think the reason why people don't tend to learn as much as might be possible about what does and does not work is that there's very little systematization in the shared tasks. At the very least, almost everyone will use (A) a different learning algorithm and (B) a different feature set. This means that it's often very hard to tell -- when someone does well -- whether it was the learning or the features.

Unfortunately (were it not the case!) there are some cons associated with shared tasks, generally closely tied to corresponding pros.
  1. May artificially bloat the attention given to one particular task.
  2. Usefulness of results is sometimes obscured by multiple dimensions of variability.
  3. Standardization can lead to inapplicability of certain options that might otherwise work well.
  4. Leads to repeated testing on the same data.
Many of these are personal taste issues, but I think some argument can be made for them all. For (1), it is certainly true that having a shared task on X increases the amount of time the collective research community spends on X. If X is chosen well, this is often fine. But, in general, there are lots of really interesting problems to work on, and this increased focus might lead to narrowing. There's recently been something of a narrowing in our field, and there is certainly a correlation (though I make no claim of causation) with increased shared tasks.

(2) and (3) are, unfortunately, almost opposed. You can, for instance, fix the feature set and only allow people to vary the learning. Then we can see who does learning best. Aside from the obvious problem here, there's an additional problem that another learning algorithm might do better, if it had different features. Alternatively, you could fix the learning and let people do feature engineering. I think this would actually be quite interesting. I've thought for a while about putting out a version of Searn for a particular task and just charge people with coming up with better features. This might be especially interesting if we did it for, say, both Searn and Mallet (the UMass CRF implementation) so we can get a few more points of comparison.

To be more concrete about (3), a simple example is in machine translation. The sort of preprocessing (eg., tokenization) that is good for one MT (eg., a phrase-based system) may be very different from the preprocessing that is good for another (eg., syntax-based). One solution here is to give multiple versions of the data (raw, preprocessed, etc.), but then this makes the (2) situation worse: how can we tell who is doing best, and is it just because they have a darn good tokenizer (don't under-estimate the importance of this!).

(4) doesn't really need any extra discussion.

My personal take-away from putting some extra thought into this is that it can be very beneficial to have shared tasks, if we set at the beginning what are the goals. If our goal is to understand what features are important, maybe we should consider fixing the learning to a small set of algorithms. If our goal is learning, do the opposite. If we want both, maybe ask people to do feature ablation and/or try with a few different learning techniques (this is perhaps too much burden, though). I think we should definitely keep the (3) of low barrier of entry: to me, this is one of the biggest pros. I think the SMT workshops did a phenomenal job here, for a task as complex as MT. And, of course, we should choose the tasks carefully.

11 October 2006

Two More Competitions

Busy week this is! Here are two more pointers.
Enjoy!

10 October 2006

Scaling and Data

In NLP, we often live in the idealized learning world where we have more data than we really know what to do with. The oft-cited Banko + Brill results are perhaps extreme in this regard (in the sense that we rarely have quite that much data), but we certainly have far more than most fields. The great thing about having lots of data is that large data sets support complex statistical analysis. As a stupid example, consider estimating a Gaussian. We estimate the mean and covariance (or generate a posterior over these quantities, if you prefer to be Bayesian). In a small data setting, we'd almost always approximate the Gaussian by either a diagonal, or constant diagonal covariance matrix. Especially if the number of data points is less than the number of dimensions (true Bayesians might not do this, but this is probably tangengtial). But if we have billions of data points, there's likely enough information in there to reliably estimate quite a few parameters (or approximate their posteriors) and we can do the full covariance matrix estimation.

The problem is that the full covariance estimation is computationally really expensive. Not only do we have to play with O(D^2) parameters (D is the dimensionality), but we also have to perform complex operations on the data that typically scale at least as O(N^2) (N Is the number of data points).

This is incredibly frustrating. We have the data to support a complex statistical analysis, but we don't have the computation time to actually perform the analysis. So we either throw out data to get the computation time down and do something more complex (which may now not be supported by the data) or, more often than not, do something simple on the large data set. Now, there is often nothing wrong with doing something simple, but if we cannot even try to do things that are more complex, then it's hard to say for sure whether simple is enough.

So then the question is: how can we scale. I only know a handful of answers to this question, but maybe other people can contribute some.
  1. Get a job at Google and just use a billion machines (and/or some really clever Google engineers, ala the Google SMT system). This is obviously not a very satisfying option for everyone.
  2. Subsample the data. This is also not very satisfying (and, perhaps, even worse than the first option).
  3. Use a randomized algorithm, such as what Deepak did in his thesis. The message here is that if your complexity hinges on pairwise computations that look something like distance metrics, you can introduce randomization and do this in something like O(N) rather than O(N^2) time.
  4. Use smart data structures. Things like kd-trees are becoming increasingly popular in the ML community for solving pairwise problems. The idea is to recursively divide your data space (in an intelligent fashion) so that you can store sufficient statistics about what's under a node at that node itself. (I think one reason these haven't taken off in NLP is that they appear at first glance to be much better suited to real-valued mid-dimensional data, rather than sparse, discrete, super-high-dimensional data...is there an alternative for us?)
There may be other general solutions, but I'm not aware of them. As it stands, with the exception of Deepak and few others, the solution appears to be basically to hire a bunch of smart people (and/or grad students) to do lots of engineering. But I'd prefer general principles.

06 October 2006

Resources for NLP

Just a quick pointer that was referred to me. In addition to the well known Stanford StatNLP link list, Francois-Régis Chaumartin also maintains a list of NLP resources and tools at proxem.com. Any other lists people find especially useful (I suppose this would lead to a meta-list :P)?

02 October 2006

I'll Take Movie Recommendations for $1m, Alex

If you feel like you have the world's greatest recommender system, you should enter the NetFlix challenge for improving their movie recs. In addition to the possibility of winning a lot of money and achieving fame, you also get an order-of-magnitude larger data set for this task than has been available to date. (Note that in order to win, you have to improve performance over their system for 10%, which is a steep requirement.) I'll offer an additional reward: if you do this using NLP technology (by analysing movie information, rather than just the review matrix), I'll sweeten the pot by $10.