28 January 2007

Good News on ACL Reviews

I'm reviewing for ACL again this year (in the machine learning subcomponent). A couple of days ago, I received my notice to start bidding on papers (more on bidding below). The email came with the following note:

Naturally, reviewers have been chosen to assess papers based on their own expertise and outlook. Having said this, we are aware that ACL has sometimes been perceived, especially in recent years, as overemphasizing the pursuit of small incremental improvements of existing methods, perhaps at the expense of exciting new developments. (ACL is solid but boring, is what some people would say.) While we believe that it would be counterproductive to change course radically -- We certainly would not want to sacrifice solidity! -- we would like to encourage you, as a reviewer, to look out particularly for what's novel and interesting, even if this means accepting a paper that has one or two flaws, for example because it has not been evaluated as rigourously as you would like. (It is for you to judge when a flaw becomes a genuine problem.)
I think this is fantastic! (Would someone who is reviewing---i.e., on the PC---for another area confirm or deny that all areas got such a message, or was it just ML?) One difficulty I always have as a reviewer is that I assign scores to different categories (originality, interest, citations, etc.) and then am asked to come up with a meta-score that summarizes all these scores. But I'm not given any instruction on how to weigh the different components. What this note seems to be doing is saying "weigh interest higher than you usually would." In the past two years or so, I've been trying to do this. I think that when you start out reviewing, it's tempting to pick apart little details on a paper, rather than focusing on the big picture. It's been a conscious (and sometimes difficult) process for me to get over this. This explicit note is nice to see because it is essentially saying that my own internal process is a good one (or, at least, whomever wrote it thinks it's a good one).

I also think---in comparison to other conferences I've PCed for or reviewed for---that ACL does a really good job of moderating the bidding process. (For those unfamiliar with bidding... when a paper gets submitted, some area chair picks it up. All papers under an area chair are shown---title plus abstract---to the reviewers in that area. Reviewers can bid "I want to review this," "I don't want to review this," "I am qualified to review this," or "Conflict of interest." There is then some optimization strategy to satisfy reviewers preferences/constraints.) In comparison to ECML and NIPS in the past, the ACL strategy of dividing into area chairs seems to be a good thing. For ECML, I got a list of about 500 papers to select from, and I had to rank them 1-5 (or 1-10, I don't remember). This was a huge hassle.

It seems that of most of the conferences that I'm familiar with, ACL has a pretty decent policy. While I would be thrilled to see them introduce an "author feedback" step, everything else seems to work pretty well. In the past, I've only once gotten in to a real argument over a paper with other reviewers --- most of the time, all the reviewer scores have tended to be +/- 1 or 2 (out of ten) of each other. And for the times when there is an initial disagreement, it is usually resolved quickly (eg., one reviewer points out some major accomplishment, or major flaw, in the paper that another reviewer missed).

25 January 2007

Error Analysis

I was recently asked if I thought that it would be a good idea if our conferences were to explicitly require an error analysis to be performed and reported in papers. While this is perhaps a bit extreme (more on this later), there are at least two reasons why this would be desirable.

  1. When multiple techniques exist for solving the same problem, and they get reasonably close scores, is this because they are making the same sort of errors or different sorts?
  2. If someone were to build on your paper and try to improve it, where should they look?
There's an additional aspect that comes up, especially once you're in a sort of supervisory role. It's often hard to get students to actually look at outputs and forcing this as part of the game early on is a good idea. I was the same as a student (and continue to be the same now) -- only two or three our of a dozen or so papers of mine contain an error analysis.

This situation reminds me a bit of an excellent talk I saw a few years ago (at ACL or EMNLP in Barcelona, I think) by Mitch Marcus talking about some parsing stuff. I don't really remember much of his talk, except that he kept flashing a single slide that read "Look at the data, stupid." His argument was essentially that we're not going to be able to model what we want to model unless we really understand what's going on in the data representing the phenomena we're trying to study.

An exercise that's also good from this perspective is to do some data annotation yourself. This is perhaps even more painful than doing an error analysis, but it really drives home the difficulties in the task.

Getting back to the point at hand, I don't think it's feasible or even necessarily advisable to require all papers to include an error analysis. But I also think that more papers should contain error analyses than actually do (including some of my own). In the universal struggle to fit papers within an 8 page limit, things have to get cut. It seems that the error analysis is the first thing to get cut (in that it gets cut before the paper is even written -- typically by not being performed).

But, at least for me, when I read a paper, I want to know after the fact what I have learned. Occasionally it's a new learning technique. Or occasionally it's some new useful features. Or sometimes it's a new problem. But if you were to take the most popular problems out there that I don't work on (MT, parsing, language modeling, ASR, etc.), I really have no idea what problems are still out there. I can guess (I think names in MT are hard, as is ordering; I think probably attachment and conjunctions in parsing; I have little idea in LM and ASR), but I'm sure that people who work on these problems (and I really mean work: like, you care about getting better systems, not just getting papers) know. So it would be great to see it in papers.

18 January 2007

Comments on: Mark-up Barking Up the Wrong Tree

The "Last Words" article in the Dec 2006 issue of Computational Linguistics is by Annie Zaenen from PARC. (I hope that everyone can access this freely, but I sadly suspect it is not so... I'm half tempted to reproduce it, since I think it's really worth reading, but I don't want to piss off the Gods at MIT press too much.)

The main point I got from the article is that we really need to pay attention to how annotation is done. A lot of our exuberance for annotating is due to the success of machine learning approaches on the Treebank, so we have since gone out and annotated probably hundreds of corpora for dozens of other tasks. The article focuses on coreference, but I think most of the claims apply broadly. The first point made is that the Treebank annotation was controlled, and done by experts (linguists). Many other annotates are not done so: are done without real standards and without deep analysis of the task. The immediate problem, then, is that a learning algorithm that "succeeds" on the annotated data is not necessarily solving the right task.

There was a similar story that my ex-office-mate Alex Fraser ran across in machine translation; specifically, with evaluating alignments for machine translation. The basic problem was two-fold. First, the dataset that everyone used (the French-English data from Aachen) was essentially broken, due largely to its distinction between "sure" and "possible" links -- almost every word pair was possibly linked. This, together with the broken evaluation metric (alignment error rate --- or AER) made results on this dataset virtually useless. The conclusion is essentially: don't use the Aachen data and don't use AER. That is, don't use them if you want improved MT performance, i.e., if you expect higher alignment performance to imply higher MT performance. (If this sounds familiar, it's perhaps because I mentioned it before.)

I should say I largely agree with the article. Where I differ (perhaps only by epsilon) is that the article seems to pick on annotation for machine learning, but I really don't see any reason why the fact that we're using machine learning matters. The issue is really one of evaluation: we need to know that at the end of the day, when we compute a number, that number matters. We can compute a number intrinsically or extrinsically. In the extrinsic case, we are golden, assuming the extrinsic task is real (turtles upon turtles). In the intrinsic case, the situation is fishy. We have to make sure that both our annotations mean something and our method of computing error rate means something (ala the error metric types and the use of F for named entities). While I've argued on this blog that the error metric is important, the CL article argues that the annotation is important. I think that as someone who is on the machine learning side, this is easy to forget.

12 January 2007

Survey of Parser Usage

I know people use parsers a fair amount; I'm curious which parsers people use. Thus, a new poll :). What I'm interested in, in particular, is if you use a parser essentially as a preprocessing step before doing something "higher level." In other words, if you are Mark Johnson and you use the Charniak parser to build a better parser, this doesn't count (sorry, Mark!). I want to know if you use one to do, eg., summarization or MT or IE or something other than parser... (Also, I don't count shallow parsing in this context.) I look forward to responses!

Do you use a parser for your work?
Nope, no parsing in this neck of the woods!
Yes, one of the Brown parsers (Charniak, Johnson, etc.)
Yes, one of the Collins' parsers (Collins, Bikel, etc.)
Yes, MiniPar
Yes, one of the CoNLL dependency parsers (Nivre, McDonald, etc.)
Yes, a rule-based parser.
Yes, something else entirely.

07 January 2007

What Irks Me about E-mail Customer Service

I hate dealing with customer service for large corporations, and it has little to do with outsourcing. I hate it because in the past few months, I have sent out maybe three or four emails to customer service peeps, at places like BofA, Chase, Comcast, Ebay, etc. Having worked in a form of customer service previously (I worked at the computer services help desk as an undergrad at CMU to earn some extra money), I completely understand what's going on. But "understand" does not imply "accept." I post this here not as a rant, but because I think there are some interesting NLP problems under the hood.

So what's the problem? What has happened in all these cases is that I have some problem that I want to solve, can't find information about it in the FAQ or help pages on the web site, and so I email customer service with a question. As an example, I wanted to contest an Ebay charge but was two days past the 60 day cutoff (this was over Thanksgiving). So I asked customer service if, given the holiday, they could waive the cutoff. As a reply I get a form email, clearly copied directly out of the FAQ page, saying that there is a 60 day cutoff for filing contests to charges. Well no shit.

So here's my experience from working at the help desk. When we got emails, we had the option of either replying by crafting an email, or replying by selecting a prewritten document from a database. This database was pretty huge -- many thousands of problems, neatly categorized and searchable. For emails for which the answer existed in the database, it took maybe 10 seconds to send the reply out.

What seems to be happening nowadays is that this is being taken to the extreme. A prewritten form letter is always used, regardless of whether it is appropriate or not. If it is a person doing this bad routing, that's a waste of 10 seconds of person time (probably more for these large companies). If it's a machine, it's no big deal from there perspective, but it makes me immediately hate the company with my whole heart.

But this seems to be a really interesting text categorization/routing problem. Basically, you have lots of normal classes (the prewritten letters) plus a "needs human attention" class. There's a natural precision/recall/$$$ trade-off, which is somewhat different and more complex than is standardly considered. But it's clearly an NLP/text categorization problem, and clearly one that should be worked on. I know from my friends at AT&T that they have something similar for routing calls, but my understanding is that this is quite different. Their routing happens based on short bits of information into a small number of categories. The customer service routing problem would presumably be based on lots of information in a large number of categories.

You could argue that this is no different from providing a "help search" option on the Ebay web page. But I think it's different, if for no other reason that how it appears to the user. If the user thinks he is writing an email to a person, he will write a good email with full sentences and lots of information. If he's just "searching" then he'll only write a few keywords.

NAACL Accepted Papers

See here.

I probably won't be going to NAACL, so if afterwards someone wants to volunteer to post a few papers they especially liked, I'd appreciate it!

02 January 2007

Learning when test and train inputs have different distributions -- NIPS workshop

I spent the second day of workshops at NIPS (while not skiing) attending the Learning when test and train inputs have different distributions workshop. This is closely related (or really just a different name) for the domain adaptation problem I've been interested in for quite some time. Unfortunately, I can't easily come across the list of papers (there were some good ones!) which means that my memory may be lacking at some parts. Here are some points I took away.

Statisticians have worked on this problem for a long time. If you provide insurance, you're going to want a predictor to say whether to give a new person a policy or not. You have lots of data on people and whether they made any claims. Unfortunately, the training data (people you have information on) is limited to those to whom you gave policies. So the test distribution (entire popular) differs from the training distribution (people to whom you gave policies). Some guy (can't remember his name right now) actually won a Nobel prize in Economics for a partial solution to this problem, which was termed "covariate shift" (because statisticians call our "inputs" "covariates" and they are changing).

There seems to be a strong desire to specify models which (though see the comment below) can be characterized as "train p(y|x) and test p(y|x) are the same, but train p(x) and test p(x) differ." In other words, the "labeling function" is the same, but the distribution over inputs is different. This is probably the clearest way to differentiate domain adaptation from multitask learning (for the latter, we typically assume p(x) stays the same but p(y|x) changes). I'm not sure that this is really a hugely important distinction. It may be to obtain interesting theoretical results, but my sense is that in the problems I encounter, both p(x) and p(y|x) are changing, but hopefully not by "too much." An interesting point made along these lines by Shai Ben David that I spent a bunch of time thinking about several years ago was that from a theoretical perspective, assuming p(y|x) is the same is a vacuous assumption, because you can always take two radically different p(x) and q(x), add a feature that indicates which (p vs. q) the data point came from, and call this the "global p(x)". In fact, in some sense, this is all you need to do to solve multitask learning, or domain adaptation: just add a feature saying which distribution the input is from, and learn a single model. I've been doing some experiments recently and, while you can do better than this in practice with standard learning models, it's not such a bad approach.

There were several other talks I liked. It seemed that the results (theoretically) were of the form "if p(y|x) is the same and p(x) and p(y) differ only by a 'little' then doing naive things for learning can do nicely." My favorite formalization of p(x) and p(y) differ a little was the Shai Ben David/John Blitzer approach of saying that they differ slightly if there is a single hyperplane that does well (has low error) on both problems. The restriction to hyperplanes is convenient for what they do later, but in general it seems that having a single hypothesis from some class that will do well on both problems is the general sense of what "p(y|x) is the same" is really supposed to mean. I also enjoyed a talk by Alex Smola on essentially learning the differences between the input distributions and using this to your advantage.

In general, my sense was that people are really starting to understand this problem theoretically, but I really didn't see any practical results that convinced me at all. Most practical results (modulo the Ben David/Blitzer, which essentially cites John's old work) were very NIPSish, in the sense that they were on unrealistic datasets (sorry, but it's true). I wholeheartedly acknowledge that its somewhat difficult to get your hands on good data for this problem, but there is data out there. And it's plentiful enough that it should no longer be necessary to make artificial or semi-artificial data for this problem. (After all, if there weren't real data out there, we wouldn't be working on this problem...or at least we shouldn't :P.)