natural language processing blog: research

Showing posts with label research. Show all posts

07 December 2008

Two Reviewer Comments that Stuck with Me

Over the years I've gotten a number of positive reviews and negative reviews. I usually remember few of the details of any more than a few months after the good or bad news. Two reviewer quotes, however, have really stuck with me. They were both in overall positive reviews, though one of them is more negative sounding. Obviously I don't know who wrote them, but they've actually had a strong impact on most of my research and my own reviewing since:

"Other approaches don't have to be bad in order for your approach to be good."
"If I were working in this area, I would want to know about these results."

You can guess what I might have written that had caused this reviewer to make the first comment. And actually I've come to think of these two comments as basically saying the same thing.

I feel like we too often see research in a competitive light. There are two things that can cause this. The first is funding. Short-term funding (in the US) is essentially a zero-sum game, which means that I can win only if you lose. (There are models where this is less true, but usually that has other -- not necessarily desirable -- outcomes.)

The second is the scooping effect: many times when we have (what we think is) a cool idea, we want to "beat" everyone else to the punch. I recall a comment John Langford made on his blog (which I cannot seem to find now because I can't figure out what search terms to use) quite some time ago along these lines... saying that for many problems he doesn't care who finds a solution so long as the solution is found. I usually have mixed feelings. For some topics, I definitely feel this way. Indeed, for some topics I actually don't want to be the person who figures it out, either because I don't feel I have the necessary skills or because I don't have the necessary time. For some topics, I do get a feeling of ownership and really want to be the person who does it. My thesis work was like that, as was my document/abstract alignment work. This is definitely highly personal: I know plenty of people who care a lot more about ownership than I do, and many who care a lot less.

What I took away from this comment is essentially the realization that we are all working toward some vague future goal, which has to do with computationalizing language processing (or some other topic, for the non-NLP audience). Progress is good. If I've done work that has something interesting and novel to say about this goal, then it's not bad -- and is often good -- that this builds on and improves on your work.

01 December 2008

Where did my (linear) model go wrong?

Back to research-related posts for a while :).

Let's say we're learning a predictor f and it doesn't get 100% test accuracy. We can ask ourselves: why not. Typically, the reason I ask myself this question is because I want to get closer to 100% test accuracy and want to know how I should go about it. Should I add more labeled data (what should it look like) or should I add more features (what should they look like) or am I just hosed?

In order to try to get a handle on this, we can ask ourselves: where can error come from:

Noise in the training data (which we have fit and is now screwing us up at test time)
Noise in the test data (which is causing our otherwise correct predictions to look wrong)
Insufficient representation (we just don't have the right/enough features)
Insufficient examples (our training data wasn't sampled densely enough in some region)

These are not mutually exclusive. For instance, maybe be reducing/changing the feature set (3), it would turn out that we are sampled densely enough in a region of interest (4).

I'm going to use (binary) text categorization as a working example because it's simple yet exhibits NLP-like properties (ever growing feature sets, sparse feature representations, etc.). So let's say we train a model (perhaps linear) on a bag of words feature set and apply it on test data. Say we get 90% accuracy on dev data. Now what can we do?

Let's take a single example that we misclassified. Perhaps take the "worst" (i.e., most misclassified in a margin sense), though I don't know that it matters. Which of the four reasons above tells us why this example was misclassified?

By looking at the example (yes, by hand!) we can probably tell if it's a problem due to issue (2: Noise in the test data). We might like to try to automate this ascertainment, but honestly I'm at a loss for how to do this. Let's say we decide that the problem isn't due to test set noise. Now what?

Let's consider the following approach. We are going to retrain our classifier several times (okay, this is expensive, but we can worry about this later). What we're going to do is add this misclassified dev point into the training data with a varying cost. The currently trained model we will call f(0), since it is based on giving this dev point a cost of zero. We can then train f(c) for a large range of costs c and let C be some reasonable upper bound (i.e., we guarantee that C is big enough that f(C) correctly classifies this point -- for any reasonable classifier, there should be such a finite C). Go ahead and replace "correctly classifies" with "classifies with a sufficiently large margin" if you'd prefer; I doubt it matters.

Now, we're going to look at two of these fs. We'll look at f(0) and f(c'), where c' is the minimal value of c such that this dev example becomes correctly classified. Let's say we now run these two fs on the training data. We know that f(0) will misclassify our "selected" test point and that f(c') will not. The big question is what do the fs do on the other points.

Suppose that f(c') doesn't make any (many?) more mistakes than f(0). That is, they're basically the same, just now classifying our selected point correctly. This suggests that the problem is (3) or (4) above (features or examples).
Suppose that f(c') makes many more mistakes than f(0). Now, we see that in order to get this selected test point correct, we have to pay by getting other things wrong (that we didn't before). This suggests that the problem is (1) or (3) above (noise or features). Importantly, it's not (4: examples).

At this point, we have separated causes (3 or 4) from (1 or 4), but haven't separated them completely.

Now, let's go back and do the same process of all of the dev points that were misclassified. What can happen?

Almost all of the f(c')s make no more errors on other training points. Unless all of these erroneous dev points are markedly different from the training data (in which case we really have a domain adaptation problem), then this is almost certainly a feature issue (3).
Almost all of the f(c')s make many more errors on the other training points, and the set of training points on which they make these errors is usually the same. Then this is almost certainly noisy training data (or noisy test data).
Almost all of the f(c')s make many more errors on the other training points, but the set of training points on which they err is substantially different each time. Then this is almost certainly a feature issue (3).
Mixed results: some f(c')s make more errors on training, some don't. This is harder to say, but my gut tells me that this is probably a (4: example) issue.

There's a lot of hedging going on here: "almost", "many", "different", etc. Maybe you could formalize these things, but I'd rather get the intuition right first.

(If you're using a margin based classifier, you might not have to exactly retrain each time. Koby Crammer's passive aggressive algorithm will essentially give you a closed form solution for "closest (in L2) weight vector to the current weight vector such that a given point is correctly classified," which could save some computational effort.)

Note that I haven't actually tried this. I'd definitely be interested to, but wanted to think about it a bit before I shot off to implement something.

04 November 2008

Using machine learning to answer scientific questions

I tend to prefer research (and the papers that result from it) that attempt to answer some sort of question about natural language. And "can I build a system that beats your system" does not count as a question. (Of course, I don't always obey this model myself.) For instance, the original IBM MT papers were essentially trying to answer the question: can we build a model of translation directly from parallel data? Later work in phrase-based and syntax-based translation basically started out by asking the question: is syntactic structure (or local lexical structure) a useful abstraction for enabling translation? It seems the answer to all these questions is "yes."

A less broad type of question that one can ask might be categorized as: is some type of feature relevant for some task. For instance, are lexical semantics features derived via a hand-crafted resource (such as WordNet) useful for the task of coreference resolution? What about lexical semantics features derived automatically from the web? This type of question is the sort of thing that I looked at in an EMNLP 2005 paper. The answers seemed to be "no" and "yes" respectively.

There are several issues with trying to answer such questions. One issue is that typically what you're actually looking at is the question: can I figure out a way to turn WordNet into a feature vector that is useful for the task of coreference resolution? This is probably a partial explanation for why our community seems not to like negative results to such questions: maybe you just weren't sufficiently clever in encoding WordNet as features. Or maybe WordNet features are useful but there is some other set of features that's more useful that's just swamping the benefits of WordNet.

My first question is whether we're even going about this the right way. The usual approach is to take some baseline system, add WordNet features, and see if predictive performance goes up (i.e., performance on test data). This seems like a bit of a round-about way to attack the problem. After all, this problem of "does some feature have an influence on the target concept" is a classical and very well studied area of statistics: most people have probably seen ANOVA (analysis of variance), but there are many many more ways to try to address this question. And, importantly, they don't hinge on the notions of predictive performance. (Which almost immediately ties us in to "my system is better than yours.")

Now, I'm not claiming that ANOVA (or some variant thereof) is the right thing to do. Our problems are often quite different from those encountered in classical statistics. We have millions of covariates (features) and often complex structured outputs. I just wonder if there's some hint of a good idea in those decades of statistical research.

The other way to go about looking at this question, which at the time I wrote the EMNLP paper I thought was pretty good, is the following. The question is: why do I care if WordNet features are useful for prediction. Very rarely will they actually hurt performance, so who cares if they don't help? (Besides those people who actually want to know something about how language works.... though in this case perhaps it tells us more about how WordNet works.) The perspective we took was that someone out there is going to implement a coreference system from scratch (perhaps for a new language, perhaps for a new domain) and they want to know what features they should spend time writing and which ones they can ignore. This now is a question about predictive performance, and we did a form of backwards feature selection to try to answer it. If you look at Figure 2 in the paper, basically what you see is that you'd better implement string-match features, lexical features, count-based features, and knowledge-based features. On top of this, you can get another half point with inference features and another half again using gazetteers. But beyond that, you're probably wasting your time. (At the time of the EMNLP paper, one of this big "sells" was the count-based features, which are impossible in the standard pairwise-decision models that most people apply to this problem. I still think this is an interesting result.)

At any rate, I still think this ("what should I spend time implementing") is a reasonable way to look at the problem. But if you're moving to a new domain or new language, most likely all of your time/money is going to be spent annotating data, not implementing features. So maybe you should just go and implement everything you can think of anyway. And besides, if you want to make an argument about moving to different languages or different domains or even different amounts of training data, you should make those arguments specifically and do the experiments to back them up. For instance, how does Figure 2 in the EMNLP paper change when you have half the data, or a tenth the data? Maybe the lexical features don't matter as much and perhaps here WordNet might kick in?

22 May 2008

Continuing bad ideas

I occasionally--more than I would like--run in to the following problem. I am working on some task for which there is prior work (shocking!). I'm getting ready to decide what experiments to run in order to convince readers of a potential paper that what I'm doing is reasonable. Typically this involves comparing fairly directly against said prior work. Which, in turn, means that I should replicate what this prior work has done as closely as possible, letting the only difference in systems be what I am trying to propose. Easy example: I'm doing machine learning for some problem and there is an alternative machine learning solution; I need to use an identical feature set to them. Another easy example: I am working on some task; I want to use the same training/dev/test set as the prior work.

The problem is: what if they did it wrong (in my opinion). There are many ways for doing things wrong and it would be hard for me to talk about specifics without pointing fingers. But just for concreteness, let's take the following case relating to the final example in the above paragraph. Say that we're doing POS tagging and the only prior work POS tagging paper used as test data every tenth sentence in WSJ, rather than the last 10%. This is (fairly clearly) totally unfair. However, it leaves me with the following options:

I can repeat their bad idea and test on every tenth sentence.
I can point out (in the paper) why this is a bad idea, and evaluate on the last 10%.
I can not point this out in the paper and just ignore their work and still evaluate on the last 10%.
I can point out why this is a bad idea, evaluate on both the last 10% (for "real" numbers) and every tenth sentence (for "comparison" numbers).
I can point out why this is a bad idea, reimplement their system, and run both mine and theirs on the last 10%.

I think that if all else is equal, (5) is the best option. I think it's scientifically sound, truthful, and gives results that are actually comparable. Unfortunately, it's also the most work because it involves reimplementing a complete other system which in many cases may be verging on prohibitive. (I suppose another option would be to contact the authors and ask them to rerun in the correct way -- I imagine some people might respond positively to this, though probably not all.)

(4) is tempting, because it allows me to "break the bad habit" but also compare directly. The problem is that if I really disbelieve the prior methodology, then these comparison numbers are themselves dubious: what am I supposed to learn from results that compare in an unreasonable setting? If they're great (for me), does that actually tell me anything? And if they're terrible (for me), does that tell me anything? It seems to depend on the severity of the "bad idea", but it's certainly not cut and dry.

(3) just seems intellectually dishonest.

(2) at first glance seems inferior to (4), but I'm not entirely sure that I believe this. After all, I'm not sure that (4) really tells me anything that (2) doesn't. I suppose one advantage to (4) over (2) is that it gives me some sense of whether this "bad idea" really is all that bad. If things don't look markedly different in (2) and (4), then maybe this bad idea really isn't quite so bad. (Of course, measuring "markedly different" may be difficult for some tasks.)

(1) also seems intellectually dishonest.

At this point, to me, it seems like (4) and (5) are the winners, with (2) not hugely worse than (4). Thinking from the perspective of how I would feel reviewing such a paper, I would clearly prefer (5), but depending on how the rest of the paper went, I could probably tolerate (2) or (4) depending on how egregious the offense is. One minor issue is that as a writer, you have to figure that the author of this prior work is likely to be a reviewer, which means that you probably shouldn't come out too hard against this error. Which is difficult because in order to get other reviewers to buy (4), and especially (2), they have to buy that this is a problem.

I'm curious how other people feel about this. I think (5) is obviously best, but if (5) is impossible to do (or nearly so), what should be done instead.

20 September 2007

Mark-up Always the Wrong Tree?

Almost a year ago I responded to a very interesting article in CL. The substance of the article is that we have to be careful when we annotate data lest we draw incorrect conclusions. In this post I'm going to take a more extreme position. It's not necessarily one I agree with 100%, but I think it's worth more than just a brief consideration.

Proposition: mark-up is always a bad idea.

That is: we should never be marking up data in ways that it's not "naturally" marked up. For instance, part-of-speech tagged data does not exist naturally. Parallel French-English data does. The crux of the argument is that if something is not a task that anyone performs naturally, then it's not a task worth computationalizing.

Here's why I think this is a reasonable position to take. In some sense, we're striving for machines that can do things that humans do. We have little to no external evidence that when humans (for instance) perform translation, that they also perform part-of-speech tagging along the way. Moreover, as the CL article mentioned above nicely points out, it's very easy to confuse ourselves by using incorrect representations, or being lazy about annotating. We may be happy to speculate the humans build up some sort of syntactic representation of sentences inside their heads (and, yes, there is some psychological evidence for something that might correlate with this). But the fact is, simply, that all we can observe are the inputs and outputs of some processes (eg., translation) and that we should base all of our models on these observables.

Despite the fact that agreeing with this proposition makes much of my own work uninteresting (at least from the perspective of doing things with language), I find very few holes in the argument.

I think the first hole is just a "we're not there yet" issue. That is: in the ideal world, sure, I agree, but I don't think we yet have the technology to accomplish this.

The second hole, which is somewhat related, is that even if we had the technology, working on small problems based on perhaps-ill-conceived data will give us insight into important issues. For instance, many summarization people believe that coreference issues are a big problem. Sure, I can imagine an end-to-end summarization system that essentially treats coreference as a "latent variable" and never actually looks and hand-annotated coref data. On the other hand, I have no idea what this latent variable should look like, how it should be influenced, etc. The very process of working on these small problems (like "solving" coref on small annotated data sets) give us an opportunity to better understand what goes in to these problems.

The hole with the second hole :) is the following. If this is the only compelling reason to look at these sub-problems, then we should essentially stop working on them once we have a reasonable grasp. Not to be too hard on POS tagging, but I think we've pretty much established that we can do this task and we know more or less the major ins and outs. So we should stop doing it. (Similar arguments can be made for other tasks; eg., NE tagging in English.)

The final hole is that I believe that there exist tasks that humans don't do simply because they're too big. And these are tasks that computers can do. If we can force some humans to do these tasks, maybe it would be worthwhile. But, to be honest, I can't think of any such thing off the top of my head. Maybe I'm just too closed-minded.

08 August 2007

Conferences: Costs and Benefits

I typically attend two or three conferences per year; usually NIPS (which has been in Vancouver since I started attending), and an ACL-related one; the third is typically a second ACL-related conference or ICML, depending on the year. Typically two of these are domestic, one is international. Domestic conferences cost me about $1500 and international ones vary, but Prague weighed in at around $4000. This means that my travel costs (just for myself!) are about $5500-$7000 per year. Moreover, this takes 2-3 weeks of my year (more than 5% of my non-vacation time). When I was a student, this question never entered my mind (I seemed to have a nearly endless supply of money); now, I find myself wondering: are conferences worth the time and money investment?

I'll focus on international conferences because these are the biggest sink in terms of both money and time. In particular, I'll consider Prague, which hosted both ACL and EMNLP. Here's what I feel like I gained from this trip:

I saw some interesting papers presented.
I saw some interesting invited talks (Tom Mitchell's stands out for me).
I had semi-deep hallway conversations with 3 or 4 people.
I had non-deep hallway conversations with probably ~20 people.
I gave two presentations. (The implication is that this may make me "more famous" and that this is a good thing for some reason :P.)
I saw an area of the world that I hadn't yet been to.
I spent a not insignificant amount of time socializing with ~20 friends who I pretty much only see at conferences.

So the question is, was this worth just over $4 grand and 10 days of my life that could have been spent doing research (or taking a vacation)?

I have mixed feelings.

does not seem compelling -- for conferences close to me that I do not attend, I still read proceedings. Sure, sometimes presentations are helpful and there's a bit of a serendipity aspect, but overall, I'd say this is something I could do in a day in the park with a copy of the proceedings.
is important. Especially when the invited talks are good and aren't just a long version of some paper presentation---i.e., when you can get a good sense of the overall research direction and the important long term results---I feel like these things are worth something.
is important. Some people say that hallway conversations are the most important; maybe it's just me, but it's pretty rare for me to have hallway conversations that are sufficiently deep to be really meaningful in the long run, but I'd say around 3 per conferences is something that you can hope for. At least with these, they seem to have either led to collaboration or at least new ideas to try out in my own work.
provides good social networking... I don't feel like these really change how I think about problems (and I think the same is true for the people I had such conversations with). The only important thing here is if you find out about what new problems other people are working on, you can learn about new areas that may interest you.
is nebulous to me; I feel like the key purpose in conference talks is advertisement. It's a bit unclear what I'm advertising for---citations, perhaps?---but hopefully something I've done will save someone else some time, or will give them ideas of something to try or something along these lines. But this is highly correlated with (1), which suggests that it's actually not particularly useful.
shouldn't be underestimated, but if I compare taking a vacation with going to a conference, they're very different. In particular, even at a conference where I a priori intend to spend a bunch of time touristing, I never seem able to accomplish this as much as I would like. Of course, $4k out of grant money versus $4k out of personal money is very different.
also shouldn't be underestimated, but like (6) is maybe best accomplished in other ways.

Based on this, I feel like overall the main benefits to going to a conference are: seeing invited talks, having deep hallway conversations, and a minor bit of socializing and serendipity.

The thing that occurred to me recently is that it's actually possible to achieve these things without going to conferences. In particular, consider the following model. I invite one or two "famous types" to my university to give invited talks. Each of these would cost maybe $2000, but a lot (if not all) of this would be subsidized by the department. So I get invited talks for (nearly) free; for safety, even say it costs me $1k. I now have $3k left. With this $3k I tour around the country and spend a few days at different labs/universities and meet with people. If I know someone well enough at a lab, I can probably stay with them, which means my only real cost is airfare (assuming their university doesn't want to invite me and pay for it) and incidentals. For domestic flights, it's hard to imagine that I wouldn't be able to pull off each of this for around $750. And that eats up the $4k.

What do I get out of this model? Well, I'd give talks at four universities. This is not quite as broad an audience as at a conference, but they're more focused and my talk can be longer. Instead of having semi-deep hallway conversations, at the very least I get 4 very deep office conversations, which is potentially much more useful. I get one or two invited talks per year, by people that I choose (modulo availability).

What do I lose? I lose seeing other papers presented, which I don't think is too serious. I lose socializing and touristing (in foreign countries). This is too bad, but is perhaps better served by a legitimate vacation. The only other big thing I lose is conversations with multiple people simultaneously (eg., in Prague, one of my "good" conversations was with Ryan McDonald and Joakim Nivre... this would not be possible under my proposed model). I also lose seeing questions and answers asked at talks, which are occasionally quite interesting, but also something that I'm willing to live with out.

Overall, I think the biggest thing I would lose is a sense of community, which I think is hard to quantify and yet still important. Though, I'm also not proposing that I would never go to a conference, but that maybe 2-3 per year is overkill for the benefits obtained (especially for expensive destinations). If I went to one domestic (I count Canada as domestic) conference per year and visited 2-3 other sites, I'm not sure that I'd be any worse off. (Of course, the fact that I'm in the States helps here... you probably couldn't get away with this model outside of US/Canada.)

05 June 2007

Tracking the State of the Art

I just received the following email from Yoav Goldberg:

I believe a resource well needed in the ACL community is a "state-of-the-art-repository", that is a public location in which one can find information about the current state-of-the-art results, papers and software for various NLP tasks (e.g. NER, Parsing, WSD, PP-Attachment, Chunking, Dependency Parsing, Summarization, QA, ...). This will help newcomers to the field to get the feel of "what's available" in terms of both tasks and available tools, and will allow active researchers to keep current on fields other than their own.

For example, I am currently quite up to date with what's going on with parsing, PoS tagging and chunking (and of course the CoNLL shared tasks are great when available, yet in many cases not updated enough), but I recently needed to do some Anaphora Resolution,
and was quite lost as for where to start looking...

I think the ACL Wiki is an ideal platform for this, and if enough people will show some interest, I will create a "StateOfTheArt" page and start populating it. But, before I do that, I would like to (a) know if there really is an interest in something like this and (b) hear any comments you might have about it (how you think it should be organized, what should be the scope, how it can be advertised other than in this blog, etc).

I find this especially amusing because this is something that I'd been planning to blog about for a few weeks and just haven't found the time! I think that this is a great idea. If we could start a community effect where everytime you publish a paper with new results on a common task, you also publish those results on the wiki, it would make life a lot easier for everyone.

I would suggest that the pages essentially consist of a table with the following columns: paper reference (and link), scores in whatever the approate metric(s) are, brief description of extra resources used. If people feel compelled, they would also be encouraged to write a paragraph summary under the table with a bit more detail.

I would certainly agree to use this and to support the effort, I would be happy to go back through all my old papers and post their results on this page. It would be nice if someone (Yoav perhaps???) could initialize pages for the main tasks, so that that burden is lifted.

I'm sure other suggestions would be taken to heart, so comment away!

15 May 2007

Whence JCLR?

Journal publication is not too popular for NLPers -- we tend to be a conference driven bunch. While I could care less about some arguments for journals (eg., the folks on tenure committees like them), I do feel that they serve a purpose beyond simply acting as an archive (which things like arxiv.org, the ACL anthology, citeseer, rexa, etc. do anyway). In particular, a journal paper is often a place where you get to really set the stage for your problem, describe your algorithms so that they're actually reimplementable, and go in to serious error analysis. Certainly not every paper that appears in a *ACL should continue on to the journal path, but many times a handful of papers could be merged.

One significant problem is that we're currently really limited in our choice of publication venues. Computational Linguistics (MIT Press) is definitely the place to publish a journal paper if you can. Unfortunately, CL only puts out four issues per year, each with about 4-5 papers. Sure there aren't hundreds of good papers per year, but I have to believe there are more than 16-20. Moreover, I don't feel that CL actually mirrors the *ACL proceedings -- there are many papers published in CL that I don't think match with the general sensitivities of *ACL. In addition to the small number of CL papers, the turn around time is quite slow. I was personally very impressed with my turnaround time two years ago (date of submission -> date of publication was about a year) and I know that Robert Dale (who's editing now) has done a lot to try to improve this. But still, a year is a long time. And I've heard of papers that take several years to get through. Finally, it's not open. I hate pay-for-access journals almost as much as I had pay-for-access conference proceedings. Sure, if you attend an *ACL you get it for "free" and most universities have agreements, but this is more a principle thing than a practical thing.

Things were similar in machine learning land about six years ago (though in fact I think they were worse). The big journal there was Machine Learning (published by Springer). They had roughly the same problems, to the extent that a large fraction of the editorial board resigned to found the Journal of Machine Learning Research (JMLR). JMLR has since become very successful, publishes dozens of papers per year, and has incredibly quick turnaround (I have seen a journal version of a NIPS paper appear in JMLR before NIPS even happens). The creation of JMLR was greatly assisted by the SPARC group, which helps fledgling journal get off the ground.

I would love to see a similar thing happen in the NLP community. I, personally, cannot make this happen (I don't have enough weight to throw around), but in talking to colleagues (the majority of whom also don't have enough weight) this seems to be something that many people would be in favor of. I don't think it has to be a "JCLR is better than CL" sort of thing; I think it's very possible for both to co-exist, essentially serving slightly different purposes for slightly different communities. In particular, aside from fast turnaround and online pubs, some things that I would love to see happen with such a journal are: Strongly encouraged sharing of code/data (if one could build in some sort of copyright protection for private data, this would be even better since it would let more people share); and a built-in board for paper discussion (probably with membership); ability for authors to easily submit addenda.

A while back I went through the SPARC suggestions of how to begin such a thing and it's very non-trivial. But it's doable. And I'd be willing to help. The biggest thing that would be required would be a bunch of people with white hair who are willing to commit body-and-soul to such a move.

19 April 2007

Searn versus HMMs

Searn is my baby and I'd like to say it can solve (or at least be applied to) every problem we care about. This is of course not true. But I'd like to understand the boundary between reasonable and unreasonable. One big apparently weakness of Searn as it stands is that it appears applicable only to fully supervised problems. That is, we can't do hidden variables, we can't do unsupervised learning.

I think this is a wrong belief. It's something I've been thinking about for a long time and I think I finally understand what's going on. This post is about showing that you can actually recover forward-backward training of HMMs as an instance of Searn with a particular choice of base classifier, optimal policy, loss function and approximation method. I'll not prove it (I haven't even done this myself), but I think that even at a hand-waving level, it's sufficiently cool to warrant a post.

I'm going to have to assume you know how Searn works in order to proceed. The important aspect is essentially that we train on the basis of an optimal policy (which may be stochastic) and some loss function. Typically I've made the "optimal policy" assumption, which means that when computing the loss for a certain prediction along the way, we approximate the true expected loss with the loss given by the optimal policy. This makes things efficient, but we can't do it in HMMs.

So here's the problem set up. We have a sequence of words, each of which will get a label (for simplicity, say the labels are binary). I'm going to treat the prediction task as predicting both the labels and the words. (This looks a lot like estimating a joint probability, which is what HMMs do.) The search strategy will be to first predict the first label, then predict the first word, then predict the second label and so on. The loss corresponding to an entire prediction (of both labels and words) is just going to be the Hamming loss over the words, ignoring the labels. Since the loss doesn't depend on the labels (which makes sense because they are latent so we don't know them anyway), the optimal policy has to be agnostic about their prediction.

Thus, we set up the optimal policy as follows. For predictions of words, the optimal policy always predicts the correct word. For predictions of labels, the optimal policy is stochastic. If there are K labels, it predicts each with probability 1/K. Other optimal policies are possible and I'll discuss that later.

Now, we have to use a full-blown version of Searn that actually computes expected losses as true expectations, rather than with an optimal policy assumption. Moreover, instead of sampling a single path from the current policy to get to a given state, we sample all paths from the current policy. In other words, we marginalize over them. This is essentially akin to not making the "single sample" assumption on the "left" of the current prediction.

So what happens in the first iteration? Well, when we're predicting the Nth word, we construct features over the current label (our previous prediction) and predict. Let's use a naive Bayes base classifier. But we're computing expectations to the left and right, so we'll essentially have "half" an example for predicting the Nth word from state 0 and half an example for predicting it from state 1. For predicting the Nth label, we compute features over the previous label only and again use a naive Bayes classifier. The examples thus generated will look exactly like a training set for the first maximization in EM (with all the expectations equal to 1/2). We then learn a new base classifier and repeat.

In the second iteration, the same thing happens, except now when we predict a label, there can be an associated loss due to messing up future word predictions. In the end, if you work through it, the weight associated with each example is given by an expectation over previous decisions and an expectation over future decisions, just like in forward-backward training. You just have to make sure that you treat your learned policy as stochastic as well.

So with this particular choice of optimal policy, loss function, search strategy and base learner, we recover something that looks essentially like forward-backward training. It's not identical because in true F-B, we do full maximization each time, while in Searn we instead take baby steps. There are two interesting things here. First, this means that in this particular case, where we compute true expectations, somehow the baby steps aren't necessary in Searn. This points to a potential area to improve the algorithm. Second, and perhaps more interesting, it means that we don't actually have to do full F-B. The Searn theorem holds even if you're not computing true expectations (you'll just wind up with higher variance in your predictions). So if you want to do, eg., Viterbi F-B but are worried about convergence, this shows that you just have to use step sizes. (I'm sure someone in the EM community must have shown this before, but I haven't seen it.)

Anyway, I'm about 90% sure that the above actually works out if you set about to prove it. Assuming its validity, I'm about 90% sure it holds for EM-like structured prediction problems in general. If so, this would be very cool. Or, at least I would think it's very cool :).

12 April 2007

Multinomial on a graph

I'm going to try something here to see if it works; my guess is not, but maybe it will at least spark some discussion. (I'm also testing LaTeX in Blogger.) In "text modeling" applications, we're usually dealing with lots of multinomials, especially multinomials over our vocabulary (i.e., unigram language models). These are parameterized by a vector of length equal to the size of the vocabulary. should be the probability of seeing word . The probability of observing an entire document of length , containing copies of word 1, copies of word 2, and so on, is:

What I'm interested in is extensions to this where we don't assume independence. In particular, suppose that we have an ontology. Let's say that our ontology is just some arbitrary graph. What I want is a distribution that essentially prefers to keep drawing values "locally" within the graph. That is, if I've already drawn "computer" then I'm not so suprised to also see "program."

One way of thinking about this is by having a sort of random walk model. Think about placing the individual s on the graph and then allowing them to disperse. So that maybe there is some for which (and the rest are zero). We certainly want our distribution to prefer draws for word , but draws of word where is "near" should also be acceptable.

My first cut attempt at this is as follows. Let be a graph over our vocabulary; let be a neighborhood of (maybe, all nodes within 10 steps); let be the inverse neighborhood (the set of all that have in ). Let be the distance between two nodes and let be a dampening parameter (how fast probability diffuses on the graph). Define our probabilistic model as:

Where we let .

The idea behind this model is as follows. When we observe word i, we have to assign a probability to it. This probability is given by any node j that contains i in its neighborhood. Each such node has some mass which it can distribute. It distributes this mass proportional to to all nodes , where serves to normalize this diffusion.

I'm pretty sure that the normalizing constant for this distribution is the same as the multinomial. After all, there is a one-to-one mapping between the multinomial parameters and the "graph multinomial" parameters, so we should still get to normalize.

Now, because I feel an incredible need to be Bayesian, I need a conjugate prior for this "graph multinomial" distribution. Since the graph multinomial is essentially just a reparameterization of the standard multinomial, it seems that a vanilla Dirichlet would be an appropriate prior. However, I cannot prove that it is conjugate -- in particular, I can't seem to come up with appropriate posterior updates.

So this is where y'all come in to save the day! I can't imagine that no one has come up with such a "graph multinomial" distribution -- I just don't know what keywords to search for. If someone has seen something at all like this, I'd love to know. Alternatively, if someone can tell me what the appropriate conjugate prior is for this distribution, plus the posterior update rules, that would also be fantastic.

10 March 2007

Reproducible Results

In an ideal world, it would be possible to read a paper, go out and implement the proposed algorithm, and obtain the same results. In the real world, this isn't possible. For one, if by "paper" we mean "conference paper," there's often just not enough space to spell out all the details. Even how you do tokenization can make a big difference! It seems reasonable that there should be sufficient detail in a journal paper to achieve essentially the same results, since there's (at least officially) not a space issue. On the other hand, no one really publishes in journals in our subfamily of CS.

The next thing one can do is to release the software associated with a paper. I've tried to do this in a handful of cases, but it can be a non-trivial exercise. There are a few problems. First, there's the question of how polished the software you put out should be. Probably my most polished is megam (for learning classifiers) and the least polished is DPsearch (code from my AI stats paper). It was a very nontrivial amount of effort to write up all the docs for megam and so on. As a result, I hope that people can use it. I have less hope for DPsearch --- you'd really have to know what you're doing to rip the guts out of it.

Nevertheless, I have occasionally received copies of code like my DPsearch from other people (i.e., unpolished code) and have still been able to use them successfully, albeit only for ML stuff, not for NLP stuff. ML stuff is nice because, for the most part, its self-contained. NLP stuff often isn't: first you run a parser, then you have to have wordnet installed, then you have to have 100MB of data files, then you have to run scripts X, Y and Z before you can finally run the program. The work I did for my thesis is a perfect example of this: instead of building all the important features into the main body of code I wrote, about half of them were implemented as Perl scripts that would essentially add "columns" to a CoNLL-style input format. At the end, the input was like 25-30 columns wide, and if any were missing or out of order, bad things would happen. As a result, it's a completely nontrivial exercise for me to release this beast. The only real conceivable option would be to remove the non-important scripts, get the important ones back into the real code, and then release that. But then there's no way the results would match exactly those from the paper/thesis.

I don't know of a solution to this problem. I suppose it depends on what your goal is. One goal is just to figure out some implementation details so that you can use them yourself. For this, it would be perfectly acceptable in, say, my thesis situation, to just put up the code (perhaps the scripts too) and leave it at that. There would be an implicit contract that you couldn't really expect too much from it (i.e., you shouldn't expect to run it).

A second goal is to use someone else's code as a baseline system to compare against. This goal is lessened when common data is available, because you can compare to published results. But often you don't care about the common data and really want to see how it works on other data. Or you want to qualitatively compare your output to a baseline. This seems harder to deal with. If code goes up to solve this problem, it needs to be runnable. And it needs to achieve pretty much the same results as published, otherwise funny things happen ("so and so reported scores of X but we were only able to achieve Y using their code", where Y < X). This looks bad, but is actually quite understandable in many cases. Maybe the solution here is, modulo copyright restrictions and licensing problems (ahem, LDC), just put up you models output as well. This doesn't solve the direct problem, but maybe helps a bit. It also lets people see where you model screws up, so they can attempt to fix those problems.

03 February 2007

To err is human, but what about researchers?

Errors happen and sometimes get in to papers. A recent example is the JAIR paper I had with Daniel on Domain Adaptation last year. I actually didn't catch the error myself -- it was caught by someone who was reimplementing the technique. And it's a totally not-insignificant error: essentially, the update equation for the generative parameters is completely botched. If you look through the derivation in the Appendix, it's clear where the error crept in.

Thankfully, this sort of error is essentially a typo. That is, the error was introduced when I was typing up the paper, not when I was doing the research. Why this is important is that it means the the implementation reflects the correct updates: only the paper has the mistake. This means that the experimental results from the paper are valid, contingent on the fact that you rederive the updates yourself, or just ask me what they should be.

I'm writing this post because it's somewhat unclear what to do when such a thing arises. One temptation is to do nothing. I have to admit that I was completely embarrassed when this was pointed out to me. There was a part of me that wanted to ignore it. It seems that this is the wrong approach for a variety of reasons, not the least of which is to make sure that correct information does get out. The question, to some degree, is exactly how to do this. I have a blog, which means I can write an entry like this. I can also put an errata on my web page that points out the errors (I'm writing this up as we "speak"). Given that this is a pub in an online journal, I believe I am able to submit updates, or at least additional appendices, which means that the "official version" can probably be remedied.

But what about conference pubs? If this had appeared in ACL and I didn't have a blog, the situation would be something different (ironically, an earlier version with the correct updates had been rejected from ACL because the derivations were omitted for space and two reviewers couldn't verify them). Also, what if someone hadn't pointed it out to me? I certainly wouldn't have noticed -- that paper was behind me. But then anyone who noticed the errors might dismiss the results on the grounds that they could assume that the implementation was also incorrect (it's not inconceivable that an erroneous implementation can still get good results). This would also not be good because the idea in the paper (any paper with such errors) might actually be interesting.

False things are published all the time. The STOC/FOCS community (i.e., theory community) has a handful of examples...for them, errors are easy to identify because you can prove the opposite of any theorem. I recall hearing of a sequence of several papers that incrementally used results from a previous, but the first was in error, putting the rest in error (I also recall hearing that many of the subsequent results could be salvaged, despite the ancestral mistake).

I don't know if there's a good solution, given our publication mechanisms (essentially, publish-once-then-appear-in-the-anthology). But I'm pretty sure mine is not the first paper with such errors. At least I hope not :).

natural language processing blog