27 September 2007

Bootstrapping

There are many bootstrapping algorithms, but they all have (roughly) the same general form. Suppose we want to solve a binary classification problem. We do the following:
  1. Build (by hand) a classifier that predicts positive with high precision (and low recall)
  2. Build (by hand) a classifier that predicts negative with high precision (and low recall)
  3. Apply (1) and (2) to some really huge data set leading to a smaller labeled set (that has high precision)
  4. Train a new classifier on the output of (3)
  5. Apply the new classifier to the original data set and go to (4)
Note that the hand-built classifier can alternatively just be a hand-labeling of a small number of "representative" examples.

I've never actually used such an algorithm before, but I hear that they work pretty well.

There's always been one thing that's surprised me, though. And that's the fact that they work pretty well.

There are two reasons why the fact that these algorithms work surprises me. First, there is often a danger that the classifier learned in (4) simply memorizes what the rule-based classifier does. Second, there is a domain adaptation problem.

To be more explicit, let me give an example. Suppose I want to classify sentences as subjective or objective. This is a well studied problem to which bootstrapping algorithms have been applied. The first two steps involve inventing rule-based classifiers that get high precision on the "predict-subjective" and "predict-objective" tasks. One way to do this is to create word lists. Get a list of highly subjective words and a list of highly objective words (okay this latter is hard and you have to be clever to do it, but let's suppose that we could actually do it). Now, to make our rule based classifier, we might say: if a sentence contains at least two positive words and no negative words it's positive; if a sentence contains at least two negative words and no positive words it's negative; else, punt.

Now, we apply this word-lookup-based classifier on a large data set and extract some subset of the sentences as labeled. In the very simplest case, we'll extract bag-of-words features from these sentences and train up a simple classifier.

This is where things go haywire for my understanding.

First, by using a bag-of-words model, there is always the possibility that the classifier we learn will simply mimic the rule-based classifiers on the training data and do random stuff on everything else. In other words, since the information used to create the rule-based classifier exists in the training data, there's no guarantee that the classifier will actually learn to generalize. Essentially what we hope is that there are other, stronger, and simpler signals in the data that the classifier can learn instead.

One might think that to get around this problem we would remove from the bag of words any of the words from the initial word list, essentially forcing the classifier to learn to generalize. But people don't seem to do this. In fact, in several cases that I've seen, people actually include extra features of the form "how many positive (resp. negative) words appear in this sentence." But now we have a single feature that will allow us to replicate the rule-based classifier. This scares me greatly.

Second, our goal from this process is to learn a classifier that has high accuracy on the distribution of "all English sentences (perhaps from some particular domain)." But the classifier we train in the first iteration is trained on the distribution of "all English sentences (perhaps...) such that the sentence contains >=2 positive and 0 negative, or >=2 negative and 0 positive words." The fact that this will generalize is totally not obvious.

And yet these things seem to work pretty well. I just don't quite understand why.

20 September 2007

Mark-up Always the Wrong Tree?

Almost a year ago I responded to a very interesting article in CL. The substance of the article is that we have to be careful when we annotate data lest we draw incorrect conclusions. In this post I'm going to take a more extreme position. It's not necessarily one I agree with 100%, but I think it's worth more than just a brief consideration.

Proposition: mark-up is always a bad idea.

That is: we should never be marking up data in ways that it's not "naturally" marked up. For instance, part-of-speech tagged data does not exist naturally. Parallel French-English data does. The crux of the argument is that if something is not a task that anyone performs naturally, then it's not a task worth computationalizing.

Here's why I think this is a reasonable position to take. In some sense, we're striving for machines that can do things that humans do. We have little to no external evidence that when humans (for instance) perform translation, that they also perform part-of-speech tagging along the way. Moreover, as the CL article mentioned above nicely points out, it's very easy to confuse ourselves by using incorrect representations, or being lazy about annotating. We may be happy to speculate the humans build up some sort of syntactic representation of sentences inside their heads (and, yes, there is some psychological evidence for something that might correlate with this). But the fact is, simply, that all we can observe are the inputs and outputs of some processes (eg., translation) and that we should base all of our models on these observables.

Despite the fact that agreeing with this proposition makes much of my own work uninteresting (at least from the perspective of doing things with language), I find very few holes in the argument.

I think the first hole is just a "we're not there yet" issue. That is: in the ideal world, sure, I agree, but I don't think we yet have the technology to accomplish this.

The second hole, which is somewhat related, is that even if we had the technology, working on small problems based on perhaps-ill-conceived data will give us insight into important issues. For instance, many summarization people believe that coreference issues are a big problem. Sure, I can imagine an end-to-end summarization system that essentially treats coreference as a "latent variable" and never actually looks and hand-annotated coref data. On the other hand, I have no idea what this latent variable should look like, how it should be influenced, etc. The very process of working on these small problems (like "solving" coref on small annotated data sets) give us an opportunity to better understand what goes in to these problems.

The hole with the second hole :) is the following. If this is the only compelling reason to look at these sub-problems, then we should essentially stop working on them once we have a reasonable grasp. Not to be too hard on POS tagging, but I think we've pretty much established that we can do this task and we know more or less the major ins and outs. So we should stop doing it. (Similar arguments can be made for other tasks; eg., NE tagging in English.)

The final hole is that I believe that there exist tasks that humans don't do simply because they're too big. And these are tasks that computers can do. If we can force some humans to do these tasks, maybe it would be worthwhile. But, to be honest, I can't think of any such thing off the top of my head. Maybe I'm just too closed-minded.

11 September 2007

Journals are on the Mind

Anyone who has ever read this blog before knows that I'm a huge supporter of moving our Computational Linguistics journal to open access. Those who talk to me, know I'm in favor of more drastic measures, but would be content with just this one small change. I'm not here to beat a horse (living or dead), but to say that this appears to be something on many people's minds. I just got an email for voting for new members of the ACL board. I think only members get to vote, but I don't see anything that says that everyone can't view the statements. The list of candidates with their statements is here.

What I find promising is how often open access CL is mentioned! Indeed, both candidates for VP-elect say something. Ido writes:
Among other possibilities, the proposal to make CL open access, with a shorter reviewing cycle, seems worth pursuing. Additionally, we can increase the annual capacity of journal papers and include shorter ones.
And Jan writes:
Third, ACL's role is to help the widest possible dissemination of the field's results, including novel ways such as electronic and open-access publications, training workshops and summer schools (to attract excellent "new blood").
The issue comes up again for one of the Exec members. Hwee Tou says "I believe some issues of current concern to the ACL membership include the role of open access journals..."

Obviously I'm not here to tell anyone who to vote for. Indeed, since both candidates for VP mention the issue, it's not really a dividing point! But I'm very very very pleased to see that this has become something of an important front.

05 September 2007

Word order and what-not

I've recently been looking at some linguistic issues related to word order phenomena. This is somewhat inline with the implicational universals stuff that I am working on with Lyle Campbell, but also somewhat tangential.

Here's a little background.

Most linguists tend to believe in things like nouns, verbs, adjectives, subjects and objects, genitives, adpositions (prepositions and postpositions), etc. Indeed, these are some of the basic building blocks of typological studies of word order. A common example is that languages that are OV (i.e., the object precedes the verb) are also postpositional (think Hindi or Japanese). On the other hand, VO languages are also prepositional (think English).

The general set of orders that are considered important are: V/O, Adj/N, Gen/N, PrepP/PostP and a few others. However, one of these is underspecified. In particular, PrepP/PostP tells us nothing about the placement of the embedded clause with respect to its head.

For instance, in English, which is PrepP, the head precedes the PP ("The man *with* the axe", axe comes after man). Or, in Japanese, which is PostP, the head comes after the PP (glossed: "the axe *with* the man" -- meaning that the man has the axe). However, it is unclear if other orders are possible. For instance, are there PrepP languages for which "with the axe" ("with" has to come before "the axe" in a PrepP language) precedes "the man", something like "with the axe the man". Or, are there PostP languages for which "the axe-with" comes after the head ("the man the axe-with"). I certainly don't know any, but I know enough about maybe 4 or 5 languages out of around 7000 to tell. It seems like interpretation in such a language would be difficult, but of course that doesn't stop German from separating verbs and auxiliaries and indeed Germans don't seem to have a hard time understanding each other.

A second thing that is left unanswered is how these relate to each other. Consider Gen/N and Adj/N. If you are a GenN + AdjN language, which comes first? In English, the Gen has to ("The man's happy brother" not "The happy man's brother" -- the latter means that it's the man, not the brother, who is happy). Is this reversible for a given language, or are the any languages that allow both? Again, it seems like it would make interpretation difficult.

I've asked the few typologists that I know these two questions and they actually haven't known. I'm hoping that a query to the blogosphere (a word I hate) and a query to linguist-list will turn up something. I'll post anything I hear here.