20 November 2006

Feature engineering (with a coreference focus)

I gave a talk today at our small group meeting about feature engineering for coreference resolution problems. I think it is oft underappreciated by people (myself occasionally included) who spend a lot of time working on fancy machine learning algorithms that the difference between success and failure often hinges more on clever features than on fancy learning. For those who want to look at the slides, you can take an OpenOffice or PDF version for your perusal. I forewarn that it may be a bit hard to follow without the soundtrack, but it's sort of an "untold story" version of the coref paper we had at HLT/EMNLP 2005.

So what did we do that's clever (or, at least, as far clever as I knew at the time and know now)? Well, a few things. First, I tried extending the notion of Hobbs' distance to discourse trees. This actually works surprisingly well. The true referent is the first element according to syntactic Hobbs' distance in about 70-80% of the cases and we (almost) see the same thing at the discourse level. The only oddity is that you have to flip the nuclearity of the attribution relation. Then you get about 80% holding, assuming perfect discourse trees. Of course, we never have perfect discourse trees and the noise introduced by poor parsing is enough to make this feature more or less useless.

The second thing we tried was to use this name/instance data that Mike Fleischman gathered at ISI before going off to MIT. This data was gathered in the context of Q/A to answer questions like "Who is the president of the U.S.?" by using lookup-tables. But it's great data for coref! It includes about 2 million examples of names paired with their occupations. Maybe you can find this data interesting too. Mike describes it more in his ACL 2003 paper.
These guys make a huge difference in coref performance.

The next thing that helped a lot was a clever, non-standard use of gazetteers. What we do is take the ID of the gazetteer to which the anaphor belongs and pair it with the lexical item of the antecedent. Why? Well, suppose we have a gazetteer that contains words like "California" and "Utah" and "New York." By pairing this list id (eg., "list-5") with antecedents, we get features like "list-5 *AND* state" or "list-5 and country" etc. We'll probably learn that the former is a good feature and the latter is not. We also have a feature that checks to see if the two referents are on the same gazetteer but are not the same word. Why? This is a negative feature. "California" and "Utah" appear on the same list (states) but are not identical (string-wise). So they're probably not coreferent.

We tried pretty hard to get WordNet to work, but it just kept not helping. WordNet distance is a terrible measure (eg., "Japan" and "Russia" have distance 2 -- they are both under "country"). On the other hand, like in the gazetteer case, "nearby in WordNet but not the same word" is a reasonably good feature, though it turned out not to help much. Hypernym and hyponym were the most promising, but we never actually got any benefit from them.

The rest of the stuff in the talk made it fairly close to the front of the paper. The most surprising result was that count-based features help a lot! These features look at things like the entity to mention ratio, the number of entities detected so far, etc. These can't be included in a simple coref model that makes pairwise decisions, but for us it's easy.

The end result is that using just string-based features (standard "substring" and edit distance stuff), lexical features (word pairs), count-based features and Mike's data, we can get an ACE score of 87.6 using LaSO for doing the learning (this goes up about 0.8 if you use Searn in most cases, but I don't have all the numbers). Adding in some inference to predict things like number and gender gets you up to 88.3. Gazetteers get you up to 88.7. After that, the rest of the features (regular expression patterns, word clusters, WordNet, discourse, and syntax) don't help...all together, they get you up to 89.1, but that's not much bang for the buck.

My conclusion is basically that while learning makes a difference (Searn is a bit better than LaSO), the feature engineering makes a bigger one[*]. Many of the features we found useful were a priori not really that obvious. I spent a month or two of my life trying to come up with ways of being clever to help improve performance (playing on dev data so as to be fair). It's definitely a non-trivial enterprise, but it often pays off.

[*] This is assuming that your learning is robust enough to handle the features you want: for instance, the standard pairwise models could not handle the count-based features that helped us a lot.

17 November 2006

ICML 2007 webpage up

ICML 2007 will be held in Oregon this year from June 20-24; more information is on the newly published webpage. (Sadly, this overlaps with ACL in Prague, with a non-trivial flight in between.)

13 November 2006

Any Stem Cell Research?

I was listening to NPR a few weeks ago while pre-election news was still accosting my ears. Though not in my state (shockingly!), there was apparently a measure on the ballots in a lot of states referencing stem cell research. The title of the measure was, apparently, "Should any stem cell research be allowed?"

Okay, quick quiz: how did you first interpret that sentence? If you were in favor of certain varieties, but not all varieties of stem cell research, would you vote for it? In other words, does "any" mean "at least one type of" or "all types of"?

To me, and at least a few other people I've talked to, the prefered interpretation is "all"...it's also much easier to reach this interpretation if you put a bit of stress on "any." But it's also really easy to get the other interpretation by unstressing "any" or by putting another sentence in front of it (eg., "Stem cell research in all forms is purely evil." :P).

The strange thing is that I'm really having a hard time coming up with other examples where the preference for "all" is so strong (even with the stress). It seems that the fact that stem cell research is a class of things, rather than a single thing is important. It also seems that the presence of "should" is pretty much necessary. But even so, I really can't get anything else to come out with such a strong preference for "all".

So what does this have to do with NLP? Well, not too much other than language is hard. Something like this would probably kill a textual entailment system, but given that it's somewhat ambiguous even to people (the degree of ambiguity is person relative though: a Brit here tells me that he has a really hard time getting the "all" interpretation at all), maybe there's just nothing that can be done about it.

07 November 2006

Why Speech Summarization?

The vast majority of work on summarization is at the text level. A document (or collection of documents) comes in as text, and a summary goes out as text. Although I'm certainly not one of the ones pushing summarization of speech, I think it's quite an interesting task. (Here, by speech summarization, I pretty much mean speech in, speech out ... though similar arguments could be made for text in, speech out applications.) A lot of my conclusion comes from personal experience, but I also had a really great conversation with Alan Black (who works on speech synthesis problems).

Let me motivate this by an personal anecdote. My mom tends to leave really long voicemail messages. Like, really long. Usually she overruns the alloted time and has to call back. It's not uncommon for her messages to span three mailbox slots. (Sorry, mom, if you're reading this!) The most interesting thing about these messages is that they really don't contain that much information. In fact, there's usually only one or two sentences in the whole thing that really matter (at least from an "information" sense).

The problem is that you can't "skim" voicemail. If I get an equally long email (say, one that would take me 3 minutes to read), I can fairly easily skim it to get the gist. Especially with some tailored highlighting, I'm able to absorb much more text per second than speech per second. Word on the street (or at least around the croissant table at ACL) is that studies have been done that show that people can get at information from text just as fast if its presented in a summary as if its not summarized at all, essentially because we're really good at skimming (I knew high school English taught me something!). (Incidentally, if anyone has a reference for this, let me know. I believe I saw it ages ago, but I can't remember where and I'd love to have it.)

So one solution to the "Mom's message" problem is to run automatic speech recognition and turn it into an email. This has the added advantage that I check my email much more frequently than I check my voicemail, it gives me a record of the message after I've deleted the speech signal to save space, and allows for indexing. But there are also times when I don't have easy access to a computer and I really do want listen to the voicemail. This is where speech summarization becomes interesting.

Anyway, if you're now convinced that speech summarization is interesting, well, I don't really know what to tell you since I really haven't looked at this problem. The work I am aware of (and this is a very biased sample...I apologize that I missed something) is:

Additionally, a quick search turned up a recent workshop at Interspeech on Speech Summarization, which probably has more links and information than I can easily produce myself.

01 November 2006

Getting Started In: Sequence Labeling

Okay, at long last it's time for another edition of "Getting Started In" (p.s., if you want to write one of these for your area of expertise, I'm very open to it!).

So, what is sequence labeling? It's the meta problem that we face all the time in NLP tasks where we want to assign a single label to each element in a sequence. For us, a sequence is usually a sentence and a word is an element. The elements we are trying to assign are usually things like parts of speech, syntactic chunk labels (is this part of a noun phrase, verb phrase, etc.), named entity labels (is this a person?) and so on. Information extraction systems (i.e., extracting meeting times and locations from emails) can also be treated as sequence labeling problems.

There are roughly two varieties of sequence labeling: (1) raw labeling and (2) joint segmentation and labeling. The latter is (IMO) more common. Raw labeling is something like POS tagging where each element gets a single tag. Joint segmentation and labeling is where whole segments get the same label. For instance, in named entity recognition, a sentence like "Yesterday , George Bush gave a speech ." contains example one named entity ("George Bush"). Here, we want to assign the label "PERSON" to the entire phrase "George Bush", not to individual words (for instance, if two named abut, we need to know where they separate).

The easiest way to deal with segmentation problems is to transform them into raw labeling problems. The standard way to do this is the "BIO" encoding, where we label each word by "B-X", "I-X" or "O". Here, "B-X" means "begin a phrase of type X", "I-X" means "continue a phrase of type X" and "O" means "not in a phrase." For instance, the Bush sentence would be labeled as: "Yesterday/O ,/O George/B-PER Bush/I-PER gave/O a/O speech/O ./O" Once can now treat this as a raw labeling problem, perhaps being careful to avoid producing impossible sequences at test time (eg., an "I-X" can only follow a "B-X" or another "I-X"). Other encodings are also possible. So for now, I'll concentrate on raw labeling and revisit the true joint segmentation problem at the end.

In raw labeling, we're faced with the problem of assigning a single label to each word in a sequence. Perhaps the easiest approach is to predict each label independently. Then, we just have a collection of multiclass classification tasks (each label is a different class). We can use whatever classifier we want for this. Despite it's simplicity, this approach is actually quite effective for many problems.

Intuitively, we usually believe that the labels in these problems are not independent. For instance, in POS tagging, it's basically impossible to have a verb immediately following a determiner. We would therefore like to use some local sequence information to improve our performance. The "old school" approach to doing this is to use a hidden Markov model (HMM). I'll refer you to Manning+Schutze for details here (keeping in mind that really what we're using is just a Markov model...in these problems, there is nothing that's hidden). But the basic idea here is that we have two probability distributions: a transition distribution (how likely is it that a verb will follow a determiner) and an emission distribution (how likely is it that we'll see the word "the" given that we know this word should be a determiner). If our transition probabilities are local, then the Viterbi algorithm will run efficiently at test time. This locality is referred to as the "Markov assumption." Specifically, a first-order Markov assumption says that the probability of label at time t+2 is independent of label at time t given the label at time t+1. The higher Markov order you use, the harder it is to decode (complexity-wise).

The potential problem with using HMMs is that the emission probabilities are of the form p(word | tag), where we'd really like to model p(tag | word). The latter is preferable because it's easier to include lots of overlapping features (capitalization, word identity, prefixes, suffixes, stems, etc.). The partial solution is to use a maximum entropy Markov model (MEMM), where we model p(tag | word) using a maximum entropy model, but keep everything else as in an HMM. MEMMs are only slightly more complex to train than HMMs, but work a whole lot better. At training time, we essentially include features having to do with the previous labels, but otherwise this is just as in the independent classifiers approach. Viterbi search still runs at test time, so we're limited to the same Markov order constraints as HMMs.

The potential problem with MEMMs (noticing a trend here? :P) is that when the models are trained, they are trained against CORRECT previous labels. That is, when we create a classification example corresponding to the label at time t+1, we include features that depend on the label at time t. But these will always be correct at training time, but can be wrong at test time. This leads to the infamous "label bias" problem. The conditional random field (CRF) is essentially an answer to this problem. Instead of training to predict each label independently, but then running Viterbi at test time, we train to get the whole sequence right. (This is a good instance of having identical training and testing situations.) Training CRFs is a bit of a bore, since each iteration of training requires one to run the forward-backward algorithm over each training instance. But CRFs do often perform better than plain MEMMs. The UMass group has been nice enough to release their CRF implementation.

Once we get to CRFs, we can play a bunch of other games. For instance, we can essentially come up with a margin-based version of the CRF. This is roughly like moving from maxent to SVMs. The result is either max-margin Markov networks or SVMstruct, depending on exactly how you formulate the problem. The latter has a nice implementation available to play with.

So that's a whole lot of learning, but where the action really is in the features. For me, there's essentially a standard feature set I always use for these problems, and then add and subtract as I see fit. The features I usually use are the following (with examples based on the word "George": the exact word ("George"), the stem ("george"), prefixes of length 1-3 ("G", "Ge" and "Geo"), suffixes of length 1-3 ("rge", "ge" and "e") and some regexps. The most useful I use is to transform all capital letters to "A", all lower-case to "a", all numbers to "0", and all punctuation to ".". Then, we collapse identical adjacent letters. So George -> Aaaaaa -> Aa. Finally, I use list of people, places and organizations collected from various gazetteers and check whether each word falls in any of these lists. (Incidentally, these are available as part of my TagChunk program for solving joint tagging/chunking problems.) All of these features are typically applied in a window of +/- 1,2 or 3 words around the given word. You can see exactly what I calculate in this perl script.

The final issue in features is what to do with the transition features. In particular, one can think of putting the lexical features on "nodes" or "edges". By putting them on nodes, I mean that you have features like "previous-tag-is-DT current-word-is-George current-case-is-Aa". But you can also do a conjunction thing and say "previous-tag-is-DT current-word-is-George-and-previous-tag-is-DT current-case-is-Aa-and-previous-tag-is-DT". This obviously blows up your feature space, but if you have enough data, it's almost always worth doing.

Now, getting back to the segmentation issue. The fact that there are multiple valid encodings for mapping segmentation -> raw labeling is probably a bad thing. It seems that, ideally, we'd like to learn to do the segmentation directly. This is really not that hard, and it comes with a bunch of benefits, especially in what features you can employ. For instance, you can check if the whole string "George Bush" falls in a list of names. Or you can notice that both words are capitalized. These are both great "phrase-based" features. More evidence has been published that treating these problems as segmentation problems directly is beneficial.