Okay, at long last it's time for another edition of "Getting Started In" (p.s., if you want to write one of these for your area of expertise, I'm very open to it!).
So, what is sequence labeling? It's the meta problem that we face all the time in NLP tasks where we want to assign a single label to each element in a sequence. For us, a sequence is usually a sentence and a word is an element. The elements we are trying to assign are usually things like parts of speech, syntactic chunk labels (is this part of a noun phrase, verb phrase, etc.), named entity labels (is this a person?) and so on. Information extraction systems (i.e., extracting meeting times and locations from emails) can also be treated as sequence labeling problems.
There are roughly two varieties of sequence labeling: (1) raw labeling and (2) joint segmentation and labeling. The latter is (IMO) more common. Raw labeling is something like POS tagging where each element gets a single tag. Joint segmentation and labeling is where whole segments get the same label. For instance, in named entity recognition, a sentence like "Yesterday , George Bush gave a speech ." contains example one named entity ("George Bush"). Here, we want to assign the label "PERSON" to the entire phrase "George Bush", not to individual words (for instance, if two named abut, we need to know where they separate).
The easiest way to deal with segmentation problems is to transform them into raw labeling problems. The standard way to do this is the "BIO" encoding, where we label each word by "B-X", "I-X" or "O". Here, "B-X" means "begin a phrase of type X", "I-X" means "continue a phrase of type X" and "O" means "not in a phrase." For instance, the Bush sentence would be labeled as: "Yesterday/O ,/O George/B-PER Bush/I-PER gave/O a/O speech/O ./O" Once can now treat this as a raw labeling problem, perhaps being careful to avoid producing impossible sequences at test time (eg., an "I-X" can only follow a "B-X" or another "I-X"). Other encodings are also possible. So for now, I'll concentrate on raw labeling and revisit the true joint segmentation problem at the end.
In raw labeling, we're faced with the problem of assigning a single label to each word in a sequence. Perhaps the easiest approach is to predict each label independently. Then, we just have a collection of multiclass classification tasks (each label is a different class). We can use whatever classifier we want for this. Despite it's simplicity, this approach is actually quite effective for many problems.
Intuitively, we usually believe that the labels in these problems are not independent. For instance, in POS tagging, it's basically impossible to have a verb immediately following a determiner. We would therefore like to use some local sequence information to improve our performance. The "old school" approach to doing this is to use a hidden Markov model (HMM). I'll refer you to Manning+Schutze for details here (keeping in mind that really what we're using is just a Markov model...in these problems, there is nothing that's hidden). But the basic idea here is that we have two probability distributions: a transition distribution (how likely is it that a verb will follow a determiner) and an emission distribution (how likely is it that we'll see the word "the" given that we know this word should be a determiner). If our transition probabilities are local, then the Viterbi algorithm will run efficiently at test time. This locality is referred to as the "Markov assumption." Specifically, a first-order Markov assumption says that the probability of label at time t+2 is independent of label at time t given the label at time t+1. The higher Markov order you use, the harder it is to decode (complexity-wise).
The potential problem with using HMMs is that the emission probabilities are of the form p(word | tag), where we'd really like to model p(tag | word). The latter is preferable because it's easier to include lots of overlapping features (capitalization, word identity, prefixes, suffixes, stems, etc.). The partial solution is to use a maximum entropy Markov model (MEMM), where we model p(tag | word) using a maximum entropy model, but keep everything else as in an HMM. MEMMs are only slightly more complex to train than HMMs, but work a whole lot better. At training time, we essentially include features having to do with the previous labels, but otherwise this is just as in the independent classifiers approach. Viterbi search still runs at test time, so we're limited to the same Markov order constraints as HMMs.
The potential problem with MEMMs (noticing a trend here? :P) is that when the models are trained, they are trained against CORRECT previous labels. That is, when we create a classification example corresponding to the label at time t+1, we include features that depend on the label at time t. But these will always be correct at training time, but can be wrong at test time. This leads to the infamous "label bias" problem. The conditional random field (CRF) is essentially an answer to this problem. Instead of training to predict each label independently, but then running Viterbi at test time, we train to get the whole sequence right. (This is a good instance of having identical training and testing situations.) Training CRFs is a bit of a bore, since each iteration of training requires one to run the forward-backward algorithm over each training instance. But CRFs do often perform better than plain MEMMs. The UMass group has been nice enough to release their CRF implementation.
Once we get to CRFs, we can play a bunch of other games. For instance, we can essentially come up with a margin-based version of the CRF. This is roughly like moving from maxent to SVMs. The result is either max-margin Markov networks or SVMstruct, depending on exactly how you formulate the problem. The latter has a nice implementation available to play with.
So that's a whole lot of learning, but where the action really is in the features. For me, there's essentially a standard feature set I always use for these problems, and then add and subtract as I see fit. The features I usually use are the following (with examples based on the word "George": the exact word ("George"), the stem ("george"), prefixes of length 1-3 ("G", "Ge" and "Geo"), suffixes of length 1-3 ("rge", "ge" and "e") and some regexps. The most useful I use is to transform all capital letters to "A", all lower-case to "a", all numbers to "0", and all punctuation to ".". Then, we collapse identical adjacent letters. So George -> Aaaaaa -> Aa. Finally, I use list of people, places and organizations collected from various gazetteers and check whether each word falls in any of these lists. (Incidentally, these are available as part of my TagChunk program for solving joint tagging/chunking problems.) All of these features are typically applied in a window of +/- 1,2 or 3 words around the given word. You can see exactly what I calculate in this perl script.
The final issue in features is what to do with the transition features. In particular, one can think of putting the lexical features on "nodes" or "edges". By putting them on nodes, I mean that you have features like "previous-tag-is-DT current-word-is-George current-case-is-Aa". But you can also do a conjunction thing and say "previous-tag-is-DT current-word-is-George-and-previous-tag-is-DT current-case-is-Aa-and-previous-tag-is-DT". This obviously blows up your feature space, but if you have enough data, it's almost always worth doing.
Now, getting back to the segmentation issue. The fact that there are multiple valid encodings for mapping segmentation -> raw labeling is probably a bad thing. It seems that, ideally, we'd like to learn to do the segmentation directly. This is really not that hard, and it comes with a bunch of benefits, especially in what features you can employ. For instance, you can check if the whole string "George Bush" falls in a list of names. Or you can notice that both words are capitalized. These are both great "phrase-based" features. More evidence has been published that treating these problems as segmentation problems directly is beneficial.
Thanks, Hal. Your post is really helpful for a beginner like me.
ReplyDeleteFor all of these models, there's a problem in defining what a token is. It can have a significant impact on the models, because the ones Hal's talking about are all local, and what falls in a local window depends on the shape of the tokens. This is doubly important when multiple models are used, as in stemmers, POS taggers and entity detectors. Of course, for Chinese, it introduces the requirement of a tokenizer, or the features need to work with single tokens per word, which actually works pretty well.
ReplyDeleteHal's baseline feature set for CRFs is interesting for two reasons. One, it builds in heuristic knowledge of the language to which it's being applied through a stemmer. Two, it is highly Eurocentric in the treatment of capitalization and word windows. Finkel et al. noted that the features they used in the CRFs for English CoNLL (English news data with person, location and organization entity types) and those they used in BioCreative (English biomedical research abstracts with gene entity type) had very different performance cross-domain. Using news features on biomedical data, they scored a 75% F-measure. Their final highly tuned English genes biomedical-specific CRF scored 85% (including 1% or so from post-processing heuristics). These included a whole lot of dictionary features. I'd like to see how these would perform with something like Hal's features (excluding stemming -- standard English stemmers perform horribly on biomedical text).
I'd argue that using CRFs with very rich feature sets is actually a hybrid activity between building a system by hand and building one fully automatically. This is actually a good thing for us that believe in linguistics, as there are lots of opportunities for using more sophisticated features (as, e.g., Collins did in his PCFG models for the Penn treebank).
The baseline tag set using BIO is also interesting. It's very hard to use for forward-backward confidence, as shown by Culotta and McCallum. The begin-in-end-whole tagging we use is much easier in this regard and introduces some more tag context, which CRFs typically include through features. This also illustrates another way in which you can tune a sequence model -- the set of tags assigned is not carved in stone anywhere. I'm also thinking that CRFs with all these features are going to have very high variance as estimators and thus not necessarily be the best baseline for confidence-based extraction.
Finally, the game these days at the accuracy high end is moving to (a) longer-distance interactions, to include information such as topic and syntactic parse, and (b) larger joint models, with rescoring of entity extraction in the context of coreference or relation extraction. This usually involves either an n-best rescoring of a simpler system, or a more sophisticated sampling approach. I think this might be what Hal's getting at in his last paragraph.
In HMMs, the states are what are "hidden". They're hidden in the sense that they're not part of the input stream. Instead, you predict the best states given the input tokens.
This is clearer looking at HMMs as language models, where the probability of a string is the sum over all state sequences of the probability of the string being emitted by that sequence. Usually people make the Viterbi assumption and approximate this value by only looking at the best sequence. But this is only a good approximation in high-confidence settings; that is, when the first-best parse has substantial probability mass relative to the sum of other.
I think what Hal means is that in fully supervised HMM training, the states are given along with the tokens as part of the training data. Very often, you don't know the states, or sometimes you don't know the alignments of states with inputs, as in training from unaligned speech data.
Manning and Schuetze's HMM presentation is highly non-standard, because they emit from arcs, not nodes. I'd recommend reading Jurafsky and Martin's presentation.
Most HMMs used in NLP don't actually model p(word|tag) properly, because of unknown words. They also don't model the whole output sequence because of spaces, and sometimes punctuation, case, or other normalizations.
Bob Carpenter said...
ReplyDelete... Finkel et al. noted that the features they used in the CRFs for English CoNLL (English news data with person, location and organization entity types) and those they used in BioCreative (English biomedical research abstracts with gene entity type) had very different performance cross-domain. ... I'd like to see how these would perform with something like Hal's features
I believe that we used all of the features that Hal described in our English CoNLL feature set (except stemming features), so the answer would be around 75%. We also kept those features for the bio data, but just added more bio-specific features.
Wow, this got lots of comments (it seems quite a hard task to predict number of comments based on topic)! I totally agree with pretty much everything Bob's said ... hopefully this will make it even easier to get started in sequence labeling :).
ReplyDeleteI think a few things are worth highlighting:
I was being incredibly English-centric with my features (or at least Indo-European centric). One need to do something different for languages like Chinese, Japanese and Arabic.
BIO is not necessarily the best: Bob points out "Begin/In/Out/Whole" where an additional tag is added for sequences of length 1. There's also the "IO" encoding which is the same as BIO, except that you only use "begin" tags when two segments of the same type abut (eg., "The/I-NP president/I-NP Mr./B-NP Bush/I-NP"). I think there are others and there's a paper I recall by Roth and some students from a few years ago comparing them.
HMMs usually emit over states, not arcs which makes M+S odd. I think Bob's right that J+M makes a better standard reference here, although there are plenty of online tutorials about HMMs that are also quite good.
(Minor point: The only reason I think that "HMM" for sequence labeling is a bit of a misnomer is because at training time there are not hidden states and training is just counting...this makes the commonly used phrase "training and HMM by counting" weird because what you're training is not an HMM. Of course at test time, there are hidden states (the labels), at which point I have no problem calling it an HMM... And, of course, when there really are hidden states, then of course its an HMM... This is really not a very important issue though.)
Dear Hal,
ReplyDeleteAt first my apology to take some of your time into this. I am
currently working on text summarization using supervised learning
techniques. I want to use CRF. So, could you please help me in choosing a
suitable CRF package which can learn from labeled data where the labeled
data is the sentences with label 1 meaning summary sentence and 0 meaning
not?
Thanks. I am eagerly waiting for your kind reply.
Kind Regards,
Sadid
酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花
ReplyDelete