01 November 2006

Getting Started In: Sequence Labeling

Okay, at long last it's time for another edition of "Getting Started In" (p.s., if you want to write one of these for your area of expertise, I'm very open to it!).

So, what is sequence labeling? It's the meta problem that we face all the time in NLP tasks where we want to assign a single label to each element in a sequence. For us, a sequence is usually a sentence and a word is an element. The elements we are trying to assign are usually things like parts of speech, syntactic chunk labels (is this part of a noun phrase, verb phrase, etc.), named entity labels (is this a person?) and so on. Information extraction systems (i.e., extracting meeting times and locations from emails) can also be treated as sequence labeling problems.

There are roughly two varieties of sequence labeling: (1) raw labeling and (2) joint segmentation and labeling. The latter is (IMO) more common. Raw labeling is something like POS tagging where each element gets a single tag. Joint segmentation and labeling is where whole segments get the same label. For instance, in named entity recognition, a sentence like "Yesterday , George Bush gave a speech ." contains example one named entity ("George Bush"). Here, we want to assign the label "PERSON" to the entire phrase "George Bush", not to individual words (for instance, if two named abut, we need to know where they separate).

The easiest way to deal with segmentation problems is to transform them into raw labeling problems. The standard way to do this is the "BIO" encoding, where we label each word by "B-X", "I-X" or "O". Here, "B-X" means "begin a phrase of type X", "I-X" means "continue a phrase of type X" and "O" means "not in a phrase." For instance, the Bush sentence would be labeled as: "Yesterday/O ,/O George/B-PER Bush/I-PER gave/O a/O speech/O ./O" Once can now treat this as a raw labeling problem, perhaps being careful to avoid producing impossible sequences at test time (eg., an "I-X" can only follow a "B-X" or another "I-X"). Other encodings are also possible. So for now, I'll concentrate on raw labeling and revisit the true joint segmentation problem at the end.

In raw labeling, we're faced with the problem of assigning a single label to each word in a sequence. Perhaps the easiest approach is to predict each label independently. Then, we just have a collection of multiclass classification tasks (each label is a different class). We can use whatever classifier we want for this. Despite it's simplicity, this approach is actually quite effective for many problems.

Intuitively, we usually believe that the labels in these problems are not independent. For instance, in POS tagging, it's basically impossible to have a verb immediately following a determiner. We would therefore like to use some local sequence information to improve our performance. The "old school" approach to doing this is to use a hidden Markov model (HMM). I'll refer you to Manning+Schutze for details here (keeping in mind that really what we're using is just a Markov model...in these problems, there is nothing that's hidden). But the basic idea here is that we have two probability distributions: a transition distribution (how likely is it that a verb will follow a determiner) and an emission distribution (how likely is it that we'll see the word "the" given that we know this word should be a determiner). If our transition probabilities are local, then the Viterbi algorithm will run efficiently at test time. This locality is referred to as the "Markov assumption." Specifically, a first-order Markov assumption says that the probability of label at time t+2 is independent of label at time t given the label at time t+1. The higher Markov order you use, the harder it is to decode (complexity-wise).

The potential problem with using HMMs is that the emission probabilities are of the form p(word | tag), where we'd really like to model p(tag | word). The latter is preferable because it's easier to include lots of overlapping features (capitalization, word identity, prefixes, suffixes, stems, etc.). The partial solution is to use a maximum entropy Markov model (MEMM), where we model p(tag | word) using a maximum entropy model, but keep everything else as in an HMM. MEMMs are only slightly more complex to train than HMMs, but work a whole lot better. At training time, we essentially include features having to do with the previous labels, but otherwise this is just as in the independent classifiers approach. Viterbi search still runs at test time, so we're limited to the same Markov order constraints as HMMs.

The potential problem with MEMMs (noticing a trend here? :P) is that when the models are trained, they are trained against CORRECT previous labels. That is, when we create a classification example corresponding to the label at time t+1, we include features that depend on the label at time t. But these will always be correct at training time, but can be wrong at test time. This leads to the infamous "label bias" problem. The conditional random field (CRF) is essentially an answer to this problem. Instead of training to predict each label independently, but then running Viterbi at test time, we train to get the whole sequence right. (This is a good instance of having identical training and testing situations.) Training CRFs is a bit of a bore, since each iteration of training requires one to run the forward-backward algorithm over each training instance. But CRFs do often perform better than plain MEMMs. The UMass group has been nice enough to release their CRF implementation.

Once we get to CRFs, we can play a bunch of other games. For instance, we can essentially come up with a margin-based version of the CRF. This is roughly like moving from maxent to SVMs. The result is either max-margin Markov networks or SVMstruct, depending on exactly how you formulate the problem. The latter has a nice implementation available to play with.

So that's a whole lot of learning, but where the action really is in the features. For me, there's essentially a standard feature set I always use for these problems, and then add and subtract as I see fit. The features I usually use are the following (with examples based on the word "George": the exact word ("George"), the stem ("george"), prefixes of length 1-3 ("G", "Ge" and "Geo"), suffixes of length 1-3 ("rge", "ge" and "e") and some regexps. The most useful I use is to transform all capital letters to "A", all lower-case to "a", all numbers to "0", and all punctuation to ".". Then, we collapse identical adjacent letters. So George -> Aaaaaa -> Aa. Finally, I use list of people, places and organizations collected from various gazetteers and check whether each word falls in any of these lists. (Incidentally, these are available as part of my TagChunk program for solving joint tagging/chunking problems.) All of these features are typically applied in a window of +/- 1,2 or 3 words around the given word. You can see exactly what I calculate in this perl script.

The final issue in features is what to do with the transition features. In particular, one can think of putting the lexical features on "nodes" or "edges". By putting them on nodes, I mean that you have features like "previous-tag-is-DT current-word-is-George current-case-is-Aa". But you can also do a conjunction thing and say "previous-tag-is-DT current-word-is-George-and-previous-tag-is-DT current-case-is-Aa-and-previous-tag-is-DT". This obviously blows up your feature space, but if you have enough data, it's almost always worth doing.

Now, getting back to the segmentation issue. The fact that there are multiple valid encodings for mapping segmentation -> raw labeling is probably a bad thing. It seems that, ideally, we'd like to learn to do the segmentation directly. This is really not that hard, and it comes with a bunch of benefits, especially in what features you can employ. For instance, you can check if the whole string "George Bush" falls in a list of names. Or you can notice that both words are capitalized. These are both great "phrase-based" features. More evidence has been published that treating these problems as segmentation problems directly is beneficial.

13 comments:

Muhua said...

Thanks, Hal. Your post is really helpful for a beginner like me.

Bob Carpenter said...

For all of these models, there's a problem in defining what a token is. It can have a significant impact on the models, because the ones Hal's talking about are all local, and what falls in a local window depends on the shape of the tokens. This is doubly important when multiple models are used, as in stemmers, POS taggers and entity detectors. Of course, for Chinese, it introduces the requirement of a tokenizer, or the features need to work with single tokens per word, which actually works pretty well.

Hal's baseline feature set for CRFs is interesting for two reasons. One, it builds in heuristic knowledge of the language to which it's being applied through a stemmer. Two, it is highly Eurocentric in the treatment of capitalization and word windows. Finkel et al. noted that the features they used in the CRFs for English CoNLL (English news data with person, location and organization entity types) and those they used in BioCreative (English biomedical research abstracts with gene entity type) had very different performance cross-domain. Using news features on biomedical data, they scored a 75% F-measure. Their final highly tuned English genes biomedical-specific CRF scored 85% (including 1% or so from post-processing heuristics). These included a whole lot of dictionary features. I'd like to see how these would perform with something like Hal's features (excluding stemming -- standard English stemmers perform horribly on biomedical text).

I'd argue that using CRFs with very rich feature sets is actually a hybrid activity between building a system by hand and building one fully automatically. This is actually a good thing for us that believe in linguistics, as there are lots of opportunities for using more sophisticated features (as, e.g., Collins did in his PCFG models for the Penn treebank).

The baseline tag set using BIO is also interesting. It's very hard to use for forward-backward confidence, as shown by Culotta and McCallum. The begin-in-end-whole tagging we use is much easier in this regard and introduces some more tag context, which CRFs typically include through features. This also illustrates another way in which you can tune a sequence model -- the set of tags assigned is not carved in stone anywhere. I'm also thinking that CRFs with all these features are going to have very high variance as estimators and thus not necessarily be the best baseline for confidence-based extraction.

Finally, the game these days at the accuracy high end is moving to (a) longer-distance interactions, to include information such as topic and syntactic parse, and (b) larger joint models, with rescoring of entity extraction in the context of coreference or relation extraction. This usually involves either an n-best rescoring of a simpler system, or a more sophisticated sampling approach. I think this might be what Hal's getting at in his last paragraph.

In HMMs, the states are what are "hidden". They're hidden in the sense that they're not part of the input stream. Instead, you predict the best states given the input tokens.

This is clearer looking at HMMs as language models, where the probability of a string is the sum over all state sequences of the probability of the string being emitted by that sequence. Usually people make the Viterbi assumption and approximate this value by only looking at the best sequence. But this is only a good approximation in high-confidence settings; that is, when the first-best parse has substantial probability mass relative to the sum of other.

I think what Hal means is that in fully supervised HMM training, the states are given along with the tokens as part of the training data. Very often, you don't know the states, or sometimes you don't know the alignments of states with inputs, as in training from unaligned speech data.

Manning and Schuetze's HMM presentation is highly non-standard, because they emit from arcs, not nodes. I'd recommend reading Jurafsky and Martin's presentation.

Most HMMs used in NLP don't actually model p(word|tag) properly, because of unknown words. They also don't model the whole output sequence because of spaces, and sometimes punctuation, case, or other normalizations.

Jenny Finkel said...

Bob Carpenter said...

... Finkel et al. noted that the features they used in the CRFs for English CoNLL (English news data with person, location and organization entity types) and those they used in BioCreative (English biomedical research abstracts with gene entity type) had very different performance cross-domain. ... I'd like to see how these would perform with something like Hal's features


I believe that we used all of the features that Hal described in our English CoNLL feature set (except stemming features), so the answer would be around 75%. We also kept those features for the bio data, but just added more bio-specific features.

hal said...

Wow, this got lots of comments (it seems quite a hard task to predict number of comments based on topic)! I totally agree with pretty much everything Bob's said ... hopefully this will make it even easier to get started in sequence labeling :).

I think a few things are worth highlighting:

I was being incredibly English-centric with my features (or at least Indo-European centric). One need to do something different for languages like Chinese, Japanese and Arabic.

BIO is not necessarily the best: Bob points out "Begin/In/Out/Whole" where an additional tag is added for sequences of length 1. There's also the "IO" encoding which is the same as BIO, except that you only use "begin" tags when two segments of the same type abut (eg., "The/I-NP president/I-NP Mr./B-NP Bush/I-NP"). I think there are others and there's a paper I recall by Roth and some students from a few years ago comparing them.

HMMs usually emit over states, not arcs which makes M+S odd. I think Bob's right that J+M makes a better standard reference here, although there are plenty of online tutorials about HMMs that are also quite good.

(Minor point: The only reason I think that "HMM" for sequence labeling is a bit of a misnomer is because at training time there are not hidden states and training is just counting...this makes the commonly used phrase "training and HMM by counting" weird because what you're training is not an HMM. Of course at test time, there are hidden states (the labels), at which point I have no problem calling it an HMM... And, of course, when there really are hidden states, then of course its an HMM... This is really not a very important issue though.)

sadid said...

Dear Hal,

At first my apology to take some of your time into this. I am
currently working on text summarization using supervised learning
techniques. I want to use CRF. So, could you please help me in choosing a
suitable CRF package which can learn from labeled data where the labeled
data is the sentences with label 1 meaning summary sentence and 0 meaning
not?

Thanks. I am eagerly waiting for your kind reply.

Kind Regards,

Sadid

梦中林 said...

When the Wow Gold wolf finally found the Buy Wow Goldhole in the chimney he crawled wow gold cheap down and KERSPLASH right into that kettle of water and that was cheapest wow gold the end of his troubles with the big bad wolf.
The next day the cheap wow gold little pig invited his mother over . She said "You see it is just as mygamegoldI told you. The way to get along in the world is to do world of warcraft gold things as well as you can." Fortunately for that little pig, he buy cheap wow gold learned that lesson. And he just k4gold lived happily ever after!

Anonymous said...

網頁設計,情趣用品,情趣用品,情趣用品,情趣用品
色情遊戲,寄情築園小遊戲,情色文學,一葉情貼圖片區,情惑用品性易購,情人視訊網,辣妹視訊,情色交友,成人論壇,情色論壇,愛情公寓,情色,舊情人,情色貼圖,色情聊天室,色情小說,做愛,做愛影片,性愛

免費視訊聊天室,aio交友愛情館,愛情公寓,一葉情貼圖片區,情色貼圖,情色文學,色情聊天室,情色小說,情色電影,情色論壇,成人論壇,辣妹視訊,視訊聊天室,情色視訊,免費視訊,免費視訊聊天,視訊交友網,視訊聊天室,視訊美女,視訊交友,視訊交友90739,UT聊天室,聊天室,豆豆聊天室,尋夢園聊天室,聊天室尋夢園,080聊天室,080苗栗人聊天室,女同志聊天室,上班族聊天室,小高聊天室 

AV,AV女優
視訊,影音視訊聊天室,視訊交友
視訊,影音視訊聊天室,視訊聊天室,視訊交友,視訊聊天,視訊美女

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex