12 July 2007

Multiclass learning as multitask learning

It's bugged me for a little while that when learning multiclass classifiers, the prior on weights is uniform (even when they're learned directly and not through some one-versus-rest or all-versus-all reduction).

Why does this bug me? Consider our favorite task: shallow parsing (aka syntactic chunking). We want to be able to identify base phrases such as NPs in running text. The standard way to do this is to do an encoding of phrase labels into word labels and apply some sequence labeling algorithm. The standard encoding is BIO. A sentence like "The man ate a sandwich ." would appear as "B-NP I-NP B-VP B-NP I-NP O" with "B-X" indicating the beginning of a phrase of type X, and "I-X" indicating being "inside" such a phrase ("O", assigned to "." is "outside" of a chunk).

If we train, eg., a CRF to recognize this, then (typically) it considers B-NP to be a completely independent of I-NP; just as independent as it is of "O". Clearly this is a priori a wrong assumption.

One way I have gotten around this problem is to actually explicitly parameterize my models with per-class features. That is, rather than having a feature like "current word is 'the'" and making K copies of this feature (one per output label); I would have explicitly conjoined features such as "current word is 'the' and label is 'B-NP'". This enables me to have features like "word=the and label is B-?" or "word=the and label is ?-NP", which would get shared across different labels. (Shameless plug: megam can do this effortlessly.)

But I would rather not have to do this. One thing I could do is to make 2^K versions of each feature (K is still the number of labels), where each encodes some subset of active features. But for large-K problems, this could get a bit unwieldy. Pairwise features would be tolerable, but then you couldn't get the "B-?" sort of features I want. There's also no obvious kernel solution here, because these are functions of the output label, not the input.

It seems like the right place for this to happen is in the prior (or the regularizer, if you're anti-probabilistic models). Let's say we have F features and K classes. In a linear model, we'll learn F*K weights (okay, really F*(K-1) for identifiability, but it's easier to think in terms of F*K). Let's say that a prior we know that classes j and k are related. Then we want the prior to favor w(:,j) to be similar to w(:,k). There are a variety of ways to accomplish this: I think that something along the lines of a recent NIPS paper on multitask feature learning is a reasonable way to approach this.

What this approach lacks in general is the notion that if classes j and k "share" some features (i.e., they have similar weights), then they're more likely to "share" other features. You could do something like task clustering to achieve this, but that seems unideal since I'd really like to see a single unified algorithm.

Unfortunately, all attempts (on my part) to come up with a convex regularizer that shares these properties has failed. I actually now think that it is probably impossible. The problem is essentially that there is bound to be some threshold controlling whether the model things classes j and k are similar and below this threshold the regularizer will prefer w(:,j) and w(:,k) independently close to zero; above this threshold, it will prefer w(:,j) and w(:,k) to be close to each other (and also close to zero). This is essentially the root of the non-convexity.

2 comments:

Vezhnick said...

What this approach lacks in general is the notion that if classes j and k "share" some features (i.e., they have similar weights), then they're more likely to "share" other features.

I'm not sure about that. Actually, if I got what they say right, then it is essentially what happens when you make L1 and L2 "fighting". You have two "competing" tendencies in your regularizer - make features weight vector for each classifier sparse and make the use of each feature by all classifiers dense. If your classifiers weights (unregularized) are close, then they will be drawn towards each other by L2 and if they are far they will be kept separate because of L1 (going close would, most likely, require using more features).

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花