13 February 2006

Structured Prediction 2.5: Features

I decided to add an additional post on structured prediction before the final one. This is also highly related to the post on joint inference.

One can claim that when doing structured prediction, the features X never have longer range than the loss function, where X is either "need" or "should." To this day I struggle to understand this issue: I don't believe it is as simple as one would like.

Consider our old favorite: sequence labeling under Hamming (per label) loss. The arguement essentially goes: we don't care (i.e., our loss function doesn't care) about how well our labels fit together. All it cares about is getting each right individually. By essentially the same argument as the joint inference post, one can then see that one should train a single classifier that does not use any "Markov-style" features.

On the other hand, I know that I have always seen Markov-style features to help, and I believe that most people who work on SP would agree.

Just as in the joint inference case, I think that if you want to be guaranteed the same information to train a classifier on, you need to move to a cross-product feature space. Unlike John, I do fear this. I fear this for several reasons. First, we already have millions of features. Going to 10^12 features is just not practical (blah blah kernels blah blah). This can also lead to severe generalization issues, and I don't think the state of the art in machine learning is capable of dealing with this. It's trikcy, though, because I largely believe the joint inference argument, and largely don't believe this argument. Yet they are essentially the same.

But I think there might be a more fundamental issue here, beyond cross-product feature spaces, generalization bounds and so on. This is the issue of: if we know a problem has structure, even if the loss function (even the true one) doesn't reflect this, should we make use of this structure. The answer to "Why?" is that it might make it easier on our models to learn. The answer to "Why not?" is that by doing so we require more "human in the loop" than otherwise.


hal said...

i realized that i would be remiss not to mention a paper by Punyakanok et al. that compares (theoretically and practically) the difference between global inference (tags depend on each other) and independent classifiers (they do not). I'm more interested in the theoretical side, and the paper does state a theorem that roughly says that if you have enough data, global inference is probably better. The biggest problem with the theorem is that it compares two upper bounds, rather than an upper and a lower. Since the bounds aren't tight, it's unclear that its actually saying anything. But it's a start and given the importance of this question (to me), it's interesting.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花