06 January 2006

Structured Prediction 1: What's Out There

Structured prediction (SP) -- the machine learning task of generating outputs with complex internal structure -- is near and dear to my heart. It is also a generalization of many NLP problems (summarization, MT, IE, coref, parsing, tagging, etc.). This is the first part of a multi-part post on structured prediction (first: what's out there; second: what I do; third: where we should go).

The most important aspect of the SP problem (aside from the data) is the loss function: l(y,y') gives us how much it costs us to produce y' when the correct answer is y. Our goal, as always, is to minimize the expected loss over the distribution that generated our training data. The common way of doing this is to posit a large set Y of possible outputs and to try to find a function f(x,y) on an input x and output y that ranks outputs: f(x,y) > f(x,z) if y is a better output than z. Once we've learned such a function, the only thing left is to find a way to search over all y to find the one with highest score. Aside from the parameterization of f (which is almost always taken to be linear the feature set), the only thing left that matters is how we measure one f against another. For notational covenience, let g(x,y) = exp(-f(x,y)). Here are some common ways to score f:

  • Generative: f is good if g(x,y) / sum_{x',y'} g(x',y') is high for all (x,y) in the training data
  • Conditional: f is good if g(x,y) / sum_{y'} g(x,y') is high for all (x,y) in the training data
  • Discriminative-M: f is good if f(x,y) > f(x,y') + l(y,y') for all (x,y) in the training data and all y', where errors are punished equally
  • Discriminative-S: f is good if f(x,y) > f(x,y') for all (x,y) in the training data and all y', where errors are punished proportional to l(y,y').
Assuming all optimizations are equally difficult (or easy), these techniques should be ranked according to how closely they match our true desire: low expected loss. To me, the Discriminative-S method is the closest. To see the distinction between D-S and D-M, consider the case where we can easily find a function that ranks the correct outputs higher than the incorrect outputs, but cannot distinguish between the incorrect outputs. From the perspective of expected loss, this is sufficient. From the perspective of D-S, this is also sufficient. But it's not good enough for D-M. D-M requires that if some really bad y' exists, then f(x,y) has to be much larger (by l(y,y')) than f(x,y'). I don't see a reason why we should care about this.

Another way of looking at these methods is how we can optimize them. Let's suppose f is just w*h(x,y), where h(x) = sum_r h(x,r) is a decomposition over regions (adjacent tags in sequences, for instance), w is a weight vector and h provides a feature vector. We can compute the gradient of the optimization criterea (using exponentiated gradients for the discriminative methods) for each of the methods on a given input (x,y) as:

sum_{r in y} h(x,r) - sum_{y'} sum_{r' in y'} X h(x,r')

This is just the difference between the true features h(x,r) and the guessed features h(x,r'), where the guesses are weighed by X. The only difference between the above techniques is how we define X.
  • Conditional: X = (1/Z) exp [ sum_{r' in y'} w*h(x,r') ]
  • Discriminative-M: X = (1/Z) exp [ sum_{r' in y'} e C [ (t-1) l(y,r') + w*h(x,r') ] ]
  • Discriminative-S: X = (1/Z) exp [ sum_{r' in y'} e C l(y,r')^(t-1) w*h(x,r') ]
In all cases, Z is defined so that the sum is unity. In the latter two, e is a learning rate, C is the SVM slack hyperparameter and t is the iteration number. The conditional X can be interpreted as the posterior probability of this (incorrect) region under the current model. Going from conditional to D-M involves the insertion of the loss function, leading to a loss-augmented posterior. The (t-1) term essentially means that as more iterations are performed, the loss term becomes increasingly important. In the D-S version, we multiply by the loss rather than add it.

What does this tell us? Well, it tells us exactly what we saw before: the gradient on the D-M model is high when the loss is high, even if the (non-loss-augmented) posterior is zero. This means that even if our model would never produce a wrong region, but that region has high loss, the gradient is still high at that point. This contrasts with the D-S gradient, which, once the posterior disappears, doesn't care about the loss anymore.

The only tricky thing in any of these models is to compute the sum over y' efficiently (or, alternatively, approximate it by computing the max over y'). In well formed cases, like sequence labeling or parsing (under strict assumptions), this is possible using standard dynamic programming (or sometimes, max-flow) methods.

Incidentally, the "conditional" model refers to CRFs, the "discriminative-s" model refers to SVMISOs and the "discriminative-m" model refers to M3Ns.

5 comments:

Kevin Duh said...

Appreciate the post on structured prediction! I would like to understand this better myself. :)

Regarding D-M vs. D-S: I thought the purpose of f(x,y)>f(x,y')+l(y,y') in D-M is to impose a margin of length l(y,y') on your f(). You don't want to just let your f() to give higher scores to (x,y) versus (x,y') in your training data; you also want to have a large margin so it generalizes well.

hal said...

First, a minor comment: D-S also imposes a margin (I forgot to write it above). But it only imposes a constant margin of 1.

So the difference is that D-S scales the *slack variables* by the loss, while D-M scales the *margin* by the loss. To me, the former is much more intuitive.

Ankan said...

Chanced upon your blog while surfing about Structured Data..here is something that you might find interesting( if you haven't seen it already)
http://nagoya.uchicago.edu/~dmcallester/colbounds.pdf

David proves some generalization bound theorems which suggest that M3Ns should be preferred over Hidden Markov SVMs..but the relationship is much more subtle. I haven't looked at D-S much, but I wonder how an approach like Least Squares will perform compared to the margin based approaches in this setting.

Looking forward to more discussion.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

Anonymous said...

Yesterday i bought viagra in [url=http://shoppharm.com]Online drugstore[/url].
On my surprise it works excellent! All the matter is that the price low, because I do not pay for the trade mark. That's all!
You can see explanations about it.
A generic drug (generic drugs, short: generics) is a drug which is produced and distributed without patent protection. According to the U.S. Food and Drug Administration, generic drugs are identical bioequivalent range to the brand name counterpart with respect to pharmacokinetic and pharmacodynamic properties. By extension, therefore, generics are considered identical in dose, strength, route of administration, safety, efficacy, and intended use. In most cases, generic products are available once the patent protections afforded to the original developer have expired. When generic products become available, the market competition often leads to substantially lower prices for both the original brand name product and the generic forms. You can read more at http://shoppharm.com