03 April 2006

What is Structured Prediction?

It is surprisingly difficult to concretely define structured prediction. The root of the question is: what does it mean for a problem to be structured. A very reasonable condition seems to be the following:

Condition 1: Output elements decompose into variable length vectors over a finite set.

This notion of decomposition is fairly ubiquitous (it subsumes things like sequence labeling, parsing, MT, protein folding, etc.). Unfortunately, it does not seem sufficient. Requiring only C1 means that problems like binary classification, multitask learning and other clearly non-structured problems fall under the heading of SP. It also doesn't stress the fact that there's some relation between the elements of these vectors.

There are, it seems, essentially two places to introduce dependence between elements in the output vectors: the features or the loss. This leads to two new conditions:

Condition 2-Feat: The feature function does not decompose over any vector representation of the output.

Condition 2-Loss: The loss function does not decompose over any vector representation of the output.

(For both of these, by "does not decompose" I mean that it is not invariant under permutations of the output vector. For example, in sequence labeling world, Hamming loss does decompose.)

In some sense, C2-Loss subsumes C2-Feat. This is because it is reasonable to believe that if the loss function doesn't decompose, then we will need/want to introduce features that span at least as much of the "nondecomposition" of the output vector as the loss. For instance, in sequence labeling under a second-order Hamming loss (accuracy on adjacent labels, which does not decompose), one would presumably want to use bigram-style features.

By mixing and matching these conditions, we obtain several different notions of structured prediction.

Condition 1 alone leads to fully decomposable problems and for these we can use independent classifiers ala Dan Roth and others.

Condition 1 and Condition 2-Feat is the standard interpretation of structure prediction and leads to, among other things, max-margin Markov nets.

Condition 1 and Condition 2-Loss is my preferred interpretation and leads to, as far as I know, no published work (though we have a paper at Snowbird on it, and something conditionally accepted at ICML).

Given that 1 and 2-Feat is standard, I have some desire to defend my interpretation. To me, the 1+2-Feat interpretation is specifying SP as a solution, not a problem. That is, the world doesn't force us to use structure: we simply do so because we believe structured features will be useful. On the other hand, when the loss function doesn't decompose, the world is essentially telling us that we had really better pay attention to the structure. We can't reasonably assume to do well otherwise. I like to define problems in terms of what the world gives us, not what we create for ourselves.

My interpretation successfully excludes problems like multiclass classification and multitask learning from the realm of structured prediction. Interestingly, it also excludes sequence labeling under Hamming loss and collective classification (ala statistical relational learning). I don't see this as necessarily a bad thing. Especially since this definition now includes problems I care about like MT, summarization, ASR, etc. under any reasonable loss function. (Any reasonable loss function will not decompose.)

19 comments:

Kevin Duh said...

I'd venture to given another definition:

SP are problems where:
- Condition 1b. We are predicting the values of multiple variables (elements)
- Condition 2-Variable: The output variables have dependencies or constraints
among each other

Condition 1b simply makes the distinction between predicting single variables
(conventional classification) and predicting multiple variables. In practice,
even if our output is a single object (e.g. parse tree), we can represent it by
multiple variables each indicating the components (e.g. activated CFG rules at
certain positions).

Condition 1b differs from Condition 1 in the original post. I don't see why a
variable length vector is part of the definition, although this often happens to
be the case.

Condition 2-Variable simply states that there are dependencies between these
output variables--without them, there is nothing to exploit and SP would not
work better than individual independent classifiers. I agree that Condition
2-Loss is more general than Condition 2-Feature in that it specifies the
problem, rather than the solution. I think Condition 2-Variable is more general
than Condition 2-Feature in the same way. We may define different loss functions
(e.g. approximate loss functions) on the same problem based on the needs of our
solution, but the fact that some variables are coupled is an inherent property
of the problem.

One question I've been pondering is: When should one cast a problem as a SP, and
when should one not? It is easy to see that POS tagging where sucessive labels
have dependencies should be viewed as SP, but what about problems like MT, where
you're predicting a whole sequence of words and allowing for permutations? I
can't put my finger on it yet, but the fact that nobody has done MT as SP yet
means there's something non-trivial there (perhaps the argmax operation). In
general, can ALL problems be fruitfully solved by a SP approach?

hal said...

RE 1 vs 1b: You can always decompose the output into a vector. For instance, a vector of 0s and 1s as stored on my hard drive. 1 just makes the "multiple variables" more formal. I think this is a minor point and that we basically agree.

I though a lot about 2-Variable. I think this is how most people would define the problem. The issue is that all that's done here is that the difficulty in defining structured prediction has been "reduced" to the problem of defining "dependencies or constraints." Maybe you have something better in mind, but to me, a dependency is essentially a feature and a constraint is essentially part of the loss function. Basically, I want to be able to point at a problem and say yes/no about is it an SP problem or not. In order to do that, I need you to define what a dependency is. I don't think there are any concrete places to look for such a definition other than the features or the loss.

I'm not convinced that POS tagging where successive labels have dependencies should be viewed as SP. Why? Under Hamming loss, take all your training and test data and randomly permute the elements. Essentially nothing changes. You should be able to learn just as well in either case. This says to me that this is not an SP problem. Do the same for MT or speech and you're hosed.

There are many reasons (IMO) MT hasn't been solved using SP. The argmax is a big deal. Scaling is a big deal. The notion of a single correct output (as in tagging and parsing) versus just some score we're trying to optimize is a big deal. Hidden variables are a big deal. Luckily, most of these big deals (except the last) are essentially solved now. Anyone want to try it? :)

hal said...

Liang: under my mapping, constraint = loss, dependency = feature. Then it's fairly clear for any model what is what. I'm not really sure what Kevin means by "constraint" and "dependency" so I can't answer for that :). It seems much less clear.

hal said...

Right: loss is Rouge; features are dependency things, redundancy, tf-idf-style things, etc.

I'm not saying Kevin's wrong. I'm hoping for some more discussion here. The question is a bit of a semantic/pedantic one, but I think its interesting nonetheless to figure out what problem we're trying to solve.

The difference is in the loss function. For simple syntactic chunking, the loss (F-measure) does not decompose over tags. Whereas Hamming does. This means that you cannot permute your sequences and expect to do well for chunking/named-entity tagging/etc. But you can for POS tagging.

Kevin Duh said...

Hal, I agree that we agree on Conditions 1 and 1b. Regarding Condition 2-Variable, I define dependency and constraints as follows:

- dependency: the value of one variable A affects the value of another variable B. It's like the lack of independence of events: i.e. P(A,B) != P(A)P(B).

- constraint: hard restrictions on what the assignments of all values should be. For example, in dependency parsing using SP, you may be predicting a vector of variables, each variable corresponds to a word and its value indicates the head of that word. The constraint would be require that the assignment be such that we have a *valid* dependency tree (e.g. no cycles).

Knowing the dependencies of your problem helps you define useful features. (ie. If A and B are dependent, then a feature f(A,B) would likely be useful). Knowing the constraints will help you design the decoder so that it doesn't enumerate over variable assignments that aren't valid.

I think I'm beginning to appreciate what you mean by Condition 2-Loss, though. There definitely is a big difference between having F-score loss (as in chunking) v.s. Hamming loss (as in tagging). Whether this difference makes a problem SP or non-SP I'm not convinced yet.

That being said, I think you're right to put the loss function in your definition of SP, Hal. It definitely needs to be there, since we know 0/1 loss is not suitable structured outputs. (eg. Say you have two parses, one is mostly correct, and the other is mostly wrong--they should receive different loss). I'll proposed another condition:

Condition 3: There is a loss function that differentiates between incorrect outputs. In other words, not all incorrect outputs are equally bad.

hal said...

Defining dependency to be non-independence of probability is what I assumed you would use. But you'd really have to condition these probabilities on the input. Then you have P(A,B|X) != P(A|X)P(B|X). Now again it is unclear when such a statement would hold: once you condition on X, like in the joint inference case, all information is known.

How you define constraint is how I define loss: If you produce something that is not a tree, you get infinite loss. But I think it's a minor distinction.

Condition 3 also applies to regression. And even approximations to 0/1 like log-, exp- and hinge-loss.

Yaroslav Bulatov said...

Kevin --
requiring only decomposition and statistical dependence doesn't seem to rule out things like multi-task learning. For instance suppose the following structure

Y1->X<-Y2

Y1 and Y2 are dependent after conditioning on X. However, we will still label Y1 and Y2 independently, in a sense that we have a labelling rule for Y1 and labelling rule for Y2 which can be applied individually in any order. In essence that's how all sequence labelling problems (under Hamming Loss) can be viewed -- for sequence of length n, after conditioning on x we have n independent classifiers.

Hal --
perhaps instead of talking about "structured vs. unstructured learning" we should talk about "structured vs. unstructured output spaces". Given a space of labels, the question is -- can we impose structure on that space to make the learning more effective?

This begs the question -- what is structure?

Good labelling system will assign similar y's to similar x's. In unstructured spaces, similarity for outputs is measured by 0-1 loss -- either two labels are the same or they aren't. Perhaps we can define a "structured output space" as a space in which we can define similarity between y's in a more detailed fashion.

This would include linear regression on R, which I think makes sense because we are treating R as an an ordered field (extra structure) as opposed to a set of numbers.

This also begs the question -- what is the output space? Suppose we are doing POS-tagging. Is our output space the set of POS tags, or the set of POS labellings? One is structured, other isn't. Perhaps you could look at the loss to determine what to call "output space". With Hamming loss it may make more sense to call the set of POS tags our output space, which would make this "learning in an unstructured output space"

hal said...

Yaroslav --

Sure, by "structured learning" I mean "structured prediction" or "structured output spaces." Clearly input has structure, but lots of work has gone to addressing that.

I don't see what makes "similar ys to similar xs" different from just standard binary classification. But it seems that what you're getting at is: structure comes from our minds -- we put in in the features.

Regression on R is interesting. I can see the argument for it being structured. But if our loss function is just l(y,y') = 1[y=y'] as opposed to (y-y')^2, then is it still structured? I would say no. But I'm not sure we should (want to) say it's structured in the first place. Modulo the field interpretation, does this mean that when solving binary classification under 0/1 loss (which is clearly not structured) by approximating 0/1 with hinge or Huber loss, then is this a structured problem? One might argue that it is not because the world did not impose hinge on us: we chose it ourselves. So are we imposing structure when it doesn't overtly exist, as in the linear chains for Hamming loss case? In order to rule such things out, one needs to say that the structure decomposes somehow, and R doesn't decompose in any reasonable algebraic sense (that I know of).

This is indeed a slippery concept.

Yaroslav Bulatov said...

I'm confused by your "structured <=> does not decompose" definition --

For instance suppose Y \in {1,11}. You have to do binary classification. Your outputs are variable length vectors, and your loss function doesn't decompose, why isn't this a structured problem?

hal said...

yaroslav --

i was being fast an loose (i think you're swapping a forall with an exists). basically, what i'm saying is that something decomposes if there is a finite number M and an encoding of Y as a variable length vector of elements from M (length of vector is a function of the input) such that the loss function is invariant under permutations of this new vector. to be structured, i say you do not decompose ==> you cannot find *ANY* encoding that decomposes.

and, in particular, M doesn't have to be 2.

so the natural encoding of the multiclass problem, with M=11 yields a decomposible encoding for your problem.

hal said...

thinking about this some more, this may not be sufficient. if i accept condition 1 (we can write the structure as variable length vectors over an alphabet of size M), then i think the "does not decompose" never happens. that is, it can always be made to decompose.

why?

think primes.

basically, since there's no constraint on the size of the encoding vector, i'm just going to write out a representation of the original vector. here's how.

let p_1, ... be an enumeration of the primes.

suppose i have the vector y_1, ..., y_N. for each label m=1..M, i'm going to look at the indices that it occurs in y_1...y_N. let's say label 1 occurs in positions 2 and 5. then i'm going to write out p_2 * p_5 many 1s into my encoding (3*11=33 ones). let's say label 2 occurs in positions 1 and 3, so then i'll also write out (2*5=10 twos) and so on. i can reconstruct the original vector by factoring the counts, and of course the counts are invariant under permutations. once i've reconstructed the original vector, i can just compute the original loss. so all losses decompose.

the trick to making this work is that i don't limit how long the encoding vector is. but i don't know how it is reasonable to limit it. one might say something like "polynomial in the size of the input", which is reasonable because its useless to have an encoding that's intractable. on the other hand, it's unclear how one measures the "size of the input" for arbitrary structured prediction problems.

i'd really appreciate people's thoughts here. i think the issue of decomposition is really important, but my current way of defining it seems to be missing something.

Yaroslav Bulatov said...

I think you can't tell if a problem is structured by just looking at features and loss function, you also have to make assumptions about the data generating process.

Suppose your algorithm has to predict sequence Y given sequence X under 0-1 loss. Is this a structured problem? Yes if X/Y are generated by a Markov process, no if X/Y are binary encodings of some atomic labels.

If we treat "structured" as a property of data generating process, defining it is easy (dependence between Y's)

hal said...

I'm not sure how this allows us to tell is something is structured or not. After all, we pretty much never know the data generating process. For almost any problem, I can come up with two reasonable DGPs (reasonable = neither seems really likely, but neither is straight-out wrong and I can't come up with something better), one which would render the problem structured and one which would not.

(There's also the additional question as to whether even if you knew the DGP gave structure, if the loss fn didn't require it, should you still solve it using structured techniques.)

Yaroslav Bulatov said...

Yes I agree that we can't *really* know whether the problem is structured. If we can't rule out unfactorizable multinomial distribution generating our outputs, we can't rule out the problem being unstructured.

Another interesting issue is what is an "unstructured technique".

Suppose our DGP is a Markov-1 HMM. Suppose X's and Y's are observed during training so I do standard max.lik. During testing I observe X1X2..Xn, and predict Yi using Forward-Backward. Is that a structured technique?

Kevin Duh said...

(Sorry for the lack of response--I was out of blogworld for a while due to some deadlines)

Yuroslav--I understand what you mean by "requiring only decomposition and statistical dependence doesn't seem to rule out things like multi-task learning". Thanks for pointing that out.

Hal--didn't you say there's talks about links between Structured learning and Multitask learning? Is it due to the above observation, or are there more to it?

Personally, I'm back to square one: what is structure? Does it arive from our encoding of the problem? From the loss function? From variable dependence? I'm not sure myself anymore... Man, I didn't expect this to be such a hard thing to define...

Yaroslav Bulatov said...

Kevin -- I thought a bit about it, and it seems that it's the *IN*dependence that makes the problem structured.

If Y\in {0,1}^n, and there are no independencies, can you do any better than treat this as multi-label classification?

hal said...

Yaroslav -- I would probably consider your F-B algorithm to certainly be a structured technique but I'm not sure that this is a structured problem (it is easy to imagine that structured prediction techniques can solve non-structured problems). Given that you're doing F-B, it seems that you're trying to find the label that maximizes the marginal. This hints to me that you care about Hamming loss, which I'm hesitant to call structured. My general opinion is that if I can imagine that it is *possible* to do as well predicting independently as not, then the probably does not seem structured to me.

Suppose I took the UCI data set and concatenated all the examples, thus treating it as a "sequence labeling problem." The only reasonable loss function here would be Hamming loss (possibly weighted). But I would never consider this a structured prediction problem, given the fact that we know (assume) that the DGP produces each X_i independently. If the loss were bi-Hamming loss (Hamming over adjacent pairs), then I would consider this structured, even though the loss function doesn't really make sense. IOW, knowing the DGP helps design the loss, but doesn't make something structured or not. (This is somewhat like the fact that "colorless green ideas sleep furious" looks syntactically like a sentence even if it makes no sense semantically.)

Kevin -- I'm with you here about it being hard to define :). Regarding multitask learning, yes, essentially the observation is that we are trying to predict multiple things. I don't consider the two problems the same and conversations with Rich Caruana indicate that he doesn't, either (more on this later). One can easily imagine that multitask learning techniques can greatly improve the performance of SP algorithms, particularly search-based structured prediction algorithms :).

hal said...

erm, "furiously". sigh.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花