02 February 2006

Joint Inference

Joint inference is becoming really popular these days. There's even a workshop coming up on the topic that you should submit a paper to! (That is, unless your paper is more related to the computationally hard NLP workshop I'm running with Ryan McDonald and Fernando Pereira).

The problem with joint inference is that while it seems like a great idea at the outset -- getting rid of pipelining, etc. -- in theory it's not always the best choice. I have a squib on the topic, titled Joint Inference is not Always Optimal and I welcome any and all comments/criticisms/etc.


Kevin Duh said...

I'll try to summarize your squib in information-theoretic and graphical model terms. Using the chunk/tag prediction problem, suppose X is the input sentence, Y is the tag, and Z is the chunk. Let's consider the single-output joint inference case, where we only worry about the loss on Z, and care less about the loss on Y. Your argument is that even if we can predict Y correctly, Y will give us no more information for predicting Z because X already contains all the information. In information-theorectic terms, you're saying that the conditional entropy

H(Z|X) = H(Z|X,Y).

Intuitively, this makes sense, because Y is derived from X, and Z is derived from the same X with no additional info.
This independence corresponds to the graphical model:

Y <-- X --> Z. (1)

Since we don't care about Y, we have the graphical model below. I call this "direct prediction", which is what you're advocating in the single-output case:

X --> Z. (2)

In joint inference, you usually see people draw this graphical model instead (see, e.g. Sutton's Factorial CRF paper):

X --> Y --> Z. (3)
X --------> Z.

(Forgive my ASCII art: Z is supposed to have two incoming arrows, one from Y and one from X.) Graphical model (3) is the joint model, and it's been shown to outperform a pipelined model. The question you raise, however, is:

Is (2) better than (3)?

Your argument is that (2) is better. I think you're right in theory, but I think (2) is not always better than (3) in practice (even with the correct implementation you mentioned). Here's my reasoning:

a) I agree that (1) and (2) are the true model of the data. Y gives no more information to Z given X. However, note that (3) is a larger set of probability models that includes (1), so (3) technically includes the correct model too.

b) Given unlimited data, both (2) and (3) should give you the same result. What will happen is that, since (2) is the truth, the learned model for (3) will effectively have a zero probability table for the Y-->Z dependency.

c) Now, here's why I think (3) may be better than (2) in practice sometimes. The function that maps directly from X to Z may be complex, whereas the function from X to Y and Y+X to Z may be a lot simpler. Given limited data, you may get better performance in (3) because less complex hypothesis spaces gives you better error bound (think VC dimension).

So, here's my personal conclusion:

Joint inference as in model (3) may not always be worse than (2), even though (2) is the correct model of the data. (2) potentially has a more complex hypothesis space, and that would be the reason to use (3). However, when this is not the case, (2) is better because it solves the problem directly. In fact, if we consider the Y's as additional features for X, then the models for (2) and (3) become the same.

My 2 cents worth. This is a thought-provoking paper and I look forward to more discussion!

hal said...

Kevin -- thanks for the IT/GM summary!

Some thoughts wrt your three points:

(a) Certainly you're right that (3) includes (2). But in practice might this not make it harder, not easier, to find the correct model since you're searching over a much larger space of possible distributions? (Though see the other point below.)

(b) I don't care about infinite data situations :). Nearly everything becomes the same as n->infty.

(c) I had a similar thought, the expressed differently. I think of it like: consider the limited data case. Then, we can think of the Y (say, POS tag) as being an efficient "summary" of the information around that position in X. Including this summary as a feature might turn out to make the prediction of Z easier.

For me, the jury is out on whether my (c) is reasonable. The reason I don't completely buy it is that it makes an implicit assumption that the underlying learning algorithm doesn't work well. If it worked well, the information theoretic argument would make this summary useless. But if the underlying learning algorithm is biased in some way, having such a summary could potentially be useful.

I think this is what you mean by (2) being a more complex model. Intuitively, (3) is more complex since it is a superset of (2). But you seem to suggest (as I did above) that one can be more feature-lean on the X->Z mapping given that we have an X->Y->Z mapping. I don't think you mean to suggest -- and I don't buy -- the suggestion that X->Y could be easy and Y->Z could be easy but X->Z could be hard.

I think this argument is badly in need of a theorem. In the assumption-free case, it cannot be better to learn (3) than (2) from a learning theoretic perspective. I really would like to know what assumptions are sufficient/necessary to show that (3) is better. What we'd seem to need to show is that (3) leads to lower sample complexity than (2), so that even if by using (3) we lose a bit on fitting the training data, we win in the complexity term of the generalization bounds. My guess is that the assumptions necessary to make this true would lie somewhere around a combination of: (A) the learner generalizes poorly with many features, (B) H(Z|X) is not much less than H(Z|Y) for the correct Y, and (C) one has neither too little nor too much data.

Does anyone see an obvious/clever/not-so-obvious solution to this? Or am I completely wrong? :)

Kevin Duh said...

Hal--I agree, this argument is badly in need of a theorem. I think it's reasonable to argue in favor of either model (2) or (3) now. The question of which model is better seems dependent on

a) data set size
b) the feature set (in particular, whether Y summarizes the features of X)
c) hypothesis function class
d) the dependence between Y and Z (I'm not sure how to characterize this formally)

The question we should ask is: given what scenario is (3) better than (2), and vice versa? Perhaps we will end up with the same conclusion you've proved in your paper, but this is an interesting exercise to examing things under a new light. I'll think about this some more and get back to you..

hal said...

I've attempted a really crude analysis of this, and the message seems to be: if the loss is sufficiently low with respect to the norm of the weight vector, then you might be able to do better with single-output joint inference. It's very unclear, though, and the analysis is bad for many reasons, but maybe it's a bit helpful.

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花