## 02 February 2006

### Joint Inference

Joint inference is becoming really popular these days. There's even a workshop coming up on the topic that you should submit a paper to! (That is, unless your paper is more related to the computationally hard NLP workshop I'm running with Ryan McDonald and Fernando Pereira).

The problem with joint inference is that while it seems like a great idea at the outset -- getting rid of pipelining, etc. -- in theory it's not always the best choice. I have a squib on the topic, titled Joint Inference is not Always Optimal and I welcome any and all comments/criticisms/etc.

Kevin said...

I'll try to summarize your squib in information-theoretic and graphical model terms. Using the chunk/tag prediction problem, suppose X is the input sentence, Y is the tag, and Z is the chunk. Let's consider the single-output joint inference case, where we only worry about the loss on Z, and care less about the loss on Y. Your argument is that even if we can predict Y correctly, Y will give us no more information for predicting Z because X already contains all the information. In information-theorectic terms, you're saying that the conditional entropy

H(Z|X) = H(Z|X,Y).

Intuitively, this makes sense, because Y is derived from X, and Z is derived from the same X with no additional info.
This independence corresponds to the graphical model:

Y <-- X --> Z. (1)

Since we don't care about Y, we have the graphical model below. I call this "direct prediction", which is what you're advocating in the single-output case:

X --> Z. (2)

In joint inference, you usually see people draw this graphical model instead (see, e.g. Sutton's Factorial CRF paper):

X --> Y --> Z. (3)
X --------> Z.

(Forgive my ASCII art: Z is supposed to have two incoming arrows, one from Y and one from X.) Graphical model (3) is the joint model, and it's been shown to outperform a pipelined model. The question you raise, however, is:

Is (2) better than (3)?

Your argument is that (2) is better. I think you're right in theory, but I think (2) is not always better than (3) in practice (even with the correct implementation you mentioned). Here's my reasoning:

a) I agree that (1) and (2) are the true model of the data. Y gives no more information to Z given X. However, note that (3) is a larger set of probability models that includes (1), so (3) technically includes the correct model too.

b) Given unlimited data, both (2) and (3) should give you the same result. What will happen is that, since (2) is the truth, the learned model for (3) will effectively have a zero probability table for the Y-->Z dependency.

c) Now, here's why I think (3) may be better than (2) in practice sometimes. The function that maps directly from X to Z may be complex, whereas the function from X to Y and Y+X to Z may be a lot simpler. Given limited data, you may get better performance in (3) because less complex hypothesis spaces gives you better error bound (think VC dimension).

So, here's my personal conclusion:

Joint inference as in model (3) may not always be worse than (2), even though (2) is the correct model of the data. (2) potentially has a more complex hypothesis space, and that would be the reason to use (3). However, when this is not the case, (2) is better because it solves the problem directly. In fact, if we consider the Y's as additional features for X, then the models for (2) and (3) become the same.

My 2 cents worth. This is a thought-provoking paper and I look forward to more discussion!

hal said...

Kevin -- thanks for the IT/GM summary!

Some thoughts wrt your three points:

(a) Certainly you're right that (3) includes (2). But in practice might this not make it harder, not easier, to find the correct model since you're searching over a much larger space of possible distributions? (Though see the other point below.)

(b) I don't care about infinite data situations :). Nearly everything becomes the same as n->infty.

(c) I had a similar thought, the expressed differently. I think of it like: consider the limited data case. Then, we can think of the Y (say, POS tag) as being an efficient "summary" of the information around that position in X. Including this summary as a feature might turn out to make the prediction of Z easier.

For me, the jury is out on whether my (c) is reasonable. The reason I don't completely buy it is that it makes an implicit assumption that the underlying learning algorithm doesn't work well. If it worked well, the information theoretic argument would make this summary useless. But if the underlying learning algorithm is biased in some way, having such a summary could potentially be useful.

I think this is what you mean by (2) being a more complex model. Intuitively, (3) is more complex since it is a superset of (2). But you seem to suggest (as I did above) that one can be more feature-lean on the X->Z mapping given that we have an X->Y->Z mapping. I don't think you mean to suggest -- and I don't buy -- the suggestion that X->Y could be easy and Y->Z could be easy but X->Z could be hard.

I think this argument is badly in need of a theorem. In the assumption-free case, it cannot be better to learn (3) than (2) from a learning theoretic perspective. I really would like to know what assumptions are sufficient/necessary to show that (3) is better. What we'd seem to need to show is that (3) leads to lower sample complexity than (2), so that even if by using (3) we lose a bit on fitting the training data, we win in the complexity term of the generalization bounds. My guess is that the assumptions necessary to make this true would lie somewhere around a combination of: (A) the learner generalizes poorly with many features, (B) H(Z|X) is not much less than H(Z|Y) for the correct Y, and (C) one has neither too little nor too much data.

Does anyone see an obvious/clever/not-so-obvious solution to this? Or am I completely wrong? :)

Kevin said...

Hal--I agree, this argument is badly in need of a theorem. I think it's reasonable to argue in favor of either model (2) or (3) now. The question of which model is better seems dependent on

a) data set size
b) the feature set (in particular, whether Y summarizes the features of X)
c) hypothesis function class
d) the dependence between Y and Z (I'm not sure how to characterize this formally)

The question we should ask is: given what scenario is (3) better than (2), and vice versa? Perhaps we will end up with the same conclusion you've proved in your paper, but this is an interesting exercise to examing things under a new light. I'll think about this some more and get back to you..

hal said...

I've attempted a really crude analysis of this, and the message seems to be: if the loss is sufficiently low with respect to the norm of the weight vector, then you might be able to do better with single-output joint inference. It's very unclear, though, and the analysis is bad for many reasons, but maybe it's a bit helpful.

. said...

Oes Tsetnoc one of the ways in which we can learn seo besides Mengembalikan Jati Diri Bangsa. By participating in the Oes Tsetnoc or Mengembalikan Jati Diri Bangsa we can improve our seo skills. To find more information about Oest Tsetnoc please visit my Oes Tsetnoc pages. And to find more information about Mengembalikan Jati Diri Bangsa please visit my Mengembalikan Jati Diri Bangsa pages. Thank you So much.

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex