29 July 2016

A quick comment on structured input vs structured output learning

When I think of structured input models, I typically think of things like kernels over discrete input spaces. For instance, the famous all-substrings kernel for which K(d1,d2) effectively counts the number of common substrings in two documents, without spending exponential time enumerating them all. Of course there are many more ways of thinking about structured inputs: tree-to-string machine translation has a tree structured input. RNNs (on the input side) are essentially structured input models for sequence structures.

When I think of structured output models, I typically think of things like CRFs, structured SVMs/M3Ns, multilabel predictors (those are borderline), various transition-based methods (eg., shift/reduce parsers), etc. Here, my internal model for the structure is essentially at prediction time: find a high scoring structure from this complicated discrete output space.

Perhaps this has been obvious to everyone-but-me for a decade, but I only recently came to the recognition that these are essentially the same, at least if you restrict the sort of models you're willing to consider. (In particular, if you ignore things like imitation learning/learning to search for a moment.)

In a pure structured input setting, you have some simple label space Y (let's assume it's the real numbers) and some complex input space X. Typically you want to learn a function f : X ➝ Y, which has low loss. In particular you want to minimize the expectation of loss(y, f(x)) over random draws of x,y. And the "interesting" thing is that x isn't just a vector, so you have to be clever.

In the pure structure output setting, in, for instance, the structured SVM/CRF setup, you have some input space X (which may or may not be structured) and some complex output space Y. As before, you want to learn a function f : X ➝ Y, which has low loss. However, in the most common setups, the way you accomplish this is that instead of directly learning f, you instead learn a scoring function s that scores x,y pairs based on how "good" that y is for the corresponding x. For a fixed scoring function s, you derive f according to the argmax rule: fs(x) := argmaxy s(x,y). In this way, you have effectively separated the learning problem (get a good s) from the structured problem (solve the argmax). [Whether this is good or not is up to debate; I'm personally on the "nay" side.] You then want to minimize something like the expectation of loss(y, argmaxy' s(x,y')) over random draws x,y.

The observation is that these two problems are essentially the same thing. That is, if you know how to do the structured input problem, then the structured output problem is essentially the same thing, as far as the learning problem goes. That is, if you can put structure in f(x) for structured input, you can just as well put structure in s(x,y) for structured output. Or, by example, if you can predict the fluency of an English sentence x as a structured input problem, you can predict the translation quality of a French/English sentence pair x,y in a structured output problem. This doesn't solve the argmax problem -- you have to do that separately -- but the underlying learning problem is essentially identical.

You see similar ideas being reborn these days with papers like David Belanger's ICML paper this year on energy networks. With this framework of think-of-structured-input-and-structured-output-as-the-same, basically what they're doing is building a structured score function that uses both the input and output simultaneously, and throwing these through a deep network. (Ok it's a bit more than that, but that's the cartoon.)

At any rate, maybe obvious to everyone but me, but I thought I'd write it down anyway :).

1 comment:

td said...

wasn't obvious to me. Nice writeup.