I've posted on structured prediction a few times before and then some. This is another post, but with a different perspective. Besides, the most recent time was last May, so it's been a while. This post is prompted by a recent OpEd by John Lafferty and Larry Wasserman on Challenges in Machine Learning in Statistica Sinica, where they mention SP as one of three major open questions in machine learning (the other two are semi-sup and sparse learning in high dimensions).
I've been trying to think of a good way of ontologizing SP methods recently, primarily because I want to understand them better. The more I think about it, the more I feel like we really don't understand the important issues. This is partially evidenced by an observation that in the past six years (essentially, since the first CRF paper), there have been like 30 different algorithms proposed for this problem. The fact that there are so many proposed algorithms tells me that we don't really understand what's going on, otherwise we wouldn't need to keep coming up with new techniques. (I'm not innocent.)
To me, the strongest division between techniques is between techniques that require tractable inference in the underlying problem (eg, CRFs, M3Ns, SVM-struct, structured perceptron, MIRA, MEMMs, HMMs, etc.) and those that don't (eg, incremental perceptron, LaSO, Searn, the Liang/Lacoste-Julien/Klein/Taskar MT algorithm, local predictors ala Roth and colleagues, etc.). Whether this truly is the most important distinction is unclear, but to me it is. I think of the former set as a sort of "top down" or "technology-push" techniques, while the latter are a sort of "bottom up" or "application-pull" techniques. Both are entirely reasonable and good ways of going at looking at the problem, but as of now the connections between the two types are only weakly understood.
An alternative division is between generative techniques (HMMs), conditional methods (CRFs) and margin-based techniques (everything else, minus Searn which is sort of non-committal with respect to this issue). I don't really think this is an important distinction, because with the exception of generative being quite different, conditional methods and margin-based methods are essentially the same in my mind. (Yes, I understand there are important differences, but it seems that in the grand scheme of things, this is not such a relevant distinctions.)
A related issue is whether the algorithms admit a kernelized version. Pretty much all do, and even if not, I also don't see this as a very important issue.
There are other issues, some of which are mentioned in the Lafferty/Wasserman paper. One is consistency. (For those unfamiliar with the term, a consistent algorithm is essentially one that is guaranteed to find the optimal solution if given infinite amounts of data.) CRFs are consistent. M3Ns are not. Searn is not, even if you use a consistent classifier as the base learning algorithm. My sense is that to statisticians, things like consistency matter a lot. In practice, my opinion is that they're less important because we never have that much data.
Even within the context of techniques that don't require tractability, there is a great deal of variation. To my knowledge, this family of techniques was essentially spawned by Collins and Roark with the incremental perceptron. My LaSO thing was basically just a minor tweak on the IP. Searn is rather different, but other published methods are largely other minor variants on the IP and/or LaSO, depending on where you measure from. And I truly mean minor tweaks: the essentially difference between LaSO and IP is whether you restart your search at the truth when you err. Doing restarts tends to help, and it lets you prove a theorem. We later had a NIPS workshop paper that restarted, but allowed the algorithm to take a few steps off the right path before doing so. This helped experimentally and also admitted another theorem. I've seen other work that does essentially the same thing, but tweaks how updates are done exactly and/or how restarts happen. The fact that we're essentially trying all permutations tells me that many things are reasonable, and we're currently in the "gather data" part of trying to figure out the best way to do things.
You posed the question of whether there is a good ontology/classification for structured prediction techniques. I'd like to ask a different question: are there different classes of structured prediction problems? For instance, is predicting a sequence fundamentally different from predicting a tree?
ReplyDeleteFor the "tractable" techniques like M3N and CRF, it is considerably more difficult to directly extend models build for sequence prediction problems into tree prediction. For other techniques like perceptron, all that is required is that some argmax is defined for the structure to be predicted (Correct me if I'm wrong.)
This leads me to wonder, are there different classes of structured prediction problems, which will do well with different classes of structured prediction techniques?
You're probably right. I've thought about it a bit from this perspective too. It seems to me that there are basically two things that matter: (1) what structure does the loss function imply; (2) what structure do the features imply (or, what aspects of the structure do we believe are useful).
ReplyDeleteEg., POS tagging... loss function (typically Hamming loss) implies nothing about the structure. But our knowledge of language says there are at least local dependencies, and probably larger syntactic dependencies. The former are easier/more-tractable to deal with, so we do a Markov model.
Eg2., machine translation... loss function (eg., Bleu) implies that we had better focus on getting 4-grams correct. Our knowledge says that things like ngram language models are useful, so we get Markov dependencies, and syntactic information is probably useful (so we get syntax). Luckily the Markov dependencies overlap with the 4-grams from Bleu, so we only have two issues to contend with.
It seems that the tractable/intractable model issue is really one of: (A) does our loss function lead to a tractable structure and (B) do our features? My sense --- and there's growing empirical support for this --- is that an impoverished feature set in a tractable model is almost always worse than a rich feature set in an intractable model.
I want to know about the work done on Active Learning for structure prediction.
ReplyDelete