Searn is my baby and I'd like to say it can solve (or at least be applied to) every problem we care about. This is of course not true. But I'd like to understand the boundary between reasonable and unreasonable. One big apparently weakness of Searn as it stands is that it appears applicable only to fully supervised problems. That is, we can't do hidden variables, we can't do unsupervised learning.

I think this is a wrong belief. It's something I've been thinking about for a long time and I think I finally understand what's going on. This post is about showing that you can actually recover forward-backward training of HMMs as an instance of Searn with a particular choice of base classifier, optimal policy, loss function and approximation method. I'll not prove it (I haven't even done this myself), but I think that even at a hand-waving level, it's sufficiently cool to warrant a post.

I'm going to have to assume you know how Searn works in order to proceed. The important aspect is essentially that we train on the basis of an optimal policy (which may be stochastic) and some loss function. Typically I've made the "optimal policy" assumption, which means that when computing the loss for a certain prediction along the way, we approximate the true expected loss with the loss given by the optimal policy. This makes things efficient, but we can't do it in HMMs.

So here's the problem set up. We have a sequence of words, each of which will get a label (for simplicity, say the labels are binary). I'm going to treat the prediction task as predicting

both the labels and the words. (This looks a lot like estimating a joint probability, which is what HMMs do.) The search strategy will be to first predict the first label, then predict the first word, then predict the second label and so on. The loss corresponding to an entire prediction (of both labels and words) is just going to be the Hamming loss over the words,

ignoring the labels. Since the loss doesn't depend on the labels (which makes sense because they are latent so we don't know them anyway), the optimal policy has to be agnostic about their prediction.

Thus, we set up the optimal policy as follows. For predictions of words, the optimal policy always predicts the correct word. For predictions of labels, the optimal policy is stochastic. If there are

K labels, it predicts each with probability

1/K. Other optimal policies are possible and I'll discuss that later.

Now, we have to use a full-blown version of Searn that actually computes expected losses as true expectations, rather than with an optimal policy assumption. Moreover, instead of sampling a single path from the current policy to get to a given state, we sample all paths from the current policy. In other words, we marginalize over them. This is essentially akin to not making the "single sample" assumption on the "left" of the current prediction.

So what happens in the first iteration? Well, when we're predicting the

Nth word, we construct features over the current label (our previous prediction) and predict. Let's use a naive Bayes base classifier. But we're computing expectations to the left and right, so we'll essentially have "half" an example for predicting the

Nth word from state

0 and half an example for predicting it from state

1. For predicting the

Nth label, we compute features over the previous label only and again use a naive Bayes classifier. The examples thus generated will look exactly like a training set for the first maximization in EM (with all the expectations equal to

1/2). We then learn a new base classifier and repeat.

In the second iteration, the same thing happens, except now when we predict a label, there can be an associated loss due to messing up future word predictions. In the end, if you work through it, the weight associated with each example is given by an expectation over previous decisions and an expectation over future decisions, just like in forward-backward training. You just have to make sure that you treat your learned policy as stochastic as well.

So with this particular choice of optimal policy, loss function, search strategy and base learner, we recover something that looks essentially like forward-backward training. It's not identical because in true F-B, we do full maximization each time, while in Searn we instead take baby steps. There are two interesting things here. First, this means that in this particular case, where we compute true expectations, somehow the baby steps aren't necessary in Searn. This points to a potential area to improve the algorithm. Second, and perhaps more interesting, it means that we don't actually have to do full F-B. The Searn theorem holds even if you're not computing true expectations (you'll just wind up with higher variance in your predictions). So if you want to do, eg., Viterbi F-B but are worried about convergence, this shows that you just have to use step sizes. (I'm sure someone in the EM community must have shown this before, but I haven't seen it.)

Anyway, I'm about 90% sure that the above actually works out if you set about to prove it. Assuming its validity, I'm about 90% sure it holds for EM-like structured prediction problems in general. If so, this would be very cool. Or, at least I would think it's very cool :).