Showing posts with label structured prediction. Show all posts
Showing posts with label structured prediction. Show all posts

07 November 2009

NLP as a study of representations

Ellen Riloff and I run an NLP reading group pretty much every semester. Last semester we covered "old school NLP." We independently came up with lists of what we consider some of the most important ideas (idea = paper) from pre-1990 (most are much earlier) and let students select which to present. There was a lot of overlap between Ellen's list and mine (not surprisingly). If people are interested, I can provide the whole list (just post a comment and I'll dig it up). The whole list of topics is posted as a comment. The topics that were actually selected are here.

I hope the students have found this exercise useful. It gets you thinking about language in a way that papers from the 2000s typically do not. It brings up a bunch of issues that we no longer think about frequently. Like language. (Joking.) (Sort of.)

One thing that's really stuck out for me is how much "old school" NLP comes across essentially as a study of representations. Perhaps this is a result of the fact that AI -- as a field -- was (and, to some degree, still is) enamored with knowledge representation problems. To be more concrete, let's look at a few examples. It's already been a while since I read these last (I had meant to write this post during the spring when things were fresh in my head), so please forgive me if I goof a few things up.

I'll start with one I know well: Mann and Thompson's rhetorical structure theory paper from 1988. This is basically "the" RST paper. I think that when a many people think of RST, they think of it as a list of ways that sentences can be organized into hierarchies. Eg., this sentence provides background for that one, and together they argue in favor of yet a third. But this isn't really where RST begins. It begins by trying to understand the communicative role of text structure. That is, when I write, I am trying to communicate something. Everything that I write (if I'm writing "well") is toward that end. For instance, in this post, I'm trying to communicate that old school NLP views representation as the heart of the issue. This current paragraph is supporting that claim by providing a concrete example, which I am using to try to convince you of my claim.

As a more detailed example, take the "Evidence" relation from RST. M+T have the following characterization of "Evidence." Herein, "N" is the nucleus of the relation, "S" is the satellite (think of these as sentences), "R" is the reader and "W" is the writer:

relation name: Evidence
constraints on N: R might not believe N to a degree satisfactory to W
constraints on S: R believes S or will find it credible
constraints on N+S: R's comprehending S increases R's belief of N
the effect: R's belief of N is increased
locus of effect: N
This is a totally different way from thinking about things than I think we see nowadays. I kind of liken it to how I tell students not to program. If you're implementing something moderately complex (say, forward/backward algorithm), first write down all the math, then start implementing. Don't start implementing first. I think nowadays (and sure, I'm guilty!) we see a lot of implementing without the math. Or rather, with plenty of math, but without a representational model of what it is that we're studying.

The central claim of the RST paper is that one can think of texts as being organized into elementary discourse units, and these are connected into a tree structure by relations like the one above. (Or at least this is my reading of it.) That is, they have laid out a representation of text and claimed that this is how texts get put together.

As a second example (this will be sorter), take Wendy Lehnert's 1982 paper, "Plot units and narrative summarization." Here, the story is about how stories get put together. The most interesting thing about the plot units model to me is that it breaks from how one might naturally think about stories. That is, I would naively think of a story as a series of events. The claim that Lehnert makes is that this is not the right way to think about it. Rather, we should think about stories as sequences of affect states. Effectively, an affect state is how a character is feeling at any time. (This isn't quite right, but it's close enough.) For example, Lehnert presents the following story:
When John tried to start his care this morning, it wouldn't turn over. He asked his neighbor Paul for help. Paul did something to the carburetor and got it going. John thanked Paul and drove to work.
The representation put forward for this story is something like: (1) negative-for-John (the car won't start), which leads to (2) motivation-for-John (to get it started, which leads to (3) positive-for-John (it's started), when then links back and resolves (1). You can also analyze the story from Paul's perspective, and then add links that go between the two characters showing how things interact. The rest of the paper describes how these relations work, and how they can be put together into more complex event sequences (such as "promised request bungled"). Again, a high level representation of how stories work from the perspective of the characters.

So now I, W, hope that you, R, have an increased belief in the title of the post.

Why do I think this is interesting? Because at this point, we know a lot about how to deal with structure in language. From a machine learning perspective, if you give me a structure and some data (and some features!), I will learn something. It can even be unsupervised if it makes you feel better. So in a sense, I think we're getting to a point where we can go back, look at some really hard problems, use the deep linguistic insights from two decades (or more) ago, and start taking a crack at things that are really deep. Of course, features are a big problem; as a very wise man once said to me: "Language is hard. The fact that statistical association mining at the word level made it appear easy for the past decade doesn't alter the basic truth. :-)." We've got many of the ingredients to start making progress, but it's not going to be easy!

21 October 2009

Convex or not, my schizophrenic personality

Machine learning as a field has been very convex-happy for the past decade or so. So much so that when I saw a tutorial on submodular optimization in ML (one of the best tutorials I've seen), they said something along the lines of "submodularity will be for this decade what convexity was for the last decade." (Submodularity is cool and I'll post about it more in the future, but it's kind of a discrete analog of convexity. There's a NIPS workshop on the topic coming up.) This gives a sense of how important convexity has been.

There's also a bit of an undercurrent of "convexity isn't so great" from other sides of the ML community (roughly from the neural nets folks); see, for instance, Yann LeCun's talk Who's Afraid of Non-convex Loss Functions, a great and entertaining talk.

There's a part of me that loves convexity. Not having to do random restarts, being assured of global convergence, etc., all sounds very nice. I use logistic regression/maxent for almost all of my classification needs, have never run a neural network, and have only occasionally used svms (though of course they are convex, too). When I teach ML (as I'm doing now), I make a bit deal about convexity: it makes life easy in many ways.

That said, almost none of my recent papers reflect this. In fact, in the structure compilation paper, we flat out say that non-linearity in the model (which leads to a non-convex loss function) is the major reason why CRFs outperform independent classifiers in structured prediction tasks! Moreover, whenever I start doing Bayesian stuff, usually solved with some form of MCMC, I've completely punted on everything convex. In a "voting with my feet" world, I could care less about convexity! For the most part, if you're using EM or sampling or whatever, you don't care much about it either. Somehow we (fairly easily!) tolerate whatever negative effects there are of non-convex optimization.

I think one reason why such things don't both us, as NLPers, as much as they bother the average machine learning person is that we are willing to invest some energy in intelligent initialization. This already puts us in a good part of the optimization landscape, and doing local hillclimbing from there is not such a big deal. A classic example is the "Klein and Manning" smart initializer for unsupervised parsing, where a small amount of human knowledge goes a long way above a random initializer.

Another style of initialization is the IBM alignment model style. IBM model 4 is, of course, highly non-convex and ridiculously difficult to optimize (the E step is intractable). So they do a smart initialization, using the output of model 3. Model 3, too, is highly non-convex (but not quite so much so), so they initialize with model 2. And so on, down to model 1, which is actually convex and fairly easy to optimize. This sequencing of simple models to complex models also happens in some statistical analysis, where you first fit first order effects and then later fit higher order effects. The danger, of course, is that you got to a bad hill to climb, but this overall generally appears to be a bigger win than starting somewhere in the middle of a random swamp. (Of course, later, Bob Moore had this cute argument that even though model 1 is convex, we don't actually ever optimize it to the global optimum, so doing clever initialization for model 1 is also a good idea!)

These two ideas: clever initialization, and sequential initialization, seem like powerful ideas that I would like to see make their way into more complex models. For instance, in the original LDA paper, Dave Blei used an initialization where they pick some random documents as seeds for topics. As far as I know, no one really does this anymore (does anyone know why: does it really not matter?), but as we keep building more and more complex models, and lose hope that our off the shelf optimizer (or sampler) is going to do anything reasonable, we're probably going to need to get back to this habit, perhaps trying to formalize it in the meantime.

18 December 2007

Particle filtering versus beam search

I had a very interesting discussion at NIPS with Vikash Mansingka about beam search and particle filtering. Much of what is below is a result of this conversation and he deserves much credit.

In a very real sense, particle filtering can be seen (for many models) as a sort of stochastic beam search. Just to set the stage, let's talk about generic beam search:

1. Initialize a min-queue "beam" to contain a single initial hypothesis.
2. While beam contains non-complete hypothesis:
A. Take h, the lowest scoring hypothesis from the beam
B. For each h' in the neighborhood of h
I. Add h' to the beam, with some score
C. If the beam exceeds a certain capacity, drop high-scoring elements
3. Return the lowest scoring hypothesis from beam
The key variant in beam search algorithms is the score function that's used. The most naive is just path cost---how much did we have to pay to get to the current hypothesis? The "A*" way is to use path cost + admissible heuristic cost, where the admissible heuristic is an underestimate of the amount of cost remaining to complete.

There are (of course) other variants: sometimes we expand the whole beam in one step; sometimes we use multiple beams; etc. We'll come back to these shortly.

Particle filtering (using as similar terminology to the beam search as possible), on the other hand, looks something like:
1. Initialize a max-queue "beam" to contain a single initial hypothesis.
2. While beam contains non-complete hypothesis:
A. Take h, the highest scoring hypothesis from the beam
B. Sample h' from the neighborhood of h
I. Add h' to the beam, with some score
C. If the beam exceeds a certain capacity, randomly drop elements
D. If the "effective sample size" of the beam is too low, resample elements
3. Return the highest scoring hypothesis from beam
To map beam search to particle filtering, use negative log probabilities for scores. (This turns max search into min search.)

In a sense, the major thing that differs between PF and BS is that in PF everything happens randomly -- we're always sampling from a proposal distribution, rather than greedily taking the "best" hypotheses at each step.

There are, however, some minor issues: what proposal distribution is used in step 2B in PF and what does resampling do.

In general, the proposal distribution should be the true posterior probability, but this is pretty much always intractable (otherwise we wouldn't be using approximations at all). The most common thing to do is to use a transition probability, which corresponds exactly to using a path cost in beam search. However, just as we know that using a heuristic can help in beam search, this suggest that using a proposal distribution in PF that contains something like a future cost heuristic is going to be helpful. I'm sure this has been done, but I don't think it's nearly as common as it perhaps should be, given how helpful it is in BS.

What 2D in PF does is to measure whether the samples that we have in the beam have fallen out of the true posterior---in other words, are we "off the mark." There's a straightforward way to compute this (see the wikipedia page linked above). If we are off the mark, we resample all the elements in our beam, essentially trying to get back on the right path. My experience with PF is that this is pretty essential to getting things to work, but as far as I know there's no analogue in BS land.

It may be that these two things (resampling and heuristic) are two different ways of solving the same problem and that doing both is not so helpful. But I tend to doubt it.

One other thing that I've never seen done in PF is to maintain multiple beams. Admittedly this makes the resampling step harder (maybe...I haven't worked through it), but it may be helpful. One thing, for instance, that we found in the PF for the coalescent was that the PF tends to like to merge clusters early with no regard for how hard life is going to be later. We use resampling to try to fix this, but another way would be to use multiple beams. This is commonly done in, eg., machine translation, where a single beam doesn't do well, so we maintain one beam for each number of foreign (or English) words covered.

So what's the point? I think that by recognizing the similarity between BS and PF, we can see ways to potentially improve both. For instance, the following seem promising:
  1. Use resampling techniques in beam search
  2. Use multiple beams in particle filtering
  3. Use A*-style heuristics in the proposal distribution for particle filtering
An alternative suggestion would be for NLPers to pay more attention to particle filtering. It's a nice, clean probabilistic analogue to a technique that we love (or at least are forced to use).

19 April 2007

Searn versus HMMs

Searn is my baby and I'd like to say it can solve (or at least be applied to) every problem we care about. This is of course not true. But I'd like to understand the boundary between reasonable and unreasonable. One big apparently weakness of Searn as it stands is that it appears applicable only to fully supervised problems. That is, we can't do hidden variables, we can't do unsupervised learning.

I think this is a wrong belief. It's something I've been thinking about for a long time and I think I finally understand what's going on. This post is about showing that you can actually recover forward-backward training of HMMs as an instance of Searn with a particular choice of base classifier, optimal policy, loss function and approximation method. I'll not prove it (I haven't even done this myself), but I think that even at a hand-waving level, it's sufficiently cool to warrant a post.

I'm going to have to assume you know how Searn works in order to proceed. The important aspect is essentially that we train on the basis of an optimal policy (which may be stochastic) and some loss function. Typically I've made the "optimal policy" assumption, which means that when computing the loss for a certain prediction along the way, we approximate the true expected loss with the loss given by the optimal policy. This makes things efficient, but we can't do it in HMMs.

So here's the problem set up. We have a sequence of words, each of which will get a label (for simplicity, say the labels are binary). I'm going to treat the prediction task as predicting both the labels and the words. (This looks a lot like estimating a joint probability, which is what HMMs do.) The search strategy will be to first predict the first label, then predict the first word, then predict the second label and so on. The loss corresponding to an entire prediction (of both labels and words) is just going to be the Hamming loss over the words, ignoring the labels. Since the loss doesn't depend on the labels (which makes sense because they are latent so we don't know them anyway), the optimal policy has to be agnostic about their prediction.

Thus, we set up the optimal policy as follows. For predictions of words, the optimal policy always predicts the correct word. For predictions of labels, the optimal policy is stochastic. If there are K labels, it predicts each with probability 1/K. Other optimal policies are possible and I'll discuss that later.

Now, we have to use a full-blown version of Searn that actually computes expected losses as true expectations, rather than with an optimal policy assumption. Moreover, instead of sampling a single path from the current policy to get to a given state, we sample all paths from the current policy. In other words, we marginalize over them. This is essentially akin to not making the "single sample" assumption on the "left" of the current prediction.

So what happens in the first iteration? Well, when we're predicting the Nth word, we construct features over the current label (our previous prediction) and predict. Let's use a naive Bayes base classifier. But we're computing expectations to the left and right, so we'll essentially have "half" an example for predicting the Nth word from state 0 and half an example for predicting it from state 1. For predicting the Nth label, we compute features over the previous label only and again use a naive Bayes classifier. The examples thus generated will look exactly like a training set for the first maximization in EM (with all the expectations equal to 1/2). We then learn a new base classifier and repeat.

In the second iteration, the same thing happens, except now when we predict a label, there can be an associated loss due to messing up future word predictions. In the end, if you work through it, the weight associated with each example is given by an expectation over previous decisions and an expectation over future decisions, just like in forward-backward training. You just have to make sure that you treat your learned policy as stochastic as well.

So with this particular choice of optimal policy, loss function, search strategy and base learner, we recover something that looks essentially like forward-backward training. It's not identical because in true F-B, we do full maximization each time, while in Searn we instead take baby steps. There are two interesting things here. First, this means that in this particular case, where we compute true expectations, somehow the baby steps aren't necessary in Searn. This points to a potential area to improve the algorithm. Second, and perhaps more interesting, it means that we don't actually have to do full F-B. The Searn theorem holds even if you're not computing true expectations (you'll just wind up with higher variance in your predictions). So if you want to do, eg., Viterbi F-B but are worried about convergence, this shows that you just have to use step sizes. (I'm sure someone in the EM community must have shown this before, but I haven't seen it.)

Anyway, I'm about 90% sure that the above actually works out if you set about to prove it. Assuming its validity, I'm about 90% sure it holds for EM-like structured prediction problems in general. If so, this would be very cool. Or, at least I would think it's very cool :).

13 February 2007

I Don't Think We Understand Structured Prediction

I've posted on structured prediction a few times before and then some. This is another post, but with a different perspective. Besides, the most recent time was last May, so it's been a while. This post is prompted by a recent OpEd by John Lafferty and Larry Wasserman on Challenges in Machine Learning in Statistica Sinica, where they mention SP as one of three major open questions in machine learning (the other two are semi-sup and sparse learning in high dimensions).

I've been trying to think of a good way of ontologizing SP methods recently, primarily because I want to understand them better. The more I think about it, the more I feel like we really don't understand the important issues. This is partially evidenced by an observation that in the past six years (essentially, since the first CRF paper), there have been like 30 different algorithms proposed for this problem. The fact that there are so many proposed algorithms tells me that we don't really understand what's going on, otherwise we wouldn't need to keep coming up with new techniques. (I'm not innocent.)

To me, the strongest division between techniques is between techniques that require tractable inference in the underlying problem (eg, CRFs, M3Ns, SVM-struct, structured perceptron, MIRA, MEMMs, HMMs, etc.) and those that don't (eg, incremental perceptron, LaSO, Searn, the Liang/Lacoste-Julien/Klein/Taskar MT algorithm, local predictors ala Roth and colleagues, etc.). Whether this truly is the most important distinction is unclear, but to me it is. I think of the former set as a sort of "top down" or "technology-push" techniques, while the latter are a sort of "bottom up" or "application-pull" techniques. Both are entirely reasonable and good ways of going at looking at the problem, but as of now the connections between the two types are only weakly understood.

An alternative division is between generative techniques (HMMs), conditional methods (CRFs) and margin-based techniques (everything else, minus Searn which is sort of non-committal with respect to this issue). I don't really think this is an important distinction, because with the exception of generative being quite different, conditional methods and margin-based methods are essentially the same in my mind. (Yes, I understand there are important differences, but it seems that in the grand scheme of things, this is not such a relevant distinctions.)

A related issue is whether the algorithms admit a kernelized version. Pretty much all do, and even if not, I also don't see this as a very important issue.

There are other issues, some of which are mentioned in the Lafferty/Wasserman paper. One is consistency. (For those unfamiliar with the term, a consistent algorithm is essentially one that is guaranteed to find the optimal solution if given infinite amounts of data.) CRFs are consistent. M3Ns are not. Searn is not, even if you use a consistent classifier as the base learning algorithm. My sense is that to statisticians, things like consistency matter a lot. In practice, my opinion is that they're less important because we never have that much data.

Even within the context of techniques that don't require tractability, there is a great deal of variation. To my knowledge, this family of techniques was essentially spawned by Collins and Roark with the incremental perceptron. My LaSO thing was basically just a minor tweak on the IP. Searn is rather different, but other published methods are largely other minor variants on the IP and/or LaSO, depending on where you measure from. And I truly mean minor tweaks: the essentially difference between LaSO and IP is whether you restart your search at the truth when you err. Doing restarts tends to help, and it lets you prove a theorem. We later had a NIPS workshop paper that restarted, but allowed the algorithm to take a few steps off the right path before doing so. This helped experimentally and also admitted another theorem. I've seen other work that does essentially the same thing, but tweaks how updates are done exactly and/or how restarts happen. The fact that we're essentially trying all permutations tells me that many things are reasonable, and we're currently in the "gather data" part of trying to figure out the best way to do things.