natural language processing blog: 2009/04

I've been slow at posting because I wanted to post on this current topic for a while, but wanted to work out some more details before doing so. Well, it didn't happen. So I'm writing sans details.

Let's think for a minute about non-linear dimensionality reduction, aka manifold learning. If we compare what, say, ISOMAP does with what laplacian eigenmaps (LE) does, they're really quite similar. In both cases, we construct a graph over our data points, usually a kNN graph or something, sometimes with edge weights, sometimes not. We then attempt to embed the points in a low dimensional space to minimize some sort of distortion. In ISOMAP, the distortion is based on the shortest path distance between the two points in our constructed graph. In LE, the distance is measures according to the Laplacian of the graph. The notion of a Laplacian of a graph is basically a discrete version of the more standard notion of the differential operator by the same name (that comes up in most standard analysis courses). In continuous land, it is the divergence of the differential, which essentially means that measures some form of diffusion (and has its original applications in physics). In discrete land, where we live, it can be thought of as a measure of flow on a graph.

The key difference, then, between ISOMAP and LE is whether you measure distances according to shortest path or to flow, which has a very "average path" feel to it. The advantage to LE is that it tends to be less susceptible to noise, which is easy to understand when viewed this way.

Now, getting back to NLP stuff, we often find ourselves doing some sort of shortest path search. It's well known that the much loved Viterbi algorithm is exactly an application of dynamic programming to search in the lattice defined by an HMM. This extends in well known ways to other structures. Of course, Viterbi search isn't the only game in town. There are two other popular approaches to "decoding" in HMMs. One is marginal decoding, where we individually maximize the probability of each node. The other is minimum Bayes risk decoding. Here, we take a user-supplied risk (aka loss) function and try to find the output that minimizes the expected risk, where the probabilities are given by our current model. MBR has been shown to outperform Viterbi in many applications (speech, MT, tagging, etc.). If your risk (loss) function is 0/1 loss, then these recover the same solution.

What I'm curious about is whether this is a connection here. I'm not exactly sure how the construction would go -- I'm thinking something like a graph defined over the lattice vertices with edges that reflect the loss function -- but there definitely seems to be a similarity between MBR and average path, which is approximately equal to a Laplacian operation (aka spectral operation). The reason I think this would be interesting is because a lot is known about spectral computations and we might be able to use this knowledge to our advantage (coming up with MBR algorithms is sometimes a bit painful). An additional complication (or bit of fun, if you see it that way) is that there are at least three standard ways to generalize the notion of a Laplacian to a hypergraph, which is what we would really need to do. Perhaps we can help pick out the "right" one through the MBR connection.

natural language processing blog

26 April 2009

Viterbi search, minimum Bayes risk and laplacian eigenmaps

24 April 2009

...and here it is for ACL

23 April 2009

NAACL program up