26 April 2009

Viterbi search, minimum Bayes risk and laplacian eigenmaps

I've been slow at posting because I wanted to post on this current topic for a while, but wanted to work out some more details before doing so. Well, it didn't happen. So I'm writing sans details.

Let's think for a minute about non-linear dimensionality reduction, aka manifold learning. If we compare what, say, ISOMAP does with what laplacian eigenmaps (LE) does, they're really quite similar. In both cases, we construct a graph over our data points, usually a kNN graph or something, sometimes with edge weights, sometimes not. We then attempt to embed the points in a low dimensional space to minimize some sort of distortion. In ISOMAP, the distortion is based on the shortest path distance between the two points in our constructed graph. In LE, the distance is measures according to the Laplacian of the graph. The notion of a Laplacian of a graph is basically a discrete version of the more standard notion of the differential operator by the same name (that comes up in most standard analysis courses). In continuous land, it is the divergence of the differential, which essentially means that measures some form of diffusion (and has its original applications in physics). In discrete land, where we live, it can be thought of as a measure of flow on a graph.

The key difference, then, between ISOMAP and LE is whether you measure distances according to shortest path or to flow, which has a very "average path" feel to it. The advantage to LE is that it tends to be less susceptible to noise, which is easy to understand when viewed this way.

Now, getting back to NLP stuff, we often find ourselves doing some sort of shortest path search. It's well known that the much loved Viterbi algorithm is exactly an application of dynamic programming to search in the lattice defined by an HMM. This extends in well known ways to other structures. Of course, Viterbi search isn't the only game in town. There are two other popular approaches to "decoding" in HMMs. One is marginal decoding, where we individually maximize the probability of each node. The other is minimum Bayes risk decoding. Here, we take a user-supplied risk (aka loss) function and try to find the output that minimizes the expected risk, where the probabilities are given by our current model. MBR has been shown to outperform Viterbi in many applications (speech, MT, tagging, etc.). If your risk (loss) function is 0/1 loss, then these recover the same solution.

What I'm curious about is whether this is a connection here. I'm not exactly sure how the construction would go -- I'm thinking something like a graph defined over the lattice vertices with edges that reflect the loss function -- but there definitely seems to be a similarity between MBR and average path, which is approximately equal to a Laplacian operation (aka spectral operation). The reason I think this would be interesting is because a lot is known about spectral computations and we might be able to use this knowledge to our advantage (coming up with MBR algorithms is sometimes a bit painful). An additional complication (or bit of fun, if you see it that way) is that there are at least three standard ways to generalize the notion of a Laplacian to a hypergraph, which is what we would really need to do. Perhaps we can help pick out the "right" one through the MBR connection.

18 comments:

  1. Slightly OT, but if you think of what's the analogy of averaged shortest paths, then electrical flow is what comes to mind, with resistance acting as an edge cost (you might have to exponentiate or otherwise transform them to make it work quite right).

    Another observation is that since the Laplacian captures various kinds of cuts (see the spectral clustering work), you'd want to show that MBR acts like it's finding a good cut or a good flow of some kind. Maybe even a dual formulation might be helpful to reveal such structure

    ReplyDelete
  2. Hmm, this is interesting... but here's a question:

    In decoding we ultimately want to compute a path, whether this path is shortest distance or average/MBR "distance". In manifold learning, the actual path between two points is not that important, since the emphasis is on the value or distance of that path (or average path). Do you think the spectral operators could still work for decoding then? (this isn't a rhetoric question--I don't know the answer)

    Another question: I'm curious, what are the standard ways to turn Laplacian into a hypergraph?

    ReplyDelete
  3. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

    ReplyDelete
  4. samuel gerber20 May, 2009 15:23

    Laplacian eigenmaps is equivalent to embedding commute times of a random walk on the graph [ Huaijun and Hancock ].

    Commute time can be interpreted as an electrical circuit - thus LE is an averaged shortest path for a graph with edges representing resistors.

    ReplyDelete
  5. In MBR, we won't necessarily compute a path. For instance, consider POS tagging or NE chunking. If we take something like F-measure loss balanced to recall, then if there are two outputs with probability 0.4 each, we'll output them both to improve our recall at the cost of precision. We do this all the time for high recall named-entity chunking.

    In practice, I haven't found per-tag decoding for POS tagging for for chunking to be much different in 0/1 performance from regular Viterbi -- we have both implemented in LingPipe's HMM and chunker decoders. Kakade, Teh and Roweis, on the other hand, found that training a MEMM with that loss had a huge effect.

    ReplyDelete
  6. sorry to ask this here but… I really love your theme, would it happen to be a free one i can download somewhere, or is it a custom theme you had made? Soon i will be launching my own blog, however i’m not great with designs but i do like the style of your site so it would be excellent if i could find (or buy) something with a similar look as my last designer cannot finish my site. Thanks! Please come visit my site moving truck rental when you got time.

    ReplyDelete
  7. Congratulations, you just earned yourself an entry in my feed reader, great blog. Please come visit my site cars accessories and give me any valuable feedbacks.

    ReplyDelete
  8. Very nice information. Thanks for this. Please come visit my site Business Directory Fortworth when you got time.

    ReplyDelete
  9. Very nice information. Thanks for this. Please come visit my site Fortworth Directory when you got time.

    ReplyDelete
  10. I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site Columbus Directory when you got time.

    ReplyDelete
  11. I enjoyed reading your work! GREAT post! I looked around for this… but I found you! Anyway, would you mind if I threw up a backlink from my site? Please come visit my site Columbus Business Phone Book when you got time.

    ReplyDelete
  12. Do you think the spectral operators could still work for decoding then? (this isn't a rhetoric question--I don't know the answer)
    Buy Assignment | Buy Coursework | Buy Thesis

    ReplyDelete
  13. we'll output them both to improve our recall at the cost of precision. We do this all the time for high recall named-entity chunking.
    Buy Dissertation | Buy Essay

    ReplyDelete
  14. Hey,
    Thanks a lot for this amazing post. I look forward to reading more on the topic in the future. Keep up the good work! This blog is going to be great resource. Love reading it.

    Business Plan Presentation

    ReplyDelete
  15. I would definitely make a change here. Just a little bit more work and it will work good.
    boca raton cosmetic sedation dentist

    ReplyDelete
  16. Hi,
    I think ISOMAP and LE are same thing. The only difference is what you have shared here. Thanks, I personally like your post. Keep it up.

    Custom Essays

    ReplyDelete
  17. guoweigangSpeaking to the Ed Hardy Sale Gazette earlier this year, Mr abercrombie and fitch Gifford said he was hopeful ed hardy clothing the company could rally. He said: "I am Abercrombie Outlet reasonably confident that there are investors out there who will support us. It's a great company. We just need to get the financial side in order."I am sure we can recover the position of the number one company in the market."
    The corporate clothing firm employs 220 people Ed Hardy store in Thornbury and a further 50 in nearby Aztec West. Nationally, at shops and warehouses, the company employs nearly 500 people.
    Alexandra Plc first started as a small shop in Bristol, in 1854. The company started to sell work wear in the 1950s and was first floated on the stock market in 1985.

    ReplyDelete