19 March 2006

Viterbi Search is Not Magic

This post is about the use of Viterbi search in NLP applications. Skip the next paragraph if you recall readily how VS works.

The easiest formulation of VS is for a sequence labeling problem: given an input x, find a label vector y1,...,yN that maximizes the score [ f(x,y0,y1) + f(x,y1,y2) + ... + f(x,y{N-1},yN) ]. (Here, y0 is some special <s> symbol.) When there are K possible values for each label, VS solves this by filling in a dynamic programming matrix a of size (N+1)*K, where a(0,k) is zero if k is <s> and negative infinity otherwise. Recursively, a(n+1,k') is [ max_k [ a(n,k) + f(x,k,k') ] ]. The largest a(N+1, . ) gives the highest score; by storing backpointers (which k was the largest) in another matrix, we can recover the whole sequence once we reach the end. This runs in time O(NK^2), as opposed to O(N^K) as a naive implementation would run.

I am becoming somewhat notorious for not using Viterbi search, even for sequence labeling problems (ironic, given that I am part of the Viterbi School of Engineering). The purpose of this post is to show that this is not unreasonable.

My evolving feeling on this has two angles:

  1. Viterbi is good for something like an HMM where we can't model the whole input as features and so we need to talk back-and-forth via Viterbi. But when we get to use any feature at any time (as in any modern model), this advantage goes away.
  2. I think people ascribe some "magic" to Viterbi. At least in NLP, the Viterbi algorithm and its generalizations (inside-outside, etc.) are treated as stand alone units. They're not. They're simply based on the observation that under certain constraints, when executing search, one can do efficient memoization. If you throw away the "magic" of Viterbi and treat it as jsut another search algorithm, you see that there's nothing special there.
I have a hard time convincing people of 2. I'm not quite sure why. I almost liken it to trying to convince a classically trained logician the ways of constructive logic are right (being one of the former, I'm not yet convinced). The problem is that we're so used to seeing VS as this very special array-filling operation that we forget what's really going on. We could just as easily do search and store stuff in a hash table. It wouldn't be quite as efficient, but complexity wise (assuming O(1) access to the table), it would be the same.

The "magic" of Viterbi is that it efficiently integrates all information from the different sources assuming that the conditional independences hold and assuming that all conditional probabilities are correctly estimated. But we know this is not the case.

That said, I think there is something to be said for later decisions being able to influence earlier ones. But I don't take this as given. And I don't think it's always the case. This is a big open question for me. By using a more complex search algorithm, you're forcing more decisions to be made but are (we hope) making these decisions easier. This is a trade-off. One angle is not a priori better than the other.

2 comments:

Anonymous said...

Hi, can I have a look at the Viterbi algorithm? Or do you know any web that show the Viterbi algorithm? Thank you... ^^

Anonymous said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花