29 March 2006

Heuristics

Deepak brings up a great discussion topic that really needs its own post. Hopefully I'll catch this here before anyone continues to reply there (don't). The basic question is this:

Why cannot people simply use heuristicy and hackery approaches that have been proven over years to work well?

Deepak's point, which I think is well taken and should not be forgotten, is that a simple "hacky" thing (I don't intend "hacky" to be condescending...Deepak used it first!) often only does at most epsilon worse than a mathematically compelling technique, and sometimes even better. I think you can make a stronger argument. Hacky things allow us to essentailly encode any information we want into a system. We don't have to find a mathematically convenient way of doing so (though things are better now that we don't use generative models for everything).

I think there are several responses to these arguments, but I don't think anything topples the basic point. But before I go into those, I want to say that I think there's a big difference between simple techniques and heuristicy techniques. I am of course in favor of the former. The question is: why shouldn't I just use heuristics. Here are some responses.

1. That's just how I am. I studied math for too long. It's purely a personal choice. This is not compelling on a grand scale, but at a personal level, it is unlikely one will do good research if one is not interested in the topic and approach.
2. We don't want to just solve one problem. Heuristic techniques are fantastic for solving a very particular problem. But they don't generalize to other similar problems. The extent to which they don't generalize (or the difficulty to force generalization) of course varies.
3. Similar to #2, I want a technique that can be described more briefly than simply the source code. Heuristic techniques by definition cannot. Why? I'd like to know what's really going on. Maybe there's some underlying principle that can be applied to other problems.
4. Lasting impact. Heuristics rarely have lasting (scientific) impact because they're virtually impossible to reproduce. Of course, a lasting impact for something mathematically clever but worthless is worse than worthless.
There are probably others that other people can come up with.

That's my general perspective. In response to specific issues Deepak brought up:
...a more complicated model gives very little improvements and generally never scales easily.
I think it's worthwhile separating the problem from the solution. The issue with a lot of techniques not scaling is very true. This is why I don't want to use them :). I want to make my own techniques that do scale and that make as few assumptions as possible (regarding features, loss, data, etc.). I think we're on our way to this. Together with John and Daniel, we have some very promising results.
...working on such (SP) problems means getting married more to principles of Machine Learning than trying to make any progress towards real world problems.
I, personally, want to solve real-world problems. I cannot speak for others.
...a lot of lot of smart (young?) people in the NLP/ML community simply cannot admit the fact that simple techniques go a long way...
This is very unfortunate if true. I, for one, believe that simple techniques do go a long way. And I'm not at all in favor of using a steamroller to kill a fly. But I just don't feel like they can or will go all the way in any scalable manner (scalable in O(person time) not O(computation time)). I would never (anymore!) build a system for answering factoid questions like "how tall is the empire state building?" that is not simple. But what about the next level "how much taller is the empire state building than the washington monument?" Okay, now things are interesting, but this is still a trivial question to answer. Google identifies this as the most relevant page. And it does contain the necessary information. And I can imagine building heuristics that could answer "how much Xer is X that Y" based on asking two separate questions. But where does this process end? I will work forever on this problem. (Obviously this is just a simple stupid example which means you can find holes in it, but I still believe the general point.)

25 March 2006

Yesterday was Structured Learning Day

I'm out in Chicago at the atomic learning workshop and yesterday was structured prediction day. There were a lot of cool talks on pretty much all the major SP techniques: Charles talked about efficient training of CRFs, Yasemin talked about semi-supervised learning in M3Ns and SVMISOs, Ben talked about efficient training of M3Ns for combinatorial-style problems (like word alignment), Drew talked about planning as a SP problem, Vasin talked about the tradeoffs of local vs. global inference in SP, and I of course talked about search.

I really enjoyed Drew's talk. He is considering the problem of robot navigation, which is typically handled using A* search and some very crafty hand-written heuristic rules for generating the cost map over which A* search runs. He wants to get rid of the crafty part and replace it with learning. The idea is simple and cute: use observed features the robot receives to learn to produce a cost map so that when A* runs it finds the right path. Here, he defines the right path by having a human remote control the robot and drive it to the correct place. He casts this as a max-margin problem and solves it using very fast subgradient methods. There's no paper on this yet, but if all goes well there will be soon. I think there's a lot of stuff here that could strongly influence how we think about SP.

There were many questions raised at the workshop that I deserve significant thought. These include:

1. What is structured learning?
2. How does structured learning related to multi-task learning (ala Rich Caruana and others)?
3. What sort of techniques are there for dealing with un- or partially-labeled data for structured learning?
4. What can be done about featuritis (i.e., throwing in all possible features)? Also: Is this a legitimate concern?
5. How important is it what loss function we optimize?
6. When is local learning enough (ala Dan Roth) and when must you do "global learning?"
7. What training techniques scale sufficiently?
Almost all of these questions deserve separate posts, and of course they are non-independent. Modulo feedback, I'll probably start at the top and work my way down. (Of course, if John beats me to it, I'll just post on his blog and cross-link here.)

19 March 2006

Viterbi Search is Not Magic

This post is about the use of Viterbi search in NLP applications. Skip the next paragraph if you recall readily how VS works.

The easiest formulation of VS is for a sequence labeling problem: given an input x, find a label vector y1,...,yN that maximizes the score [ f(x,y0,y1) + f(x,y1,y2) + ... + f(x,y{N-1},yN) ]. (Here, y0 is some special <s> symbol.) When there are K possible values for each label, VS solves this by filling in a dynamic programming matrix a of size (N+1)*K, where a(0,k) is zero if k is <s> and negative infinity otherwise. Recursively, a(n+1,k') is [ max_k [ a(n,k) + f(x,k,k') ] ]. The largest a(N+1, . ) gives the highest score; by storing backpointers (which k was the largest) in another matrix, we can recover the whole sequence once we reach the end. This runs in time O(NK^2), as opposed to O(N^K) as a naive implementation would run.

I am becoming somewhat notorious for not using Viterbi search, even for sequence labeling problems (ironic, given that I am part of the Viterbi School of Engineering). The purpose of this post is to show that this is not unreasonable.

My evolving feeling on this has two angles:

1. Viterbi is good for something like an HMM where we can't model the whole input as features and so we need to talk back-and-forth via Viterbi. But when we get to use any feature at any time (as in any modern model), this advantage goes away.
2. I think people ascribe some "magic" to Viterbi. At least in NLP, the Viterbi algorithm and its generalizations (inside-outside, etc.) are treated as stand alone units. They're not. They're simply based on the observation that under certain constraints, when executing search, one can do efficient memoization. If you throw away the "magic" of Viterbi and treat it as jsut another search algorithm, you see that there's nothing special there.
I have a hard time convincing people of 2. I'm not quite sure why. I almost liken it to trying to convince a classically trained logician the ways of constructive logic are right (being one of the former, I'm not yet convinced). The problem is that we're so used to seeing VS as this very special array-filling operation that we forget what's really going on. We could just as easily do search and store stuff in a hash table. It wouldn't be quite as efficient, but complexity wise (assuming O(1) access to the table), it would be the same.

The "magic" of Viterbi is that it efficiently integrates all information from the different sources assuming that the conditional independences hold and assuming that all conditional probabilities are correctly estimated. But we know this is not the case.

That said, I think there is something to be said for later decisions being able to influence earlier ones. But I don't take this as given. And I don't think it's always the case. This is a big open question for me. By using a more complex search algorithm, you're forcing more decisions to be made but are (we hope) making these decisions easier. This is a trade-off. One angle is not a priori better than the other.

16 March 2006

Snowbird Accepted Papers Up

The list of accepted papers for the Snowbird Learning Workshop is up. I'm interested to see:

• Training CRFs with SMD (Vishwanathan, Shraudolph, Schmidt + Murphy)
• Sub-gradient method structured learning (Bagnell, Ratliff + Zinkevich)
• AdaBoost is MaxEnt over the error distribution (Haffner)
• Structural correspondences for transfer learning (Blitzer, Crammer, McDonald + Pereira)
• Rule-extraction from SVMs (Braun + Buhmann)
• Prior beliefs in SSL (Luxburg + Ben-David)
• Model compression (Caruana, Bucila + Niculescu-Mizil)

10 March 2006

Is X useful for Y?

This post has two parts, both addressing the issue of showing that one task (X) is useful for another task (Y); eg., syntax is useful for MT, or WSD is useful for IR, or ....

The first question is: on whom is the onus of making such an argument. (I'm presupposing that it makes sense to make such an argument.) There are three options: (1) a person who does X; (2) a person who does Y; a third party. Arguments in favor of 1: if I work on a task, I am likely to want to justify its importance. For 2: I know the most about Y, so if X is useful, I can probably get it to work; also, other (2)s are more likely to buy my results. For 3: ???.

One could argue (3) to be (more) unbiased, but I'll make an idealized assumption that (1) and (2) care about furthering science, not some personal research program. Given that, I think the best case scenario is to have a joint authored paper between a (2) and a (1); that failing, I'd probably prefer a paper by a (2). I'd prefer it not from a potential bias perspective, but because a (2) is more likely to produce a Y-system that's state-of-the-art, so showing that X improves on this is going to be more convincing. Of course if a (1) can do that, that's fine too.

The second issue that I see is that in a lot of cases it's not cut and dry. Take for example a recent paper that showed that syntax is useful for EDT. I believe this paper is probably right, given the maturity of the system into which syntax was added. However, take my EDT system. I worked on this for about 15 months not using syntax. I used all sorts of crazy features. If I add syntax (which I've done), it improves things a little. Not a lot. And only for coref, not for mention tagging. Why? Likely, I've engineered around having syntax. If I had added syntax at the beginning, maybe I could have done away with many of the other features I have. We see the same thing in IR: if some complex NLP technique seems to improve IR systems but is expensive to run, then often people will keep tearing it apart until they find some tiny little feature that's doing all the important work. If so, can we still say that the original task helped?

I think, given this, there is a better way to measure the usefulness of X to Y other than: X improved Y's performance. Consider someone setting out to build an EDT system. They want to know whether they should include syntax or not. The real question is: assuming I get comparable performance, is it easier to include syntax or to engineer around it. I don't know how to measure this automatically (lines of code is perhaps a reasonable surrogate, assuming the same coder writes both), but it seems like such a measure would be much more useful and telling than whether or not performance on an arbitrary system goes up.

08 March 2006

HLT-NAACL Program Up

The program for HLT-NAACL 2006 has been posted (I didn't submit anything). I'm partcularly looking forward to:

• A fast finite-state relaxation method (Tromble + Eisner)
• Name matching in English and Arabic (Freeman, Condon + Ackerman)
• Do we need phrases? (Quirk + Menezes)
• Using syntax for false entailment (Snow, Vanderwende + Menezes)
• Grammatical machine translation (Riezler + Maxwell)
• SRL, Wornet and Wikipedia for Coreference (Ponzetto + Strube)
I'm sure others will be good too. Anyone see anything else especially intruiging?