29 March 2006

Heuristics

Deepak brings up a great discussion topic that really needs its own post. Hopefully I'll catch this here before anyone continues to reply there (don't). The basic question is this:

Why cannot people simply use heuristicy and hackery approaches that have been proven over years to work well?

Deepak's point, which I think is well taken and should not be forgotten, is that a simple "hacky" thing (I don't intend "hacky" to be condescending...Deepak used it first!) often only does at most epsilon worse than a mathematically compelling technique, and sometimes even better. I think you can make a stronger argument. Hacky things allow us to essentailly encode any information we want into a system. We don't have to find a mathematically convenient way of doing so (though things are better now that we don't use generative models for everything).

I think there are several responses to these arguments, but I don't think anything topples the basic point. But before I go into those, I want to say that I think there's a big difference between simple techniques and heuristicy techniques. I am of course in favor of the former. The question is: why shouldn't I just use heuristics. Here are some responses.
  1. That's just how I am. I studied math for too long. It's purely a personal choice. This is not compelling on a grand scale, but at a personal level, it is unlikely one will do good research if one is not interested in the topic and approach.
  2. We don't want to just solve one problem. Heuristic techniques are fantastic for solving a very particular problem. But they don't generalize to other similar problems. The extent to which they don't generalize (or the difficulty to force generalization) of course varies.
  3. Similar to #2, I want a technique that can be described more briefly than simply the source code. Heuristic techniques by definition cannot. Why? I'd like to know what's really going on. Maybe there's some underlying principle that can be applied to other problems.
  4. Lasting impact. Heuristics rarely have lasting (scientific) impact because they're virtually impossible to reproduce. Of course, a lasting impact for something mathematically clever but worthless is worse than worthless.
There are probably others that other people can come up with.

That's my general perspective. In response to specific issues Deepak brought up:
...a more complicated model gives very little improvements and generally never scales easily.
I think it's worthwhile separating the problem from the solution. The issue with a lot of techniques not scaling is very true. This is why I don't want to use them :). I want to make my own techniques that do scale and that make as few assumptions as possible (regarding features, loss, data, etc.). I think we're on our way to this. Together with John and Daniel, we have some very promising results.
...working on such (SP) problems means getting married more to principles of Machine Learning than trying to make any progress towards real world problems.
I, personally, want to solve real-world problems. I cannot speak for others.
...a lot of lot of smart (young?) people in the NLP/ML community simply cannot admit the fact that simple techniques go a long way...
This is very unfortunate if true. I, for one, believe that simple techniques do go a long way. And I'm not at all in favor of using a steamroller to kill a fly. But I just don't feel like they can or will go all the way in any scalable manner (scalable in O(person time) not O(computation time)). I would never (anymore!) build a system for answering factoid questions like "how tall is the empire state building?" that is not simple. But what about the next level "how much taller is the empire state building than the washington monument?" Okay, now things are interesting, but this is still a trivial question to answer. Google identifies this as the most relevant page. And it does contain the necessary information. And I can imagine building heuristics that could answer "how much Xer is X that Y" based on asking two separate questions. But where does this process end? I will work forever on this problem. (Obviously this is just a simple stupid example which means you can find holes in it, but I still believe the general point.)

10 comments:

  1. I have a few thoughts here.

    First, I think there's a difference between "simple" approaches and "memorization" approaches. If I have a summarization system and google wants to use it, I don't care if they precompute and store everything or compute things on the fly. (In practice, you'd probably want to trade-off storage for runtime and do some smart indexing.) The question is: where do these original summaries come from. Similarly, I don't really think phrase-based MT is the Wal-Mart approach. Perhaps more-so than, say, Interlingua-based MT. But, to me, Wal-Mart MT would be "store all parallel sentences; when a new F sentences comes in, find the closest match in your corpus and output the translation." Phrase-based MT is (IMO) actually *too* complicated :). Or, at least the current models we have for it. (More specifically, its complexity is in the wrong places.)

    I think regarding the math/empirical results/etc., everyone has his or her own threshold for believability: the minimum evidence they require to beleive a technique is useful (i.e., decide to try to apply it to one's own problem). For some people this is "you can prove a nice theorem about it and I agree with the assumptions"; to some people this is "it improves state of the art performance on some benchmark"; to some it is "easy to implement"; to others it is "psychologically plausible."

    For me, I fall somewhere between the "theorem" statement and the "empirical" statement. I will forgive a small lack in one for strong evidence in the other. Why? None of any of these pieces of evidence is ever going to be fully sufficient. I see both of these as sanity checks. The "empirical" part tells me that it works on one problem (essentially, the bias is correct for some problem), while the "theorem" part tells me that it's probably not completely insane to believe it will also work (reasonably well) on another problem. I'm curious, Deepak: what is your threshold? :)

    I agree that one model for everything is not going to be realized for a long time (I hesitate to even say that it is possible). I disagree that one model for many things approach is not going to be realized for a long time. I currently have an approach for structured prediction that works really well on sequence labeling tasks, coreference resolution and automatic summarization. I can see how (though don't have the time) to work it on MT, parsing and other NLP tasks, as well as a few things in vision and biology. It's also very easy to implement, assuming you have access to some pre-implemented binary classifier.

    Wrapping up for now, I think that many of us probably agree on most things (reusing the same data sets, etc.). I think this is a really useful discussion, but for the purposes of continuing it, it might be worth trying to semi-formally define what we mean by a Wal-Mart approach and a Saks approach. Based on comments, I get the impression everyone has a slightly different opinion.

    ReplyDelete
  2. Incidentally, the Hand paper (which I largely agree with: in fact, several posts so far have talked about many of the issues pointed out in that paper) seems to focus on impoved machine learning techniques. Which I think many people here associate with "overly complex" techniques. But I don't really think that anything said there does not also apply to the increasing complexity of more "practical" solutions, such as rule based translation (for specific domains), or any other "practicality" focused solution. This is the typical 90/10 rule, where one year will get you 90% of the way and you need 10 more years to get the last 10%. This last 10% will involve tons of complexity, regardless of how you solve it and regardless of whether these are "theoretical" improvements or practical ones.

    ReplyDelete
  3. We built Hal's Wal-Mart MT system last year and entered it in the NIST eval. See ACL 2005 workshop on building an using parallel texts for details.

    ReplyDelete
  4. William: is that paper online anywhere?

    I completely agree on improving on stationary test sets and will probably post about this topic at some point. But this is really a concern with the treebank based evals. Everyone else (MT, summarization, IE, QA, etc.) has yearly evals where the data changes. There's also the question of whether these improvements are real.

    I agree about making assumptions, but the hacky approach is also making assumptions. They're just never written down. A big assumption made, for instance, is: the data I'm hacking on is the only data I ever need to do well on. This assumption scares me. A lot.

    Your definitions about Wal-mart and Saks agree with what I have in my head. And I think I (largely, at least) agree about working on fun new problems. Where I disagree is on the claim that for many interesting problems we can get 98% (or even just 90%) of the way there in a year of work (hackily or correctly). For many things it seems more like we can get 50% of the way there easily. And then the question is whether we're satisfied with 50% or not. (How you define the %ages is not so important. In fact, for many problems for which current systems report "95% accuracy," this is an artifact of the cost function, not of "real system performance.") For instance, though QA systems report high accuracies on TREC-style data, we're clearly very far from actually solving the QA problem.

    One answer to your thesis at the end is: yes, we have a zillion different problems. This is precisely why I don't want to hack up a solution to each one individually!

    ReplyDelete
  5. I'll jump into the conversation with my 2 cents worth:

    Last summer I was working on all sorts of ways to train a dependency parser. Being a person who likes math, I studied all sorts of complicated models. In the end I discovered that many training methods gave similar results and the real impact came from defining good features. One may call such feature engineering hacky, but now I have I have deeper appreciation for those who have the linguistic insight to come up with simple features for a particular problem.

    I liked Deepak's statement: "Why work on 2% gain problem over 10 years when we have zillion different problems to work on". Can you tell me what are the zillion different problems, though? (I'm in thesis-searching mode now... :) It would be cool if we could identify those new problems--it seems that it's often easier to work on the same problem (and use the same test set) than identifying a new important area. Finding a problem requires true innovation; solving a problem can be done with either hacky or complex solutions.

    ReplyDelete
  6. Kevin: I doubt anyone would be really surprised at your situation with the parser. I think, given this experience, that the request to the machine learning people is for a SP framework that enables us incorporate whatever kinds of features we want and not worry about messy things like efficiency and learning and the like. You just want it to work.

    Ross: Perhaps a second issue (and this relates to Kevin's request for topics) is that the academic system rewards staying inside the box and not branching out to "odd" problems. This means that the only possible impact is by incremental improvements on the state of the art.

    It does seem a bit strong to say that any step off the shortest path is equivalent to noise. I fear that making such arguments, while often reasonable, are on a very slipperly slope. My concern is that it's very easy to conclude that anything done to improve an existing system should not be done. But clearly progress has been made in lots of problems this way. As long as you're not doing human-in-the-loop learning (i.e., running your system against test data, adding more features, running against the same test data, and so on), I feel like you're fairly safe. It is common practice to either change the test set every year or at least to do the feature engineering against a held-out development set. Sure, you're at a slight risk of overfitting, but this seems better to me than to never actually solve any problem.

    ReplyDelete
  7. Ross: I really like your post. The first point is essentially a "trying to get to the moon by climbing a tree argument." I think it's often really hard to stop climbing the tree, even if you seem to be hitting the top, because its a long way down. I think regardless of whether you are practically oriented or scientifically oriented, this can be a challenge, essentially because it can take a while to climb back up. What is the best mechanism to avoid this in the academic world?

    ReplyDelete
  8. The (one page) review of Braben's book is worth the short time to read.

    ReplyDelete
  9. Hey Ross and Hal--thanks for the reference to the Braben book. I like his quote: "An expert opinion is one thing; the consensus of experts is another." (ps: This reminds me of a common phenonemon in training mixture of experts in machine learning--you want lots of diversity among your experts to achieve good accuracy!) I think this discussion has gone from the debate on heuristics specifically to a more general forum on what does it take to do good research, to be innovative, etc. I'll summarize some of my own take-home messages so far:

    - There are Wal-mart style and Saks approaches to solving problems. Some think that simple Wal-mart style approaches get you most of the way. However, it seems that we all have different opinions on what approach is Wal-mart and what is not.

    - Quoting Hal: "Everyone has his or her own threshold for ... the minimum evidence they require to beleive a technique is useful... For some people this is "you can prove a nice theorem about it and I agree with the assumptions"; to some people this is "it improves state of the art performance on some benchmark"; to some it is "easy to implement"; to others it is "psychologically plausible."

    - It is difficult to find new problems; further, peer-review systems tend to disfavor thinking out-of-the-box, so both new problems and new solutions have less chance to be pursued. Innovation comes from dissent; it comes from the feeling "Why do we have to do things like this?". I think this is true whether we have a Wal-mart or Saks approach to solutions.

    ReplyDelete
  10. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

    ReplyDelete