Why cannot people simply use heuristicy and hackery approaches that have been proven over years to work well?
Deepak's point, which I think is well taken and should not be forgotten, is that a simple "hacky" thing (I don't intend "hacky" to be condescending...Deepak used it first!) often only does at most epsilon worse than a mathematically compelling technique, and sometimes even better. I think you can make a stronger argument. Hacky things allow us to essentailly encode any information we want into a system. We don't have to find a mathematically convenient way of doing so (though things are better now that we don't use generative models for everything).
I think there are several responses to these arguments, but I don't think anything topples the basic point. But before I go into those, I want to say that I think there's a big difference between simple techniques and heuristicy techniques. I am of course in favor of the former. The question is: why shouldn't I just use heuristics. Here are some responses.
- That's just how I am. I studied math for too long. It's purely a personal choice. This is not compelling on a grand scale, but at a personal level, it is unlikely one will do good research if one is not interested in the topic and approach.
- We don't want to just solve one problem. Heuristic techniques are fantastic for solving a very particular problem. But they don't generalize to other similar problems. The extent to which they don't generalize (or the difficulty to force generalization) of course varies.
- Similar to #2, I want a technique that can be described more briefly than simply the source code. Heuristic techniques by definition cannot. Why? I'd like to know what's really going on. Maybe there's some underlying principle that can be applied to other problems.
- Lasting impact. Heuristics rarely have lasting (scientific) impact because they're virtually impossible to reproduce. Of course, a lasting impact for something mathematically clever but worthless is worse than worthless.
That's my general perspective. In response to specific issues Deepak brought up:
...a more complicated model gives very little improvements and generally never scales easily.I think it's worthwhile separating the problem from the solution. The issue with a lot of techniques not scaling is very true. This is why I don't want to use them :). I want to make my own techniques that do scale and that make as few assumptions as possible (regarding features, loss, data, etc.). I think we're on our way to this. Together with John and Daniel, we have some very promising results.
...working on such (SP) problems means getting married more to principles of Machine Learning than trying to make any progress towards real world problems.I, personally, want to solve real-world problems. I cannot speak for others.
...a lot of lot of smart (young?) people in the NLP/ML community simply cannot admit the fact that simple techniques go a long way...This is very unfortunate if true. I, for one, believe that simple techniques do go a long way. And I'm not at all in favor of using a steamroller to kill a fly. But I just don't feel like they can or will go all the way in any scalable manner (scalable in O(person time) not O(computation time)). I would never (anymore!) build a system for answering factoid questions like "how tall is the empire state building?" that is not simple. But what about the next level "how much taller is the empire state building than the washington monument?" Okay, now things are interesting, but this is still a trivial question to answer. Google identifies this as the most relevant page. And it does contain the necessary information. And I can imagine building heuristics that could answer "how much Xer is X that Y" based on asking two separate questions. But where does this process end? I will work forever on this problem. (Obviously this is just a simple stupid example which means you can find holes in it, but I still believe the general point.)