27 December 2007

Those Darn Biologists...

I've recently been talking to a handful of biologists. The lab/bench kind, not the computational kind. As an experiment in species observation, they intrigue me greatly. I'm going to stereotype, and I'm sure it doesn't hold universally, but I think it's pretty true and several (non-biologist) friends have confirmed this (one of whom is actually married to a biologist). The reason they intrigue me is because they are wholly uninterested in techniques that allow you to do X better, once they already have a solution for X.

This is a bit of an exaggeration, but not a huge one (I feel).

It is perhaps best served by example. Take dimensionality reduction. This is a very very well studied problem in the machine learning community. There are a bajillion ways to do it. What do biologists use? PCA. Still. And it's not that people have tried other things and they've failed. Other things have worked. Better. But no one cares. Everyone still uses PCA.

Part of this is probably psychological. After using PCA for a decade, people become comfortable with it and perhaps able to anticipate a bit of what it will do. Changing techniques would erase this.

Part of this is probably cultural. If you're trying to publish a bio paper (i.e., a paper in a bio journal) and you're doing dimensionality reduction and you aren't using PCA, you're probably going to really have to convince the reviewers that there's a good reason not to use PCA and that PCA just plain wouldn't work.

Now, that's not to say that biologists are luddites. They seem perfectly happy to accept new solutions to new problems. For instance, still in the dimensionality reduction setting, non-negative matrix factorization techniques are starting to show their heads in microarray settings. But it's not because they're better at solving the same old problem. What was observed was that if you have not just one, but a handful of microarray data sets, taken under slightly different experimental conditions, then the variance captured by PCA is the variance across experiments, not the variance that you actually care about. It has been (empirically) observed that non-negative matrix factorization techniques don't suffer from this problem. Note that it's only fairly recently that we've had such a surplus of microarray data that this has even become an issue. So here we have an example of a new problem needing a new solution.

Let's contrast this with the *ACL modus operandi.

There were 132 papers in ACL 2007. How many of them do you think introduced a new problem?

Now, I know I'm being a bit unfair. I'm considering dimensionality reduction across different platforms (or experimental conditions) as a different problem from dimensionality reduction on a single platform. You could argue that these really are the same problem. I just kind of disagree.

Just as biologists tend to be wary of new techniques, so we tend to be wary of new problems. I think anyone who has worked on a new problem and tried to get it published in a *ACL has probably gone through an experience that left a few scars. I know I have. It's just plain hard to introduce a new problem.

The issue is that if you work on an age-old problem, then all you have to do is introduce a new technique and then show that it works better than some old technique. And convince the reviewers that the technique is interesting (though this can even be done away with if the "works better" part is replaced with "works a lot better").

On the other hand, if you work on a new problem, you have to do a lot. (1) You have to introduce the new problem and convince us that it is interesting. (2) You have to introduce a new evaluation metric and convince us that it is reasonable. (3) You have to introduce a new technique and show that it's better than whatever existed before.

The problem is that (1) is highly subjective, so a disinterested reviewer can easily kill such a paper with a "this problem isn't interesting" remark. (2) is almost impossible to do right, especially the first time around. Again, this is something that's easy to pick on as a reviewer and instantly kill a paper. One might argue that (3) is either irrelevant (being a new problem, there isn't anything that existed before) or no harder than in the "age old problem" case. I've actually found this not to be true. If you work on an age old problem, everyone knows what the right baseline is. If you work on a new problem, not only do you have to come up with your own technique, but you also have to come up with reasonable baselines. I've gotten (on multiple occasions) reviews for "new problems" papers along the lines of "what about a baseline that does XXX." The irony is that while XXX is often creative and very interesting, it's not really a baseline and would actually probably warrant it's own paper!

That said, I didn't write this post to complain about reviewing, but rather to point out a marked difference between how our community works and how another community (that is also successful) works. Personally, I don't think either is completely healthy. I feel like there is merit in figuring out how to do something better. But I also feel that there is merit in figuring out what new problems are and how to solve them. The key problem with the latter is that the common reviewer complaints I cited above (problem not interesting or metric not useful) often are real issues with "new problem" papers. And I'm not claiming that my cases are exceptions. And once you accept a "new problem" paper that is on the problem that really isn't interesting, you've opened Pandora's box and you can expect to see a handful of copy cat papers next year ( could, but won't, give a handful of examples of this happening in the past 5 years). And it's really hard to reject the second paper on a topic as being uninteresting, given that the precedent has been set.

15 comments:

Suresh said...

This is interesting. It would be a little tricky to get a paper into FOCS/STOC/SODA where the main contribution was to improve some bound on a problem (unless the improvement was substantial, came after a lot of effort, introduced a new technique, etc).

On the other hand, although introducing a new problem can be tricky, lots of new problems get introduced into algorithms conferences, if done the right way. So there's a slightly different sensibility even within CS (to the extent that *ACL is "CS")

Bob Carpenter said...

Lab biologists care about solving biology problems, not machine learning, algorithms or statistics problems. They'll be hired, get tenure, get grants, get papers published, and win Nobel prizes for innovations in biology.

Time they spend learning about state-of-the-art machine learning and statistics is time they can't spend learning about biology.

Clustering and factorization techniques like PCA are exploratory data analysis tools used to formulate hypotheses. Biologists don't publish much based on data analysis; they have to go back in the lab and verify what they found in the statistics on the bench.

As an academic computational linguist, writing an ACL or NIPS paper is the end product. Parsing a section of the Penn treebank or participating in CoNLL is considered real work.

If, like me, you have to solve real problems for real customers, you'll find yourselves using bigram language models, HMMs, naive Bayes, and other robust techniques even if they're not "state of the art". That's largely because they work well without a lot of feature engineering and run fast in the field in small memory footprints.

Fernando Pereira said...

I've been working closely with lab biologists and I've started to publish with them. We should distinguish between "methods" papers and "results" papers. Our two main papers so far (one published, the other in print) are on new gene prediction methods. The reviews were pretty much like what I would expected from ACL reviewers, except that they were much more detailed, and required substantial work to answer in full. The first paper was published in a results-oriented journal (PLoS Comp Bio), the other has been accepted by a methods journal (Bioinformatics). Some of our most recent work may lead to specific, experimentally confirmed biological results, which we will submit to the appropriate results journals. Even methods papers may make it into results-oriented journals if the new method is seen by the editor and reviewers as changing the game in a particular area. This has been the case with recent work on discriminative structured gene prediction methods by the Rätsch lab at Max Plank Tübingen and by my group (publications in PLoS Comp Bio), and by the Batzoglou lab at Stanford and the Galagan lab at the Broad institute (publications in Genome Research).

Suresh said...

I should also add that in many areas of CS, the same holds: namely, certain techniques take hold and root themselves, even if newer techniques come along with improvements: the argument again is that the focus is on the problem, rather than the technique, and so you need a dramatic improvement, or a technique snuck in via a new problem.

More generally, this goes to the fact that large parts of ML, and large parts of algorithms for that matter, are driven by techniques rather than problems, whereas in the outside "problem-centric" world, there's a much bigger focus on generalist solutions.

OiOi said...

I agree with Bob Capenter.
People have different values. I mean they will attach more weightage to a method/technique depending on how the method or technique fits into their overall scheme of things.
If the goal is a research paper then you could try and do a 'mashup' of sorts to increase performance from 99 to 99.5.

In practice I would prefer to use Peter Drucker's value scheme 'It is more important to do the right thing than doing the thing right'.So if this value is used project delivery would get 8 of 10. POS tagging of all varieties will probably get 3.
In one of your earlier blogs you talked about 'confusion matrices'? for POS tagging.
Question is how much marginal value (To the goals) does a
marginal improvement add?

Igor said...

Hal,

I'd have to agree with you but you have to look at it from a different perspective. If you are not in Machine Learning, how can one figure out which of these 25 dimension reduction techniques ( http://www.cs.unimaas.nl/l.vandermaaten/Laurens_van_der_Maaten/Matlab_Toolbox_for_Dimensionality_Reduction.html )
is the best for the type of problems you are most interested in. I can see why using PCA is a safe bet and would convince people in the field and reviewers that the finding to be published is not a by-product of the new algorithm used.

Igor.
http://nuit-blanche.blogspot.com

From French to Japanese said...

Your blog is great! Very informative! We will be sure to link to you!

Judy and Harold

Anonymous said...

Ultima Online Gold, UO Gold, crestingwait
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
lotro gold
wow gold
warhammer gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
Age of Conan Gold, AOC Gold

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

Adi said...

Oes Tsetnoc one of the ways in which we can learn seo besides Mengembalikan Jati Diri Bangsa. By participating in the Oes Tsetnoc or Mengembalikan Jati Diri Bangsa we can improve our seo skills. To find more information about Oest Tsetnoc please visit my Oes Tsetnoc pages. And to find more information about Mengembalikan Jati Diri Bangsa please visit my Mengembalikan Jati Diri Bangsa pages. Thank you So much.

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex