03 February 2008

The behemoth, PubMed

The friend I crashed with while attending SODA is someone I've known since we were five years old. (Incidentally, there's actually someone in the NLP world who I've actually known from earlier...small world.) Anyway, the friend I stayed with is just finishing med school at UCSF and will soon be staying there for residency. His specialty is neurosurgery, and his interests are in neural pathologies. He spent some time doing research on Alzheimer's disease, effectively by studying mice (there's something I feel sort of bad about finding slightly amusing about mice with Alzheimer's disease). Needless to say, in the process of doing research, he made nearly daily use out of PubMed. (For those of you who don't know, PubMed is like the ACL anthology, but with hundreds of thousands of papers, with new ones being added by the truckload daily, and will a bunch of additional things, like ontologies and data sets.)

There are two things I want to talk about regarding PubMed. I think both of these admit very interesting problems that we, as NLPers, are qualified to tackle. I think the most important thing, however, is opening and maintaining a wide channel of communication. There seems to be less interaction between people who do (for instance) bio-medical informatics (we have a fairly large group here) and what I'll term as mainstream NLPers. Sure, there have been BioNLP workshops at ACLs, but I really think that both communities would be well-served to interact more. And for those of you who don't want to work on BioNLP because it's "just a small domain problem", let me assure you: it is not easy... don't think of it in the same vein as a true "sublanguage" -- it is quite broad.

I suppose I should give a caveat that my comments below are based on a sample size of one (my friend), so it may not be totally representative. But I think it generalizes.

Search in PubMed, from what I've heard, is good in the same ways that web search is good and bad in the same ways that web search is bad. It is good when you know what you're looking for (i.e., you know the name for it) and bad otherwise. One of the most common sorts of queries that my friend wants to do is something like "show me all the research on proteins that interact in some way with XXX in the context of YYY" where XXX is (eg) a gene and YYY is (eg) a disease. The key is that we don't know which proteins these are and so it's hard to query for them directly. I know that this is something that the folks at Penn (and probably elsewhere) are working on, and I get the impression that a good solution to this problem would make lots and lots of biologists much happier (and more productive). One thing that was particularly interesting, however, is that he was pretty averse to using structured queries like the one I gave above. He effectively wants to search for "XXX YYY" and have it realize that XXX is a gene, YYY is a disease, and that it's "obvious" that what he wants is proteins that interact with (or even, for instance, pathways that contain) XXX in the context of disease YYY. On the other hand, if YYY were another gene, then probably he's be looking for diseases or pathways that are regulated by both XXX and YYY. It's a bit complex, but I don't think this is something particularly beyond our means.

The other thing I want to talk about is summarization. PubMed actually archives a fairly substantial collection of human-written summaries. These fall into one of two categories. The first, called "systematic reviews" are more or less what we would think of as summaries. However, they are themselves quite long and complex. They're really not anything like sentence extracts. The second, called "meta analyses" are really not like summaries at all. In a meta analysis, an author will consider a handful of previously published papers on, say, the effects of smoking on lifespan. He will take the data and results published in these individual papers, and actually do novel statistical analyses on them to see how well the conclusions hold.

From a computational perspective, the automatic creation of meta analyses would essentially be impossible, until we have machines that can actually run experiments in the lab. "Systematic reviews", on the other hand, while totally outside the scope of our current technology, are things we could hope to do. And they give us lots of training data. There are somewhere around ten to forty thousand systematic reviews on PubMed, each about 20 pages long, and each with references back to papers, almost all of which are themselves in PubMed. Finding systematic reviews older than a few years ago (when the began being tagged explicitly) has actually sprouted a tiny cottage industry. And PubMed nicely makes all of their data available for download, without having to crawl, something that makes life much easier for us.

My friend warns that it might not be a good idea to use all systematic reviews, but only those from top journals. (They tend to be less biased, and better written.) However, in so far as I don't think we'd even have hope of getting something as good as a systematic review from the worst journal in the world, I'm not sure this matters much. Maybe all it says is that for evaluation, we should be careful to evaluate against the top.

Now, I should point out that people in biomedical informatics have actually been working on the summarization problem too. From what I can tell, the majority of effort there is on rule-based systems that build on top of more rule-based systems that extract genes, pathways and relations. People at the National Library of Medicine, Rindflesch and Fiszman, use SemRep to do summarization, and they have tried applying it to some medical texts. Two other people that I know are doing this kind of work are Kathy McKeown and Greg Whalen, both at Columbia. The Columbia group has access to a medically informed NLP concept extractor called MedLEE, which gives them a leg up on the low-level processing details. If you search for 'summarization medical OR biomedical' in GoogleScholar, you'll get a raft of hits (~9000).

Now, don't get me wrong -- I'm not saying that this is easy -- but for summarization folks who are constantly looking for "natural artifacts" of their craft, this is an enormous repository.

20 comments:

Jurgen Van Gael said...

Interesting post!

Sorry to go off-topic a bit but this whole pubmed business always gets me thinking about machine learning repositories. We have the UCI repository for datasets and lately we have the machine learning open source repository but I still have the feeling we are missing out on so much. Wouldn't it make sense for our community to have an arXiv or some other central location with easy access to any paper you could possibly care about. It should be a piece of cake to augment the repository with citation tracking, (blog-comment like) discussions, summaries, meta-analysis, datasets, software and probably many more things?

In the mean time, if I (ever?!?) publish, I'll certainly use a blog to do some of the above.

Fernando Pereira said...

Re your first comment ("the biomed researcher wants to just type "XXX YYY"). I've heard the same from my biomed collaborators, and several of us at Penn have been working on an approach to giving them what they want, which is in review at the moment (cross fingers).

james t said...

Very interesting post. This is another off-topic message, but does anyone here know anything about this new search engine ManagedQ.com?

They've got this really cool statistical noun phrase extraction algorithm built into their search engine. The coolest part is it actually finds relevant NPs. Think automatic tagging detection.

It looks like they've managed to incorporate reasonable stop word detection with some type of simple dictionary file to extract key phrases.

try it: ManagedQ I found it quite interesting.

Bob Carpenter said...

Bio-medical text is a great problem for many reasons. First, the research biologists are desperate for a solution. Second, there's lots of open source text and an unimaginable amount of structured data. Third, there are conference outlets and funding for the work.

In terms of summarization, geneticists often want to find out about 100 genes they found through differential micro-array or landscape experiments. Minimally, a system needs to (1) find all the mentions of gene, and (2) somehow summarize sets of articles. There are existing tools to do this with ontologies, like Cytoscape and GeneSpring, but nothing that ties in with the literature.

MEDLINE is the NLM's repository of citations. Entrez is their top-level search application:

http://www.ncbi.nlm.nih.gov/sites/gquery

PubMed is the piece of this search application that searches MEDLINE citations.

MEDLINE contains more than text. It also contains hand-curated MeSH (medical subject heading) terminology links.

The rest of Entrez links extensively back into MEDLINE. For instance, check out Entrez Gene (from above link), which is a database indexed by gene and species. Each entry contains aliases, text descriptions of the gene, descriptions of the gene's function, links to related genes (either by homology or interaction), and links into GO (the Gene Ontology, which contains an extensive multiple inheritance hierarchy for genetic function, process and location).

Many biologists are tracking specific diseases, in which case you'll want to check out OMIM (Online Mendelian Inheritance in Man), which is a catalogue of diseases and genes, with many gene variants listed. The beauty of OMIM is that it's text with citations back into MEDLINE.

You'll also need KEGG (Kyto Encyclopedia of Genes and Genomes, a set of "disease graphs" with known pathways and sub-pathways):

http://www.genome.jp/kegg/

And don't forget PubMed Central -- the repository of full text articles (it's part of Entrez).

The search problem's quite a bit more complex than gene plus disease, but limiting to that, it's still nearly impossible even if you know the curated names of the diseases and genes. Too many of them have common names, like ACT or TO, to use simple techniques for search. The listed sets of aliases are woefully incomplete. Context is really critical.

The other big problem is genericity. Gene names and disease names are more like product names than human names. They refer to families, variations, and so on.

Very often papers are about whole families of genes. For example, "the EXT family" contains EXT1, EXT2, EXTL1, EXTL2 and EXTL3. Then there are subparts (e.g. conserved regions under homologies or protein motifs), homologues in other species (e.g. mouse ext1 [they like lower case]), mutated forms (most often described rather than named). Oh, and then there's the metonymy problem -- genes and the proteins they produce tend to have the same names.

Mark Dredze said...

To echo Bob's post, there are a truckload of bio text resources of all shapes and kinds. Many learning methods and techniques can find a relevant task in the vast array of these resources. I did a project with bio related search a few years back and was overwhelmed by the size and depth of these materials.

Also, like Hal said, people use this stuff. With a lot of NLP problems, the goal is to provide better tools to the average user. However, in bio these aren't "average" users, these are bio researchers, doctors, academics, etc. They are accustomed to research and use state of the art tools all the time. I think its easier for this group to take, use and understand newer technologies.

babaluma said...
This comment has been removed by the author.
babaluma said...

There is a recent Web application, Semantic MEDLINE, which attempts to help the researcher answer the type of question you mention in the post. It is based on SemRep technology and automatic abstraction summarization on top of it.
It currently works with canned queries right now, but will eventually work with any query. (Once we finish processing entire MEDLINE with SemRep)

YoYoYo said...

Along the lines of "Semantic Medline" mentioned above, we are working on mapping drug-disease relationships to find effective treatments at CureHunter.com.

We also have an interactive visual medical dictionary tech demo up too.

For example, a search for "obesity" shows a network linking "insulin" and "exercise" therapy.

Please give it a try if you get the chance.

Gully said...
This comment has been removed by the author.
Anonymous said...

Ultima Online Gold, UO Gold, crestingwait
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
buy uo gold
lotro gold
wow gold
warhammer gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
buy aoc gold
Age of Conan Gold, AOC Gold

Anonymous said...

It is the holic gold which makes me very happy these days, my brother says holic money is his favorite games gold he likes, he usually holic online gold to start his game and most of the time he will win the cheap holic gold back and give me some holic online money to play the game.

. said...

酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花

酒店上班請找艾葳 said...

艾葳酒店經紀公司提供專業的酒店經紀, 酒店上班小姐,八大行業,酒店兼職,傳播妹,或者想要打工兼差打工,兼差,八大行業,酒店兼職,想去酒店上班, 日式酒店,制服酒店,ktv酒店,禮服店,整天穿得水水漂漂的,還是想去制服店日領上班小姐,水水們如果想要擁有打工工作、晚上兼差工作兼差打工假日兼職兼職工作酒店兼差兼差打工兼差日領工作晚上兼差工作酒店工作酒店上班酒店打工兼職兼差兼差工作酒店上班等,想了解酒店相關工作特種行業內容,想兼職工作日領假日兼職兼差打工、或晚班兼職想擁有鋼琴酒吧又有保障的工作嗎???又可以現領請找專業又有保障的艾葳酒店經紀公司!

艾葳酒店經紀是合法的公司工作環境高雅時尚,無業績壓力,無脫秀無喝酒壓力,高層次會員制客源,工作輕鬆,可日領現領
一般的酒店經紀只會在水水們第一次上班和領薪水時出現而已,對水水們的上班安全一點保障都沒有!艾葳酒店經紀公司的水水們上班時全程媽咪作陪,不需擔心!只提供最優質的酒店上班,酒店上班,酒店打工環境、上班條件給水水們。心動嗎!? 趕快來填寫你的酒店上班履歷表

水水們妳有缺現領、有兼職缺錢便服店的煩腦嗎?想到日本留學缺錢嗎?妳是傳播妹??想要擁有高時薪又輕鬆的賺錢,酒店和,假日打工,假日兼職賺錢的機會嗎??想實現夢想卻又缺錢沒錢嗎!??
艾葳酒店台北酒店經紀招兵買馬!!徵專業的酒店打工,想要去酒店的水水,想要短期日領,酒店日領,禮服酒店,制服店,酒店經紀,ktv酒店,便服店,酒店工作,禮服店,酒店小姐,酒店經紀人,
等相關服務 幫您快速的實現您的夢想~!!

Adi said...

Oes Tsetnoc one of the ways in which we can learn seo besides Mengembalikan Jati Diri Bangsa. By participating in the Oes Tsetnoc or Mengembalikan Jati Diri Bangsa we can improve our seo skills. To find more information about Oest Tsetnoc please visit my Oes Tsetnoc pages. And to find more information about Mengembalikan Jati Diri Bangsa please visit my Mengembalikan Jati Diri Bangsa pages. Thank you So much.

qishaya said...

one day i went shopping outside,and in an ed hardy store,I found some kinds of ed hardy i love most they are Your website is really good Thank you for the information ed hardy ed hardy ed hardy clothing ed hardy clothing ed hardy shoes ed hardy shoes don ed hardy don ed hardy ed hardy clothes ed hardy clothes ed hardy bags ed hardy bags ed hardy swimwear ed hardy swimwear ed hardy jeans ed hardy jeans ed hardy mens ed hardy mens Thank you for the information

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chatkamerali sohbetseslisohbetsesli sohbet sitelerisesli chat siteleriseslichatsesli sohpetseslisohbet.comsesli chatsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet
seslisohbetsesli sohbetkamerali sohbetsesli chatsesli sohbetkamerali sohbet

cilemsin42 said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..
sesli sohbetsesli chat
sesli sohbet siteleri

sesli chat siteleri sesli sohbetsesli chat
sesli sohbet siteleri
sesli chat siteleri
SesliChat
cılgın sohbet
güzel kızlar
bekar kızlar
dul bayanlar
seviyeli insanlar
yarışma
canlı müzik
izdivac
en güzel evlilik
hersey burada
sesliparti
seslisohbet odalari
Sesli adresi
Sesli Chat
SesliChat Siteleri
Sesli Chat sitesi
SesliChat sitesi
SesliSohbet
Sesli Sohbet
Sesli Sohbet Sitesi
SesliSohbet Sitesi
SesliSohbet Siteleri
Muhabbet Sitesi
kamerali chat
Görüntülü Sohbet
Hasret gülleri
Çet sitesi
SesliSohbet
Sesli Sohbet
Canli sohbet
Turkce sohbet
Kurtce Sohbet
Kurtce Chat
Kurtce Muhabbet
Kurtce Sohbet
Kurdish Chat
SesliChat
Sesli Chat
SesliSanal
Guncel Haber
sohbet Sitesi
Chat sitesi..

seldamuratim said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it to a few friends of mine that I know would enjoy reading..

sesli sohbet
seslisohbet
sesli chat
seslichat
sesli sohbet sitesi
sesli chat sitesi
sesli sohpet
kamerali sohbet
kamerali chat
webcam sohbet

DiSCo said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
kameralı sohbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
seslidunya
seslisehir
sesli sex

Sesli Chat said...

Really trustworthy blog. Please keep updating with great posts like this one. I have booked marked your site and am about to email it

to a few friends of mine that I know would enjoy reading..
seslisohbet
seslichat
sesli sohbet
sesli chat
sesli
sesli site
görünlütü sohbet
görüntülü chat
kameralı sohbet
kameralı chat
sesli sohbet siteleri
sesli chat siteleri
sesli muhabbet siteleri
görüntülü sohbet siteleri
görüntülü chat siteleri
görüntülü muhabbet siteleri
kameralı sohbet siteleri
kameralı chat siteleri
kameralı muhabbet siteleri
canlı sohbet
sesli muhabbet
görüntülü muhabbet
kameralı muhabbet
birsesver
birses
seslidunya
seslisehir
sesli sex