03 February 2008

The behemoth, PubMed

The friend I crashed with while attending SODA is someone I've known since we were five years old. (Incidentally, there's actually someone in the NLP world who I've actually known from earlier...small world.) Anyway, the friend I stayed with is just finishing med school at UCSF and will soon be staying there for residency. His specialty is neurosurgery, and his interests are in neural pathologies. He spent some time doing research on Alzheimer's disease, effectively by studying mice (there's something I feel sort of bad about finding slightly amusing about mice with Alzheimer's disease). Needless to say, in the process of doing research, he made nearly daily use out of PubMed. (For those of you who don't know, PubMed is like the ACL anthology, but with hundreds of thousands of papers, with new ones being added by the truckload daily, and will a bunch of additional things, like ontologies and data sets.)

There are two things I want to talk about regarding PubMed. I think both of these admit very interesting problems that we, as NLPers, are qualified to tackle. I think the most important thing, however, is opening and maintaining a wide channel of communication. There seems to be less interaction between people who do (for instance) bio-medical informatics (we have a fairly large group here) and what I'll term as mainstream NLPers. Sure, there have been BioNLP workshops at ACLs, but I really think that both communities would be well-served to interact more. And for those of you who don't want to work on BioNLP because it's "just a small domain problem", let me assure you: it is not easy... don't think of it in the same vein as a true "sublanguage" -- it is quite broad.

I suppose I should give a caveat that my comments below are based on a sample size of one (my friend), so it may not be totally representative. But I think it generalizes.

Search in PubMed, from what I've heard, is good in the same ways that web search is good and bad in the same ways that web search is bad. It is good when you know what you're looking for (i.e., you know the name for it) and bad otherwise. One of the most common sorts of queries that my friend wants to do is something like "show me all the research on proteins that interact in some way with XXX in the context of YYY" where XXX is (eg) a gene and YYY is (eg) a disease. The key is that we don't know which proteins these are and so it's hard to query for them directly. I know that this is something that the folks at Penn (and probably elsewhere) are working on, and I get the impression that a good solution to this problem would make lots and lots of biologists much happier (and more productive). One thing that was particularly interesting, however, is that he was pretty averse to using structured queries like the one I gave above. He effectively wants to search for "XXX YYY" and have it realize that XXX is a gene, YYY is a disease, and that it's "obvious" that what he wants is proteins that interact with (or even, for instance, pathways that contain) XXX in the context of disease YYY. On the other hand, if YYY were another gene, then probably he's be looking for diseases or pathways that are regulated by both XXX and YYY. It's a bit complex, but I don't think this is something particularly beyond our means.

The other thing I want to talk about is summarization. PubMed actually archives a fairly substantial collection of human-written summaries. These fall into one of two categories. The first, called "systematic reviews" are more or less what we would think of as summaries. However, they are themselves quite long and complex. They're really not anything like sentence extracts. The second, called "meta analyses" are really not like summaries at all. In a meta analysis, an author will consider a handful of previously published papers on, say, the effects of smoking on lifespan. He will take the data and results published in these individual papers, and actually do novel statistical analyses on them to see how well the conclusions hold.

From a computational perspective, the automatic creation of meta analyses would essentially be impossible, until we have machines that can actually run experiments in the lab. "Systematic reviews", on the other hand, while totally outside the scope of our current technology, are things we could hope to do. And they give us lots of training data. There are somewhere around ten to forty thousand systematic reviews on PubMed, each about 20 pages long, and each with references back to papers, almost all of which are themselves in PubMed. Finding systematic reviews older than a few years ago (when the began being tagged explicitly) has actually sprouted a tiny cottage industry. And PubMed nicely makes all of their data available for download, without having to crawl, something that makes life much easier for us.

My friend warns that it might not be a good idea to use all systematic reviews, but only those from top journals. (They tend to be less biased, and better written.) However, in so far as I don't think we'd even have hope of getting something as good as a systematic review from the worst journal in the world, I'm not sure this matters much. Maybe all it says is that for evaluation, we should be careful to evaluate against the top.

Now, I should point out that people in biomedical informatics have actually been working on the summarization problem too. From what I can tell, the majority of effort there is on rule-based systems that build on top of more rule-based systems that extract genes, pathways and relations. People at the National Library of Medicine, Rindflesch and Fiszman, use SemRep to do summarization, and they have tried applying it to some medical texts. Two other people that I know are doing this kind of work are Kathy McKeown and Greg Whalen, both at Columbia. The Columbia group has access to a medically informed NLP concept extractor called MedLEE, which gives them a leg up on the low-level processing details. If you search for 'summarization medical OR biomedical' in GoogleScholar, you'll get a raft of hits (~9000).

Now, don't get me wrong -- I'm not saying that this is easy -- but for summarization folks who are constantly looking for "natural artifacts" of their craft, this is an enormous repository.


  1. Interesting post!

    Sorry to go off-topic a bit but this whole pubmed business always gets me thinking about machine learning repositories. We have the UCI repository for datasets and lately we have the machine learning open source repository but I still have the feeling we are missing out on so much. Wouldn't it make sense for our community to have an arXiv or some other central location with easy access to any paper you could possibly care about. It should be a piece of cake to augment the repository with citation tracking, (blog-comment like) discussions, summaries, meta-analysis, datasets, software and probably many more things?

    In the mean time, if I (ever?!?) publish, I'll certainly use a blog to do some of the above.

  2. Re your first comment ("the biomed researcher wants to just type "XXX YYY"). I've heard the same from my biomed collaborators, and several of us at Penn have been working on an approach to giving them what they want, which is in review at the moment (cross fingers).

  3. Very interesting post. This is another off-topic message, but does anyone here know anything about this new search engine ManagedQ.com?

    They've got this really cool statistical noun phrase extraction algorithm built into their search engine. The coolest part is it actually finds relevant NPs. Think automatic tagging detection.

    It looks like they've managed to incorporate reasonable stop word detection with some type of simple dictionary file to extract key phrases.

    try it: ManagedQ I found it quite interesting.

  4. Bio-medical text is a great problem for many reasons. First, the research biologists are desperate for a solution. Second, there's lots of open source text and an unimaginable amount of structured data. Third, there are conference outlets and funding for the work.

    In terms of summarization, geneticists often want to find out about 100 genes they found through differential micro-array or landscape experiments. Minimally, a system needs to (1) find all the mentions of gene, and (2) somehow summarize sets of articles. There are existing tools to do this with ontologies, like Cytoscape and GeneSpring, but nothing that ties in with the literature.

    MEDLINE is the NLM's repository of citations. Entrez is their top-level search application:


    PubMed is the piece of this search application that searches MEDLINE citations.

    MEDLINE contains more than text. It also contains hand-curated MeSH (medical subject heading) terminology links.

    The rest of Entrez links extensively back into MEDLINE. For instance, check out Entrez Gene (from above link), which is a database indexed by gene and species. Each entry contains aliases, text descriptions of the gene, descriptions of the gene's function, links to related genes (either by homology or interaction), and links into GO (the Gene Ontology, which contains an extensive multiple inheritance hierarchy for genetic function, process and location).

    Many biologists are tracking specific diseases, in which case you'll want to check out OMIM (Online Mendelian Inheritance in Man), which is a catalogue of diseases and genes, with many gene variants listed. The beauty of OMIM is that it's text with citations back into MEDLINE.

    You'll also need KEGG (Kyto Encyclopedia of Genes and Genomes, a set of "disease graphs" with known pathways and sub-pathways):


    And don't forget PubMed Central -- the repository of full text articles (it's part of Entrez).

    The search problem's quite a bit more complex than gene plus disease, but limiting to that, it's still nearly impossible even if you know the curated names of the diseases and genes. Too many of them have common names, like ACT or TO, to use simple techniques for search. The listed sets of aliases are woefully incomplete. Context is really critical.

    The other big problem is genericity. Gene names and disease names are more like product names than human names. They refer to families, variations, and so on.

    Very often papers are about whole families of genes. For example, "the EXT family" contains EXT1, EXT2, EXTL1, EXTL2 and EXTL3. Then there are subparts (e.g. conserved regions under homologies or protein motifs), homologues in other species (e.g. mouse ext1 [they like lower case]), mutated forms (most often described rather than named). Oh, and then there's the metonymy problem -- genes and the proteins they produce tend to have the same names.

  5. To echo Bob's post, there are a truckload of bio text resources of all shapes and kinds. Many learning methods and techniques can find a relevant task in the vast array of these resources. I did a project with bio related search a few years back and was overwhelmed by the size and depth of these materials.

    Also, like Hal said, people use this stuff. With a lot of NLP problems, the goal is to provide better tools to the average user. However, in bio these aren't "average" users, these are bio researchers, doctors, academics, etc. They are accustomed to research and use state of the art tools all the time. I think its easier for this group to take, use and understand newer technologies.

  6. This comment has been removed by the author.

  7. There is a recent Web application, Semantic MEDLINE, which attempts to help the researcher answer the type of question you mention in the post. It is based on SemRep technology and automatic abstraction summarization on top of it.
    It currently works with canned queries right now, but will eventually work with any query. (Once we finish processing entire MEDLINE with SemRep)

  8. Along the lines of "Semantic Medline" mentioned above, we are working on mapping drug-disease relationships to find effective treatments at CureHunter.com.

    We also have an interactive visual medical dictionary tech demo up too.

    For example, a search for "obesity" shows a network linking "insulin" and "exercise" therapy.

    Please give it a try if you get the chance.

  9. This comment has been removed by the author.

  10. 酒店經紀PRETTY GIRL 台北酒店經紀人 ,禮服店 酒店兼差PRETTY GIRL酒店公關 酒店小姐 彩色爆米花酒店兼職,酒店工作 彩色爆米花酒店經紀, 酒店上班,酒店工作 PRETTY GIRL酒店喝酒酒店上班 彩色爆米花台北酒店酒店小姐 PRETTY GIRL酒店上班酒店打工PRETTY GIRL酒店打工酒店經紀 彩色爆米花