Pharma IT: NEW TECHNOLOGIES
A variety of computational techniques can be used to evaluate more literature than is humanly possible and present a concise overview of the most important concepts in a given biomedical topic. There is a high demand these days for approaches that can accelerate the annotation and summarization of concepts related to genes, proteins, and other entities in the biomedical sciences.
Unfortunately, these algorithms struggle with the ambiguity of many of the terms found in the literature. In addition, large chunks of relevant text are kept securely behind the firewalls of traditional publishers and are not easily mined. It is impossible to rely on computational text mining as a sole and definitive source for facts. Consequently, human intervention is necessary. The collaborative effort of millions of bright minds, each with specific expertise, could capitalize on the input of the entire scientific community. Evidence from the immensely successful Wikipedia concept shows that even the rather embryonic editing software of the current Wiki platform is sufficient, in principle, to offer a forum capable of fostering a valuable knowledge repository.1
Fusing Two New Technologies
People need good reasons to adopt new technologies, so the marriage of two new technologies from very different fields-computational text analysis (via Knowlet space, discussed below) and the relational Wiki database environment-needs to present a very clear added-value proposition to attract the prospective user. Enter Wiki for Professionals (, which introduces an environment where scientists can combine online knowledge discovery with annotation. Scientists will recognize the site's computational text analysis methods as tools to evaluate literature and distill pertinent information. In addition, continuous analysis of the advancing concept space in published biomedical literature quickly alerts experts to the latest changes in their fields of interest.
Knowlet technology, provided to the Wiki environment by Knewco, Inc., allows for the construction of a "dynamic ontology" that can reflect changes in the concept space on an almost real-time basis. Existing text mining methodologies extract functional information using document classification methods that are based on the relationships among a limited set of gene ontology (GO) terms and manually assigned medical subject headings; by associating literature with proteins and GO terms using a dictionary; or by sequence homology clustering with a GOAnnotation protein for the transfer of GO terms.2-4
In contrast, Knowlet technology mines the relationships among the biological entities-as found in the Unified Medical Language System (UMLS) and UniProtKB/Swiss-Prot (UP/SP) data-bases-from the entire MedLine literature database to produce a condensed version. This is an impossible task for a single scientist or even for a larger dedicated annotation team. Each identified biological entity (i.e., source concept, see Figure 1, below) has a defined number of relationships with a number of target concepts. Every concept-to-concept relationship is explained by a value based on three main categories of relationships: factual (F) statements found in scientific databases, the co-occurrence (C) of two concepts in a text, and an associative (A) parameter based on the conceptual overlap of the two concepts.
Strategic Starting Point
The annotation component-the WikiProteins environment-is jump-started by the import of authoritative sources.5 Selected records from UMLS, UP/SP, GO, and IntAct have already been imported. Authoritative sources are "read-only" in the Wiki-meaning the entries can't be edited-and community annotation will be supplementary, performed on copies of the original records. Scientists will be able to add new or edit existing textual commentary to, for example, nuance or describe new biological functions. It will also be possible to establish links between two Wiki records to illustrate pertinent relationships.
All modifications are attributed to the participating scientist, and certain granting agencies have indicated that they are interested in using the resulting contributions as indices of scholarly achievement. Eventually, the most mature stage of the community data can be combined with the latest version of the authoritative source data, and acknowledgment will be given for the scientific contribution. Gradually, more authoritative sources will be added. In principle, all high quality resources that describe interactions between biologically meaningful concepts can benefit from inclusion in this environment, which would expand their information base via the input of experts in the community.
The Knowlet of individual concepts-and thus the entire Knowlet space of over one million concepts from UMLS, 250,000 proteins from UP/SP, 24,000 GO terms, and seven million abstracts from MedLine (see Figure 2, above)-will be upgraded with an ever-growing number of relationships, derived both from new publications and from annotations in the Wiki. Each new authoritative resource also enriches the Knowlet space. As seen in a proteomics study of the nucleolus, the aggregated conceptual information from MedLine and GO alone produced correct predictions of other proteins related to this biological system. The inclusion of more resources in the Knowlet space will only enhance and refine predictions.6
Semantic support software for the Knowlet technology will regularly index the changes in the Wiki environment and will send alerts concerning new co-occurrences between the Wiki article's source concept and new target concepts added to the Wiki page. Algorithms will be progressively implemented to predict more precisely the factual information from co-occurrence data. These data will then be published in the Wiki for community fine-tuning and peer review, which will expand the Wiki as a database of biomedical facts.
Only about 40% of the protein-protein interactions documented in databases can be found as co-occurring concepts in MedLine abstracts, indicating that the majority of interactions are only cited in the body of the full text article (according to A. Botelho Bovo, in a personal communication). The Wiki will allow authors to enter the established relationships between biomedical entities not found in abstracts-and it will let them receive recognition for their findings (see Figure 3, left). �
Dr. Chichester is a post-doctoral researcher and Dr. Mons is an associate professor, both in the department of human and clinical genetics, Leiden University Medical Center, The Netherlands. Contact them at
1. Giles J. Internet encyclopaedias go head to head. Nature. 2005;438(7070):900-901.
2. Raychaudhuri S, Chang JT, Sutphin PD, et al. Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res. 2002;12(1):203-214.
3. Koike A, Niwa Y, Takagi T. Automatic extraction of gene/protein biological functions from biomedical text. Bioinformatics. 2005;21(7):1227-1236. Epub 2004 Oct 27.
4. Xie H, Wasserman A, Levine Z, et al. Large-scale protein annotation through gene ontology. Genome Res. 2002;12(5):785-794.
5. Giles J. Key biology databases go wiki. Nature. 2007;445(7129):691.
6. Schuemie M, Chichester C, Lisacek F, et al. Assignment of protein function and discovery of novel nucleolar proteins based on automatic analysis of MEDLINE. Proteomics. 2007;7(6): 921-931.