Chapter 1. Introduction
The Exponential Growth of Biomedical Research Data
The current capabilities of our biomedical research enterprise, exemplified by the
completion of Human Genome Project, enable researchers to quickly and routinely
survey the contents of entire molecular and cellular systems. This capability is generating
a revolution in biomedical research in various profound ways. One significant change is
the availability of staggering amounts of genomic and functional genomic data gathered
at a whole genome or whole cell scale. As the result of such tremendous technology
breakthroughs, the challenge for biomedical research is being shifted from experimental
data generation to the organization, curation and interpretation of these data (Lander ES
et al, 2001; Meldrum D et al, 2000).
Biomedical research literature can be considered to be a knowledgebase that
comprises the most complete status of our research enterprise. Reflecting the geometric
growth of available experimental data, the publication rate in biomedicine is also
increasing exponentially. There are currently more than 17 million biomedical articles
already represented in the National Library of Medicine’s biomedical literature database
MEDLINE, including more than 3 million articles published within last 5 years alone and
2,000 per day in 2006 (Hunter L et al, 2006; MEDLINE). Keeping abreast of this large
and ever-expanding body of information is increasingly daunting for researchers in order
to track and utilize what’s relevant to their interests, especially for new investigators. For
example, the pediatric tumor neuroblastoma is a common pediatric tumor but considered
to be quite rare overall, with approximately 600 new cases diagnosed in the US each
year. However, there are almost 25,000 research articles describing neuroblastoma,
making it virtually impossible for a new investigator to systematically assess historical
research on this topic.
Furthermore, researchers have the increasing need to get in touch with the
research fields outside their core competence. The commonly used PubMed system,
which provides a convenient query interface for MEDLINE, provides keyword search
and some concept mapping for researchers to narrow down the information they are
looking for (PubMed). However, its capabilities lack the precision (positive predictive
value), recall (sensitivity), granularity, and relevance ranking capabilities that many
typical but complex research queries have. One of the most popular demands that
general-purpose systems such as PubMed fail to satisfy is the ability to extract and
compile specific knowledge or facts out of literature records. For example, there is no
provision in PubMed-like systems to determine which genes have been studied thus far in
relation to a certain type of malignancy, other than to read through the set of articles
identified by PubMed using keywords defining the concepts “gene” and “cancer” (or the
type of cancer of interest), and then identifying the particular genes one article at a time.
With the exponentially increasing literature size, the process will not only be more time
consuming, but also be less reliable on getting the right articles. Consequently, the gap
between what is recognized and what is currently known is widening (Wren JD et al,
2004). Biomedical text mining techniques can help researchers meet this challenge by
developing automated systems to extract the relevant information out of the text and
organize it into a structured knowledgebase.
Data Integration Opportunities in Cancer Research
The general challenge of biomedical literature knowledge extraction is
confounded in cancer research, including an acute need to more systematically identify
linkages between genomic data and malignant phenotypes. Characterization of the
molecular aberrations responsible for the onset and progression of malignancy is a major
goal for cancer researchers, and genomic components of the aberrations, ranging from
base pair variance to chromosome deletion, are crucial determinants in this regard.
Despite the existence of some locus-, mutation- and disease-specific resources, there is
currently no central cancer knowledge database in the public domain integrating genomic
findings with phenotypic observations of tumors (Cairns J et al, 2000; Freimer N et al,
2003). While high-throughput screening efforts increasing allow researchers to identify
genome-wide mutational profiles for specific tumors, this information is largely diffusely
distributed and is mostly catalogued in a semi-structured manner throughout the
biomedical literature. Such decentralization is holding back the efforts towards making
rapid and comprehensive inferences of the genomic basis of malignancy onset and
progression in a manner that incorporates cumulative knowledge. Ideally, researchers and
clinicians would likely benefit from a comprehensive cancer knowledgebase that
consolidates experimental work (genome-level investigation), clinical observations
(descriptions of phenotype) and patient outcome (efficacy of treatment). Because the
biomedical literature represents a large proportion of this information, which is both
critically reviewed and eventually objective in its presentation of cancer research
information, means for more adequately extracting, normalizing and relating such diverse
collections of information in literature are crucial to solving this data integration problem
in cancer research.
Named Entity Recognition
The successful development of text mining technology has been increasingly
applied in biomedical research to assist with meeting the above-mentioned challenges.
There have been significant efforts from both computational linguists and
bioinformaticists within the past 5 years to develop automated biomedical text mining
(BTM) systems (Jensen LJ et al, 2006). BTM tasks include named entity recognition
(NER), information extraction (IE), document retrieval (DR), and literature-based
discovery (LBD). NER, which serves as the basis for most other BTM undertakings, is
the process of identifying mentions of biomedical entities (objects, such as genes and
diseases) in the text. Named entity recognition can be at first deceptively straightforward,
but it is has emerged as a challenging and considerable task in BTM research. NER
begins with the classification and definition of biomedical entities, which easily
consumes tremendous amount of effort because of the complex and lack-of-standard
nature in biomedical entities.
The process of identifying references to biomedical objects in text is usually split
into two steps: the identification of mentions of specific entity instances in text, such as
“the p53 gene” or “acute lymphoblastic leukemia”; and the assignment of these mentions
to a standard referent (normalization), such as classifying “the p53 gene” as a mention of
the official gene symbol “TP53”, or “ALL” as “acute lymphoblastic leukemia”. Many
biomedical entities either lack controlled vocabularies that can act as sufficient
nomenclature standards, or the instances in text are not expressed with the standards due
to historical reasons. Therefore, normalization is absolutely necessary for equating entity
values as appropriate, or placing values into a hierarchical or ontological framework (e.g.,
“ALL” as a form of “leukemia”. Much BTM research to date has focused upon molecular
entities that tend to be more discretely definable, such as genes and protein-protein
interactions, than phenotypic entities, which are harder to classify semantically
(BioCreAtIvE; McDonald R et al, 2005; Settles BA 2005; Zhou G et al, 2005).
NER methods include both rule-based and machine-learning approaches. Rule-
based approaches use sets of “rules”, alone or in combination, that pre-state signature
grammatical and especially character and word-based patterns within a string of text
being considered, and then return Boolean values as an output. For example, a rule to
identify a gene name could be “This word is a gene if it contains the consecutive letters
‘KIAA”, all of which are capitalized”. There can be some allowance for lexical
variations, such as capitalization, stemming, or punctuation, and some or all rules might
compare the text being considered to a term list, such as a pre-compiled list of known
tumor types. However, the performance of the approach can’t count on the completion of
the dictionary-type list in terms of both depth (the completion of the entity unique
identifiers) and breadth (the completion of the synonyms for each unique identifier)
because for most biomedical entities, the term lists are always changing and never
complete. For complexly formulated text, rule-based approaches typically require
considerable thought and exquisite biological knowledge. Advantages of this approach
are relatively high precision without the requirement for generating extensive training
material. However, disadvantages include high false negative rates, a performance
plateau that is increasingly difficult to overcome, and, for complex and heterogeneous
text, a tendency to generate low recall. Most first-generation systems and many domain-
focused current systems utilize rule-based approaches; when coupled with a term list, this
approach accomplishes both steps of the overall NER task at one time. However, rule-
based systems have enjoyed only modest success for biomedical applications, likely
because their performances have plateaued below rates acceptable for wide use by
researchers, or their application domains have been overtly narrow (Hanisch D et al,
2005; Fundel K et al, 2005; Chang JT et al, 2004; Finkel J et al, 2005).
Given the limitations of rule-based systems, a number of machine-learning
algorithms have been applied to improve the first step of the NER task. Generally, these
algorithms consider and then define sets of features within and surrounding entity
mentions that co-associate with the mentions. These can include orthographic features of
the text (e.g., suffixes, particular sequential combinations of characters or words,
capitalization patterns, etc.) and domain-specific features (e.g., term lists). For example,
the suffix “-ase” usually indicates a protein name, and the noun phrase immediately
preceding the word “gene” is often a gene name. Machine-learning approaches have
several advantages: at their purest, they require no domain knowledge; they can consider
thousands or millions of features simultaneously; they can provide confidence scores for
predictions; and they can consider the entire feature space simultaneously. However, the
success of machine-learning approaches is dependent upon two critical and costly factors.
First, ML systems require the establishment, quality, and representativeness of a set of
manually generated training material from which to “learn” features, a process that
requires considerable effort and does not generalize effectively. Second, the most
effective systems incorporate biological knowledge—either in the form of domain-
specific rules or definition of features that are domain-specific (such as specialized
lexicons)—that are likewise costly to implement (McDonald R et al, 2004; Coller N et al,
2000; Tanabe L et al, 2002).
It is most critical to let human set the examples of gold standards before machines
can learn from it. To better reduce the annotation ambiguity and disagreement, it is
crucial to define the target biomedical entities explicitly. Currently, most developed NER
systems take some version of pre-established conceptual definitions, by which annotators
could apply with very different standards. We have tried otherwise and put tremendous
effort in an iterative annotation process to develop literature-based definitions drawing
both the conceptual and textual boundaries.
Step 2 work (normalization) is syntactically easier since the identification of
textual boundaries is not necessary. However, it poses significant semantic challenges,
because the non-unique synonyms have to be disambiguated to find out the real intent.
And also, a comprehensive thesaurus like dictionary is necessary in order to match the
raw entity mentions to their unique identifiers. Classification techniques, rule-based
systems, and pattern-matching algorithms have been utilized to solve this issue, and some
approaches also take the contextual information to disambiguate the synonyms (Chen L
et al, 2005).
Ideally, BTM systems extract and synthesize “facts” out of the literature that
combine entity mentions with relationships between and among the mentions established
in the literature. This work requires NER results, that is, the relationships between the
entities can only be extracted once the individual entities have been identified. Although
biomedically oriented research in this area is not as advanced as NER, BTM researchers
have recently been increasing their efforts on these challenges.
A most straightforward but powerful approach is co-occurrence. This approach
identifies the relationships between the involved biomedical entities based on their co-
occurrence in the articles, or by considering how close mentions are to each other within
a document. The assumption taken by the co-occurrence method is that if two (or more)
entity instances are co-mentioned in one single text record (or defined subset, such as a
sentence or a paragraph), these instances have some type of underlying biological
relationship. As it is possible that entity instances can coincidentally co-occur, systems
commonly use some parameters to rank the relationships, such as the frequency and
location of their co-occurrence. If two entity instances are repeatedly co-mentioned
together in close proximity, it is most likely that they are related. This approach tends to
perform with better recall but at the expense of precision because it has no intelligent
means for distinguishing specific from general relationships. For example, if the
information to be extracted is the causal relationship between gene A and disease
diagnostic labels, this approach will recognize relationships of any kind between gene A
and relevant diseases, including but not limited to direct or causal relationships. In order
to improve precision, some co-occurrence-based IE systems include additional
approaches, such as combining with a customized text-categorization system to
preferentially identify relevant articles or sentences. Co-occurrence-based IE systems are
usually used as exploratory tools making inferential calls since they can identify both
direct and indirect relationships between entity instances (Jessen TK et al, 2001; Alako
BT et al, 2005).
Another approach is to take advantage of natural language processing (NLP)
methodology that combines syntactic and semantic analysis of text. In this approach,
individual tokens in test are often first identified and then assigned part-of-speech labels,
in a process that has been converted to automation with high accuracy. Then a nested tree
like structure (either top-down or bottom-up) is developed in order to determine the
relationships between noun phrases or beyond, such as subjective and objective. After a
NER process is applied for assigning semantic labels to specific words and phrases, either
rule-based or machine-learning based processes can be used to extract relationships
between entity mentions. Although the syntactic parsing and the semantic labeling have
been carried out as separate steps by most NLP-based IE systems, results indicate that
better performance can be obtained by integrating the two steps, due in part to the often
complex relationships of biomedical entity mentions. This NLP-based approach can
achieve better precision, but lower recall, largely because of increased challenges in
identifying relationships across sentences. These approaches are also labor-intensive,
since either expert defined sophisticated extraction rules or manually annotated training
corpus are required (Rzhetsky A et al, 2004; Daraselia N et al, 2004; Yakushiji A et al,
Although there is some research touching base with n-ary relationships between a
set of biomedical entities, most IE systems currently classify binary relationships between
same-type entities. These systems most commonly focus on entities and relationships that
are easier to define, such as protein-protein/gene-protein interactions, protein
phosphorylation, other specific relations between genomic entities such as cellular
localizations of proteins, or interactions between proteins and chemicals. Few NER
systems have yet to be designed for relating phenotypic attributes, such as gene-disease
relationships (Temkin et al, 2003; McDonald R et al, 2005).
High-performance systems that can extract many types of relationships and also
distinguish among relationships beyond the sentence level are not yet achievable. This is
due largely to three contributing factors. First, biomedical text is complex and highly
variable in its structure and presentation. Second, many complicating factors need to be
considered, including co-reference (e.g, the use of pronouns), ambiguity in intent, and
variability in formulation. Finally, systems need to incorporate various approaches
simultaneously (e.g., tokenizers, POS taggers, NER systerms, parsers, disambiguators),
each of which contributes some measure of error that combines to significantly degrade
finalized output (Ding J et al, 2002).
DR systems typically identify and rank documents pertaining to a certain topic
from a large collection of text. Topics of interest might be derived from user-supplied
search terms or from pre-selecting specified types of documents. Most DR systems
feature keyword search capabilities; advanced keyword searching allows users to input a
combination of search terms and/or to perform advanced functions, such as including
logical operations or inducing limits to terms. Systems then commonly retrieve
documents containing or excluding certain terms that match the search criteria. This
method often retrieves irrelevant articles, and relevance-ranking functions are often
absent or primitive. More sophisticated DR systems go beyond this by applying distance
metrics, such as a vector-space model. With this model, every document is represented as
a vector, which is determined by measuring text-based features and/or document
metadata, such as a list of frequency-based weighted terms identified in each document.
The query vector, which is determined by the relative importance of each query term, is
then compared to document vectors to relevance rank the documents. The comparison
between document vectors can also calculate document similarity. PubMed is a well-
known DR system that is highly adapted for use as a query interface for MEDLINE.
PubMed uses both keyword searching and a vector model (Glenisson P et al, 2003).
Advanced DR systems integrate NER or other NLP methods in order to more
accurately assess document content and identify documents that mention certain
biomedical entity mentions. FABLE, MedMiner and Textpresso are examples of systems
that make retrieval decisions by extracting and considering knowledge from gene/protein
mentions in the documents (FABLE; Tanabe L et al, 1999; Muller HM et al, 2004).
An ultimate goal of BTM is to assist with literature-based discovery. LBD can be
defined as a process that discovers testable novel hypotheses by inferring implicit
knowledge in biomedical literature. An early and often-cited example of LBD was from
researcher recognizance of facts from two unrelated bodies of biomedical text, describing
Raynaud’s disease, in which patients suffer from vasoconstriction, high blood viscosity
and platelet aggregability, and describing fish oil, indicating that besides its capability of
causing vasodilation, its active ingredient can also lower blood viscosity and platelet
aggregation. This connection was formed completely through extensive reading of the
literature, and later the relationship was proved experimentally. The model used in this
seminal example was very simple: if A leads to B, and B leads to C, then it is plausible
that A could lead to C. Based on this closed discovery process (to connect two previously
known relations), this researcher subsequently discovered a novel association between
migraine and magnesium deficiency (also proved experimentally) as well as additional
successes (Swanson DR 1986; Swanson DR 1988; Swanson DR 1990).
More challenging LBDs might arise from an open discovery process, which
attempts to derive relationships between two entities of interest through implicit
relationships in literature. For example, the process of identifying candidate genes for a
certain disease is an open discovery process. One example of this process would be to
first identify gene mentions co-occurring in the literature (gene set A) with mentions of a
disease of interest, next identifiying co-occurring gene mentions (gene set B) with known
disease genes, and then consider the overlap between the two sets of gene mentions as
candidate genes for the disease. There are two assumptions taken for this approach: Gene
set B is functionally related with known disease genes; Gene set A has some sort of
relations with the disease. One potential problem for this approach is that there are many
types of direct and indirect relationships identified in such a process, including the high
likelihood that a substantial number of false positives are generated. NLP-based IE can
certainly help narrow down the relationship types, but further research is needed to
improve the performance of such models. Also fundamentally, literature inevitably
contains conflicting and inaccurate statements, which is impossible for an automated
algorithm to adjudicate (Weeber M et al, 2005).
It is much likely that more reliable inference of novel hypotheses and research
directions from literature achieves success by integration of BTM results with other data
types, including from curated data sets and experimental data. Experts’ curation and
experimental evidence provides verification, filtering, and relevance ranking capabilities
from information derived from real biological relationships between entities. For
example, researchers have made novel discoveries by transferring text-mined
relationships of a protein to its orthologous proteins based on sequence-similarity
searches. The integration effort of BTM results with functional genomic data such as
microarray data has helped researchers rank significant genes as well as develop novel
hypotheses based on both experimental data and previously known knowledge in a large
scale, automated fashion (Yandell MD et al, 2002; Raychaudhuri S et al, 2002; Glenisson
P et al, 2004).
Along with the rapid expanding of experimental data, the exponential increase of
the biomedical research text makes it more and more difficult for researchers to track and
utilize the relevant information to their interests, especially for the domains outside their
core competence. Automated text mining systems can process the unstructured
information in the literature into structured, queryable knowledgebase. This dissertation
research has developed well-performed automated entity extractors based on the refined
manual annotation with iteratively defined literature-based entity definitions in genomic
variation of malignancy. Co-occurrence-based information extraction process was
applied to integrate with microarray expression data in the pursuit of determining
neuroblastoma research candidate genes. Both functional pathway analysis and RT-PCR
experiment validated the text mining’s contribution. This thesis demonstrated that in
addition to systematic curation of the textual information, biomedical text mining also
has inferential capability especially when combined with experimental data.
Introduction to the Thesis
Using the genomics of malignancy as a test bed, this thesis has touched upon
every aspect of BTM outlined above. Work regarding the BTM process developed and
employed will be discussed in detail in Chapter 2 and Chapter 3. This thesis has also
established important work regarding information extraction in this domain, which has
been applied to research regarding the pediatric tumor neuroblastoma (Chapter 3 and
Chapter 4). Integration of BTM-extracted information with expression array analytical
results to discover candidate genes for neuroblastoma research will be discussed in detail
in Chapter 4.
Chapter 2. Defining Biomedical Entities for Named Entity Recognition
Mark A. Mandel
Peter S. White
The performance of machine-learning based named entity recognition is highly
dependent upon the quality of the training data, which is commonly generated by manual
annotation of biomedical text representative of the target domain. The development of
robust definitions of biomedical entities of interest is crucial for highly accurate
recognition but is often neglected by text-mining applications. While the conceptual and
syntactic complexities of biomedical entities often generate ambiguities in assigning text
mentions to particular entity classes, entity definitions that exhibit as distinct semantic
and textual boundaries as possible are desired. We have created a highly generalizable
process for developing entity definitions specifying both conceptual limits and detailed
textual ranges for target biomedical entities. This process utilizes representative text and
manual annotators to initially define and iteratively refine definitions. The process was
tested within the knowledge domain of genomic variation of malignancy. This work
describes in detail the different types of challenges faced and the corresponding solutions
devised during the definition process. The resulting entity definitions were used to
annotate a training corpus for the development of automated entity extraction algorithms
and for use by the research community. We conclude that manual annotation consistency
is useful for the success of later biomedical text mining tasks, and that explicit, boundary-
defined entity definitions can assist with achieving this goal.
Automated information extraction techniques can assist in the acquisition,
management and curation of data. A necessary first step is the ability to automatically
recognize biomedical entities in text, as also known as named entity recognition (NER).
Development of named entity extractors for biomedical literature has progressed rapidly
in recent years. For example, a number of machine-learning algorithms currently exist for
identifying gene name instances in text (Collier N et al, 2000; Tanabe L et al, 2002;
GENIA; Hanisch D et al, 2005). However, a major shortcoming of many approaches is
that they often minimize efforts to define biomedical entities in an explicit fashion.
Rather, the tendency is often to ignore this step by adapting or refining existing semantic
standards as the target entities’ conceptual definitions, leaving interpretive details to
manual annotators. Additionally, existing standards often provide little or none of the
semantic depth required to establish concept boundaries with enough rigidity to provide
highly accurate extraction. This tends to create outstanding consistency problems in later
steps when training automated extractors and utilizing the extracted entity mentions for
particular applications, because non-literature based conceptual definitions often generate
significant annotation ambiguity problems due to the semantic as well as syntactic
complexities of biomedical entities in the literature. As a result, automated systems
derived from such systems tend to perform more poorly. For biologists in particular, high
precision is a necessary prerequisite for widespread acceptance of automated tools, in
order to establish a level of reliability acceptable to users.
Strongly believing the importance of establishing well-defined, literature-based
entity definitions with clear boundaries specially designed for biomedical NER practice,
the Biomedical Information Extraction Group at University of Pennsylvania (Penn
BioIE) has developed an iterative annotation process designed to establish a set of
“precise” entity definitions. These definitions are meant to clarify the conceptual
boundaries both semantically and syntactically, while also striking a balance between the
requirements of researchers, annotators, and computational scientists. This paper will
first describe the annotation process developed by the Penn BioIE group, and then
introduce the necessities and challenges of defining biomedical entities with specific
examples in the literature.
2. Overview of manual annotation process and entity classification
QuickTimeª and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Figure 2-1. The processes of developing entity definitions and extractors
Figure 2-1 demonstrates the iterative process developed for establishing and
refining entity definitions, first through manual annotations and then in developing
extractors based on the manually annotated training data. The process begins with the
creation of an initial definition that establishes the general concept and scope of an entity
class, which is supplied by one or a group of domain experts. Commonly existing
standards and resources are explored and, if deemed suitable, adopted as nuclei for the
process. Subsequently, the domain expert(s) plays the role of adjudicating definition
discrepancies. Manual annotators are then trained with the initial versions of the entity
definitions, from which they manually annotate the selected training corpora. Invariably,
as the annotators encounter the wide diversity of semantic representations of specific
concepts, a need for iterative refinement of the entity definitions emerges. Often, text
encounters require major revisions or even restructuring of definitions to accommodate
such heterogeneity. Accordingly, definitions are continually refined during the analysis of
annotated texts and annotation disambiguation. The Penn BioIE group founded useful
frequent communication forums where the emerging definitions and identified exceptions
were fully discussed among annotators and researchers. Communication modalities
included weekly face-to-face meetings, email lists, and live chat. After annotation has
been executed, entity extractors were developed by implementation of machine-learning
algorithms utilizing probability models (we used Conditional Random Fields); the
manually annotated texts were utilized as both training and testing data for these
algorithms. Comparison of the annotations produced by the automatic extractors and
human annotators allows for evaluation of the extractor performance.
The target knowledge domain we chose was “Genomic Variation of Malignancy”,
conceptualized as a relationship among three entity classes: Gene, Variation and
Malignancy. As shown in Figure 2-2, the Gene and Variation entities comprise genomic
components of cancer while the Malignancy entity covers phenotypic aspects of
malignancy, including malignancy diagnostic labels and a number of malignancy
QuickTimeª and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Figure 2-2. Entity classification scheme for the domain of genomic variation of malignancy
A total of 1442 MEDLINE abstracts were selected for exploration and annotation
in this study, one subset of which contained many different malignancy types to establish
breadth, and a second subset of which mentioned only one major malignancy
(neuroblastoma) to establish depth. As diagrammed in Figure 2-1, the manual annotation
process was first applied to the corpus with an electronic annotation tool, WordFreak
(http://sourceforge.net/projects/wordfreak). After the entity definitions were refined and
stabilized, the manually annotated data were then used to develop entity and attribute
extractors (McDonald RT et al, 2004, Jin Y et al, 2006). These automated extractors
performed with state-of-the-art accuracy, in part due to the careful design and
management of our annotation process. In the following paragraphs, we will discuss the
challenges we have encountered during the manual annotation process, and why we
believe that consistent entity definitions are critical for the success of later steps in
biomedical text mining.
3. The challenges of defining biomedical entities
Although we began this task believing we had clear ideas of what information
each entity should cover, it quickly proved challenging to develop detailed working
definitions. Our a priori notions of entity definition adequacy were that definitions
establish distinct and defensible boundaries both conceptually and textually, therefore
providing guidance to the annotators both semantically and syntactically. Solid entity
definitions are an essential foundation for the subsequent steps of developing machine-
learning algorithms and utilizing the extracted information for specific applications. First,
the performance of entity extractors is highly dependent not only on the selection of the
underlying algorithms, but also on the quality of the training data, which are entirely
based on the entity definitions. If the annotators cannot identify specific entity mentions
consistently on the basis of the definitions, it is hard to imagine that automated extractors
can replicate this task reliably. More importantly, without clear definitions, researchers
will certainly run into problems when trying to utilize the extracted mentions, since it will
be difficult to know the precise boundaries of the gathered information.
As mentioned earlier, we initially defined three major entities in the knowledge
domain of genomic variation of malignancy, based on existing ontological categories and
concepts. However, we quickly found that ontology-based definitions often don’t
precisely reflect what has been conceptualized throughout the biomedical texts
contributed by researchers worldwide. For example, a gene defined by NCI thesaurus is:
“A functional unit of heredity which occupies a specific position (locus) on a particular
chromosome, is capable of reproducing itself exactly at each cell division, and directs the
formation of a protein or other product.” If annotators use this definition for identifying
gene mentions in the text, they could quickly be confused by many situations such as
whether promoters should be included; how should gene family names be treated; how
about pronoun referents to genes, etc. Thus, we found the need to invoke text-based
working entity definitions, which are most effectively determined as annotators
proceeded with the entity recognition task in the training corpus. Every new mention of
an entity and every new context for a mention provided a test for the pre-developed entity
definition. If a definition could not explicitly lead the annotators to a “correct”, or at least
consistent decision in each case, the problematic mention required further examination,
interpretation, and possibly, refinement of the definition. Through such an iterative
process, we were able to develop fine-tuned entity definitions that provided distinct
boundaries both for semantic scope and contextual range.
The challenges that we encountered in refining our definitions can be grouped
into four categories: conceptual, syntactic, syntactic/semantic ambiguity, and inter-
annotator agreement. In the following paragraphs we will illustrate these types and give
examples of our devised solutions and their limits.
3.1 Conceptual definition challenges
As discussed earlier, an entity definition has to clarify both conceptual and textual
boundaries. Initial versions of our definitions were completely conceptual, based on our
understanding of biomedical categories. Surprisingly, more than half of the annotators’
difficulties with definitions fell into this category during the annotation process, and most
of them were reasonable as you can observe in the following paragraphs showing the four
most common challenges in this category. This reflects the semantic complexity and
diversity of biomedical entities, which often cannot be easily defined without some
3.1.1 Sub-classification of entities
Based on the classification scheme stated above, our target knowledge domain
was initially divided into three major conceptual classes: gene, genomic variation, and
malignancy. However, this broad conceptual classification was far from sufficient for the
generation of highly accurate extractors. For example, according to the conceptual
definition, the malignancy concept covers all phenotypic information of cancer, including
a tumor’s diagnostic type, the tumor’s anatomical location and cellular composition, and
its differentiation status. Each of these types of information are presented in a variable
and often bewildering array of syntactic and contextual patterns, which increases entropy
and thus erodes the ability of machine-learning approaches to classify mentions. If
instead we further classified the mentions into sub-categories such as those described
above and annotated them as such, entropy is reduced and extractor performance can be
expected to improve. However, a major disadvantage of this approach is that, sub-
categorization introduces considerable additional annotation effort. Thus, the annotation
process requires first the establishment of a level of entity granularity that balances the
cost of manual annotation with the application value of the extracted data.
There are countless ways to further divide entities into their underlying
components. For our purpose, we decided to let the level of granularity be generated by
the annotation process. By beginning with broad classes and subdividing them as needed,
we considered that we would eventually approach an optimal balance between effort and
effectiveness. We considered it to be critical to determine how the text strings represented
subcategories in the real world of biomedical literature. Therefore we divided our
annotation efforts into two stages: data gathering and data classification, as demonstrated
in Figure 2-3 with a genomic variation entity example.
QuickTimeª and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Figure 2-3. The text-based two-stage entity sub-classification process
In the example illustrated by Figure 2-3, annotation of our initial concept of
“Genomic Variation” proceeded through a preliminary stage of annotation before it was
divided into sub-categories, which we named “Data Gathering”. In this stage, all textual
mentions falling within or partially within our initial concept definition were annotated
regardless of syntax. When sufficient information was gathered, sub-categories were
defined based on their semantic and syntactic representations. In addition, by proceeding
with this exercise, the annotators became familiar with the concepts, definitions, and
emerging challenges of the tasks. By employing this method, the sub-classification
scheme began to approximate how the concepts were actually presented in the text.
3.1.2 Levels of specificity
Textual entity mentions referring to the same semantic types can range from very
general to quite specific, and not all levels of detail may be appropriate for a particular
project. A gene mention may refer to a specific gene instance in a single cell of a sample,
or to the wild type or a specific variation of the gene; or it may refer to gene families,
super families and generalized classes, which represent classes of genes. For instance,
“MAPK10” or “mitogen-activated protein kinase 10” is a family member of “MAPK”,
which itself belongs to a higher level family “protein kinase”. We made the decision to
include all levels of information for the gene entity except for the most general level such
as “gene”. That is, in the above example, all three levels of gene mentions are legitimate
and should be annotated as such.
The decision was based on a couple of considerations. First of all, gene class
information is valuable information to extract in later steps; although we don’t know
which specific gene it refers to, it does help us narrow down to a class of genes. Second,
if we only include the mentions describing genes at the instance level (the level that can
lead to a specific genomic element), we have to draw a line between gene classes and
instances. Because textual mentions for gene classes and instances are sometimes
interchangeable (researchers tend to use gene class names referring to gene instance
names and vice versa), it will be quite difficult for the automated extractors to distinguish
between the two. And finally, we exclude gene mentions at the most general level, which
contains no information content or application value to extract. In another words, all
information-containing levels of mentions are included.
3.1.3 Conceptual overlaps between entities
An ideal entity classification scheme should result in independent information
categories without any conceptual overlaps. Unfortunately, the subjective and adaptive
nature of biological objects makes this ideal especially difficult to achieve, especially
when defining two different but related entities. Even a basic concept such as “organism”
is difficult to define when considering entities such as viruses and viroids, self-replicating
machines with attributes necessary but not necessarily sufficient to qualify as life forms.
Because our gene and genomic variation concepts both fall within the genomic domain
and are closely associated, we were very careful to make a clear distinction. Eventually,
our gene entity evolved to encompass solely the names of genes and their downstream
products (i.e., RNAs and proteins), while the genomic variation entity covered specific
descriptions of genomic element variations.
Although our definitions of gene and genomic variation managed to eventually
establish a reasonable boundary between them, for other entities, we found it sometimes
impossible to avoid the conceptual overlapping problem. We encountered such problems
when trying to make a clear division between the entity classes symptom and disease. The
symptom entity was designed to capture subjective or objective evidence of disease, such
as headache, diarrhea or hyperglycemia, while the disease entity captured specific
pathological processes with a characteristic set of symptoms, such as Long QT Syndrome
or lung cancer. As with most cases, the distinction is often clear to domain experts unless
considerable scrutiny is requested, as it appears to be simple common sense that these
concepts represent two distinct and non-overlapping sets of information. However, when
presented with the broad contextual variation in use and, often, semantic intent, it actually
becomes quite difficult to draw a clear boundary between the two. We quickly found that
many terms can be considered as both symptoms and diseases, depending both upon
intent and the level of domain knowledge available. For example, “arrhythmia” itself is a
disease entity mention, representing a pathological process, but it is usually used as a
diagnostic label of a disease (symptom), such as long QT Syndrome. We certainly don’t
want to have two entity types heavily overlapping with each other, since that will make
the classification unnecessary. That is not the case for the symptom and disease entity
types, and their overlapping mentions are less than approximately 10% overall. Most
conceptually overlapping mentions cannot be put into either category without reading the
text. We leave it to the annotators to determine authors’ intent based on the context and
increasingly, they became quite good at minimizing the disagreement.
3.1.4 Domain-specific clarification
As biological entities tend to be conceptually subjective, we often found it to be
quite challenging and labor-intensive to establish consistent conceptual boundaries. The
process of defining the gene entity is a good example to illustrate this challenge. Initially,
we considered the task of defining a “gene” to be a straightforward task, as this concept is
considered by biologists to be a rather discrete object. The HUGO Gene Nomenclature
Committee (HUGO), the nomenclature body tasked with establishing official names for
human genes, defines a gene as “a DNA segment that contributes to phenotype/function.
In the absence of demonstrated function a gene may be characterized by sequence,
transcription or homology". On top of that, our gene entity is initially defined as the
nominal reference to a gene or its downstream product in biomedical text. However, as
annotations moved forward, annotators raised more and more questions, forcing us to
make difficult determinations on the boundaries as illustrated below.
An example of biological complexity is the many ways that a gene can contribute
to phenotype. Typically, genes functionally impact biological processes through their
downstream products, proteins. However, there are DNA segments on the genome which
are able to affect phenotype by regulating how genes are expressed in particular
biological contexts. Promoter and enhancer regions, which are distinct segments of DNA
(often far) removed from the DNA segment that directly contributes to an RNA and/or
protein product, are such example. These elements control whether and when the gene
itself is expressed. Although biologists disagree whether promoters should be considered
as genes or components of particular genes, annotators are required to make a decision on
the gene entity boundary limits. In this case, we considered our application domain to be
the most important determinant, as the main focus of our gene entity was to capture those
“traditional genes” that could be directly and consistently associated with a protein. Thus,
we limited our scope of genes to include only what we considered to be biologically
functional DNA segments which are translated into protein products.
There are many more cases that required further clarification of the gene entity
conceptual definition, such as how to deal with segments and multiplexes of
genes/RNAs/proteins. We realized that consistency was more valuable than trying to
establish universal truth, the former of which we considered to be the key to developing
well-performing automated extractors and increasing the application value of extracted
3.2 Syntactic definition challenges
Even with precise conceptual definitions, we found that guidelines needed be made
regarding the textual boundaries of the entity mentions. Although many of these were
syntactical nuances, they were not necessarily trivial for the annotator disagreement. In
order to make consistent automated extractors, we determined that detailed annotation
guidelines were required to make manual annotations consistent between different
annotators. We designed our guidelines to be practical and based on actual contexts,
specifying to the annotators exactly what to do under any uncertain circumstances that we
3.2.1 Associating a text string to an entity mention
There are many different ways to associate a text string with an entity mention in
biomedical literature. In order to harvest consistent training data to develop highly
performed automated extractors, we needed to define a series of rules specifying how to
select text strings in the literature as legitimate entity mentions. We allowed entity
references to include more than one word, including punctuation, but not to cross
Although the majority of the entity mentions were nouns, not all of them were.
For some entity mentions such as variation type, other part-of-speech forms were not
uncommon. For example, for genomic variation types that would likely be normalized as
the forms “insertion”, “deletion”, or “translocation”, those variation type mentions were
usually expressed as verbs: “inserted”, “deleted”, or “translocated”. Moreover,
malignancy attribute mentions were nearly always adjectives, such as “well-
differentiated”, “hereditary”, and “malignant”.
All modifiers in a noun phrase mention were considered to be included as part of
a mention, because not only can the modifiers provide very useful information to be
extracted, but also that some modifiers are indispensable parts of the standard terms. We
observed that this decision made it easier for both manual annotators and machine-
learning extractors to operate since it was difficult to define boundaries on what
modifiers to include in noun phrases. However, modifiers were not included for other
part-of-speech phrases, in order not to complicate the issue. For example, in a noun
phrase malignancy type mention “malignant squamous cell carcinoma”, both “malignant”
and “squamous cell” are the modifiers of “carcinoma”, and both provide very useful
information. “Squamous cell carcinoma” is also a commonly employed name of a type of
cancer. Our experience determined that it was difficult for annotators and impossible for
automatic extractors to draw consistent boundaries between modifiers on what should be
included as part of the legitimate mentions.
Lastly, we found it necessary to make entity-specific rules for some biological
entities. For example, the gene entity mentions commonly appeared in the text as “The
mycn gene…”, necessitating a decision as to whether the article “The” and the noun
“gene” should be included as part of the entity mention. We reasoned that the decision
should depend on how the extracted information was to be further processed and utilized.
Accordingly, we decided to include neither word, since all the extracted gene mentions
were to be subsequently mapped and normalized to official gene symbols.
3.2.2 Co-reference issue
Often a single entity is referred to in different ways in the same text, a situation
known as co-reference. Besides its standardized form, an entity instance can also be
referred to by aliases, acronyms, descriptions or pronoun references. For example, the
mycn gene has at least 10 aliases in the literature, including “n-myc”, “oded”, and “v-myc
avian myelocytomatosis viral related oncogene, neuroblastoma derived”. Moreover,
researchers commonly engineer their own acronyms as self-convenient but non-standard
and often unique aliases. Co-reference is generally recognized as a challenging task for
entity recognition and information extraction. To deal with this issue in manual
annotation, we have classified this problem into the following four categories and made
corresponding decisions for each of them.
A. Extended form vs. acronym
Regular expression: ___ ___ ___ (___)
• …mitogen-activated protein kinase (MAPK)…-- gene entity mention
• …squamous cell carcinoma (SCC)… -- malignancy type entity mention
Our decision: Tag both the extended form and abbreviated form of the entity mention.
For the above examples, “MAPK” is co-referential with “mitogen-activated protein
kinase”, and “SCC” is co-referential with “squamous cell carcinoma”. Both extended
forms and acronyms would be tagged as corresponding entity instances in our system.
Our rationale: Both forms are interchangeable descriptions of entity mentions, and they
should be treated equally.
B. Alias description
Regular expression: …Y…X… or …Y (X)…
• TrkA (NTRK1)…
• The N-myc gene, or MYCN…
Our decision: NTRK1 and MYCN are official name designations of the TrkA and N-myc
genes, and here they are being co-referenced accordingly. We decided to tag all different
expression forms of the entity instances, including standard/official nomenclatures,
aliases or descriptions. Like acronyms and their extended forms, these various names are
also tagged individually: in the first example, we tagged “TrkA” and “NTRK1”
separately and without the parentheses, not the combined string “TrkA (NTRK1)”.
Our rationale: Researchers often use unofficial nomenclatures for entity mentions, so we
can’t just annotate standard descriptions. However, they should be normalized later.
C. General vs. specific
Regular expression: X, a (the) Y…
• C-Kit, a tyrosine kinase which plays an important role, …
• K-Ras is an oncogene. The Ras gene…
Our decision: In the examples above, the gene family name “Ras” and the superfamily
name “tyrosine kinase” are used to co-refer to the gene family instances “K-Ras” and “C-
Kit”. In such situations, our annotation guideline treated the general terms and more
specific terms completely independently, regardless of the co-referential relationship
between them. That is, depending on the conceptual definition, if the term was a
legitimate mention, it was tagged as an entity mention no matter what levels of specificity
it had. For those examples, since the gene entity definition included both gene instances
and family names, all four terms were tagged as gene entity mentions. We did not,
however, tag “oncogene”, nor did we extend the tag on “Ras” to include the following
word “gene”. These words, at the highest level of generality, convey no taggable
Our rationale: Based on our decision on tagging all information-containing levels of
mentions and specifically for the examples listed, all gene instances, gene families and
superfamilies are determined legitimate mentions.
D. Pronoun reference
Regular expression: …X…PRONOUN (It, This, etc.)…
• K-Ras is an oncogene. It is mutated in…
• Five point mutations were found in the MYC gene, and they were next to each
Our decision: In the two examples, “It” is co-referential to “K-Ras”, and “they” is co-
referential to “point mutations”. We generally did not annotate pronouns, although they
may refer to legitimate entity mentions.
Our rationale: Pronoun co-reference is a challenging problem in text mining research,
which involves cross-sentence, whole-record level of relation extraction. Without deeper
parsing of the text, there is no value by extracting the pronoun itself.
3.2.3 Structural overlap between entity mentions
Entities can overlap not only conceptually, but also literally, with their textual
mentions in the literature. Annotation guidelines were developed for the following
A. Entity within entity – tag within tag
This refers to the situation that one entity mention is completely included in the
textual range of another. As the two intertwined entity mentions could belong to either
the same or different entities, we divided this category of problem into two sub-
categories. If the two mentions were in the same entity, only the subsuming entity
mention was tagged. For example, in “mitogen-activated protein kinase kinase kinase”,
there exist 7 distinct gene entity mentions: mitogen-activated protein; mitogen-activated
protein kinase; mitogen-activated protein kinase kinase; mitogen-activated protein kinase
kinase kinase; and three mentions of “kinase”. While this type of a situation was a source
of confusion among new annotators, we considered it both unnecessary and costly to tag
all possible mention permutations. As the mention with the largest range was always the
one being discussed, only the outermost mention was considered to be tagged as a gene
mention. In fact, this situation led to the adoption of a more generalized guiding
principle, where the annotation should reflect the author intent whenever possible
(although exceptions were encountered, such as poorly written abstracts where the intent
from the context occasionally and obviously differed from the actual word or phrase
If two completely overlapping mentions instead belonged to different entity types,
we annotated both. These mentions were usually related, and they both often provided
valuable information. Some entities, such as malignancy attributes, often appeared as part
of another entity mention. For instance, “colon cancer” is a malignancy type mention, and
“colon” is a malignancy site mention. “Hirschsprung disease 1” is another example, that
“Hirschsprung disease” is a disease mention while the whole phrase is a gene mention.
B. Entity co-identity – double tagging
This category represents the situation that two entity mentions share the exact
same text. We annotated the same text twice with the two corresponding labels under
such circumstances. For example, in the phrase “deletion of the K-ras gene”, “K-ras” was
tagged as both a gene entity mention and a variation-location mention.
C. Discontinuous mentions – chaining
Sometimes mentions of several entities of the same type shared a common
substring. When written together in the text, the common part only occured once for the
first or last mention, and other mentions were only represented with the different parts.
For example, in the text “H-, K-, and N-ras…”, there are really three gene mentions: “H-
ras”, “K-ras” and “N-ras”, but a limitation of our annotation software prevented tagging
of discontinuous mentions as one parent mention (in the example above, only “N-ras”
could be tagged. For the other two discontinuous mentions, we developed a chaining,
procedure through which annotators were able to link the component parts (“H-” and
“K-” with “ras”) by inserting comments into the annotation in a standard format.
Chaining was strictly limited within one sentence in order not to complicate issues
for subsequent syntactic parsing of sentences. Employing the same logic, entity mentions
were not allowed to come across different sentences.
3.3 Syntactical vs. Semantic – ambiguity challenges
We considered ambiguity in mentions to be the most common and difficult
challenge in our annotation experience, as it truly reflects the limitation of human-
invented texts in fully communicating author intent. In biomedical text, we found it not
uncommon that an identical text string could represent completely different concepts, and
the frequency of ambiguity appeared to be much higher than for non-biological text. In
the following paragraphs, we will use mainly gene entity examples to illustrate the
illusive nature of this problem.
We found ambiguity to occur both within and outside gene entities. Genes have a
tradition of being independently named, with poor adherence to or awareness of
standards. People tended to make up new acronyms for gene names, as the result of
which, there are more gene names than the combinations of letters and numbers for short-
character symbols/aliases. Thus, there are lots of similarities between aliases just by
chance. Since each gene has multiple non-unique aliases with one unique gene symbol,
there exists very serious internal ambiguity problem among the aliases. Based on our
calculation, just for human genes alone, there are as many as 3% genes share the same
aliases and the numbers are number higher if including other species. Also, many species
have traditions of naming the genes the same, especially mouse and human (Chen L et al,
2005). For example, p90 is the common alias shared by the distinct gene symbols CANX
and TFRC. As a protein naming convention, p90 actually refers to the protein with
molecular weight 90. Therefore, it is not surprising that there are two proteins with the
When such gene mentions appear in literature, (often quite distant) context is the
only way to clarify which gene is in discussion, although sometimes it offers no
assistance. Another type of within gene entity ambiguity that we recognized was the
frequent apparent inability to distinguish a gene from its downstream products, based
purely on the text string of the mention. Although initially, our gene entity was designed
to capture only the nomenclatures of functional genomic elements, we soon discovered
that researchers were frequently using the same referents to represent a gene and also its
RNA and protein products in the literature. Without looking at the context, a gene
mention “mycn” had almost an equal probability to refer to a gene or its downstream
product, and both the gene and its mRNA were referred to as being “expressed” to create
a mRNA or a protein product, respectively. In addition, authors also tended to obscure
the conceptual boundaries between a gene and its downstream products. For example,
while a given protein X performs biological functions, we found it common that the
corresponding gene X was being described as performing this action. It became apparent
that while researchers were personally clear regarding distinctions, their descriptions did
not adequately convey these distinctions. In fact, in several cases, we found it impossible
to determine whether certain gene mentions referred to a gene or its RNA or protein
products even when considering the entire article. This overwhelming ambiguity problem
finally prompted us to reach the decision to include genes’ downstream products when
annotating gene entity mentions. Finally, we created one entity class gene but also
included labels for partially subdividing them, while making considerations for not being
able to perfectly divide mentions into the 3 classes. If it was not clear in the text whether
a mention referred to a gene or a protein, the mention was annotated as “gene.generic”, as
apposed to “gene.gene/RNA” or “gene.protein”.
Besides the challenges mentioned above, it was common to encounter gene entity
mentions that were easily be confused with objects belonging to other entity types, This
is because genes have been named with a wide variety of methods, from the use of lay
languages to the invention of specialized and often clever acronyms. For example, “Cat”
is an official gene symbol for the gene catalase, while it could also be used to refer to a
kind of animal. “NB” is the acronym of a well-known pediatric cancer neuroblastoma,
but it is also an official name of a gene locus putatively located on chromosome 1p36.
This cross-entity ambiguity problem was also commonly seen for other entity classes,
such as variation type. As an example, “Insertion” and “deletion” are well-defined
variation type mentions, but they are also frequently used to denote biological or clinical
actions. Regardless of the types of the ambiguity problems, the task for our manual
annotators was to make their best calls to identify the intended reference of the text
strings and annotate them as such. Sometimes annotators needed to take entire abstract
or, rarely, the entire article, into consideration in order to determine what particular
mentions truly represented. Depending on the nature of the biomedical entities and how
representative the training data was, the subsequent automatic extractors were able to
disambiguate problematic text strings to certain degree by taking local contextual features
3.4 Annotator perceptions
Even if perfect entity definitions and annotation guidelines could somehow be
created, there would still be variations among human annotators in understanding and
applying them during the annotation process, and we certainly encountered lively
discussion regarding some topics. Usually, manual annotation is done by different
annotators in order to get more files done within a shorter period of time, but the
downside is that it introduces more inconsistencies between annotators. Even with only
one annotator, there will be variability in application of guidelines.
We took two approaches to deal with this problem. First, annotators were told to
discuss anything unclear, and we promoted frequent discussion to determine a consistent
path. And also, a dual, sequential-pass manual annotation process was developed and
applied to better adjudicate different annotators’ work and produce training data as
consistent as possible. During this process, every document was annotated de novo by
one annotator and then subsequently checked by a second annotator, who is more
experienced and consistent, charged with identifying and revising any annotations
considered to be incorrect by first pass annotators. Edited items were then subject to
review by the group, and senior annotators used this editing process as an opportunity for
educating less experienced annotators if repeated error patterns were identified.
3.5 Publication-based errors
Typographical and grammatical errors, though infrequent, are inevitable, and
some of them were observed in entity mentions during our process. Due to the
considerations of copyright issues, we were not authorized to change the text in such
cases but instead skipped tagging the mentions with added comments.
As a result of the generation and application of these carefully refined entity
definitions and annotation guidelines, 1442 MEDLINE abstracts were manually
annotated. Of these, 1157 files have been made publicly available (release 0.9, BioIE web
site). Since the release, the data has been widely used by the biomedical text mining
community for a variety of purposes, including entity recognition, normalization etc., and
the usage is likely to increase (Cohen KB et al, 2005).
Because of the consistency of the training data across the corpus, the developed
entity and attribute extractors perform with high precision and recall rates. Table 2-1
indicates the performance of three entity extractors built with this data (McDonald RT et
al, 2004; Jin Y et al, 2006).
Entity Precision Recall F-measure
Gene 0.864 0.787 0.824
Variation Type 0.8556 0.7990 0.8263
Location 0.8695 0.7722 0.8180
State-Initial 0.8430 0.8286 0.8357
State-Sub 0.8035 0.7809 0.7920
Overall 0.8541 0.7870 0.8192
Malignancy type 0.8456 0.8218 0.8335
Table 2-1: Entity extractor performance on evaluation data
Manual annotation is an indispensable step to create training data for developing
machine-learning automated extractors. In order to generate extractors that perform with
accuracies high enough to be acceptable to the biomedical research community,
consistently annotated training data is a prerequisite. Although we did not formally prove
it, our experience has been that investment of developing literature-based entity
definitions and annotation guidelines yields far better extracted information with distinct
conceptual boundaries, which in turn increases the opportunity for practical application.
We have concluded that rather than trying to construct unifying definitions that maximize
acceptance and minimize contention amongst domain experts, that a consistent and
generally arguable definition was preferable when making decisions to specify entity
boundaries and magnitudes. More important for us was to consider how the extracted
information will be used, and once determined, how to maintain consistency throughout
the training corpus.
Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures.
Bioinformatics, 21: 248-256. (2005).
Cohen KB, Fox L, Ogren PV, Hunter L: Corpus design for biomedical natural language
processing. Proceedings of the ACL-ISMB workshop on linking biological literature,
ontologies and databases, pp. 38-45. Association for Computational Linguistics. (2005).
Collier N, Nobata C, Tsujii J: Extracting the names of genes and gene products with a
hidden Markov model. In Proceedings of the 18th International Conference on
Computational Lingustics, Saarbrucken, Germany. (2000).
GENIA: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ (2004).
Hanisch D, Fundel K, Mevissen HT, Ximmer R, Fluck J: ProMiner: rule-based protein
and gene entity recognition. BMC Bioinformatics. 6: S14. (2005).
Jin Y, McDonald RT, Lerman K, Mandel MA, Carroll S, Liberman MY, Pereira FC,
Winters RS, White PS: Automated recognition of malignancy mentions in biomedical
literature. BMC Bioinformatics, 7: 492. (2006).
McDonald RT, Winters RS, Mandel M, Jin Y, White PS, Pereira F: An entity tagger for
recognizing acquired genomic variations in cancer literature. Bioinformatics 22(20):
Penn BioIE: http://bioie.ldc.upenn.edu/index.jsp
Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text,
Bioinformatics, 18:1124-1132. (2002).
Chapter 3. Automated Recognition of Malignancy Mentions in
Ryan T. McDonald
Mark A. Mandel
Mark Y. Liberman
Fernando C. N. Pereira
R. Scott Winters
Peter S. White
Pulished: BMC Bioinformatics, 7:492, 2006
Background: The rapid proliferation of biomedical text makes it increasingly
difficult for researchers to identify, synthesize, and utilize developed knowledge in their
fields of interest. Automated information extraction procedures can assist in the
acquisition and management of this knowledge. Previous efforts in biomedical text
mining have focused primarily upon named entity recognition of well-defined molecular
objects such as genes, but less work has been performed to identify disease-related
objects and concepts. Furthermore, promise has been tempered by an inability to
efficiently scale approaches in ways that minimize manual efforts and still perform with
high accuracy. Here, we have applied a machine-learning approach previously successful
for identifying molecular entities to a disease concept to determine if the underlying
probabilistic model effectively generalizes to unrelated concepts with minimal manual
intervention for model retraining.
Results: We developed a named entity recognizer (MTag), an entity tagger for
recognizing clinical descriptions of malignancy presented in text. The application uses
the machine-learning technique Conditional Random Fields with additional domain-
specific features. MTag was tested with 1,010 training and 432 evaluation documents
pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83
recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using
string matching of text with a neoplasm term list, MTag performed with a much higher
recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns.
Application of MTag to all MEDLINE abstracts yielded the identification of 580,002
unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an
extensive lexicon of malignancy mentions as a feature set for extraction had minimal
impact in performance.
Conclusions: Together, these results suggest that the identification of disparate
biomedical entity classes in free text may be achievable with high accuracy and only
moderate additional effort for each new application domain.
The biomedical literature collectively represents the acknowledged historical
perception of biological and medical concepts, including findings pertaining to disease-
related research. However, the rapid proliferation of this information makes it
increasingly difficult for researchers and clinicians to peruse, query, and synthesize it for
biomedical knowledge gain. Automated information extraction methods, which have
recently been increasingly concentrated upon biomedical text, can assist in the acquisition
and management of this data. Although text mining applications have been successful in
other domains and show promise for biomedical information extraction, issues of
scalability impose significant impediments to broad use in biomedicine. Particular
challenges for text mining include the requirement for highly specified extractors in order
to generate accuracies sufficient for users; considerable effort by highly trained computer
scientists with substantial input by biomedical domain experts to develop extractors; and
a significant body of manually annotated text—with comparable effort in generating
annotated corpora—for training machine-learning extractors. In addition, the high
number and wide diversity of biomedical entity types, along with the high complexity of
biomedical literature, makes auto-annotation of multiple biomedical entity classes a
difficult and labor-intensive task.
Most biomedical text mining efforts to date have focused upon molecular object
(entity) classes, especially the identification of gene and protein names. Automated
extractors for these tasks have improved considerably in the last few years [1-13]. We
recently extended this focus to include genomic variations . Although there have
been efforts to apply automated entity recognition to the identification of phenotypic and
disease objects [15-17], these systems are broadly focused and often do not perform as
well as those utilizing more recently-evolved machine-learning techniques for such tasks
as gene/protein name recognition. Recently, Skounakis and colleagues have applied a
machine-learning algorithm to extract gene-disorder relations , while van Driel and
co-workers have made attempts to extract phenotypic attributes from Online Mendelian
Inheritance in Man . However, more extensive work on medical entity class
recognition is necessary because it is an important prerequisite for utilizing text
information to link molecular and phenotypic observations, thus improving the
association between laboratory research and clinical applications described in the
In the current work, we explore scalability issues relating to entity extractor
generality and development time, and also determine the feasibility of efficiently
capturing disease descriptions. We first describe an algorithm for automatically
recognizing a specific disease entity class: malignant disease labels. This algorithm,
MTag, is based upon the probability model Conditional Random Fields (CRFs) that has
been shown to perform with state-of-the-art accuracy for entity extraction tasks [5, 14].
CRF extractors consider a large number of syntactic and semantic features of text
surrounding each putative mention [20, 21]. MTag was trained and evaluated on
MEDLINE abstracts and compared with a baseline vocabulary matching method. An
MTag output format that provides HTML-visualized markup of malignant mentions was
developed. Finally, we applied MTag to the entire collection of MEDLINE abstracts to
generate an annotated corpus and an extensive vocabulary of malignancy mentions.
Manually annotated text from a corpus of 1,442 MEDLINE abstracts was used to
train and evaluate MTag. Abstracts were derived from a random sampling of two
domains: articles pertaining to the pediatric tumor neuroblastoma and articles describing
genomic alterations in a wide variety of malignancies. Two separate training experiments
were performed, either with or without the inclusion of malignancy-specific features,
which were the addition of a lexicon of malignancy mentions and a list of indicative
suffixes. In each case, MTag was tested with the same randomly selected 1,010 training
documents and then evaluated with a separate set of 432 documents pertaining to cancer
genomics. The extractor took approximately 6 hours to train on a 733 MHz PowerPC G4
with 1 GB SDRAM. Once trained, MTag can annotate a new abstract in a matter of
For evaluation purposes, manual annotations were treated as gold-standard files
(assuming 100% annotation accuracy). We first evaluated the MTag model with all
biological feature sets included. Our experiments resulted in 0.846 precision, 0.831 recall,
and 0.838 F-measure on the evaluation set. Additionally, the two subset corpora
(neuroblastoma-specific and genome-specific) were tested separately. As expected, the
extractor performed with higher accuracy with the more narrowly defined corpus
(neuroblastoma) than with the corpus more representative for various malignancies
(genome-specific). The neuroblastoma corpus performed with 0.88 precision, 0.87 recall,
and 0.88 F-measure, while the genome-specific corpus performed with 0.77 precision,
0.69 recall, and 0.73 F-measure. These results likely reflect the increased challenge of
identifying mentions of malignancy in a document set demonstrating a more diverse
collection of mentions.
To determine the impact of the biological feature sets we included to provide domain
specificity, we excluded these feature sets to create a generic MTag. This extractor was
then trained and evaluated using the identical set of files used to train the biological
MTag version. Somewhat surprisingly, the extractor performed with similar accuracy
with the generic model, resulting in 0.851 precision, 0.818 recall, and 0.834 F-measure
on the evaluation set. These results suggested that at least for this class of entities, the
extractor performs the task of identifying malignancy mentions efficiently without the
use of a specialized lexicon.
Extraction versus string matching
We next determined performance of MTag relative to a baseline system that could be
easily employed. For the baseline system, the NCI neoplasm ontology, a term list of
5,555 malignancies, was used as a lexicon to identify malignancy mentions . Lexicon
terms were individually queried against text by case-insensitive exact string matching. A
subset of 39 abstracts randomly selected from the testing set, which together contained
202 malignancy mentions, were used to compare the automated extractor and baseline
results. MTag identified 190 of the 202 mentions correctly (94.1%), while the NCI list
identified only 85 mentions (42.1%), all of which were also identified by the extractor.
We also determined the performance of string matching that instead used the set of
malignancy mentions identified in the manually curated training set annotations (1,010
documents) as a matching lexicon. This system identified 79 of 202 mentions (39.1%).
Combining the manually-derived lexicon with the NCI lexicon yielded 124 of 202
A closer analysis of the 68 malignancy mentions missed by the string matching with
combined lists but positively identified by MTag determined two general subclasses of
additional malignant mentions. The majority of MTag-unique mentions were lexical or
modified variations of malignancies present either in the training data or in the NCI
lexicon, such as minor variations in spelling and form (e.g., “leukaemia” versus
“leukemia”), and acronyms (e.g., “AML” in place of “acute myeloid leukemia”). More
importantly, a substantial minority of mentions identified only by MTag were instances
of the extractor determining new mentions of malignancies that were, in many cases,
neither obvious nor represented in readily available lexicons. For example, “temporal
lobe benign capillary haemangioblastoma” and “parietal lobe ganglioglioma” are neither
in the NCI list or training set per se, or approximated as such by a lexical variant. This
suggests that MTag contributes a significant learning component.
Application to MEDLINE
MTag was then used to extract mentions of malignancy from all MEDLINE
abstracts through 2005. Extraction took 1,642 CPU-hours (68.4 CPU-days; 2.44 days on
our 28-CPU cluster) to process 15,433,668 documents. A total of 9,153,340 redundant
mentions and 580,002 unique mentions (ignoring case) were identified. Interestingly, the
ratio of unique new mentions identified relative to the number of abstracts analyzed was
relatively uniform, ranging from a rate of 0.183 new mentions per abstract for the first
0.1% of documents to a rate of 0.038 new mentions per abstract for the last 1% of
documents. This indicated that a substantial rate of new mentions was being maintained
throughout the extraction process.
The 25 mentions found in the greatest number of abstracts by MTag are listed in
Table 1. Six of these malignant phrases: pulmonary, fibroblasts, neoplastic, neoplasm
metastasis, extramural, and abdominal did not match our definition of malignancy. Of
these, only “extramural” is not frequently associated with malignancy descriptions and is
likely the result of containing character n-grams that are generally indicative of
malignancy mentions. The remaining five phrases are likely the result of the extractor
failing to properly define mention boundaries in certain cases (e.g., tagging “neoplasm”
rather than “brain neoplasm”), or alternatively, shared use of an otherwise indicative
character string (e.g., “opl” in “brain neoplasm” and “neoplastic”) between a true positive
and a false positive.
For comparison, we also determined the corresponding number of articles identified
both by keyword searching of PubMed and by exact string matching of MEDLINE for
each of the 19 most common true malignancy types (Table 1). Overall, MTag’s
comparative recall was 1.076 versus PubMed keyword searching and 0.814 versus string
matching. As PubMed keyword searching uses concept mapping to relate keywords to
related concepts, thus providing query expansion, the document retrieval totals derived
from this approach do not strictly compare to MTag’s approach. Furthermore, the exact
string totals would be inflated relative to the MTag totals, as for example the phrase
“myeloid leukemia” would be counted both for this category and for a category
“leukemia” with exact string matching, but would only be counted for the former phrase
by MTag. To adjust for these discrepancies, for MTag document totals listed in Table 1,
we included documents that were tagged with malignancy mentions that were both strict
syntactic parents and biological children of the phrase used. For example, we included
articles identified by MTag with the phrase “small-cell lung cancer” within the total for
the phrase “lung cancer”.
Comparison of these totals between MTag articles and PubMed keyword searching
revealed that MTag provided high recall for most malignancies. Interestingly, there are
three malignancy mention instances (“carcinoma”, “sarcoma”, “melanoma”) that have
more MTag-identified articles than for PubMed keyword searches. This suggests that a
more formalized normalization of MTag-derived mentions might assist both with
efficiency and recall if employed in concert with the manual annotation procedure
currently employed by MEDLINE. Furthermore, MTag’s document recall compared
quite favorably to exact string matching. Only two of the 25 malignancy mentions
yielded less than 60% as many articles via MTag than via PubMed exact string matching
(“bone neoplasms” and “lung cancer”). In these two cases, the concept-mapping PubMed
search identifies the articles with a broader range beyond the search terms. For example,
a PubMed search for the term “lung cancer” identifies articles describing “lung
neoplasms”, while for “bone neoplams”, articles focusing on related concepts such as
“osteoma” and “sphenoid meningioma” are identified by PubMed. Generally, MTag
recall would be expected to improve further after a subsequent normalization process that
maps equivalent phrases to a standard referent.
To assess document-level precision, we randomly selected 100 abstracts identified by
MTag each for the malignancies “breast cancer” and “adenocarcinoma”. Manual
evaluation of these abstracts showed that all of the articles were directly describing the
respective malignancies. Finally, we evaluated both the 250 most frequently mentioned
malignancies as well as a random set of 250 extracted malignancy mentions from the all-
MEDLINE-extracted set. For the frequently occurring mentions, 72.06% were considered
to be true malignancies; this set corresponds to 0.043% of all malignancy mentions. For
the random set, 78.93% were true malignancies. This suggests that such extracted
mention sets might serve as a first-pass exhaustive lexicon of malignancy mentions.
Comparison of the entire set of unique mentions with the NCI neoplasm list showed that
1,902 of the 5,555 NCI terms (34.2%) were represented in the extracted literature.
MTag is platform independent, written in java, and requires java 1.4.2 or higher to
run. The software is freely available under the GNU General Public License at
has been engineered to directly accept files downloaded from PubMed and formatted in
MEDLINE format as input. MTag provides output options of text or HTML file versions
of the extractor results. The text file repeats the input file with recognized malignancy
mentions appended at the end of the file. The HTML file provides markup of the original
abstract with color-highlighted malignancy mentions, as shown in Figure 1.
We have adapted an entity extraction approach that has been shown to be successful
for recognition of molecular biological entities and have shown that it also performs with
high accuracy for disease labels. It is evident that an F-measure of 0.83 is not sufficient as
a stand-alone approach for curation tasks, such as the de novo population of databases.
However, such an approach provides highly enriched material for manual curators to
utilize further. As was determined by our comparisons with lexical string matching and
PubMed-based approaches, our extraction method demonstrated substantial improvement
and efficiency over commonly employed methods for document retrieval. Furthermore,
MTag appeared to be accurately predicting malignancy mentions by learning and
exploiting syntactic patterns encountered in the training corpus.
Analysis of mis-annotations would likely suggest additional features and/or heuristics
that could boost performance considerably. For example, anatomical and histological
descriptions were frequent among MTag false positive mentions. Incorporation of
lexicons for these entity types as negative features within the MTag model would likely
increase precision. Our training set also does not include a substantial number of
documents that do not contain mentions of malignancy; recent unpublished work from
our group suggests that inclusion of such documents significantly impacts extractor
performance in a positive manner.
Unlike the first iteration of our CRF model , the MTag application required only
modest computational effort (several weeks vs. several months) of retraining and
customization time (see Methods). To our surprise, the addition of biological features,
including an extensive lexicon for malignancy mentions, provided very little boost to the
recall rate. This provides evidence that our general CRF model is flexible, broadly
applicable, and if these results hold true for additional entity types, might lessen the need
for creating highly specified extractors. In addition, the need for extensive domain-
specific lexicons, which do not readily exist for many disease attributes, might be
obviated. If so, one approach to comprehensive text mining of biomedical literature might
be to employ a series of modular extractors, each of which is quickly generated and then
trained for a particular entity or relation class. Conversely, it is important to note that the
entity class of malignancy possesses a relatively discrete conceptualization relative to
certain other phenotypic and disease concepts. Further adaptation of our extractor model
for more variably described entity types, such as morphological and developmental
descriptions of neoplasms, is underway. However, the finding that biological feature
addition provided minimal gain in accuracy suggests that further improvements may be
more difficult to obtain than by merely identifying and adding additional domain-specific
features. Significantly, challenges in rapid generation of annotations for extractor
training, as well as procedures for efficient and accurate entity normalization, still
When combined with expert evaluation of output, extractors can assist with
vocabulary building for targeted entity classes. To demonstrate feasibility, we extracted
mentions of malignancy for all pre-2006 MEDLINE abstracts. Our results indicate that
MTag can generate such a vocabulary readily and with moderate computational resources
and expertise. With manual intervention, this list could be linked to the underlying
literature records and also integrated with other ontological and database resources, such
as the Gene Ontology, UMLS, caBIG, or tumor-specific databases [23-25]. Since
normalization of disease-descriptive term lists requires considerable specialized
expertise, the role of an extractor in this setting more appropriately serves as an
information harvester. However, this role is important, as such supervised lists are often
not readily available, due in part to the variability in which phenotypic and disease
descriptions can be described, and in part to the lack of nomenclature standards in many
Finally, to our knowledge, MTag is one of the first directed efforts to automatically
extract entity mentions in a disease-oriented domain with high accuracy. Therefore,
applications such as MTag could contribute to the extraction and integration of
unstructured, medically-oriented information, such as physician notes and physician-
dictated letters to patients and practitioners. Future work will include determining how
well similar extractors perform for identifying mentions of malignant attributes with
greater (e.g. tumor histology) and lesser (e.g. tumor clinical stage) semantic and syntactic
MTag can automatically identify and extract mentions of malignancy with high
accuracy from biomedical text. Generation of MTag required only moderate
computational expertise, development time, and domain knowledge. MTag substantially
outperformed information retrieval methods using specialized lexicons. MTag also
demonstrated the ability to assist with the generation of a literature-based vocabulary for
all neoplasm mentions, which is of benefit for data integration procedures requiring
normalization of malignancy mentions. Parallel iteration of the core algorithm used for
MTag could provide a means for more systematic annotation of unstructured text,
involving the identification of many entity types; and application to phenotypic and
medical classes of information.
Our task was to develop an automated method that would accurately identify and
extract strings of text corresponding to a clinician’s or researcher’s reference to cancer
(malignancy). Our definition of the extent of the label “malignancy” was generally the
full noun phrase encompassing a mention of a cancer subtype, such that “neuroblastoma”,
“localized neuroblastoma”, and “primary extracranial neuroblastoma” were considered to
be distinct mentions of malignancy. Directly adjacent prepositional phrases, such as
“cancer <of the lung>”, were not allowed, as these constructions often denoted ambiguity
as to exact type. Within these confines, the task included identification of all variable
descriptions of particular malignancies, such as the forms “squamous cell carcinoma”
(histological observation) or “lung cancer” (anatomical location), both of which are
underspecified forms of “lung squamous cell carcinoma”. Our formal definition of the
semantic type “malignancy” can be found at the Penn BioIE website .
In order to train and test the extractor with both depth and breadth of entity mention,
we combined two corpora for testing. The first corpus concentrated upon a specific
malignancy (neuroblastoma) and consisted of 1,000 randomly selected abstracts
identified by querying PubMed with the query terms “neuroblastoma” and “gene”. The
second corpus consisted of 600 abstracts previously selected as likely containing gene
mutation instances for genes commonly mutated in a wide variety of malignancies. These
sets were combined to create a single corpus of 1,442 abstracts, after eliminating 158
abstracts that appeared to be non-topical, had no abstract body, or were not written in
English. This set was manually annotated for tokenization, part-of-speech assignments,
and malignancy named entity recognition, the latter in strict adherence to our pre-
established entity class definition [27, 28]. Sequential dual pass annotations were
performed on all documents by experienced annotators with biomedical knowledge, and
discrepancies were resolved through forum discussions. A total of 7,303 malignancy
mentions were identified in the document set. These annotations are available in corpus
release v0.9 from our BioIE website .
Based on the manually annotated data, an automatic malignancy mention extractor
(MTag) was developed using the probability model Conditional Random Fields (CRFs)
. We have previously demonstrated that this model yields state-of-the-art accuracy
for recognition of molecular named entity classes [5, 14]. CRFs model the conditional
probability of a tag sequence given an observation sequence. We denote that O is an
observation sequence, or a sequence of tokens in the text, and t is a corresponding tag
sequence in which each tag labels the corresponding token with either Malignancy
(meaning that the token is part of a malignancy mention) or Other. CRFs are log-linear
models based on a set of feature functions, fi(tj, tj-1, O), which map predicates on
observation/tag-transition pairs to binary values. As shown in the formula below, the
function value is 1.0 when the tag sequence is Malignancy; otherwise (o.w.) it is 0. A
particular advantage of this model is that it allows the effects of many potentially
informative features to be simultaneously weighed. Consider, for example, the following
This feature represents the probability of whether the token “cancer” is tagged with label
Malignancy given the presence of “lung” as the previous token. Features such as this
would likely receive a high weight, as they represent informative associations between
observation predicates and their corresponding labels.
Our CRF algorithm considers many textual features when it makes decisions on
classifying whether a word comprises all or part of a malignancy mention. Word-based
features included whether a word has been identified as being a malignancy mention by
manual annotation of text used as training material. The frequency of each string of 2, 3,
or 4 adjacent characters (character n-grams) within each word of the training text was
calculated, and the differential frequency of each n-gram within words manually tagged
as being malignancy mentions, relative to the overall frequency of these strings in the
overall text, was considered as a series of features. Orthographic features included the
usage and distribution of punctuation, alternative spellings, and case usage. Domain-
specific features comprised a lexicon of 5,555 malignancies and a regular expression for
tokens containing the suffix –oma. In total, MTag incorporated 80,294 unique features.
All observation predicates, either with or without the biological predicates, were then
applied over all labels, applying a token window of (-1, 1) to create the final set of
features. The MALLET toolkit  was used as the implementation of CRFs to build our
The evaluation set of 432 abstracts comprised 2,031 sentences containing mentions
of malignancy and 3,752 sentences without mentions, as determined by manual
assessment of entity content. The predicted malignancy mention was considered correctly
identified if, and only if, the predicted and manually labeled tags were exactly the same
in content and both boundary determinations. The performance of MTag was calculated
according to the following metrics: Precision (number of entities predicted correctly
divided by the total number of entities predicted), Recall (number of entities predicted
correctly divided by the total number of entities identified manually), and F-measure
List of Abbreviations Used
CRF, conditional random field
YJ implemented the algorithm to develop MTag and drafted the manuscript. RM
developed the core algorithm and assisted in the implementation. KL developed the
software interface. MM supervised the manual annotation for extractor training and
testing. SC assisted with the tagging of MEDLINE and analysis of the results. ML
oversaw the linguistic aspects of the project. FP developed the theoretical underpinnings
of the algorithm and oversaw the computational aspects of the project. RW participated in
algorithm design and the manual annotation procedure. PW oversaw the biological
aspects of the project, provided overall direction, and finalized the manuscript. All
authors read and approved the final manuscript.
The authors thank members of the University of Pennsylvania Biomedical
Information Extraction Group; Kevin Murphy for annotations, discussions and technical
assistance; the National Library of Medicine for access to MEDLINE; and Richard
Wooster for corpus provision. This work was supported in part by NSF grant ITR
0205448 (to ML), a pilot project grant from the Penn Genomics Institute (to PW), and the
David Lawrence Altschuler Endowed Chair in Genomics and Computational Biology (to
1. Collier N, Takeuchi K: Comparison of character-level and part of speech
features for name recognition in biomedical texts. J Biomed Inform 2004,
2. Finkel J, Dingare S, Manning CD, Nissim M, Alex B, Grover C: Exploring the
boundaries: gene and protein identification in biomedical text. BMC
Bioinformatics 2005, 6 Suppl 1:S5.
3. Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L, Leser U,
Scheffer T: Systematic feature evaluation for gene name recognition. BMC
Bioinformatics 2005, 6 Suppl 1:S9.
4. Kinoshita S, Cohen KB, Ogren PV, Hunter L: BioCreAtIvE Task1A: entity
identification with a stochastic tagger. BMC Bioinformatics 2005, 6 Suppl
5. McDonald R, Pereira F: Identifying gene and protein mentions in text using
conditional random fields. BMC Bioinformatics 2005, 6 Suppl 1:S6.
6. Mitsumori T, Fation S, Murata M, Doi K, Doi H: Gene/protein name
recognition based on support vector machine using dictionary as features.
BMC Bioinformatics 2005, 6 Suppl 1:S8.
7. Tamames J: Text Detective: a rule-based system for gene annotation in
biomedical texts. BMC Bioinformatics 2005, 6 Suppl 1:S10.
8. Tanabe L, Wilbur WJ: Tagging gene and protein names in biomedical text.
Bioinformatics 2002, 18:1124-1132.
9. Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ: GENETAG: a tagged
corpus for gene/protein named entity recognition. BMC Bioinformatics 2005, 6
10. Temkin JM, Gilder MR: Extraction of protein interaction information from
unstructured text using a context-free grammar. Bioinformatics 2003,
11. Torii M, Kamboj S, Vijay-Shanker K: Using name-internal and contextual
features to classify biological terms. J Biomed Inform 2004, 37:498-511.
12. Yeh A, Morgan A, Colosimo M, Hirschman L: BioCreAtIvE Task 1A: gene
mention finding evaluation. BMC Bioinformatics 2005, 6 Suppl 1:S2.
13. Zhou G, Shen D, Zhang J, Su J, Tan S: Recognition of protein/gene names from
text using an ensemble of classifiers. BMC Bioinformatics 2005, 6 Suppl 1:S7.
14. McDonald RT, Winters RS, Mandel M, Jin Y, White PS, Pereira F: An entity
tagger for recognizing acquired genomic variations in cancer literature.
Bioinformatics 2004, 20:3249-3251.
15. Chen L, Friedman C: Extracting phenotypic information from the literature
via natural language processing. Medinfo 2004, 11:758-762.
16. Friedman C, Hripcsak G, DuMouchel W, Hohnson SB, Clayton PD: Natural
language processing in an operational clinical information system. Natural
Language Engineering 1995, 1:1-28.
17. Hahn U, Romacker M, Schulz S: MEDSYNDIKATE--a natural language
system for the extraction of medical information from findings reports. Int J
Med Inform 2002, 67:63-74.
18. Skounakis M, Craven M, Ray S: Hierarchical Hidden Markov Models for
information extraction. Proceedings of the 18th International Joint Conference
on Artificial Intelligence: 2003; Acapulco, Mexico; 2003.
19. van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-
mining analysis of the human phenome. Eur J Hum Genet 2006, 14:535-542.
20. Lafferty J, McCallum A, Pereira F: Conditional Random Fields: Probabilistic
Models for Segmenting and Labeling Sequence Data. Proceedings of
ICML-01: 2001; 2001: 282-289.
21. McCallum A: Efficiently Inducing Features of Conditional Random Fields.
UAI '03, Proceedings of the 19th Conference in Uncertainty in Artificial
Intelligence: 2003: Morgan Kaufmann; 2003: 403-410.
22. Malignancy type definitions
23. The Gene Ontology (GO) project in 2006. Nucleic Acids Res 2006,
24. Bodenreider O: The Unified Medical Language System (UMLS): integrating
biomedical terminology. Nucleic Acids Res 2004, 32:D267-270.
25. Kakazu KK, Cheung LW, Lynne W: The Cancer Biomedical Informatics Grid
(caBIG): pioneering an expansive network of information and tools for
collaborative cancer research. Hawaii Med J 2004, 63:273-275.
26. Kulick S, Bies A, Liberman M, Mandel M, McDonald R, Palmer M, Schein A,
Ungar L, Winters S, White P: Integrated annotation for biomedical
information extraction. Proc of BioLink 2004 2004.
27. Kulick S, Liberman M, Palmer M, Schein A: Shallow semantic annotation of
biomedical corpora for information extraction. Proc ISMB 2003.
28. Penn BioIE corpus release v0.9 [http://bioie.ldc.upenn.edu]
29. MALLET: A Machine Learning for Language Toolkit
30. Bruder E, Passera O, Harms D, Leuschner I, Ladanyi M, Argani P, Eble JN,
Struckmann K, Schraml P, Moch H: Morphologic and molecular
characterization of renal cell carcinoma in children and young adults. Am J
Surg Pathol 2004, 28:1117-1132.
QuickTimeª and a
TIFF (LZW) decompressor
are needed to see this picture.
Figure 3-1. Example of the HTML output of MTag for an annotated abstract . Malignancy type
mentions identified by MTag are shown in bold, italicized, and blue text.
Chapter 4. A Text Mining Approach for Identifying Genes Implicated
in Neuroblastoma Tumorigenesis
Garrett M. Brodeur
Peter S White
The pediatric tumor neuroblastoma can be classified into two subtypes that
commonly exhibit distinctly different clinical outcomes, and which appear to correlate
with the differential activation of either the NTRK1 or NTRK2 neurotrophin signaling
pathways. Previously, we generated neuroblastoma cell lines that constituitively express
either the receptor tyrosine kinase NTRK1 or NTRK2 in an otherwise identical
background. Microarray expression profiling of the cell line models after introduction of
either NTRK1 ligand (NGF) or NTRK2 ligand (BDNF) gave rise to 751 genes
differentially expressed between the two cell lines. We developed a method to re-
prioritize the differentially expressed gene list by extracting and integrating information
regarding genes differentially mentioned in biomedical text articles between NTRK1 and
NTRK2, using a highly specific entity recognition and process. This process identified
twenty-two genes differentially expressed and also differentially mentioned in the
literature. The 22 genes were compared to the larger set of differentially expressed genes
to determine the ability of each group’s genes to be enriched for protein pathways
considered to be critical for neurolast development. Results demonstrated that text mining
alone or when integrated with the microarray data was capable of further enriching the
genes from the differentially expressed gene set. Expression levels for 11 of the 22 genes
were verified by real-time expression analysis. One the eleven genes, EFNB3, validated
the biological utility of the text mining process, while another, TYRO3, suggested
inferential power of the process. We conclude that biomedical text mining can help
interpret high throughput data analysis by integrating previously known information.
Neuroblastoma is the most common pediatric extracranial solid tumor, accounting
for approximately 9% of all childhood cancers. Neuroblastoma is derived from primitive
cells of the developing sympathetic nervous system. Progression of the disease is
markedly variable, ranging from spontaneous regression of metastatic disease in a small
minority of infants to metastatic disease that grows relentlessly, despite even the most
intensive multimodality therapy, in many children over one year of age (Brodeur GM
2003). Based both upon these observations and a number of tumor classification studies
using a wide range of biological and clinical factors, the presence of at least two
biological subtypes with distinct clinical outcomes has been proposed. Previous studies
have suggested that expression of the neurotrophin receptor NTRK1 (TrkA) is strongly
correlated with favorable outcomes, while expression of NTRK2 (TrkB) conversely
indicates an unfavorable outcome (Nakagawara A et al, 1992; 1993; 1994; Suzuki T et al,
1993; Kogner P et al, 1993; Borrello MG et al, 1993). The high binding-affinity ligands
for NTRK1 and NTRK2 receptors are nerve growth factor (NGF) and brain-derived
neurotrophic factor (BDNF) respectively. The NTRK1 and NTRK2 ligands, receptors,
and, to the extent they are known, the downstream signal transduction pathways are
highly similar in structure and composition. However, it has been well-established that
the NGF/NTRK1 signaling pathway mediates cellular differentiation and/or programmed
cell death in vitro, while the BDNF/NTRK2 pathway enhances neuroblastoma cell
survival (Eggert A et al, 2000; 2002; Ho et al, 2002). It is evident that these two signaling
pathways must activate certain non-overlapping effector molecules and downstream
targets, but the molecules that account for the distinct biological behaviors have not yet
been elucidated. Therefore, further characterization of the differential molecular
responders activated by the two similar neurotrophin signaling pathways might lead us to
understand the mechanisms responsible for different phenotypic behaviors of the two
neuroblastoma subtypes, as well as identifying possible clinical intervention targets.
Array-based gene expression analysis is a recent, commonly employed, and
increasingly effective strategy for identifying differentially active transcripts in a
systematic fashion. However, array methods are well known to suffer from limited
positive predictive value, due in part to the large number of genes being surveyed, and in
part to limitations in the correlation between gene expression and biological activity.
Although single-gene transcript surveillance systems such as real time PCR (RT-PCR)
are more reliable ways to identify differentially expressed genes, as well as to validate
array-based findings, employing these more sensitive techniques to identify more
promising candidates is cost- and effort-prohibitive for most laboratories. Instead,
researchers typically first undertake a high-throughput array-based screen and then select
a small subset of the most differentially expressed genes for validation and further study.
However, this process requires researchers to make subjective decisions that often rely on
their own knowledge rather than more objective methods that consider additional
knowledge sources regarding genes of interest for prioritization.
Biomedical literature is the most complete and updated reservoir for discovered
biomedical knowledge. While this knowledge source is immediately attractive, from an
information content standpoint, for discovery tasks such as the identification of genes
implicated in human diseases, the unstructured nature of biomedical text obviates
approaches to utilize this information for prioritization tasks systematically. However,
biomedical text mining (BTM) techniques developed by us and others have recently
demonstrated success in extracting target information out of text (Jin Y et al, 2006;
McDonald RT et al, 2004; Rzhetsky A et al, 2004;Hanisch D et al, 2005; BioCreAtIvE).
Effective use of such techniques could provide a large and structured data set of extracted
information that would allow more comprehensive synthesis of published biomedical
knowledge than current, ad hoc methods used by most researchers for literature
awareness. However, BTM techniques are costly to implement and typically yield results
that are inadequately sensitive if applied generally; thus, these systems have been slow to
gain acceptance among biomedical researchers.
In contrast, we and others have had considerable success constructing BTM
applications that are limited in scope but are highly tuned to a particular practical task.
With a previously developed named entity recognition (NER) system, we were able to
identify human gene mentions in literature with high accuracy rates, normalize these to
standard referents, and apply this system to the entire body of MEDLINE documents. In
the current study, we applied this system to help address a particular biomedical research
challenge, the identification of candidate genes associated with a particular differential
signaling paradigm. Our NER system was used to identify MEDLINE articles
differentially “expressing” NTRK1 or NTRK2 relative to each other, and then to identify
other genes co-mentioned in these articles. The BTM results were then combined with
microarray expression analysis results generated in an in vitro expression system where
either NTRK1 or NTRK2 was induced. The combined analysis provided a means to re-
calculate relevance of genes that showed evidence of differential expression in both the
experimental and computational systems. Finally, we experimentally validated and
characterized the plausibility of predicted candidates.
Materials and Methods
Microarray expression profiling
Full-length NTRK1 and NTRK2 were cloned into the retroviral expression vector
pLNCX and transfected into Trk-null human neuroblastoma cell lines SH-SY5Y as
previously described (Eggert A et al, 2000). The NTRK1 and NTRK2 over-expressing
cell lines were serum-starved overnight and treated with NGF or BDNF, respectively, at
37°C for treatment times from 0 to 12 hours. Total RNA was prepared using the RNeasy
Mini kit (Qiagen Inc., Valencia, CA) from NTRK1 and NTRK2-expressing cells exposed
either to 100 ng/ml of NGF or 20 ng/ml of BDNF at time points 0, 1.5, 4, or 12 hrs of
treatment. Microarray experiments were performed with strict adherence to the
manufacturer’s instructions (Affymetrix; Santa Clara, CA). Purified biotin-labeled cRNA
was fragmented, heated to 99°C for 5 min, and then hybridized at 45°C for 16 hours to
HG-U133A arrays. Each data point was sampled with 3 technical and 1 biological
duplicates. Expression intensity value signals corresponding to relative gene expression
were calculated by the Affymetrix MAS v5.0 software package. Intensity values were
then normalized (per gene) to the median of each gene’s expression across the entire
experiment to account for chip-to-chip variation and to facilitate comparisons, using the
RMA express software package (UC Berkeley, CA).
Statistical analysis of differential gene expression
Normalized gene expression values were imported to the microarray data analysis
toolkit Multiple Experiment Viewer (MEV) v4.0 (TIGR, Rockville, MD). Paired
significance analysis of microarrays (SAM) was used to calculate differentially expressed
genes between NTRK1 and NTRK2-expressing cell lines. One hundred permutations
were used for multiple testing corrections during the process, and the false discovery rate
was kept at zero.
Text mining analysis
The gene mentions of all pre-2006 MEDLINE abstracts were extracted with a
previously developed named entity recognition (NER) process that uses the machine-
learning technique conditional random fields to build a statistically based entity
recognition model (Jin Y et al, 2006). A previously established rule-based normalization
process was then applied to the extracted gene mentions, which paired human gene
mentions with their corresponding official HGNC gene symbols to serve as standard
referents (Fang H et al, 2006). All genes co-mentioned in a MEDLINE abstract with
NTRK1 or NTRK2 were selected and co-occurrence frequencies were calculated. Genes
were considered to be differentially expressed in the literature if their co-occurrence
frequencies differed at least 5-fold between NTRK1 and NTRK2.
Statistical pathway analysis
Functional pathway analysis was performed through the Ingenuity pathway
analysis toolkit (Ingenuity, Redwood City, CA). Neuroblastoma related pathways were
pre-selected and the numbers of pathway-associated genes were determined for different
gene groups. Direct comparisons between groups were made by applying the
hypergeometric statistical test in order to determine the enrichment values of
neuroblastoma-relevant genes for the gene group integrating text mining results. The
Bonferroni step–down correction was used to calculate the multiple-test corrected P-
values for the statistical comparisons.
NTRK1 and NTRK2-expressing cell lines and total RNA extractions were
prepared as described above. Extracted RNAs were reverse transcribed and amplified into
cDNAs using the TaqMan high-capacity archive kit (Applied Biosystems, Foster City,
CA). Primers and probes for each of 11 selected genes, as well as all other assay reagents
were obtained with TaqMan Gene Expression Assay kit (Applied Biosystems, Foster
City, CA). The TaqMan relative quantification procedure with TaqMan 7500 instrument
was applied to determine the amount of each cDNA, with the housekeeping gene
GAPDH as endogenous control. Each data point had 3 technical replicates.
Results and Discussion
Microarray-based differential gene expression analysis
In order to screen the differential responders for NGF/NTRK1 and BDNF/NTRK2
pathways, NTRK1 and NTRK2 expressing NB cell lines were made and expression
profiles were obtained by microarray experiment after NGF or BDNF exposures
respectively. Using the parameters specified in the Methods section, statistical analysis
identified that across different time points, 751 known genes on the microarray chips
were differentially expressed between NTRK1 and NTRK2-expressing cell lines after
NGF or BDNF exposure. Specifically, 468 genes were found to be differentially over-
expressed in NTRK1 expressing cell lines relative to NTRK2-expressing cell lines, while
283 genes were observed with opposite expression behaviors (Figure 4-1). The 468 genes
(gene set 1) and 283 genes (gene set 2) are listed in the attached appendix A.
Integration of text mining analysis
To prioritize the array-determined differentially expressed genes based on their
functional relevance to NTRK1 and NTRK2 pathways, we applied pre-developed gene
mention extractor and rule-based normalizer to acquire all the gene symbols co-
mentioned with either NTRK1 or NTRK2. And among them, there were 514 genes
preferentially associated with NTRK1 (co-occurred 5 times or more with NTRK1 than
NTRK2), and 157 genes with NTRK2 (Figure 4-1). Both 514 genes (gene set 3) and 157
genes (gene set 4) are listed in the appendix A. We identified a total of 22 genes that were
differentially expressed in the same manner by both the expression array and BTM
methods. Of these, eighteen were differentially NTRK1 overexpressed on the chip and
preferentially associated in text and four were differentially NTRK2 overexpressed on the
chip and preferentially associated in text (Figure 4-1). We selected eight most
overexpressed genes of the 18 NTRK1-associated genes along with three of four
NTRK2-associated genes for in silico experimental validation. The reason why we chose
5 as the cut-off number was to limit the overlapping genes in order to choose manageable
higher ranked genes for the following RT-PCR experiment. If we change the cut-off
number to 2, the numbers of genes preferentially associated with either NTRK1 or
NTRK2 are increased to 632 and 182 respectively, and the overlapping genes are
increased to 31.
468 genes up in NTRK1, 18 genes 514 genes preferentially
down in NTRK2 cell line overlapped associated with NTRK1
157 genes preferentially
associated with NTRK2
283 genes up in NTRK2, 4 genes
down in NTRK1 cell line overlapped
Out of 10,459 known 671 genes were
genes on the chips, 751 preferentially associated
genes were found with either NTRK1 or
differentially expressed NTRK2 in literature
Figure 4-1. Differentially expressed genes on chips and preferentially associated genes in literature
Functional pathway analysis
In order to explore the potential relevance of the derived gene lists to
neuroblastoma, we determined whether these sets were preferentially enriched for
biological pathways that were known to be critical for tumorigenesis and tumor
progression. The following four gene list groups were involved in this comparison:
Group A: The overall gene set: all 10,459 genes represented on the expression
Group B: Out of Group A, the set of 751 genes differentially expressed
(biologically) in neuroblastoma cell lines constitutively expressing NTRK1 or NTRK2
and induced with corresponding ligand.
Group C: Out of Group A, the 550 genes that were differentially represented in
the literature between NTRK1 and NTRK2
Group D: 22 genes were consistently differentially expressed, either for NTRK1
or NTRK2, by both techniques
Functional pathways assigned to each gene in the above groups were identified
with the Ingenuity pathway analysis toolkit. We concentrated on six specific pathways
considered to be highly relevant to neurotrophic factor signaling in neuroblasts: cell
death, cell growth and proliferation, cell-to-cell signaling and interaction, cell
morphology, nervous system development and function, and cellular assembly and
organization. For each functional group, the number and the proportion of genes assigned
to each of those six pathways were calculated (Table 4-1).
Group A Group B Group C Group D
(N=10,459) (N= 751) (N= 550) (N=22)
CD 1979, 18.9% 153, 20.4% 309, 56.2% 12, 54.5%
CGP 2251, 21.5% 154, 20.5% 304, 55.3% 3, 13.6%
CCSI 1492, 14.3% 57, 9.98% 186, 33.8% 7, 31.8%
CM 1068, 10.2% 85, 11.3% 219, 39.8% 7, 31.8%
NSDF 897, 8.58% 108, 19.6% 148, 26.9% 9, 40.9%
CAO 755, 7.22% 103, 13.7% 115, 20.9% 11, 50%
Table 4-1. The number and proportion of genes in each gene group associated with selected
pathways. CD: cell death; CGP, cell growth and proliferation; CCSI, cell-to-cell signaling and
interaction (CCSI); CM, cell morphology; NSDF, nervous system development and function; CAO,
cellular assembly and organization.
As shown in Table 4-1, when compared to the overall set of genes that were
surveyed for expression levels (Group A), the subset of 751 genes identified as being
significantly differentially expressed by expression array analysis alone (Group B) was
slightly or moderately enriched for four pathways (CD, CM, NSDF, and CAO) and was
actually reduced in the other two pathways (CGP and CCSI). Conversely, the set of genes
differentially mentioned in text (Group C) was highly enriched for all six relevant
pathways relative to the overall set and the expression array-alone set. Correspondingly,
the set of genes differentially expressed in both the microarray and text mining
experiments were highly enriched for five of the six pathways. However, the CGP
pathway did not show enrichment. To illustrate the Ingenuity determined genes that are
relevant for select pathways, all the genes in Group C subsets are listed in Appendix B.
Group B Group C Group D
CD 0.152 0.0166 <0.001
CGP 0.746 0.0216 0.728
CCSI 0.999 0.0227 0.009
CM 0.146 0.0109 0.001
NSDF <0.001 <0.001 <0.001
CAO <0.001 <0.001 <0.001
Table 4-2. Significance testing for six relevant protein pathways. Shown are P-values calculated in
comparisons between Groups B, C, or D relative to group A for each of the six pathways. Pathway
abbreviations are listed in Table 4-1.
In order to calculate statistical significance of the six selected pathway gene
enrichments for the three subset groups, compared to the overall gene Group A, a
hypergeometric test was applied and the corresponding P-values were calculated (Table
4-2). The results show that both the text-mining Group C (all 6 pathways) and the
combined analysis Group D (5 out of 6 pathways) gene sets were enriched from the
overall set for selected pathways with statistical significance. Interestingly, the expression
array Group B gene set was only enriched for the NSDF and CAO pathways. To
determine whether the combined analysis Group D gene subset was further enriched from
the expression array Group B gene set, Group B was used as a reference set to directly
determine whether Group D showed significant enrichment (Table 4-3).
Table 4-3. Significance testing for six relevant protein pathways. Shown are P-values calculated in a
comparison between Group D relative to group B for each of the six pathways. Pathway
abbreviations are listed in Table 4-1. The Bonferroni step-down correction was applied to account
for multiple testing.
Table 4-3 shows that the P-values for 5 out of 6 pathways are significant,
demonstrating the relevant gene enrichment capability of the integrated analysis method
compared to expression array analysis alone. This experiment suggests that at least in this
experimental paradigm, our text mining process is capable of enriching gene sets for
genes that are members of functional pathways critical for tumorigenic and tumor
progression processes in neuroblastoma.
RT-PCR Experimental Validation
To determine the authenticity of genes identified by both text mining and
expression analysis, we selected 11 genes for further validation of expression, using RT-
PCR. Identically to the expression array experiments, gene expression levels were
measured at four time points to cell lines expressing stably transfected NTRK1 or
NTRK2, after applying the corresponding neurotrophic factors to the media. Generally,
the RT-PCR results confirmed and more precisely defined the expression level
differences observed between the NTRK1 and NTRK2 expressing cell lines by the
microarray analysis. Specifically, expression level differences were concordant for 10 of
11 genes (Table 4-4). The gene GNAS was the lone outlier; GNAS was identified as
preferentially over-expressed in NTRK2-induced cell lines relative to NTRK1-induced
lines by RT-PCR, but the opposite was true both in the expression array and text mining
Microarray Literature RT-PCR
TBC1D8 ↑NTRK2* ↑NTRK2 ↑NTRK2
VSNL1 ↑NTRK2 ↑NTRK2 ↑NTRK2
CAMK4 ↑NTRK2 ↑NTRK2 ↑NTRK2
RPS6KA1 ↑NTRK1 ↑NTRK1 ↑NTRK1
EFNB3 ↑NTRK1 ↑NTRK1 ↑NTRK1
B3GAT1 ↑NTRK1 ↑NTRK1 ↑NTRK1
GNAS ↑NTRK1 ↑NTRK1 ↑NTRK2
NEFH ↑NTRK1 ↑NTRK1 ↑NTRK1
NEFL ↑NTRK1 ↑NTRK1 ↑NTRK1
INA ↑NTRK1 ↑NTRK1 ↑NTRK1
TYRO3 ↑NTRK1 ↑NTRK1 ↑NTRK1
Table 4-4. Differential behavior of 11 highly differentially expressed genes, as determined by three
* The designation ↑NTRK2 indicates that the overall expression level of this gene is higher in
NTRK2-expressing, BDNF-induced cell lines than in NTRK1–expressing, NGF-induced cell lines for
the “Microarray” and “RT-PCR” columns. For the “Literature” column, it indicates that this gene is
preferentially associated with NTRK2 to NTRK1 in biomedical text. The inverse corollary
association is true for the ↑NTRK1 designation.
The objective of this study was to identify immediate-to-early response genes
expressed differentially between the two NTRK signaling pathways that might explain
the different growth behaviors of NTRK1- and NTRK2-expressing cell lines. Thus, we
characterized the RT-PCR-based expression differences more closely. One gene that
exhibited a striking and rapid expression induction was EFNB3. As demonstrated in
Figure 4-2, RT-PCR data shows that the expression level of EFNB3 was substantially up-
regulated in NTRK1-expressing cell lines, with a two-fold increase in expression
observed from 0 to 4 hours after NGF application. Subsequently, by 12 hours expression
had decreased to the original level. Conversely, in the NTRK2-expressing cell line, the
activation of signaling by BDNF had little effect on the expression level of EFNB3 in
Figure 4-2. EFNB3 RT-PCR gene expression patterns in NTRK1 (blue) and NTRK2 (pink)-
expressing cell lines. Error bars are not shown. Variation for each data point was less than ±5‰.
EFNB3 (ephrin-B3) belongs to a family of ligands that bind to Eph family receptor
tyrosine kinases and has been implicated in axon guidance and other patterning processes
during vertebrate nervous system development (Bergemann AD et al, 1998). Remarkably
previous studies have demonstrated that EFNB3 exhibits growth-suppressive activity
against neuroblastoma cells in vitro. Along with NTRK1, EFNB3 has been identified as a
gene whose expression is preferentially and significantly associated with low tumor stage
and favorable clinical outcomes in neuroblastoma primary tumors (Tang XX et al, 1999,
2000, 2004). The RT-PCR experiment shown in Figure 2 revealed the different responses
of EFNB3 expression after the activation of NTRK1 and NTRK2 signaling pathways.
The up-regulation of EFNB3 mRNA in NTRK1 expressing cell line indicates that
NGF/NTRK1 signaling directly or indirectly activates the expression of EFNB3, while
BDNF/NTRK2 signaling has no substantial effect in this time range.
Figure 4-3. TYRO3 RT-PCR gene expression patterns in NTRK1 (blue) and NTRK2 (pink)-
expressing cell lines. Error bars are not shown. Variation for each data point was less than ±5‰.
Another gene with sizable differential expression was TYRO3. As seen in Figure 4-
3, TYRO3 expression was up-regulated by 20% in response to NGF-NTRK1 signal
transduction but remained unchanged in BDNF-NTRK2 signaling from 0 to 1.5 hours
after neurotrophin application. After 1.5 hours, TYRO3 expression decreased in both cell
lines, but the expression level differential actually continued to increase between the two
cell lines to 50% by 12 hours. TYRO3 is a trans-membrane receptor tyrosine kinase that
is activated by the ligand GAS6. The exact biological function of this signaling pathway
is yet to be determined. However, prior studies indicate that GAS6 promotes human fetal
oligodendrocyte survival and maturation by receptor activation and downstream
signaling, via the PI3-kinase/Akt pathway, in the absence of cell proliferation (Shankar
SL et al, 2003). Additional evidence suggests that GAS6 may contribute to cell adhesion,
immune responsiveness, and osteoclastic bone resorption through the MAPK signaling
pathway (Crosier KE et al, 1997; Heiring C et al, 2004).
Additionally, both light and heavy polypeptide neurofilaments (NEFL and NEFH)
were up-regulated in NTRK1-expressing cell lines while down-regulated in NTRK2
expressing cell line early after neurotrophin application (0 to 1.5 hr). These expression
changes might be expected to lead to changes in the cytoskeleton associated with
differential cellular growth and differentiation status between the two cell lines. Indeed,
addition of NGF induces neurite outgrowth in many neuroblastoma cell lines, and neurite
outgrowth has been shown to be positively correlated with neurofilament expression in
neuroblastoma (Linnala A et al, 1998). Finally, because of time constraints, we have only
done 3 technical duplicates for each data point in RT-PCR validation. However ideally,
biological duplicates with independently extracted RNAs from different batches of
transfected cell lines should be analyzed in order to minimize the possibility of errors.
Researchers are confronted with a constant acceleration in the generation of
accumulated biomedical knowledge captured both in structured, readily generated forms
such as whole genome expression profiles, and from unstructured information
exemplified by biomedical literature. As such, researchers are increasingly in need of
novel means to capture, manage, and productively synthesize this information for specific
biomedical application. Systematic data mining approaches such as the text mining tools
illustrated in this study can assist with ranking tasks using previously discovered but
disparate facts. This study was designed to integrate literature-based knowledge with the
analysis of high-throughput array data. Our results suggest that application of an unbiased
text mining-based method is capable of not only enriching for genes relevant to particular
biological process, but also that this process provides a relevance ranking that may be
significant for identifying plausible candidate genes involved in differential processes.
The EFNB3 gene co-occurred with NTRK1 in the literature in five articles but did
not co-occur with NTRK2 at all. According to our hypothesis, this differential association
in biomedical text can be a strong indication that EFNB3 might play a specific role in
differential signaling between NTRK1 and NTRK2. In this case, the EFNB3 results can
be taken as a validation of the precision of the methods employed, but it is an expected
result both in terms of the literature reference and our verification of published
expression correlations between NTRK1 and neuroblastoma. However, the previously
published reports did not examine NTRK2 expression. Thus, our approach provided an
example of literature-based discovery by generating a higher relevance ranking for
EFNB3 as a differential signaling candidate than the expression array data alone
indicated. More experimentation is indicated but also required to determine a potential
role for EFNB3 in neuroblast differentiation. The fact that there was only 1 co-occurring
paper showing the indirect association of TYRO3 with NTRK1 indicates the lack of
previous investigation of TYRO3 in normal and malignant neuroblast development or
neurotrophin signaling pathways. However, the possible roles of TYRO3 in cell
proliferation and survival as well as its differential responses to NTRK1 signaling
demonstrated by RT-PCR make further studies worthwhile. To put the text mining power
into perspective, among the 1576 genes co-occurred with NTRK1 and 3882 articles
describing NTRK1, it is not easy with manual effort to identify EFNB3 (5 co-occurrence
papers) and even harder for TYRO3 (only 1 co-occurring paper).
Since the text mining processes employed in this study are highly task-specified
and perform with high accuracy, we demonstrated that even a relatively straightforward
text mining application, when combined with molecular data analyses, appears to make
better predictions. This process is easily scaled to lots of genes, so that many gene
interactions could be simultaneously surveyed for larger data sets or combinations of data
sets. Thus, with little additional effort, one could use the literature to "pre-annotate" all
gene probes so that they could be sorted by literature findings with ease. If additional
entity classes are added, the capabilities multiply geometrically. For example, we can
create an information matrix integrating genes with malignancy attribute classes. Then
the gene-clinical stage relation would tell us the gene sets associated with early and late
stages in addition to knowing the gene-gene associations.
Co-occurrence-based information extraction can be further improved in a variety of
ways such as using proximity-based measures. Generally, article-level co-occurrence can
achieve high recall rates but lacks the ability to distinguish different types of relations or
to adequately relevance rank such associations. For example, when we extracted all co-
occurred genes with NTRK1, genes related both directly and indirectly to NTRK1 were
extracted equally. As NLP-based information extraction methods continue to advance, it
is likely that deeper computational understanding of the syntactic and semantic
representations of text will lead to more successful and precise biomedical applications.
Recent work in identifying and extracting entity relations shows promise in this regard
(Jenssen TK et al, 2001; Rzhetsky A. et al, 2004).
Bergemann AD et al: Ephrin-B3, a ligand for the receptor EphB3, expressed at the
midline of the developing neural tube. Oncogene 16(4):471-80. (1998).
BioCreAtIvE: Critical Assessment of Information Extraction systems in Biology.
Borrello MG et al: TRK and TET protooncogene expression in human neuroblastoma
specimens: high-frequency of TRK expression in non-advanced stages. Intl. J. Cancer.
54: 540-545. (1993).
Brodeur GM: Neuroblastoma: biological insights into a clinical enigma. Nature Rev.
Cancer 3: 203-216. (2003).
Crosier KE et al: New insights into the control of cell growth; the role of the Axl family.
Pathology 29: 131-135. (1997).
Eggert A et al: Expression of the neurotrophin receptor TrkA down-regulates expression
and function of angiogenic stimulators in SH-SY5Y neuroblastoma cells. Cancer Res. 62:
Eggert A et al: Molecular dissection of TrkA signal transduction pathways mediating
differentiation in human neuroblastoma cells. Oncogene 19: 2043-2051. (2000).
Fang H et al: Human Gene Name Normalization Using Text Matching with
Automatically Extracted Synonym Dictionaries. BioNLP (2006).
Hanisch D et al: Rule-based protein and gene entity recognition. BMC Bioinformatics, 6,
Heiring C et al: Ligand recognition and homophilic interactions in Tyro3. J. Bio. Chem.
279(8): 6952-6958. (2004).
Ho R et al: Resistance to chemotherapy mediated by TrkB in neuroblastomas. Cancer
Res. 62: 6462-6466. (2002).
Jessen TK et al: A literature network of human genes for high-throughput analysis of
gene expression. Nature Genet. 28: 21-28. (2001).
Jin Y et al: Automated recognition of malignancy mentions in biomedical literature.
BMC Bioinformatics 7: 492. (2006).
Kogner P et al: Coexpression of messenger RNA for TRK protooncogene and low
affinity nerve growth factor receptor in neuroblastoma with favorable prognosis. Cancer
Res. 53: 2044-2050. (1993).
Linnala A et al: Neuronal differentiation in SH-SY5Y human neuroblastoma cells
induces synthesis and secretion of tenascin and upregulation of a integrin receptors. J.
Neurosci. Res. 49: 53-63. (1998).
McDonald RT et al: An entity tagger for recognizing acquired genomic variations in
cancer literature. Bioinformatics 22(20): 3249-3251. (2004).
Nakagawara A et al: Inverse relationship between trk expression and N-myc
amplification in human neuroblastomas. Cancer Res. 52: 1364-1368. (1992).
Nakagawara A et al: Association between high levels of expression of the Trk gene and
favorable outcome in human neuroblastomas. N. Engl. J. Med. 328: 847-854. (1993).
Nakagawara A et al: Expression and function of TRK-B and BDNF in human
neuroblastomas. Mol. Cell. Biol. 14: 759-767. (1994).
Rzhetsky A et al: GeneWays: a system for extracting, analyzing, visualizing, and
integrating molecular pathway data. J. Biomed. Inform. 37: 43-53. (2004).
Shankar SL et al: The growth arrest-specific gene product Gas6 promotes the survival of
human oligodendrocytes via a phosphatidylinositol 3-kinase-dependent pathway. J
Neurosci. 23(10):4208-18. (2003).
Suzuki T et al: Lack of high-affinity nerve growth factor receptors in aggressive
neuroblastomas. J. Natl. Cancer Inst. 85: 377-384. (1993).
Tang XX et al: High level expression of EPHB6, EFNB2, and EFNB3 is associated with
low tumor stage and high TrkA expression in human neuroblastomas. Clin. Cancer Res.
5: 1491-1496. (1999).
Tang XX et al: Implications of EPHB6, EFNB2, and EFNB3 expressions in human
neuroblastoma. PNAS 97(20): 10936-10941. (2000).
Tang XX et al: Favorable neuroblastoma genes and molecular therapeutics of
neuroblastoma. Clin. Cancer Res. 10: 5837-5844. (2004).
Chapter 5. General Conclusions and Future Directions
The increasing demand for transforming unstructured biomedical research
literature into a form amenable to computational analysis provides both opportunities and
challenges for biomedical text mining. This dissertation started with a basic aspect of
BTM research, the definition of target biomedical entities. The complexity and criticality
of this endeavor has been underappreciated by the text mining community, which has
largely approached this problem from a computational linguistics perspective. Through
an extensive and iterative process, literature-based definitions were developed as they
emerged from a consensus-building process by annotators and domain experts. In
addition to the semantic challenges caused by the conceptual complexity of biomedical
entities, syntactical challenges were also dealt with by establishing specific annotation
guidelines in order to define distinct textual boundaries for each entity class. Using this
process, entity classes for genes, RNAs, and proteins; genomic variations; types of
malignancy; and phenotypic and clinical attributes of malignancy were carefully
established with distinct boundaries semantically and syntactically. Training data
generated through manual annotation in select corpora with those refined definitions
allowed the development of automated NER extractors, based on machine learning
algorithms, with accuracy rates satisfactory for specialized application by biomedical
researchers. Entity mentions were then extracted from pre-2006 MEDLINE abstracts and
normalized to unique identifiers through a rule-based computational procedure. Finally,
this thesis focused on BTM’s discovery capabilities by integrating text mining results
with high throughput data analysis to prioritize genes involved in differential cell
developmental signaling in neuroblastoma. Protein pathway analysis showed that the
addition of literature-based information was able to effectively re-prioritize functionally
relevant genes identified by microarray expression analysis. Experimental validation of
these results demonstrated that these re-prioritized genes were verifiable candidates
worthy of additional experimental characterization. This text mining integrated method
provides researchers a systematic and objective way to analyze the experimental data and
better hypothesize targets for the next step research based upon previously discovered
and published knowledge.
With the steadily accelerating pace of biotechnological development and
knowledge accumulation, there is an increasing need of having well-performed BTM
systems available for a variety of purposes, including information extraction, document
retrieval and literature-based discovery. As end users struggling to manage and
synthesize an overwhelming amount of research information, it is prudent for biologists
to closely collaborate with computer scientists on every front, including the adaptation of
BTM research to assist with solving biomedical problems. This dissertation has focused
upon investigations that attempt to build BTM systems with more biological input that is
infused throughout the process. Accordingly, as an essential building block of many
BTM tasks, the development of our named entity recognition system incorporated
biological perspectives, which has been instrumental for the success of biomedical
applications built upon this process, such as our successful gene-centric information
retrieval system FABLE (FABLE).
The performance of entity extractors developed by our approach depends heavily
on the quality and quantity of training data. We have spent substantial amount of time
creating manually annotated corpora in order to develop high-performance extractors.
However, further research should be conducted on deciding the scope and size of the
training data to make the process most cost effective. Normalization algorithms that
incorporate disambiguation schemes are also desired for improving entity recognition
performance since it is difficult for a pure rule-based approach to solve the problem of
ambiguous matches between mentions and unique identifiers. Effective disambiguation
approaches would likely need to survey distant contextual information in order to
determine the correct match (Chen L et al, 2005).
Deeper parsing of the entity relations is another natural extension of this thesis
research. With the incorporation of linguistic analysis that includes deeper syntactic and
semantic processing (such as the parse tree and semantic role labeling systems developed
at Penn), entity relationships could be further mined with more precision and granularity.
For example, extraction of specific causal relationships between genes and malignancy
types from biomedical literature would be an important advance in application.
Along with the maturation of the mentioned BTM tasks, it will be possible to
construct a structured and queryable cancer knowledgebase integrating the most complete
and up-to-date genomic, phenotypic and clinical information from the published
biomedical records, based on which, further interpretation of the experimental data will
lead to more reliable and frequent literature-based discovery and hypothesis generation.
Alako BT, Veldhoven A, Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster
G: CoPub Mapper: mining MEDLINE based on search term co-publication. BMC
Bioinformatics. 6:51. (2005).
BioCreAtIvE: Critical Assessment of Information Extraction systems in Biology.
Cairns J: The interface between molecular biology and cancer research. Mutat Res, 462:
Chang JT, Schutze H, Altman RB: GAPSCORE: finding gene and protein names one
word at a time. Bioinformatics, 20: 216-225 (2004).
Chen L, Friedman C: Extracting phenotypic information from the literature via natural
language processing. Medinfo, 11(Pt 2):758-762. (2004).
Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures.
Bioinformatics, 21: 248-256. (2005).
Cohen KB, Fox L, Ogren, PV, Hunter L: Corpus design for biomedical natural language
processing. Proceedings of the ACL-ISMB workshop on linking biological literature,
ontologies and databases, pp. 38-45. Association for Computational Linguistics. (2005).
Collier, N., Nobata, C. and Tsujii, J: Extracting the names of genes and gene products
with a hidden Markov model. In Proceedings of the 18th International Conference on
Computational Lingustics (COLING’2000), Saarbrucken, Germany. (2000).
Collier N, Takeuchi K: Comparison of character-level and part of speech features for
name recognition in biomedical texts. J Biomed. Inform. 37(6):423-435. (2004).
Daraselia N, Yuryev A, Egorov S, Novichkova S, Nikitin A, Mazo I: Extracting human
protein interactions from MEDLINE using a full-sentence parser. Bioinformatics, 20:
DiGiacomo RA, Kremer JM, Shah DM: Fish-oil dietary supplementation in patients with
Raynaud's phenomenon: a double-blind, controlled, prospective study. Am. J. Med.
Ding J, Berleant D, Nettleton D, Wurtelle E: Mining Medline: abstracts, sentences, or
phrases? Pac. Symp. Biocomput. 7: 326-337. (2002).
Finkel J, Dingare S, Manning CD, Nissim M, Alex B, Grover C. Exploring the
boundaries: gene and protein identification in biomedical text. BMC Bioinformatics, 6
Suppl 1:S5. (2005).
Friedman C, Hripcsak G, DuMouchel W, Hohnson SB, Clayton PD: Natural language
processing in an operational clinical information system. Natural Language Engineering,
Freimer N, Sabatti C: The human phenome project. Nature Genet, 34: 15-21. (2003).
Fundel K, Guttler D, Zimmer R, Apostolakis JA: Simple approach for protein name
identification: prospects and limits. BMC Bioinformatics, 6, S15 (2005).
The Gene Ontology (GO) project in 2006. Nucleic Acids Res 2006, 34(Database
GENIA: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ (2004).
Glenisson P, Anta P, Mathys J, Moreau Y, De Moor B: Evaluation of the vector space
representation in text-based gene clustering. Pac. Symp. Biocomput. 8: 391-402. (2003).
Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B: TXTGate:
profiling gene groups with text-based information. Genome Biol. 5: R43. (2004).
Hahn U, Romacker M, Schulz S: MEDSYNDIKATE--a natural language system for the
extraction of medical information from findings reports. Int J Med Inform 2002,
Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L, Leser U, Scheffer T:
Systematic feature evaluation for gene name recognition. BMC Bioinformatics, 6 Suppl
Hanisch D, Fundel K, Mevissen, HT, Zimmer R, Fluck JP: Rule-based protein and gene
entity recognition. BMC Bioinformatics, 6, S14. (2005).
Hunter L, Cohen KB: Biomedical language processing: what’s beyond PubMed? Mol.
Cell 21:589-594. (2006).
Jensen LJ, Saric J, Bork P: Literature mining for the biologist: from information retrieval
to biological discovery. Nature Genet. 7: 119-129. (2006).
Jessen TK, Lagreid A, Komorowski J, Hovig E: A literature network of human genes for
high-throughput analysis of gene expression. Nature Genet. 28: 21-28. (2001).
Jin Y, McDonald RT, Lerman K, Mandel MA, Carroll S. Liberman MY, Pereira FC,
Winters RS, and White PS. Automated recognition of malignancy mentions in
biomedical literature. BMC Bioinformatics, 7: 492. (2006).
Kulick S, Bies A, Liberman M, Mandel M, McDonald R, Palmer M, Schein A, Ungar L,
Winters S, White P: Integrated annotation for biomedical information extraction. Proc of
Lafferty J, McCallum A, Pereira F: Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data. In: Proceedings of ICML-01: 282-289. (2001).
Lander ES, Linton LM, Birren B, Nusbaum C, etal: Initial sequencing and analysis of the
human genome. Nature, 409: 860-921, (2001).
Malignancy type definitions:
McDonald RT, Winters RS, Mandel, Jin Y, White PS and Pereira F. An entity tagger for
recognizing acquired genomic variations in cancer literature. Bioinformatics 22(20):
McDonald RT, Pereira FN Identifying gene and protein mentions in text using
conditional random fields. BMC Bioinformatics, 6 Suppl 1:S6. (2005).
McDonald RT, Pereira F, Kulick, Winters RS, Jin Y, White P: Simple Algorithms for
Complex Relation Extraction with Applications to Biomedical IE. 43rd Annual Meeting
of the Association for Computational Linguistics, (2005).
Meldrum D: Automation for genomics, part two: sequencers, microarrays, and future
trends. Genome Res, 10:1081-1092, (2000).
Mitsumori T, Fation S, Murata M, Doi K, Doi H: Gene/protein name recognition based
on support vector machine using dictionary as features. BMC Bioinformatics. 6 Suppl
Muller HM, Kenny EE, Sternberg PW: Textpresso: an ontology-based information
retrieval and extraction system for biological literature. PloS Biol. 2, e309. (2004).
Novichkova, S., Egorov, S. and Daraselia, N. MedScan, a natural language processing
engine for MEDLINE abstracts. Bioinformatics, 19:1699-1706. (2003).
Penn BioIE corpus release v0.9 [http://bioie.ldc.upenn.edu]
Raychaudhuri S, Schutze H, Altman RB: Using text analysis to identify functionally
coherent gene groups. Genome Res. 12: 1582-1590. (2002).
Rzhetsky A, Iossifov I, Koike T, Krauthammer M, Kra P, Morris M, Yu H, Duboue PA,
Weng W, Wilbur JW, Hatzivassiloglou V, Friedman C: GeneWays: a system for
extracting, analyzing, visualizing, and integrating molecular pathway data. J. Biomed.
Inform. 37: 43-53. (2004).
Settles BA: an open source tool for automatically tagging genes, proteins and other entity
names in text. Bioinformatics, 21: 3191-3192 (2005).
Swanson DR: Fish oil, Raynaud's syndrome, and undiscovered public knowledge.
Perspect. Biol. Med., 30:7-18. (1986).
Swanson DR: Migrane and magnesium: eleven neglectd connections. Perspect. Biol.
Med. 31: 526-557. (1988).
Swanson DR: Somatomedin C and arginine: implicit connections between mutually
isolated literatures. Perspect. Biol. Med. 33: 157-186. (1990).
Tamames J: Text Detective: a rule-based system for gene annotation in biomedical texts.
BMC Bioinformatics, 6 Suppl 1:S10. (2005).
Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN: MedMiner: an internet
text-mining tool for biomedical information, with application to gene expression
profiling. BioTech. 27: 1210-1217. (1999).
Tanabe L, Wilbur W: Tagging gene and protein names in biomedical text,
Bioinformatics, 18:1124-1132. (2002).
Tanabe L, Xie N, Thom LH, Matten W, Wilbur WJ: GENETAG: a tagged corpus for
gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3. (2005).
Temkin JM, Gilder MR: Extraction of protein interaction information from unstructured
text using a context-free grammar. Bioinformatics 19:2046-2053. (2003).
Torii M, Kamboj S, Vijay-Shanker K: Using name-internal and contextual features to
classify biological terms. J Biomed Inform 37(6):498-511. (2004)
van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A text-mining
analysis of the human phenome. Eur J Hum Genet 14(5):535-542. (2006).
Weeber M, Kors JA, Mons B: Online tools to support literature-based discovery in the
life sciences. Brief. In Bioinfo. 6: 277-286. (2005).
Wren JD, Bekeredjian, R., Stewart JA, Shohet, RV and Garner HR: Knowledge
discovery by automated identification and ranking of implicit relationships.
Bioinformatics, 20:389-398. (2004).
Yakushiji A, Tateisi Y, Miyao Y, Tsujii J: Event extraction from biomedical papers using
a full parser. Pac. Symp. Biocomput. 6: 408-419. (2001).
Yandell MD, Majoros WH: Genomics and natural language processing. Nat. Rev. Genet.,
Zhou G, Shen D, Zhang J, Su J, Tan S: Recognition of protein/gene names from text
using an ensemble of classifiers. BMC Bioinformatics 6 Suppl 1:S7. (2005).