A hybrid approach to finding phenotype candidates in genetic text.pdf

VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
LE HOANG QUYNH
A HYBRID APPROACH
TO FINDING PHENOTYPE CANDIDATES
IN GENETIC TEXT
MASTER THESIS
Hanoi – 2012

VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
LE HOANG QUYNH
A HYBRID APPROACH
TO FINDING PHENOTYPE CANDIDATES
IN GENETIC TEXT
Major : Computer Science
Code : 60 48 01
MASTER THESIS
Supervisor: Assoc.Prof. Ha Quang Thuy
Hanoi – 2012

A hybrid approach to finding phenotype
candidates in genetic texts
Le Hoang Quynh
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Associate Professor. Ha Quang Thuy
A thesis submitted in fulfillment of the requirements
for the degree of
Master of Science in Computer Science
November 2012

ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my knowledge
it contains no materials previously published or written by another person, or substan-
tial proportions of material which have been accepted for the award of any other degree
or diploma at University of Engineering and Technology (UET/Coltech) or any other
educational institution, except where due acknowledgement is made in the thesis. Any
contribution made to the research by others, with whom I have worked with at Univer-
sity of Engineering and Technology and National Institute of Informatic (Tokyo, Japan)
or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual
content of this thesis is the product of my own work, except to the extent that assistance
from others in the project’s design and conception or in style, presentation and linguistic
expression is acknowledged.’
Hanoi, November 10th
, 2012
Signed ........................................
Le Hoang Quynh
i

ABSTRACT
Named entity recognition (NER) has been extensively studied for the names of
genes and gene products but there are few proposed solutions for phenotypes. Phe-
notype terms are expected to play a key role in inferring gene function in complex
heritable diseases but are intrinsically difficult to analyse due to their complex se-
mantics and scale. In contrast to previous approaches we evaluate state-of-the-art
techniques involving the fusion of machine learning on a rich feature set with evi-
dence from extant domain knowledge-sources. The techniques are validated on two
gold standard collections including a novel annotated collection of 112 abstracts de-
rived from a systematic search of the Online Mendelian Inheritance of Man database
for auto-immune diseases. Encouragingly the hybrid model outperforms a HMM, a
CRF and a pure knowledge-based method to achieve an F1 of 75.37 for BF and
micro average F1 of 84.01 for the whole system.
Publications:
• Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le. Automatic Named
Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations. In Inter-
national Conference on Asian Language Processing 2010. Page 170-173. Harbin, China;
December 28-30, 2010, DOI: http://doi.ieeecomputersociety.org/10.1109/IALP.2010.73
• Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan and Quang-
Thuy Ha. An Integrated Approach Using Conditional Random Fields for Named En-
tity Recognition and Person Property Extraction in Vietnamese Text. In Proceedings
of International Conference on Asian Language Processing 2011. Page 115-118. DOI:
http://doi.ieeecomputersociety.org/10.1109/IALP.2011.37
• Nigel Collier, Mai-Vu Tran, Hoang-Quynh Le, Anika Oellrich, Ai Kawazoe, Martin Hall-
May and Dietrich Rebholz-Schuhmann. A hybrid approach to finding phenotype candidates
in genetic text. In The 24th
conference on Computational Linguistics (COLING 2012).
Accepted as long paper.
ii

ACKNOWLEDGEMENTS
First and foremost, I would like to express my deep gratitude to my supervi-
sor, Assoc.Prof. Ha Quang Thuy, for his patient guidance and continuous support
throughout the years. He always appears when I need help, and responds to queries
so helpfully and promptly.
I would like to express my gratitude to the National Institute of Informatics (NII
- Tokyo, Japan) for giving me a great chance working at NII in the NII International
Internship program. Then, I sincerely give my honest thanks and appreciation to
Assoc.Prof. Nigel H. Collier, my internship supervisor at NII, for his great support.
I would like to say thank you to all my teachers at university of Engineering and
Technology (VNU), who bring me many knowledge and experiences.
I also want to thank my colleagues at the Knowledge and Technology laboratory
(UET, VNU) and my classmate for their enthusiasm and promptly help.
I sincerely acknowledge the Vietnam National University, NAFOSTED and the
QG.10.38 project for some supporting finance to my master study.
And thanks to all my friends who always be by my side and cheer me.
Finally, this thesis would not have been possible without the support and love
of my family. Thank you, mother and father. Thanks brother and sister, thanks to
my nephew. And thank you, my beloved husband. Again, thank you and love all of
you so much ♥.
iii

Table of Contents
1 Introduction 1
1.1 Motivation and problem definition . . . . . . . . . . . . . . . . . . . . 1
1.2 Phenotype definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The challenges of phenotype entity recognition . . . . . . . . . . . . . 3
2 Related works 6
2.1 Useful resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 GENIA and JNLPBA corpora . . . . . . . . . . . . . . . . . . 7
2.1.2 The online mendelian inheritance in man . . . . . . . . . . . . 7
2.1.3 The human phenotype ontology . . . . . . . . . . . . . . . . . 8
2.1.4 The mammalian phenotype ontology . . . . . . . . . . . . . . 9
2.1.5 The unified medical language system . . . . . . . . . . . . . . 9
2.1.6 KMR corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Related researches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Baseline method: Khordad et al. (2011) . . . . . . . . . . . . . 11
3 Methods 16
3.1 Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Annotated data sources . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Proposed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Machine learning labeler . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Knowledge-based labeler . . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Merge results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Experimental results and evaluation 29
4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Experiments on the KMR corpus . . . . . . . . . . . . . . . . . . . . 31
iv

TABLE OF CONTENTS v
4.3 Experiments on the Phenominer corpus . . . . . . . . . . . . . . . . . 32
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 Discussion on corpora . . . . . . . . . . . . . . . . . . . . . . 35
4.4.2 Discussion on results . . . . . . . . . . . . . . . . . . . . . . . 36
5 Conclusion 40

List of Figures
2.1 A visual example of HPO hierarchical structure . . . . . . . . . . . . 13
2.2 A visual example of MP hierarchical structure . . . . . . . . . . . . . 14
2.3 Khordad et al. (2011)’s system block diagram . . . . . . . . . . . . . 15
3.1 An informal overview of bodily feature entity . . . . . . . . . . . . . . 17
3.2 Phenotype tagging architecture . . . . . . . . . . . . . . . . . . . . . 27
3.3 Brat rapid annotation tool example . . . . . . . . . . . . . . . . . . . 28
4.1 Column chart shows the experimental results on KMR corpus . . . . 32
4.2 Column chart shows the experimental results of BF entities on Phe-
nominer corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Column chart shows the experimental results of GGP entities on Phe-
nominer corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vi

List of Tables
3.1 Referential semantics and scoping of mentions by entity type . . . . . 19
3.2 List of auto-immune disease used to collect Phenominer corpus . . . . 21
3.3 Feature sets used in the machine learning labeler . . . . . . . . . . . . 24
3.4 Features exploited by the two learner models . . . . . . . . . . . . . . 24
4.1 Results for BF entity on the KMR corpus using models with partial
matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Results for each entity on the Phenominer corpus using models with
partial matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Sources of error by the Hybrid system on the KMR corpus. . . . . . . 37
4.4 Sources of error by Khordad et al.’s system on the Phenominer corpus. 38
4.5 Sources of error by the Hybrid system on the Phenominer corpus. . . 39
vii

List of Abbreviations
BF Bodily feature
CRF Conditional Random Field
GGP Gene and gene product
HMM Hidden Markov Model
HPO the Human Phenotype Ontology
KB Knowledge-based
ML Machine learning
MP the Mammalian Phenotype Ontology
NE Named entity
NER Named entity recognition
viii

Chapter 1
Introduction
1.1 Motivation and problem definition
During the last decade biomedicine has developed tremendously. Everyday a lot
of biomedical papers are published and a great amount of information is produced.
Due to the rapidly increasing amount of biomedical literature available on the Web,
biomedical information extraction becomes more and more important.
Biomedical named entity recognition (NER) is a subtask of biomedical infor-
mation extraction which is a fundamental step and can affect the results of others
tasks. Biomedical NER is a computational technique used to identify and classify
strings of text (mentions) that designate important concepts in biomedicine. As the
first stage in the integrated semantic linking of knowledge between literature and
structured databases it is critically important to maximize the effectiveness of this
step.
This thesis focuses on the analysis and identification of a new class of entity:
phenotypes. Follow Hoehndorf et al. (2010), phenotype is important for the analysis
of the molecular mechanisms underlying disease; it is also expected to play a key
role in inferring gene function in complex heritable diseases. Two thoughts motivate
our work are: (1) The database curation community has expressed a wish for full
text entity indexing and the inclusion of phenotypes (Dowell et al., 2009; Hirschman
et al., 2012), and (2) Biomedicine is rapidly moving towards full-scale integration of
data, opening up the possibility to understand complex heritable diseases caused by
genes. Association studies involving phenotypes are considered important to making
progress (Lage et al., 2007; Wu et al., 2008). The ultimate goal of the work we present
1

1.2. Phenotype definition 2
here is to allow relations mined from sentences such as the one we annotated below
to feed into novel hypothesis generation procedures. From Ex 1, the reader can easily
infer a relation between ‘IgG1 disorder’ and three genes/gene products marked as
GGP.
Ex 1. Among [patients]ORGANISM with [systemic lupus erythematosus]DISEASE
([SLE]DISEASE), those with the [IgG1 disorder]PHENOTY PE have a higher prevalence
of high titre [rheumatoid factor]GGP and [antinuclear antibody]GGP , but a lower
prevalence of [anti-double-stranded DNA (anti-dsDNA) antibodies]GGP above 30
U/ml. (Source PMCID: PMC1003566).
1.2 Phenotype definition
Unlike genes or anatomic structures, phenotypes and their traits are complex
concepts and do not constitute a homogeneous class of objects (i.e. a natural kind).
Traits such as ‘eye colour’, ‘blood group’, ‘hemoglobin concentration’ or ‘facial gri-
macing’ describe morphological structures, physiological processes and behaviours.
When qualities or quantities of traits are used to describe a specific organism then
we have phenotypic descriptions, e.g. ‘blue eyes’, ‘blood group AB’, ‘not having
between 13 and 18 gm/dl hemoglobin concentration’.
Until recently, there has been little effort to provide data integration standards
for phenotypes. This means that phenotypic descriptions tend to be author/study
specific and biological results may go undiscovered if the terms used lie outside an
author’s immediate research area (Bard and Rhee, 2004). In some researches, it is
simply called as ‘phenotypic information’ and authors do not give any specific def-
inition for it (Hoehndorf et al., 2010). In CSI-OMIM system (Cohen et al., 2011),
phenotypes are considered as genetic terms including clinical signs and symptoms.
Freimer and Sabatti (2003) describe phenotypes as referring to ‘any morphologic,
biochemical, physiological or behavioral characteristic of an organism. . . . All phe-
notypic characteristicsrepresent the expression of particular genotypes combined with
the effects of specific environmental influences’. Khordad et al. (2011) defines phe-
notypes as ‘genetically-determined observable characteristics of a cell or organism,
including the result of any test that is not a direct test of the genotype. ...A pheno-
type of an organism is determined by the interaction of its genetic constitution and
the environment’.

1.3. The challenges of phenotype entity recognition 3
Our definition of phenotype was taken from the formal analysis in Scheuermann
et al. (2009)’s research.
Definition: A phenotype entity is a (combination of) bodily features(s)
of an organism determined by the interaction of its genetic make-up and
environment.
Examples include: [lack of kidney], [abnormal cell migration],[absent ankle re-
flexes] as well as more complex cases such as [no abnormality in his heart], [unfa-
vorable serum lipid levels] and [susceptibility to ulcerative colitis].
But Scheuermann et al. (2009) also define symptom as ‘a bodily feature of a
patient that is observed by the patient or clinician and suspected of being caused
by a disease’. We can see an ambiguity made by the causality (or context) here:
a term may be symptom in some contexts but refer to phenotype in others or
many symptoms may be phenotypes. Thus, it is important to recognize that this
phenotype definition requires us to know the underlying cause. Since causality is
often difficult to establish using narrow contextual evidence of the sort used in NER
it seems reasonable that we focus here on identifying bodily features themselves, i.e.
phenotype candidates, and then determine causality in another stage of processing.
Definition: A bodily feature (BF) entity is a mention of a bodily quality
in an organism. It is considered as phenotype candidate.
Our definition of bodily features require two caveats (1) in contrast to Khordad
et al. (2011) we did not apply a granular cut off at the level of cell, and (2) because
of the diversity of bodily features across organisms we took a decision to focus our
definition of this entity on mouse as a model organism and human as the most
important species.
1.3 The challenges of phenotype entity recogni-
tion
Unlike NER in the newswire domain, NER in the biomedical domain remains
a perplexing challenge. Biomedical NEs in general do not follow any nomenclature,
and can be comprised of long compound words or short abbreviations. Some even
contain various symbols or spelling variations. We summarize some challenges for BF
NER below (some of them are difficulties of NER in biomedical domain mentioned
by Lin et al. (2004))

• Unknown word identification: There are an extreme use of unknown words.
Unknown words can be acronyms, abbreviations, or words containing hyphens,
digits, letters, and Greek letters. Moreover, the use of numerous synonyms and
homonyms make recognition become more difficult.
• Named entity boundary identification: The boundary of an NE can be a regular
English word, unknown word, Roman numeral, or digit. A BF can apply at all
levels of anatomical granularity from chemical structures to cells and organs
making it difficult to know where to draw a boundary. Additionally, nested
NEs (an NE embedded in another NE) further complicate this problem: BF
can contain GGP, disease and even organism.
• Named entity classification: Once an NE is identified, it is then classified into a
category such as GGP, anatomy, BF, and so on. Ambiguity and inconsistency
are often encountered at this stage. NEs with the same orthographical features
may fall into different categories (for example, there is a big ambiguity between
BF and disease). In additional, BF entities are intrinsically more difficult to
analyze due to their complex semantics, scale and structure:
• Semantically, a BF can be abnormal (in a disordered disposition) or normal
(in an ordered disposition) feature of humans or mice; it can be a clinically
relevant characteristic of a human/mouse disease or not.
• A lack of standard nomenclatures, extensive and growing nomenclatures make
the problem of BF recognition become more difficult. , the lack of naming
agreement prior to a standard name being accepted,
• BFs can be found with complex structure in various forms, sometimes even
biologists do not agree on the boundary of the BF. BF may contain modifiers
(for example, quantification that are either specific (e.g. 18 gm/dl) or rela-
tive (e.g. normal or increased’)); negations can be used to indicate lack of an
anatomy/GGP or normal/abnormal qualities of anatomy/GGP (for example:
[not having kidney], [not having between 13 and 18 gm/dl hemoglobin con-
centration]) but it can also show that a human or mouse not have a BF (for
example: there is [no abnormality in his heart], she has a [fever] but doesn’t
have a [cough]); conjoined cases happen when two or more BFs share one head
noun.

Due to the motivation and challenges of phenotype recognition, the key contri-
butions of this thesis are: (1) To provide an operational semantics for identifying
phenotype candidates in text, (2) To introduce a set of guidelines and an annotated
corpus based on a selection of 19 clinically significant auto-immune diseases from
The Online Mendelian Inheritance of Man (OMIM) (Hamosh et al., 2005), one of
the most widely used gene-disease databases, and (3) To mitigate linguistic varia-
tion whilst still meeting the conceptual expectations of biologists we propose a new
named entity solution that uses statistical inference and external manually crafted
resources.
The remaining of this thesis is organized as follows. In the second chapter, we
present some related researches and useful resources. The next chapter describes
our Phenominer corpus version 1.0 and proposed method for phenotype candidate
recognition. Then, experimental results, evaluation and discussion are in 3rd
chapter.
Finally, 4th
chapter is the conclusions.

Chapter 2
Related works
Such motivation and challenges that we mentioned in chapter 1 have led to a
variety of proposed solutions involving a wide range of resources. In this chapter,
we take a review on some useful resources in section 2.1, they are GENIA and
JNLPBA corpora, the online mendelian inheritance in man (OMIM) , the human
phenotype ontology (HPO), the mammalian phenotype ontology (MP), the unified
medical language system (UMLS), etc. Then, in section 2.2, we introduce some
related researches in biomedical entity recognition and describe Khordad et al. (2011)
as our baseline method for BF.
2.1 Useful resources
Using available resources help us not only to take advantage of knowledge from
other researches but also to reduce effort. Up to now, there are many resources are
used in bio-informatics. Among these, linguistically corpora such as GENIA (Tateisi
et al., 2000; Kim et al., 2003), OMIM (Hamosh et al., 2005), have proven to be
central to the NER solution. However due to the size of the vocabularies involved,
annotated corpora by themselves do not provide a complete solution. Researchers
have therefore also looked at the rich availability of formally structured biomedi-
cal knowledge (ontologies) such as the Unified Medical Language System (UMLS)
(Bodenreider et al., 2002), the Human Phenotype Ontology (Robinson and Mund-
los, 2010), the Mammalian Phenotype Ontology (Smith and Eppig, 2009), the Gene
Ontology (Gene Ontology Consortium, 2000), etc.
6

2.1. Useful resources 7
2.1.1 GENIA and JNLPBA corpora
GENIA corpus version 3.0 (Kim et al., 2003) was formed from a controlled search
on MEDLINE using the MeSH terms ’human’, ’blood cells’ and ’transcription fac-
tors’. From this search, 2000 abstracts (20,546 sentences, more than 400,000 words)
were selected. This corpus has been released with linguistically rich annotations in-
cluding sentence boundaries, term boundaries, term classifications, semi-structured
coordinated clauses, recovered ellipsis in terms, etc. Entities are hand annotated into
36 classes of DNA, RNA, cell line, cell type and protein (almost 100,000 annota-
tions).
JNLPBA data set came from the GENIA version 3.02 corpus. It is a training
set for the Bio-Entity recognition task at JNLPBA Kim et al. (2004). In this share
task, they simplify 36 classes of GENIA corpus and used only the classes protein,
DNA, RNA, cell line and cell type.
The GENIA and JNLPBA corpora is important for two major reasons: the first
is it provides the large single source of annotated training data for the NE task in
molecular biology and the second is in the breadth of classification. Follow Kim et al.
(2004), although number of classes in GENIA/JNLPBA corpora is a fraction of the
classes contained in major taxonomies it is still the largest class set that has been
attempted so far for the named entity recognition task . Moreover, GENIA corpus
can be also used for other biomedical tasks, such as POS tagging.
2.1.2 The online mendelian inheritance in man
The Online Mendelian Inheritance in Man (OMIM) (Hamosh et al., 2005) is a
continuously updated catalog of human genes and genetic disorders and traits, with
particular focus on the molecular relationship between genetic variation and pheno-
typic expression (genotype and phenotype). The full text and referenced overviews
in OMIM contain information on many mendelian disorders and over 12,000 genes.
Derived from the biomedical literature, OMIM is written and edited at Johns
Hopkins University with input from scientists and physicians around the world. Each
OMIM entry has a full text summary of a genetically determined phenotype and/or
gene and has numerous links to other genetic databases such as DNA and protein
sequence, PubMed references, general and locus-specific mutation databases, HUGO
nomenclature, MapViewer, GeneTests, patient support groups and many others.
Within an OMIM entry, there is a field called ‘Clinical Synopsis’ which is a list of

the clinical features of the disorder appear in this entry or references of this entry.
There are over 4500 clinical synopses in OMIM, they are a important resources for
researches on Phenotype.
OMIM is an easy and straightforward portal to the burgeoning information in hu-
man genetics, it is now distributed electronically by the National Center for Biotech-
nology Information1
. Over five decades OMIM has achieved great success, it is one of
the most important information source about human genes and genetic phenotypes
(Cohen et al., 2011; Robinson and Mundlos, 2010).
Nonetheless OMIM does not use a controlled vocabulary to describe the pheno-
typic features in its clinical synopsis section that makes it inappropriate for data
mining usages. In the section 2.1.3, we introduce HPO which is constructed using
OMIM.
2.1.3 The human phenotype ontology
The Human Phenotype Ontology (HPO)2
is a standardized, controlled vocab-
ulary allows phenotypic information to be described in an unambiguous fashion in
medical publications and databases (Robinson and Mundlos, 2010).
The HPO was originally constructed using data from OMIM by merging synonym
and creating the hierarchical structure between terms according to their semantics.
The hierarchical structure in the HPO represents the subclass relationship, figure
2.1 is a describe a hierarchical structure of HPO by a example of ‘atrioventricular
septal defect’ [HP:0010439] (example comes from Robinson and Mundlos (2010)).
The HPO currently contains over 9500 unique terms (more than 15000 synonyms)
describing human phenotypic features (statistic in 2012).
Nevertheless, follow Khordad et al. (2011), HPO is not complete and we had
several problems finding phenotype names in it:
(1) some acronyms and abbreviations are not available in the HPO;
(2) although the HPO contains synonyms of phenotypes, there are still some
synonyms that are not included in the HPO;
(3) in some cases adjectives and other modifiers are added to phenotype names,
making it difficult to find these phenotype names in the ontology;
(4) new phenotypes are being continuously introduced to the biomedicine world,
1
http://www.ncbi.nlm.nih.gov/omim/
2
http://www.human-phenotype-ontology.org/

HPO is being constantly refined, corrected, and expanded manually, but this process
is not fast enough nor can the inclusion of new phenotypes be guaranteed.
Thus, although HPO is a very useful resources, using only it is not enough for
phenotype recognition, we should use it just as a additional resources.
2.1.4 The mammalian phenotype ontology
The Mammalian Phenotype Ontology (MP) (Smith and Eppig, 2009) has been
applied to mouse phenotype descriptions in MGI3
, RGD4
, OMIA5
and elsewhere.
Use of this ontology allows comparisons of data from diverse sources, can facilitate
comparisons across mammalian species, assists in identifying appropriate experi-
mental disease models, and aids in the discovery of candidate disease genes and
molecular signaling pathways.
Similar with HPO, the Mammalian Phenotype Ontology (MP) is a standardized
hierarchical structured vocabulary. The highest level terms describe physiological
systems, survival, and behavior. The physiological systems branch into morpho-
logical and physiological phenotype terms at the next node level. The example of
hierarchical tree for the term ‘opisthotonus’ [MP:0002880] is shown in figure 1 2.2
(example comes from Smith and Eppig (2009)).
MP has about 9000 unique terms (about 24000 synonyms) of mouse abnormal
phenotype descriptions (statistic in 2012).
2.1.5 The unified medical language system
The Unified Medical Language System (UMLS) (Bodenreider et al., 2002) is a set
of files and software that brings together many health and biomedical vocabularies
and standards. The UMLS has three tools, which we call the Knowledge Sources:
Metathesaurus, semantic network and SPECIALIST Lexicon and Lexical Tools.
• The Metathesaurus is a very large, multi-purpose, and multi-lingual vocabu-
lary database that contains information about biomedical and health related
concepts, their various names, and the relationships among them. It contains
more than 1.8 million concepts come from more than 100 source vocabularies.
3
Mouse Genome Informatics Database: http://www.informatics.jax.org/
4
Rat Genome Database: http://rgd.mcw.edu
5
Online Mendelian Inheritance in Animals: http://omia.angis.org.au/

• The Metathesaurus is linked to the Semantic Network: all concepts in the
Metathesaurus are assigned to at least one semantic type from the semantic
network.
• MetaMap is a well-known tool in the UMLS SPECIALIST Lexicon and lex-
ical tools. It is a highly configurable application to map biomedical text to
the UMLS Metathesaurus: MetaMap tokenizes and phrase chunking the input
text; map them to UMLS concepts, each phrase is mapped to a set of candi-
date concepts; word sense disambiguation step will choose the best candidate
with respect to the surrounding text.
However UMLS semantic network does not contain Phenotype as a semantic type
so it alone is not adequate to distinguish between phenotypes and other objects in
text. In addition, some phenotype names do not exist in the UMLS Metathesaurus
at all. But UMLS and its knowledge sources may be useful for phenotype recognition
in some ways.
2.1.6 KMR corpus
We call a manually annotated corpus in Khordad et al. (2011) ‘KMR corpus’. It is
a collection of 3784 tokens (120 sentences) with 110 annotated phenotype mentions.
Sentences in KMR corpus were taken from 4 PubMed papers from the year 2009 in
the area of human genetics. Annotation was conducted with reference to the HPO
so that a term was tagged as phenotype if it was in the HPO or if it was not in the
HPO but its definition showed that it was caused by a genotype.
It is not a well-known corpus and only be used in Khordad et al. (2011) re-
searches. But now we are lack of annotated corpus for phenotype so it is still a
valuable choice. We will use this corpus for testing and analyzing our proposed
model.
Above, we just introduce some of the most typical useful resources for our re-
searches. In additional to them, there are many other resources for bio-informatics
that can be used such as medical subject headings6
, Gene list contains more than 9
millions genes7
, etc.
6
MeSH:http://www.nlm.nih.gov/mesh/meshhome.html
7
Created by National Center for Biotechnology Information, U.S. National Library of Medicine

2.2. Related researches 11
2.2 Related researches
Named Entity Recognition in the biomedical domain has been extensively stud-
ied and, as a consequence, many methods have been proposed. Some methods like
MetaMap are generic methods and find many kinds of entities in the text. Some
methods, are specialized to recognize particular type of entities. However, these
techniques tend to emphasize finding the name of genes, gene products, cells, dis-
eases and chemical (Fukuda et al., 1998; Rindflesch et al., 1999; Collier et al., 2000;
Kazama et al., 2002; Zhou et al., 2003; Settles, 2004; Kim et al., 2004; Leaman and
Gonzalez, 2008). So far, there have been a small number of researches done for phe-
notype they often based primarily on a available resources or rule-based method.
Whilst other authors have tried similar approaches for other entity types, none have
tried both machine learning and external resource lookup for a class as rich and
semantically complex as phenotypes.
In this section, we describe a method proposed by Khordad et al. (2011) which
is used as our base-line method for comparison in the experiments.
2.2.1 Baseline method: Khordad et al. (2011)
The system built in Khordad et al. (2011) is based on Metamap and makes
use of the UMLS Metathesaurus and the Human Phenotype Ontology. From an
initial basic system that uses only these pre-existing tools, five rules that capture
stylistic and linguistic properties of this type of literature are proposed to enhance
the performance of our NER tool. A block diagram showing Khordad et al. (2011)’s
system processing is shown in figure 2.3. The system performs the following steps:
• (1) MetaMap chunks the input text into phrases and assigns the UMLS se-
mantic types associated with each noun phrase.
• (2) The Disorder Recognizer analyzes the MetaMap output to find phenotypes
and phenotype candidates. This is the most important part of this method,
it based primarily on the idea that phenotype must belong to some certain
UMLS semantic types. The UMLS Semantic Network contains 133 Semantic
Types which are categorized into 15 Semantic Groups that are more general.
In which, the Semantic Group Disorders contains 12 semantic types that are
close to the meaning of phenotype, they are: Acquired Abnormality, Anatomical
Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease

or Syndrome, Experimental Model of Disease, Finding, Injury or Poisoning,
Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function,
Sign or Symptom. In this step, phrase are not belong to this semantic group
are rejected.
But a number of semantic types in this semantic group may include concepts
that are not phenotypes. The 7 problematic semantic groups are: Finding,
Disease or Syndrome, Experimental Model of Disease, Injury or Poisoning,
Sign or Symptom, Pathologic Function, and Cell or Molecular Dysfunction.
Therefore, if a phrase is assigned to these semantic types, it is considered as
phenotype candidate and will be confirmed as phenotype or not in step (3),
otherwise, it is a phenotype.
• (3) Phenotype candidates from the previous step are searched in the HPO using
OBO-Edit8
. Phenotype candidates that are found in the HPO are recognized
as phenotypes.
• (4) Result Merger merges the phenotypes found by disorder recognizer and
OBO-Edit and makes the output that is the final list of available phenotypes
in the input text.
This model is tested on a small corpus KMR (described in section 2.1.6) anno-
tated by authors. The results is precision is 97.58, recall is 88.32 and F1 is 92.71.
8
OBO-Edit: the OBO ontology editor: http://oboedit.org/

Figure 2.1: A visual example of HPO hierarchical structure
HP:0010439

Figure 2.2: A visual example of MP hierarchical structure
MP:0002880

Figure 2.3: Khordad et al. (2011)’s system block diagram

Chapter 3
Methods
In this chapter, firstly, we analyze two entities that we employed in this study:
gene/gene product (GGP) and bodily feature (BF) in details (section 3.1). Then, in
section 3.2, we introduce our Phenominer corpus version 1.0 which is built based on
19 auto-immune diseases, this corpus can be used in phenotype recognition as well
as other biomedical problem. And last, section 3.3 describe our proposed Hybrid
model for BF and GGP entities recognition, the model consists of there main parts:
machine learning labeler, knowledge-based labeler and merge results module.
3.1 Schema
We employed two types of entity in our study: gene/gene product (GGP) and
bodily feature (BF).
GGP is proposed because (1) a subset of these entities are useful for applica-
tions that explore gene-phenotype relations, and (2) it allows us to compare our
results against the many biomedical NER studies of the past, e.g. Kim et al. (2004);
Rebholz-Schuhmann et al. (2010). Because of space limitations we will not provide a
rigidly formal definition or a taxonomic analysis (Beisswanger et al., 2008). Future
work will explore the relationships between these and other entity types.
In line with BioTop (Beisswanger et al., 2008), GGP is relatively straightforward
to define by the conjunction of (BioTop ID Nucleic Acid Structure) and (BioTop ID
Peptide Structure).
Definition: A gene/gene product (GGP) entity is a mention of one
of three major macro-molecules DNA, RNA or protein. DNA and RNA
16

3.1. Schema 17
are nucleic acid sequences containing the genetic instructions used in
the development and function of an organism. Proteins are polypeptide
sequences, or parts of polypeptide sequences, folded into structures that
facilitate biological function.
Examples include: [cryoglobulins], [anticariolipin antibodies], [AFM044xg3], [chro-
mosome 17q], [CC16 protein].
As mentioned in chapter 1, in this thesis, we use the definition of bodily feature
(BF) as Phenotype candidate.
Definition: A bodily feature (BF) entity is a mention of a bodily quality
in an organism.
Examples include: [lack of kidney], [abnormal cell migration],[absent ankle re-
flexes] as well as more complex cases such as [no abnormality in his heart], [unfa-
vorable serum lipid levels] and [suceptibility to ulcerative colitis].
Figure 3.1 is an informal overview of bodily feature entity. It visually describes
some forms of BFs obtained from the data surveying, contains: structural attribute,
qualitative attribute, functional attribute and process attribute.
Figure 3.1: An informal overview of bodily feature entity
Tải bản FULL (60 trang): https://bit.ly/3RVUzAL
Dự phòng: fb.com/TaiHo123doc.net

3.1. Schema 18
• Structural attributes indicate any presence or absence of a physical component
(Anatomy or GGP).
For example: [having five fingers], [lack of kidney], [Peritoneal mesothelioma],
[missing one finger]
• Qualitative attributes show qualities of physical components in organism. In
simple cases, they have the form: Anatomy/GGP has (or not has) certain
quality. Qualities can describe any measurable characteristic such as location,
color, size, mass, etc. and even underspecified qualities of a human/mouse
body component. Most qualitative phenotypes contain mention of a physical
component term, i.e. anatomy/GGP, but some phenotypes do not (although
there is usually a hidden relation to a physical component).
For example: [black hair], [not having between 13 and 18 gm/dl hemoglobin
concentration], [adult female height 130-157 cm], [conjoined fingers]
• Functional attributes are related to functions and disposition of anatomy
(Hoehndorf et al., 2010). Intuitively, functions of anatomy establish the rea-
son (or cause) that an anatomy exists while their dispositions determine their
capabilities and potentials. For example, the endocrine pancreatic cells have
a function to produce insulin, and normally have a disposition to produce in-
sulin. In general, functional attribute shows the lack or abnormality of anatomy
function.
For example: [facial grimacing], [sleepy facial expression], [reading disability],
[hypotension], [deaf]
• Process attributes represent characteristics of the process themselves. They
include characteristics of physiological process, metabolic process, biological
pathways, chemical reactions, gene-related process, gene expression, etc. The
expression of process attribute sometimes have complex structure, but follow-
ing the discussion of phenotypes as processes in physiology (Hoehndorf et al.,
2012) we include some mentions of processes within the scope of our annotation
schema.
For example: [defective DNA repair after ultraviolet radiation damage], [ab-
normality of metabolism], [proliferation of BAF-32 cells]
Tải bản FULL (60 trang): https://bit.ly/3RVUzAL
Dự phòng: fb.com/TaiHo123doc.net

3.1. Schema 19
• These above cases are the most common cases of BF, but there are many
other cases of BF that we cannot list or group them into classes. For example,
there are some non-measurable characteristics of a body component that are
experienced by a patient (human or mouse) himself, such as pain or itchiness.
These characteristic themselves cannot be objectively measured or observed
by others. This kind of characteristic is complex and has often has several
variants, in this work, they are also considered as BF.
For example: [primary sunburn], [headache], [stress]
Table 3.1: Referential semantics and scoping of mentions by entity type
BF GGP
specific reference Yes Yes
generic reference Yes1
Yes
under-specified reference No No
modifiers Yes2,3
No
conjunctions Yes4
Yes4
processes Yes5
No
negation Yes6
No
Notes on annotation:
1
An entity may be referred with an expression of generic name. They may be
anaphoric (i.e., refer to other mentions in the context), sometimes they are too vague
or descriptive to be called a named entity. But because its information contents are
valuable, in such a case, the generic name should be annotated. For example, [gene],
[gene expression], [asthma phenotype].
2
Quantitative modifiers are included, e.g. [having five fingers] as well as spatial
modifiers, e.g. [abnormality in left hand].
3
Qualitative modifiers are included. For example, physical components: [black hair],
underspecified ranges: [normal height], locational modifers: [low set ears], and level
modifiers: [quite small fingers].
4
Where there is elision of the head, e.g. [IA/H5 virus], then annotate the whole
expression. Otherwise annotate each expression separately, e.g. [IA virus] and [H5
virus].
5
We exclude however finite verb forms, infinite verb forms with ‘to’, verbs in a
progressive or perfect aspect, verb phrases, clauses or sentences and any phrase with
a relative clause or complement clause.
6
If the negation appears in a noun phrase with an anatomical entity then we gen-
erally allow it, e.g. [absent ankle reflexes], [no left kidney].
6811996

A hybrid approach to finding phenotype candidates in genetic text.pdf

Recommended

Recommended

More Related Content

Similar to A hybrid approach to finding phenotype candidates in genetic text.pdf

Similar to A hybrid approach to finding phenotype candidates in genetic text.pdf (20)

More from NuioKila

More from NuioKila (20)

Recently uploaded

Recently uploaded (20)

A hybrid approach to finding phenotype candidates in genetic text.pdf