SlideShare a Scribd company logo
1 of 31
Download to read offline
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
LE HOANG QUYNH
A HYBRID APPROACH
TO FINDING PHENOTYPE CANDIDATES
IN GENETIC TEXT
MASTER THESIS
Hanoi – 2012
VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF ENGINEERING AND TECHNOLOGY
LE HOANG QUYNH
A HYBRID APPROACH
TO FINDING PHENOTYPE CANDIDATES
IN GENETIC TEXT
Major : Computer Science
Code : 60 48 01
MASTER THESIS
Supervisor: Assoc.Prof. Ha Quang Thuy
Hanoi – 2012
A hybrid approach to finding phenotype
candidates in genetic texts
Le Hoang Quynh
Faculty of Information Technology
University of Engineering and Technology
Vietnam National University, Hanoi
Supervised by
Associate Professor. Ha Quang Thuy
A thesis submitted in fulfillment of the requirements
for the degree of
Master of Science in Computer Science
November 2012
2
ORIGINALITY STATEMENT
‘I hereby declare that this submission is my own work and to the best of my knowledge
it contains no materials previously published or written by another person, or substan-
tial proportions of material which have been accepted for the award of any other degree
or diploma at University of Engineering and Technology (UET/Coltech) or any other
educational institution, except where due acknowledgement is made in the thesis. Any
contribution made to the research by others, with whom I have worked with at Univer-
sity of Engineering and Technology and National Institute of Informatic (Tokyo, Japan)
or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual
content of this thesis is the product of my own work, except to the extent that assistance
from others in the project’s design and conception or in style, presentation and linguistic
expression is acknowledged.’
Hanoi, November 10th
, 2012
Signed ........................................
Le Hoang Quynh
i
ABSTRACT
Named entity recognition (NER) has been extensively studied for the names of
genes and gene products but there are few proposed solutions for phenotypes. Phe-
notype terms are expected to play a key role in inferring gene function in complex
heritable diseases but are intrinsically difficult to analyse due to their complex se-
mantics and scale. In contrast to previous approaches we evaluate state-of-the-art
techniques involving the fusion of machine learning on a rich feature set with evi-
dence from extant domain knowledge-sources. The techniques are validated on two
gold standard collections including a novel annotated collection of 112 abstracts de-
rived from a systematic search of the Online Mendelian Inheritance of Man database
for auto-immune diseases. Encouragingly the hybrid model outperforms a HMM, a
CRF and a pure knowledge-based method to achieve an F1 of 75.37 for BF and
micro average F1 of 84.01 for the whole system.
Publications:
• Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le. Automatic Named
Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations. In Inter-
national Conference on Asian Language Processing 2010. Page 170-173. Harbin, China;
December 28-30, 2010, DOI: http://doi.ieeecomputersociety.org/10.1109/IALP.2010.73
• Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan and Quang-
Thuy Ha. An Integrated Approach Using Conditional Random Fields for Named En-
tity Recognition and Person Property Extraction in Vietnamese Text. In Proceedings
of International Conference on Asian Language Processing 2011. Page 115-118. DOI:
http://doi.ieeecomputersociety.org/10.1109/IALP.2011.37
• Nigel Collier, Mai-Vu Tran, Hoang-Quynh Le, Anika Oellrich, Ai Kawazoe, Martin Hall-
May and Dietrich Rebholz-Schuhmann. A hybrid approach to finding phenotype candidates
in genetic text. In The 24th
conference on Computational Linguistics (COLING 2012).
Accepted as long paper.
ii
ACKNOWLEDGEMENTS
First and foremost, I would like to express my deep gratitude to my supervi-
sor, Assoc.Prof. Ha Quang Thuy, for his patient guidance and continuous support
throughout the years. He always appears when I need help, and responds to queries
so helpfully and promptly.
I would like to express my gratitude to the National Institute of Informatics (NII
- Tokyo, Japan) for giving me a great chance working at NII in the NII International
Internship program. Then, I sincerely give my honest thanks and appreciation to
Assoc.Prof. Nigel H. Collier, my internship supervisor at NII, for his great support.
I would like to say thank you to all my teachers at university of Engineering and
Technology (VNU), who bring me many knowledge and experiences.
I also want to thank my colleagues at the Knowledge and Technology laboratory
(UET, VNU) and my classmate for their enthusiasm and promptly help.
I sincerely acknowledge the Vietnam National University, NAFOSTED and the
QG.10.38 project for some supporting finance to my master study.
And thanks to all my friends who always be by my side and cheer me.
Finally, this thesis would not have been possible without the support and love
of my family. Thank you, mother and father. Thanks brother and sister, thanks to
my nephew. And thank you, my beloved husband. Again, thank you and love all of
you so much ♥.
iii
Table of Contents
1 Introduction 1
1.1 Motivation and problem definition . . . . . . . . . . . . . . . . . . . . 1
1.2 Phenotype definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The challenges of phenotype entity recognition . . . . . . . . . . . . . 3
2 Related works 6
2.1 Useful resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 GENIA and JNLPBA corpora . . . . . . . . . . . . . . . . . . 7
2.1.2 The online mendelian inheritance in man . . . . . . . . . . . . 7
2.1.3 The human phenotype ontology . . . . . . . . . . . . . . . . . 8
2.1.4 The mammalian phenotype ontology . . . . . . . . . . . . . . 9
2.1.5 The unified medical language system . . . . . . . . . . . . . . 9
2.1.6 KMR corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Related researches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Baseline method: Khordad et al. (2011) . . . . . . . . . . . . . 11
3 Methods 16
3.1 Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Annotated data sources . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Proposed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Machine learning labeler . . . . . . . . . . . . . . . . . . . . . 22
3.3.3 Knowledge-based labeler . . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Merge results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Experimental results and evaluation 29
4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Experiments on the KMR corpus . . . . . . . . . . . . . . . . . . . . 31
iv
TABLE OF CONTENTS v
4.3 Experiments on the Phenominer corpus . . . . . . . . . . . . . . . . . 32
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.1 Discussion on corpora . . . . . . . . . . . . . . . . . . . . . . 35
4.4.2 Discussion on results . . . . . . . . . . . . . . . . . . . . . . . 36
5 Conclusion 40
List of Figures
2.1 A visual example of HPO hierarchical structure . . . . . . . . . . . . 13
2.2 A visual example of MP hierarchical structure . . . . . . . . . . . . . 14
2.3 Khordad et al. (2011)’s system block diagram . . . . . . . . . . . . . 15
3.1 An informal overview of bodily feature entity . . . . . . . . . . . . . . 17
3.2 Phenotype tagging architecture . . . . . . . . . . . . . . . . . . . . . 27
3.3 Brat rapid annotation tool example . . . . . . . . . . . . . . . . . . . 28
4.1 Column chart shows the experimental results on KMR corpus . . . . 32
4.2 Column chart shows the experimental results of BF entities on Phe-
nominer corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Column chart shows the experimental results of GGP entities on Phe-
nominer corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
vi
List of Tables
3.1 Referential semantics and scoping of mentions by entity type . . . . . 19
3.2 List of auto-immune disease used to collect Phenominer corpus . . . . 21
3.3 Feature sets used in the machine learning labeler . . . . . . . . . . . . 24
3.4 Features exploited by the two learner models . . . . . . . . . . . . . . 24
4.1 Results for BF entity on the KMR corpus using models with partial
matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Results for each entity on the Phenominer corpus using models with
partial matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Sources of error by the Hybrid system on the KMR corpus. . . . . . . 37
4.4 Sources of error by Khordad et al.’s system on the Phenominer corpus. 38
4.5 Sources of error by the Hybrid system on the Phenominer corpus. . . 39
vii
List of Abbreviations
BF Bodily feature
CRF Conditional Random Field
GGP Gene and gene product
HMM Hidden Markov Model
HPO the Human Phenotype Ontology
KB Knowledge-based
ML Machine learning
MP the Mammalian Phenotype Ontology
NE Named entity
NER Named entity recognition
viii
Chapter 1
Introduction
1.1 Motivation and problem definition
During the last decade biomedicine has developed tremendously. Everyday a lot
of biomedical papers are published and a great amount of information is produced.
Due to the rapidly increasing amount of biomedical literature available on the Web,
biomedical information extraction becomes more and more important.
Biomedical named entity recognition (NER) is a subtask of biomedical infor-
mation extraction which is a fundamental step and can affect the results of others
tasks. Biomedical NER is a computational technique used to identify and classify
strings of text (mentions) that designate important concepts in biomedicine. As the
first stage in the integrated semantic linking of knowledge between literature and
structured databases it is critically important to maximize the effectiveness of this
step.
This thesis focuses on the analysis and identification of a new class of entity:
phenotypes. Follow Hoehndorf et al. (2010), phenotype is important for the analysis
of the molecular mechanisms underlying disease; it is also expected to play a key
role in inferring gene function in complex heritable diseases. Two thoughts motivate
our work are: (1) The database curation community has expressed a wish for full
text entity indexing and the inclusion of phenotypes (Dowell et al., 2009; Hirschman
et al., 2012), and (2) Biomedicine is rapidly moving towards full-scale integration of
data, opening up the possibility to understand complex heritable diseases caused by
genes. Association studies involving phenotypes are considered important to making
progress (Lage et al., 2007; Wu et al., 2008). The ultimate goal of the work we present
1
1.2. Phenotype definition 2
here is to allow relations mined from sentences such as the one we annotated below
to feed into novel hypothesis generation procedures. From Ex 1, the reader can easily
infer a relation between ‘IgG1 disorder’ and three genes/gene products marked as
GGP.
Ex 1. Among [patients]ORGANISM with [systemic lupus erythematosus]DISEASE
([SLE]DISEASE), those with the [IgG1 disorder]PHENOTY PE have a higher prevalence
of high titre [rheumatoid factor]GGP and [antinuclear antibody]GGP , but a lower
prevalence of [anti-double-stranded DNA (anti-dsDNA) antibodies]GGP above 30
U/ml. (Source PMCID: PMC1003566).
1.2 Phenotype definition
Unlike genes or anatomic structures, phenotypes and their traits are complex
concepts and do not constitute a homogeneous class of objects (i.e. a natural kind).
Traits such as ‘eye colour’, ‘blood group’, ‘hemoglobin concentration’ or ‘facial gri-
macing’ describe morphological structures, physiological processes and behaviours.
When qualities or quantities of traits are used to describe a specific organism then
we have phenotypic descriptions, e.g. ‘blue eyes’, ‘blood group AB’, ‘not having
between 13 and 18 gm/dl hemoglobin concentration’.
Until recently, there has been little effort to provide data integration standards
for phenotypes. This means that phenotypic descriptions tend to be author/study
specific and biological results may go undiscovered if the terms used lie outside an
author’s immediate research area (Bard and Rhee, 2004). In some researches, it is
simply called as ‘phenotypic information’ and authors do not give any specific def-
inition for it (Hoehndorf et al., 2010). In CSI-OMIM system (Cohen et al., 2011),
phenotypes are considered as genetic terms including clinical signs and symptoms.
Freimer and Sabatti (2003) describe phenotypes as referring to ‘any morphologic,
biochemical, physiological or behavioral characteristic of an organism. . . . All phe-
notypic characteristicsrepresent the expression of particular genotypes combined with
the effects of specific environmental influences’. Khordad et al. (2011) defines phe-
notypes as ‘genetically-determined observable characteristics of a cell or organism,
including the result of any test that is not a direct test of the genotype. ...A pheno-
type of an organism is determined by the interaction of its genetic constitution and
the environment’.
1.3. The challenges of phenotype entity recognition 3
Our definition of phenotype was taken from the formal analysis in Scheuermann
et al. (2009)’s research.
Definition: A phenotype entity is a (combination of) bodily features(s)
of an organism determined by the interaction of its genetic make-up and
environment.
Examples include: [lack of kidney], [abnormal cell migration],[absent ankle re-
flexes] as well as more complex cases such as [no abnormality in his heart], [unfa-
vorable serum lipid levels] and [susceptibility to ulcerative colitis].
But Scheuermann et al. (2009) also define symptom as ‘a bodily feature of a
patient that is observed by the patient or clinician and suspected of being caused
by a disease’. We can see an ambiguity made by the causality (or context) here:
a term may be symptom in some contexts but refer to phenotype in others or
many symptoms may be phenotypes. Thus, it is important to recognize that this
phenotype definition requires us to know the underlying cause. Since causality is
often difficult to establish using narrow contextual evidence of the sort used in NER
it seems reasonable that we focus here on identifying bodily features themselves, i.e.
phenotype candidates, and then determine causality in another stage of processing.
Definition: A bodily feature (BF) entity is a mention of a bodily quality
in an organism. It is considered as phenotype candidate.
Our definition of bodily features require two caveats (1) in contrast to Khordad
et al. (2011) we did not apply a granular cut off at the level of cell, and (2) because
of the diversity of bodily features across organisms we took a decision to focus our
definition of this entity on mouse as a model organism and human as the most
important species.
1.3 The challenges of phenotype entity recogni-
tion
Unlike NER in the newswire domain, NER in the biomedical domain remains
a perplexing challenge. Biomedical NEs in general do not follow any nomenclature,
and can be comprised of long compound words or short abbreviations. Some even
contain various symbols or spelling variations. We summarize some challenges for BF
NER below (some of them are difficulties of NER in biomedical domain mentioned
by Lin et al. (2004))
1.3. The challenges of phenotype entity recognition 4
• Unknown word identification: There are an extreme use of unknown words.
Unknown words can be acronyms, abbreviations, or words containing hyphens,
digits, letters, and Greek letters. Moreover, the use of numerous synonyms and
homonyms make recognition become more difficult.
• Named entity boundary identification: The boundary of an NE can be a regular
English word, unknown word, Roman numeral, or digit. A BF can apply at all
levels of anatomical granularity from chemical structures to cells and organs
making it difficult to know where to draw a boundary. Additionally, nested
NEs (an NE embedded in another NE) further complicate this problem: BF
can contain GGP, disease and even organism.
• Named entity classification: Once an NE is identified, it is then classified into a
category such as GGP, anatomy, BF, and so on. Ambiguity and inconsistency
are often encountered at this stage. NEs with the same orthographical features
may fall into different categories (for example, there is a big ambiguity between
BF and disease). In additional, BF entities are intrinsically more difficult to
analyze due to their complex semantics, scale and structure:
• Semantically, a BF can be abnormal (in a disordered disposition) or normal
(in an ordered disposition) feature of humans or mice; it can be a clinically
relevant characteristic of a human/mouse disease or not.
• A lack of standard nomenclatures, extensive and growing nomenclatures make
the problem of BF recognition become more difficult. , the lack of naming
agreement prior to a standard name being accepted,
• BFs can be found with complex structure in various forms, sometimes even
biologists do not agree on the boundary of the BF. BF may contain modifiers
(for example, quantification that are either specific (e.g. 18 gm/dl) or rela-
tive (e.g. normal or increased’)); negations can be used to indicate lack of an
anatomy/GGP or normal/abnormal qualities of anatomy/GGP (for example:
[not having kidney], [not having between 13 and 18 gm/dl hemoglobin con-
centration]) but it can also show that a human or mouse not have a BF (for
example: there is [no abnormality in his heart], she has a [fever] but doesn’t
have a [cough]); conjoined cases happen when two or more BFs share one head
noun.
1.3. The challenges of phenotype entity recognition 5
Due to the motivation and challenges of phenotype recognition, the key contri-
butions of this thesis are: (1) To provide an operational semantics for identifying
phenotype candidates in text, (2) To introduce a set of guidelines and an annotated
corpus based on a selection of 19 clinically significant auto-immune diseases from
The Online Mendelian Inheritance of Man (OMIM) (Hamosh et al., 2005), one of
the most widely used gene-disease databases, and (3) To mitigate linguistic varia-
tion whilst still meeting the conceptual expectations of biologists we propose a new
named entity solution that uses statistical inference and external manually crafted
resources.
The remaining of this thesis is organized as follows. In the second chapter, we
present some related researches and useful resources. The next chapter describes
our Phenominer corpus version 1.0 and proposed method for phenotype candidate
recognition. Then, experimental results, evaluation and discussion are in 3rd
chapter.
Finally, 4th
chapter is the conclusions.
Chapter 2
Related works
Such motivation and challenges that we mentioned in chapter 1 have led to a
variety of proposed solutions involving a wide range of resources. In this chapter,
we take a review on some useful resources in section 2.1, they are GENIA and
JNLPBA corpora, the online mendelian inheritance in man (OMIM) , the human
phenotype ontology (HPO), the mammalian phenotype ontology (MP), the unified
medical language system (UMLS), etc. Then, in section 2.2, we introduce some
related researches in biomedical entity recognition and describe Khordad et al. (2011)
as our baseline method for BF.
2.1 Useful resources
Using available resources help us not only to take advantage of knowledge from
other researches but also to reduce effort. Up to now, there are many resources are
used in bio-informatics. Among these, linguistically corpora such as GENIA (Tateisi
et al., 2000; Kim et al., 2003), OMIM (Hamosh et al., 2005), have proven to be
central to the NER solution. However due to the size of the vocabularies involved,
annotated corpora by themselves do not provide a complete solution. Researchers
have therefore also looked at the rich availability of formally structured biomedi-
cal knowledge (ontologies) such as the Unified Medical Language System (UMLS)
(Bodenreider et al., 2002), the Human Phenotype Ontology (Robinson and Mund-
los, 2010), the Mammalian Phenotype Ontology (Smith and Eppig, 2009), the Gene
Ontology (Gene Ontology Consortium, 2000), etc.
6
2.1. Useful resources 7
2.1.1 GENIA and JNLPBA corpora
GENIA corpus version 3.0 (Kim et al., 2003) was formed from a controlled search
on MEDLINE using the MeSH terms ’human’, ’blood cells’ and ’transcription fac-
tors’. From this search, 2000 abstracts (20,546 sentences, more than 400,000 words)
were selected. This corpus has been released with linguistically rich annotations in-
cluding sentence boundaries, term boundaries, term classifications, semi-structured
coordinated clauses, recovered ellipsis in terms, etc. Entities are hand annotated into
36 classes of DNA, RNA, cell line, cell type and protein (almost 100,000 annota-
tions).
JNLPBA data set came from the GENIA version 3.02 corpus. It is a training
set for the Bio-Entity recognition task at JNLPBA Kim et al. (2004). In this share
task, they simplify 36 classes of GENIA corpus and used only the classes protein,
DNA, RNA, cell line and cell type.
The GENIA and JNLPBA corpora is important for two major reasons: the first
is it provides the large single source of annotated training data for the NE task in
molecular biology and the second is in the breadth of classification. Follow Kim et al.
(2004), although number of classes in GENIA/JNLPBA corpora is a fraction of the
classes contained in major taxonomies it is still the largest class set that has been
attempted so far for the named entity recognition task . Moreover, GENIA corpus
can be also used for other biomedical tasks, such as POS tagging.
2.1.2 The online mendelian inheritance in man
The Online Mendelian Inheritance in Man (OMIM) (Hamosh et al., 2005) is a
continuously updated catalog of human genes and genetic disorders and traits, with
particular focus on the molecular relationship between genetic variation and pheno-
typic expression (genotype and phenotype). The full text and referenced overviews
in OMIM contain information on many mendelian disorders and over 12,000 genes.
Derived from the biomedical literature, OMIM is written and edited at Johns
Hopkins University with input from scientists and physicians around the world. Each
OMIM entry has a full text summary of a genetically determined phenotype and/or
gene and has numerous links to other genetic databases such as DNA and protein
sequence, PubMed references, general and locus-specific mutation databases, HUGO
nomenclature, MapViewer, GeneTests, patient support groups and many others.
Within an OMIM entry, there is a field called ‘Clinical Synopsis’ which is a list of
2.1. Useful resources 8
the clinical features of the disorder appear in this entry or references of this entry.
There are over 4500 clinical synopses in OMIM, they are a important resources for
researches on Phenotype.
OMIM is an easy and straightforward portal to the burgeoning information in hu-
man genetics, it is now distributed electronically by the National Center for Biotech-
nology Information1
. Over five decades OMIM has achieved great success, it is one of
the most important information source about human genes and genetic phenotypes
(Cohen et al., 2011; Robinson and Mundlos, 2010).
Nonetheless OMIM does not use a controlled vocabulary to describe the pheno-
typic features in its clinical synopsis section that makes it inappropriate for data
mining usages. In the section 2.1.3, we introduce HPO which is constructed using
OMIM.
2.1.3 The human phenotype ontology
The Human Phenotype Ontology (HPO)2
is a standardized, controlled vocab-
ulary allows phenotypic information to be described in an unambiguous fashion in
medical publications and databases (Robinson and Mundlos, 2010).
The HPO was originally constructed using data from OMIM by merging synonym
and creating the hierarchical structure between terms according to their semantics.
The hierarchical structure in the HPO represents the subclass relationship, figure
2.1 is a describe a hierarchical structure of HPO by a example of ‘atrioventricular
septal defect’ [HP:0010439] (example comes from Robinson and Mundlos (2010)).
The HPO currently contains over 9500 unique terms (more than 15000 synonyms)
describing human phenotypic features (statistic in 2012).
Nevertheless, follow Khordad et al. (2011), HPO is not complete and we had
several problems finding phenotype names in it:
(1) some acronyms and abbreviations are not available in the HPO;
(2) although the HPO contains synonyms of phenotypes, there are still some
synonyms that are not included in the HPO;
(3) in some cases adjectives and other modifiers are added to phenotype names,
making it difficult to find these phenotype names in the ontology;
(4) new phenotypes are being continuously introduced to the biomedicine world,
1
http://www.ncbi.nlm.nih.gov/omim/
2
http://www.human-phenotype-ontology.org/
2.1. Useful resources 9
HPO is being constantly refined, corrected, and expanded manually, but this process
is not fast enough nor can the inclusion of new phenotypes be guaranteed.
Thus, although HPO is a very useful resources, using only it is not enough for
phenotype recognition, we should use it just as a additional resources.
2.1.4 The mammalian phenotype ontology
The Mammalian Phenotype Ontology (MP) (Smith and Eppig, 2009) has been
applied to mouse phenotype descriptions in MGI3
, RGD4
, OMIA5
and elsewhere.
Use of this ontology allows comparisons of data from diverse sources, can facilitate
comparisons across mammalian species, assists in identifying appropriate experi-
mental disease models, and aids in the discovery of candidate disease genes and
molecular signaling pathways.
Similar with HPO, the Mammalian Phenotype Ontology (MP) is a standardized
hierarchical structured vocabulary. The highest level terms describe physiological
systems, survival, and behavior. The physiological systems branch into morpho-
logical and physiological phenotype terms at the next node level. The example of
hierarchical tree for the term ‘opisthotonus’ [MP:0002880] is shown in figure 1 2.2
(example comes from Smith and Eppig (2009)).
MP has about 9000 unique terms (about 24000 synonyms) of mouse abnormal
phenotype descriptions (statistic in 2012).
2.1.5 The unified medical language system
The Unified Medical Language System (UMLS) (Bodenreider et al., 2002) is a set
of files and software that brings together many health and biomedical vocabularies
and standards. The UMLS has three tools, which we call the Knowledge Sources:
Metathesaurus, semantic network and SPECIALIST Lexicon and Lexical Tools.
• The Metathesaurus is a very large, multi-purpose, and multi-lingual vocabu-
lary database that contains information about biomedical and health related
concepts, their various names, and the relationships among them. It contains
more than 1.8 million concepts come from more than 100 source vocabularies.
3
Mouse Genome Informatics Database: http://www.informatics.jax.org/
4
Rat Genome Database: http://rgd.mcw.edu
5
Online Mendelian Inheritance in Animals: http://omia.angis.org.au/
2.1. Useful resources 10
• The Metathesaurus is linked to the Semantic Network: all concepts in the
Metathesaurus are assigned to at least one semantic type from the semantic
network.
• MetaMap is a well-known tool in the UMLS SPECIALIST Lexicon and lex-
ical tools. It is a highly configurable application to map biomedical text to
the UMLS Metathesaurus: MetaMap tokenizes and phrase chunking the input
text; map them to UMLS concepts, each phrase is mapped to a set of candi-
date concepts; word sense disambiguation step will choose the best candidate
with respect to the surrounding text.
However UMLS semantic network does not contain Phenotype as a semantic type
so it alone is not adequate to distinguish between phenotypes and other objects in
text. In addition, some phenotype names do not exist in the UMLS Metathesaurus
at all. But UMLS and its knowledge sources may be useful for phenotype recognition
in some ways.
2.1.6 KMR corpus
We call a manually annotated corpus in Khordad et al. (2011) ‘KMR corpus’. It is
a collection of 3784 tokens (120 sentences) with 110 annotated phenotype mentions.
Sentences in KMR corpus were taken from 4 PubMed papers from the year 2009 in
the area of human genetics. Annotation was conducted with reference to the HPO
so that a term was tagged as phenotype if it was in the HPO or if it was not in the
HPO but its definition showed that it was caused by a genotype.
It is not a well-known corpus and only be used in Khordad et al. (2011) re-
searches. But now we are lack of annotated corpus for phenotype so it is still a
valuable choice. We will use this corpus for testing and analyzing our proposed
model.
Above, we just introduce some of the most typical useful resources for our re-
searches. In additional to them, there are many other resources for bio-informatics
that can be used such as medical subject headings6
, Gene list contains more than 9
millions genes7
, etc.
6
MeSH:http://www.nlm.nih.gov/mesh/meshhome.html
7
Created by National Center for Biotechnology Information, U.S. National Library of Medicine
2.2. Related researches 11
2.2 Related researches
Named Entity Recognition in the biomedical domain has been extensively stud-
ied and, as a consequence, many methods have been proposed. Some methods like
MetaMap are generic methods and find many kinds of entities in the text. Some
methods, are specialized to recognize particular type of entities. However, these
techniques tend to emphasize finding the name of genes, gene products, cells, dis-
eases and chemical (Fukuda et al., 1998; Rindflesch et al., 1999; Collier et al., 2000;
Kazama et al., 2002; Zhou et al., 2003; Settles, 2004; Kim et al., 2004; Leaman and
Gonzalez, 2008). So far, there have been a small number of researches done for phe-
notype they often based primarily on a available resources or rule-based method.
Whilst other authors have tried similar approaches for other entity types, none have
tried both machine learning and external resource lookup for a class as rich and
semantically complex as phenotypes.
In this section, we describe a method proposed by Khordad et al. (2011) which
is used as our base-line method for comparison in the experiments.
2.2.1 Baseline method: Khordad et al. (2011)
The system built in Khordad et al. (2011) is based on Metamap and makes
use of the UMLS Metathesaurus and the Human Phenotype Ontology. From an
initial basic system that uses only these pre-existing tools, five rules that capture
stylistic and linguistic properties of this type of literature are proposed to enhance
the performance of our NER tool. A block diagram showing Khordad et al. (2011)’s
system processing is shown in figure 2.3. The system performs the following steps:
• (1) MetaMap chunks the input text into phrases and assigns the UMLS se-
mantic types associated with each noun phrase.
• (2) The Disorder Recognizer analyzes the MetaMap output to find phenotypes
and phenotype candidates. This is the most important part of this method,
it based primarily on the idea that phenotype must belong to some certain
UMLS semantic types. The UMLS Semantic Network contains 133 Semantic
Types which are categorized into 15 Semantic Groups that are more general.
In which, the Semantic Group Disorders contains 12 semantic types that are
close to the meaning of phenotype, they are: Acquired Abnormality, Anatomical
Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease
2.2. Related researches 12
or Syndrome, Experimental Model of Disease, Finding, Injury or Poisoning,
Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function,
Sign or Symptom. In this step, phrase are not belong to this semantic group
are rejected.
But a number of semantic types in this semantic group may include concepts
that are not phenotypes. The 7 problematic semantic groups are: Finding,
Disease or Syndrome, Experimental Model of Disease, Injury or Poisoning,
Sign or Symptom, Pathologic Function, and Cell or Molecular Dysfunction.
Therefore, if a phrase is assigned to these semantic types, it is considered as
phenotype candidate and will be confirmed as phenotype or not in step (3),
otherwise, it is a phenotype.
• (3) Phenotype candidates from the previous step are searched in the HPO using
OBO-Edit8
. Phenotype candidates that are found in the HPO are recognized
as phenotypes.
• (4) Result Merger merges the phenotypes found by disorder recognizer and
OBO-Edit and makes the output that is the final list of available phenotypes
in the input text.
This model is tested on a small corpus KMR (described in section 2.1.6) anno-
tated by authors. The results is precision is 97.58, recall is 88.32 and F1 is 92.71.
8
OBO-Edit: the OBO ontology editor: http://oboedit.org/
2.2. Related researches 13
Figure 2.1: A visual example of HPO hierarchical structure
HP:0010439
2.2. Related researches 14
Figure 2.2: A visual example of MP hierarchical structure
MP:0002880
2.2. Related researches 15
Figure 2.3: Khordad et al. (2011)’s system block diagram
Chapter 3
Methods
In this chapter, firstly, we analyze two entities that we employed in this study:
gene/gene product (GGP) and bodily feature (BF) in details (section 3.1). Then, in
section 3.2, we introduce our Phenominer corpus version 1.0 which is built based on
19 auto-immune diseases, this corpus can be used in phenotype recognition as well
as other biomedical problem. And last, section 3.3 describe our proposed Hybrid
model for BF and GGP entities recognition, the model consists of there main parts:
machine learning labeler, knowledge-based labeler and merge results module.
3.1 Schema
We employed two types of entity in our study: gene/gene product (GGP) and
bodily feature (BF).
GGP is proposed because (1) a subset of these entities are useful for applica-
tions that explore gene-phenotype relations, and (2) it allows us to compare our
results against the many biomedical NER studies of the past, e.g. Kim et al. (2004);
Rebholz-Schuhmann et al. (2010). Because of space limitations we will not provide a
rigidly formal definition or a taxonomic analysis (Beisswanger et al., 2008). Future
work will explore the relationships between these and other entity types.
In line with BioTop (Beisswanger et al., 2008), GGP is relatively straightforward
to define by the conjunction of (BioTop ID Nucleic Acid Structure) and (BioTop ID
Peptide Structure).
Definition: A gene/gene product (GGP) entity is a mention of one
of three major macro-molecules DNA, RNA or protein. DNA and RNA
16
3.1. Schema 17
are nucleic acid sequences containing the genetic instructions used in
the development and function of an organism. Proteins are polypeptide
sequences, or parts of polypeptide sequences, folded into structures that
facilitate biological function.
Examples include: [cryoglobulins], [anticariolipin antibodies], [AFM044xg3], [chro-
mosome 17q], [CC16 protein].
As mentioned in chapter 1, in this thesis, we use the definition of bodily feature
(BF) as Phenotype candidate.
Definition: A bodily feature (BF) entity is a mention of a bodily quality
in an organism.
Examples include: [lack of kidney], [abnormal cell migration],[absent ankle re-
flexes] as well as more complex cases such as [no abnormality in his heart], [unfa-
vorable serum lipid levels] and [suceptibility to ulcerative colitis].
Figure 3.1 is an informal overview of bodily feature entity. It visually describes
some forms of BFs obtained from the data surveying, contains: structural attribute,
qualitative attribute, functional attribute and process attribute.
Figure 3.1: An informal overview of bodily feature entity
Tải bản FULL (60 trang): https://bit.ly/3RVUzAL
Dự phòng: fb.com/TaiHo123doc.net
3.1. Schema 18
• Structural attributes indicate any presence or absence of a physical component
(Anatomy or GGP).
For example: [having five fingers], [lack of kidney], [Peritoneal mesothelioma],
[missing one finger]
• Qualitative attributes show qualities of physical components in organism. In
simple cases, they have the form: Anatomy/GGP has (or not has) certain
quality. Qualities can describe any measurable characteristic such as location,
color, size, mass, etc. and even underspecified qualities of a human/mouse
body component. Most qualitative phenotypes contain mention of a physical
component term, i.e. anatomy/GGP, but some phenotypes do not (although
there is usually a hidden relation to a physical component).
For example: [black hair], [not having between 13 and 18 gm/dl hemoglobin
concentration], [adult female height 130-157 cm], [conjoined fingers]
• Functional attributes are related to functions and disposition of anatomy
(Hoehndorf et al., 2010). Intuitively, functions of anatomy establish the rea-
son (or cause) that an anatomy exists while their dispositions determine their
capabilities and potentials. For example, the endocrine pancreatic cells have
a function to produce insulin, and normally have a disposition to produce in-
sulin. In general, functional attribute shows the lack or abnormality of anatomy
function.
For example: [facial grimacing], [sleepy facial expression], [reading disability],
[hypotension], [deaf]
• Process attributes represent characteristics of the process themselves. They
include characteristics of physiological process, metabolic process, biological
pathways, chemical reactions, gene-related process, gene expression, etc. The
expression of process attribute sometimes have complex structure, but follow-
ing the discussion of phenotypes as processes in physiology (Hoehndorf et al.,
2012) we include some mentions of processes within the scope of our annotation
schema.
For example: [defective DNA repair after ultraviolet radiation damage], [ab-
normality of metabolism], [proliferation of BAF-32 cells]
Tải bản FULL (60 trang): https://bit.ly/3RVUzAL
Dự phòng: fb.com/TaiHo123doc.net
3.1. Schema 19
• These above cases are the most common cases of BF, but there are many
other cases of BF that we cannot list or group them into classes. For example,
there are some non-measurable characteristics of a body component that are
experienced by a patient (human or mouse) himself, such as pain or itchiness.
These characteristic themselves cannot be objectively measured or observed
by others. This kind of characteristic is complex and has often has several
variants, in this work, they are also considered as BF.
For example: [primary sunburn], [headache], [stress]
Table 3.1: Referential semantics and scoping of mentions by entity type
BF GGP
specific reference Yes Yes
generic reference Yes1
Yes
under-specified reference No No
modifiers Yes2,3
No
conjunctions Yes4
Yes4
processes Yes5
No
negation Yes6
No
Notes on annotation:
1
An entity may be referred with an expression of generic name. They may be
anaphoric (i.e., refer to other mentions in the context), sometimes they are too vague
or descriptive to be called a named entity. But because its information contents are
valuable, in such a case, the generic name should be annotated. For example, [gene],
[gene expression], [asthma phenotype].
2
Quantitative modifiers are included, e.g. [having five fingers] as well as spatial
modifiers, e.g. [abnormality in left hand].
3
Qualitative modifiers are included. For example, physical components: [black hair],
underspecified ranges: [normal height], locational modifers: [low set ears], and level
modifiers: [quite small fingers].
4
Where there is elision of the head, e.g. [IA/H5 virus], then annotate the whole
expression. Otherwise annotate each expression separately, e.g. [IA virus] and [H5
virus].
5
We exclude however finite verb forms, infinite verb forms with ‘to’, verbs in a
progressive or perfect aspect, verb phrases, clauses or sentences and any phrase with
a relative clause or complement clause.
6
If the negation appears in a noun phrase with an anatomical entity then we gen-
erally allow it, e.g. [absent ankle reflexes], [no left kidney].
6811996

More Related Content

Similar to A hybrid approach to finding phenotype candidates in genetic text.pdf

Geometric Processing of Data in Neural Networks
Geometric Processing of Data in Neural NetworksGeometric Processing of Data in Neural Networks
Geometric Processing of Data in Neural NetworksLorenzo Cassani
 
Classification of squamous cell cervical cytology
Classification of squamous cell cervical cytologyClassification of squamous cell cervical cytology
Classification of squamous cell cervical cytologykarthigailakshmi
 
A Vietnamese Text-based Conversational Agent.pdf
A Vietnamese Text-based Conversational Agent.pdfA Vietnamese Text-based Conversational Agent.pdf
A Vietnamese Text-based Conversational Agent.pdfNuioKila
 
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVMAspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVMAndrew Hagens
 
Nghiên cứu tình trạng methyl hóa một số chỉ thị phân tử ở bệnh nhân ung thư đ...
Nghiên cứu tình trạng methyl hóa một số chỉ thị phân tử ở bệnh nhân ung thư đ...Nghiên cứu tình trạng methyl hóa một số chỉ thị phân tử ở bệnh nhân ung thư đ...
Nghiên cứu tình trạng methyl hóa một số chỉ thị phân tử ở bệnh nhân ung thư đ...https://www.facebook.com/garmentspace
 
Biểu hiện hệ vector tái lập trình trên tế bào gốc tạo máu của người nhằm tạo ...
Biểu hiện hệ vector tái lập trình trên tế bào gốc tạo máu của người nhằm tạo ...Biểu hiện hệ vector tái lập trình trên tế bào gốc tạo máu của người nhằm tạo ...
Biểu hiện hệ vector tái lập trình trên tế bào gốc tạo máu của người nhằm tạo ...TÀI LIỆU NGÀNH MAY
 
Facial kinship verification- a machine learning approach
Facial kinship verification- a machine learning approachFacial kinship verification- a machine learning approach
Facial kinship verification- a machine learning approachEx Lecturer of HUMP
 
Missing Data Problems in Machine Learning
Missing Data Problems in Machine LearningMissing Data Problems in Machine Learning
Missing Data Problems in Machine Learningbutest
 
Analysis and Classification of ECG Signal using Neural Network
Analysis and Classification of ECG Signal using Neural NetworkAnalysis and Classification of ECG Signal using Neural Network
Analysis and Classification of ECG Signal using Neural NetworkZHENG YAN LAM
 
Micro robotic cholesteatoma surgery
Micro robotic cholesteatoma surgeryMicro robotic cholesteatoma surgery
Micro robotic cholesteatoma surgeryPrasanna Datta
 
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...Yomna Mahmoud Ibrahim Hassan
 
AUTOMATIC ANALYSIS OF DOCUMENT SENTIMENT
AUTOMATIC ANALYSIS OF DOCUMENT SENTIMENTAUTOMATIC ANALYSIS OF DOCUMENT SENTIMENT
AUTOMATIC ANALYSIS OF DOCUMENT SENTIMENTStephen Faucher
 

Similar to A hybrid approach to finding phenotype candidates in genetic text.pdf (20)

Geometric Processing of Data in Neural Networks
Geometric Processing of Data in Neural NetworksGeometric Processing of Data in Neural Networks
Geometric Processing of Data in Neural Networks
 
Classification of squamous cell cervical cytology
Classification of squamous cell cervical cytologyClassification of squamous cell cervical cytology
Classification of squamous cell cervical cytology
 
thesis
thesisthesis
thesis
 
A Vietnamese Text-based Conversational Agent.pdf
A Vietnamese Text-based Conversational Agent.pdfA Vietnamese Text-based Conversational Agent.pdf
A Vietnamese Text-based Conversational Agent.pdf
 
Aspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVMAspect_Category_Detection_Using_SVM
Aspect_Category_Detection_Using_SVM
 
Ims16 thesis-knabl-v1.1
Ims16 thesis-knabl-v1.1Ims16 thesis-knabl-v1.1
Ims16 thesis-knabl-v1.1
 
Nghiên cứu tình trạng methyl hóa một số chỉ thị phân tử ở bệnh nhân ung thư đ...
Nghiên cứu tình trạng methyl hóa một số chỉ thị phân tử ở bệnh nhân ung thư đ...Nghiên cứu tình trạng methyl hóa một số chỉ thị phân tử ở bệnh nhân ung thư đ...
Nghiên cứu tình trạng methyl hóa một số chỉ thị phân tử ở bệnh nhân ung thư đ...
 
Biểu hiện hệ vector tái lập trình trên tế bào gốc tạo máu của người nhằm tạo ...
Biểu hiện hệ vector tái lập trình trên tế bào gốc tạo máu của người nhằm tạo ...Biểu hiện hệ vector tái lập trình trên tế bào gốc tạo máu của người nhằm tạo ...
Biểu hiện hệ vector tái lập trình trên tế bào gốc tạo máu của người nhằm tạo ...
 
Facial kinship verification- a machine learning approach
Facial kinship verification- a machine learning approachFacial kinship verification- a machine learning approach
Facial kinship verification- a machine learning approach
 
Missing Data Problems in Machine Learning
Missing Data Problems in Machine LearningMissing Data Problems in Machine Learning
Missing Data Problems in Machine Learning
 
Analysis and Classification of ECG Signal using Neural Network
Analysis and Classification of ECG Signal using Neural NetworkAnalysis and Classification of ECG Signal using Neural Network
Analysis and Classification of ECG Signal using Neural Network
 
Micro robotic cholesteatoma surgery
Micro robotic cholesteatoma surgeryMicro robotic cholesteatoma surgery
Micro robotic cholesteatoma surgery
 
MaryamNajafianPhDthesis
MaryamNajafianPhDthesisMaryamNajafianPhDthesis
MaryamNajafianPhDthesis
 
ilp
ilpilp
ilp
 
Marshall-PhDThesis-2005
Marshall-PhDThesis-2005Marshall-PhDThesis-2005
Marshall-PhDThesis-2005
 
Fulltext02
Fulltext02Fulltext02
Fulltext02
 
Inglis PhD Thesis
Inglis PhD ThesisInglis PhD Thesis
Inglis PhD Thesis
 
Diplomarbeit
DiplomarbeitDiplomarbeit
Diplomarbeit
 
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
Applicability of Interactive Genetic Algorithms to Multi-agent Systems: Exper...
 
AUTOMATIC ANALYSIS OF DOCUMENT SENTIMENT
AUTOMATIC ANALYSIS OF DOCUMENT SENTIMENTAUTOMATIC ANALYSIS OF DOCUMENT SENTIMENT
AUTOMATIC ANALYSIS OF DOCUMENT SENTIMENT
 

More from NuioKila

Pháp luật về Quỹ trợ giúp pháp lý ở Việt Nam.pdf
Pháp luật về Quỹ trợ giúp pháp lý ở Việt Nam.pdfPháp luật về Quỹ trợ giúp pháp lý ở Việt Nam.pdf
Pháp luật về Quỹ trợ giúp pháp lý ở Việt Nam.pdfNuioKila
 
BÁO CÁO Kết quả tham vấn cộng đồng về tính hợp pháp của gỗ và các sản phẩm gỗ...
BÁO CÁO Kết quả tham vấn cộng đồng về tính hợp pháp của gỗ và các sản phẩm gỗ...BÁO CÁO Kết quả tham vấn cộng đồng về tính hợp pháp của gỗ và các sản phẩm gỗ...
BÁO CÁO Kết quả tham vấn cộng đồng về tính hợp pháp của gỗ và các sản phẩm gỗ...NuioKila
 
A study on common mistakes committed by Vietnamese learners in pronouncing En...
A study on common mistakes committed by Vietnamese learners in pronouncing En...A study on common mistakes committed by Vietnamese learners in pronouncing En...
A study on common mistakes committed by Vietnamese learners in pronouncing En...NuioKila
 
[123doc] - thu-nghiem-cai-tien-chi-tieu-du-bao-khong-khi-lanh-cac-thang-cuoi-...
[123doc] - thu-nghiem-cai-tien-chi-tieu-du-bao-khong-khi-lanh-cac-thang-cuoi-...[123doc] - thu-nghiem-cai-tien-chi-tieu-du-bao-khong-khi-lanh-cac-thang-cuoi-...
[123doc] - thu-nghiem-cai-tien-chi-tieu-du-bao-khong-khi-lanh-cac-thang-cuoi-...NuioKila
 
THỬ NGHIỆM CẢI TIẾN CHỈ TIÊU DỰ BÁO KHÔNG KHÍ LẠNH CÁC THÁNG CUỐI MÙA ĐÔNG BẰ...
THỬ NGHIỆM CẢI TIẾN CHỈ TIÊU DỰ BÁO KHÔNG KHÍ LẠNH CÁC THÁNG CUỐI MÙA ĐÔNG BẰ...THỬ NGHIỆM CẢI TIẾN CHỈ TIÊU DỰ BÁO KHÔNG KHÍ LẠNH CÁC THÁNG CUỐI MÙA ĐÔNG BẰ...
THỬ NGHIỆM CẢI TIẾN CHỈ TIÊU DỰ BÁO KHÔNG KHÍ LẠNH CÁC THÁNG CUỐI MÙA ĐÔNG BẰ...NuioKila
 
Nhu cầu lập pháp của hành pháp.pdf
Nhu cầu lập pháp của hành pháp.pdfNhu cầu lập pháp của hành pháp.pdf
Nhu cầu lập pháp của hành pháp.pdfNuioKila
 
KẾ HOẠCH DẠY HỌC CỦA TỔ CHUYÊN MÔN MÔN HỌC SINH HỌC - CÔNG NGHỆ.pdf
KẾ HOẠCH DẠY HỌC CỦA TỔ CHUYÊN MÔN MÔN HỌC SINH HỌC - CÔNG NGHỆ.pdfKẾ HOẠCH DẠY HỌC CỦA TỔ CHUYÊN MÔN MÔN HỌC SINH HỌC - CÔNG NGHỆ.pdf
KẾ HOẠCH DẠY HỌC CỦA TỔ CHUYÊN MÔN MÔN HỌC SINH HỌC - CÔNG NGHỆ.pdfNuioKila
 
KIẾN TRÚC BIỂU HIỆN TẠI VIỆT NAM.pdf
KIẾN TRÚC BIỂU HIỆN TẠI VIỆT NAM.pdfKIẾN TRÚC BIỂU HIỆN TẠI VIỆT NAM.pdf
KIẾN TRÚC BIỂU HIỆN TẠI VIỆT NAM.pdfNuioKila
 
QUY HOẠCH PHÁT TRIỂN HỆ THỐNG Y TẾ TỈNH NINH THUẬN.pdf
QUY HOẠCH PHÁT TRIỂN HỆ THỐNG Y TẾ TỈNH NINH THUẬN.pdfQUY HOẠCH PHÁT TRIỂN HỆ THỐNG Y TẾ TỈNH NINH THUẬN.pdf
QUY HOẠCH PHÁT TRIỂN HỆ THỐNG Y TẾ TỈNH NINH THUẬN.pdfNuioKila
 
NGHIÊN CỨU XÂY DỰNG BỘ TIÊU CHÍ ĐÁNH GIÁ CHẤT LƯỢNG CÁC CHƯƠNG TRÌNH ĐÀO TẠO ...
NGHIÊN CỨU XÂY DỰNG BỘ TIÊU CHÍ ĐÁNH GIÁ CHẤT LƯỢNG CÁC CHƯƠNG TRÌNH ĐÀO TẠO ...NGHIÊN CỨU XÂY DỰNG BỘ TIÊU CHÍ ĐÁNH GIÁ CHẤT LƯỢNG CÁC CHƯƠNG TRÌNH ĐÀO TẠO ...
NGHIÊN CỨU XÂY DỰNG BỘ TIÊU CHÍ ĐÁNH GIÁ CHẤT LƯỢNG CÁC CHƯƠNG TRÌNH ĐÀO TẠO ...NuioKila
 
TIỂU LUẬN Phân tích các loại nguồn của luật tư La Mã và so sánh với các nguồn...
TIỂU LUẬN Phân tích các loại nguồn của luật tư La Mã và so sánh với các nguồn...TIỂU LUẬN Phân tích các loại nguồn của luật tư La Mã và so sánh với các nguồn...
TIỂU LUẬN Phân tích các loại nguồn của luật tư La Mã và so sánh với các nguồn...NuioKila
 
Nuevo enfoque de aprendizajesemi-supervisado para la identificaciónde secuenci...
Nuevo enfoque de aprendizajesemi-supervisado para la identificaciónde secuenci...Nuevo enfoque de aprendizajesemi-supervisado para la identificaciónde secuenci...
Nuevo enfoque de aprendizajesemi-supervisado para la identificaciónde secuenci...NuioKila
 
Inefficiency in engineering change management in kimberly clark VietNam co., ...
Inefficiency in engineering change management in kimberly clark VietNam co., ...Inefficiency in engineering change management in kimberly clark VietNam co., ...
Inefficiency in engineering change management in kimberly clark VietNam co., ...NuioKila
 
An Investigation into culrural elements via linguistic means in New Headway t...
An Investigation into culrural elements via linguistic means in New Headway t...An Investigation into culrural elements via linguistic means in New Headway t...
An Investigation into culrural elements via linguistic means in New Headway t...NuioKila
 
An evaluation of the translation of the film Rio based on Newmarks model.pdf
An evaluation of the translation of the film Rio based on Newmarks model.pdfAn evaluation of the translation of the film Rio based on Newmarks model.pdf
An evaluation of the translation of the film Rio based on Newmarks model.pdfNuioKila
 
Teachers and students views on grammar presentation in the course book Englis...
Teachers and students views on grammar presentation in the course book Englis...Teachers and students views on grammar presentation in the course book Englis...
Teachers and students views on grammar presentation in the course book Englis...NuioKila
 
11th graders attitudes towards their teachers written feedback.pdf
11th graders attitudes towards their teachers written feedback.pdf11th graders attitudes towards their teachers written feedback.pdf
11th graders attitudes towards their teachers written feedback.pdfNuioKila
 
Phân tích tài chính Công ty Cổ phần VIWACO.pdf
Phân tích tài chính Công ty Cổ phần VIWACO.pdfPhân tích tài chính Công ty Cổ phần VIWACO.pdf
Phân tích tài chính Công ty Cổ phần VIWACO.pdfNuioKila
 
Ngói Champa ở di tích Triền Tranh (Duy Xuyên Quảng Nam).pdf
Ngói Champa ở di tích Triền Tranh (Duy Xuyên Quảng Nam).pdfNgói Champa ở di tích Triền Tranh (Duy Xuyên Quảng Nam).pdf
Ngói Champa ở di tích Triền Tranh (Duy Xuyên Quảng Nam).pdfNuioKila
 
ĐỀ XUẤT CÁC GIẢI PHÁP NÂNG CAO HIỆU QUẢ VẬN HÀNH LƯỚI ĐIỆN PHÂN PHỐI TÂY NAM ...
ĐỀ XUẤT CÁC GIẢI PHÁP NÂNG CAO HIỆU QUẢ VẬN HÀNH LƯỚI ĐIỆN PHÂN PHỐI TÂY NAM ...ĐỀ XUẤT CÁC GIẢI PHÁP NÂNG CAO HIỆU QUẢ VẬN HÀNH LƯỚI ĐIỆN PHÂN PHỐI TÂY NAM ...
ĐỀ XUẤT CÁC GIẢI PHÁP NÂNG CAO HIỆU QUẢ VẬN HÀNH LƯỚI ĐIỆN PHÂN PHỐI TÂY NAM ...NuioKila
 

More from NuioKila (20)

Pháp luật về Quỹ trợ giúp pháp lý ở Việt Nam.pdf
Pháp luật về Quỹ trợ giúp pháp lý ở Việt Nam.pdfPháp luật về Quỹ trợ giúp pháp lý ở Việt Nam.pdf
Pháp luật về Quỹ trợ giúp pháp lý ở Việt Nam.pdf
 
BÁO CÁO Kết quả tham vấn cộng đồng về tính hợp pháp của gỗ và các sản phẩm gỗ...
BÁO CÁO Kết quả tham vấn cộng đồng về tính hợp pháp của gỗ và các sản phẩm gỗ...BÁO CÁO Kết quả tham vấn cộng đồng về tính hợp pháp của gỗ và các sản phẩm gỗ...
BÁO CÁO Kết quả tham vấn cộng đồng về tính hợp pháp của gỗ và các sản phẩm gỗ...
 
A study on common mistakes committed by Vietnamese learners in pronouncing En...
A study on common mistakes committed by Vietnamese learners in pronouncing En...A study on common mistakes committed by Vietnamese learners in pronouncing En...
A study on common mistakes committed by Vietnamese learners in pronouncing En...
 
[123doc] - thu-nghiem-cai-tien-chi-tieu-du-bao-khong-khi-lanh-cac-thang-cuoi-...
[123doc] - thu-nghiem-cai-tien-chi-tieu-du-bao-khong-khi-lanh-cac-thang-cuoi-...[123doc] - thu-nghiem-cai-tien-chi-tieu-du-bao-khong-khi-lanh-cac-thang-cuoi-...
[123doc] - thu-nghiem-cai-tien-chi-tieu-du-bao-khong-khi-lanh-cac-thang-cuoi-...
 
THỬ NGHIỆM CẢI TIẾN CHỈ TIÊU DỰ BÁO KHÔNG KHÍ LẠNH CÁC THÁNG CUỐI MÙA ĐÔNG BẰ...
THỬ NGHIỆM CẢI TIẾN CHỈ TIÊU DỰ BÁO KHÔNG KHÍ LẠNH CÁC THÁNG CUỐI MÙA ĐÔNG BẰ...THỬ NGHIỆM CẢI TIẾN CHỈ TIÊU DỰ BÁO KHÔNG KHÍ LẠNH CÁC THÁNG CUỐI MÙA ĐÔNG BẰ...
THỬ NGHIỆM CẢI TIẾN CHỈ TIÊU DỰ BÁO KHÔNG KHÍ LẠNH CÁC THÁNG CUỐI MÙA ĐÔNG BẰ...
 
Nhu cầu lập pháp của hành pháp.pdf
Nhu cầu lập pháp của hành pháp.pdfNhu cầu lập pháp của hành pháp.pdf
Nhu cầu lập pháp của hành pháp.pdf
 
KẾ HOẠCH DẠY HỌC CỦA TỔ CHUYÊN MÔN MÔN HỌC SINH HỌC - CÔNG NGHỆ.pdf
KẾ HOẠCH DẠY HỌC CỦA TỔ CHUYÊN MÔN MÔN HỌC SINH HỌC - CÔNG NGHỆ.pdfKẾ HOẠCH DẠY HỌC CỦA TỔ CHUYÊN MÔN MÔN HỌC SINH HỌC - CÔNG NGHỆ.pdf
KẾ HOẠCH DẠY HỌC CỦA TỔ CHUYÊN MÔN MÔN HỌC SINH HỌC - CÔNG NGHỆ.pdf
 
KIẾN TRÚC BIỂU HIỆN TẠI VIỆT NAM.pdf
KIẾN TRÚC BIỂU HIỆN TẠI VIỆT NAM.pdfKIẾN TRÚC BIỂU HIỆN TẠI VIỆT NAM.pdf
KIẾN TRÚC BIỂU HIỆN TẠI VIỆT NAM.pdf
 
QUY HOẠCH PHÁT TRIỂN HỆ THỐNG Y TẾ TỈNH NINH THUẬN.pdf
QUY HOẠCH PHÁT TRIỂN HỆ THỐNG Y TẾ TỈNH NINH THUẬN.pdfQUY HOẠCH PHÁT TRIỂN HỆ THỐNG Y TẾ TỈNH NINH THUẬN.pdf
QUY HOẠCH PHÁT TRIỂN HỆ THỐNG Y TẾ TỈNH NINH THUẬN.pdf
 
NGHIÊN CỨU XÂY DỰNG BỘ TIÊU CHÍ ĐÁNH GIÁ CHẤT LƯỢNG CÁC CHƯƠNG TRÌNH ĐÀO TẠO ...
NGHIÊN CỨU XÂY DỰNG BỘ TIÊU CHÍ ĐÁNH GIÁ CHẤT LƯỢNG CÁC CHƯƠNG TRÌNH ĐÀO TẠO ...NGHIÊN CỨU XÂY DỰNG BỘ TIÊU CHÍ ĐÁNH GIÁ CHẤT LƯỢNG CÁC CHƯƠNG TRÌNH ĐÀO TẠO ...
NGHIÊN CỨU XÂY DỰNG BỘ TIÊU CHÍ ĐÁNH GIÁ CHẤT LƯỢNG CÁC CHƯƠNG TRÌNH ĐÀO TẠO ...
 
TIỂU LUẬN Phân tích các loại nguồn của luật tư La Mã và so sánh với các nguồn...
TIỂU LUẬN Phân tích các loại nguồn của luật tư La Mã và so sánh với các nguồn...TIỂU LUẬN Phân tích các loại nguồn của luật tư La Mã và so sánh với các nguồn...
TIỂU LUẬN Phân tích các loại nguồn của luật tư La Mã và so sánh với các nguồn...
 
Nuevo enfoque de aprendizajesemi-supervisado para la identificaciónde secuenci...
Nuevo enfoque de aprendizajesemi-supervisado para la identificaciónde secuenci...Nuevo enfoque de aprendizajesemi-supervisado para la identificaciónde secuenci...
Nuevo enfoque de aprendizajesemi-supervisado para la identificaciónde secuenci...
 
Inefficiency in engineering change management in kimberly clark VietNam co., ...
Inefficiency in engineering change management in kimberly clark VietNam co., ...Inefficiency in engineering change management in kimberly clark VietNam co., ...
Inefficiency in engineering change management in kimberly clark VietNam co., ...
 
An Investigation into culrural elements via linguistic means in New Headway t...
An Investigation into culrural elements via linguistic means in New Headway t...An Investigation into culrural elements via linguistic means in New Headway t...
An Investigation into culrural elements via linguistic means in New Headway t...
 
An evaluation of the translation of the film Rio based on Newmarks model.pdf
An evaluation of the translation of the film Rio based on Newmarks model.pdfAn evaluation of the translation of the film Rio based on Newmarks model.pdf
An evaluation of the translation of the film Rio based on Newmarks model.pdf
 
Teachers and students views on grammar presentation in the course book Englis...
Teachers and students views on grammar presentation in the course book Englis...Teachers and students views on grammar presentation in the course book Englis...
Teachers and students views on grammar presentation in the course book Englis...
 
11th graders attitudes towards their teachers written feedback.pdf
11th graders attitudes towards their teachers written feedback.pdf11th graders attitudes towards their teachers written feedback.pdf
11th graders attitudes towards their teachers written feedback.pdf
 
Phân tích tài chính Công ty Cổ phần VIWACO.pdf
Phân tích tài chính Công ty Cổ phần VIWACO.pdfPhân tích tài chính Công ty Cổ phần VIWACO.pdf
Phân tích tài chính Công ty Cổ phần VIWACO.pdf
 
Ngói Champa ở di tích Triền Tranh (Duy Xuyên Quảng Nam).pdf
Ngói Champa ở di tích Triền Tranh (Duy Xuyên Quảng Nam).pdfNgói Champa ở di tích Triền Tranh (Duy Xuyên Quảng Nam).pdf
Ngói Champa ở di tích Triền Tranh (Duy Xuyên Quảng Nam).pdf
 
ĐỀ XUẤT CÁC GIẢI PHÁP NÂNG CAO HIỆU QUẢ VẬN HÀNH LƯỚI ĐIỆN PHÂN PHỐI TÂY NAM ...
ĐỀ XUẤT CÁC GIẢI PHÁP NÂNG CAO HIỆU QUẢ VẬN HÀNH LƯỚI ĐIỆN PHÂN PHỐI TÂY NAM ...ĐỀ XUẤT CÁC GIẢI PHÁP NÂNG CAO HIỆU QUẢ VẬN HÀNH LƯỚI ĐIỆN PHÂN PHỐI TÂY NAM ...
ĐỀ XUẤT CÁC GIẢI PHÁP NÂNG CAO HIỆU QUẢ VẬN HÀNH LƯỚI ĐIỆN PHÂN PHỐI TÂY NAM ...
 

Recently uploaded

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 

Recently uploaded (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 

A hybrid approach to finding phenotype candidates in genetic text.pdf

  • 1. VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY LE HOANG QUYNH A HYBRID APPROACH TO FINDING PHENOTYPE CANDIDATES IN GENETIC TEXT MASTER THESIS Hanoi – 2012
  • 2. VIETNAM NATIONAL UNIVERSITY, HANOI UNIVERSITY OF ENGINEERING AND TECHNOLOGY LE HOANG QUYNH A HYBRID APPROACH TO FINDING PHENOTYPE CANDIDATES IN GENETIC TEXT Major : Computer Science Code : 60 48 01 MASTER THESIS Supervisor: Assoc.Prof. Ha Quang Thuy Hanoi – 2012
  • 3. A hybrid approach to finding phenotype candidates in genetic texts Le Hoang Quynh Faculty of Information Technology University of Engineering and Technology Vietnam National University, Hanoi Supervised by Associate Professor. Ha Quang Thuy A thesis submitted in fulfillment of the requirements for the degree of Master of Science in Computer Science November 2012
  • 4. 2
  • 5. ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substan- tial proportions of material which have been accepted for the award of any other degree or diploma at University of Engineering and Technology (UET/Coltech) or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked with at Univer- sity of Engineering and Technology and National Institute of Informatic (Tokyo, Japan) or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Hanoi, November 10th , 2012 Signed ........................................ Le Hoang Quynh i
  • 6. ABSTRACT Named entity recognition (NER) has been extensively studied for the names of genes and gene products but there are few proposed solutions for phenotypes. Phe- notype terms are expected to play a key role in inferring gene function in complex heritable diseases but are intrinsically difficult to analyse due to their complex se- mantics and scale. In contrast to previous approaches we evaluate state-of-the-art techniques involving the fusion of machine learning on a rich feature set with evi- dence from extant domain knowledge-sources. The techniques are validated on two gold standard collections including a novel annotated collection of 112 abstracts de- rived from a systematic search of the Online Mendelian Inheritance of Man database for auto-immune diseases. Encouragingly the hybrid model outperforms a HMM, a CRF and a pure knowledge-based method to achieve an F1 of 75.37 for BF and micro average F1 of 84.01 for the whole system. Publications: • Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le. Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations. In Inter- national Conference on Asian Language Processing 2010. Page 170-173. Harbin, China; December 28-30, 2010, DOI: http://doi.ieeecomputersociety.org/10.1109/IALP.2010.73 • Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan and Quang- Thuy Ha. An Integrated Approach Using Conditional Random Fields for Named En- tity Recognition and Person Property Extraction in Vietnamese Text. In Proceedings of International Conference on Asian Language Processing 2011. Page 115-118. DOI: http://doi.ieeecomputersociety.org/10.1109/IALP.2011.37 • Nigel Collier, Mai-Vu Tran, Hoang-Quynh Le, Anika Oellrich, Ai Kawazoe, Martin Hall- May and Dietrich Rebholz-Schuhmann. A hybrid approach to finding phenotype candidates in genetic text. In The 24th conference on Computational Linguistics (COLING 2012). Accepted as long paper. ii
  • 7. ACKNOWLEDGEMENTS First and foremost, I would like to express my deep gratitude to my supervi- sor, Assoc.Prof. Ha Quang Thuy, for his patient guidance and continuous support throughout the years. He always appears when I need help, and responds to queries so helpfully and promptly. I would like to express my gratitude to the National Institute of Informatics (NII - Tokyo, Japan) for giving me a great chance working at NII in the NII International Internship program. Then, I sincerely give my honest thanks and appreciation to Assoc.Prof. Nigel H. Collier, my internship supervisor at NII, for his great support. I would like to say thank you to all my teachers at university of Engineering and Technology (VNU), who bring me many knowledge and experiences. I also want to thank my colleagues at the Knowledge and Technology laboratory (UET, VNU) and my classmate for their enthusiasm and promptly help. I sincerely acknowledge the Vietnam National University, NAFOSTED and the QG.10.38 project for some supporting finance to my master study. And thanks to all my friends who always be by my side and cheer me. Finally, this thesis would not have been possible without the support and love of my family. Thank you, mother and father. Thanks brother and sister, thanks to my nephew. And thank you, my beloved husband. Again, thank you and love all of you so much ♥. iii
  • 8. Table of Contents 1 Introduction 1 1.1 Motivation and problem definition . . . . . . . . . . . . . . . . . . . . 1 1.2 Phenotype definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 The challenges of phenotype entity recognition . . . . . . . . . . . . . 3 2 Related works 6 2.1 Useful resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 GENIA and JNLPBA corpora . . . . . . . . . . . . . . . . . . 7 2.1.2 The online mendelian inheritance in man . . . . . . . . . . . . 7 2.1.3 The human phenotype ontology . . . . . . . . . . . . . . . . . 8 2.1.4 The mammalian phenotype ontology . . . . . . . . . . . . . . 9 2.1.5 The unified medical language system . . . . . . . . . . . . . . 9 2.1.6 KMR corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Related researches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Baseline method: Khordad et al. (2011) . . . . . . . . . . . . . 11 3 Methods 16 3.1 Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Annotated data sources . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Proposed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.2 Machine learning labeler . . . . . . . . . . . . . . . . . . . . . 22 3.3.3 Knowledge-based labeler . . . . . . . . . . . . . . . . . . . . . 24 3.3.4 Merge results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 Experimental results and evaluation 29 4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Experiments on the KMR corpus . . . . . . . . . . . . . . . . . . . . 31 iv
  • 9. TABLE OF CONTENTS v 4.3 Experiments on the Phenominer corpus . . . . . . . . . . . . . . . . . 32 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.1 Discussion on corpora . . . . . . . . . . . . . . . . . . . . . . 35 4.4.2 Discussion on results . . . . . . . . . . . . . . . . . . . . . . . 36 5 Conclusion 40
  • 10. List of Figures 2.1 A visual example of HPO hierarchical structure . . . . . . . . . . . . 13 2.2 A visual example of MP hierarchical structure . . . . . . . . . . . . . 14 2.3 Khordad et al. (2011)’s system block diagram . . . . . . . . . . . . . 15 3.1 An informal overview of bodily feature entity . . . . . . . . . . . . . . 17 3.2 Phenotype tagging architecture . . . . . . . . . . . . . . . . . . . . . 27 3.3 Brat rapid annotation tool example . . . . . . . . . . . . . . . . . . . 28 4.1 Column chart shows the experimental results on KMR corpus . . . . 32 4.2 Column chart shows the experimental results of BF entities on Phe- nominer corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Column chart shows the experimental results of GGP entities on Phe- nominer corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 vi
  • 11. List of Tables 3.1 Referential semantics and scoping of mentions by entity type . . . . . 19 3.2 List of auto-immune disease used to collect Phenominer corpus . . . . 21 3.3 Feature sets used in the machine learning labeler . . . . . . . . . . . . 24 3.4 Features exploited by the two learner models . . . . . . . . . . . . . . 24 4.1 Results for BF entity on the KMR corpus using models with partial matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Results for each entity on the Phenominer corpus using models with partial matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.3 Sources of error by the Hybrid system on the KMR corpus. . . . . . . 37 4.4 Sources of error by Khordad et al.’s system on the Phenominer corpus. 38 4.5 Sources of error by the Hybrid system on the Phenominer corpus. . . 39 vii
  • 12. List of Abbreviations BF Bodily feature CRF Conditional Random Field GGP Gene and gene product HMM Hidden Markov Model HPO the Human Phenotype Ontology KB Knowledge-based ML Machine learning MP the Mammalian Phenotype Ontology NE Named entity NER Named entity recognition viii
  • 13. Chapter 1 Introduction 1.1 Motivation and problem definition During the last decade biomedicine has developed tremendously. Everyday a lot of biomedical papers are published and a great amount of information is produced. Due to the rapidly increasing amount of biomedical literature available on the Web, biomedical information extraction becomes more and more important. Biomedical named entity recognition (NER) is a subtask of biomedical infor- mation extraction which is a fundamental step and can affect the results of others tasks. Biomedical NER is a computational technique used to identify and classify strings of text (mentions) that designate important concepts in biomedicine. As the first stage in the integrated semantic linking of knowledge between literature and structured databases it is critically important to maximize the effectiveness of this step. This thesis focuses on the analysis and identification of a new class of entity: phenotypes. Follow Hoehndorf et al. (2010), phenotype is important for the analysis of the molecular mechanisms underlying disease; it is also expected to play a key role in inferring gene function in complex heritable diseases. Two thoughts motivate our work are: (1) The database curation community has expressed a wish for full text entity indexing and the inclusion of phenotypes (Dowell et al., 2009; Hirschman et al., 2012), and (2) Biomedicine is rapidly moving towards full-scale integration of data, opening up the possibility to understand complex heritable diseases caused by genes. Association studies involving phenotypes are considered important to making progress (Lage et al., 2007; Wu et al., 2008). The ultimate goal of the work we present 1
  • 14. 1.2. Phenotype definition 2 here is to allow relations mined from sentences such as the one we annotated below to feed into novel hypothesis generation procedures. From Ex 1, the reader can easily infer a relation between ‘IgG1 disorder’ and three genes/gene products marked as GGP. Ex 1. Among [patients]ORGANISM with [systemic lupus erythematosus]DISEASE ([SLE]DISEASE), those with the [IgG1 disorder]PHENOTY PE have a higher prevalence of high titre [rheumatoid factor]GGP and [antinuclear antibody]GGP , but a lower prevalence of [anti-double-stranded DNA (anti-dsDNA) antibodies]GGP above 30 U/ml. (Source PMCID: PMC1003566). 1.2 Phenotype definition Unlike genes or anatomic structures, phenotypes and their traits are complex concepts and do not constitute a homogeneous class of objects (i.e. a natural kind). Traits such as ‘eye colour’, ‘blood group’, ‘hemoglobin concentration’ or ‘facial gri- macing’ describe morphological structures, physiological processes and behaviours. When qualities or quantities of traits are used to describe a specific organism then we have phenotypic descriptions, e.g. ‘blue eyes’, ‘blood group AB’, ‘not having between 13 and 18 gm/dl hemoglobin concentration’. Until recently, there has been little effort to provide data integration standards for phenotypes. This means that phenotypic descriptions tend to be author/study specific and biological results may go undiscovered if the terms used lie outside an author’s immediate research area (Bard and Rhee, 2004). In some researches, it is simply called as ‘phenotypic information’ and authors do not give any specific def- inition for it (Hoehndorf et al., 2010). In CSI-OMIM system (Cohen et al., 2011), phenotypes are considered as genetic terms including clinical signs and symptoms. Freimer and Sabatti (2003) describe phenotypes as referring to ‘any morphologic, biochemical, physiological or behavioral characteristic of an organism. . . . All phe- notypic characteristicsrepresent the expression of particular genotypes combined with the effects of specific environmental influences’. Khordad et al. (2011) defines phe- notypes as ‘genetically-determined observable characteristics of a cell or organism, including the result of any test that is not a direct test of the genotype. ...A pheno- type of an organism is determined by the interaction of its genetic constitution and the environment’.
  • 15. 1.3. The challenges of phenotype entity recognition 3 Our definition of phenotype was taken from the formal analysis in Scheuermann et al. (2009)’s research. Definition: A phenotype entity is a (combination of) bodily features(s) of an organism determined by the interaction of its genetic make-up and environment. Examples include: [lack of kidney], [abnormal cell migration],[absent ankle re- flexes] as well as more complex cases such as [no abnormality in his heart], [unfa- vorable serum lipid levels] and [susceptibility to ulcerative colitis]. But Scheuermann et al. (2009) also define symptom as ‘a bodily feature of a patient that is observed by the patient or clinician and suspected of being caused by a disease’. We can see an ambiguity made by the causality (or context) here: a term may be symptom in some contexts but refer to phenotype in others or many symptoms may be phenotypes. Thus, it is important to recognize that this phenotype definition requires us to know the underlying cause. Since causality is often difficult to establish using narrow contextual evidence of the sort used in NER it seems reasonable that we focus here on identifying bodily features themselves, i.e. phenotype candidates, and then determine causality in another stage of processing. Definition: A bodily feature (BF) entity is a mention of a bodily quality in an organism. It is considered as phenotype candidate. Our definition of bodily features require two caveats (1) in contrast to Khordad et al. (2011) we did not apply a granular cut off at the level of cell, and (2) because of the diversity of bodily features across organisms we took a decision to focus our definition of this entity on mouse as a model organism and human as the most important species. 1.3 The challenges of phenotype entity recogni- tion Unlike NER in the newswire domain, NER in the biomedical domain remains a perplexing challenge. Biomedical NEs in general do not follow any nomenclature, and can be comprised of long compound words or short abbreviations. Some even contain various symbols or spelling variations. We summarize some challenges for BF NER below (some of them are difficulties of NER in biomedical domain mentioned by Lin et al. (2004))
  • 16. 1.3. The challenges of phenotype entity recognition 4 • Unknown word identification: There are an extreme use of unknown words. Unknown words can be acronyms, abbreviations, or words containing hyphens, digits, letters, and Greek letters. Moreover, the use of numerous synonyms and homonyms make recognition become more difficult. • Named entity boundary identification: The boundary of an NE can be a regular English word, unknown word, Roman numeral, or digit. A BF can apply at all levels of anatomical granularity from chemical structures to cells and organs making it difficult to know where to draw a boundary. Additionally, nested NEs (an NE embedded in another NE) further complicate this problem: BF can contain GGP, disease and even organism. • Named entity classification: Once an NE is identified, it is then classified into a category such as GGP, anatomy, BF, and so on. Ambiguity and inconsistency are often encountered at this stage. NEs with the same orthographical features may fall into different categories (for example, there is a big ambiguity between BF and disease). In additional, BF entities are intrinsically more difficult to analyze due to their complex semantics, scale and structure: • Semantically, a BF can be abnormal (in a disordered disposition) or normal (in an ordered disposition) feature of humans or mice; it can be a clinically relevant characteristic of a human/mouse disease or not. • A lack of standard nomenclatures, extensive and growing nomenclatures make the problem of BF recognition become more difficult. , the lack of naming agreement prior to a standard name being accepted, • BFs can be found with complex structure in various forms, sometimes even biologists do not agree on the boundary of the BF. BF may contain modifiers (for example, quantification that are either specific (e.g. 18 gm/dl) or rela- tive (e.g. normal or increased’)); negations can be used to indicate lack of an anatomy/GGP or normal/abnormal qualities of anatomy/GGP (for example: [not having kidney], [not having between 13 and 18 gm/dl hemoglobin con- centration]) but it can also show that a human or mouse not have a BF (for example: there is [no abnormality in his heart], she has a [fever] but doesn’t have a [cough]); conjoined cases happen when two or more BFs share one head noun.
  • 17. 1.3. The challenges of phenotype entity recognition 5 Due to the motivation and challenges of phenotype recognition, the key contri- butions of this thesis are: (1) To provide an operational semantics for identifying phenotype candidates in text, (2) To introduce a set of guidelines and an annotated corpus based on a selection of 19 clinically significant auto-immune diseases from The Online Mendelian Inheritance of Man (OMIM) (Hamosh et al., 2005), one of the most widely used gene-disease databases, and (3) To mitigate linguistic varia- tion whilst still meeting the conceptual expectations of biologists we propose a new named entity solution that uses statistical inference and external manually crafted resources. The remaining of this thesis is organized as follows. In the second chapter, we present some related researches and useful resources. The next chapter describes our Phenominer corpus version 1.0 and proposed method for phenotype candidate recognition. Then, experimental results, evaluation and discussion are in 3rd chapter. Finally, 4th chapter is the conclusions.
  • 18. Chapter 2 Related works Such motivation and challenges that we mentioned in chapter 1 have led to a variety of proposed solutions involving a wide range of resources. In this chapter, we take a review on some useful resources in section 2.1, they are GENIA and JNLPBA corpora, the online mendelian inheritance in man (OMIM) , the human phenotype ontology (HPO), the mammalian phenotype ontology (MP), the unified medical language system (UMLS), etc. Then, in section 2.2, we introduce some related researches in biomedical entity recognition and describe Khordad et al. (2011) as our baseline method for BF. 2.1 Useful resources Using available resources help us not only to take advantage of knowledge from other researches but also to reduce effort. Up to now, there are many resources are used in bio-informatics. Among these, linguistically corpora such as GENIA (Tateisi et al., 2000; Kim et al., 2003), OMIM (Hamosh et al., 2005), have proven to be central to the NER solution. However due to the size of the vocabularies involved, annotated corpora by themselves do not provide a complete solution. Researchers have therefore also looked at the rich availability of formally structured biomedi- cal knowledge (ontologies) such as the Unified Medical Language System (UMLS) (Bodenreider et al., 2002), the Human Phenotype Ontology (Robinson and Mund- los, 2010), the Mammalian Phenotype Ontology (Smith and Eppig, 2009), the Gene Ontology (Gene Ontology Consortium, 2000), etc. 6
  • 19. 2.1. Useful resources 7 2.1.1 GENIA and JNLPBA corpora GENIA corpus version 3.0 (Kim et al., 2003) was formed from a controlled search on MEDLINE using the MeSH terms ’human’, ’blood cells’ and ’transcription fac- tors’. From this search, 2000 abstracts (20,546 sentences, more than 400,000 words) were selected. This corpus has been released with linguistically rich annotations in- cluding sentence boundaries, term boundaries, term classifications, semi-structured coordinated clauses, recovered ellipsis in terms, etc. Entities are hand annotated into 36 classes of DNA, RNA, cell line, cell type and protein (almost 100,000 annota- tions). JNLPBA data set came from the GENIA version 3.02 corpus. It is a training set for the Bio-Entity recognition task at JNLPBA Kim et al. (2004). In this share task, they simplify 36 classes of GENIA corpus and used only the classes protein, DNA, RNA, cell line and cell type. The GENIA and JNLPBA corpora is important for two major reasons: the first is it provides the large single source of annotated training data for the NE task in molecular biology and the second is in the breadth of classification. Follow Kim et al. (2004), although number of classes in GENIA/JNLPBA corpora is a fraction of the classes contained in major taxonomies it is still the largest class set that has been attempted so far for the named entity recognition task . Moreover, GENIA corpus can be also used for other biomedical tasks, such as POS tagging. 2.1.2 The online mendelian inheritance in man The Online Mendelian Inheritance in Man (OMIM) (Hamosh et al., 2005) is a continuously updated catalog of human genes and genetic disorders and traits, with particular focus on the molecular relationship between genetic variation and pheno- typic expression (genotype and phenotype). The full text and referenced overviews in OMIM contain information on many mendelian disorders and over 12,000 genes. Derived from the biomedical literature, OMIM is written and edited at Johns Hopkins University with input from scientists and physicians around the world. Each OMIM entry has a full text summary of a genetically determined phenotype and/or gene and has numerous links to other genetic databases such as DNA and protein sequence, PubMed references, general and locus-specific mutation databases, HUGO nomenclature, MapViewer, GeneTests, patient support groups and many others. Within an OMIM entry, there is a field called ‘Clinical Synopsis’ which is a list of
  • 20. 2.1. Useful resources 8 the clinical features of the disorder appear in this entry or references of this entry. There are over 4500 clinical synopses in OMIM, they are a important resources for researches on Phenotype. OMIM is an easy and straightforward portal to the burgeoning information in hu- man genetics, it is now distributed electronically by the National Center for Biotech- nology Information1 . Over five decades OMIM has achieved great success, it is one of the most important information source about human genes and genetic phenotypes (Cohen et al., 2011; Robinson and Mundlos, 2010). Nonetheless OMIM does not use a controlled vocabulary to describe the pheno- typic features in its clinical synopsis section that makes it inappropriate for data mining usages. In the section 2.1.3, we introduce HPO which is constructed using OMIM. 2.1.3 The human phenotype ontology The Human Phenotype Ontology (HPO)2 is a standardized, controlled vocab- ulary allows phenotypic information to be described in an unambiguous fashion in medical publications and databases (Robinson and Mundlos, 2010). The HPO was originally constructed using data from OMIM by merging synonym and creating the hierarchical structure between terms according to their semantics. The hierarchical structure in the HPO represents the subclass relationship, figure 2.1 is a describe a hierarchical structure of HPO by a example of ‘atrioventricular septal defect’ [HP:0010439] (example comes from Robinson and Mundlos (2010)). The HPO currently contains over 9500 unique terms (more than 15000 synonyms) describing human phenotypic features (statistic in 2012). Nevertheless, follow Khordad et al. (2011), HPO is not complete and we had several problems finding phenotype names in it: (1) some acronyms and abbreviations are not available in the HPO; (2) although the HPO contains synonyms of phenotypes, there are still some synonyms that are not included in the HPO; (3) in some cases adjectives and other modifiers are added to phenotype names, making it difficult to find these phenotype names in the ontology; (4) new phenotypes are being continuously introduced to the biomedicine world, 1 http://www.ncbi.nlm.nih.gov/omim/ 2 http://www.human-phenotype-ontology.org/
  • 21. 2.1. Useful resources 9 HPO is being constantly refined, corrected, and expanded manually, but this process is not fast enough nor can the inclusion of new phenotypes be guaranteed. Thus, although HPO is a very useful resources, using only it is not enough for phenotype recognition, we should use it just as a additional resources. 2.1.4 The mammalian phenotype ontology The Mammalian Phenotype Ontology (MP) (Smith and Eppig, 2009) has been applied to mouse phenotype descriptions in MGI3 , RGD4 , OMIA5 and elsewhere. Use of this ontology allows comparisons of data from diverse sources, can facilitate comparisons across mammalian species, assists in identifying appropriate experi- mental disease models, and aids in the discovery of candidate disease genes and molecular signaling pathways. Similar with HPO, the Mammalian Phenotype Ontology (MP) is a standardized hierarchical structured vocabulary. The highest level terms describe physiological systems, survival, and behavior. The physiological systems branch into morpho- logical and physiological phenotype terms at the next node level. The example of hierarchical tree for the term ‘opisthotonus’ [MP:0002880] is shown in figure 1 2.2 (example comes from Smith and Eppig (2009)). MP has about 9000 unique terms (about 24000 synonyms) of mouse abnormal phenotype descriptions (statistic in 2012). 2.1.5 The unified medical language system The Unified Medical Language System (UMLS) (Bodenreider et al., 2002) is a set of files and software that brings together many health and biomedical vocabularies and standards. The UMLS has three tools, which we call the Knowledge Sources: Metathesaurus, semantic network and SPECIALIST Lexicon and Lexical Tools. • The Metathesaurus is a very large, multi-purpose, and multi-lingual vocabu- lary database that contains information about biomedical and health related concepts, their various names, and the relationships among them. It contains more than 1.8 million concepts come from more than 100 source vocabularies. 3 Mouse Genome Informatics Database: http://www.informatics.jax.org/ 4 Rat Genome Database: http://rgd.mcw.edu 5 Online Mendelian Inheritance in Animals: http://omia.angis.org.au/
  • 22. 2.1. Useful resources 10 • The Metathesaurus is linked to the Semantic Network: all concepts in the Metathesaurus are assigned to at least one semantic type from the semantic network. • MetaMap is a well-known tool in the UMLS SPECIALIST Lexicon and lex- ical tools. It is a highly configurable application to map biomedical text to the UMLS Metathesaurus: MetaMap tokenizes and phrase chunking the input text; map them to UMLS concepts, each phrase is mapped to a set of candi- date concepts; word sense disambiguation step will choose the best candidate with respect to the surrounding text. However UMLS semantic network does not contain Phenotype as a semantic type so it alone is not adequate to distinguish between phenotypes and other objects in text. In addition, some phenotype names do not exist in the UMLS Metathesaurus at all. But UMLS and its knowledge sources may be useful for phenotype recognition in some ways. 2.1.6 KMR corpus We call a manually annotated corpus in Khordad et al. (2011) ‘KMR corpus’. It is a collection of 3784 tokens (120 sentences) with 110 annotated phenotype mentions. Sentences in KMR corpus were taken from 4 PubMed papers from the year 2009 in the area of human genetics. Annotation was conducted with reference to the HPO so that a term was tagged as phenotype if it was in the HPO or if it was not in the HPO but its definition showed that it was caused by a genotype. It is not a well-known corpus and only be used in Khordad et al. (2011) re- searches. But now we are lack of annotated corpus for phenotype so it is still a valuable choice. We will use this corpus for testing and analyzing our proposed model. Above, we just introduce some of the most typical useful resources for our re- searches. In additional to them, there are many other resources for bio-informatics that can be used such as medical subject headings6 , Gene list contains more than 9 millions genes7 , etc. 6 MeSH:http://www.nlm.nih.gov/mesh/meshhome.html 7 Created by National Center for Biotechnology Information, U.S. National Library of Medicine
  • 23. 2.2. Related researches 11 2.2 Related researches Named Entity Recognition in the biomedical domain has been extensively stud- ied and, as a consequence, many methods have been proposed. Some methods like MetaMap are generic methods and find many kinds of entities in the text. Some methods, are specialized to recognize particular type of entities. However, these techniques tend to emphasize finding the name of genes, gene products, cells, dis- eases and chemical (Fukuda et al., 1998; Rindflesch et al., 1999; Collier et al., 2000; Kazama et al., 2002; Zhou et al., 2003; Settles, 2004; Kim et al., 2004; Leaman and Gonzalez, 2008). So far, there have been a small number of researches done for phe- notype they often based primarily on a available resources or rule-based method. Whilst other authors have tried similar approaches for other entity types, none have tried both machine learning and external resource lookup for a class as rich and semantically complex as phenotypes. In this section, we describe a method proposed by Khordad et al. (2011) which is used as our base-line method for comparison in the experiments. 2.2.1 Baseline method: Khordad et al. (2011) The system built in Khordad et al. (2011) is based on Metamap and makes use of the UMLS Metathesaurus and the Human Phenotype Ontology. From an initial basic system that uses only these pre-existing tools, five rules that capture stylistic and linguistic properties of this type of literature are proposed to enhance the performance of our NER tool. A block diagram showing Khordad et al. (2011)’s system processing is shown in figure 2.3. The system performs the following steps: • (1) MetaMap chunks the input text into phrases and assigns the UMLS se- mantic types associated with each noun phrase. • (2) The Disorder Recognizer analyzes the MetaMap output to find phenotypes and phenotype candidates. This is the most important part of this method, it based primarily on the idea that phenotype must belong to some certain UMLS semantic types. The UMLS Semantic Network contains 133 Semantic Types which are categorized into 15 Semantic Groups that are more general. In which, the Semantic Group Disorders contains 12 semantic types that are close to the meaning of phenotype, they are: Acquired Abnormality, Anatomical Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease
  • 24. 2.2. Related researches 12 or Syndrome, Experimental Model of Disease, Finding, Injury or Poisoning, Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function, Sign or Symptom. In this step, phrase are not belong to this semantic group are rejected. But a number of semantic types in this semantic group may include concepts that are not phenotypes. The 7 problematic semantic groups are: Finding, Disease or Syndrome, Experimental Model of Disease, Injury or Poisoning, Sign or Symptom, Pathologic Function, and Cell or Molecular Dysfunction. Therefore, if a phrase is assigned to these semantic types, it is considered as phenotype candidate and will be confirmed as phenotype or not in step (3), otherwise, it is a phenotype. • (3) Phenotype candidates from the previous step are searched in the HPO using OBO-Edit8 . Phenotype candidates that are found in the HPO are recognized as phenotypes. • (4) Result Merger merges the phenotypes found by disorder recognizer and OBO-Edit and makes the output that is the final list of available phenotypes in the input text. This model is tested on a small corpus KMR (described in section 2.1.6) anno- tated by authors. The results is precision is 97.58, recall is 88.32 and F1 is 92.71. 8 OBO-Edit: the OBO ontology editor: http://oboedit.org/
  • 25. 2.2. Related researches 13 Figure 2.1: A visual example of HPO hierarchical structure HP:0010439
  • 26. 2.2. Related researches 14 Figure 2.2: A visual example of MP hierarchical structure MP:0002880
  • 27. 2.2. Related researches 15 Figure 2.3: Khordad et al. (2011)’s system block diagram
  • 28. Chapter 3 Methods In this chapter, firstly, we analyze two entities that we employed in this study: gene/gene product (GGP) and bodily feature (BF) in details (section 3.1). Then, in section 3.2, we introduce our Phenominer corpus version 1.0 which is built based on 19 auto-immune diseases, this corpus can be used in phenotype recognition as well as other biomedical problem. And last, section 3.3 describe our proposed Hybrid model for BF and GGP entities recognition, the model consists of there main parts: machine learning labeler, knowledge-based labeler and merge results module. 3.1 Schema We employed two types of entity in our study: gene/gene product (GGP) and bodily feature (BF). GGP is proposed because (1) a subset of these entities are useful for applica- tions that explore gene-phenotype relations, and (2) it allows us to compare our results against the many biomedical NER studies of the past, e.g. Kim et al. (2004); Rebholz-Schuhmann et al. (2010). Because of space limitations we will not provide a rigidly formal definition or a taxonomic analysis (Beisswanger et al., 2008). Future work will explore the relationships between these and other entity types. In line with BioTop (Beisswanger et al., 2008), GGP is relatively straightforward to define by the conjunction of (BioTop ID Nucleic Acid Structure) and (BioTop ID Peptide Structure). Definition: A gene/gene product (GGP) entity is a mention of one of three major macro-molecules DNA, RNA or protein. DNA and RNA 16
  • 29. 3.1. Schema 17 are nucleic acid sequences containing the genetic instructions used in the development and function of an organism. Proteins are polypeptide sequences, or parts of polypeptide sequences, folded into structures that facilitate biological function. Examples include: [cryoglobulins], [anticariolipin antibodies], [AFM044xg3], [chro- mosome 17q], [CC16 protein]. As mentioned in chapter 1, in this thesis, we use the definition of bodily feature (BF) as Phenotype candidate. Definition: A bodily feature (BF) entity is a mention of a bodily quality in an organism. Examples include: [lack of kidney], [abnormal cell migration],[absent ankle re- flexes] as well as more complex cases such as [no abnormality in his heart], [unfa- vorable serum lipid levels] and [suceptibility to ulcerative colitis]. Figure 3.1 is an informal overview of bodily feature entity. It visually describes some forms of BFs obtained from the data surveying, contains: structural attribute, qualitative attribute, functional attribute and process attribute. Figure 3.1: An informal overview of bodily feature entity Tải bản FULL (60 trang): https://bit.ly/3RVUzAL Dự phòng: fb.com/TaiHo123doc.net
  • 30. 3.1. Schema 18 • Structural attributes indicate any presence or absence of a physical component (Anatomy or GGP). For example: [having five fingers], [lack of kidney], [Peritoneal mesothelioma], [missing one finger] • Qualitative attributes show qualities of physical components in organism. In simple cases, they have the form: Anatomy/GGP has (or not has) certain quality. Qualities can describe any measurable characteristic such as location, color, size, mass, etc. and even underspecified qualities of a human/mouse body component. Most qualitative phenotypes contain mention of a physical component term, i.e. anatomy/GGP, but some phenotypes do not (although there is usually a hidden relation to a physical component). For example: [black hair], [not having between 13 and 18 gm/dl hemoglobin concentration], [adult female height 130-157 cm], [conjoined fingers] • Functional attributes are related to functions and disposition of anatomy (Hoehndorf et al., 2010). Intuitively, functions of anatomy establish the rea- son (or cause) that an anatomy exists while their dispositions determine their capabilities and potentials. For example, the endocrine pancreatic cells have a function to produce insulin, and normally have a disposition to produce in- sulin. In general, functional attribute shows the lack or abnormality of anatomy function. For example: [facial grimacing], [sleepy facial expression], [reading disability], [hypotension], [deaf] • Process attributes represent characteristics of the process themselves. They include characteristics of physiological process, metabolic process, biological pathways, chemical reactions, gene-related process, gene expression, etc. The expression of process attribute sometimes have complex structure, but follow- ing the discussion of phenotypes as processes in physiology (Hoehndorf et al., 2012) we include some mentions of processes within the scope of our annotation schema. For example: [defective DNA repair after ultraviolet radiation damage], [ab- normality of metabolism], [proliferation of BAF-32 cells] Tải bản FULL (60 trang): https://bit.ly/3RVUzAL Dự phòng: fb.com/TaiHo123doc.net
  • 31. 3.1. Schema 19 • These above cases are the most common cases of BF, but there are many other cases of BF that we cannot list or group them into classes. For example, there are some non-measurable characteristics of a body component that are experienced by a patient (human or mouse) himself, such as pain or itchiness. These characteristic themselves cannot be objectively measured or observed by others. This kind of characteristic is complex and has often has several variants, in this work, they are also considered as BF. For example: [primary sunburn], [headache], [stress] Table 3.1: Referential semantics and scoping of mentions by entity type BF GGP specific reference Yes Yes generic reference Yes1 Yes under-specified reference No No modifiers Yes2,3 No conjunctions Yes4 Yes4 processes Yes5 No negation Yes6 No Notes on annotation: 1 An entity may be referred with an expression of generic name. They may be anaphoric (i.e., refer to other mentions in the context), sometimes they are too vague or descriptive to be called a named entity. But because its information contents are valuable, in such a case, the generic name should be annotated. For example, [gene], [gene expression], [asthma phenotype]. 2 Quantitative modifiers are included, e.g. [having five fingers] as well as spatial modifiers, e.g. [abnormality in left hand]. 3 Qualitative modifiers are included. For example, physical components: [black hair], underspecified ranges: [normal height], locational modifers: [low set ears], and level modifiers: [quite small fingers]. 4 Where there is elision of the head, e.g. [IA/H5 virus], then annotate the whole expression. Otherwise annotate each expression separately, e.g. [IA virus] and [H5 virus]. 5 We exclude however finite verb forms, infinite verb forms with ‘to’, verbs in a progressive or perfect aspect, verb phrases, clauses or sentences and any phrase with a relative clause or complement clause. 6 If the negation appears in a noun phrase with an anatomical entity then we gen- erally allow it, e.g. [absent ankle reflexes], [no left kidney]. 6811996