SlideShare a Scribd company logo
1 of 28
A Rule-Based NLP System in Tagging andA Rule-Based NLP System in Tagging and
Categorizing Phenotype Variables in dbGaPCategorizing Phenotype Variables in dbGaP
Son Doan, Ko-Wei Lin, Rebecca Walker, Seena Farzaneh,
Neda Alipanah and Hyeon-Eui Kim
Division of Biomedical Informatics
UC San Diego
AMIA 2013, Washington DC, 11/18/2013
RoadmapRoadmap
• Background
o dbGaP
o Challenges in using dbGaP
o pFINDR program
• Phenotype standardization in dbGaP
• PhenDisco system
• Performance evaluation
2
NCBI’s database of Genotypes and Phenotypes (dbGaP)NCBI’s database of Genotypes and Phenotypes (dbGaP)
“dbGaP was developed to
archive and distribute the
results of studies that have
investigated the interaction
of genotype and phenotype”
Until 11/14/2013:
-411 top-level studies
-139,238 phenotypes
variables
-2,816 datasets
-3,895 analyses
3
What topics dbGaP users are focusing on?What topics dbGaP users are focusing on?
Most frequent phenotype topics from the dbGaP data requests
Genetic
Disease
Congenital
Abnormality
(8.6%)
Cardiovascu
lar Disease
(8.1%)
14,287 data requests from
dbGaP Website
5
http://www.ncbi.nlm.nih.gov/gap
9/5/13 6
http://www.ncbi.nlm.nih.gov/gap
9/5/13 7
http://www.ncbi.nlm.nih.gov/gap
pFINDRpFINDR (phenotype Finding IN Data Repositories)(phenotype Finding IN Data Repositories)
8
• Funded by NHLBI/NIH
• To facilitate dbGaP use by improving accuracy
and completeness of search returns
– Standardized existing phenotype variables
– Searchable study related information
Related workRelated work
9
 PhenX: Define 287 most important phenotypes and
manually mapped into 16 dbGaP studies
 eMERGE: Standardize EMR phenotypes
Semi-automatic process: First phenotypes are
automatically mapped into standardized
vocabularies , then outputs are returned and curated
by users
Our approachOur approach
Using NLP to standardize phenotype variables in dbGaP
Integrating NLP components into a new phenotype
search tool for dbGaP
dbGaP
Free text search
Structured (advanced) search
Unsorted, flat list results
Data user
Study Description
Annotator
Phenotyp
e
Variable
Annotator
Ontolog
y
Mapper
Query Parser
Ranking
Algorithms
Standardization
& annotation
sdGaP
PhenDisco
sdGaP contains
standardized
phenotype variables
Ranked results
Structured Search
Free text search
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and
subject of information
Identify semantic
category of phenotypes
Phenotype
variables
Phenotype
variables
TaggerTagger CategorizerCategorizer
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM
1
medical
examination
study subject Smoking History;
Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother
(person)
Demographics
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and
subject of information
Identify semantic
category of phenotypes
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM
1
medical
examination
study subject Smoking History;
Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother
(person)
Demographics
Phenotype
variables
Phenotype
variables
TaggerTagger CategorizerCategorizer
Topic: Main theme of phenotype variables
Subject of information: Bearer of the variable
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and
subject of information
Identify semantic
category of phenotypes
Phenotype
variables
Phenotype
variables
TaggerTagger CategorizerCategorizer
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM
1
medical
examination
study subject Smoking History;
Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother
(person)
Demographics
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Variable
Descriptions
Variable
Descriptions
Normali-
zation
Normali-
zation
MetaMap
Processing
MetaMap
Processing
Semantic Role
Assignment
Semantic Role
Assignment
Topic
Filtering
Topic
Filtering
Variable
Categorization
Variable
Categorization
• Spell out
abbreviations
and short
hand
expressions
• Drop question
numbers and
other
unimportant
characters
• Generate CUIs,
concept
names,
semantic types
• Semantic
types and
keyword-
based role
identification
• Keep concepts
that match
SNOMED-CT
clinical
findings
• Remove
problematic
concepts
• Semantic types
and keyword-
based
categorization
15 semantic categories are
selected based on semantic
types from MetaMap:
Demographics, Medical History,
Clinical Attribute, Medication,
Lab Tests from two domain
experts
TaggerTagger CategorizerCategorizer
Creating an abbreviation list from dbGaPCreating an abbreviation list from dbGaP
15
bp blood pressure
bmi body mass index
bpm beats per minute
bw body weight
dbp diastolic blood pressure
hbp high blood pressure
htn hypertension
hr heart rate
Ht height
lb pounds
rr respiration rate
sbp systolic blood pressure
temp temperature
TPR temperature, pulse, respiration
wt weight
yr year
vs vital signs
We compiled and reviewed
a list of abbreviation in
dbGaP, original contain 50
abbreviations, latest
version contains 520
abbreviations
Rule Example
1
if # after type, please keep this
number type 1 diabetes
type I diabetes
type 2 diabetes
Glycogen storage disease type I
type 1 hypersensitivity diseases
2
if # after grade, please keep
this number grade 1 Dupuytren's disease
3
if # after stage, please keep
this number stage 1 chronic kidney disease
4
if # after bipolar, please keep
this number bipolar 1 disorder
bipolar I disorder
bipolar II disorder
biporlar 2 disorder
5
if # after class, please keep
this number class I and II Newcastle disease
class 1 Newcastle disease 16
Remove number with exceptionsRemove number with exceptions
17
IF CandidatePreferred contains
"gender" or "sex"
AND SemType = organism
attribute
THEN Topic=Gender
IF Topic concepts = Pharmacologic
Substance
OR
Variable description contains a word
“medication”
THEN
Type = Medication
Rule for Tagger Rule for Categorizer
Rules for tagger and categorizer: examplesRules for tagger and categorizer: examples
Two domain experts reviewed and created rules from 300
randomly unique phenotype variables
77 age mom
diagnosed–
stroke (tia)
age mother
diagnosed
stroke (tia)
• C0001779:Age [Organism
Attribute]
• C0026591:Mother (Mother
(person)) [Family Group]
• C0038454:Stroke
(Cerebrovascular
accident) [Disease or
Syndrome]
• ‘Diagnosed’
• TOPIC: Age (C0001779)
• Subject of Information:
Mother (C0026591)
MetaMap
Example of taggingExample of tagging
Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Variable
Descriptions
Variable
Descriptions
Normali-
zation
Normali-
zation
MetaMap
Processing
MetaMap
Processing
Semantic Role
Assignment
Semantic Role
Assignment
Topic
Filtering
Topic
Filtering
Variable
Categorization
Variable
Categorization
116,957
phenotypes
mapped to
Topic
104,172
phenotypes
mapped to
Category
135,608
variables
TaggerTagger CategorizerCategorizer
Evaluation:
- Random sample of 500 unique phenotypes
- Reviewed by 3 domain experts
73% accuracy for topic
71% accuracy for category
Semantic category of phenotypes in dbGaPSemantic category of phenotypes in dbGaP
20
(as of July 1, 2013)
Mapping FailuresMapping Failures
21
a. Unprocessed by Metamap
14 c  first arm other
vessel max l lat obliq 
a. Lexical problem. Items with incorrect lexical forms
including typos
hba1 c collection date  month 057 
fateat gm 
a. Id variables or some administrative variables
form number
f124  documentation used form 
a. Our rules do not cover
years treated pet for fleas 
what is your first language 
Free text Query parser
sdGaP
Relevant
studies
Ranked
studies
NLP tools +
MetaMap
Information
model mapping
dbGaP
PhenDisco: Put-it-all-togetherPhenDisco: Put-it-all-together
BM25 ranking algorithm
Standardize
d
phenotypes
Standardize
d
phenotypes
Doan, S, Lin KW, Conway M, Ohno-Machado et al. "PhenDisco: Phenotype Discovery System for the Database of
Genotypes and Phenotypes (dbGaP)”, JAMIA, 2013, doi:10.1136/amiajnl-2013-001882.
PhenDisco systemPhenDisco system
Search
oTerm auto-complete
oSynonym expansion
23
Display
oKeyword highlighting
oRanking by relevance
oFilter by study metadata
oCross-link related studies
Export to Excel
oSelected study
metadata
oSelected phenotype
variables
Search by titles, platform, study
24
Advanced
Search
PhenDisco vs dbGaP EntrezPhenDisco vs dbGaP Entrez
25
Basic Search
dbGaP PhenDisco
Recall Precision Recall Precision
COPD 100 % 41.67% 80.00% 100 %
“macular degeneration” AND
white
100 % 42.86% 100 % 85.71%
“breast cancer” AND “breast
density”
100 % 66.67% 50.00% 100 %
schizophrenia 100 % 46.88% 86.67% 92.86%
cardiomyopathy 100 % 35.00% 100 % 100 %
Average 100 % 46.61% 83.33% 95.71%
Average F-measure 0.64 0.89
(as of July 7, 2013)
Summary & Future workSummary & Future work
• A rule-based approach is a simple yet efficient
way to standardize phenotype variables in
dbGaP
• Integration to machine learning methods will be
investigated
• Identification of similar variables is in progress!
AcknowledgementsAcknowledgements
• Lucila Ohno-Machado (PI)
• Other PhenDisco team members:
o Mike Conway
o Alex Hsieh
o Stephanie Feudjido Feupe
o Asher Garland
o Mindy Ross
o Xiaoqian Jiang
o Jing Zhang
• Early contributors
o Wendy Chapman
o Melissa Tharp
o Jihoon Kim
• Collaborator:
o Hua Xu
• SAB member and NHLBI officers
• Funding: UH2HL108785 from NHLBI/NIH 27
Questions?
Project Homepage: http://pfindr.net
PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1
Contact:
sondoan@ucsd.edu
hyk038@ucsd.edu
Source code and database of PhenDisco are
publicly available

More Related Content

Similar to Rule-Based NLP System for Tagging and Categorizing Phenotype Variables

NLP tutorial at AIME 2020
NLP tutorial at AIME 2020NLP tutorial at AIME 2020
NLP tutorial at AIME 2020Rui Zhang
 
PadminiNarayanan-Intro-2018.pptx
PadminiNarayanan-Intro-2018.pptxPadminiNarayanan-Intro-2018.pptx
PadminiNarayanan-Intro-2018.pptxDESMONDEZIEKE1
 
Clinical Tools - Faculty Development
Clinical Tools - Faculty DevelopmentClinical Tools - Faculty Development
Clinical Tools - Faculty DevelopmentRobin Featherstone
 
Guide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and EducationGuide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and EducationChris Southan
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Syed Muhammad Ali Hasnain
 
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...Gabe Rudy
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
TIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS AnalyticsTIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS AnalyticsJeremy Yang
 
Searching with point of care tools sept 2010
Searching with point of care tools sept 2010Searching with point of care tools sept 2010
Searching with point of care tools sept 2010Robin Featherstone
 
Enzymes as drug targets: curated pharmacological information in the 'Guide to...
Enzymes as drug targets: curated pharmacological information in the 'Guide to...Enzymes as drug targets: curated pharmacological information in the 'Guide to...
Enzymes as drug targets: curated pharmacological information in the 'Guide to...Guide to PHARMACOLOGY
 
Using Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsUsing Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsGolden Helix Inc
 
Genome science intermine
Genome science intermineGenome science intermine
Genome science intermineELIXIR UK
 
Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...Jinho Choi
 
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...David Peyruc
 
The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicinemhaendel
 
Pharmacogenetics
PharmacogeneticsPharmacogenetics
PharmacogeneticsLarry Baum
 
Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...
Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...
Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...Remedy Informatics
 

Similar to Rule-Based NLP System for Tagging and Categorizing Phenotype Variables (20)

NLP tutorial at AIME 2020
NLP tutorial at AIME 2020NLP tutorial at AIME 2020
NLP tutorial at AIME 2020
 
PadminiNarayanan-Intro-2018.pptx
PadminiNarayanan-Intro-2018.pptxPadminiNarayanan-Intro-2018.pptx
PadminiNarayanan-Intro-2018.pptx
 
Clinical Tools - Faculty Development
Clinical Tools - Faculty DevelopmentClinical Tools - Faculty Development
Clinical Tools - Faculty Development
 
Guide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and EducationGuide to PHARMACOLOGY: a web-Based Compendium for Research and Education
Guide to PHARMACOLOGY: a web-Based Compendium for Research and Education
 
Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...Quantifying the content of biomedical semantic resources as a core for drug d...
Quantifying the content of biomedical semantic resources as a core for drug d...
 
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
2015 TriCon - Clinical Grade Annotations - Public Data Resources for Interpre...
 
PhenDisco: Phenotype Discovery System for the Database of Genotypes and Pheno...
PhenDisco: Phenotype Discovery System for the Database of Genotypes and Pheno...PhenDisco: Phenotype Discovery System for the Database of Genotypes and Pheno...
PhenDisco: Phenotype Discovery System for the Database of Genotypes and Pheno...
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
TIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS AnalyticsTIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS Analytics
 
Searching with point of care tools sept 2010
Searching with point of care tools sept 2010Searching with point of care tools sept 2010
Searching with point of care tools sept 2010
 
Enzymes as drug targets: curated pharmacological information in the 'Guide to...
Enzymes as drug targets: curated pharmacological information in the 'Guide to...Enzymes as drug targets: curated pharmacological information in the 'Guide to...
Enzymes as drug targets: curated pharmacological information in the 'Guide to...
 
Using Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS VariantsUsing Public Access Clinical Databases to Interpret NGS Variants
Using Public Access Clinical Databases to Interpret NGS Variants
 
Genome science intermine
Genome science intermineGenome science intermine
Genome science intermine
 
Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...
 
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...
tranSMART Community Meeting 5-7 Nov 13 - Session 5: Advancing tranSMART Analy...
 
The Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision MedicineThe Monarch Initiative: From Model Organism to Precision Medicine
The Monarch Initiative: From Model Organism to Precision Medicine
 
Pharmacogenetics
PharmacogeneticsPharmacogenetics
Pharmacogenetics
 
Eisai-2
Eisai-2Eisai-2
Eisai-2
 
Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...
Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...
Ontology-Driven Clinical Intelligence: A Path from the Biobank to Cross-Disea...
 

Recently uploaded

Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 

Recently uploaded (20)

Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 

Rule-Based NLP System for Tagging and Categorizing Phenotype Variables

  • 1. A Rule-Based NLP System in Tagging andA Rule-Based NLP System in Tagging and Categorizing Phenotype Variables in dbGaPCategorizing Phenotype Variables in dbGaP Son Doan, Ko-Wei Lin, Rebecca Walker, Seena Farzaneh, Neda Alipanah and Hyeon-Eui Kim Division of Biomedical Informatics UC San Diego AMIA 2013, Washington DC, 11/18/2013
  • 2. RoadmapRoadmap • Background o dbGaP o Challenges in using dbGaP o pFINDR program • Phenotype standardization in dbGaP • PhenDisco system • Performance evaluation 2
  • 3. NCBI’s database of Genotypes and Phenotypes (dbGaP)NCBI’s database of Genotypes and Phenotypes (dbGaP) “dbGaP was developed to archive and distribute the results of studies that have investigated the interaction of genotype and phenotype” Until 11/14/2013: -411 top-level studies -139,238 phenotypes variables -2,816 datasets -3,895 analyses 3
  • 4. What topics dbGaP users are focusing on?What topics dbGaP users are focusing on? Most frequent phenotype topics from the dbGaP data requests Genetic Disease Congenital Abnormality (8.6%) Cardiovascu lar Disease (8.1%) 14,287 data requests from dbGaP Website
  • 8. pFINDRpFINDR (phenotype Finding IN Data Repositories)(phenotype Finding IN Data Repositories) 8 • Funded by NHLBI/NIH • To facilitate dbGaP use by improving accuracy and completeness of search returns – Standardized existing phenotype variables – Searchable study related information
  • 9. Related workRelated work 9  PhenX: Define 287 most important phenotypes and manually mapped into 16 dbGaP studies  eMERGE: Standardize EMR phenotypes Semi-automatic process: First phenotypes are automatically mapped into standardized vocabularies , then outputs are returned and curated by users
  • 10. Our approachOur approach Using NLP to standardize phenotype variables in dbGaP Integrating NLP components into a new phenotype search tool for dbGaP dbGaP Free text search Structured (advanced) search Unsorted, flat list results Data user Study Description Annotator Phenotyp e Variable Annotator Ontolog y Mapper Query Parser Ranking Algorithms Standardization & annotation sdGaP PhenDisco sdGaP contains standardized phenotype variables Ranked results Structured Search Free text search
  • 11. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline Identify topic and subject of information Identify semantic category of phenotypes Phenotype variables Phenotype variables TaggerTagger CategorizerCategorizer Variable Topic Subject Category Gender of the participant Gender participant Demographics CIGARETTES/DAY, EXAM 1 medical examination study subject Smoking History; Healthcare Activity Finding Weight in kg. at baseline weighing patient study subject Clinical Attributes AGE OF LIVING MOTHER Age mother (person) Demographics
  • 12. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline Identify topic and subject of information Identify semantic category of phenotypes Variable Topic Subject Category Gender of the participant Gender participant Demographics CIGARETTES/DAY, EXAM 1 medical examination study subject Smoking History; Healthcare Activity Finding Weight in kg. at baseline weighing patient study subject Clinical Attributes AGE OF LIVING MOTHER Age mother (person) Demographics Phenotype variables Phenotype variables TaggerTagger CategorizerCategorizer Topic: Main theme of phenotype variables Subject of information: Bearer of the variable
  • 13. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline Identify topic and subject of information Identify semantic category of phenotypes Phenotype variables Phenotype variables TaggerTagger CategorizerCategorizer Variable Topic Subject Category Gender of the participant Gender participant Demographics CIGARETTES/DAY, EXAM 1 medical examination study subject Smoking History; Healthcare Activity Finding Weight in kg. at baseline weighing patient study subject Clinical Attributes AGE OF LIVING MOTHER Age mother (person) Demographics
  • 14. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline Variable Descriptions Variable Descriptions Normali- zation Normali- zation MetaMap Processing MetaMap Processing Semantic Role Assignment Semantic Role Assignment Topic Filtering Topic Filtering Variable Categorization Variable Categorization • Spell out abbreviations and short hand expressions • Drop question numbers and other unimportant characters • Generate CUIs, concept names, semantic types • Semantic types and keyword- based role identification • Keep concepts that match SNOMED-CT clinical findings • Remove problematic concepts • Semantic types and keyword- based categorization 15 semantic categories are selected based on semantic types from MetaMap: Demographics, Medical History, Clinical Attribute, Medication, Lab Tests from two domain experts TaggerTagger CategorizerCategorizer
  • 15. Creating an abbreviation list from dbGaPCreating an abbreviation list from dbGaP 15 bp blood pressure bmi body mass index bpm beats per minute bw body weight dbp diastolic blood pressure hbp high blood pressure htn hypertension hr heart rate Ht height lb pounds rr respiration rate sbp systolic blood pressure temp temperature TPR temperature, pulse, respiration wt weight yr year vs vital signs We compiled and reviewed a list of abbreviation in dbGaP, original contain 50 abbreviations, latest version contains 520 abbreviations
  • 16. Rule Example 1 if # after type, please keep this number type 1 diabetes type I diabetes type 2 diabetes Glycogen storage disease type I type 1 hypersensitivity diseases 2 if # after grade, please keep this number grade 1 Dupuytren's disease 3 if # after stage, please keep this number stage 1 chronic kidney disease 4 if # after bipolar, please keep this number bipolar 1 disorder bipolar I disorder bipolar II disorder biporlar 2 disorder 5 if # after class, please keep this number class I and II Newcastle disease class 1 Newcastle disease 16 Remove number with exceptionsRemove number with exceptions
  • 17. 17 IF CandidatePreferred contains "gender" or "sex" AND SemType = organism attribute THEN Topic=Gender IF Topic concepts = Pharmacologic Substance OR Variable description contains a word “medication” THEN Type = Medication Rule for Tagger Rule for Categorizer Rules for tagger and categorizer: examplesRules for tagger and categorizer: examples Two domain experts reviewed and created rules from 300 randomly unique phenotype variables
  • 18. 77 age mom diagnosed– stroke (tia) age mother diagnosed stroke (tia) • C0001779:Age [Organism Attribute] • C0026591:Mother (Mother (person)) [Family Group] • C0038454:Stroke (Cerebrovascular accident) [Disease or Syndrome] • ‘Diagnosed’ • TOPIC: Age (C0001779) • Subject of Information: Mother (C0026591) MetaMap Example of taggingExample of tagging
  • 19. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline Variable Descriptions Variable Descriptions Normali- zation Normali- zation MetaMap Processing MetaMap Processing Semantic Role Assignment Semantic Role Assignment Topic Filtering Topic Filtering Variable Categorization Variable Categorization 116,957 phenotypes mapped to Topic 104,172 phenotypes mapped to Category 135,608 variables TaggerTagger CategorizerCategorizer Evaluation: - Random sample of 500 unique phenotypes - Reviewed by 3 domain experts 73% accuracy for topic 71% accuracy for category
  • 20. Semantic category of phenotypes in dbGaPSemantic category of phenotypes in dbGaP 20 (as of July 1, 2013)
  • 21. Mapping FailuresMapping Failures 21 a. Unprocessed by Metamap 14 c  first arm other vessel max l lat obliq  a. Lexical problem. Items with incorrect lexical forms including typos hba1 c collection date  month 057  fateat gm  a. Id variables or some administrative variables form number f124  documentation used form  a. Our rules do not cover years treated pet for fleas  what is your first language 
  • 22. Free text Query parser sdGaP Relevant studies Ranked studies NLP tools + MetaMap Information model mapping dbGaP PhenDisco: Put-it-all-togetherPhenDisco: Put-it-all-together BM25 ranking algorithm Standardize d phenotypes Standardize d phenotypes Doan, S, Lin KW, Conway M, Ohno-Machado et al. "PhenDisco: Phenotype Discovery System for the Database of Genotypes and Phenotypes (dbGaP)”, JAMIA, 2013, doi:10.1136/amiajnl-2013-001882.
  • 23. PhenDisco systemPhenDisco system Search oTerm auto-complete oSynonym expansion 23 Display oKeyword highlighting oRanking by relevance oFilter by study metadata oCross-link related studies Export to Excel oSelected study metadata oSelected phenotype variables Search by titles, platform, study
  • 25. PhenDisco vs dbGaP EntrezPhenDisco vs dbGaP Entrez 25 Basic Search dbGaP PhenDisco Recall Precision Recall Precision COPD 100 % 41.67% 80.00% 100 % “macular degeneration” AND white 100 % 42.86% 100 % 85.71% “breast cancer” AND “breast density” 100 % 66.67% 50.00% 100 % schizophrenia 100 % 46.88% 86.67% 92.86% cardiomyopathy 100 % 35.00% 100 % 100 % Average 100 % 46.61% 83.33% 95.71% Average F-measure 0.64 0.89 (as of July 7, 2013)
  • 26. Summary & Future workSummary & Future work • A rule-based approach is a simple yet efficient way to standardize phenotype variables in dbGaP • Integration to machine learning methods will be investigated • Identification of similar variables is in progress!
  • 27. AcknowledgementsAcknowledgements • Lucila Ohno-Machado (PI) • Other PhenDisco team members: o Mike Conway o Alex Hsieh o Stephanie Feudjido Feupe o Asher Garland o Mindy Ross o Xiaoqian Jiang o Jing Zhang • Early contributors o Wendy Chapman o Melissa Tharp o Jihoon Kim • Collaborator: o Hua Xu • SAB member and NHLBI officers • Funding: UH2HL108785 from NHLBI/NIH 27
  • 28. Questions? Project Homepage: http://pfindr.net PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1 Contact: sondoan@ucsd.edu hyk038@ucsd.edu Source code and database of PhenDisco are publicly available

Editor's Notes

  1. Mentioned to GWAS studies
  2. NLP analysis of study descriptions for users who obtained data from dbGaP Topic analysis only, need more about Demographics
  3. Mention more about “cancer” and “cardiovascular”
  4. Based on 4 users: all excited about having a resource like dbGaP but very confused about its usage – 100% requested requirement is providing relevancy why certain studies are selected and ideally sort them by relevancy.
  5. Based on 4 users: all excited about having a resource like dbGaP but very confused about its usage – 100% requested requirement is providing relevancy why certain studies are selected and ideally sort them by relevancy.
  6. Examples for variable mapping
  7. Note that this is a full list, in our experiment we conducted with about 50 abbreviations only
  8. DIVER pipeline XML data dictionary as input – xml parsing and extraction (variable descriptions)  normalization (lexical variants, abbreviation, etc)  MetaMap run  Key concepts and their CUI , Semantic types  Model fitting (semantic analysis) based on the key concepts and semantic types  metadata annotation and labeling
  9. Examples for variable mapping
  10. Duplicate More focus on NLP Among 135,608 phenotype variables: - 104,172 mapped ( >=1 categories) - 31,473 not mapped (most are IDs such as subject IDs) This contains duplicated information
  11. Queries for data show consistency in interest in disease, ethnicity/race, and therapeutics (drugs)
  12. Rule development – manual process: review of the core concepts and their annotated information model classes
  13. Remove figure of Erin Smith