This document describes PhenDisco, a rule-based natural language processing (NLP) system for standardizing phenotype variables in the database of Genotypes and Phenotypes (dbGaP). The system uses a pipeline that tags variables with topics and subjects, normalizes descriptions, assigns semantic roles, categorizes variables, and maps them to standardized ontologies. It achieved an accuracy of 73% for topic identification and 71% for categorization. PhenDisco provides improved search and access to dbGaP compared to the existing dbGaP Entrez system, demonstrating higher recall and precision. Future work will integrate machine learning and identify similar variables to further enhance phenotype discovery in dbGaP.
Forensic Biology & Its biological significance.pdf
Rule-Based NLP System for Tagging and Categorizing Phenotype Variables
1. A Rule-Based NLP System in Tagging andA Rule-Based NLP System in Tagging and
Categorizing Phenotype Variables in dbGaPCategorizing Phenotype Variables in dbGaP
Son Doan, Ko-Wei Lin, Rebecca Walker, Seena Farzaneh,
Neda Alipanah and Hyeon-Eui Kim
Division of Biomedical Informatics
UC San Diego
AMIA 2013, Washington DC, 11/18/2013
2. RoadmapRoadmap
• Background
o dbGaP
o Challenges in using dbGaP
o pFINDR program
• Phenotype standardization in dbGaP
• PhenDisco system
• Performance evaluation
2
3. NCBI’s database of Genotypes and Phenotypes (dbGaP)NCBI’s database of Genotypes and Phenotypes (dbGaP)
“dbGaP was developed to
archive and distribute the
results of studies that have
investigated the interaction
of genotype and phenotype”
Until 11/14/2013:
-411 top-level studies
-139,238 phenotypes
variables
-2,816 datasets
-3,895 analyses
3
4. What topics dbGaP users are focusing on?What topics dbGaP users are focusing on?
Most frequent phenotype topics from the dbGaP data requests
Genetic
Disease
Congenital
Abnormality
(8.6%)
Cardiovascu
lar Disease
(8.1%)
14,287 data requests from
dbGaP Website
8. pFINDRpFINDR (phenotype Finding IN Data Repositories)(phenotype Finding IN Data Repositories)
8
• Funded by NHLBI/NIH
• To facilitate dbGaP use by improving accuracy
and completeness of search returns
– Standardized existing phenotype variables
– Searchable study related information
9. Related workRelated work
9
PhenX: Define 287 most important phenotypes and
manually mapped into 16 dbGaP studies
eMERGE: Standardize EMR phenotypes
Semi-automatic process: First phenotypes are
automatically mapped into standardized
vocabularies , then outputs are returned and curated
by users
10. Our approachOur approach
Using NLP to standardize phenotype variables in dbGaP
Integrating NLP components into a new phenotype
search tool for dbGaP
dbGaP
Free text search
Structured (advanced) search
Unsorted, flat list results
Data user
Study Description
Annotator
Phenotyp
e
Variable
Annotator
Ontolog
y
Mapper
Query Parser
Ranking
Algorithms
Standardization
& annotation
sdGaP
PhenDisco
sdGaP contains
standardized
phenotype variables
Ranked results
Structured Search
Free text search
11. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and
subject of information
Identify semantic
category of phenotypes
Phenotype
variables
Phenotype
variables
TaggerTagger CategorizerCategorizer
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM
1
medical
examination
study subject Smoking History;
Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother
(person)
Demographics
12. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and
subject of information
Identify semantic
category of phenotypes
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM
1
medical
examination
study subject Smoking History;
Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother
(person)
Demographics
Phenotype
variables
Phenotype
variables
TaggerTagger CategorizerCategorizer
Topic: Main theme of phenotype variables
Subject of information: Bearer of the variable
13. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and
subject of information
Identify semantic
category of phenotypes
Phenotype
variables
Phenotype
variables
TaggerTagger CategorizerCategorizer
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM
1
medical
examination
study subject Smoking History;
Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother
(person)
Demographics
14. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Variable
Descriptions
Variable
Descriptions
Normali-
zation
Normali-
zation
MetaMap
Processing
MetaMap
Processing
Semantic Role
Assignment
Semantic Role
Assignment
Topic
Filtering
Topic
Filtering
Variable
Categorization
Variable
Categorization
• Spell out
abbreviations
and short
hand
expressions
• Drop question
numbers and
other
unimportant
characters
• Generate CUIs,
concept
names,
semantic types
• Semantic
types and
keyword-
based role
identification
• Keep concepts
that match
SNOMED-CT
clinical
findings
• Remove
problematic
concepts
• Semantic types
and keyword-
based
categorization
15 semantic categories are
selected based on semantic
types from MetaMap:
Demographics, Medical History,
Clinical Attribute, Medication,
Lab Tests from two domain
experts
TaggerTagger CategorizerCategorizer
15. Creating an abbreviation list from dbGaPCreating an abbreviation list from dbGaP
15
bp blood pressure
bmi body mass index
bpm beats per minute
bw body weight
dbp diastolic blood pressure
hbp high blood pressure
htn hypertension
hr heart rate
Ht height
lb pounds
rr respiration rate
sbp systolic blood pressure
temp temperature
TPR temperature, pulse, respiration
wt weight
yr year
vs vital signs
We compiled and reviewed
a list of abbreviation in
dbGaP, original contain 50
abbreviations, latest
version contains 520
abbreviations
16. Rule Example
1
if # after type, please keep this
number type 1 diabetes
type I diabetes
type 2 diabetes
Glycogen storage disease type I
type 1 hypersensitivity diseases
2
if # after grade, please keep
this number grade 1 Dupuytren's disease
3
if # after stage, please keep
this number stage 1 chronic kidney disease
4
if # after bipolar, please keep
this number bipolar 1 disorder
bipolar I disorder
bipolar II disorder
biporlar 2 disorder
5
if # after class, please keep
this number class I and II Newcastle disease
class 1 Newcastle disease 16
Remove number with exceptionsRemove number with exceptions
17. 17
IF CandidatePreferred contains
"gender" or "sex"
AND SemType = organism
attribute
THEN Topic=Gender
IF Topic concepts = Pharmacologic
Substance
OR
Variable description contains a word
“medication”
THEN
Type = Medication
Rule for Tagger Rule for Categorizer
Rules for tagger and categorizer: examplesRules for tagger and categorizer: examples
Two domain experts reviewed and created rules from 300
randomly unique phenotype variables
18. 77 age mom
diagnosed–
stroke (tia)
age mother
diagnosed
stroke (tia)
• C0001779:Age [Organism
Attribute]
• C0026591:Mother (Mother
(person)) [Family Group]
• C0038454:Stroke
(Cerebrovascular
accident) [Disease or
Syndrome]
• ‘Diagnosed’
• TOPIC: Age (C0001779)
• Subject of Information:
Mother (C0026591)
MetaMap
Example of taggingExample of tagging
19. Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Variable
Descriptions
Variable
Descriptions
Normali-
zation
Normali-
zation
MetaMap
Processing
MetaMap
Processing
Semantic Role
Assignment
Semantic Role
Assignment
Topic
Filtering
Topic
Filtering
Variable
Categorization
Variable
Categorization
116,957
phenotypes
mapped to
Topic
104,172
phenotypes
mapped to
Category
135,608
variables
TaggerTagger CategorizerCategorizer
Evaluation:
- Random sample of 500 unique phenotypes
- Reviewed by 3 domain experts
73% accuracy for topic
71% accuracy for category
20. Semantic category of phenotypes in dbGaPSemantic category of phenotypes in dbGaP
20
(as of July 1, 2013)
21. Mapping FailuresMapping Failures
21
a. Unprocessed by Metamap
14 c first arm other
vessel max l lat obliq
a. Lexical problem. Items with incorrect lexical forms
including typos
hba1 c collection date month 057
fateat gm
a. Id variables or some administrative variables
form number
f124 documentation used form
a. Our rules do not cover
years treated pet for fleas
what is your first language
22. Free text Query parser
sdGaP
Relevant
studies
Ranked
studies
NLP tools +
MetaMap
Information
model mapping
dbGaP
PhenDisco: Put-it-all-togetherPhenDisco: Put-it-all-together
BM25 ranking algorithm
Standardize
d
phenotypes
Standardize
d
phenotypes
Doan, S, Lin KW, Conway M, Ohno-Machado et al. "PhenDisco: Phenotype Discovery System for the Database of
Genotypes and Phenotypes (dbGaP)”, JAMIA, 2013, doi:10.1136/amiajnl-2013-001882.
23. PhenDisco systemPhenDisco system
Search
oTerm auto-complete
oSynonym expansion
23
Display
oKeyword highlighting
oRanking by relevance
oFilter by study metadata
oCross-link related studies
Export to Excel
oSelected study
metadata
oSelected phenotype
variables
Search by titles, platform, study
25. PhenDisco vs dbGaP EntrezPhenDisco vs dbGaP Entrez
25
Basic Search
dbGaP PhenDisco
Recall Precision Recall Precision
COPD 100 % 41.67% 80.00% 100 %
“macular degeneration” AND
white
100 % 42.86% 100 % 85.71%
“breast cancer” AND “breast
density”
100 % 66.67% 50.00% 100 %
schizophrenia 100 % 46.88% 86.67% 92.86%
cardiomyopathy 100 % 35.00% 100 % 100 %
Average 100 % 46.61% 83.33% 95.71%
Average F-measure 0.64 0.89
(as of July 7, 2013)
26. Summary & Future workSummary & Future work
• A rule-based approach is a simple yet efficient
way to standardize phenotype variables in
dbGaP
• Integration to machine learning methods will be
investigated
• Identification of similar variables is in progress!
27. AcknowledgementsAcknowledgements
• Lucila Ohno-Machado (PI)
• Other PhenDisco team members:
o Mike Conway
o Alex Hsieh
o Stephanie Feudjido Feupe
o Asher Garland
o Mindy Ross
o Xiaoqian Jiang
o Jing Zhang
• Early contributors
o Wendy Chapman
o Melissa Tharp
o Jihoon Kim
• Collaborator:
o Hua Xu
• SAB member and NHLBI officers
• Funding: UH2HL108785 from NHLBI/NIH 27
NLP analysis of study descriptions for users who obtained data from dbGaP
Topic analysis only, need more about Demographics
Mention more about “cancer” and “cardiovascular”
Based on 4 users: all excited about having a resource like dbGaP but very confused about its usage – 100% requested requirement is providing relevancy why certain studies are selected and ideally sort them by relevancy.
Based on 4 users: all excited about having a resource like dbGaP but very confused about its usage – 100% requested requirement is providing relevancy why certain studies are selected and ideally sort them by relevancy.
Examples for variable mapping
Note that this is a full list, in our experiment we conducted with about 50 abbreviations only
DIVER pipeline
XML data dictionary as input – xml parsing and extraction (variable descriptions) normalization (lexical variants, abbreviation, etc) MetaMap run Key concepts and their CUI , Semantic types Model fitting (semantic analysis) based on the key concepts and semantic types metadata annotation and labeling
Examples for variable mapping
Duplicate
More focus on NLP
Among 135,608 phenotype variables:
- 104,172 mapped ( >=1 categories)
- 31,473 not mapped (most are IDs such as subject IDs)
This contains duplicated information
Queries for data show consistency in interest in disease, ethnicity/race, and therapeutics (drugs)
Rule development – manual process: review of the core concepts and their annotated information model classes