Rule-Based NLP System for Tagging and Categorizing Phenotype Variables

A Rule-Based NLP System in Tagging andA Rule-Based NLP System in Tagging and
Categorizing Phenotype Variables in dbGaPCategorizing Phenotype Variables in dbGaP
Son Doan, Ko-Wei Lin, Rebecca Walker, Seena Farzaneh,
Neda Alipanah and Hyeon-Eui Kim
Division of Biomedical Informatics
UC San Diego
AMIA 2013, Washington DC, 11/18/2013

RoadmapRoadmap
• Background
o dbGaP
o Challenges in using dbGaP
o pFINDR program
• Phenotype standardization in dbGaP
• PhenDisco system
• Performance evaluation
2

NCBI’s database of Genotypes and Phenotypes (dbGaP)NCBI’s database of Genotypes and Phenotypes (dbGaP)
“dbGaP was developed to
archive and distribute the
results of studies that have
investigated the interaction
of genotype and phenotype”
Until 11/14/2013:
-411 top-level studies
-139,238 phenotypes
variables
-2,816 datasets
-3,895 analyses
3

What topics dbGaP users are focusing on?What topics dbGaP users are focusing on?
Most frequent phenotype topics from the dbGaP data requests
Genetic
Disease
Congenital
Abnormality
(8.6%)
Cardiovascu
lar Disease
(8.1%)
14,287 data requests from
dbGaP Website

5
http://www.ncbi.nlm.nih.gov/gap

9/5/13 6

9/5/13 7

pFINDRpFINDR (phenotype Finding IN Data Repositories)(phenotype Finding IN Data Repositories)
8
• Funded by NHLBI/NIH
• To facilitate dbGaP use by improving accuracy
and completeness of search returns
– Standardized existing phenotype variables
– Searchable study related information

Related workRelated work
9
 PhenX: Define 287 most important phenotypes and
manually mapped into 16 dbGaP studies
 eMERGE: Standardize EMR phenotypes
Semi-automatic process: First phenotypes are
automatically mapped into standardized
vocabularies , then outputs are returned and curated
by users

Our approachOur approach
Using NLP to standardize phenotype variables in dbGaP
Integrating NLP components into a new phenotype
search tool for dbGaP
dbGaP
Free text search
Structured (advanced) search
Unsorted, flat list results
Data user
Study Description
Annotator
Phenotyp
e
Variable
Annotator
Ontolog
y
Mapper
Query Parser
Ranking
Algorithms
Standardization
& annotation
sdGaP
PhenDisco
sdGaP contains
standardized
phenotype variables
Ranked results
Structured Search
Free text search

Phenotype Variable Standardization PipelinePhenotype Variable Standardization Pipeline
Identify topic and
subject of information
Identify semantic
category of phenotypes
Phenotype
variables
Phenotype
variables
TaggerTagger CategorizerCategorizer
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM
1
medical
examination
study subject Smoking History;
Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother
(person)
Demographics

Identify topic and
subject of information
Identify semantic
category of phenotypes
Variable Topic Subject Category
Gender of the participant Gender participant Demographics
CIGARETTES/DAY, EXAM
1
medical
examination
study subject Smoking History;
Healthcare Activity Finding
Weight in kg. at baseline weighing patient study subject Clinical Attributes
AGE OF LIVING MOTHER Age mother
(person)
Demographics
Phenotype
variables
Phenotype
variables
Topic: Main theme of phenotype variables
Subject of information: Bearer of the variable

Variable
Descriptions
Variable
Descriptions
Normali-
zation
Normali-
zation
MetaMap
Processing
MetaMap
Processing
Semantic Role
Assignment
Semantic Role
Assignment
Topic
Filtering
Topic
Filtering
Variable
Categorization
Variable
Categorization
• Spell out
abbreviations
and short
hand
expressions
• Drop question
numbers and
other
unimportant
characters
• Generate CUIs,
concept
names,
semantic types
• Semantic
types and
keyword-
based role
identification
• Keep concepts
that match
SNOMED-CT
clinical
findings
• Remove
problematic
concepts
• Semantic types
and keyword-
based
categorization
15 semantic categories are
selected based on semantic
types from MetaMap:
Demographics, Medical History,
Clinical Attribute, Medication,
Lab Tests from two domain
experts

Creating an abbreviation list from dbGaPCreating an abbreviation list from dbGaP
15
bp blood pressure
bmi body mass index
bpm beats per minute
bw body weight
dbp diastolic blood pressure
hbp high blood pressure
htn hypertension
hr heart rate
Ht height
lb pounds
rr respiration rate
sbp systolic blood pressure
temp temperature
TPR temperature, pulse, respiration
wt weight
yr year
vs vital signs
We compiled and reviewed
a list of abbreviation in
dbGaP, original contain 50
abbreviations, latest
version contains 520
abbreviations

Rule Example
1
if # after type, please keep this
number type 1 diabetes
type I diabetes
type 2 diabetes
Glycogen storage disease type I
type 1 hypersensitivity diseases
2
if # after grade, please keep
this number grade 1 Dupuytren's disease
3
if # after stage, please keep
this number stage 1 chronic kidney disease
4
if # after bipolar, please keep
this number bipolar 1 disorder
bipolar I disorder
bipolar II disorder
biporlar 2 disorder
5
if # after class, please keep
this number class I and II Newcastle disease
class 1 Newcastle disease 16
Remove number with exceptionsRemove number with exceptions

17
IF CandidatePreferred contains
"gender" or "sex"
AND SemType = organism
attribute
THEN Topic=Gender
IF Topic concepts = Pharmacologic
Substance
OR
Variable description contains a word
“medication”
THEN
Type = Medication
Rule for Tagger Rule for Categorizer
Rules for tagger and categorizer: examplesRules for tagger and categorizer: examples
Two domain experts reviewed and created rules from 300
randomly unique phenotype variables

77 age mom
diagnosed–
stroke (tia)
age mother
diagnosed
stroke (tia)
• C0001779:Age [Organism
Attribute]
• C0026591:Mother (Mother
(person)) [Family Group]
• C0038454:Stroke
(Cerebrovascular
accident) [Disease or
Syndrome]
• ‘Diagnosed’
• TOPIC: Age (C0001779)
• Subject of Information:
Mother (C0026591)
MetaMap
Example of taggingExample of tagging

Variable
Descriptions
Variable
Descriptions
Normali-
zation
Normali-
zation
MetaMap
Processing
MetaMap
Processing
Semantic Role
Assignment
Semantic Role
Assignment
Topic
Filtering
Topic
Filtering
Variable
Categorization
Variable
Categorization
116,957
phenotypes
mapped to
Topic
104,172
phenotypes
mapped to
Category
135,608
variables
Evaluation:
- Random sample of 500 unique phenotypes
- Reviewed by 3 domain experts
73% accuracy for topic
71% accuracy for category

Semantic category of phenotypes in dbGaPSemantic category of phenotypes in dbGaP
20
(as of July 1, 2013)

Mapping FailuresMapping Failures
21
a. Unprocessed by Metamap
14 c first arm other
vessel max l lat obliq
a. Lexical problem. Items with incorrect lexical forms
including typos
hba1 c collection date month 057
fateat gm
a. Id variables or some administrative variables
form number
f124 documentation used form
a. Our rules do not cover
years treated pet for fleas
what is your first language

Free text Query parser
sdGaP
Relevant
studies
Ranked
studies
NLP tools +
MetaMap
Information
model mapping
dbGaP
PhenDisco: Put-it-all-togetherPhenDisco: Put-it-all-together
BM25 ranking algorithm
Standardize
d
phenotypes
Standardize
d
phenotypes
Doan, S, Lin KW, Conway M, Ohno-Machado et al. "PhenDisco: Phenotype Discovery System for the Database of
Genotypes and Phenotypes (dbGaP)”, JAMIA, 2013, doi:10.1136/amiajnl-2013-001882.

PhenDisco systemPhenDisco system
Search
oTerm auto-complete
oSynonym expansion
23
Display
oKeyword highlighting
oRanking by relevance
oFilter by study metadata
oCross-link related studies
Export to Excel
oSelected study
metadata
oSelected phenotype
variables
Search by titles, platform, study

PhenDisco vs dbGaP EntrezPhenDisco vs dbGaP Entrez
25
Basic Search
dbGaP PhenDisco
Recall Precision Recall Precision
COPD 100 % 41.67% 80.00% 100 %
“macular degeneration” AND
white
100 % 42.86% 100 % 85.71%
“breast cancer” AND “breast
density”
100 % 66.67% 50.00% 100 %
schizophrenia 100 % 46.88% 86.67% 92.86%
cardiomyopathy 100 % 35.00% 100 % 100 %
Average 100 % 46.61% 83.33% 95.71%
Average F-measure 0.64 0.89
(as of July 7, 2013)

Summary & Future workSummary & Future work
• A rule-based approach is a simple yet efficient
way to standardize phenotype variables in
dbGaP
• Integration to machine learning methods will be
investigated
• Identification of similar variables is in progress!

AcknowledgementsAcknowledgements
• Lucila Ohno-Machado (PI)
• Other PhenDisco team members:
o Mike Conway
o Alex Hsieh
o Stephanie Feudjido Feupe
o Asher Garland
o Mindy Ross
o Xiaoqian Jiang
o Jing Zhang
• Early contributors
o Wendy Chapman
o Melissa Tharp
o Jihoon Kim
• Collaborator:
o Hua Xu
• SAB member and NHLBI officers
• Funding: UH2HL108785 from NHLBI/NIH 27

Questions?
Project Homepage: http://pfindr.net
PhenDisco: http://pfindr-data.ucsd.edu/_PhDVer1
Contact:
sondoan@ucsd.edu
hyk038@ucsd.edu
Source code and database of PhenDisco are
publicly available

Rule-Based NLP System for Tagging and Categorizing Phenotype Variables

Recommended

Recommended

More Related Content

Similar to Rule-Based NLP System for Tagging and Categorizing Phenotype Variables

Similar to Rule-Based NLP System for Tagging and Categorizing Phenotype Variables (20)

Recently uploaded

Recently uploaded (20)

Rule-Based NLP System for Tagging and Categorizing Phenotype Variables

Editor's Notes