Disease phenotypes are descriptions of clinically observable or measurable traits that characterize a target disease and its associated patient cohort of interest (e.g., using HbA1C measurements, medical codes and other criteria to identify patients with type II diabetes). As health data become increasingly digitalized through use of electronic health records (EHR), data-driven phenotyping has been developed as a new discipline that aims to quickly identify disease-specific cohorts from large datasets and gain insights into disease dynamics through ever-changing real-world evidence. In this context, the word "phenotype" effectively takes on new semantics as "computable phenotype," which generally refers to any clinical patterns inferred from EHR (and often genomic data as well) that can be used to make assertions about patients and their clinical conditions.
EHRs, however, are noisy practice-based patient data, collected primarily for healthcare delivery and therefore present a great representational gap to biomedical research like disease phenotyping. As a result, most of the computational phenotyping methods today are predominantly rule-based: Clinical experts pre-specify based on their domain knowledge – a set of phenotyping rules in terms of narrative descriptions, logical expressions and workflows – that captures the pathology and relevant medical observations of a disease cohort.
Rule-based methods often involve a long development cycle subject to site-dependent interpretations (of the phenotyping algorithm), knowledge engineering and programming exercises that can often stretch beyond several months to accomplish phenotyping merely a single disease. To achieve a better scalability while enabling generalization for more complex diseases in the phenotyping process -- where phenotype definitions are unclear but relevant medical concepts/patterns can be learned statistically from large-scale EHR data -- a large portion of my recent work in this area has been focusing on developing automated, statistical machine learning-based phenotyping methodologies.
In these slides, I will present an overview of health data-driven disease phenotyping with a focus on one example project – bulk learning – an EHR-based, multi-disease phenotyping framework.
In essence, bulk learning uses a hierarchical learning approach, combined with medical ontology, to derive diagnostic components that collectively serve as phenotyping rules for a group of infectious diseases. As a multiple-disease phenotyping framework, it works in a similar fashion to medical diagnosis settings (albeit through statistical means) where relevant medical concepts such as microbiology lab tests, intravenous chemistry tests, among others, are used as supporting evidence with varying degrees of confidence, estimated statistically from data, for determining positive cases while ruling out the negatives with probabilities.
2. Phenotypes and phenotyping
Physically observable traits of genotypes (and their interactions with environments)
Health data-derived clinical patterns (e.g. patient embeddings, graph basis)
Disease descriptors and phenotyping rules
3. Utilities of phenotypes
• Phenotyping takes high-dimensional patient data (e.g. EHR)
and maps it to medical concepts
• Learned phenotyping rules help to define target disease
cohorts.
• Phenotypes can help answer different research questions:
– Descriptive: What's the trend of the lab values (e.g. eGFR)?
What are the underlying temporal patterns?
– Predictive: Will this patient develop a comorbid condition X
given the history we know?
– Prescriptive: What medical interventions are more likely to be
useful for this patient? Considering cost effectiveness
• Patient segmentation
• Population representativeness (e.g. for clinical trials)
• Recommender system
4. From ML Perspective …
´ Prediction
´ Model Interpretation
´Reusable feature representation
´phenotyping rules
5. Data and Methods
• Data-driven phenotyping
– Data sources
• Clinical data (EHR-based): EHR, clinical notes, pathology reports, claims
• Genomic data: protein sequences, gene expressions, HPO
• Environment data: air pollution exposure profile
– Two main methodologies
• Rule-based approach (e.g. CKD staging)
– e.g. eMERGE: phekb.org/network-associations/emerge
• Probabilistic approach via ML, statistics
6. Example Projects
• (EHR-based) Bulk learning: a multi-disease
phenotyping framework for infectious diseases
• EHR sequencing: developing an alternative
representation mirroring the pipeline of genomic
sequencing + sequence pattern recognition
• Combinations of air toxics associated with
childhood asthma
• Protein function predictions: Gene ontology
• HPO: Multiway associations between
genes/proteins, phenotypes, target diseases
7. Bulk Learning: Workflow
• Simultaneous phenotyping on multiple diseases (to be
discussed shortly)
⌃
⌃
⌃
⌃
⌃
m1
a1
b1
u1
m1
(1)
m1
(i)
a1
(i)
b1
(i)
u1
(i)
⌃
m1g a1g b1g u1g
local2
(i)
global2
(i)
(i-1)
(i+1)
a1
(1)
b1
(1)
u1
(1)
(i-1)
(i)
(i+1)
logistic units
raw
features
microbiology
antibiotic
blood
test
urine
test
2. Compute Base Models
Level-1
Global
Unit
Individual
Level-1
Local
Units
Level-1
abstract
features
f11
f12
f1j
f21
f2j
f31
f41
f3j
Four Example Base Models
3. Compute Meta Models (via Ensemble Learning)
1. Define Feature Groups Using Medical Ontology
1a. Gather EHR data according to
medical concepts
1b. Use Medical Entities Dictionary to
delineate feature scopes
1c. Apply feature selection
within each
concept group
3a. Per-disease ensembles:
compute local
level-1 models
3b. Cross-disease ensemble:
compute a global
level-1 model
Global
level-1
features
8. Key Concepts
• Multi-disease phenotyping
– Integration between model stacking and ontology
• Feature discovery, engineering, and representation learning
– Training data preparation and “matching” (80%+ dev time)
– Surrogate labels (e.g. ICD) vs annotated labels (gold standard)
– Disease prediction
• Converting multi-class problem into a set of binary classification problems
(why?) and predict the degree of “association” between X (patient) and y
(disease)
• Small sample size in model evaluation via “2D model stacking”
– Data augmentation via semi-supervised learning
• Diagnostic concept models
– Model explanability
– Using medical ontology for feature discovery and for defining concept-
specific base models
9. Modeling Diagnostic Concepts
• Different infectious diseases share the same set
of diagnostic concept units
• Infectious diseases
– Lab tests
• Microorganism, blood, urine, body tissues, stool
– Medications
• Antibiotic, antivirus, anthelmintic
• Build statistical models for each diagnostic
component and combine them appropriately
– Ensemble learning
10. MLOps: Data Processing
´ Data Ingestion & Preparation
´ Bulk learning set (scoping)
´ Ontology-based feature engineering
´ Curation of training data (case vs control)
´ How to select appropriate control group?
11. Bulk Learning Set (1)
• Diseases of the same class e.g. infectious
diseases
• The set of target infectious diseases are
represented by 100 ICD codes
– Why 100 codes?
– Code selection strategy?
• Systematic methods: e.g. Use CCS to map the target
diseases to their corresponding codes
• (Random) selection by sample size (100 out of ~1500)
13. Training Data Preparation
• 100 ICD codes corresponds to 100 labels (e.g. 038.1: Staphylococcal
septicemia)
– ICDs are surrogate labels
– Other “free” labels? E.g. keywords in clinical notes, pathology reports
• Which part of the clinical records are of interest?
– Choose a window (w) and stay consistent; e.g. from 60 days prior to
the mention (of 038.1) up to 30 days following the first mention of a
given ICD.
• Each case needs a control
– Try to keep control data (negative class) as similar as possible to the
case data (positive class)
– Active variables
– Matching via similarity metric (e.g. Jaccard index)
15. Visualizing Active Variables: Similar Cases
Toxic shock syndrome (040.82),
Staphylococcus infection of unspecified site (041.11, 041.10),
Gram-negative organism infection (041.85), Unspecified bacterial infection (041.89, 041.9),
Pseudomonas infection of unspecified site (041.7),
Unspecified streptococcus infection (041.00),
Proteus (041.6), Streptococcus infection of unspecified site (041.09
16. Using Medical Ontology to Group Features
• Snapshot of Medical Entities Dictionary
(http://med.dmi.columbia.edu)
17. Feature Screening & Selection (1)
• Ontology based feature candidate selection
(or scoping)
– This results in the initial set
• Each base model represents a diagnostic
concept (e.g., microbiology) through these
variables
– Each code (disease) is associated with N diagnostic
concepts (base models) (e.g. N=4 in the paper) è
Each model is associated with variables from the
same concept class (e.g. microbiology)
18. Feature Screening & Selection (2)
• Within each feature group, how to select the
most relevant clinical variables?
– Variable screening: Most active variables
• Active in >= 80% of the training data
• This facilitates a matching process that generates control
dataset
• This leads to 747 variables for microbiology, 567 for
antibiotic, 710 for blood test, and 202 for urine test
• In modeling training stage
– BoLASSO: resampling + LASSO
– Results depend on the dataset, which determines the
shrinkage
22. Model Training
• Base models (diagnosis concepts)
– Base models map raw features to probabilities
interpreted as diagnostic (level-1) features
• Indicator features, and additional variables used only at level
1
• Higher-level models (level 1 and level 2)
– Local models: one model per ICD/disease
– Global model and evaluation with gold standard
• Surrogate labels vs true labels (typically small)
• Local models predict ICDs (but they are not the gold
standard)
23. Model Stacking
• Recall: Why inspecting multiple (infectious) diseases?
– Using multiple diseases as substrate and identify their common elements
– Shared feature representation: Each condition has different weight
distribution over diagnostic components
• Next: An example stacking architecture
24. Level 0
Level 1
Antibiotic Model
Urine Test Model
Blood Test Model
Level 2
Level-0 Probabilities
+
Indicators
Level-1
Probabilities
Microbiology Model
112.3
009.0
137.0
054.2
+
Other Phenotypic Models (e.g. Antiviral)
ICD-9 Annotation
Set
ICD-9
25.
26. Local vs Global Models
• Can build a meta-model (level 1) for each disease using the
probability scores from base models (e.g. microbiology, antibiotic,
etc.)
– Some diseases may not have sufficient data points
– Evaluation for labeled data (typically small) can still be a challenge
• Global model
– Combine cases across diseases to form a “global” model (in contrast to
the disease-specific models, referred to as local models).
– In the combined model, individuality of diseases is lost => only positive
or negative cases
– Small annotation set (83 different cases in the experiment)
• 54 cases sampled from positive examples corresponding to 54 distinct ICD-9
codes; 29 cases from the negative
• Semi-supervised learning to generate virtual annotations
27. ⌃
⌃
⌃
⌃
⌃
m1
a1
b1
u1
m1
(1)
m1g a1g b1g u1g
global2
(i)
(i-1)
(i+1)
a1
(1) b1
(1) u1
(1)
logistic units
raw
features
microbiology
antibiotic
blood
test
urine
test
f11
f12
f1j
f21
f2j
f31
f41
f3j
Four Example Base Models
28. Interpretation
• Now we have predictive scores (and indicators) as
features, how do we use them?
• Think of these probabilities as degrees of confidence …
– A subset lab tests provide better explanations to particular
cases
• Candida Esophagitis à (M: 0.9, B: 0.5, U: 0.2, A: 0.6)
• Venereal Disease à (M: 0.8, B: 0.6, U: 0.9, A: 0.6)
• Can add other high-level features: indicator, entropy,
variance
– We published the result by considering only consider 8
features (4 model-specific + 4 indicators)
29. Model Evaluation
• How well does the model predict ICDs (using a separate test data)?
– Base models
– Local level-1, Global level-1: predict ICDs
– Global level-2: predict gold standard (83 labeled cases)
• How well does the level-1 model captures individual disease (i.e.
using only diagnostic features instead of raw features)
• How well does the model predict annotated data (assoc. with “true
labels”)?
– (Binarized) the ICD becomes a candidate feature among abstract
features (e.g. probability scores, indicators)
• OHDSI-verified dataset (OHDSI: https://ohdsi.org/)
– Annotated data consist of randomly selected cases in which errors of
ICD-9 coding are corrected
– Data annotations and coding procedures are two independent
processes
36. Data Fusion
• What happens when a patient has multiple clinical visits for the same
disease at different times?
• Missing data for a subset of base models
– Indicator variables, etc.
• Temporal alignments from across different base models
– e.g. aligning xi in microbiology model with xj in antibiotic model according to
their timestamps)
• Other issues with multiple training instances (X)
– The time window may be too big to assume the disease state (y) stays the
same
– Certain variables could assume different values within the chosen window w:[-
60, 30]; which one to choose?
• Most recent? Average? Min/Max?
– Most representative instance (e.g. by aligning probability scores with the label
after the training stage at the base level)
– More sophisticated methods? E.g. interpolation by time
37. Abstract Feature Representation: Design Choices
• Related work in constructing high-level features
– PCA, unsupervised feature learning, manifold learning, etc.
• Design choices
– Data characteristics
– Interpretability
• Deep Neural Network
– Linear combination
– Non-linear transformation (e.g. sigmoid, rectifier, etc.)
• Feature set: continuous, dense, and “homogeneous”
– Image pixels
– Times series of lab measurements
– word2vec
• EHR data however are very different
– sparse and incomplete
– consist of many different types (binary, categorical, continuous, etc.)
– Features associated with multiple concepts
38. Future As a Multi-Disease Phenotyping Framework …
• Summary
– Bulk learning is a framework with at least the following system choices
• The bulk learning set (of target conditions) => base models
• Classification algorithms (guideline: probabilistic classifiers + well-calibrated)
• Stacking architecture (multiple tiers => levels of abstractions)
• Strategy for combining individual (local) disease models to a global model
– Advantage: Can use a small annotated sample for model construction and
evaluation within the abstract feature space (e.g. level-1 data)
• 83 clinical cases were labeled in this study
– Challenge: The model involving the interaction between abstract features and
ICD-9 do not generalize well into the region of the data where the ICD-9
coding was incorrect
• Multiple types of surrogate labels
⌃
⌃
⌃
⌃
⌃
m1
a1
b1
u1
m1
(1)
m1
(i)
a1
(i)
b1
(i)
u1
(i)
⌃
m1g a1g b1g u1g
local2
(i)
global2
(i)
(i-1)
(i+1)
a1
(1) b1
(1) u1
(1)
(i-1)
(i)
(i+1)
Semi-supervised learning
Active learning
Other surrogate labels
• Possible extensions
39. T H A N K
Y O U ⌃
⌃
⌃
⌃
m1
a1
b1
u1
f11
f12
f1j
f21
f2j
f31
f41
f3j