Data-driven Disease Phenotyping and Bulk Learning

Data-Driven Disease Phenotyping
Po-Hsiang (Barnett) Chiu

Phenotypes and phenotyping
Physically observable traits of genotypes (and their interactions with environments)
Health data-derived clinical patterns (e.g. patient embeddings, graph basis)
Disease descriptors and phenotyping rules

Utilities of phenotypes
• Phenotyping takes high-dimensional patient data (e.g. EHR)
and maps it to medical concepts
• Learned phenotyping rules help to define target disease
cohorts.
• Phenotypes can help answer different research questions:
– Descriptive: What's the trend of the lab values (e.g. eGFR)?
What are the underlying temporal patterns?
– Predictive: Will this patient develop a comorbid condition X
given the history we know?
– Prescriptive: What medical interventions are more likely to be
useful for this patient? Considering cost effectiveness
• Patient segmentation
• Population representativeness (e.g. for clinical trials)
• Recommender system

From ML Perspective …
´ Prediction
´ Model Interpretation
´Reusable feature representation
´phenotyping rules

Data and Methods
• Data-driven phenotyping
– Data sources
• Clinical data (EHR-based): EHR, clinical notes, pathology reports, claims
• Genomic data: protein sequences, gene expressions, HPO
• Environment data: air pollution exposure profile
– Two main methodologies
• Rule-based approach (e.g. CKD staging)
– e.g. eMERGE: phekb.org/network-associations/emerge
• Probabilistic approach via ML, statistics

Example Projects
• (EHR-based) Bulk learning: a multi-disease
phenotyping framework for infectious diseases
• EHR sequencing: developing an alternative
representation mirroring the pipeline of genomic
sequencing + sequence pattern recognition
• Combinations of air toxics associated with
childhood asthma
• Protein function predictions: Gene ontology
• HPO: Multiway associations between
genes/proteins, phenotypes, target diseases

Bulk Learning: Workflow
• Simultaneous phenotyping on multiple diseases (to be
discussed shortly)
⌃
⌃
⌃
⌃
⌃
m1
a1
b1
u1
m1
(1)
m1
(i)
a1
(i)
b1
(i)
u1
(i)
⌃
m1g a1g b1g u1g
local2
(i)
global2
(i)
(i-1)
(i+1)
a1
(1)
b1
(1)
u1
(1)
(i-1)
(i)
(i+1)
logistic units
raw
features
microbiology
antibiotic
blood
test
urine
test
2. Compute Base Models
Level-1
Global
Unit
Individual
Level-1
Local
Units
Level-1
abstract
features
f11
f12
f1j
f21
f2j
f31
f41
f3j
Four Example Base Models
3. Compute Meta Models (via Ensemble Learning)
1. Define Feature Groups Using Medical Ontology
1a. Gather EHR data according to
medical concepts
1b. Use Medical Entities Dictionary to
delineate feature scopes
1c. Apply feature selection
within each
concept group
3a. Per-disease ensembles:
compute local
level-1 models
3b. Cross-disease ensemble:
compute a global
level-1 model
Global
level-1
features

Key Concepts
• Multi-disease phenotyping
– Integration between model stacking and ontology
• Feature discovery, engineering, and representation learning
– Training data preparation and “matching” (80%+ dev time)
– Surrogate labels (e.g. ICD) vs annotated labels (gold standard)
– Disease prediction
• Converting multi-class problem into a set of binary classification problems
(why?) and predict the degree of “association” between X (patient) and y
(disease)
• Small sample size in model evaluation via “2D model stacking”
– Data augmentation via semi-supervised learning
• Diagnostic concept models
– Model explanability
– Using medical ontology for feature discovery and for defining concept-
specific base models

Modeling Diagnostic Concepts
• Different infectious diseases share the same set
of diagnostic concept units
• Infectious diseases
– Lab tests
• Microorganism, blood, urine, body tissues, stool
– Medications
• Antibiotic, antivirus, anthelmintic
• Build statistical models for each diagnostic
component and combine them appropriately
– Ensemble learning

MLOps: Data Processing
´ Data Ingestion & Preparation
´ Bulk learning set (scoping)
´ Ontology-based feature engineering
´ Curation of training data (case vs control)
´ How to select appropriate control group?

Bulk Learning Set (1)
• Diseases of the same class e.g. infectious
diseases
• The set of target infectious diseases are
represented by 100 ICD codes
– Why 100 codes?
– Code selection strategy?
• Systematic methods: e.g. Use CCS to map the target
diseases to their corresponding codes
• (Random) selection by sample size (100 out of ~1500)

Training Data Preparation
• 100 ICD codes corresponds to 100 labels (e.g. 038.1: Staphylococcal
septicemia)
– ICDs are surrogate labels
– Other “free” labels? E.g. keywords in clinical notes, pathology reports
• Which part of the clinical records are of interest?
– Choose a window (w) and stay consistent; e.g. from 60 days prior to
the mention (of 038.1) up to 30 days following the first mention of a
given ICD.
• Each case needs a control
– Try to keep control data (negative class) as similar as possible to the
case data (positive class)
– Active variables
– Matching via similarity metric (e.g. Jaccard index)

Visualizing Active Variables: Diverse Cases
Cysticercosis (123.1), Candidiasis (112.3), Meningococcal meningitis (036.0),
Dengue (061), Lyme disease (088.81), RSV (079.6), pneumococcal pneumonia (481),
Herpes zoster (053.19), Listeriosis (027.0), Salmonella gastroenteritis (003.0)
UpSet: caleydo.org

Visualizing Active Variables: Similar Cases
Toxic shock syndrome (040.82),
Staphylococcus infection of unspecified site (041.11, 041.10),
Gram-negative organism infection (041.85), Unspecified bacterial infection (041.89, 041.9),
Pseudomonas infection of unspecified site (041.7),
Unspecified streptococcus infection (041.00),
Proteus (041.6), Streptococcus infection of unspecified site (041.09

Using Medical Ontology to Group Features
• Snapshot of Medical Entities Dictionary
(http://med.dmi.columbia.edu)

Feature Screening & Selection (1)
• Ontology based feature candidate selection
(or scoping)
– This results in the initial set
• Each base model represents a diagnostic
concept (e.g., microbiology) through these
variables
– Each code (disease) is associated with N diagnostic
concepts (base models) (e.g. N=4 in the paper) è
Each model is associated with variables from the
same concept class (e.g. microbiology)

Feature Screening & Selection (2)
• Within each feature group, how to select the
most relevant clinical variables?
– Variable screening: Most active variables
• Active in >= 80% of the training data
• This facilitates a matching process that generates control
dataset
• This leads to 747 variables for microbiology, 567 for
antibiotic, 710 for blood test, and 202 for urine test
• In modeling training stage
– BoLASSO: resampling + LASSO
– Results depend on the dataset, which determines the
shrinkage

Example Variables for Each Phenotypic Models

Data Distributions
Number of unique patients in the foreground; training set sizes in the background

MLOps: Modeling
´ Model training
´ Model evaluation
´ Error analysis

Model Training
• Base models (diagnosis concepts)
– Base models map raw features to probabilities
interpreted as diagnostic (level-1) features
• Indicator features, and additional variables used only at level
1
• Higher-level models (level 1 and level 2)
– Local models: one model per ICD/disease
– Global model and evaluation with gold standard
• Surrogate labels vs true labels (typically small)
• Local models predict ICDs (but they are not the gold
standard)

Model Stacking
• Recall: Why inspecting multiple (infectious) diseases?
– Using multiple diseases as substrate and identify their common elements
– Shared feature representation: Each condition has different weight
distribution over diagnostic components
• Next: An example stacking architecture

Level 0
Level 1
Antibiotic Model
Urine Test Model
Blood Test Model
Level 2
Level-0 Probabilities
+
Indicators
Level-1
Probabilities
Microbiology Model
112.3
009.0
137.0
054.2
+
Other Phenotypic Models (e.g. Antiviral)
ICD-9 Annotation
Set
ICD-9

Local vs Global Models
• Can build a meta-model (level 1) for each disease using the
probability scores from base models (e.g. microbiology, antibiotic,
etc.)
– Some diseases may not have sufficient data points
– Evaluation for labeled data (typically small) can still be a challenge
• Global model
– Combine cases across diseases to form a “global” model (in contrast to
the disease-specific models, referred to as local models).
– In the combined model, individuality of diseases is lost => only positive
or negative cases
– Small annotation set (83 different cases in the experiment)
• 54 cases sampled from positive examples corresponding to 54 distinct ICD-9
codes; 29 cases from the negative
• Semi-supervised learning to generate virtual annotations

⌃
⌃
⌃
⌃
⌃
m1
a1
b1
u1
m1
(1)
m1g a1g b1g u1g
global2
(i)
(i-1)
(i+1)
a1
(1) b1
(1) u1
(1)
logistic units
raw
features
microbiology
antibiotic
blood
test
urine
test
f11
f12
f1j
f21
f2j
f31
f41
f3j
Four Example Base Models

Interpretation
• Now we have predictive scores (and indicators) as
features, how do we use them?
• Think of these probabilities as degrees of confidence …
– A subset lab tests provide better explanations to particular
cases
• Candida Esophagitis à (M: 0.9, B: 0.5, U: 0.2, A: 0.6)
• Venereal Disease à (M: 0.8, B: 0.6, U: 0.9, A: 0.6)
• Can add other high-level features: indicator, entropy,
variance
– We published the result by considering only consider 8
features (4 model-specific + 4 indicators)

Model Evaluation
• How well does the model predict ICDs (using a separate test data)?
– Base models
– Local level-1, Global level-1: predict ICDs
– Global level-2: predict gold standard (83 labeled cases)
• How well does the level-1 model captures individual disease (i.e.
using only diagnostic features instead of raw features)
• How well does the model predict annotated data (assoc. with “true
labels”)?
– (Binarized) the ICD becomes a candidate feature among abstract
features (e.g. probability scores, indicators)
• OHDSI-verified dataset (OHDSI: https://ohdsi.org/)
– Annotated data consist of randomly selected cases in which errors of
ICD-9 coding are corrected
– Data annotations and coding procedures are two independent
processes

127.4 Enterobiasis
047.8 (Other) viral meningitis
009.1 Gastroenteritis ...
053.9 Herpez zoster
117.9 Mycoses
Global Level-1 Models

Global “level-2” Model Predicting Annotated Data

Data Augmentation
• Semi-supervised learning and virtual
annotation set
– Cluster assumption
– Similarity metric

Data Fusion
• What happens when a patient has multiple clinical visits for the same
disease at different times?
• Missing data for a subset of base models
– Indicator variables, etc.
• Temporal alignments from across different base models
– e.g. aligning xi in microbiology model with xj in antibiotic model according to
their timestamps)
• Other issues with multiple training instances (X)
– The time window may be too big to assume the disease state (y) stays the
same
– Certain variables could assume different values within the chosen window w:[-
60, 30]; which one to choose?
• Most recent? Average? Min/Max?
– Most representative instance (e.g. by aligning probability scores with the label
after the training stage at the base level)
– More sophisticated methods? E.g. interpolation by time

Abstract Feature Representation: Design Choices
• Related work in constructing high-level features
– PCA, unsupervised feature learning, manifold learning, etc.
• Design choices
– Data characteristics
– Interpretability
• Deep Neural Network
– Linear combination
– Non-linear transformation (e.g. sigmoid, rectifier, etc.)
• Feature set: continuous, dense, and “homogeneous”
– Image pixels
– Times series of lab measurements
– word2vec
• EHR data however are very different
– sparse and incomplete
– consist of many different types (binary, categorical, continuous, etc.)
– Features associated with multiple concepts

Future As a Multi-Disease Phenotyping Framework …
• Summary
– Bulk learning is a framework with at least the following system choices
• The bulk learning set (of target conditions) => base models
• Classification algorithms (guideline: probabilistic classifiers + well-calibrated)
• Stacking architecture (multiple tiers => levels of abstractions)
• Strategy for combining individual (local) disease models to a global model
– Advantage: Can use a small annotated sample for model construction and
evaluation within the abstract feature space (e.g. level-1 data)
• 83 clinical cases were labeled in this study
– Challenge: The model involving the interaction between abstract features and
ICD-9 do not generalize well into the region of the data where the ICD-9
coding was incorrect
• Multiple types of surrogate labels
⌃
⌃
⌃
⌃
⌃
m1
a1
b1
u1
m1
(1)
m1
(i)
a1
(i)
b1
(i)
u1
(i)
⌃
m1g a1g b1g u1g
local2
(i)
global2
(i)
(i-1)
(i+1)
a1
(1) b1
(1) u1
(1)
(i-1)
(i)
(i+1)
Semi-supervised learning
Active learning
Other surrogate labels
• Possible extensions

T H A N K
Y O U ⌃
⌃
⌃
⌃
m1
a1
b1
u1
f11
f12
f1j
f21
f2j
f31
f41
f3j

Data-driven Disease Phenotyping and Bulk Learning

Recommended

Recommended

More Related Content

Similar to Data-driven Disease Phenotyping and Bulk Learning

Similar to Data-driven Disease Phenotyping and Bulk Learning (20)

Recently uploaded

Recently uploaded (20)

Data-driven Disease Phenotyping and Bulk Learning