SlideShare a Scribd company logo
1 of 39
Download to read offline
Data-Driven Disease Phenotyping
Po-Hsiang (Barnett) Chiu
Phenotypes and phenotyping
Physically observable traits of genotypes (and their interactions with environments)
Health data-derived clinical patterns (e.g. patient embeddings, graph basis)
Disease descriptors and phenotyping rules
Utilities of phenotypes
• Phenotyping takes high-dimensional patient data (e.g. EHR)
and maps it to medical concepts
• Learned phenotyping rules help to define target disease
cohorts.
• Phenotypes can help answer different research questions:
– Descriptive: What's the trend of the lab values (e.g. eGFR)?
What are the underlying temporal patterns?
– Predictive: Will this patient develop a comorbid condition X
given the history we know?
– Prescriptive: What medical interventions are more likely to be
useful for this patient? Considering cost effectiveness
• Patient segmentation
• Population representativeness (e.g. for clinical trials)
• Recommender system
From ML Perspective …
´ Prediction
´ Model Interpretation
´Reusable feature representation
´phenotyping rules
Data and Methods
• Data-driven phenotyping
– Data sources
• Clinical data (EHR-based): EHR, clinical notes, pathology reports, claims
• Genomic data: protein sequences, gene expressions, HPO
• Environment data: air pollution exposure profile
– Two main methodologies
• Rule-based approach (e.g. CKD staging)
– e.g. eMERGE: phekb.org/network-associations/emerge
• Probabilistic approach via ML, statistics
Example Projects
• (EHR-based) Bulk learning: a multi-disease
phenotyping framework for infectious diseases
• EHR sequencing: developing an alternative
representation mirroring the pipeline of genomic
sequencing + sequence pattern recognition
• Combinations of air toxics associated with
childhood asthma
• Protein function predictions: Gene ontology
• HPO: Multiway associations between
genes/proteins, phenotypes, target diseases
Bulk Learning: Workflow
• Simultaneous phenotyping on multiple diseases (to be
discussed shortly)
⌃
⌃
⌃
⌃
⌃
m1
a1
b1
u1
m1
(1)
m1
(i)
a1
(i)
b1
(i)
u1
(i)
⌃
m1g a1g b1g u1g
local2
(i)
global2
(i)
(i-1)
(i+1)
a1
(1)
b1
(1)
u1
(1)
(i-1)
(i)
(i+1)
logistic units
raw
features
microbiology
antibiotic
blood
test
urine
test
2. Compute Base Models
Level-1
Global
Unit
Individual
Level-1
Local
Units
Level-1
abstract
features
f11
f12
f1j
f21
f2j
f31
f41
f3j
Four Example Base Models
3. Compute Meta Models (via Ensemble Learning)
1. Define Feature Groups Using Medical Ontology
1a. Gather EHR data according to
medical concepts
1b. Use Medical Entities Dictionary to
delineate feature scopes
1c. Apply feature selection
within each
concept group
3a. Per-disease ensembles:
compute local
level-1 models
3b. Cross-disease ensemble:
compute a global
level-1 model
Global
level-1
features
Key Concepts
• Multi-disease phenotyping
– Integration between model stacking and ontology
• Feature discovery, engineering, and representation learning
– Training data preparation and “matching” (80%+ dev time)
– Surrogate labels (e.g. ICD) vs annotated labels (gold standard)
– Disease prediction
• Converting multi-class problem into a set of binary classification problems
(why?) and predict the degree of “association” between X (patient) and y
(disease)
• Small sample size in model evaluation via “2D model stacking”
– Data augmentation via semi-supervised learning
• Diagnostic concept models
– Model explanability
– Using medical ontology for feature discovery and for defining concept-
specific base models
Modeling Diagnostic Concepts
• Different infectious diseases share the same set
of diagnostic concept units
• Infectious diseases
– Lab tests
• Microorganism, blood, urine, body tissues, stool
– Medications
• Antibiotic, antivirus, anthelmintic
• Build statistical models for each diagnostic
component and combine them appropriately
– Ensemble learning
MLOps: Data Processing
´ Data Ingestion & Preparation
´ Bulk learning set (scoping)
´ Ontology-based feature engineering
´ Curation of training data (case vs control)
´ How to select appropriate control group?
Bulk Learning Set (1)
• Diseases of the same class e.g. infectious
diseases
• The set of target infectious diseases are
represented by 100 ICD codes
– Why 100 codes?
– Code selection strategy?
• Systematic methods: e.g. Use CCS to map the target
diseases to their corresponding codes
• (Random) selection by sample size (100 out of ~1500)
Bulk Learning Set (2)
Training Data Preparation
• 100 ICD codes corresponds to 100 labels (e.g. 038.1: Staphylococcal
septicemia)
– ICDs are surrogate labels
– Other “free” labels? E.g. keywords in clinical notes, pathology reports
• Which part of the clinical records are of interest?
– Choose a window (w) and stay consistent; e.g. from 60 days prior to
the mention (of 038.1) up to 30 days following the first mention of a
given ICD.
• Each case needs a control
– Try to keep control data (negative class) as similar as possible to the
case data (positive class)
– Active variables
– Matching via similarity metric (e.g. Jaccard index)
Visualizing Active Variables: Diverse Cases
Cysticercosis (123.1), Candidiasis (112.3), Meningococcal meningitis (036.0),
Dengue (061), Lyme disease (088.81), RSV (079.6), pneumococcal pneumonia (481),
Herpes zoster (053.19), Listeriosis (027.0), Salmonella gastroenteritis (003.0)
UpSet: caleydo.org
Visualizing Active Variables: Similar Cases
Toxic shock syndrome (040.82),
Staphylococcus infection of unspecified site (041.11, 041.10),
Gram-negative organism infection (041.85), Unspecified bacterial infection (041.89, 041.9),
Pseudomonas infection of unspecified site (041.7),
Unspecified streptococcus infection (041.00),
Proteus (041.6), Streptococcus infection of unspecified site (041.09
Using Medical Ontology to Group Features
• Snapshot of Medical Entities Dictionary
(http://med.dmi.columbia.edu)
Feature Screening & Selection (1)
• Ontology based feature candidate selection
(or scoping)
– This results in the initial set
• Each base model represents a diagnostic
concept (e.g., microbiology) through these
variables
– Each code (disease) is associated with N diagnostic
concepts (base models) (e.g. N=4 in the paper) è
Each model is associated with variables from the
same concept class (e.g. microbiology)
Feature Screening & Selection (2)
• Within each feature group, how to select the
most relevant clinical variables?
– Variable screening: Most active variables
• Active in >= 80% of the training data
• This facilitates a matching process that generates control
dataset
• This leads to 747 variables for microbiology, 567 for
antibiotic, 710 for blood test, and 202 for urine test
• In modeling training stage
– BoLASSO: resampling + LASSO
– Results depend on the dataset, which determines the
shrinkage
Example Variables for Each Phenotypic Models
Data Distributions
Number of unique patients in the foreground; training set sizes in the background
MLOps: Modeling
´ Model training
´ Model evaluation
´ Error analysis
Model Training
• Base models (diagnosis concepts)
– Base models map raw features to probabilities
interpreted as diagnostic (level-1) features
• Indicator features, and additional variables used only at level
1
• Higher-level models (level 1 and level 2)
– Local models: one model per ICD/disease
– Global model and evaluation with gold standard
• Surrogate labels vs true labels (typically small)
• Local models predict ICDs (but they are not the gold
standard)
Model Stacking
• Recall: Why inspecting multiple (infectious) diseases?
– Using multiple diseases as substrate and identify their common elements
– Shared feature representation: Each condition has different weight
distribution over diagnostic components
• Next: An example stacking architecture
Level 0
Level 1
Antibiotic Model
Urine Test Model
Blood Test Model
Level 2
Level-0 Probabilities
+
Indicators
Level-1
Probabilities
Microbiology Model
112.3
009.0
137.0
054.2
+
Other Phenotypic Models (e.g. Antiviral)
ICD-9 Annotation
Set
ICD-9
Local vs Global Models
• Can build a meta-model (level 1) for each disease using the
probability scores from base models (e.g. microbiology, antibiotic,
etc.)
– Some diseases may not have sufficient data points
– Evaluation for labeled data (typically small) can still be a challenge
• Global model
– Combine cases across diseases to form a “global” model (in contrast to
the disease-specific models, referred to as local models).
– In the combined model, individuality of diseases is lost => only positive
or negative cases
– Small annotation set (83 different cases in the experiment)
• 54 cases sampled from positive examples corresponding to 54 distinct ICD-9
codes; 29 cases from the negative
• Semi-supervised learning to generate virtual annotations
⌃
⌃
⌃
⌃
⌃
m1
a1
b1
u1
m1
(1)
m1g a1g b1g u1g
global2
(i)
(i-1)
(i+1)
a1
(1) b1
(1) u1
(1)
logistic units
raw
features
microbiology
antibiotic
blood
test
urine
test
f11
f12
f1j
f21
f2j
f31
f41
f3j
Four Example Base Models
Interpretation
• Now we have predictive scores (and indicators) as
features, how do we use them?
• Think of these probabilities as degrees of confidence …
– A subset lab tests provide better explanations to particular
cases
• Candida Esophagitis à (M: 0.9, B: 0.5, U: 0.2, A: 0.6)
• Venereal Disease à (M: 0.8, B: 0.6, U: 0.9, A: 0.6)
• Can add other high-level features: indicator, entropy,
variance
– We published the result by considering only consider 8
features (4 model-specific + 4 indicators)
Model Evaluation
• How well does the model predict ICDs (using a separate test data)?
– Base models
– Local level-1, Global level-1: predict ICDs
– Global level-2: predict gold standard (83 labeled cases)
• How well does the level-1 model captures individual disease (i.e.
using only diagnostic features instead of raw features)
• How well does the model predict annotated data (assoc. with “true
labels”)?
– (Binarized) the ICD becomes a candidate feature among abstract
features (e.g. probability scores, indicators)
• OHDSI-verified dataset (OHDSI: https://ohdsi.org/)
– Annotated data consist of randomly selected cases in which errors of
ICD-9 coding are corrected
– Data annotations and coding procedures are two independent
processes
Base Level Performances
Local Level-1 Models
127.4 Enterobiasis
047.8 (Other) viral meningitis
009.1 Gastroenteritis ...
053.9 Herpez zoster
117.9 Mycoses
Global Level-1 Models
Global “level-2” Model Predicting Annotated Data
Data Augmentation
• Semi-supervised learning and virtual
annotation set
– Cluster assumption
– Similarity metric
Data Fusion
• What happens when a patient has multiple clinical visits for the same
disease at different times?
• Missing data for a subset of base models
– Indicator variables, etc.
• Temporal alignments from across different base models
– e.g. aligning xi in microbiology model with xj in antibiotic model according to
their timestamps)
• Other issues with multiple training instances (X)
– The time window may be too big to assume the disease state (y) stays the
same
– Certain variables could assume different values within the chosen window w:[-
60, 30]; which one to choose?
• Most recent? Average? Min/Max?
– Most representative instance (e.g. by aligning probability scores with the label
after the training stage at the base level)
– More sophisticated methods? E.g. interpolation by time
Abstract Feature Representation: Design Choices
• Related work in constructing high-level features
– PCA, unsupervised feature learning, manifold learning, etc.
• Design choices
– Data characteristics
– Interpretability
• Deep Neural Network
– Linear combination
– Non-linear transformation (e.g. sigmoid, rectifier, etc.)
• Feature set: continuous, dense, and “homogeneous”
– Image pixels
– Times series of lab measurements
– word2vec
• EHR data however are very different
– sparse and incomplete
– consist of many different types (binary, categorical, continuous, etc.)
– Features associated with multiple concepts
Future As a Multi-Disease Phenotyping Framework …
• Summary
– Bulk learning is a framework with at least the following system choices
• The bulk learning set (of target conditions) => base models
• Classification algorithms (guideline: probabilistic classifiers + well-calibrated)
• Stacking architecture (multiple tiers => levels of abstractions)
• Strategy for combining individual (local) disease models to a global model
– Advantage: Can use a small annotated sample for model construction and
evaluation within the abstract feature space (e.g. level-1 data)
• 83 clinical cases were labeled in this study
– Challenge: The model involving the interaction between abstract features and
ICD-9 do not generalize well into the region of the data where the ICD-9
coding was incorrect
• Multiple types of surrogate labels
⌃
⌃
⌃
⌃
⌃
m1
a1
b1
u1
m1
(1)
m1
(i)
a1
(i)
b1
(i)
u1
(i)
⌃
m1g a1g b1g u1g
local2
(i)
global2
(i)
(i-1)
(i+1)
a1
(1) b1
(1) u1
(1)
(i-1)
(i)
(i+1)
Semi-supervised learning
Active learning
Other surrogate labels
• Possible extensions
T H A N K
Y O U ⌃
⌃
⌃
⌃
m1
a1
b1
u1
f11
f12
f1j
f21
f2j
f31
f41
f3j

More Related Content

Similar to Data-driven Disease Phenotyping and Bulk Learning

Theory and Practice of Integrating Machine Learning and Conventional Statisti...
Theory and Practice of Integrating Machine Learning and Conventional Statisti...Theory and Practice of Integrating Machine Learning and Conventional Statisti...
Theory and Practice of Integrating Machine Learning and Conventional Statisti...University of Malaya
 
Health advances ai in diagnostic development
Health advances ai in diagnostic developmentHealth advances ai in diagnostic development
Health advances ai in diagnostic developmentHealth Advances
 
Modelling physiological uncertainty
Modelling physiological uncertaintyModelling physiological uncertainty
Modelling physiological uncertaintyNatal van Riel
 
SHE, Quality, and Ethics in Medical Laboratories - PCLP
SHE, Quality, and Ethics in Medical Laboratories - PCLPSHE, Quality, and Ethics in Medical Laboratories - PCLP
SHE, Quality, and Ethics in Medical Laboratories - PCLPAlAcademia Tsr
 
Zen and the Art of Data Science Maintenance
Zen and the Art of Data Science MaintenanceZen and the Art of Data Science Maintenance
Zen and the Art of Data Science MaintenanceElsevier
 
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdfEffective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdfPubrica
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansBrook White, PMP
 
PERSONALISED STATISTICAL MEDICINE.pptx
PERSONALISED STATISTICAL MEDICINE.pptxPERSONALISED STATISTICAL MEDICINE.pptx
PERSONALISED STATISTICAL MEDICINE.pptxAbhaya Indrayan
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchFranciscoJAzuajeG
 
Lect 1_Biostat.pdf
Lect 1_Biostat.pdfLect 1_Biostat.pdf
Lect 1_Biostat.pdfBirhanTesema
 
Integrating evidence based medicine and em rs
Integrating evidence based medicine and em rsIntegrating evidence based medicine and em rs
Integrating evidence based medicine and em rsTrimed Media Group
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management inscit2006
 

Similar to Data-driven Disease Phenotyping and Bulk Learning (20)

Theory and Practice of Integrating Machine Learning and Conventional Statisti...
Theory and Practice of Integrating Machine Learning and Conventional Statisti...Theory and Practice of Integrating Machine Learning and Conventional Statisti...
Theory and Practice of Integrating Machine Learning and Conventional Statisti...
 
Health advances ai in diagnostic development
Health advances ai in diagnostic developmentHealth advances ai in diagnostic development
Health advances ai in diagnostic development
 
Modelling physiological uncertainty
Modelling physiological uncertaintyModelling physiological uncertainty
Modelling physiological uncertainty
 
SHE, Quality, and Ethics in Medical Laboratories - PCLP
SHE, Quality, and Ethics in Medical Laboratories - PCLPSHE, Quality, and Ethics in Medical Laboratories - PCLP
SHE, Quality, and Ethics in Medical Laboratories - PCLP
 
Zen and the Art of Data Science Maintenance
Zen and the Art of Data Science MaintenanceZen and the Art of Data Science Maintenance
Zen and the Art of Data Science Maintenance
 
London 2008
London 2008London 2008
London 2008
 
Malmo 11.11.2008
Malmo 11.11.2008Malmo 11.11.2008
Malmo 11.11.2008
 
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdfEffective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
Effective strategies to monitor clinical risks using biostatistics - Pubrica.pdf
 
Data management & statistics in clinical trials
Data management & statistics in clinical trialsData management & statistics in clinical trials
Data management & statistics in clinical trials
 
Clinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-StatisticiansClinical Research Statistics for Non-Statisticians
Clinical Research Statistics for Non-Statisticians
 
PERSONALISED STATISTICAL MEDICINE.pptx
PERSONALISED STATISTICAL MEDICINE.pptxPERSONALISED STATISTICAL MEDICINE.pptx
PERSONALISED STATISTICAL MEDICINE.pptx
 
Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
CIBM
CIBMCIBM
CIBM
 
Heart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptxHeart Disease Prediction Analysis - Sushil Gupta.pptx
Heart Disease Prediction Analysis - Sushil Gupta.pptx
 
Metaanalysis copy
Metaanalysis    copyMetaanalysis    copy
Metaanalysis copy
 
Lect 1_Biostat.pdf
Lect 1_Biostat.pdfLect 1_Biostat.pdf
Lect 1_Biostat.pdf
 
Integrating evidence based medicine and em rs
Integrating evidence based medicine and em rsIntegrating evidence based medicine and em rs
Integrating evidence based medicine and em rs
 
Pathology informatics
Pathology informaticsPathology informatics
Pathology informatics
 
Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management Evolution of Knowledge Discovery and Management
Evolution of Knowledge Discovery and Management
 
Liverpool uemseflm2014
Liverpool uemseflm2014Liverpool uemseflm2014
Liverpool uemseflm2014
 

Recently uploaded

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 

Recently uploaded (20)

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 

Data-driven Disease Phenotyping and Bulk Learning

  • 2. Phenotypes and phenotyping Physically observable traits of genotypes (and their interactions with environments) Health data-derived clinical patterns (e.g. patient embeddings, graph basis) Disease descriptors and phenotyping rules
  • 3. Utilities of phenotypes • Phenotyping takes high-dimensional patient data (e.g. EHR) and maps it to medical concepts • Learned phenotyping rules help to define target disease cohorts. • Phenotypes can help answer different research questions: – Descriptive: What's the trend of the lab values (e.g. eGFR)? What are the underlying temporal patterns? – Predictive: Will this patient develop a comorbid condition X given the history we know? – Prescriptive: What medical interventions are more likely to be useful for this patient? Considering cost effectiveness • Patient segmentation • Population representativeness (e.g. for clinical trials) • Recommender system
  • 4. From ML Perspective … ´ Prediction ´ Model Interpretation ´Reusable feature representation ´phenotyping rules
  • 5. Data and Methods • Data-driven phenotyping – Data sources • Clinical data (EHR-based): EHR, clinical notes, pathology reports, claims • Genomic data: protein sequences, gene expressions, HPO • Environment data: air pollution exposure profile – Two main methodologies • Rule-based approach (e.g. CKD staging) – e.g. eMERGE: phekb.org/network-associations/emerge • Probabilistic approach via ML, statistics
  • 6. Example Projects • (EHR-based) Bulk learning: a multi-disease phenotyping framework for infectious diseases • EHR sequencing: developing an alternative representation mirroring the pipeline of genomic sequencing + sequence pattern recognition • Combinations of air toxics associated with childhood asthma • Protein function predictions: Gene ontology • HPO: Multiway associations between genes/proteins, phenotypes, target diseases
  • 7. Bulk Learning: Workflow • Simultaneous phenotyping on multiple diseases (to be discussed shortly) ⌃ ⌃ ⌃ ⌃ ⌃ m1 a1 b1 u1 m1 (1) m1 (i) a1 (i) b1 (i) u1 (i) ⌃ m1g a1g b1g u1g local2 (i) global2 (i) (i-1) (i+1) a1 (1) b1 (1) u1 (1) (i-1) (i) (i+1) logistic units raw features microbiology antibiotic blood test urine test 2. Compute Base Models Level-1 Global Unit Individual Level-1 Local Units Level-1 abstract features f11 f12 f1j f21 f2j f31 f41 f3j Four Example Base Models 3. Compute Meta Models (via Ensemble Learning) 1. Define Feature Groups Using Medical Ontology 1a. Gather EHR data according to medical concepts 1b. Use Medical Entities Dictionary to delineate feature scopes 1c. Apply feature selection within each concept group 3a. Per-disease ensembles: compute local level-1 models 3b. Cross-disease ensemble: compute a global level-1 model Global level-1 features
  • 8. Key Concepts • Multi-disease phenotyping – Integration between model stacking and ontology • Feature discovery, engineering, and representation learning – Training data preparation and “matching” (80%+ dev time) – Surrogate labels (e.g. ICD) vs annotated labels (gold standard) – Disease prediction • Converting multi-class problem into a set of binary classification problems (why?) and predict the degree of “association” between X (patient) and y (disease) • Small sample size in model evaluation via “2D model stacking” – Data augmentation via semi-supervised learning • Diagnostic concept models – Model explanability – Using medical ontology for feature discovery and for defining concept- specific base models
  • 9. Modeling Diagnostic Concepts • Different infectious diseases share the same set of diagnostic concept units • Infectious diseases – Lab tests • Microorganism, blood, urine, body tissues, stool – Medications • Antibiotic, antivirus, anthelmintic • Build statistical models for each diagnostic component and combine them appropriately – Ensemble learning
  • 10. MLOps: Data Processing ´ Data Ingestion & Preparation ´ Bulk learning set (scoping) ´ Ontology-based feature engineering ´ Curation of training data (case vs control) ´ How to select appropriate control group?
  • 11. Bulk Learning Set (1) • Diseases of the same class e.g. infectious diseases • The set of target infectious diseases are represented by 100 ICD codes – Why 100 codes? – Code selection strategy? • Systematic methods: e.g. Use CCS to map the target diseases to their corresponding codes • (Random) selection by sample size (100 out of ~1500)
  • 13. Training Data Preparation • 100 ICD codes corresponds to 100 labels (e.g. 038.1: Staphylococcal septicemia) – ICDs are surrogate labels – Other “free” labels? E.g. keywords in clinical notes, pathology reports • Which part of the clinical records are of interest? – Choose a window (w) and stay consistent; e.g. from 60 days prior to the mention (of 038.1) up to 30 days following the first mention of a given ICD. • Each case needs a control – Try to keep control data (negative class) as similar as possible to the case data (positive class) – Active variables – Matching via similarity metric (e.g. Jaccard index)
  • 14. Visualizing Active Variables: Diverse Cases Cysticercosis (123.1), Candidiasis (112.3), Meningococcal meningitis (036.0), Dengue (061), Lyme disease (088.81), RSV (079.6), pneumococcal pneumonia (481), Herpes zoster (053.19), Listeriosis (027.0), Salmonella gastroenteritis (003.0) UpSet: caleydo.org
  • 15. Visualizing Active Variables: Similar Cases Toxic shock syndrome (040.82), Staphylococcus infection of unspecified site (041.11, 041.10), Gram-negative organism infection (041.85), Unspecified bacterial infection (041.89, 041.9), Pseudomonas infection of unspecified site (041.7), Unspecified streptococcus infection (041.00), Proteus (041.6), Streptococcus infection of unspecified site (041.09
  • 16. Using Medical Ontology to Group Features • Snapshot of Medical Entities Dictionary (http://med.dmi.columbia.edu)
  • 17. Feature Screening & Selection (1) • Ontology based feature candidate selection (or scoping) – This results in the initial set • Each base model represents a diagnostic concept (e.g., microbiology) through these variables – Each code (disease) is associated with N diagnostic concepts (base models) (e.g. N=4 in the paper) è Each model is associated with variables from the same concept class (e.g. microbiology)
  • 18. Feature Screening & Selection (2) • Within each feature group, how to select the most relevant clinical variables? – Variable screening: Most active variables • Active in >= 80% of the training data • This facilitates a matching process that generates control dataset • This leads to 747 variables for microbiology, 567 for antibiotic, 710 for blood test, and 202 for urine test • In modeling training stage – BoLASSO: resampling + LASSO – Results depend on the dataset, which determines the shrinkage
  • 19. Example Variables for Each Phenotypic Models
  • 20. Data Distributions Number of unique patients in the foreground; training set sizes in the background
  • 21. MLOps: Modeling ´ Model training ´ Model evaluation ´ Error analysis
  • 22. Model Training • Base models (diagnosis concepts) – Base models map raw features to probabilities interpreted as diagnostic (level-1) features • Indicator features, and additional variables used only at level 1 • Higher-level models (level 1 and level 2) – Local models: one model per ICD/disease – Global model and evaluation with gold standard • Surrogate labels vs true labels (typically small) • Local models predict ICDs (but they are not the gold standard)
  • 23. Model Stacking • Recall: Why inspecting multiple (infectious) diseases? – Using multiple diseases as substrate and identify their common elements – Shared feature representation: Each condition has different weight distribution over diagnostic components • Next: An example stacking architecture
  • 24. Level 0 Level 1 Antibiotic Model Urine Test Model Blood Test Model Level 2 Level-0 Probabilities + Indicators Level-1 Probabilities Microbiology Model 112.3 009.0 137.0 054.2 + Other Phenotypic Models (e.g. Antiviral) ICD-9 Annotation Set ICD-9
  • 25.
  • 26. Local vs Global Models • Can build a meta-model (level 1) for each disease using the probability scores from base models (e.g. microbiology, antibiotic, etc.) – Some diseases may not have sufficient data points – Evaluation for labeled data (typically small) can still be a challenge • Global model – Combine cases across diseases to form a “global” model (in contrast to the disease-specific models, referred to as local models). – In the combined model, individuality of diseases is lost => only positive or negative cases – Small annotation set (83 different cases in the experiment) • 54 cases sampled from positive examples corresponding to 54 distinct ICD-9 codes; 29 cases from the negative • Semi-supervised learning to generate virtual annotations
  • 27. ⌃ ⌃ ⌃ ⌃ ⌃ m1 a1 b1 u1 m1 (1) m1g a1g b1g u1g global2 (i) (i-1) (i+1) a1 (1) b1 (1) u1 (1) logistic units raw features microbiology antibiotic blood test urine test f11 f12 f1j f21 f2j f31 f41 f3j Four Example Base Models
  • 28. Interpretation • Now we have predictive scores (and indicators) as features, how do we use them? • Think of these probabilities as degrees of confidence … – A subset lab tests provide better explanations to particular cases • Candida Esophagitis à (M: 0.9, B: 0.5, U: 0.2, A: 0.6) • Venereal Disease à (M: 0.8, B: 0.6, U: 0.9, A: 0.6) • Can add other high-level features: indicator, entropy, variance – We published the result by considering only consider 8 features (4 model-specific + 4 indicators)
  • 29. Model Evaluation • How well does the model predict ICDs (using a separate test data)? – Base models – Local level-1, Global level-1: predict ICDs – Global level-2: predict gold standard (83 labeled cases) • How well does the level-1 model captures individual disease (i.e. using only diagnostic features instead of raw features) • How well does the model predict annotated data (assoc. with “true labels”)? – (Binarized) the ICD becomes a candidate feature among abstract features (e.g. probability scores, indicators) • OHDSI-verified dataset (OHDSI: https://ohdsi.org/) – Annotated data consist of randomly selected cases in which errors of ICD-9 coding are corrected – Data annotations and coding procedures are two independent processes
  • 32. 127.4 Enterobiasis 047.8 (Other) viral meningitis 009.1 Gastroenteritis ... 053.9 Herpez zoster 117.9 Mycoses Global Level-1 Models
  • 33. Global “level-2” Model Predicting Annotated Data
  • 34.
  • 35. Data Augmentation • Semi-supervised learning and virtual annotation set – Cluster assumption – Similarity metric
  • 36. Data Fusion • What happens when a patient has multiple clinical visits for the same disease at different times? • Missing data for a subset of base models – Indicator variables, etc. • Temporal alignments from across different base models – e.g. aligning xi in microbiology model with xj in antibiotic model according to their timestamps) • Other issues with multiple training instances (X) – The time window may be too big to assume the disease state (y) stays the same – Certain variables could assume different values within the chosen window w:[- 60, 30]; which one to choose? • Most recent? Average? Min/Max? – Most representative instance (e.g. by aligning probability scores with the label after the training stage at the base level) – More sophisticated methods? E.g. interpolation by time
  • 37. Abstract Feature Representation: Design Choices • Related work in constructing high-level features – PCA, unsupervised feature learning, manifold learning, etc. • Design choices – Data characteristics – Interpretability • Deep Neural Network – Linear combination – Non-linear transformation (e.g. sigmoid, rectifier, etc.) • Feature set: continuous, dense, and “homogeneous” – Image pixels – Times series of lab measurements – word2vec • EHR data however are very different – sparse and incomplete – consist of many different types (binary, categorical, continuous, etc.) – Features associated with multiple concepts
  • 38. Future As a Multi-Disease Phenotyping Framework … • Summary – Bulk learning is a framework with at least the following system choices • The bulk learning set (of target conditions) => base models • Classification algorithms (guideline: probabilistic classifiers + well-calibrated) • Stacking architecture (multiple tiers => levels of abstractions) • Strategy for combining individual (local) disease models to a global model – Advantage: Can use a small annotated sample for model construction and evaluation within the abstract feature space (e.g. level-1 data) • 83 clinical cases were labeled in this study – Challenge: The model involving the interaction between abstract features and ICD-9 do not generalize well into the region of the data where the ICD-9 coding was incorrect • Multiple types of surrogate labels ⌃ ⌃ ⌃ ⌃ ⌃ m1 a1 b1 u1 m1 (1) m1 (i) a1 (i) b1 (i) u1 (i) ⌃ m1g a1g b1g u1g local2 (i) global2 (i) (i-1) (i+1) a1 (1) b1 (1) u1 (1) (i-1) (i) (i+1) Semi-supervised learning Active learning Other surrogate labels • Possible extensions
  • 39. T H A N K Y O U ⌃ ⌃ ⌃ ⌃ m1 a1 b1 u1 f11 f12 f1j f21 f2j f31 f41 f3j