Day 1 (Lecture 3): Predictive Analytics in Healthcare

Predictive Analytics in
Healthcare:
Building Successful
Pipelines
Danielle Belgrave
Researcher
Healthcare Machine Learning
@DaniCMBelg

Latent Variable Modelling
My Research
Missing DataLongitudinal Data Analysis
0 2 4 6 8 10 12 14 16 18 20 22 24
Time (hours)
Patient 1
Patient 2
Patient 3
Patient 4
MultidisciplinaryCausality
X Y
Z
Patient-Centric Approach

Roadmap: Predictive Analytics for Healthcare
Part 1
• Drivers ofpredictive analyticsin healthcare
Part 2
• Evaluationof data-driven healthcare
Part3
• Important concepts in study design
Part 4
• Top 10 DataScience Skills

Part 1: Drivers of Predictive Analytics in healthcare
PopulationPatient/Person Pharmaceuticals Providers
TheKey Stakeholders inHealth

ThePerson at the Centre of Healthcare
Patient/Person
MachineLearninghas the capacityto
transformhealthcare
- Understandingphysiological changes
overtime
- Forecastingofprogression or onsetof
disease
- Personalisingtreatmentstrategies

Population Data-driven Healthcare
Population
Elucidatesaverage effectsand deviations
fromaverage effects
Policy recommendations
Healtheducation
Outreach
Researchfor diseasedetectionand injury
prevention
Reducehealthcareinequalities
Whatwe as asociety do collectively
toassurethe conditionsin which
people can be healthy

The Pharmaceutical Perspective:
Drug Discoveryand Therapeutics
Idea
Basic
Research
Clinical
Trails
Phase I Phase II Phase III Regulatory
Approval
Patient
Care

Data Protection and
Connected Care: The Provider
and Regulator Perspective
Providers

Predictive Analytics forHealthcare
Part 1
• Drivers of predictive analyticsin healthcare
Part 2
• Evaluation of data-driven healthcare
Part3
Part 4

DATA EXPERTISE
METHODS &
MODELS
Vast data volume,
velocity, variety
TSUNAMI
Supra-linear growth in
papers & tools
BLIZZARD
Human expertise to make
sense of the growing data
DROUGHT
Myth 1: ‘Big data’ are a solution
Myth 2: We have the models needed to turn healthcare data into decision support algorithms
Myth 3: Clinicians will continue to be the main source of data
Current Stateof Healthcare Data

Machine Learning has the Potential to
Disrupt and Impact Healthcare…

…But it’s Important to Remember the Beginnings
The study of the distribution and determinantsof healthrelated statesor events in
specific populations & theapplications of this study to thecontrol of healthproblems

Data visualisation: death toll of
the Crimean War
Army data: 16,000/18,000
deaths notdue to battle
wounds, but to preventable
diseases, spread by poor
sanitation
The Beginnings of Data-Driven Health
Florence Nightingale(1820– 1910)

Anchor Data Science in
The Power ofObservation
Contextualphenomena:cholera
incidence
Ecologicaldesign:compare
cholera rates by region
Cohort design:comparecholera
ratesin exposedand non-
exposedindividuals

R.A. Fisher and the
Principles of Experimental Design
1.Randomisation:Unbiasedallocationof
treatmentsto
differentexperimentalplot
2.Replication:repetitionofthe treatment
tomorethan
oneexperimentalplot
3.Error control: Measurefor reducing the
error ofvariance
Why do these 2
plants differ in growth?

Part 1
Part 2
Part3
• Important conceptsin study design
Part 4

Principles of Study Design
Needtosetup a studytoanswer aresearch question
Designmost importantaspect of a studyand perhaps themostneglected
Thestudydesign should matchresearch question
So that we don’t endup collecting useless data orthe principle outcome ends up not being recorded
Nomatter howgoodanalgorithmis,if the studydesign is inadequate
(garbage in) for answeringthe research question,we’ll get garbage out

Good Predictive Analyticsfor Healthcare
starts with Good Study Design
Threequestions:
1. Whatis thequestionor hypothesis I’mtrying toinvestigate
2. Is thedataI have adequateinorder to address thisquestion
3. How can I design astudythataddresses thesequestions

Types of Study Design
Non-Experimental
Observational
Studies
Descriptive
Case Reports
Case Series
Cross-Sectional or
Prevalence Study
Analytical
Case-control
Cohort Study
Experimental
Intervention
Studies
Randomised
Clinical Trial
Non-randomized/
Field/ Community
Trial

CaseReport
Profileofa singlepatientis reportedin detailby oneor more clinician

CaseSeries
An individualcase report thathas been expandedtoincludea number ofpatientswitha
givendisease

Cross-sectional study
• Allinformationcollectedatsametimepoint
• Snapshots of healthstate
• Designedtoobtaininformationfromsamplesregarding prevalence, distributionsand
interrelations
• Eg:survey
• Thereare no typical formats:they aredesignedor modifiedtomeettheneeds ofthe
researcher or fitthetopicofresearch

Cross-sectional Study
Source
Population
Exposed
&
Bad outcome
Exposed
&
Good outcome
Unexposed
&
Bad outcome
Unexposed
&
Good outcome
Ineligible Eligible
Participation
No
Paricipation

Observational Studies
• Analytical
• Mainobjective is to test hypothesis of relationship between exposure to risk factor and
disease or other healthoutcome
• A measure of associationis estimated
• The magnitude, precision and statisticalsignificanceif the associationis determined

Cohort Study
• Identifya group ofsubjects
• Followthemover time
• Compare eventrate in:
• Subjects exposed to risk factors and
• Those not exposedto risk factor
• Can be both retrospectiveand prospective

Cohort Study
Source
Population
Ineligible Eligible
Exposed
No
Participation
Participation
Bad
Outcome
Good
Outcome
Unexposed
No
Participation
Participation
Bad
Outcome
Good
Outcome

Important Concept: Randomisation
Definition:Theprocess by which allocationofsubjectstotreatmentgroups
is doneby chance,withouttheabilitytopredict who is in whatgroup
Aims:-
-To prevent statisticalbias in allocating subjects to treatment groups
- To achieve comparability between the groups
- To ensure samples representative of the general population

SimpleRandomSampling PermutedBlockRandomisation StratifiedRandomSampling
Methods ofRandomisation
1 2 3 4
5 6 7 8
9 10 11 12
2 35
3
8 10
Population
Sample
AABBAABB
BBBBAAA
AABAAABB
Populations Strata Sample

Part 1
Part 2
Part3
Part 4
• Top10 Data Science Skills

Skill #1: ROC Curves
No disease Disease
No disease
(D = 0)

Specificity
X
Type I error (False
+) 
Disease
(D = 1)
X
Type II error (False -)


Power 1 - ;
Sensitivity
UsefulTool forDiagnosticTestEvaluation

Specific Example:How welldoesIgE“score” for Ara h 2peanutcomponent
predict/ diagnosepeanutallergy
IgE response to Ara h2
Ptswith
peanut allergy
Ptswithout the
peanut allergy

Call these patients “non-peanut allergic” Call these patients “peanut allergic”
Threshold
IgE= 3.15

withoutpeanut allergy
with peanutallergy
TruePositives
Some definitions ...
IgE= 3.15

with peanutallergy
False Positives
IgE= 3.15

True negatives
with peanutallergy
IgE= 3.15

False negatives
with peanutallergy
IgE= 3.15

TruePositiveRate(sensitivity)
0%
100%
FalsePositiveRate(1-
specificity)
0% 100%
ROC curve
IgEResponse to Arah 2
without peanut allergy
with peanut allergy

Skill #2: Power and Sample Size Calculation
https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f
𝑛 =
𝑟 + 1
𝑟
𝜎2
𝑍 𝛽 + 𝑍 𝛼
2
2
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒2
r=ratio ofcontrols tocases
Samplesize in case group
Standard deviation ofthe outcome
variable
Representsthe desired power
(typically0.84 for 80% power)
Effect Size
(the difference in means)
Representsthelevelof statistical
significance
(typically1.96)

Skill #3: Data Visualisation
https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f

Skill #4: A/B Testing
Patient Group
Population is splitinto 2
groups byrandom allocation
Intervention
Control
Outcomes for both
groups are measured
= Cured = Still Diseased

Skill #5 Linear Regression
• Linear regression allowsus tolookat thelinear relationshipbetweenone normally
distributedinterval predictorand onenormallydistributedinterval outcomevariable
• Example:Wewish tolookatthe relationshipbetweenFEV1and IgEatage5
• In otherwords, predictingFEV1from IgE

Linear Regression: Example
• WeseethattherelationshipbetweenFEVand TotalIgEat age 5 isnegative(-0.11)
• Based on thep-value(<0.05), wewouldconclude thisrelationship is statistically
significant.
• Hence, wewould say thereisa statisticallysignificantnegativelinearrelationship
betweenreading andwriting.
Coefficientsa
1.146 .033 35.014 .000
-.011 .005 -.221 -2.242 .027
(Constant)
Total serum IgE, age 5
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: FEV1, age 5a.

Skill #6: Logistic Regression
Asthm a/wheeze sym ptom between 0 to 8 years * Ever antibiotic in first 12 m onths
165 257 422
39.1% 60.9% 100.0%
98 380 478
20.5% 79.5% 100.0%
263 637 900
29.2% 70.8% 100.0%
Control
Case
Total
No Yes
Antibiotic in first 12 months
Total
Odds of developing asthma inantibiotic group are:
(380/637)/(257/637)= 380/257= 1.48
Odds of developing asthma innon-antibiotic group are:
(98/263)/(165/263)= 98/165= 0.59
Theratio of odds for antibiotic users to the odds of non-antibiotic
users is:
(380/257)/(98/165)= (380*165)/(257*98)= 2.49
Theodds for antibiotic users is 1.49times the odds for non-users

What are the Odds?
Variables in the Equation
.912 .151 36.506 1 .000 2.489 1.852 3.347
-.521 .128 16.688 1 .000 .594
evrab12m
Constant
Step
1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0%C.I.for EXP(B)
Variable(s) entered on step 1: evrab12m.a.
Theinterceptof-0.52is thelogoddsfor non-antibioticusers since
this isthereference group.log(-0.52)=0.59
Thecoefficientofevrab12mis thelogoftheodds ratiobetweenthe
antibioticusers andnon-antibioticusers

Skill #7: CausalReasoning
Thequestions that motivate most studies in the health, social and behavioral sciences
arenotassociational butcausal in nature.
Before an association is assessed for the possibility that it is causal, other explanations
such as chance,bias and confounding have to beexcluded
Require some knowledge of the data-generating process - cannot be computed from
the data alone, nor from distributions governing data
Aim: to infer dynamics of beliefs under changing conditions, for example, changes
induced by treatments orexternal interventions.
Pearl, Judea."Causal inferencein statistics: An overview." Statistics surveys3 (2009): 96-146.

Bradford-Hill Principles of Causality
Plausibility
Does causation make sense
Consistency
Cause associated with disease in
different population and studies
Temporality
Cause precedes disease
Strength
Cause strongly associated with disease
Specificity
Does the cause lead toa specific effect
Dose-Response
Greater exposure to cause,
higher the risk of disease

Prognosticbiomarker (risk
factor)
Treatment
Genetic Marker
Tumor Size
Outcome
(Survival)
U
Example: Personalisation of Cancer Treatment

Skill #8: Missing Data
MissingCompletelyAtRandom(MCAR)
The probability of data being missing does not depend on the observed or unobserved data
e.g. logit(pit) = θ0
MissingAtRandom(MAR)
The probability of data being missing does not depend on the unobserved data, conditional on the observed data
e.g. Children with missing wheeze data have better lung function
e.g. logit(pit) = θ0 + θ1ti or logit(pit) = θ0 + θ2y0
MissingNotAt Random(MNAR)
The probability of data being missing does depend on the unobserved data, conditional on the observed data.
e.g. Children with missing lung function have better lung function
e.g. logit(pit) = θ0 + θ3yit
Alexina Mason. “Bayesian methods for modelling non-randommissing data mechanisms in longitudinal studies” PhD Thesis (2009)

The Challenge of Missing Data
Missing data is a common problem in healthcaredata and can produce biased
parameter estimates
Reasons for missingness may be informativefor estimatingmodel parameters
Bayesian models: coherent approach to incorporating uncertaintyby assigning
prior distributions
Mason, Alexina, NickyBest, SylviaRichardson, and IANPLEWIS. "Strategy for modelling non-randommissing data mechanisms in observational studies using
Bayesian methods." Journalof Official Statistics (2010)

Skill #9: LatentVariable Modelling using Probabilistic
Programming
To identifysubgroups ofcomplexdiseaserisk
Treatmentoutcomeexplainedby distinctiveunderlyingmechanism
FoundationofStratifiedMedicine
Seekingbetter-targetedinterventions

Identifying DiseaseEndotypes
Parsimonious descriptionofthe
datainferredfromwhatis
observed

Poor
Lung Function
Wheeze
Allergy
AsthmaMedication
Asthma Symptoms
Exacerbations
Grow out of
Asthma
Asthma Late in
Childhood
Respond to
treatment
Don’t Respond to
treatment
Severity
EndotypesDiscovery: Different DiseasesWith Different Causes
Phenotypes: Observable Manifestationsof Disease
Defining Asthma
Asthma

Probabilistic Programming:
Finding Patterns in Data
Inference
Algorithm
Probabilistic
model
Probabilistic reasoning system
The probabilistic
model expresses
general knowledge
about a situation
The inference
algorithm uses the
model to answer
queries given evidence
The answers to queries
are framed as
probabilities of
different outcomes
The basic components of a probabilistic reasoning system
Answer
Queries
Evidence
The evidence contains
specific information
about a situation
The queries express
the things that will
help you make a
decisionAdapted from Pfeffer,Avi. "Practical probabilistic programming." International ConferenceonInductive Logic Programming. Springer Berlin Heidelberg, 2010.

1. Team Science:Discoveries about healthcare, not hypothesised a priori, have been made by experts
explaining structure learned from data by algorithms tuned by those experts
2. Heuristic blend of biostatistics and machine-learningfor principled problem-led healthcare research
3. An ML approach to extracting knowledge from information in healthcare requires persistent integration of
Data
Methods
Expertise
Skill #10: Interdisciplinary Research

Problem-led vsData-driven Health
Thinkdeeplyabout theclinical context.Find
solutions which are specifictothe problem.
Goodscienceis aboutmergingdifferentschoolsof
thoughtfordevelopingthebiggerpicture.
Datadriven approach + DomainKnowledge=
Problem-ledapproach withthe patientatthecentre
DanielleBelgrave, JohnHenderson,AngelaSimpson,IainBuchan,ChristopherBishop,andAdnanCustovic.
"Disaggregatingasthma:Big investigationversusbig data." Journalof Allergy andClinicalImmunology139, no. 2 (2017):400-407..

Take Home Message
Problem-Led Patient-Centred Research

Healthcare ML at MSRCambridge
Javier Alvarez-Valle Danielle Belgrave Chris Bishop Laurence Bourn David Carter Richard Lowe
Hannah Murfet Jay Nanavati Kenton O’Hara Konstantina Palla Anton Schwaighofer
Kenji Takeda Ivan Tarapov Stefan Wijnen
Aditya NoriPratik Ghosh
Anja Thieme
Tim Regan
Isabel Chien
Jan Steumer
Antonio Criminisi
Sebastian Tschiatschek

Cohort Study: Example
• A long-term follow-up study of doctors began in 1950 (Doll and Hill)
• Questionnaire sent to all men and women on British Medical Register and
Resident in the UK
• Asked about age and smoking habits
• Replies obtained from 34,440 men
• Subsequent follow-up focused on men
• Further questionnaires sent at intervals and asked about changes in smoking
habit
• During 1951-1971 10,074 had died

Missing Completely At Random
𝑚𝑖
𝑝𝑖σ²
β
µ𝑖
θ
𝑦𝑖
Individual i
Model of Interest
Model of
Missingness
𝑥𝑖
logit(pit) = θ0

Missing At Random
𝑚𝑖
𝑝𝑖σ²
β
µ𝑖
θ
𝑦𝑖
Individual i
Model of Interest
Model of Missingness
𝑥𝑖
logit(pit) = θ0 + θ1xi

Missing NotAt Random
𝑚𝑖
𝑝𝑖σ²
β
µ𝑖
θ
𝑦𝑖
Individual i
Model of Interest
Model of Missingness
𝑥𝑖
logit(pit) = θ0 + θ3yit

Prognostic Biomarker (Risk Factor)
A biological measurementmade before treatment toindicatelong-term
outcome for patientseither untreated or receiving standardoutcome
factor)
Random Allocation
(Treatment) Outcomes
Dunn,Graham, RichardEmsley, Hanhua Liu, and Sabine Landau. "Integratingbiomarkerinformation within trials to evaluatetreatment mechanisms andefficacy for
personalised medicine." ClinicalTrials10, no. 5 (2013): 709-719.

Predictive Biomarker (Moderator)
A variablethat changesthe impact of treatment on the outcome. A
biologicalmeasurement made before treatmentto identifypatientslikely
or unlikelyto benefit from a particulartreatment
Predictive biomarker
(Moderator)
Random Allocation
(Treatment) Outcomes

Mediator
A mechanism by which one variableaffectsanother variable.Omitted
common causes (hidden confounding)shouldalwaysbe considered as a
possibleexplanationfor associationsthat might be interpreted as causal
Random Allocation
(Treatment)
Mediator
Outcomes
U

Efficacyand mechanism evaluation: Causal
Framework forinvestigating who
medications workfor
factor)
Random Allocation
Predictive biomarker
(moderator)
Mediator
Outcomes
U

‘‘-’’ ‘‘+’’
Moving the Threshold: right
with peanutallergy
3.15 7.155.15 9.15

‘‘-’’ ‘‘+’’
Moving the Threshold: left
with peanutallergy
3.152.151.150.15

Day 1 (Lecture 3): Predictive Analytics in Healthcare

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Day 1 (Lecture 3): Predictive Analytics in Healthcare

Similar to Day 1 (Lecture 3): Predictive Analytics in Healthcare (20)

More from Aseda Owusua Addai-Deseh

More from Aseda Owusua Addai-Deseh (9)

Recently uploaded

Recently uploaded (20)

Day 1 (Lecture 3): Predictive Analytics in Healthcare