Workshop on "Building Successful Pipelines for Predictive Analytics in Healthcare" delivered by Danielle Belgrave, PhD, Researcher at Microsoft Research, Cambridge, UK.
2. Latent Variable Modelling
My Research
Missing DataLongitudinal Data Analysis
0 2 4 6 8 10 12 14 16 18 20 22 24
Time (hours)
Patient 1
Patient 2
Patient 3
Patient 4
MultidisciplinaryCausality
X Y
Z
Patient-Centric Approach
3. Roadmap: Predictive Analytics for Healthcare
Part 1
• Drivers ofpredictive analyticsin healthcare
Part 2
• Evaluationof data-driven healthcare
Part3
• Important concepts in study design
Part 4
• Top 10 DataScience Skills
4. Part 1: Drivers of Predictive Analytics in healthcare
PopulationPatient/Person Pharmaceuticals Providers
TheKey Stakeholders inHealth
5. ThePerson at the Centre of Healthcare
Patient/Person
MachineLearninghas the capacityto
transformhealthcare
- Understandingphysiological changes
overtime
- Forecastingofprogression or onsetof
disease
- Personalisingtreatmentstrategies
6. Population Data-driven Healthcare
Population
Elucidatesaverage effectsand deviations
fromaverage effects
Policy recommendations
Healtheducation
Outreach
Researchfor diseasedetectionand injury
prevention
Reducehealthcareinequalities
Whatwe as asociety do collectively
toassurethe conditionsin which
people can be healthy
7. The Pharmaceutical Perspective:
Drug Discoveryand Therapeutics
Idea
Basic
Research
Clinical
Trails
Phase I Phase II Phase III Regulatory
Approval
Patient
Care
9. Predictive Analytics forHealthcare
Part 1
• Drivers of predictive analyticsin healthcare
Part 2
• Evaluation of data-driven healthcare
Part3
• Important concepts in study design
Part 4
• Top 10 DataScience Skills
10. DATA EXPERTISE
METHODS &
MODELS
Vast data volume,
velocity, variety
TSUNAMI
Supra-linear growth in
papers & tools
BLIZZARD
Human expertise to make
sense of the growing data
DROUGHT
Myth 1: ‘Big data’ are a solution
Myth 2: We have the models needed to turn healthcare data into decision support algorithms
Myth 3: Clinicians will continue to be the main source of data
Current Stateof Healthcare Data
12. …But it’s Important to Remember the Beginnings
The study of the distribution and determinantsof healthrelated statesor events in
specific populations & theapplications of this study to thecontrol of healthproblems
13. Data visualisation: death toll of
the Crimean War
Army data: 16,000/18,000
deaths notdue to battle
wounds, but to preventable
diseases, spread by poor
sanitation
The Beginnings of Data-Driven Health
Florence Nightingale(1820– 1910)
14. Anchor Data Science in
The Power ofObservation
Contextualphenomena:cholera
incidence
Ecologicaldesign:compare
cholera rates by region
Cohort design:comparecholera
ratesin exposedand non-
exposedindividuals
15. R.A. Fisher and the
Principles of Experimental Design
1.Randomisation:Unbiasedallocationof
treatmentsto
differentexperimentalplot
2.Replication:repetitionofthe treatment
tomorethan
oneexperimentalplot
3.Error control: Measurefor reducing the
error ofvariance
Why do these 2
plants differ in growth?
16. Predictive Analytics forHealthcare
Part 1
• Drivers of predictive analyticsin healthcare
Part 2
• Evaluationof data-driven healthcare
Part3
• Important conceptsin study design
Part 4
• Top 10 DataScience Skills
17. Principles of Study Design
Needtosetup a studytoanswer aresearch question
Designmost importantaspect of a studyand perhaps themostneglected
Thestudydesign should matchresearch question
So that we don’t endup collecting useless data orthe principle outcome ends up not being recorded
Nomatter howgoodanalgorithmis,if the studydesign is inadequate
(garbage in) for answeringthe research question,we’ll get garbage out
18. Good Predictive Analyticsfor Healthcare
starts with Good Study Design
Threequestions:
1. Whatis thequestionor hypothesis I’mtrying toinvestigate
2. Is thedataI have adequateinorder to address thisquestion
3. How can I design astudythataddresses thesequestions
19. Types of Study Design
Non-Experimental
Observational
Studies
Descriptive
Case Reports
Case Series
Cross-Sectional or
Prevalence Study
Analytical
Case-control
Cohort Study
Experimental
Intervention
Studies
Randomised
Clinical Trial
Non-randomized/
Field/ Community
Trial
24. Types of Study Design
Non-Experimental
Observational
Studies
Descriptive
Case Reports
Case Series
Cross-Sectional or
Prevalence Study
Analytical
Case-control
Cohort Study
Experimental
Intervention
Studies
Randomised
Clinical Trial
Non-randomized/
Field/ Community
Trial
25. Observational Studies
• Analytical
• Mainobjective is to test hypothesis of relationship between exposure to risk factor and
disease or other healthoutcome
• A measure of associationis estimated
• The magnitude, precision and statisticalsignificanceif the associationis determined
26. Cohort Study
• Identifya group ofsubjects
• Followthemover time
• Compare eventrate in:
• Subjects exposed to risk factors and
• Those not exposedto risk factor
• Can be both retrospectiveand prospective
28. Types of Study Design
Non-Experimental
Observational
Studies
Descriptive
Case Reports
Case Series
Cross-Sectional or
Prevalence Study
Analytical
Case-control
Cohort Study
Experimental
Intervention
Studies
Randomised
Clinical Trial
Non-randomized/
Field/ Community
Trial
29. Important Concept: Randomisation
Definition:Theprocess by which allocationofsubjectstotreatmentgroups
is doneby chance,withouttheabilitytopredict who is in whatgroup
Aims:-
-To prevent statisticalbias in allocating subjects to treatment groups
- To achieve comparability between the groups
- To ensure samples representative of the general population
31. Predictive Analytics forHealthcare
Part 1
• Drivers of predictive analyticsin healthcare
Part 2
• Evaluationof data-driven healthcare
Part3
• Important concepts in study design
Part 4
• Top10 Data Science Skills
32. Skill #1: ROC Curves
No disease Disease
No disease
(D = 0)
Specificity
X
Type I error (False
+)
Disease
(D = 1)
X
Type II error (False -)
Power 1 - ;
Sensitivity
UsefulTool forDiagnosticTestEvaluation
33. Specific Example:How welldoesIgE“score” for Ara h 2peanutcomponent
predict/ diagnosepeanutallergy
IgE response to Ara h2
Ptswith
peanut allergy
Ptswithout the
peanut allergy
34. IgE response to Ara h2
Call these patients “non-peanut allergic” Call these patients “peanut allergic”
Threshold
IgE= 3.15
35. Call these patients “non-peanut allergic” Call these patients “peanut allergic”
withoutpeanut allergy
with peanutallergy
TruePositives
Some definitions ...
IgE response to Ara h2
IgE= 3.15
37. True negatives
Call these patients “non-peanut allergic” Call these patients “peanut allergic”
withoutpeanut allergy
with peanutallergy
IgE response to Ara h2
IgE= 3.15
38. False negatives
Call these patients “non-peanut allergic” Call these patients “peanut allergic”
withoutpeanut allergy
with peanutallergy
IgE response to Ara h2
IgE= 3.15
40. Skill #2: Power and Sample Size Calculation
https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f
𝑛 =
𝑟 + 1
𝑟
𝜎2
𝑍 𝛽 + 𝑍 𝛼
2
2
𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒2
r=ratio ofcontrols tocases
Samplesize in case group
Standard deviation ofthe outcome
variable
Representsthe desired power
(typically0.84 for 80% power)
Effect Size
(the difference in means)
Representsthelevelof statistical
significance
(typically1.96)
41. Skill #3: Data Visualisation
https://towardsdatascience.com/5-quick-and-easy-data-visualizations-in-python-with-code-a2284bae952f
42. Skill #4: A/B Testing
Patient Group
Population is splitinto 2
groups byrandom allocation
Intervention
Control
Outcomes for both
groups are measured
= Cured = Still Diseased
43. Skill #5 Linear Regression
• Linear regression allowsus tolookat thelinear relationshipbetweenone normally
distributedinterval predictorand onenormallydistributedinterval outcomevariable
• Example:Wewish tolookatthe relationshipbetweenFEV1and IgEatage5
• In otherwords, predictingFEV1from IgE
44. Linear Regression: Example
• WeseethattherelationshipbetweenFEVand TotalIgEat age 5 isnegative(-0.11)
• Based on thep-value(<0.05), wewouldconclude thisrelationship is statistically
significant.
• Hence, wewould say thereisa statisticallysignificantnegativelinearrelationship
betweenreading andwriting.
Coefficientsa
1.146 .033 35.014 .000
-.011 .005 -.221 -2.242 .027
(Constant)
Total serum IgE, age 5
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: FEV1, age 5a.
45. Skill #6: Logistic Regression
Asthm a/wheeze sym ptom between 0 to 8 years * Ever antibiotic in first 12 m onths
165 257 422
39.1% 60.9% 100.0%
98 380 478
20.5% 79.5% 100.0%
263 637 900
29.2% 70.8% 100.0%
Control
Case
Total
No Yes
Antibiotic in first 12 months
Total
Odds of developing asthma inantibiotic group are:
(380/637)/(257/637)= 380/257= 1.48
Odds of developing asthma innon-antibiotic group are:
(98/263)/(165/263)= 98/165= 0.59
Theratio of odds for antibiotic users to the odds of non-antibiotic
users is:
(380/257)/(98/165)= (380*165)/(257*98)= 2.49
Theodds for antibiotic users is 1.49times the odds for non-users
46. What are the Odds?
Variables in the Equation
.912 .151 36.506 1 .000 2.489 1.852 3.347
-.521 .128 16.688 1 .000 .594
evrab12m
Constant
Step
1
a
B S.E. Wald df Sig. Exp(B) Lower Upper
95.0%C.I.for EXP(B)
Variable(s) entered on step 1: evrab12m.a.
Theinterceptof-0.52is thelogoddsfor non-antibioticusers since
this isthereference group.log(-0.52)=0.59
Thecoefficientofevrab12mis thelogoftheodds ratiobetweenthe
antibioticusers andnon-antibioticusers
47. Skill #7: CausalReasoning
Thequestions that motivate most studies in the health, social and behavioral sciences
arenotassociational butcausal in nature.
Before an association is assessed for the possibility that it is causal, other explanations
such as chance,bias and confounding have to beexcluded
Require some knowledge of the data-generating process - cannot be computed from
the data alone, nor from distributions governing data
Aim: to infer dynamics of beliefs under changing conditions, for example, changes
induced by treatments orexternal interventions.
Pearl, Judea."Causal inferencein statistics: An overview." Statistics surveys3 (2009): 96-146.
48. Bradford-Hill Principles of Causality
Plausibility
Does causation make sense
Consistency
Cause associated with disease in
different population and studies
Temporality
Cause precedes disease
Strength
Cause strongly associated with disease
Specificity
Does the cause lead toa specific effect
Dose-Response
Greater exposure to cause,
higher the risk of disease
50. Skill #8: Missing Data
MissingCompletelyAtRandom(MCAR)
The probability of data being missing does not depend on the observed or unobserved data
e.g. logit(pit) = θ0
MissingAtRandom(MAR)
The probability of data being missing does not depend on the unobserved data, conditional on the observed data
e.g. Children with missing wheeze data have better lung function
e.g. logit(pit) = θ0 + θ1ti or logit(pit) = θ0 + θ2y0
MissingNotAt Random(MNAR)
The probability of data being missing does depend on the unobserved data, conditional on the observed data.
e.g. Children with missing lung function have better lung function
e.g. logit(pit) = θ0 + θ3yit
Alexina Mason. “Bayesian methods for modelling non-randommissing data mechanisms in longitudinal studies” PhD Thesis (2009)
51. The Challenge of Missing Data
Missing data is a common problem in healthcaredata and can produce biased
parameter estimates
Reasons for missingness may be informativefor estimatingmodel parameters
Bayesian models: coherent approach to incorporating uncertaintyby assigning
prior distributions
Mason, Alexina, NickyBest, SylviaRichardson, and IANPLEWIS. "Strategy for modelling non-randommissing data mechanisms in observational studies using
Bayesian methods." Journalof Official Statistics (2010)
52. Skill #9: LatentVariable Modelling using Probabilistic
Programming
To identifysubgroups ofcomplexdiseaserisk
Treatmentoutcomeexplainedby distinctiveunderlyingmechanism
FoundationofStratifiedMedicine
Seekingbetter-targetedinterventions
55. Probabilistic Programming:
Finding Patterns in Data
Inference
Algorithm
Probabilistic
model
Probabilistic reasoning system
The probabilistic
model expresses
general knowledge
about a situation
The inference
algorithm uses the
model to answer
queries given evidence
The answers to queries
are framed as
probabilities of
different outcomes
The basic components of a probabilistic reasoning system
Answer
Queries
Evidence
The evidence contains
specific information
about a situation
The queries express
the things that will
help you make a
decisionAdapted from Pfeffer,Avi. "Practical probabilistic programming." International ConferenceonInductive Logic Programming. Springer Berlin Heidelberg, 2010.
56. 1. Team Science:Discoveries about healthcare, not hypothesised a priori, have been made by experts
explaining structure learned from data by algorithms tuned by those experts
2. Heuristic blend of biostatistics and machine-learningfor principled problem-led healthcare research
3. An ML approach to extracting knowledge from information in healthcare requires persistent integration of
Data
Methods
Expertise
Skill #10: Interdisciplinary Research
59. Healthcare ML at MSRCambridge
Javier Alvarez-Valle Danielle Belgrave Chris Bishop Laurence Bourn David Carter Richard Lowe
Hannah Murfet Jay Nanavati Kenton O’Hara Konstantina Palla Anton Schwaighofer
Kenji Takeda Ivan Tarapov Stefan Wijnen
Aditya NoriPratik Ghosh
Anja Thieme
Tim Regan
Isabel Chien
Jan Steumer
Antonio Criminisi
Sebastian Tschiatschek
61. Cohort Study: Example
• A long-term follow-up study of doctors began in 1950 (Doll and Hill)
• Questionnaire sent to all men and women on British Medical Register and
Resident in the UK
• Asked about age and smoking habits
• Replies obtained from 34,440 men
• Subsequent follow-up focused on men
• Further questionnaires sent at intervals and asked about changes in smoking
habit
• During 1951-1971 10,074 had died
62. Missing Completely At Random
𝑚𝑖
𝑝𝑖σ²
β
µ𝑖
θ
𝑦𝑖
Individual i
Model of Interest
Model of
Missingness
𝑥𝑖
logit(pit) = θ0
65. Prognostic Biomarker (Risk Factor)
A biological measurementmade before treatment toindicatelong-term
outcome for patientseither untreated or receiving standardoutcome
Prognosticbiomarker (risk
factor)
Random Allocation
(Treatment) Outcomes
Dunn,Graham, RichardEmsley, Hanhua Liu, and Sabine Landau. "Integratingbiomarkerinformation within trials to evaluatetreatment mechanisms andefficacy for
personalised medicine." ClinicalTrials10, no. 5 (2013): 709-719.
66. Predictive Biomarker (Moderator)
A variablethat changesthe impact of treatment on the outcome. A
biologicalmeasurement made before treatmentto identifypatientslikely
or unlikelyto benefit from a particulartreatment
Predictive biomarker
(Moderator)
Random Allocation
(Treatment) Outcomes
67. Mediator
A mechanism by which one variableaffectsanother variable.Omitted
common causes (hidden confounding)shouldalwaysbe considered as a
possibleexplanationfor associationsthat might be interpreted as causal
Random Allocation
(Treatment)
Mediator
Outcomes
U
68. Efficacyand mechanism evaluation: Causal
Framework forinvestigating who
medications workfor
Prognosticbiomarker (risk
factor)
Random Allocation
Predictive biomarker
(moderator)
Mediator
Outcomes
U
69. ‘‘-’’ ‘‘+’’
Moving the Threshold: right
withoutpeanut allergy
with peanutallergy
IgE response to Ara h2
3.15 7.155.15 9.15
70. ‘‘-’’ ‘‘+’’
Moving the Threshold: left
IgE response to Ara h2
withoutpeanut allergy
with peanutallergy
3.152.151.150.15