Statistics and ML 21Oct22 sel.pptx

Oct 21, 2022
Statistics and machine learning:
friends or foes?
Ewout W. Steyerberg, PhD
Professor of Clinical Biostatistics and
Medical Decision Making
Dept of Biomedical Data Sciences
Leiden University Medical Center
Thanks to many, including Ben van Calster, Leuven;
Maarten van Smeden, Utrecht

Statistics and machine learning:
friends or foes?
21-Oct-22
2 Insert > Header & footer
• Introduction for debate
• Friction points: foes
• Commonalities between statistics and ML: friends

Statistics and Machine Learning (ML)
In medical research, “artificial intelligence” usually just means “machine learning” or
“algorithm”
21-Oct-22

Machine learning in medical research
21-Oct-22

Machine learning and AI everywhere

IBM Watson winning Jeopardy! (2011)

Dr Watson
21-Oct-22

Dr Watson lessons
21-Oct-22

Dr Watson lesson 1
21-Oct-22

Dr Watson lesson 2
21-Oct-22

Dr Watson lesson 3
21-Oct-22

Friction points between statistics and ML: foes
1. ML claims to be new and supersede statistics
2. ML claims any data is relevant
3. ML makes promises it cannot keep
21-Oct-22

21-Oct-22

“Everything is an ML method”

Statistics Machine learning
Covariates Features
Outcome variable Target
Model Network, graphs
Parameters Weights
Model for discrete var. Classifier
Model for continuous var. Regression
Log-likelihood Cross-entropy loss
Multinomial regression Softmax
Measurement error Noise
Subject/observation Sample/instance
Dummy coding One-hot encoding
Measurement invariance Concept drift
Statistics Machine learning
Prediction Supervised learning
Latent variable modeling Unsupervised learning
Fitting Learning
Prediction error Error
Sensitivity Recall
Positive predictive value Precision
Contingency table Confusion matrix
Measurement error model Noise-aware ML
Structural equation model Gaussian Bayesian network
Gold standard Ground truth
Derivation–validation Training–test
Experiment A/B test
Language
https://www.analyticsvidhya.com/glossary-of-common-statistics-and-machine-learning-terms/
https://developers.google.com/machine-learning/glossary

Where to place Machine Learning?
21-Oct-22
https://codeburst.io/statistics-a-machine-learning-essential-ee537695786b

21-Oct-22

ML has developed from statistics
ML as part of statistics
Statistics as part of ML
ML:
models roughly outside of the traditional regression types of analysis:
• decision trees (and descendants, XGBoost, ..)
• Support vector machines (SVMs)
• neural networks (including Deep learning)
21-Oct-22

2. ML claims any data is relevant
Typical context: Electronic Health Records (EHR); large administrative data sets
Uncover patterns in data that are there but remained hidden
Strong point of EHR: large N, large sets of features
Weak point of EHR: ‘quality’
Selection of patients
Start point definition
End point definition
Selective measurement
Missing values
…
21-Oct-22

More data is better? Lessons from meta-analysis
Meta-analysis:
Risk of bias assessment
Respect clustering nature
21-Oct-22
24 Personal protective equipment for preventing highly infectious diseases

3. ML makes promises it cannot keep
“Uncover patterns in data that are there but remained hidden”
Unsupervised learning
Clustering unstable and determined by optimization criterion
Supervised learning
Trees / neural networks better for prediction than regression
21-Oct-22

Supervised learning example
21-Oct-22
Example from Maarten van Smeden, Utrecht; @MaartenvSmeden

Predicting mortality – the media

Findings not convincing
Cox, #4, 30 vars, max c =0.793
RF, #7, 600 vars, c=0.797
Elastic, #9, 600 vars, c=0.801
21-Oct-22

RF showed poor calibration
21-Oct-22

Machine learning vs conventional modeling
Text
“We found that random forests did not outperform Cox models despite their inherent ability to
accommodate nonlinearities and interactions. …
Elastic nets achieved the highest discrimination performance …, demonstrating
the ability of regularisation to select relevant variables and optimise model coefficients in an EHR context.”
21-Oct-22

Systematic review on ML vs classic modeling
21-Oct-22

Commonalities between statistics and ML: friends
4. Research question is key
5. Complex data structures require innovative approaches
6. Some problems are really hard
21-Oct-22

21-Oct-22

4. Research question is key
From easy to hard questions
- Exploratory / descriptive
- Prediction / classification
- Causal
21-Oct-22

4. Research questions
Separate
- Exploratory: data mining
“enjoy the results, because you will never see these results again”
- Descriptive: patterns in the data to learn about nature;
hypothesis generating; biomarkers – disease
ML provides more flexibility; less interpretability?
- Prediction: machine learning /trees often poor in performance
ML may provide benefits in specific circumstances?
21-Oct-22

39
Van der Ploeg et al. BMC Med Res Methodol 2014;14:137.

ML good for prediction?
Large N, small p
“Natural flexibility”?
Versus non-linear terms / interactions in regression?
21-Oct-22

ML good for treatment selection rules?
High hopes
“The incorporation of new data modalities such as single-cell profiling, along with techniques that
rapidly find effective drug combinations will likely be instrumental in improving cancer care.”
21-Oct-22

Statistics good for treatment selection rules?
21-Oct-22

21-Oct-22
https://hbiostat.org/blog/post/path/index.html

Alternatives
21-Oct-22
1) Risk-based methods (11 papers) use only prognostic factors to define patient subgroups,
relying on the mathematical dependency of the absolute risk difference on baseline risk;
2) Treatment effect modeling methods (9 papers): prognostic factors and treatment effect modifiers,
including penalization or separate data sets for subgroup identification / effect
3) Optimal treatment regime methods (12 papers) focus primarily on treatment effect modifiers
to classify the trial population into those who benefit from treatment and those who do not

5. Complex data structures require innovative approaches
Examples of succesful ML
- Image analysis: Deep Learning (DL)
- Radiology, pathology, dermatology, opthalmology, gastroenterology, cardiology,
…
- Free text: natural language processing (NLP)
- Mining electronic health records, building blocks for prediction, …
- Pharmacovigilance in social media
21-Oct-22

6. Some problems are really hard
Prediction
Small N, small p  regression
Small N, large p  hopeless
Large N, small p  regression
Large N, large p  ?
Treatment selection
Balance bias – precision
Causal interpretation
21-Oct-22

Summary 21 Oct 2022
1. ML is not really new and needs to liaise with statistics
2. Data quality and bias: design is key, learn from clinical epidemiology
3. Don’t make too many promises
4. Research questions relate to description, prediction and causality
5. Recognized power for specific complex data structures
6. Work on the truly hard problems together
21-Oct-22

Statistics and ML 21Oct22 sel.pptx

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Statistics and ML 21Oct22 sel.pptx

Similar to Statistics and ML 21Oct22 sel.pptx (20)

Recently uploaded

Recently uploaded (20)

Statistics and ML 21Oct22 sel.pptx