SlideShare a Scribd company logo
1 of 41
Download to read offline
AI, electronic records, health
Gaël Varoquaux
and why machine-learning on dirty data
opens new doors
1 From clinical studies to electronic
health records
G Varoquaux 2
1 Clinical studies are hard
Covid 19 vaccines
Moderna:
- 2 months lab development
- 8 months clinical studies for approval
30 000 volunteers
Pfizer–BioNTech:
- 3 months lab development
- 7 months clinical studies for approval
10 000 volunteers
Experimentation on humans (slow cycles, high risks)
Conclusion across individual heterogeneity
G Varoquaux 3
1 Real-life evidence versus clinical trials
Do vaccines prevent spread?
Question ill-suited to intervention & requiring huge samples
Is the Astrazeneca vaccine applicable to people above 65 years-old?
Fragile people (elderly) were excluded from the clinical trial
An external validity problem [Colnet... 2020]
Evidence from real-world observational data: following
individuals as they get, or not, the treatment [Dagan... 2021].
G Varoquaux 4
1 Electronic Health Records – source of real-life data
Patient records (anything available, really)
Claims databases, accounting, measurement history, doctors’ notes
Great longitudinal coverage
AP-HP (Paris hospitals)
39 hospitals
8 millions patients a year
Great population coverage
Free data
G Varoquaux 5
1 Electronic Health Records: dirty data challenges
Missing values
Uneven data on patients, across hospital sites
Data not measured because not applicable, no time in face of urgency...
Much larger rate of missingness than in clinical studies (often 80%)
G Varoquaux 6
1 Electronic Health Records: dirty data challenges
Missing values
Uneven data on patients, across hospital sites
Data not measured because not applicable, no time in face of urgency...
Much larger rate of missingness than in clinical studies (often 80%)
Non normalized information
Manual input, different conventions
“Diabetes Type 2” — “Diabetes Mellitus, Type 2” — “DM2”
G Varoquaux 6
1 Electronic Health Records: observational data 6= experiments
Treated & non treated patients
are not comparable
Naive conclusions
on treatment efficacy
G Varoquaux 7
1 Electronic Health Records: observational data 6= experiments
Treated & non treated patients
are not comparable
Naive conclusions
on treatment efficacy
Causal inference techniques
Settings
- Treatment T (∈ {0, 1})
- Outcome Y
Potential outcome Y (T) (treated or not)
- Covariates X (condition of patient)
Need unconfoundedness
{Y (1), Y (0)}
|=
T | X
Potential outcomes Y of patients do not
depend on whether they have really
been treated or not
Accounting for covariates to
compensate for differences
G Varoquaux 7
1 AI for causal inference [Funk... 2011]
Unconfoundedness {Y (1), Y (0)}
|=
T | X
Outcome regression
Model Y (T) = f (X, T)
f can be learned with an “AI”
(statistical machine learning)
G Varoquaux 8
1 AI for causal inference [Funk... 2011]
Unconfoundedness {Y (1), Y (0)}
|=
T | X
Outcome regression
Model Y (T) = f (X, T)
f can be learned with an “AI”
(statistical machine learning)
Reweighting
Match the covariate distribution
of treated and non treated
Treated Non treated
Learn P(T|X) with an “AI”
G Varoquaux 8
1 AI for causal inference [Funk... 2011]
Unconfoundedness {Y (1), Y (0)}
|=
T | X
Outcome regression
Model Y (T) = f (X, T)
f can be learned with an “AI”
(statistical machine learning)
Reweighting
Match the covariate distribution
of treated and non treated
Treated Non treated
Learn P(T|X) with an “AI”
More generally:
machine learning as non-parametric statistical estimator
G Varoquaux 8
1 Remaining agenda: Machine learning can model this “dirty data”
1 From clinical studies to electronic health records
2 Learning on non-normalized data
3 Learning with missing values
G Varoquaux 9
2 Learning on non-normalized data
[Cerda... 2018, Cerda and Varoquaux 2020]
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Data expressed with categories
in non-standardized form
“Dirty categories”
G Varoquaux 10
2 Dirty categories break standard statistical practice
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
OneHotEncoder not suitable
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 11
2 Standard approach: data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
Police Officer II
Social Worker II
Police Officer III
⇒
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
G Varoquaux 12
2 Standard approach: data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
Pfizer International LLC
Pfizer Limited
Pfizer Corporation Hong Kong Limited
Pfizer Pharmaceuticals Korea Limited
...
Difficult
without
supervision
G Varoquaux 12
2 Standard approach: data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
...
Hard to make automatic and turn-key
Our view: supervised learning on dirty categories
The statistical question should inform curation
Pfizer Corporation Hong Kong =
? Pfizer Pharmaceuticals Korea
G Varoquaux 12
2 Simple fix: Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
X ∈ Rn×p
new categories
link categories
string distance(Londres, London)
G Varoquaux 13
2 Simple fix: Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
X ∈ Rn×p
new categories
link categories
string distance(Londres, London)
= Prototype methods
How to choose a small number of prototypes?
The right prototypes may not be in training set
“big cat” “fat cat”
“big dog” “fat dog”
Estimate prototypes
G Varoquaux 13
2 Substring information
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 14
2 GaP Encoder, a latent category model [Cerda and Varoquaux 2020]
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
Model strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 15
2 String models of latent categories [Cerda and Varoquaux 2020]
Encodings
that extract
latent
categories
,
l
i
b
r
a
r
y
o
p
e
r
a
t
o
r
p
e
c
i
a
l
i
s
t
w
a
r
e
h
o
u
s
e
,
m
a
n
a
g
e
r
c
o
m
m
u
n
i
t
y
r
,
r
e
s
c
u
e
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
e
n
a
m
e
s
Categories
G Varoquaux 16
2 String models of latent categories [Cerda and Varoquaux 2020]
Inferring
plausible
feature
names
a
s
s
i
s
t
a
n
t
,
l
i
b
r
a
r
y
e
q
u
i
p
m
e
n
t
,
o
p
e
r
a
t
o
r
t
r
a
t
i
o
n
,
s
p
e
c
i
a
l
i
s
t
t
s
w
o
r
k
e
r
,
w
a
r
e
h
o
u
s
e
g
,
p
r
o
g
r
a
m
,
m
a
n
a
g
e
r
m
e
c
h
a
n
i
c
,
c
o
m
m
u
n
i
t
y
e
r
,
r
e
s
c
u
e
r
,
r
e
s
c
u
e
c
o
r
r
e
c
t
i
o
n
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
r
r
e
d
f
e
a
t
u
r
e
n
a
m
e
s
Categories
G Varoquaux 16
2 Un-blackboxify: Data science with dirty categories
Retrieving insight from machine-learning on non-curated data
Feature importances
Given a fitted model:
from s k l e a r n . i n s p e c t i o n import p e r m u t a t i o n i m p o r t a n c e
r = p e r m u t a t i o n i m p o r t a n c e (model , X val , y v a l ,
n r e p e a t s =30,
r a n d o m s t a t e =0)
G Varoquaux 17
[Cerda and Varoquaux 2020]
2 Un-blackboxify: Data science with dirty categories
What characteristics of an employee are important to explain salary?
0.0 0.1 0.2
Information, Technology, Technologist
Officer, Office, Police
Liquor, Clerk, Store
School, Health, Room
Environmental, Telephone, Capital
Lieutenant, Captain, Chief
Income, Assistance, Compliance
Manager, Management, Property
Inferred feature names Permutation Importances
G Varoquaux 17
[Cerda and Varoquaux 2020]
2 Dirty categories in practice
Software
DirtyCat: Dirty category software
http://dirty-cat.github.io
from d i r t y c a t import GapEncoder
g a p e n c o d e r = GapEncoder ()
t r a n s f o r m e d v a l u e s = g a p e n c o d e r . f i t t r a n s f o r m ( df )
Practical tip
Gradient-boosted trees work very well on tabular data
sklearn.ensemble.HistGradientBoostingRegressor
[Cerda... 2018, Cerda and Varoquaux 2020]
G Varoquaux 18
3 Learning with missing values
[Josse... 2019]
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
G Varoquaux 19
3 Classic statistics missing-values framework [Josse... 2019]
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
Probability of missingness does not depend on unobserved values.
Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can
be obtained on observed data while ignoring the unobserved values.
Justification for imputation of missing values
G Varoquaux 20
3 Classic statistics missing-values framework [Josse... 2019]
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
Probability of missingness does not depend on unobserved values.
Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can
be obtained on observed data while ignoring the unobserved values.
Justification for imputation of missing values
Missing Not at Random situation (MNAR)
Missingness not ignorable
Hard: need model of missing-values mechanism
G Varoquaux 20
3 Classic statistics missing-values framework [Josse... 2019]
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
Probability of missingness does not depend on unobserved values.
Missing Not at Random situation (MNAR)
Missingness not ignorable
Hard: need model of missing-values mechanism
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MAR
2 0 2
2
0
2
MNAR
G Varoquaux 20
3 Supervised learning with missing values
Difficulties
Half-discrete input space (NA ∪ R)
Complex predictor even in simple settings (linear + MAR)
[Le Morvan... 2020b]
Y = β?
1X1 + β?
2X2 + β?
0
cor(X1, X2) = 0.5.
If X2 is missing, the coefficient
of X1 should compensate for
the missingness of X2.
up to 2d
set of slopes
effect of X2lost effect of X2
accounted for by
X1
G Varoquaux 21
3 Neumiss network: adapted neural architecture [Le Morvan... 2020a]
Derive theoretical forms of optimal predictors
Approximate them with functions learnable by neural networks
Taylored architecture which learns all slopes jointly
G Varoquaux 22
3 Neumiss network: adapted neural architecture [Le Morvan... 2020a]
Derive theoretical forms of optimal predictors
Approximate them with functions learnable by neural networks
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less samples to approximate well and predict well
G Varoquaux 22
3 Neumiss network: adapted neural architecture [Le Morvan... 2020a]
Derive theoretical forms of optimal predictors
Approximate them with functions learnable by neural networks
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less samples to approximate well and predict well
Also suitable for MNAR settings
G Varoquaux 22
AI, electronic health records, health
Electronic health records open new doors for cheaper studies
AI provides statistical estimators and information extraction
AI, electronic health records, health
Electronic health records open new doors for cheaper studies
AI provides statistical estimators and information extraction
Dirty categories
Non-normalized data
Latent categories via string forms
Dirty category software:
http://dirty-cat.github.io
AI, electronic health records, health
Electronic health records open new doors for cheaper studies
AI provides statistical estimators and information extraction
Dirty categories
Latent categories via string forms
Dirty category software:
http://dirty-cat.github.io
Supervised learning with missing data
Also suitable for MNAR
Broader picture: supervised learning without cleaning
http://project.inria.fr/dirtydata
Acknowledgements
Dirty categories
Patricio Cerda and Balazs Kegl
Missing data
Julie Josse, Erwan Scornet, Marine Le Morvan, Nicolas Prost
Electronic Health records
AP-HP, Alexandre Gramfort, Marc Lavielle, Lihu Chen,
Fabian Suchanek, Thomas Moreau, Antoine Neuraz...
4 References I
P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical
variables. IEEE Transactions on Knowledge and Data Engineering, 2020.
P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty
categorical variables. Machine learning, 2018.
B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse,
and S. Yang. Causal inference methods for combining randomized trials and
observational studies: a review. arXiv preprint arXiv:2011.08047, 2020.
N. Dagan, N. Barda, E. Kepten, O. Miron, S. Perchik, M. A. Katz, M. A. Hernán,
M. Lipsitch, B. Reis, and R. D. Balicer. Bnt162b2 mrna covid-19 vaccine in a
nationwide mass vaccination setting. New England Journal of Medicine, 2021.
M. J. Funk, D. Westreich, C. Wiesen, T. Stürmer, M. A. Brookhart, and
M. Davidian. Doubly robust estimation of causal effects. American journal of
epidemiology, 173(7):761–767, 2011.
J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of
supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019.
4 References II
M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss
networks: differential programming for supervised learning with missing values.
In Advances in Neural Information Processing Systems 33, 2020a.
M. Le Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear
predictor on linearly-generated data with missing values: non consistency and
solutions. AISTATS, 2020b.
D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.

More Related Content

More from Gael Varoquaux

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
Gael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 

More from Gael Varoquaux (20)

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budget
 

Recently uploaded

ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
Madan Karki
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
drjose256
 
Microkernel in Operating System | Operating System
Microkernel in Operating System | Operating SystemMicrokernel in Operating System | Operating System
Microkernel in Operating System | Operating System
Sampad Kar
 
Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
Kamal Acharya
 

Recently uploaded (20)

Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
Introduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AIIntroduction to Artificial Intelligence and History of AI
Introduction to Artificial Intelligence and History of AI
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
Module-III Varried Flow.pptx GVF Definition, Water Surface Profile Dynamic Eq...
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
Microkernel in Operating System | Operating System
Microkernel in Operating System | Operating SystemMicrokernel in Operating System | Operating System
Microkernel in Operating System | Operating System
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)
 
Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptx
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Electrical shop management system project report.pdf
Electrical shop management system project report.pdfElectrical shop management system project report.pdf
Electrical shop management system project report.pdf
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 
Introduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of ArduinoIntroduction to Arduino Programming: Features of Arduino
Introduction to Arduino Programming: Features of Arduino
 
Lesson no16 application of Induction Generator in Wind.ppsx
Lesson no16 application of Induction Generator in Wind.ppsxLesson no16 application of Induction Generator in Wind.ppsx
Lesson no16 application of Induction Generator in Wind.ppsx
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
How to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdfHow to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdf
 

AI, electronic records, and health

  • 1. AI, electronic records, health Gaël Varoquaux and why machine-learning on dirty data opens new doors
  • 2. 1 From clinical studies to electronic health records G Varoquaux 2
  • 3. 1 Clinical studies are hard Covid 19 vaccines Moderna: - 2 months lab development - 8 months clinical studies for approval 30 000 volunteers Pfizer–BioNTech: - 3 months lab development - 7 months clinical studies for approval 10 000 volunteers Experimentation on humans (slow cycles, high risks) Conclusion across individual heterogeneity G Varoquaux 3
  • 4. 1 Real-life evidence versus clinical trials Do vaccines prevent spread? Question ill-suited to intervention & requiring huge samples Is the Astrazeneca vaccine applicable to people above 65 years-old? Fragile people (elderly) were excluded from the clinical trial An external validity problem [Colnet... 2020] Evidence from real-world observational data: following individuals as they get, or not, the treatment [Dagan... 2021]. G Varoquaux 4
  • 5. 1 Electronic Health Records – source of real-life data Patient records (anything available, really) Claims databases, accounting, measurement history, doctors’ notes Great longitudinal coverage AP-HP (Paris hospitals) 39 hospitals 8 millions patients a year Great population coverage Free data G Varoquaux 5
  • 6. 1 Electronic Health Records: dirty data challenges Missing values Uneven data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency... Much larger rate of missingness than in clinical studies (often 80%) G Varoquaux 6
  • 7. 1 Electronic Health Records: dirty data challenges Missing values Uneven data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency... Much larger rate of missingness than in clinical studies (often 80%) Non normalized information Manual input, different conventions “Diabetes Type 2” — “Diabetes Mellitus, Type 2” — “DM2” G Varoquaux 6
  • 8. 1 Electronic Health Records: observational data 6= experiments Treated & non treated patients are not comparable Naive conclusions on treatment efficacy G Varoquaux 7
  • 9. 1 Electronic Health Records: observational data 6= experiments Treated & non treated patients are not comparable Naive conclusions on treatment efficacy Causal inference techniques Settings - Treatment T (∈ {0, 1}) - Outcome Y Potential outcome Y (T) (treated or not) - Covariates X (condition of patient) Need unconfoundedness {Y (1), Y (0)} |= T | X Potential outcomes Y of patients do not depend on whether they have really been treated or not Accounting for covariates to compensate for differences G Varoquaux 7
  • 10. 1 AI for causal inference [Funk... 2011] Unconfoundedness {Y (1), Y (0)} |= T | X Outcome regression Model Y (T) = f (X, T) f can be learned with an “AI” (statistical machine learning) G Varoquaux 8
  • 11. 1 AI for causal inference [Funk... 2011] Unconfoundedness {Y (1), Y (0)} |= T | X Outcome regression Model Y (T) = f (X, T) f can be learned with an “AI” (statistical machine learning) Reweighting Match the covariate distribution of treated and non treated Treated Non treated Learn P(T|X) with an “AI” G Varoquaux 8
  • 12. 1 AI for causal inference [Funk... 2011] Unconfoundedness {Y (1), Y (0)} |= T | X Outcome regression Model Y (T) = f (X, T) f can be learned with an “AI” (statistical machine learning) Reweighting Match the covariate distribution of treated and non treated Treated Non treated Learn P(T|X) with an “AI” More generally: machine learning as non-parametric statistical estimator G Varoquaux 8
  • 13. 1 Remaining agenda: Machine learning can model this “dirty data” 1 From clinical studies to electronic health records 2 Learning on non-normalized data 3 Learning with missing values G Varoquaux 9
  • 14. 2 Learning on non-normalized data [Cerda... 2018, Cerda and Varoquaux 2020] Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Data expressed with categories in non-standardized form “Dirty categories” G Varoquaux 10
  • 15. 2 Dirty categories break standard statistical practice Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I OneHotEncoder not suitable Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 11
  • 16. 2 Standard approach: data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III Police Officer II Social Worker II Police Officer III ⇒ Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III G Varoquaux 12
  • 17. 2 Standard approach: data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC Pfizer International LLC Pfizer Limited Pfizer Corporation Hong Kong Limited Pfizer Pharmaceuticals Korea Limited ... Difficult without supervision G Varoquaux 12
  • 18. 2 Standard approach: data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC ... Hard to make automatic and turn-key Our view: supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 12
  • 19. 2 Simple fix: Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 X ∈ Rn×p new categories link categories string distance(Londres, London) G Varoquaux 13
  • 20. 2 Simple fix: Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 X ∈ Rn×p new categories link categories string distance(Londres, London) = Prototype methods How to choose a small number of prototypes? The right prototypes may not be in training set “big cat” “fat cat” “big dog” “fat dog” Estimate prototypes G Varoquaux 13
  • 21. 2 Substring information Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 14
  • 22. 2 GaP Encoder, a latent category model [Cerda and Varoquaux 2020] Topic model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Model strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l G Varoquaux 15
  • 23. 2 String models of latent categories [Cerda and Varoquaux 2020] Encodings that extract latent categories , l i b r a r y o p e r a t o r p e c i a l i s t w a r e h o u s e , m a n a g e r c o m m u n i t y r , r e s c u e , o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e n a m e s Categories G Varoquaux 16
  • 24. 2 String models of latent categories [Cerda and Varoquaux 2020] Inferring plausible feature names a s s i s t a n t , l i b r a r y e q u i p m e n t , o p e r a t o r t r a t i o n , s p e c i a l i s t t s w o r k e r , w a r e h o u s e g , p r o g r a m , m a n a g e r m e c h a n i c , c o m m u n i t y e r , r e s c u e r , r e s c u e c o r r e c t i o n , o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant r r e d f e a t u r e n a m e s Categories G Varoquaux 16
  • 25. 2 Un-blackboxify: Data science with dirty categories Retrieving insight from machine-learning on non-curated data Feature importances Given a fitted model: from s k l e a r n . i n s p e c t i o n import p e r m u t a t i o n i m p o r t a n c e r = p e r m u t a t i o n i m p o r t a n c e (model , X val , y v a l , n r e p e a t s =30, r a n d o m s t a t e =0) G Varoquaux 17 [Cerda and Varoquaux 2020]
  • 26. 2 Un-blackboxify: Data science with dirty categories What characteristics of an employee are important to explain salary? 0.0 0.1 0.2 Information, Technology, Technologist Officer, Office, Police Liquor, Clerk, Store School, Health, Room Environmental, Telephone, Capital Lieutenant, Captain, Chief Income, Assistance, Compliance Manager, Management, Property Inferred feature names Permutation Importances G Varoquaux 17 [Cerda and Varoquaux 2020]
  • 27. 2 Dirty categories in practice Software DirtyCat: Dirty category software http://dirty-cat.github.io from d i r t y c a t import GapEncoder g a p e n c o d e r = GapEncoder () t r a n s f o r m e d v a l u e s = g a p e n c o d e r . f i t t r a n s f o r m ( df ) Practical tip Gradient-boosted trees work very well on tabular data sklearn.ensemble.HistGradientBoostingRegressor [Cerda... 2018, Cerda and Varoquaux 2020] G Varoquaux 18
  • 28. 3 Learning with missing values [Josse... 2019] Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I G Varoquaux 19
  • 29. 3 Classic statistics missing-values framework [Josse... 2019] Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) Probability of missingness does not depend on unobserved values. Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can be obtained on observed data while ignoring the unobserved values. Justification for imputation of missing values G Varoquaux 20
  • 30. 3 Classic statistics missing-values framework [Josse... 2019] Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) Probability of missingness does not depend on unobserved values. Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can be obtained on observed data while ignoring the unobserved values. Justification for imputation of missing values Missing Not at Random situation (MNAR) Missingness not ignorable Hard: need model of missing-values mechanism G Varoquaux 20
  • 31. 3 Classic statistics missing-values framework [Josse... 2019] Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) Probability of missingness does not depend on unobserved values. Missing Not at Random situation (MNAR) Missingness not ignorable Hard: need model of missing-values mechanism 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MAR 2 0 2 2 0 2 MNAR G Varoquaux 20
  • 32. 3 Supervised learning with missing values Difficulties Half-discrete input space (NA ∪ R) Complex predictor even in simple settings (linear + MAR) [Le Morvan... 2020b] Y = β? 1X1 + β? 2X2 + β? 0 cor(X1, X2) = 0.5. If X2 is missing, the coefficient of X1 should compensate for the missingness of X2. up to 2d set of slopes effect of X2lost effect of X2 accounted for by X1 G Varoquaux 21
  • 33. 3 Neumiss network: adapted neural architecture [Le Morvan... 2020a] Derive theoretical forms of optimal predictors Approximate them with functions learnable by neural networks Taylored architecture which learns all slopes jointly G Varoquaux 22
  • 34. 3 Neumiss network: adapted neural architecture [Le Morvan... 2020a] Derive theoretical forms of optimal predictors Approximate them with functions learnable by neural networks Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less samples to approximate well and predict well G Varoquaux 22
  • 35. 3 Neumiss network: adapted neural architecture [Le Morvan... 2020a] Derive theoretical forms of optimal predictors Approximate them with functions learnable by neural networks Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less samples to approximate well and predict well Also suitable for MNAR settings G Varoquaux 22
  • 36. AI, electronic health records, health Electronic health records open new doors for cheaper studies AI provides statistical estimators and information extraction
  • 37. AI, electronic health records, health Electronic health records open new doors for cheaper studies AI provides statistical estimators and information extraction Dirty categories Non-normalized data Latent categories via string forms Dirty category software: http://dirty-cat.github.io
  • 38. AI, electronic health records, health Electronic health records open new doors for cheaper studies AI provides statistical estimators and information extraction Dirty categories Latent categories via string forms Dirty category software: http://dirty-cat.github.io Supervised learning with missing data Also suitable for MNAR Broader picture: supervised learning without cleaning http://project.inria.fr/dirtydata
  • 39. Acknowledgements Dirty categories Patricio Cerda and Balazs Kegl Missing data Julie Josse, Erwan Scornet, Marine Le Morvan, Nicolas Prost Electronic Health records AP-HP, Alexandre Gramfort, Marc Lavielle, Lihu Chen, Fabian Suchanek, Thomas Moreau, Antoine Neuraz...
  • 40. 4 References I P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 2020. P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty categorical variables. Machine learning, 2018. B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang. Causal inference methods for combining randomized trials and observational studies: a review. arXiv preprint arXiv:2011.08047, 2020. N. Dagan, N. Barda, E. Kepten, O. Miron, S. Perchik, M. A. Katz, M. A. Hernán, M. Lipsitch, B. Reis, and R. D. Balicer. Bnt162b2 mrna covid-19 vaccine in a nationwide mass vaccination setting. New England Journal of Medicine, 2021. M. J. Funk, D. Westreich, C. Wiesen, T. Stürmer, M. A. Brookhart, and M. Davidian. Doubly robust estimation of causal effects. American journal of epidemiology, 173(7):761–767, 2011. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019.
  • 41. 4 References II M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss networks: differential programming for supervised learning with missing values. In Advances in Neural Information Processing Systems 33, 2020a. M. Le Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISTATS, 2020b. D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.