SlideShare a Scribd company logo
Dirty data science machine learning on non-curated data
Gaël Varoquaux,
Dirty data science machine learning on non-curated data
Gaël Varoquaux,
Industry challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
Industry challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
On some dirty-data problems,
progress in machine learning
can ease the pain
Talk outline
1 What models cannot fit
2 Learning with missing values
3 Machine learning on dirty categories
G Varoquaux 3
1 What models cannot fit
Outside of statistics’ comfort zone (X ∈ Rn×p
)
G Varoquaux 4
1 The full life-cycle of a data-science project
Framing the domain question
Finding and understanding the data
Assembling and reshaping it
Designing an AI / statistical model?
Evaluating model performance
Inspecting the model for unwanted behavior
Bringing the model to stakeholders / production
?: what we think is cool
G Varoquaux 5
1 Understanding the data, between human and machine
Age
60
26
38
139
52
86
17
48
Just numbers
G Varoquaux 6
1 Understanding the data, between human and machine
Age
60
26
38
?? 139
52
86
17
48
Numbers with a
meaning
A numerical column expresses a quantity, with a corresponding scale...
G Varoquaux 6
1 Understanding the data, between human and machine
Age Name
60 Bono
26 Justin Bieber
38 Giselle Knowles-Carter?
139 Pablo Picasso
52 Céline Dion
86 Léonard Cohen
17 Greta Thunberg
48 Justin Trudeau
? Beyonce
A numerical column expresses a quantity, with a corresponding scale...
Recognized entries shed light on the numbers
G Varoquaux 6
1 Understanding the data, between human and machine
Age Name Born in Activity
60 Bono Ireland Singer
26 Justin Bieber Canada Singer
38 Giselle Knowles-Carter?
USA Singer
139 Pablo Picasso Spain Painter
52 Céline Dion Canada Singer
86 Léonard Cohen Canada Singer
17 Greta Thunberg Sweden Activist
48 Justin Trudeau Sweden Politician
? Beyonce
A numerical column expresses a quantity, with a corresponding scale...
Recognized entries shed light on the numbers
They can be used to bring in additional information (features)
G Varoquaux 6
1 Understanding the data, between human and machine
Age Name Born in Activity
60 Bono Ireland Singer
26 Justin Bieber Canada Singer
38 Giselle Knowles-Carter?
USA Singer
139 Pablo Picasso Spain Painter
52 Céline Dion Canada Singer
86 Léonard Cohen Canada Singer
17 Greta Thunberg Sweden Activist
48 Justin Trudeau Sweden Politician
? Beyonce
A numerical column expresses a quantity, with a corresponding scale...
Recognized entries shed light on the numbers
They can be used to bring in additional information (features)
And find errors
Knowledge representation, relational algebra
G Varoquaux 6
1 Assembling data, of different natures and sources
Age Name Position
60 John Doe Electrician
48 Jane Austen Senior Professor
52 Jack Daniels Professor
Position Salary
Electrician 35 lizards
Professor 13 horses
Senior Professor 1 dragon
To model the link between age and salary, a join is necessary
Databases:
To maintain consistency and min-
imize storage, data are normal-
ized: multiple tables are use to
minimize redundancy.
Statistics:
Needs samples and features: mul-
tiple observations of the same
kind
⇒ data is denormalized in 1 table
Age Name Position Salary Coffees/day
60 John Doe Electrician 35 lizards 2
48 Jane Austen Senior Professor 1 dragon 128
G Varoquaux 7
1 Aggregations – long vs wide tables
Person ID Measure type Value
12345 Blood Pressure 139
45673 Sugar Level 113
12345 Heart Rate 71
45673 Blood Pressure 84
Long table
Flexible data representation
Person Blood Sugar Heart Rate
ID Pressure Level Rate
12345 139 NA 71
45673 84 113 NA
Wide table
Amenable to statistics on Person
Long to wide in Pandas: unstack, pivot
Also: count coffes per day per person from coffee-machine logs
G Varoquaux 8
1 Data wrangling: assembling unfamiliar sources
Relational algebra:
joins
aggregations (# coffees a day)
selections (finding the data)
Challenges:
understanding the data store
and domain logic
errors in the data
(correspondances in names)
Age Name Country Position Coffees/day
48 Justin Trudeau Canada Prime minister 3000
NA Gaël Varoquaux NA NA NA
G Varoquaux 9
1 Data wrangling: assembling unfamiliar sources
Relational algebra:
joins
aggregations (# coffees a day)
selections (finding the data)
Challenges:
understanding the data store
and domain logic
errors in the data
(correspondances in names)
In health:
Assembling information across large
electronic health records systems
G Varoquaux 9
1 Systematic errors: data require external checks
Measurement biases:
Volunteer bias
More women
volunteer in medical
studies
G Varoquaux 10
1 Systematic errors: data require external checks
Measurement biases:
Volunteer bias
More women
volunteer in medical
studies
Selection bias
Healthy people
seldom go to the
hospital
(causal inference)
G Varoquaux 10
1 Systematic errors: data require external checks
Measurement biases:
Volunteer bias
More women
volunteer in medical
studies
Selection bias
Healthy people
seldom go to the
hospital
(causal inference)
Survival bias
Data loss related to
the process under
study
(survival models)
G Varoquaux 10
1 Systematic errors: data require external checks
Measurement biases:
Volunteer bias
More women
volunteer in medical
studies
Selection bias
Healthy people
seldom go to the
hospital
(causal inference)
Survival bias
Data loss related to
the process under
study
(survival models)
Partly addressed by machine-learning models for
dataset shift (transfer learning) if you know the bias.
Brings us back to understanding the data
G Varoquaux 10
Data-science is much more than fitting a statistical model
Data require assembling information
Different data sources = different conventions
Measurements come with errors and biases
These challenges require domain knowledge and data wrangling
G Varoquaux 11
2 Learning with missing values
[Josse... 2019]
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
G Varoquaux 12
Why doesn’t the #$@! machine learning toolkit work?!
Machine learning models need entries in a vector space (or at least
a metric space).
NA /
∈ R
More than an implementation problem
G Varoquaux 13
Why doesn’t the #$@! machine learning toolkit work?!
Machine learning models need entries in a vector space (or at least
a metric space).
NA /
∈ R
More than an implementation problem
Categorical entries are discrete anyhow
For missing values in categorical variables, create a
special categorie ”missing”.
Rest of talk on NA in numerical variables
G Varoquaux 13
2 Classic statistics points of view
Model a) a distribution fθ for the complete data x
Model b) a random process gφ occluding entries (mask m)
Missing at random situation (MAR)
for non-observed values, the probability of missingness does not depend
on this non-observed value. Proper definition in [Josse... 2019]
observed(x0
, mi) = observed(xi, mi) ⇒ gφ(mi|x0
) = gφ(mi|xi)
Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data
while ignoring (marginalizing) the unobserved values gives maximum
likelihood of model a).
G Varoquaux 14
2 Classic statistics points of view
Model a) a distribution fθ for the complete data x
Model b) a random process gφ occluding entries (mask m)
Missing at random situation (MAR)
for non-observed values, the probability of missingness does not depend
on this non-observed value. Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data
while ignoring (marginalizing) the unobserved values gives maximum
likelihood of model a).
Missing Completely at random situation (MCAR)
Missingness is independent from data
Missing Not at Random situation (MNAR)
Missingness not ignorable
G Varoquaux 14
2 Classic statistics points of view
Model a) a distribution fθ for the complete data x
Model b) a random process gφ occluding entries (mask m)
Missing at random situation (MAR)
for non-observed values, the probability of missingness does not depend
on this non-observed value. Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data
while ignoring (marginalizing) the unobserved values gives maximum
likelihood of model a).
Missing Completely at random situation (MCAR)
Missingness is independent from data
Missing Not at Random situation (MNAR)
Missingness not ignorable
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0 2
2
0
2
MNAR
G Varoquaux 14
2 Classic statistics points of view
Model a) a distribution fθ for the complete data x
Model b) a random process gφ occluding entries (mask m)
Missing at random situation (MAR)
for non-observed values, the probability of missingness does not depend
on this non-observed value. Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data
while ignoring (marginalizing) the unobserved values gives maximum
likelihood of model a).
Missing Completely at random situation (MCAR)
Missingness is independent from data
Missing Not at Random situation (MNAR)
Missingness not ignorable
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0 2
2
0
2
MNAR
But
There isn’t always an unobserved value
Age of spouse of singles?
Machine-learning’s goal is not to maximize likelihoods
G Varoquaux 14
2 Imputation
Fill in information Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA
–2000 Social Worker IV
M 07/16/2007 Police Officer III
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA
–2012 Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA
–2014 Library Assistant I
Large statistical literature
Procedures and results focused on in sample settings
How about completing the test set with the train set?
What to do with the prediction target y?
G Varoquaux 15
2 Imputation and prediction with test-time missing values
Settings: y = f (x) + ε
Theorem [Josse... 2019]
f : trained predictor achieving Bayes risk on full data
Conditional multiple imputation achieves Bayes risk on test set
with missing data (in MAR settings)
f ?
mult imput(x̃) = Exm|Xo=xo
[f (xm, Xo)].
Notations: x̃ ∈ (R ∪ NA)p
: data at hand
xo: observed values
xm: unobserved values
G Varoquaux 16
2 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
G Varoquaux 17
2 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
bad computational scalability
G Varoquaux 17
2 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
bad computational scalability
Classic statistics point of view
Mean imputation is disastrous, be-
cause it disorts the distribution
“Congeniality” conditions: good im-
putation must preserve data propeties
used by later analysis steps
2 0 2
3
2
1
0
1
2
3
G Varoquaux 17
2 Constant imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent) imputing both train
and test with the mean of train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and compensates at test
time
G Varoquaux 18
2 Constant imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent) imputing both train
and test with the mean of train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and compensates at test
time
Constant imputation breaks simple models (eg linear models)
[Morvan... 2020]
G Varoquaux 18
2 Imputation for supervised learning
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.65
0.70
0.75
0.80
r2
score
Mean
Iterative
Convergence
0.725 0.750 0.775
r2 score
Iterative
Mean
Small small size
Notebook: github – @nprost / supervised missing
Conclusions: IterativeImputer is useful for small sample sizes
G Varoquaux 19
2 Imputation is not enough: predictive missingness
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
G Varoquaux 20
2 Imputation is not enough: predictive missingness
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
Simulation: y depends indirectly on missingness censoring
102 103 104
Sample size
0.75
0.80
0.85
0.90
0.95
r2
score
Mean
Mean+
indicator
Iterative
Iterative+
indicator
Convergence
0.8 0.9
r2 score
Iterative+
indicator
Iterative
Mean+
indicator
Mean
Small small size
Notebook: github – @nprost / supervised missing
Adding a mask is crucial
Iterative imputation can be detrimental
G Varoquaux 20
2 Tree models with missing values
MIA (Missing Incorporated Attribute)
[Josse... 2019] x10< -1.5 ?
x2< 2 ?
Yes/Missing
x7< 0.3 ?
No
...
Yes
...
No/Missing
x1< 0.5 ?
Yes
...
No/Missing
... Predict +1.3
sklearn.ensemble.HistGradientBoostingClassifier
The learner readily
handles missing values
G Varoquaux 21
2 Tree models with missing values (MCAR)
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.70
0.75
0.80
r2
score
Inside trees
Mean
Iterative
Convergence
0.75 0.80
r2 score
Iterative
Mean
Inside trees
Small small size
Notebook: github – @nprost / supervised missing
G Varoquaux 22
2 Tree models with missing values (censored)
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.7
0.8
0.9
r2
score
Inside trees
Mean
Iterative
Mean+
indicator
Iterative+
indicator
Convergence
0.8 0.9
r2 score
Iterative+
indicator
Mean+
indicator
Iterative
Mean
Inside trees
Small small size
Notebook: github – @nprost / supervised missing
G Varoquaux 23
2 Neural networks with missing values
Gradient-based optimization of continuous models
Difficulty: Half-discrete input space (NA ∪ R)
Y = β?
1X1 + β?
2X2 + β?
0
cor(X1, X2) = 0.5.
If X2 is missing, the coefficient
of X1 should compensate for
the missingness of X2.
up to 2d
set of slopes
effect of X2lost effect of X2
accounted for by
X1
G Varoquaux 24
2 Neumiss network: adapted neural architecture [Le Morvan... 2020]
Neural networks that approximate optimal predictors (functions of Σ−1
).
Taylored architecture which learns all slopes jointly
G Varoquaux 25
2 Neumiss network: adapted neural architecture [Le Morvan... 2020]
Neural networks that approximate optimal predictors (functions of Σ−1
).
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less data
G Varoquaux 25
2 Neumiss network: adapted neural architecture [Le Morvan... 2020]
Neural networks that approximate optimal predictors (functions of Σ−1
).
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less data
Also suitable for MNAR settings
G Varoquaux 25
Learning with missing values
Imputation is motivated only in MAR settings
Rather than a sophisticated imputation,
use a powerful supervised learner
sklearn’s HistGradientBoostingClassifier
readily models missing values
Can work in MNAR settings
Different regime as standard statistics
G Varoquaux 26
3 Machine learning on dirty categories
[Cerda... 2018, Cerda and Varoquaux 2020]
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
G Varoquaux 27
3 Categorical entries in a statistical model
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Master Police Officer Social Worker IV Police Officer II
1 0
0 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
One-hot encoding X ∈ Rn×p
G Varoquaux 28
3 Non-normalized categorical entries in a statistical model
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Break OneHotEncoder
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 29
3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001]
High-cardinality categories
Represent each category by the average target y
Police Officer II → average salary of policy officer II
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
G Varoquaux 30
3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001]
High-cardinality categories
Represent each category by the average target y
Police Officer II → average salary of policy officer II
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
Embedding closeby categories with the same y can help
building a simple decision function.
G Varoquaux 30
3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001]
High-cardinality categories
Represent each category by the average target y
Police Officer II → average salary of policy officer II
DirtCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import TargetEncoder
t a r g e t e n c o d e r = TargetEncoder ()
t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df )
G Varoquaux 30
3 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
Police Officer II
Social Worker II
Police Officer III
⇒
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
G Varoquaux 31
3 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
Pfizer International LLC
Pfizer Limited
Pfizer Corporation Hong Kong Limited
Pfizer Pharmaceuticals Korea Limited
...
Difficult without supervision
Potentially suboptimal
Pfizer Corporation Hong Kong
=
? Pfizer Pharmaceuticals Korea
G Varoquaux 31
3 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
...
Hard to make automatic and turn-key
Harder than supervised learning
G Varoquaux 31
Our goal: supervised learning on dirty categories
The statistical question should
inform curation
Pfizer Corporation Hong Kong
=
?
Pfizer Pharmaceuticals Korea
G Varoquaux 32
3 Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
new categories?
link categories?
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 33
3 Some string similarities
Levenshtein
Number of edit on one string to match the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 34
3 Python implementation: DirtyCat
DirtyCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import S i m i l a r i t y E n c o d e r
s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y =’ngram ’)
t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df )
G Varoquaux 35
3 Dirty categories blow up dimension
G Varoquaux 36
3 Dirty categories blow up dimension
New words in
natural language
G Varoquaux 36
3 Dirty categories blow up dimension
New words in
natural language
X ∈ Rn×p
, p is large
Statistical problems
Computational problems
G Varoquaux 36
3 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number of prototypes?
G Varoquaux 37
3 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number of prototypes?
All training-set ⇒ huge dimensionality
Most frequent?
Maybe the right prototypes /
∈ training set
“big cat” “fat cat”
“big dog” “fat dog”
Estimate prototypes
G Varoquaux 37
3 Substring information
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 38
3 Modeling substrings [Cerda and Varoquaux 2020]
Model on sub-strings
(GaP: Gamma-Poisson factorization)
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
Models strings as a combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
sklearn.feature extraction.text
CountVectorizer
analyzer : ’word’, ’char’, ’char wb’
HashingVectorizer fast, stateless
TfidfVectorizer normalize counts
G Varoquaux 39
3 Latent category model [Cerda and Varoquaux 2020]
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 39
3 String models of latent categories [Cerda and Varoquaux 2020]
Encodings
that extract
latent
categories
b
r
a
r
y
r
a
t
o
r
a
l
i
s
t
h
o
u
s
e
n
a
g
e
r
u
n
i
t
y
e
s
c
u
e
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
e
s
Categories
G Varoquaux 40
3 String models of latent categories [Cerda and Varoquaux 2020]
Inferring
plausible
feature
names
s
t
a
n
t
,
l
i
b
r
a
r
y
m
e
n
t
,
o
p
e
r
a
t
o
r
o
n
,
s
p
e
c
i
a
l
i
s
t
k
e
r
,
w
a
r
e
h
o
u
s
e
o
g
r
a
m
,
m
a
n
a
g
e
r
n
i
c
,
c
o
m
m
u
n
i
t
y
e
s
c
u
e
r
,
r
e
s
c
u
e
c
t
i
o
n
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
e
a
t
u
r
e
n
a
m
e
s
Categories
G Varoquaux 40
3 Data science with dirty categories
0.0 0.1 0.2
Information, Technology, Technologist
Officer, Office, Police
Liquor, Clerk, Store
School, Health, Room
Environmental, Telephone, Capital
Lieutenant, Captain, Chief
Income, Assistance, Compliance
Manager, Management, Property
Inferred feature names Permutation Importances
G Varoquaux 41
Learning does not require clean entities
Model continuous similarities across entries
Sub-string models can capture theses
Requires a powerful statistical model (Gradient-boosted trees)
Explainable machine-learning techniques to give insight
G Varoquaux 42
@GaelVaroquaux
Machine learning with dirty data
What models cannot fit
Dirty categories
Missing values
Understanding and formatting data is unavoidable
Master these aspects
Powerful machine-learning models can cope with dirtyness
- If it is well represented (representing similarities and missingness)
- If they have supervision information
4 References I
P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical
variables. Transactions in Data and Knowledge Engineering, 2020.
P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty
categorical variables. Machine Learning, 2018.
J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of
supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019.
M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss
networks: differential programming for supervised learning with missing values.
In Advances in Neural Information Processing Systems 33, 2020.
D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical
attributes in classification and prediction problems. ACM SIGKDD
Explorations Newsletter, 3(1):27–32, 2001.
4 References II
M. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor
on linearly-generated data with missing values: non consistency and solutions.
AISATS, 2020.
D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.

More Related Content

What's hot

Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
Ankit Sharma
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and Applications
Geeta Arora
 
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Edureka!
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Simplilearn
 
svm classification
svm classificationsvm classification
svm classification
Akhilesh Joshi
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
YounesCharfaoui
 
Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)
Ha Phuong
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning Programmers
Kimikazu Kato
 
Fuzzy c-means clustering for image segmentation
Fuzzy c-means  clustering for image segmentationFuzzy c-means  clustering for image segmentation
Fuzzy c-means clustering for image segmentation
Dharmesh Patel
 
genetic algorithms-artificial intelligence
 genetic algorithms-artificial intelligence genetic algorithms-artificial intelligence
genetic algorithms-artificial intelligenceKarunakar Singh Thakur
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
mahutte
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
Edureka!
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
hoangminhdong
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
Sangwoo Mo
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
Khalid Elshafie
 
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Simplilearn
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Edureka!
 
Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)
Er. Shiva K. Shrestha
 

What's hot (20)

Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and Applications
 
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
 
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...
 
svm classification
svm classificationsvm classification
svm classification
 
Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)Tutorial of topological_data_analysis_part_1(basic)
Tutorial of topological_data_analysis_part_1(basic)
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning Programmers
 
Fuzzy c-means clustering for image segmentation
Fuzzy c-means  clustering for image segmentationFuzzy c-means  clustering for image segmentation
Fuzzy c-means clustering for image segmentation
 
genetic algorithms-artificial intelligence
 genetic algorithms-artificial intelligence genetic algorithms-artificial intelligence
genetic algorithms-artificial intelligence
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
 
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
Machine Learning vs Deep Learning vs Artificial Intelligence | ML vs DL vs AI...
 
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
Random Forest Tutorial | Random Forest in R | Machine Learning | Data Science...
 
Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)Deep Learning for Artificial Intelligence (AI)
Deep Learning for Artificial Intelligence (AI)
 

Similar to Dirty data science machine learning on non-curated data

Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
Gael Varoquaux
 
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
The Statistical and Applied Mathematical Sciences Institute
 
Hunermund causal inference in ml and ai
Hunermund   causal inference in ml and aiHunermund   causal inference in ml and ai
Hunermund causal inference in ml and ai
Boston Global Forum
 
UNIT1-2.pptx
UNIT1-2.pptxUNIT1-2.pptx
UNIT1-2.pptx
csecem
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
Istituto nazionale di statistica
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...
tuxette
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
tuxette
 
02 Data Mining
02 Data Mining02 Data Mining
Practive test 1
Practive test 1Practive test 1
Practive test 1
Long Beach City College
 
Data Visualisation
Data VisualisationData Visualisation
Data Visualisation
Daniel Marcous
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Gael Varoquaux
 
Ciência de Dados: definição, desafios de modelagem e aplicações multidiscipli...
Ciência de Dados: definição, desafios de modelagem e aplicações multidiscipli...Ciência de Dados: definição, desafios de modelagem e aplicações multidiscipli...
Ciência de Dados: definição, desafios de modelagem e aplicações multidiscipli...
luizcelsojr
 
Polikar10missing
Polikar10missingPolikar10missing
Polikar10missing
kagupta
 
Pre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_techniquePre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_technique
Bhushan134837
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
Gael Varoquaux
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
VenkateswaraBabuRavi
 
Null-values imputation using different modification random forest algorithm
Null-values imputation using different modification random forest algorithmNull-values imputation using different modification random forest algorithm
Null-values imputation using different modification random forest algorithm
IAESIJAI
 
Learning possibilistic networks from data: a survey
Learning possibilistic networks from data: a surveyLearning possibilistic networks from data: a survey
Learning possibilistic networks from data: a survey
University of Nantes
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription product
Vienna Data Science Group
 
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Data Driven Innovation
 

Similar to Dirty data science machine learning on non-curated data (20)

Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
CLIM Program: Remote Sensing Workshop, Foundations Session: A Discussion - Br...
 
Hunermund causal inference in ml and ai
Hunermund   causal inference in ml and aiHunermund   causal inference in ml and ai
Hunermund causal inference in ml and ai
 
UNIT1-2.pptx
UNIT1-2.pptxUNIT1-2.pptx
UNIT1-2.pptx
 
G. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statisticsG. Barcaroli, The use of machine learning in official statistics
G. Barcaroli, The use of machine learning in official statistics
 
La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...La statistique et le machine learning pour l'intégration de données de la bio...
La statistique et le machine learning pour l'intégration de données de la bio...
 
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiquesApprentissage pour la biologie moléculaire et l’analyse de données omiques
Apprentissage pour la biologie moléculaire et l’analyse de données omiques
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
 
Practive test 1
Practive test 1Practive test 1
Practive test 1
 
Data Visualisation
Data VisualisationData Visualisation
Data Visualisation
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Ciência de Dados: definição, desafios de modelagem e aplicações multidiscipli...
Ciência de Dados: definição, desafios de modelagem e aplicações multidiscipli...Ciência de Dados: definição, desafios de modelagem e aplicações multidiscipli...
Ciência de Dados: definição, desafios de modelagem e aplicações multidiscipli...
 
Polikar10missing
Polikar10missingPolikar10missing
Polikar10missing
 
Pre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_techniquePre_processing_the_data_using_advance_technique
Pre_processing_the_data_using_advance_technique
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Null-values imputation using different modification random forest algorithm
Null-values imputation using different modification random forest algorithmNull-values imputation using different modification random forest algorithm
Null-values imputation using different modification random forest algorithm
 
Learning possibilistic networks from data: a survey
Learning possibilistic networks from data: a surveyLearning possibilistic networks from data: a survey
Learning possibilistic networks from data: a survey
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription product
 
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
 

More from Gael Varoquaux

Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
Gael Varoquaux
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
Gael Varoquaux
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
Gael Varoquaux
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
Gael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
Gael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
Gael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
Gael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
Gael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
Gael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Gael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
Gael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
Gael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
Gael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
Gael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
Gael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
Gael Varoquaux
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
Gael Varoquaux
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
Gael Varoquaux
 

More from Gael Varoquaux (20)

Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
 

Recently uploaded

Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
itech2017
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
zwunae
 

Recently uploaded (20)

Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABSDESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
DESIGN AND ANALYSIS OF A CAR SHOWROOM USING E TABS
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
 

Dirty data science machine learning on non-curated data

  • 1. Dirty data science machine learning on non-curated data Gaël Varoquaux,
  • 2. Dirty data science machine learning on non-curated data Gaël Varoquaux,
  • 3. Industry challenges to data science www.kaggle.com/ash316/novice-to-grandmaster
  • 4. Industry challenges to data science www.kaggle.com/ash316/novice-to-grandmaster On some dirty-data problems, progress in machine learning can ease the pain
  • 5. Talk outline 1 What models cannot fit 2 Learning with missing values 3 Machine learning on dirty categories G Varoquaux 3
  • 6. 1 What models cannot fit Outside of statistics’ comfort zone (X ∈ Rn×p ) G Varoquaux 4
  • 7. 1 The full life-cycle of a data-science project Framing the domain question Finding and understanding the data Assembling and reshaping it Designing an AI / statistical model? Evaluating model performance Inspecting the model for unwanted behavior Bringing the model to stakeholders / production ?: what we think is cool G Varoquaux 5
  • 8. 1 Understanding the data, between human and machine Age 60 26 38 139 52 86 17 48 Just numbers G Varoquaux 6
  • 9. 1 Understanding the data, between human and machine Age 60 26 38 ?? 139 52 86 17 48 Numbers with a meaning A numerical column expresses a quantity, with a corresponding scale... G Varoquaux 6
  • 10. 1 Understanding the data, between human and machine Age Name 60 Bono 26 Justin Bieber 38 Giselle Knowles-Carter? 139 Pablo Picasso 52 Céline Dion 86 Léonard Cohen 17 Greta Thunberg 48 Justin Trudeau ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers G Varoquaux 6
  • 11. 1 Understanding the data, between human and machine Age Name Born in Activity 60 Bono Ireland Singer 26 Justin Bieber Canada Singer 38 Giselle Knowles-Carter? USA Singer 139 Pablo Picasso Spain Painter 52 Céline Dion Canada Singer 86 Léonard Cohen Canada Singer 17 Greta Thunberg Sweden Activist 48 Justin Trudeau Sweden Politician ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers They can be used to bring in additional information (features) G Varoquaux 6
  • 12. 1 Understanding the data, between human and machine Age Name Born in Activity 60 Bono Ireland Singer 26 Justin Bieber Canada Singer 38 Giselle Knowles-Carter? USA Singer 139 Pablo Picasso Spain Painter 52 Céline Dion Canada Singer 86 Léonard Cohen Canada Singer 17 Greta Thunberg Sweden Activist 48 Justin Trudeau Sweden Politician ? Beyonce A numerical column expresses a quantity, with a corresponding scale... Recognized entries shed light on the numbers They can be used to bring in additional information (features) And find errors Knowledge representation, relational algebra G Varoquaux 6
  • 13. 1 Assembling data, of different natures and sources Age Name Position 60 John Doe Electrician 48 Jane Austen Senior Professor 52 Jack Daniels Professor Position Salary Electrician 35 lizards Professor 13 horses Senior Professor 1 dragon To model the link between age and salary, a join is necessary Databases: To maintain consistency and min- imize storage, data are normal- ized: multiple tables are use to minimize redundancy. Statistics: Needs samples and features: mul- tiple observations of the same kind ⇒ data is denormalized in 1 table Age Name Position Salary Coffees/day 60 John Doe Electrician 35 lizards 2 48 Jane Austen Senior Professor 1 dragon 128 G Varoquaux 7
  • 14. 1 Aggregations – long vs wide tables Person ID Measure type Value 12345 Blood Pressure 139 45673 Sugar Level 113 12345 Heart Rate 71 45673 Blood Pressure 84 Long table Flexible data representation Person Blood Sugar Heart Rate ID Pressure Level Rate 12345 139 NA 71 45673 84 113 NA Wide table Amenable to statistics on Person Long to wide in Pandas: unstack, pivot Also: count coffes per day per person from coffee-machine logs G Varoquaux 8
  • 15. 1 Data wrangling: assembling unfamiliar sources Relational algebra: joins aggregations (# coffees a day) selections (finding the data) Challenges: understanding the data store and domain logic errors in the data (correspondances in names) Age Name Country Position Coffees/day 48 Justin Trudeau Canada Prime minister 3000 NA Gaël Varoquaux NA NA NA G Varoquaux 9
  • 16. 1 Data wrangling: assembling unfamiliar sources Relational algebra: joins aggregations (# coffees a day) selections (finding the data) Challenges: understanding the data store and domain logic errors in the data (correspondances in names) In health: Assembling information across large electronic health records systems G Varoquaux 9
  • 17. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies G Varoquaux 10
  • 18. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) G Varoquaux 10
  • 19. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) Survival bias Data loss related to the process under study (survival models) G Varoquaux 10
  • 20. 1 Systematic errors: data require external checks Measurement biases: Volunteer bias More women volunteer in medical studies Selection bias Healthy people seldom go to the hospital (causal inference) Survival bias Data loss related to the process under study (survival models) Partly addressed by machine-learning models for dataset shift (transfer learning) if you know the bias. Brings us back to understanding the data G Varoquaux 10
  • 21. Data-science is much more than fitting a statistical model Data require assembling information Different data sources = different conventions Measurements come with errors and biases These challenges require domain knowledge and data wrangling G Varoquaux 11
  • 22. 2 Learning with missing values [Josse... 2019] Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I G Varoquaux 12
  • 23. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA / ∈ R More than an implementation problem G Varoquaux 13
  • 24. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA / ∈ R More than an implementation problem Categorical entries are discrete anyhow For missing values in categorical variables, create a special categorie ”missing”. Rest of talk on NA in numerical variables G Varoquaux 13
  • 25. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] observed(x0 , mi) = observed(xi, mi) ⇒ gφ(mi|x0 ) = gφ(mi|xi) Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). G Varoquaux 14
  • 26. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable G Varoquaux 14
  • 27. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 2 2 0 2 MNAR G Varoquaux 14
  • 28. 2 Classic statistics points of view Model a) a distribution fθ for the complete data x Model b) a random process gφ occluding entries (mask m) Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unobserved values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingness is independent from data Missing Not at Random situation (MNAR) Missingness not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 2 2 0 2 MNAR But There isn’t always an unobserved value Age of spouse of singles? Machine-learning’s goal is not to maximize likelihoods G Varoquaux 14
  • 29. 2 Imputation Fill in information Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA –2000 Social Worker IV M 07/16/2007 Police Officer III M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA –2012 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA –2014 Library Assistant I Large statistical literature Procedures and results focused on in sample settings How about completing the test set with the train set? What to do with the prediction target y? G Varoquaux 15
  • 30. 2 Imputation and prediction with test-time missing values Settings: y = f (x) + ε Theorem [Josse... 2019] f : trained predictor achieving Bayes risk on full data Conditional multiple imputation achieves Bayes risk on test set with missing data (in MAR settings) f ? mult imput(x̃) = Exm|Xo=xo [f (xm, Xo)]. Notations: x̃ ∈ (R ∪ NA)p : data at hand xo: observed values xm: unobserved values G Varoquaux 16
  • 31. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute G Varoquaux 17
  • 32. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer bad computational scalability G Varoquaux 17
  • 33. 2 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer bad computational scalability Classic statistics point of view Mean imputation is disastrous, be- cause it disorts the distribution “Congeniality” conditions: good im- putation must preserve data propeties used by later analysis steps 2 0 2 3 2 1 0 1 2 3 G Varoquaux 17
  • 34. 2 Constant imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time G Varoquaux 18
  • 35. 2 Constant imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time Constant imputation breaks simple models (eg linear models) [Morvan... 2020] G Varoquaux 18
  • 36. 2 Imputation for supervised learning Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.65 0.70 0.75 0.80 r2 score Mean Iterative Convergence 0.725 0.750 0.775 r2 score Iterative Mean Small small size Notebook: github – @nprost / supervised missing Conclusions: IterativeImputer is useful for small sample sizes G Varoquaux 19
  • 37. 2 Imputation is not enough: predictive missingness Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) G Varoquaux 20
  • 38. 2 Imputation is not enough: predictive missingness Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) Simulation: y depends indirectly on missingness censoring 102 103 104 Sample size 0.75 0.80 0.85 0.90 0.95 r2 score Mean Mean+ indicator Iterative Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Iterative Mean+ indicator Mean Small small size Notebook: github – @nprost / supervised missing Adding a mask is crucial Iterative imputation can be detrimental G Varoquaux 20
  • 39. 2 Tree models with missing values MIA (Missing Incorporated Attribute) [Josse... 2019] x10< -1.5 ? x2< 2 ? Yes/Missing x7< 0.3 ? No ... Yes ... No/Missing x1< 0.5 ? Yes ... No/Missing ... Predict +1.3 sklearn.ensemble.HistGradientBoostingClassifier The learner readily handles missing values G Varoquaux 21
  • 40. 2 Tree models with missing values (MCAR) Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.70 0.75 0.80 r2 score Inside trees Mean Iterative Convergence 0.75 0.80 r2 score Iterative Mean Inside trees Small small size Notebook: github – @nprost / supervised missing G Varoquaux 22
  • 41. 2 Tree models with missing values (censored) Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.7 0.8 0.9 r2 score Inside trees Mean Iterative Mean+ indicator Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Mean+ indicator Iterative Mean Inside trees Small small size Notebook: github – @nprost / supervised missing G Varoquaux 23
  • 42. 2 Neural networks with missing values Gradient-based optimization of continuous models Difficulty: Half-discrete input space (NA ∪ R) Y = β? 1X1 + β? 2X2 + β? 0 cor(X1, X2) = 0.5. If X2 is missing, the coefficient of X1 should compensate for the missingness of X2. up to 2d set of slopes effect of X2lost effect of X2 accounted for by X1 G Varoquaux 24
  • 43. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of Σ−1 ). Taylored architecture which learns all slopes jointly G Varoquaux 25
  • 44. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of Σ−1 ). Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less data G Varoquaux 25
  • 45. 2 Neumiss network: adapted neural architecture [Le Morvan... 2020] Neural networks that approximate optimal predictors (functions of Σ−1 ). Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less data Also suitable for MNAR settings G Varoquaux 25
  • 46. Learning with missing values Imputation is motivated only in MAR settings Rather than a sophisticated imputation, use a powerful supervised learner sklearn’s HistGradientBoostingClassifier readily models missing values Can work in MNAR settings Different regime as standard statistics G Varoquaux 26
  • 47. 3 Machine learning on dirty categories [Cerda... 2018, Cerda and Varoquaux 2020] Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I G Varoquaux 27
  • 48. 3 Categorical entries in a statistical model Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Master Police Officer Social Worker IV Police Officer II 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 One-hot encoding X ∈ Rn×p G Varoquaux 28
  • 49. 3 Non-normalized categorical entries in a statistical model Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Break OneHotEncoder Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 29
  • 50. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II → average salary of policy officer II 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II G Varoquaux 30
  • 51. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II → average salary of policy officer II 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II Embedding closeby categories with the same y can help building a simple decision function. G Varoquaux 30
  • 52. 3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001] High-cardinality categories Represent each category by the average target y Police Officer II → average salary of policy officer II DirtCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import TargetEncoder t a r g e t e n c o d e r = TargetEncoder () t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 30
  • 53. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III Police Officer II Social Worker II Police Officer III ⇒ Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III G Varoquaux 31
  • 54. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC Pfizer International LLC Pfizer Limited Pfizer Corporation Hong Kong Limited Pfizer Pharmaceuticals Korea Limited ... Difficult without supervision Potentially suboptimal Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 31
  • 55. 3 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC ... Hard to make automatic and turn-key Harder than supervised learning G Varoquaux 31
  • 56. Our goal: supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 32
  • 57. 3 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 X ∈ Rn×p new categories? link categories? Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 33
  • 58. 3 Some string similarities Levenshtein Number of edit on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... similarity = #n-gram in comon #n-gram in total G Varoquaux 34
  • 59. 3 Python implementation: DirtyCat DirtyCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import S i m i l a r i t y E n c o d e r s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y =’ngram ’) t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 35
  • 60. 3 Dirty categories blow up dimension G Varoquaux 36
  • 61. 3 Dirty categories blow up dimension New words in natural language G Varoquaux 36
  • 62. 3 Dirty categories blow up dimension New words in natural language X ∈ Rn×p , p is large Statistical problems Computational problems G Varoquaux 36
  • 63. 3 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? G Varoquaux 37
  • 64. 3 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? All training-set ⇒ huge dimensionality Most frequent? Maybe the right prototypes / ∈ training set “big cat” “fat cat” “big dog” “fat dog” Estimate prototypes G Varoquaux 37
  • 65. 3 Substring information Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 38
  • 66. 3 Modeling substrings [Cerda and Varoquaux 2020] Model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Models strings as a combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l sklearn.feature extraction.text CountVectorizer analyzer : ’word’, ’char’, ’char wb’ HashingVectorizer fast, stateless TfidfVectorizer normalize counts G Varoquaux 39
  • 67. 3 Latent category model [Cerda and Varoquaux 2020] Topic model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l G Varoquaux 39
  • 68. 3 String models of latent categories [Cerda and Varoquaux 2020] Encodings that extract latent categories b r a r y r a t o r a l i s t h o u s e n a g e r u n i t y e s c u e f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e s Categories G Varoquaux 40
  • 69. 3 String models of latent categories [Cerda and Varoquaux 2020] Inferring plausible feature names s t a n t , l i b r a r y m e n t , o p e r a t o r o n , s p e c i a l i s t k e r , w a r e h o u s e o g r a m , m a n a g e r n i c , c o m m u n i t y e s c u e r , r e s c u e c t i o n , o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e a t u r e n a m e s Categories G Varoquaux 40
  • 70. 3 Data science with dirty categories 0.0 0.1 0.2 Information, Technology, Technologist Officer, Office, Police Liquor, Clerk, Store School, Health, Room Environmental, Telephone, Capital Lieutenant, Captain, Chief Income, Assistance, Compliance Manager, Management, Property Inferred feature names Permutation Importances G Varoquaux 41
  • 71. Learning does not require clean entities Model continuous similarities across entries Sub-string models can capture theses Requires a powerful statistical model (Gradient-boosted trees) Explainable machine-learning techniques to give insight G Varoquaux 42
  • 72. @GaelVaroquaux Machine learning with dirty data What models cannot fit Dirty categories Missing values Understanding and formatting data is unavoidable Master these aspects Powerful machine-learning models can cope with dirtyness - If it is well represented (representing similarities and missingness) - If they have supervision information
  • 73. 4 References I P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. Transactions in Data and Knowledge Engineering, 2020. P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty categorical variables. Machine Learning, 2018. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss networks: differential programming for supervised learning with missing values. In Advances in Neural Information Processing Systems 33, 2020. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27–32, 2001.
  • 74. 4 References II M. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISATS, 2020. D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.