Dirty data science machine learning on non-curated data

Dirty data science machine learning on non-curated data
Gaël Varoquaux,

Industry challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster

Industry challenges to data science
www.kaggle.com/ash316/novice-to-grandmaster
On some dirty-data problems,
progress in machine learning
can ease the pain

Talk outline
1 What models cannot fit
2 Learning with missing values
3 Machine learning on dirty categories
G Varoquaux 3

1 What models cannot fit
Outside of statistics’ comfort zone (X ∈ Rn×p
)
G Varoquaux 4

1 The full life-cycle of a data-science project
Framing the domain question
Finding and understanding the data
Assembling and reshaping it
Designing an AI / statistical model?
Evaluating model performance
Inspecting the model for unwanted behavior
Bringing the model to stakeholders / production
?: what we think is cool
G Varoquaux 5

1 Understanding the data, between human and machine
Age
60
26
38
139
52
86
17
48
Just numbers
G Varoquaux 6

Age
60
26
38
?? 139
52
86
17
48
Numbers with a
meaning
A numerical column expresses a quantity, with a corresponding scale...
G Varoquaux 6

Age Name
60 Bono
26 Justin Bieber
38 Giselle Knowles-Carter?
139 Pablo Picasso
52 Céline Dion
86 Léonard Cohen
17 Greta Thunberg
48 Justin Trudeau
? Beyonce
Recognized entries shed light on the numbers
G Varoquaux 6

Age Name Born in Activity
60 Bono Ireland Singer
26 Justin Bieber Canada Singer
USA Singer
139 Pablo Picasso Spain Painter
52 Céline Dion Canada Singer
86 Léonard Cohen Canada Singer
17 Greta Thunberg Sweden Activist
48 Justin Trudeau Sweden Politician
? Beyonce
They can be used to bring in additional information (features)
G Varoquaux 6

Age Name Born in Activity
60 Bono Ireland Singer
26 Justin Bieber Canada Singer
USA Singer
139 Pablo Picasso Spain Painter
52 Céline Dion Canada Singer
86 Léonard Cohen Canada Singer
17 Greta Thunberg Sweden Activist
48 Justin Trudeau Sweden Politician
? Beyonce
They can be used to bring in additional information (features)
And find errors
Knowledge representation, relational algebra
G Varoquaux 6

1 Assembling data, of different natures and sources
Age Name Position
60 John Doe Electrician
48 Jane Austen Senior Professor
52 Jack Daniels Professor
Position Salary
Electrician 35 lizards
Professor 13 horses
Senior Professor 1 dragon
To model the link between age and salary, a join is necessary
Databases:
To maintain consistency and min-
imize storage, data are normal-
ized: multiple tables are use to
minimize redundancy.
Statistics:
Needs samples and features: mul-
tiple observations of the same
kind
⇒ data is denormalized in 1 table
Age Name Position Salary Coffees/day
60 John Doe Electrician 35 lizards 2
48 Jane Austen Senior Professor 1 dragon 128
G Varoquaux 7

1 Aggregations – long vs wide tables
Person ID Measure type Value
12345 Blood Pressure 139
45673 Sugar Level 113
12345 Heart Rate 71
45673 Blood Pressure 84
Long table
Flexible data representation
Person Blood Sugar Heart Rate
ID Pressure Level Rate
12345 139 NA 71
45673 84 113 NA
Wide table
Amenable to statistics on Person
Long to wide in Pandas: unstack, pivot
Also: count coffes per day per person from coffee-machine logs
G Varoquaux 8

1 Data wrangling: assembling unfamiliar sources
Relational algebra:
joins
aggregations (# coffees a day)
selections (finding the data)
Challenges:
understanding the data store
and domain logic
errors in the data
(correspondances in names)
Age Name Country Position Coffees/day
48 Justin Trudeau Canada Prime minister 3000
NA Gaël Varoquaux NA NA NA
G Varoquaux 9

1 Data wrangling: assembling unfamiliar sources
Relational algebra:
joins
aggregations (# coffees a day)
selections (finding the data)
Challenges:
understanding the data store
and domain logic
errors in the data
(correspondances in names)
In health:
Assembling information across large
electronic health records systems
G Varoquaux 9

1 Systematic errors: data require external checks
Measurement biases:
Volunteer bias
More women
volunteer in medical
studies
G Varoquaux 10

Measurement biases:
Volunteer bias
More women
studies
Selection bias
Healthy people
seldom go to the
hospital
(causal inference)
G Varoquaux 10

Measurement biases:
Volunteer bias
More women
studies
Selection bias
Healthy people
seldom go to the
hospital
(causal inference)
Survival bias
Data loss related to
the process under
study
(survival models)
G Varoquaux 10

Measurement biases:
Volunteer bias
More women
studies
Selection bias
Healthy people
seldom go to the
hospital
(causal inference)
Survival bias
Data loss related to
the process under
study
(survival models)
Partly addressed by machine-learning models for
dataset shift (transfer learning) if you know the bias.
Brings us back to understanding the data
G Varoquaux 10

Data-science is much more than fitting a statistical model
Data require assembling information
Different data sources = different conventions
Measurements come with errors and biases
These challenges require domain knowledge and data wrangling
G Varoquaux 11

2 Learning with missing values
[Josse... 2019]
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
G Varoquaux 12

Why doesn’t the #$@! machine learning toolkit work?!
Machine learning models need entries in a vector space (or at least
a metric space).
NA /
∈ R
More than an implementation problem
G Varoquaux 13

Why doesn’t the #$@! machine learning toolkit work?!
Machine learning models need entries in a vector space (or at least
a metric space).
NA /
∈ R
More than an implementation problem
Categorical entries are discrete anyhow
For missing values in categorical variables, create a
special categorie ”missing”.
Rest of talk on NA in numerical variables
G Varoquaux 13

2 Classic statistics points of view
Model a) a distribution fθ for the complete data x
Model b) a random process gφ occluding entries (mask m)
Missing at random situation (MAR)
for non-observed values, the probability of missingness does not depend
on this non-observed value. Proper definition in [Josse... 2019]
observed(x0
, mi) = observed(xi, mi) ⇒ gφ(mi|x0
) = gφ(mi|xi)
Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data
while ignoring (marginalizing) the unobserved values gives maximum
likelihood of model a).
G Varoquaux 14

Missing Completely at random situation (MCAR)
Missingness is independent from data
Missing Not at Random situation (MNAR)
Missingness not ignorable
G Varoquaux 14

2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0 2
2
0
2
MNAR
G Varoquaux 14

2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0 2
2
0
2
MNAR
But
There isn’t always an unobserved value
Age of spouse of singles?
Machine-learning’s goal is not to maximize likelihoods
G Varoquaux 14

2 Imputation
Fill in information Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA
–2000 Social Worker IV
M 07/16/2007 Police Officer III
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA
–2012 Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA
–2014 Library Assistant I
Large statistical literature
Procedures and results focused on in sample settings
How about completing the test set with the train set?
What to do with the prediction target y?
G Varoquaux 15

2 Imputation and prediction with test-time missing values
Settings: y = f (x) + ε
Theorem [Josse... 2019]
f : trained predictor achieving Bayes risk on full data
Conditional multiple imputation achieves Bayes risk on test set
with missing data (in MAR settings)
f ?
mult imput(x̃) = Exm|Xo=xo
[f (xm, Xo)].
Notations: x̃ ∈ (R ∪ NA)p
: data at hand
xo: observed values
xm: unobserved values
G Varoquaux 16

2 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
G Varoquaux 17

Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
bad computational scalability
G Varoquaux 17

Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
bad computational scalability
Classic statistics point of view
Mean imputation is disastrous, be-
cause it disorts the distribution
“Congeniality” conditions: good im-
putation must preserve data propeties
used by later analysis steps
2 0 2
3
2
1
0
1
2
3
G Varoquaux 17

2 Constant imputation for supervised learning
For a powerful learner (universally consistent) imputing both train
and test with the mean of train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and compensates at test
time
G Varoquaux 18

2 Constant imputation for supervised learning
For a powerful learner (universally consistent) imputing both train
and test with the mean of train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and compensates at test
time
Constant imputation breaks simple models (eg linear models)
[Morvan... 2020]
G Varoquaux 18

2 Imputation for supervised learning
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.65
0.70
0.75
0.80
r2
score
Mean
Iterative
Convergence
0.725 0.750 0.775
r2 score
Iterative
Mean
Small small size
Notebook: github – @nprost / supervised missing
Conclusions: IterativeImputer is useful for small sample sizes
G Varoquaux 19

2 Imputation is not enough: predictive missingness
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
G Varoquaux 20

2 Imputation is not enough: predictive missingness
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
Simulation: y depends indirectly on missingness censoring
102 103 104
Sample size
0.75
0.80
0.85
0.90
0.95
r2
score
Mean
Mean+
indicator
Iterative
Iterative+
indicator
Convergence
0.8 0.9
r2 score
Iterative+
indicator
Iterative
Mean+
indicator
Mean
Small small size
Adding a mask is crucial
Iterative imputation can be detrimental
G Varoquaux 20

2 Tree models with missing values
MIA (Missing Incorporated Attribute)
[Josse... 2019] x10< -1.5 ?
x2< 2 ?
Yes/Missing
x7< 0.3 ?
No
...
Yes
...
No/Missing
x1< 0.5 ?
Yes
...
No/Missing
... Predict +1.3
sklearn.ensemble.HistGradientBoostingClassifier
The learner readily
handles missing values
G Varoquaux 21

2 Tree models with missing values (MCAR)
102 103 104
Sample size
0.70
0.75
0.80
r2
score
Inside trees
Mean
Iterative
Convergence
0.75 0.80
r2 score
Iterative
Mean
Inside trees
Small small size
G Varoquaux 22

2 Tree models with missing values (censored)
102 103 104
Sample size
0.7
0.8
0.9
r2
score
Inside trees
Mean
Iterative
Mean+
indicator
Iterative+
indicator
Convergence
0.8 0.9
r2 score
Iterative+
indicator
Mean+
indicator
Iterative
Mean
Inside trees
Small small size
G Varoquaux 23

2 Neural networks with missing values
Gradient-based optimization of continuous models
Difficulty: Half-discrete input space (NA ∪ R)
Y = β?
1X1 + β?
2X2 + β?
0
cor(X1, X2) = 0.5.
If X2 is missing, the coefficient
of X1 should compensate for
the missingness of X2.
up to 2d
set of slopes
effect of X2lost effect of X2
accounted for by
X1
G Varoquaux 24

2 Neumiss network: adapted neural architecture [Le Morvan... 2020]
Neural networks that approximate optimal predictors (functions of Σ−1
).
Taylored architecture which learns all slopes jointly
G Varoquaux 25

).
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less data
G Varoquaux 25

).
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less data
Also suitable for MNAR settings
G Varoquaux 25

Learning with missing values
Imputation is motivated only in MAR settings
Rather than a sophisticated imputation,
use a powerful supervised learner
sklearn’s HistGradientBoostingClassifier
readily models missing values
Can work in MNAR settings
Different regime as standard statistics
G Varoquaux 26

3 Machine learning on dirty categories
[Cerda... 2018, Cerda and Varoquaux 2020]
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
G Varoquaux 27

3 Categorical entries in a statistical model
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Master Police Officer Social Worker IV Police Officer II
1 0
0 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
One-hot encoding X ∈ Rn×p
G Varoquaux 28

3 Non-normalized categorical entries in a statistical model
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Break OneHotEncoder
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 29

3 Forgotten baseline: TargetEncoder [Micci-Barreca 2001]
High-cardinality categories
Represent each category by the average target y
Police Officer II → average salary of policy officer II
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
G Varoquaux 30

40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
Embedding closeby categories with the same y can help
building a simple decision function.
G Varoquaux 30

DirtCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import TargetEncoder
t a r g e t e n c o d e r = TargetEncoder ()
t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df )
G Varoquaux 30

3 Data curation Database normalization
Feature engineering
Social Worker III
Police Officer II
Social Worker II
Police Officer III
⇒
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
G Varoquaux 31

Feature engineering
Social Worker III
...
⇒
Position Rank
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
Pfizer International LLC
Pfizer Limited
Pfizer Corporation Hong Kong Limited
Pfizer Pharmaceuticals Korea Limited
...
Difficult without supervision
Potentially suboptimal
Pfizer Corporation Hong Kong
=
? Pfizer Pharmaceuticals Korea
G Varoquaux 31

Feature engineering
Social Worker III
...
⇒
Position Rank
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
...
Hard to make automatic and turn-key
Harder than supervised learning
G Varoquaux 31

Our goal: supervised learning on dirty categories
The statistical question should
inform curation
Pfizer Corporation Hong Kong
=
?
Pfizer Pharmaceuticals Korea
G Varoquaux 32

3 Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
new categories?
link categories?
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 33

3 Some string similarities
Levenshtein
Number of edit on one string to match the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 34

3 Python implementation: DirtyCat
DirtyCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import S i m i l a r i t y E n c o d e r
s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y =’ngram ’)
t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df )
G Varoquaux 35

3 Dirty categories blow up dimension
G Varoquaux 36

New words in
natural language
G Varoquaux 36

New words in
natural language
X ∈ Rn×p
, p is large
Statistical problems
Computational problems
G Varoquaux 36

3 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number of prototypes?
G Varoquaux 37

3 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number of prototypes?
All training-set ⇒ huge dimensionality
Most frequent?
Maybe the right prototypes /
∈ training set
“big cat” “fat cat”
“big dog” “fat dog”
Estimate prototypes
G Varoquaux 37

3 Substring information
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Police Aide
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 38

3 Modeling substrings [Cerda and Varoquaux 2020]
Model on sub-strings
(GaP: Gamma-Poisson factorization)
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
Models strings as a combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
sklearn.feature extraction.text
CountVectorizer
analyzer : ’word’, ’char’, ’char wb’
HashingVectorizer fast, stateless
TfidfVectorizer normalize counts
G Varoquaux 39

3 Latent category model [Cerda and Varoquaux 2020]
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 39

3 String models of latent categories [Cerda and Varoquaux 2020]
Encodings
that extract
latent
categories
b
r
a
r
y
r
a
t
o
r
a
l
i
s
t
h
o
u
s
e
n
a
g
e
r
u
n
i
t
y
e
s
c
u
e
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Financial Programs Manager
Capital Projects Manager
Police Sergeant
e
s
Categories
G Varoquaux 40

3 String models of latent categories [Cerda and Varoquaux 2020]
Inferring
plausible
feature
names
s
t
a
n
t
,
l
i
b
r
a
r
y
m
e
n
t
,
o
p
e
r
a
t
o
r
o
n
,
s
p
e
c
i
a
l
i
s
t
k
e
r
,
w
a
r
e
h
o
u
s
e
o
g
r
a
m
,
m
a
n
a
g
e
r
n
i
c
,
c
o
m
m
u
n
i
t
y
e
s
c
u
e
r
,
r
e
s
c
u
e
c
t
i
o
n
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Financial Programs Manager
Capital Projects Manager
Police Sergeant
e
a
t
u
r
e
n
a
m
e
s
Categories
G Varoquaux 40

3 Data science with dirty categories
0.0 0.1 0.2
Information, Technology, Technologist
Oﬃcer, Oﬃce, Police
Liquor, Clerk, Store
School, Health, Room
Environmental, Telephone, Capital
Lieutenant, Captain, Chief
Income, Assistance, Compliance
Manager, Management, Property
Inferred feature names Permutation Importances
G Varoquaux 41

Learning does not require clean entities
Model continuous similarities across entries
Sub-string models can capture theses
Requires a powerful statistical model (Gradient-boosted trees)
Explainable machine-learning techniques to give insight
G Varoquaux 42

@GaelVaroquaux
Machine learning with dirty data
What models cannot fit
Dirty categories
Missing values
Understanding and formatting data is unavoidable
Master these aspects
Powerful machine-learning models can cope with dirtyness
- If it is well represented (representing similarities and missingness)
- If they have supervision information

4 References I
P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical
variables. Transactions in Data and Knowledge Engineering, 2020.
P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty
categorical variables. Machine Learning, 2018.
J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of
supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019.
M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss
networks: differential programming for supervised learning with missing values.
In Advances in Neural Information Processing Systems 33, 2020.
D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical
attributes in classification and prediction problems. ACM SIGKDD
Explorations Newsletter, 3(1):27–32, 2001.

4 References II
M. L. Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor
on linearly-generated data with missing values: non consistency and solutions.
AISATS, 2020.
D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.

Dirty data science machine learning on non-curated data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dirty data science machine learning on non-curated data

Similar to Dirty data science machine learning on non-curated data (20)

More from Gael Varoquaux

More from Gael Varoquaux (20)

Recently uploaded

Recently uploaded (20)

Dirty data science machine learning on non-curated data