SlideShare a Scribd company logo
Machine learning on non curated data
Dirty data made easy (in Python )
Ga¨el Varoquaux,
Machine learning on non curated data
Dirty data made easy (in Python )
Ga¨el Varoquaux,
With scikit-learn, machine learning is easy and fun
The problem is getting the data into the learner
With scikit-learn, machine learning is easy and fun
The problem is getting the data into the learner
www.kaggle.com/ash316/novice-
to-grandmaster
Machine learning
Let X ∈ Rn×p
or a numpy array
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
sklearn.compose.Column Transformer
Apply different preprocessing per columns
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Dirty Categories
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Missing values
Talk outline
1 Column transforming
2 Encoding dirty categories
3 Learning with missing values
Python + scikit-learn
data mining research
statistics research
G Varoquaux 4
1 Column transforming
Pandas in, numpy out
(preprocessing)
G Varoquaux 5
1 Dataframes to numbers
df = pd.read csv(’employee_salary.csv’)
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
F 01/26/2000 Library Assistant I
Convert all values to numerical
G Varoquaux 6
1 Dataframes to numbers
df = pd.read csv(’employee_salary.csv’)
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
F 01/26/2000 Library Assistant I
Convert all values to numerical
Gender: One-hot encode
one hot enc = sklearn. preprocessing .OneHotEncoder()
one hot enc. fit transform (df[[’Gender’]])
Gender (M) Gender (F) ...
1 0
0 1
1 0
0 1G Varoquaux 6
1 Dataframes to numbers
df = pd.read csv(’employee_salary.csv’)
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
F 01/26/2000 Library Assistant I
Convert all values to numerical
Gender: One-hot encode
Date: use pandas’ datetime support
d a t e s = pd. t o d a t e t i m e ( df [’Date First Hired ’])
# the values hold the data in secs
d a t e s . v a l u e s . a s t y p e (float)
G Varoquaux 6
1 Transformers: fit & transform
Separating fitting from transforming
Avoids data leakage
Can be used in a Pipeline and cross val score
One-hot encoder
one hot enc. fit (df[[’Gender’]])
X = one hot enc.transform(df[[’Gender’]])
1) store which categories are present
2) encode the data accordingly
Better than pd.get dummies because columns are defined
from train set, and do not change with test set
G Varoquaux 7
1 Transformers: fit & transform
Separating fitting from transforming
Avoids data leakage
Can be used in a Pipeline and cross val score
For dates: FunctionTransformer
def date2num ( d a t e s t r ):
out = pd. t o d a t e t i m e ( d a t e s t r ). v a l u e s .
a s t y p e (np.float)
return out . r e s h a p e ((-1, 1)) # 2D output
d a t e t r a n s = p r e p r o c e s s i n g . F u n c t i o n T r a n s f o r m e r (
func =date2num , v a l i d a t e = F a l s e )
X = d a t e t r a n s . t r a n s f o r m ( df [’Date First Hired ’]
G Varoquaux 7
1 ColumnTransformer: assembling
Applies different transformers to columns
These can be complex pipelines
c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r (
( one hot enc , [’Gender ’, ’Employee
Position Title ’]),
( d a t e t r a n s , ’Date First Hired ’),
)
X = c o l u m n t r a n s . f i t t r a n s f o r m ( df )
From DataFrame to array with heteroge-
neous preprocessing & feature engineering
G Varoquaux 8
1 ColumnTransformer: assembling
Applies different transformers to columns
These can be complex pipelines
c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r (
( one hot enc , [’Gender ’, ’Employee
Position Title ’]),
( d a t e t r a n s , ’Date First Hired ’),
)
X = c o l u m n t r a n s . f i t t r a n s f o r m ( df )
From DataFrame to array with heteroge-
neous preprocessing & feature engineering
Benefit: model selection on dataframe
model = make pipeline(column trans,
HistGradientBoostingClassifier)
scores = cross val score(model, df, y)
G Varoquaux 8
2 Encoding dirty categories
PhD word of Patricio Cerda [Cerda... 2018]
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
2 The problem of dirty categories
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Break OneHotEncoder
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 10
2 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
Police Officer II
Social Worker II
Police Officer III
⇒
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
G Varoquaux 11
2 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
Pfizer International LLC
Pfizer Limited
Pfizer Corporation Hong Kong Limited
Pfizer Pharmaceuticals Korea Limited
...
Difficult
without
supervision
Potentially
suboptimal
Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea
G Varoquaux 11
2 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
...
Hard to make automatic and turn-key
Harder than supervised learning
G Varoquaux 11
Our goal: supervised learning on dirty categories
The statistical question
should inform curation
Pfizer Corporation Hong Kong
=?
Pfizer Pharmaceuticals Korea
G Varoquaux 12
2 Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
new categories?
link categories?
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 13
2 Some string similarities
Levenshtein
Number of edit on one string to match the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
3-gram1
L
3-gram2
on
3-gram3
do...
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 14
2 Python implementation: DirtyCat
DirtCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import S i m i l a r i t y E n c o d e r
s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r (
s i m i l a r i t y =’ngram ’)
t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r .
f i t t r a n s f o r m ( df )
G Varoquaux 15
2 Other approach: TargetEncoder [Micci-Barreca 2001]
Represent each category by the average target y
For example Police Officer III
→ average salary of policy officer III
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
G Varoquaux 16
2 Other approach: TargetEncoder [Micci-Barreca 2001]
Represent each category by the average target y
For example Police Officer III
→ average salary of policy officer III
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager IIEmbedding closeby categories with the same
y can help building a simple decision function.
G Varoquaux 16
2 Other approach: TargetEncoder [Micci-Barreca 2001]
Represent each category by the average target y
For example Police Officer III
→ average salary of policy officer III
DirtCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import TargetEncoder
t a r g e t e n c o d e r = TargetEncoder ()
t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r .
f i t t r a n s f o r m ( df )
G Varoquaux 16
2 Experimental results: prediction performance
Average rank on 7 datasets
Linear model Gradient-boosted trees
One-hot encoding 4.7 6.0
Target encoding 5.3 4.3
Similarity encoding
Jaro-Winkler 3.4 3.6
Levenshtein 3.1 3.0
3-gram 1.1 1.9
Best: similarity encoding with 3-gram similarity
[Cerda... 2018]
Also, gradient-boosted
trees work much better
G Varoquaux 17
2 Dirty categories blow up dimension
Wow, lot’s of datasets!
G Varoquaux 18
2 Dirty categories blow up dimension
New words in
natural language
Wow, lot’s of datasets!
G Varoquaux 18
2 Dirty categories blow up dimension
New words in
natural language
Wow, lot’s of datasets!
X ∈ Rn×p
, p is large
Statistical problems
Computational problems
G Varoquaux 18
2 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number
of prototypes?
G Varoquaux 19
2 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number
of prototypes?
All training-set ⇒ huge dimensionality
Most frequent?
Maybe the right prototypes /∈ training set
“big cat” “fat cat”
“big dog” “fat dog”
Estimate prototypes
G Varoquaux 19
2 n-grams grow, but there is redundancy
Natural
language
G Varoquaux 20
2 Substring information
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 21
2 Latent category model
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
3-gram1
L
3-gram2
on
3-gram3
do...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 22
2 Latent category model
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
3-gram1
L
3-gram2
on
3-gram3
do...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 22
2 String models of latent categories
Encodings that extract latent categories
library
operator
ecialist
arehouse
manager
ommunity
,
rescue
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
nam
es
Categories
G Varoquaux 23
2 String models of latent categories
Inferring plausible feature names
untant,
assistant,
library
nator,
equipment,
operator
administration,
specialist
t,
craftsworker,
warehouse
crossing,
program,
manager
ician,
mechanic,
community
refighter,
rescuer,
rescue
ional,
correction,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
Inferred
featurenam
es
Categories
G Varoquaux 23
2 Data science with dirty categories
0.0 0.1 0.2
Information, Technology, Technologist
Officer, Office, Police
Liquor, Clerk, Store
School, Health, Room
Environmental, Telephone, Capital
Lieutenant, Captain, Chief
Income, Assistance, Compliance
Manager, Management, Property
Inferred feature names Permutation Importances
G Varoquaux 24
3 Learning with missing values
[Josse... 2019]
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
G Varoquaux 25
Why doesn’t the #$@! machine learning
toolkit work?!
Machine learning models need entries in a vector
space (or at least a metric space).
NA /∈ R
More than an implementation problem
G Varoquaux 26
Why doesn’t the #$@! machine learning
toolkit work?!
Machine learning models need entries in a vector
space (or at least a metric space).
NA /∈ R
More than an implementation problem
Categorical are discrete anyhow
For missing values in categorical variables,
create a special categorie ”missing”.
Rest of talk on NA in numerical variables
G Varoquaux 26
3 Classic statistics points of view
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
for non-observed values, the probability of missingness
does not depend on this non-observed value.
Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for
observed data while ignoring (marginalizing) the unob-
served values gives maximum likelihood of model a).
G Varoquaux 27
3 Classic statistics points of view
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
for non-observed values, the probability of missingness
does not depend on this non-observed value.
Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for
observed data while ignoring (marginalizing) the unob-
served values gives maximum likelihood of model a).
Missing Completely at random situation (MCAR)
Missingnes is independent from data
Missing Not at Random situation (MNAR)
Missingnes not ignorable
G Varoquaux 27
3 Classic statistics points of view
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
for non-observed values, the probability of missingness
does not depend on this non-observed value.
Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for
observed data while ignoring (marginalizing) the unob-
served values gives maximum likelihood of model a).
Missing Completely at random situation (MCAR)
Missingnes is independent from data
Missing Not at Random situation (MNAR)
Missingnes not ignorable
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0
3
2
1
0
MNAR
G Varoquaux 27
3 Classic statistics points of view
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
for non-observed values, the probability of missingness
does not depend on this non-observed value.
Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for
observed data while ignoring (marginalizing) the unob-
served values gives maximum likelihood of model a).
Missing Completely at random situation (MCAR)
Missingnes is independent from data
Missing Not at Random situation (MNAR)
Missingnes not ignorable
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0
3
2
1
0
MNAR
But
There isn’t always an unobserved value
Age of spouse of singles?
We are not trying to maximize likelihoods
G Varoquaux 27
The #$@! machine learning toolkit still
doesn’t work?!
G Varoquaux 28
3 Imputation
Fill in information
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA–2000 Social Worker IV
M 07/16/2007 Police Officer III
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA–2012 Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA–2014 Library Assistant I
Large statistical literature
Procedures and results focused on in sample settings
How about completing the test set with the train set?
What to do with the prediction target y?
G Varoquaux 29
3 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
G Varoquaux 30
3 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
new in 0.21!!
G Varoquaux 30
3 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
new in 0.21!!
Classic statistics point of view
Mean imputation is dis-
astrous, because it dis-
orts the distribution
2 0 2
3
2
1
0
1
2
3
“Congeniality” conditions: good imputation must
preserve data propeties used by later analysis steps
G Varoquaux 30
3 Imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent)
imputing both train and test with the mean of
train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and
compensates at test time
G Varoquaux 31
3 Imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent)
imputing both train and test with the mean of
train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and
compensates at test time
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.65
0.70
0.75
0.80
r2score
Mean
Iterative
Convergence
0.725 0.750 0.775
r2 score
Iterative
Mean
Small small size
Notebook: github – @nprost / supervised missing
Conclusions:
IterativeImputer is useful for small sample sizes
G Varoquaux 31
3 Imputation is not enough
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
G Varoquaux 32
3 Imputation is not enough
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
Simulation: y depends indirectly on missingness
censoring in the data
102 103 104
Sample size
0.75
0.80
0.85
0.90
0.95
r2score
Mean
Mean+
indicator
Iterative
Iterative+
indicator
Convergence
0.8 0.9
r2 score
Iterative+
indicator
Iterative
Mean+
indicator
Mean
Small small size
Notebook: github – @nprost / supervised missing
Adding a mask is crucial
Iterative imputation can be detrimental
G Varoquaux 32
@GaelVaroquaux
Learning on dirty data
Prepare data via ColumnTransformer
Use HistGradientBoosting
@GaelVaroquaux
Learning on dirty data
Prepare data via ColumnTransformer
Use HistGradientBoosting
Dirty categories
Statistical modeling of non-curated categorical data
Give us your dirty data
Similarity encoding
robust solution that enables statistical models
Dirty category software:
http://dirty-cat.github.io
@GaelVaroquaux
Learning on dirty data
Prepare data via ColumnTransformer
Use HistGradientBoosting
Dirty categories
Give us your dirty data
Similarity encoding
Dirty category software:
http://dirty-cat.github.io
Supervised learning with missing data
Mean imputation + missing indicator
Much more results in [Josse... 2019]
http://project.inria.fr/dirtydata
On going research
Acknowledgements
Dirty categories
Patricio Cerda and Balazs Kegl
Missing data
Julie Josse, Erwan Scornet, Nicolas Prost
Implementation in scikit-learn
thanks to scikit-learn consortium partners
4 References I
P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for
learning with dirty categorical variables. 2018.
J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the
consistency of supervised learning with missing values. arXiv
preprint arXiv:1902.06931, 2019.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classification and prediction
problems. ACM SIGKDD Explorations Newsletter, 3(1):
27–32, 2001.
D. B. Rubin. Inference and missing data. Biometrika, 63(3):
581–592, 1976.

More Related Content

More from Gael Varoquaux

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
Gael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
Gael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
Gael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
Gael Varoquaux
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Gael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
Gael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
Gael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Gael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
Gael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
Gael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
Gael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
Gael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
Gael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
Gael Varoquaux
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
Gael Varoquaux
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
Gael Varoquaux
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
Gael Varoquaux
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
Gael Varoquaux
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budget
Gael Varoquaux
 

More from Gael Varoquaux (20)

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budget
 

Recently uploaded

Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
GiselleginaGloria
 
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdfFUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
EMERSON EDUARDO RODRIGUES
 
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASICINTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
GOKULKANNANMMECLECTC
 
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
IJCNCJournal
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
Indrajeet sahu
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Balvir Singh
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
upoux
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
paraasingh12 #V08
 
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
Sou Tibon
 
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
vmspraneeth
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
Properties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure MeasurementProperties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure Measurement
Indrajeet sahu
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
PreethaV16
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
ijseajournal
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
PreethaV16
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
Dwarkadas J Sanghvi College of Engineering
 

Recently uploaded (20)

Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)
 
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdfFUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
FUNDAMENTALS OF MECHANICAL ENGINEERING.pdf
 
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASICINTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
INTRODUCTION TO ARTIFICIAL INTELLIGENCE BASIC
 
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
Particle Swarm Optimization–Long Short-Term Memory based Channel Estimation w...
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
Open Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surfaceOpen Channel Flow: fluid flow with a free surface
Open Channel Flow: fluid flow with a free surface
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfSri Guru Hargobind Ji - Bandi Chor Guru.pdf
Sri Guru Hargobind Ji - Bandi Chor Guru.pdf
 
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
一比一原版(uofo毕业证书)美国俄勒冈大学毕业证如何办理
 
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls ChennaiCall Girls Chennai +91-8824825030 Vip Call Girls Chennai
Call Girls Chennai +91-8824825030 Vip Call Girls Chennai
 
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
309475979-Creativity-Innovation-notes-IV-Sem-2016-pdf.pdf
 
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTERUNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
UNIT-III- DATA CONVERTERS ANALOG TO DIGITAL CONVERTER
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
Properties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure MeasurementProperties of Fluids, Fluid Statics, Pressure Measurement
Properties of Fluids, Fluid Statics, Pressure Measurement
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
FULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back EndFULL STACK PROGRAMMING - Both Front End and Back End
FULL STACK PROGRAMMING - Both Front End and Back End
 
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...Call For Paper -3rd International Conference on Artificial Intelligence Advan...
Call For Paper -3rd International Conference on Artificial Intelligence Advan...
 
Object Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOADObject Oriented Analysis and Design - OOAD
Object Oriented Analysis and Design - OOAD
 
Introduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.pptIntroduction to Computer Networks & OSI MODEL.ppt
Introduction to Computer Networks & OSI MODEL.ppt
 

Machine learning on non curated data

  • 1. Machine learning on non curated data Dirty data made easy (in Python ) Ga¨el Varoquaux,
  • 2. Machine learning on non curated data Dirty data made easy (in Python ) Ga¨el Varoquaux,
  • 3. With scikit-learn, machine learning is easy and fun The problem is getting the data into the learner
  • 4. With scikit-learn, machine learning is easy and fun The problem is getting the data into the learner www.kaggle.com/ash316/novice- to-grandmaster
  • 5. Machine learning Let X ∈ Rn×p or a numpy array
  • 6. Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I
  • 7. Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I sklearn.compose.Column Transformer Apply different preprocessing per columns
  • 8. Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Dirty Categories
  • 9. Machine learning Let X ∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Missing values
  • 10. Talk outline 1 Column transforming 2 Encoding dirty categories 3 Learning with missing values Python + scikit-learn data mining research statistics research G Varoquaux 4
  • 11. 1 Column transforming Pandas in, numpy out (preprocessing) G Varoquaux 5
  • 12. 1 Dataframes to numbers df = pd.read csv(’employee_salary.csv’) Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Convert all values to numerical G Varoquaux 6
  • 13. 1 Dataframes to numbers df = pd.read csv(’employee_salary.csv’) Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Convert all values to numerical Gender: One-hot encode one hot enc = sklearn. preprocessing .OneHotEncoder() one hot enc. fit transform (df[[’Gender’]]) Gender (M) Gender (F) ... 1 0 0 1 1 0 0 1G Varoquaux 6
  • 14. 1 Dataframes to numbers df = pd.read csv(’employee_salary.csv’) Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Convert all values to numerical Gender: One-hot encode Date: use pandas’ datetime support d a t e s = pd. t o d a t e t i m e ( df [’Date First Hired ’]) # the values hold the data in secs d a t e s . v a l u e s . a s t y p e (float) G Varoquaux 6
  • 15. 1 Transformers: fit & transform Separating fitting from transforming Avoids data leakage Can be used in a Pipeline and cross val score One-hot encoder one hot enc. fit (df[[’Gender’]]) X = one hot enc.transform(df[[’Gender’]]) 1) store which categories are present 2) encode the data accordingly Better than pd.get dummies because columns are defined from train set, and do not change with test set G Varoquaux 7
  • 16. 1 Transformers: fit & transform Separating fitting from transforming Avoids data leakage Can be used in a Pipeline and cross val score For dates: FunctionTransformer def date2num ( d a t e s t r ): out = pd. t o d a t e t i m e ( d a t e s t r ). v a l u e s . a s t y p e (np.float) return out . r e s h a p e ((-1, 1)) # 2D output d a t e t r a n s = p r e p r o c e s s i n g . F u n c t i o n T r a n s f o r m e r ( func =date2num , v a l i d a t e = F a l s e ) X = d a t e t r a n s . t r a n s f o r m ( df [’Date First Hired ’] G Varoquaux 7
  • 17. 1 ColumnTransformer: assembling Applies different transformers to columns These can be complex pipelines c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r ( ( one hot enc , [’Gender ’, ’Employee Position Title ’]), ( d a t e t r a n s , ’Date First Hired ’), ) X = c o l u m n t r a n s . f i t t r a n s f o r m ( df ) From DataFrame to array with heteroge- neous preprocessing & feature engineering G Varoquaux 8
  • 18. 1 ColumnTransformer: assembling Applies different transformers to columns These can be complex pipelines c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r ( ( one hot enc , [’Gender ’, ’Employee Position Title ’]), ( d a t e t r a n s , ’Date First Hired ’), ) X = c o l u m n t r a n s . f i t t r a n s f o r m ( df ) From DataFrame to array with heteroge- neous preprocessing & feature engineering Benefit: model selection on dataframe model = make pipeline(column trans, HistGradientBoostingClassifier) scores = cross val score(model, df, y) G Varoquaux 8
  • 19. 2 Encoding dirty categories PhD word of Patricio Cerda [Cerda... 2018] Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I
  • 20. 2 The problem of dirty categories Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Break OneHotEncoder Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 10
  • 21. 2 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III Police Officer II Social Worker II Police Officer III ⇒ Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III G Varoquaux 11
  • 22. 2 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC Pfizer International LLC Pfizer Limited Pfizer Corporation Hong Kong Limited Pfizer Pharmaceuticals Korea Limited ... Difficult without supervision Potentially suboptimal Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea G Varoquaux 11
  • 23. 2 Data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC ... Hard to make automatic and turn-key Harder than supervised learning G Varoquaux 11
  • 24. Our goal: supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea G Varoquaux 12
  • 25. 2 Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 X ∈ Rn×p new categories? link categories? Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 13
  • 26. 2 Some string similarities Levenshtein Number of edit on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters 3-gram1 L 3-gram2 on 3-gram3 do... similarity = #n-gram in comon #n-gram in total G Varoquaux 14
  • 27. 2 Python implementation: DirtyCat DirtCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import S i m i l a r i t y E n c o d e r s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y =’ngram ’) t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 15
  • 28. 2 Other approach: TargetEncoder [Micci-Barreca 2001] Represent each category by the average target y For example Police Officer III → average salary of policy officer III 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II G Varoquaux 16
  • 29. 2 Other approach: TargetEncoder [Micci-Barreca 2001] Represent each category by the average target y For example Police Officer III → average salary of policy officer III 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager IIEmbedding closeby categories with the same y can help building a simple decision function. G Varoquaux 16
  • 30. 2 Other approach: TargetEncoder [Micci-Barreca 2001] Represent each category by the average target y For example Police Officer III → average salary of policy officer III DirtCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import TargetEncoder t a r g e t e n c o d e r = TargetEncoder () t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 16
  • 31. 2 Experimental results: prediction performance Average rank on 7 datasets Linear model Gradient-boosted trees One-hot encoding 4.7 6.0 Target encoding 5.3 4.3 Similarity encoding Jaro-Winkler 3.4 3.6 Levenshtein 3.1 3.0 3-gram 1.1 1.9 Best: similarity encoding with 3-gram similarity [Cerda... 2018] Also, gradient-boosted trees work much better G Varoquaux 17
  • 32. 2 Dirty categories blow up dimension Wow, lot’s of datasets! G Varoquaux 18
  • 33. 2 Dirty categories blow up dimension New words in natural language Wow, lot’s of datasets! G Varoquaux 18
  • 34. 2 Dirty categories blow up dimension New words in natural language Wow, lot’s of datasets! X ∈ Rn×p , p is large Statistical problems Computational problems G Varoquaux 18
  • 35. 2 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? G Varoquaux 19
  • 36. 2 Tackling the high cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? All training-set ⇒ huge dimensionality Most frequent? Maybe the right prototypes /∈ training set “big cat” “fat cat” “big dog” “fat dog” Estimate prototypes G Varoquaux 19
  • 37. 2 n-grams grow, but there is redundancy Natural language G Varoquaux 20
  • 38. 2 Substring information Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 21
  • 39. 2 Latent category model Topic model on sub-strings (GaP: Gamma-Poisson factorization) 3-gram1 L 3-gram2 on 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol G Varoquaux 22
  • 40. 2 Latent category model Topic model on sub-strings (GaP: Gamma-Poisson factorization) 3-gram1 L 3-gram2 on 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 22
  • 41. 2 String models of latent categories Encodings that extract latent categories library operator ecialist arehouse manager ommunity , rescue officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant nam es Categories G Varoquaux 23
  • 42. 2 String models of latent categories Inferring plausible feature names untant, assistant, library nator, equipment, operator administration, specialist t, craftsworker, warehouse crossing, program, manager ician, mechanic, community refighter, rescuer, rescue ional, correction, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant Inferred featurenam es Categories G Varoquaux 23
  • 43. 2 Data science with dirty categories 0.0 0.1 0.2 Information, Technology, Technologist Officer, Office, Police Liquor, Clerk, Store School, Health, Room Environmental, Telephone, Capital Lieutenant, Captain, Chief Income, Assistance, Compliance Manager, Management, Property Inferred feature names Permutation Importances G Varoquaux 24
  • 44. 3 Learning with missing values [Josse... 2019] Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I G Varoquaux 25
  • 45. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA /∈ R More than an implementation problem G Varoquaux 26
  • 46. Why doesn’t the #$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA /∈ R More than an implementation problem Categorical are discrete anyhow For missing values in categorical variables, create a special categorie ”missing”. Rest of talk on NA in numerical variables G Varoquaux 26
  • 47. 3 Classic statistics points of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). G Varoquaux 27
  • 48. 3 Classic statistics points of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingnes is independent from data Missing Not at Random situation (MNAR) Missingnes not ignorable G Varoquaux 27
  • 49. 3 Classic statistics points of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingnes is independent from data Missing Not at Random situation (MNAR) Missingnes not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 3 2 1 0 MNAR G Varoquaux 27
  • 50. 3 Classic statistics points of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingnes is independent from data Missing Not at Random situation (MNAR) Missingnes not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 3 2 1 0 MNAR But There isn’t always an unobserved value Age of spouse of singles? We are not trying to maximize likelihoods G Varoquaux 27
  • 51. The #$@! machine learning toolkit still doesn’t work?! G Varoquaux 28
  • 52. 3 Imputation Fill in information Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA–2000 Social Worker IV M 07/16/2007 Police Officer III M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA–2012 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA–2014 Library Assistant I Large statistical literature Procedures and results focused on in sample settings How about completing the test set with the train set? What to do with the prediction target y? G Varoquaux 29
  • 53. 3 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute G Varoquaux 30
  • 54. 3 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer new in 0.21!! G Varoquaux 30
  • 55. 3 Imputation procedures that work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer new in 0.21!! Classic statistics point of view Mean imputation is dis- astrous, because it dis- orts the distribution 2 0 2 3 2 1 0 1 2 3 “Congeniality” conditions: good imputation must preserve data propeties used by later analysis steps G Varoquaux 30
  • 56. 3 Imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time G Varoquaux 31
  • 57. 3 Imputation for supervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.65 0.70 0.75 0.80 r2score Mean Iterative Convergence 0.725 0.750 0.775 r2 score Iterative Mean Small small size Notebook: github – @nprost / supervised missing Conclusions: IterativeImputer is useful for small sample sizes G Varoquaux 31
  • 58. 3 Imputation is not enough Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) G Varoquaux 32
  • 59. 3 Imputation is not enough Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) Simulation: y depends indirectly on missingness censoring in the data 102 103 104 Sample size 0.75 0.80 0.85 0.90 0.95 r2score Mean Mean+ indicator Iterative Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Iterative Mean+ indicator Mean Small small size Notebook: github – @nprost / supervised missing Adding a mask is crucial Iterative imputation can be detrimental G Varoquaux 32
  • 60. @GaelVaroquaux Learning on dirty data Prepare data via ColumnTransformer Use HistGradientBoosting
  • 61. @GaelVaroquaux Learning on dirty data Prepare data via ColumnTransformer Use HistGradientBoosting Dirty categories Statistical modeling of non-curated categorical data Give us your dirty data Similarity encoding robust solution that enables statistical models Dirty category software: http://dirty-cat.github.io
  • 62. @GaelVaroquaux Learning on dirty data Prepare data via ColumnTransformer Use HistGradientBoosting Dirty categories Give us your dirty data Similarity encoding Dirty category software: http://dirty-cat.github.io Supervised learning with missing data Mean imputation + missing indicator Much more results in [Josse... 2019] http://project.inria.fr/dirtydata On going research
  • 63. Acknowledgements Dirty categories Patricio Cerda and Balazs Kegl Missing data Julie Josse, Erwan Scornet, Nicolas Prost Implementation in scikit-learn thanks to scikit-learn consortium partners
  • 64. 4 References I P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for learning with dirty categorical variables. 2018. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1): 27–32, 2001. D. B. Rubin. Inference and missing data. Biometrika, 63(3): 581–592, 1976.