Machine learning on non curated data
Dirty data made easy (in Python )
Ga¨el Varoquaux,
Machine learning on non curated data
Dirty data made easy (in Python )
Ga¨el Varoquaux,
With scikit-learn, machine learning is easy and fun
The problem is getting the data into the learner
With scikit-learn, machine learning is easy and fun
The problem is getting the data into the learner
www.kaggle.com/ash316/novice-
to-grandmaster
Machine learning
Let X ∈ Rn×p
or a numpy array
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
sklearn.compose.Column Transformer
Apply different preprocessing per columns
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Dirty Categories
Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
Missing values
Talk outline
1 Column transforming
2 Encoding dirty categories
3 Learning with missing values
Python + scikit-learn
data mining research
statistics research
G Varoquaux 4
1 Column transforming
Pandas in, numpy out
(preprocessing)
G Varoquaux 5
1 Dataframes to numbers
df = pd.read csv(’employee_salary.csv’)
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
F 01/26/2000 Library Assistant I
Convert all values to numerical
G Varoquaux 6
1 Dataframes to numbers
df = pd.read csv(’employee_salary.csv’)
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
F 01/26/2000 Library Assistant I
Convert all values to numerical
Gender: One-hot encode
one hot enc = sklearn. preprocessing .OneHotEncoder()
one hot enc. fit transform (df[[’Gender’]])
Gender (M) Gender (F) ...
1 0
0 1
1 0
0 1G Varoquaux 6
1 Dataframes to numbers
df = pd.read csv(’employee_salary.csv’)
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 06/26/2006 Social Worker III
M 07/16/2007 Police Officer III
F 01/26/2000 Library Assistant I
Convert all values to numerical
Gender: One-hot encode
Date: use pandas’ datetime support
d a t e s = pd. t o d a t e t i m e ( df [’Date First Hired ’])
# the values hold the data in secs
d a t e s . v a l u e s . a s t y p e (float)
G Varoquaux 6
1 Transformers: fit & transform
Separating fitting from transforming
Avoids data leakage
Can be used in a Pipeline and cross val score
One-hot encoder
one hot enc. fit (df[[’Gender’]])
X = one hot enc.transform(df[[’Gender’]])
1) store which categories are present
2) encode the data accordingly
Better than pd.get dummies because columns are defined
from train set, and do not change with test set
G Varoquaux 7
1 Transformers: fit & transform
Separating fitting from transforming
Avoids data leakage
Can be used in a Pipeline and cross val score
For dates: FunctionTransformer
def date2num ( d a t e s t r ):
out = pd. t o d a t e t i m e ( d a t e s t r ). v a l u e s .
a s t y p e (np.float)
return out . r e s h a p e ((-1, 1)) # 2D output
d a t e t r a n s = p r e p r o c e s s i n g . F u n c t i o n T r a n s f o r m e r (
func =date2num , v a l i d a t e = F a l s e )
X = d a t e t r a n s . t r a n s f o r m ( df [’Date First Hired ’]
G Varoquaux 7
1 ColumnTransformer: assembling
Applies different transformers to columns
These can be complex pipelines
c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r (
( one hot enc , [’Gender ’, ’Employee
Position Title ’]),
( d a t e t r a n s , ’Date First Hired ’),
)
X = c o l u m n t r a n s . f i t t r a n s f o r m ( df )
From DataFrame to array with heteroge-
neous preprocessing & feature engineering
G Varoquaux 8
1 ColumnTransformer: assembling
Applies different transformers to columns
These can be complex pipelines
c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r (
( one hot enc , [’Gender ’, ’Employee
Position Title ’]),
( d a t e t r a n s , ’Date First Hired ’),
)
X = c o l u m n t r a n s . f i t t r a n s f o r m ( df )
From DataFrame to array with heteroge-
neous preprocessing & feature engineering
Benefit: model selection on dataframe
model = make pipeline(column trans,
HistGradientBoostingClassifier)
scores = cross val score(model, df, y)
G Varoquaux 8
2 Encoding dirty categories
PhD word of Patricio Cerda [Cerda... 2018]
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
2 The problem of dirty categories
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Break OneHotEncoder
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 10
2 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
Police Officer II
Social Worker II
Police Officer III
⇒
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
G Varoquaux 11
2 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
Pfizer International LLC
Pfizer Limited
Pfizer Corporation Hong Kong Limited
Pfizer Pharmaceuticals Korea Limited
...
Difficult
without
supervision
Potentially
suboptimal
Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea
G Varoquaux 11
2 Data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
...
Hard to make automatic and turn-key
Harder than supervised learning
G Varoquaux 11
Our goal: supervised learning on dirty categories
The statistical question
should inform curation
Pfizer Corporation Hong Kong
=?
Pfizer Pharmaceuticals Korea
G Varoquaux 12
2 Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
new categories?
link categories?
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 13
2 Some string similarities
Levenshtein
Number of edit on one string to match the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
3-gram1
L
3-gram2
on
3-gram3
do...
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 14
2 Python implementation: DirtyCat
DirtCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import S i m i l a r i t y E n c o d e r
s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r (
s i m i l a r i t y =’ngram ’)
t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r .
f i t t r a n s f o r m ( df )
G Varoquaux 15
2 Other approach: TargetEncoder [Micci-Barreca 2001]
Represent each category by the average target y
For example Police Officer III
→ average salary of policy officer III
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
G Varoquaux 16
2 Other approach: TargetEncoder [Micci-Barreca 2001]
Represent each category by the average target y
For example Police Officer III
→ average salary of policy officer III
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager IIEmbedding closeby categories with the same
y can help building a simple decision function.
G Varoquaux 16
2 Other approach: TargetEncoder [Micci-Barreca 2001]
Represent each category by the average target y
For example Police Officer III
→ average salary of policy officer III
DirtCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import TargetEncoder
t a r g e t e n c o d e r = TargetEncoder ()
t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r .
f i t t r a n s f o r m ( df )
G Varoquaux 16
2 Experimental results: prediction performance
Average rank on 7 datasets
Linear model Gradient-boosted trees
One-hot encoding 4.7 6.0
Target encoding 5.3 4.3
Similarity encoding
Jaro-Winkler 3.4 3.6
Levenshtein 3.1 3.0
3-gram 1.1 1.9
Best: similarity encoding with 3-gram similarity
[Cerda... 2018]
Also, gradient-boosted
trees work much better
G Varoquaux 17
2 Dirty categories blow up dimension
Wow, lot’s of datasets!
G Varoquaux 18
2 Dirty categories blow up dimension
New words in
natural language
Wow, lot’s of datasets!
G Varoquaux 18
2 Dirty categories blow up dimension
New words in
natural language
Wow, lot’s of datasets!
X ∈ Rn×p
, p is large
Statistical problems
Computational problems
G Varoquaux 18
2 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number
of prototypes?
G Varoquaux 19
2 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number
of prototypes?
All training-set ⇒ huge dimensionality
Most frequent?
Maybe the right prototypes /∈ training set
“big cat” “fat cat”
“big dog” “fat dog”
Estimate prototypes
G Varoquaux 19
2 n-grams grow, but there is redundancy
Natural
language
G Varoquaux 20
2 Substring information
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 21
2 Latent category model
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
3-gram1
L
3-gram2
on
3-gram3
do...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 22
2 Latent category model
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
3-gram1
L
3-gram2
on
3-gram3
do...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 22
2 String models of latent categories
Encodings that extract latent categories
library
operator
ecialist
arehouse
manager
ommunity
,
rescue
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
nam
es
Categories
G Varoquaux 23
2 String models of latent categories
Inferring plausible feature names
untant,
assistant,
library
nator,
equipment,
operator
administration,
specialist
t,
craftsworker,
warehouse
crossing,
program,
manager
ician,
mechanic,
community
refighter,
rescuer,
rescue
ional,
correction,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
Inferred
featurenam
es
Categories
G Varoquaux 23
2 Data science with dirty categories
0.0 0.1 0.2
Information, Technology, Technologist
Officer, Office, Police
Liquor, Clerk, Store
School, Health, Room
Environmental, Telephone, Capital
Lieutenant, Captain, Chief
Income, Assistance, Compliance
Manager, Management, Property
Inferred feature names Permutation Importances
G Varoquaux 24
3 Learning with missing values
[Josse... 2019]
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
G Varoquaux 25
Why doesn’t the #$@! machine learning
toolkit work?!
Machine learning models need entries in a vector
space (or at least a metric space).
NA /∈ R
More than an implementation problem
G Varoquaux 26
Why doesn’t the #$@! machine learning
toolkit work?!
Machine learning models need entries in a vector
space (or at least a metric space).
NA /∈ R
More than an implementation problem
Categorical are discrete anyhow
For missing values in categorical variables,
create a special categorie ”missing”.
Rest of talk on NA in numerical variables
G Varoquaux 26
3 Classic statistics points of view
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
for non-observed values, the probability of missingness
does not depend on this non-observed value.
Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for
observed data while ignoring (marginalizing) the unob-
served values gives maximum likelihood of model a).
G Varoquaux 27
3 Classic statistics points of view
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
for non-observed values, the probability of missingness
does not depend on this non-observed value.
Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for
observed data while ignoring (marginalizing) the unob-
served values gives maximum likelihood of model a).
Missing Completely at random situation (MCAR)
Missingnes is independent from data
Missing Not at Random situation (MNAR)
Missingnes not ignorable
G Varoquaux 27
3 Classic statistics points of view
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
for non-observed values, the probability of missingness
does not depend on this non-observed value.
Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for
observed data while ignoring (marginalizing) the unob-
served values gives maximum likelihood of model a).
Missing Completely at random situation (MCAR)
Missingnes is independent from data
Missing Not at Random situation (MNAR)
Missingnes not ignorable
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0
3
2
1
0
MNAR
G Varoquaux 27
3 Classic statistics points of view
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
for non-observed values, the probability of missingness
does not depend on this non-observed value.
Proper definition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for
observed data while ignoring (marginalizing) the unob-
served values gives maximum likelihood of model a).
Missing Completely at random situation (MCAR)
Missingnes is independent from data
Missing Not at Random situation (MNAR)
Missingnes not ignorable
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0
3
2
1
0
MNAR
But
There isn’t always an unobserved value
Age of spouse of singles?
We are not trying to maximize likelihoods
G Varoquaux 27
The #$@! machine learning toolkit still
doesn’t work?!
G Varoquaux 28
3 Imputation
Fill in information
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA–2000 Social Worker IV
M 07/16/2007 Police Officer III
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA–2012 Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA–2014 Library Assistant I
Large statistical literature
Procedures and results focused on in sample settings
How about completing the test set with the train set?
What to do with the prediction target y?
G Varoquaux 29
3 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
G Varoquaux 30
3 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
new in 0.21!!
G Varoquaux 30
3 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
new in 0.21!!
Classic statistics point of view
Mean imputation is dis-
astrous, because it dis-
orts the distribution
2 0 2
3
2
1
0
1
2
3
“Congeniality” conditions: good imputation must
preserve data propeties used by later analysis steps
G Varoquaux 30
3 Imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent)
imputing both train and test with the mean of
train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and
compensates at test time
G Varoquaux 31
3 Imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent)
imputing both train and test with the mean of
train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and
compensates at test time
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.65
0.70
0.75
0.80
r2score
Mean
Iterative
Convergence
0.725 0.750 0.775
r2 score
Iterative
Mean
Small small size
Notebook: github – @nprost / supervised missing
Conclusions:
IterativeImputer is useful for small sample sizes
G Varoquaux 31
3 Imputation is not enough
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
G Varoquaux 32
3 Imputation is not enough
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
Simulation: y depends indirectly on missingness
censoring in the data
102 103 104
Sample size
0.75
0.80
0.85
0.90
0.95
r2score
Mean
Mean+
indicator
Iterative
Iterative+
indicator
Convergence
0.8 0.9
r2 score
Iterative+
indicator
Iterative
Mean+
indicator
Mean
Small small size
Notebook: github – @nprost / supervised missing
Adding a mask is crucial
Iterative imputation can be detrimental
G Varoquaux 32
@GaelVaroquaux
Learning on dirty data
Prepare data via ColumnTransformer
Use HistGradientBoosting
@GaelVaroquaux
Learning on dirty data
Prepare data via ColumnTransformer
Use HistGradientBoosting
Dirty categories
Statistical modeling of non-curated categorical data
Give us your dirty data
Similarity encoding
robust solution that enables statistical models
Dirty category software:
http://dirty-cat.github.io
@GaelVaroquaux
Learning on dirty data
Prepare data via ColumnTransformer
Use HistGradientBoosting
Dirty categories
Give us your dirty data
Similarity encoding
Dirty category software:
http://dirty-cat.github.io
Supervised learning with missing data
Mean imputation + missing indicator
Much more results in [Josse... 2019]
http://project.inria.fr/dirtydata
On going research
Acknowledgements
Dirty categories
Patricio Cerda and Balazs Kegl
Missing data
Julie Josse, Erwan Scornet, Nicolas Prost
Implementation in scikit-learn
thanks to scikit-learn consortium partners
4 References I
P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for
learning with dirty categorical variables. 2018.
J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the
consistency of supervised learning with missing values. arXiv
preprint arXiv:1902.06931, 2019.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classification and prediction
problems. ACM SIGKDD Explorations Newsletter, 3(1):
27–32, 2001.
D. B. Rubin. Inference and missing data. Biometrika, 63(3):
581–592, 1976.

Machine learning on non curated data

  • 1.
    Machine learning onnon curated data Dirty data made easy (in Python ) Ga¨el Varoquaux,
  • 2.
    Machine learning onnon curated data Dirty data made easy (in Python ) Ga¨el Varoquaux,
  • 3.
    With scikit-learn, machinelearning is easy and fun The problem is getting the data into the learner
  • 4.
    With scikit-learn, machinelearning is easy and fun The problem is getting the data into the learner www.kaggle.com/ash316/novice- to-grandmaster
  • 5.
    Machine learning Let X∈ Rn×p or a numpy array
  • 6.
    Machine learning Let X∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I
  • 7.
    Machine learning Let X∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I sklearn.compose.Column Transformer Apply different preprocessing per columns
  • 8.
    Machine learning Let X∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Dirty Categories
  • 9.
    Machine learning Let X∈ Rn×p or a numpy array Real life often as pandas dataframe Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I Missing values
  • 10.
    Talk outline 1 Columntransforming 2 Encoding dirty categories 3 Learning with missing values Python + scikit-learn data mining research statistics research G Varoquaux 4
  • 11.
    1 Column transforming Pandasin, numpy out (preprocessing) G Varoquaux 5
  • 12.
    1 Dataframes tonumbers df = pd.read csv(’employee_salary.csv’) Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Convert all values to numerical G Varoquaux 6
  • 13.
    1 Dataframes tonumbers df = pd.read csv(’employee_salary.csv’) Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Convert all values to numerical Gender: One-hot encode one hot enc = sklearn. preprocessing .OneHotEncoder() one hot enc. fit transform (df[[’Gender’]]) Gender (M) Gender (F) ... 1 0 0 1 1 0 0 1G Varoquaux 6
  • 14.
    1 Dataframes tonumbers df = pd.read csv(’employee_salary.csv’) Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 06/26/2006 Social Worker III M 07/16/2007 Police Officer III F 01/26/2000 Library Assistant I Convert all values to numerical Gender: One-hot encode Date: use pandas’ datetime support d a t e s = pd. t o d a t e t i m e ( df [’Date First Hired ’]) # the values hold the data in secs d a t e s . v a l u e s . a s t y p e (float) G Varoquaux 6
  • 15.
    1 Transformers: fit& transform Separating fitting from transforming Avoids data leakage Can be used in a Pipeline and cross val score One-hot encoder one hot enc. fit (df[[’Gender’]]) X = one hot enc.transform(df[[’Gender’]]) 1) store which categories are present 2) encode the data accordingly Better than pd.get dummies because columns are defined from train set, and do not change with test set G Varoquaux 7
  • 16.
    1 Transformers: fit& transform Separating fitting from transforming Avoids data leakage Can be used in a Pipeline and cross val score For dates: FunctionTransformer def date2num ( d a t e s t r ): out = pd. t o d a t e t i m e ( d a t e s t r ). v a l u e s . a s t y p e (np.float) return out . r e s h a p e ((-1, 1)) # 2D output d a t e t r a n s = p r e p r o c e s s i n g . F u n c t i o n T r a n s f o r m e r ( func =date2num , v a l i d a t e = F a l s e ) X = d a t e t r a n s . t r a n s f o r m ( df [’Date First Hired ’] G Varoquaux 7
  • 17.
    1 ColumnTransformer: assembling Appliesdifferent transformers to columns These can be complex pipelines c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r ( ( one hot enc , [’Gender ’, ’Employee Position Title ’]), ( d a t e t r a n s , ’Date First Hired ’), ) X = c o l u m n t r a n s . f i t t r a n s f o r m ( df ) From DataFrame to array with heteroge- neous preprocessing & feature engineering G Varoquaux 8
  • 18.
    1 ColumnTransformer: assembling Appliesdifferent transformers to columns These can be complex pipelines c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r ( ( one hot enc , [’Gender ’, ’Employee Position Title ’]), ( d a t e t r a n s , ’Date First Hired ’), ) X = c o l u m n t r a n s . f i t t r a n s f o r m ( df ) From DataFrame to array with heteroge- neous preprocessing & feature engineering Benefit: model selection on dataframe model = make pipeline(column trans, HistGradientBoostingClassifier) scores = cross val score(model, df, y) G Varoquaux 8
  • 19.
    2 Encoding dirtycategories PhD word of Patricio Cerda [Cerda... 2018] Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I
  • 20.
    2 The problemof dirty categories Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Break OneHotEncoder Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 10
  • 21.
    2 Data curationDatabase normalization Feature engineering Employee Position Title Master Police Officer Social Worker III Police Officer II Social Worker II Police Officer III ⇒ Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III G Varoquaux 11
  • 22.
    2 Data curationDatabase normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC Pfizer International LLC Pfizer Limited Pfizer Corporation Hong Kong Limited Pfizer Pharmaceuticals Korea Limited ... Difficult without supervision Potentially suboptimal Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea G Varoquaux 11
  • 23.
    2 Data curationDatabase normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC ... Hard to make automatic and turn-key Harder than supervised learning G Varoquaux 11
  • 24.
    Our goal: supervisedlearning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea G Varoquaux 12
  • 25.
    2 Adding similaritiesto one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 X ∈ Rn×p new categories? link categories? Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 13
  • 26.
    2 Some stringsimilarities Levenshtein Number of edit on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters 3-gram1 L 3-gram2 on 3-gram3 do... similarity = #n-gram in comon #n-gram in total G Varoquaux 14
  • 27.
    2 Python implementation:DirtyCat DirtCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import S i m i l a r i t y E n c o d e r s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r ( s i m i l a r i t y =’ngram ’) t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 15
  • 28.
    2 Other approach:TargetEncoder [Micci-Barreca 2001] Represent each category by the average target y For example Police Officer III → average salary of policy officer III 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager II G Varoquaux 16
  • 29.
    2 Other approach:TargetEncoder [Micci-Barreca 2001] Represent each category by the average target y For example Police Officer III → average salary of policy officer III 40000 60000 80000 100000 120000 140000 y: Employee salary Crossing Guard Liquor Store Clerk I Library Aide Police Cadet Public Safety Reporting Aide I Administrative Specialist II Management and Budget Specialist III Manager III Manager I Manager IIEmbedding closeby categories with the same y can help building a simple decision function. G Varoquaux 16
  • 30.
    2 Other approach:TargetEncoder [Micci-Barreca 2001] Represent each category by the average target y For example Police Officer III → average salary of policy officer III DirtCat: Dirty category software: http://dirty-cat.github.io from d i r t y c a t import TargetEncoder t a r g e t e n c o d e r = TargetEncoder () t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r . f i t t r a n s f o r m ( df ) G Varoquaux 16
  • 31.
    2 Experimental results:prediction performance Average rank on 7 datasets Linear model Gradient-boosted trees One-hot encoding 4.7 6.0 Target encoding 5.3 4.3 Similarity encoding Jaro-Winkler 3.4 3.6 Levenshtein 3.1 3.0 3-gram 1.1 1.9 Best: similarity encoding with 3-gram similarity [Cerda... 2018] Also, gradient-boosted trees work much better G Varoquaux 17
  • 32.
    2 Dirty categoriesblow up dimension Wow, lot’s of datasets! G Varoquaux 18
  • 33.
    2 Dirty categoriesblow up dimension New words in natural language Wow, lot’s of datasets! G Varoquaux 18
  • 34.
    2 Dirty categoriesblow up dimension New words in natural language Wow, lot’s of datasets! X ∈ Rn×p , p is large Statistical problems Computational problems G Varoquaux 18
  • 35.
    2 Tackling thehigh cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? G Varoquaux 19
  • 36.
    2 Tackling thehigh cardinality Similarity encoding, one-hot encoding = Prototype methods How to choose a small number of prototypes? All training-set ⇒ huge dimensionality Most frequent? Maybe the right prototypes /∈ training set “big cat” “fat cat” “big dog” “fat dog” Estimate prototypes G Varoquaux 19
  • 37.
    2 n-grams grow,but there is redundancy Natural language G Varoquaux 20
  • 38.
    2 Substring information DrugName alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 21
  • 39.
    2 Latent categorymodel Topic model on sub-strings (GaP: Gamma-Poisson factorization) 3-gram1 L 3-gram2 on 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol G Varoquaux 22
  • 40.
    2 Latent categorymodel Topic model on sub-strings (GaP: Gamma-Poisson factorization) 3-gram1 L 3-gram2 on 3-gram3 do... Models strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier er_ cer fic off _of ce_ ice lic pol → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry er_ cer fic off _of ce_ ice lic pol G Varoquaux 22
  • 41.
    2 String modelsof latent categories Encodings that extract latent categories library operator ecialist arehouse manager ommunity , rescue officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant nam es Categories G Varoquaux 23
  • 42.
    2 String modelsof latent categories Inferring plausible feature names untant, assistant, library nator, equipment, operator administration, specialist t, craftsworker, warehouse crossing, program, manager ician, mechanic, community refighter, rescuer, rescue ional, correction, officer Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant Inferred featurenam es Categories G Varoquaux 23
  • 43.
    2 Data sciencewith dirty categories 0.0 0.1 0.2 Information, Technology, Technologist Officer, Office, Police Liquor, Clerk, Store School, Health, Room Environmental, Telephone, Capital Lieutenant, Captain, Chief Income, Assistance, Compliance Manager, Management, Property Inferred feature names Permutation Importances G Varoquaux 24
  • 44.
    3 Learning withmissing values [Josse... 2019] Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I G Varoquaux 25
  • 45.
    Why doesn’t the#$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA /∈ R More than an implementation problem G Varoquaux 26
  • 46.
    Why doesn’t the#$@! machine learning toolkit work?! Machine learning models need entries in a vector space (or at least a metric space). NA /∈ R More than an implementation problem Categorical are discrete anyhow For missing values in categorical variables, create a special categorie ”missing”. Rest of talk on NA in numerical variables G Varoquaux 26
  • 47.
    3 Classic statisticspoints of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). G Varoquaux 27
  • 48.
    3 Classic statisticspoints of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingnes is independent from data Missing Not at Random situation (MNAR) Missingnes not ignorable G Varoquaux 27
  • 49.
    3 Classic statisticspoints of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingnes is independent from data Missing Not at Random situation (MNAR) Missingnes not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 3 2 1 0 MNAR G Varoquaux 27
  • 50.
    3 Classic statisticspoints of view Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) for non-observed values, the probability of missingness does not depend on this non-observed value. Proper definition in [Josse... 2019] Theorem [Rubin 1976], in MAR, maximizing likelihood for observed data while ignoring (marginalizing) the unob- served values gives maximum likelihood of model a). Missing Completely at random situation (MCAR) Missingnes is independent from data Missing Not at Random situation (MNAR) Missingnes not ignorable 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MCAR 2 0 3 2 1 0 MNAR But There isn’t always an unobserved value Age of spouse of singles? We are not trying to maximize likelihoods G Varoquaux 27
  • 51.
    The #$@! machinelearning toolkit still doesn’t work?! G Varoquaux 28
  • 52.
    3 Imputation Fill ininformation Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA–2000 Social Worker IV M 07/16/2007 Police Officer III M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA–2012 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA–2014 Library Assistant I Large statistical literature Procedures and results focused on in sample settings How about completing the test set with the train set? What to do with the prediction target y? G Varoquaux 29
  • 53.
    3 Imputation proceduresthat work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute G Varoquaux 30
  • 54.
    3 Imputation proceduresthat work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer new in 0.21!! G Varoquaux 30
  • 55.
    3 Imputation proceduresthat work out of sample Mean imputation special case of univariate imputation Replace NA by the mean of the feature sklearn.impute.SimpleImpute Conditional imputation Modeling one feature as a function of others Possible implementation: iteratively predict one feature as a function of other Classic implementations in R: MICE, missforest sklearn.impute.IterativeImputer new in 0.21!! Classic statistics point of view Mean imputation is dis- astrous, because it dis- orts the distribution 2 0 2 3 2 1 0 1 2 3 “Congeniality” conditions: good imputation must preserve data propeties used by later analysis steps G Varoquaux 30
  • 56.
    3 Imputation forsupervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time G Varoquaux 31
  • 57.
    3 Imputation forsupervised learning Theorem [Josse... 2019] For a powerful learner (universally consistent) imputing both train and test with the mean of train is consistent ie it converges to the best possible prediction Intuition The learner “recognizes” imputed entries and compensates at test time Simulation: MCAR + Gradient boosting 102 103 104 Sample size 0.65 0.70 0.75 0.80 r2score Mean Iterative Convergence 0.725 0.750 0.775 r2 score Iterative Mean Small small size Notebook: github – @nprost / supervised missing Conclusions: IterativeImputer is useful for small sample sizes G Varoquaux 31
  • 58.
    3 Imputation isnot enough Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) G Varoquaux 32
  • 59.
    3 Imputation isnot enough Pathological case [Josse... 2019] y depends only on wether data is missing or not eg tax fraud detection theory: MNAR = “Missing Not At Random” Imputing makes prediction impossible Solution Add a missingness indicator: extra feature to predict ...SimpleImpute(add indicator=True) ...IterativeImputer(add indicator=True) Simulation: y depends indirectly on missingness censoring in the data 102 103 104 Sample size 0.75 0.80 0.85 0.90 0.95 r2score Mean Mean+ indicator Iterative Iterative+ indicator Convergence 0.8 0.9 r2 score Iterative+ indicator Iterative Mean+ indicator Mean Small small size Notebook: github – @nprost / supervised missing Adding a mask is crucial Iterative imputation can be detrimental G Varoquaux 32
  • 60.
    @GaelVaroquaux Learning on dirtydata Prepare data via ColumnTransformer Use HistGradientBoosting
  • 61.
    @GaelVaroquaux Learning on dirtydata Prepare data via ColumnTransformer Use HistGradientBoosting Dirty categories Statistical modeling of non-curated categorical data Give us your dirty data Similarity encoding robust solution that enables statistical models Dirty category software: http://dirty-cat.github.io
  • 62.
    @GaelVaroquaux Learning on dirtydata Prepare data via ColumnTransformer Use HistGradientBoosting Dirty categories Give us your dirty data Similarity encoding Dirty category software: http://dirty-cat.github.io Supervised learning with missing data Mean imputation + missing indicator Much more results in [Josse... 2019] http://project.inria.fr/dirtydata On going research
  • 63.
    Acknowledgements Dirty categories Patricio Cerdaand Balazs Kegl Missing data Julie Josse, Erwan Scornet, Nicolas Prost Implementation in scikit-learn thanks to scikit-learn consortium partners
  • 64.
    4 References I P.Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for learning with dirty categorical variables. 2018. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1): 27–32, 2001. D. B. Rubin. Inference and missing data. Biometrika, 63(3): 581–592, 1976.