Machine learning on non curated data

Machine learning on non curated data
Dirty data made easy (in Python )
Ga¨el Varoquaux,

With scikit-learn, machine learning is easy and fun
The problem is getting the data into the learner

With scikit-learn, machine learning is easy and fun
The problem is getting the data into the learner
www.kaggle.com/ash316/novice-
to-grandmaster

Machine learning
Let X ∈ Rn×p
or a numpy array

Machine learning
Let X ∈ Rn×p
or a numpy array
Real life often as pandas dataframe
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Oﬃcer
F NA Social Worker IV
M 07/16/2007 Police Oﬃcer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I

Machine learning
Let X ∈ Rn×p
or a numpy array
M NA Bus Operator
sklearn.compose.Column Transformer
Apply diﬀerent preprocessing per columns

Machine learning
Let X ∈ Rn×p
or a numpy array
M NA Bus Operator
Dirty Categories

Machine learning
Let X ∈ Rn×p
or a numpy array
M NA Bus Operator
Missing values

Talk outline
1 Column transforming
2 Encoding dirty categories
3 Learning with missing values
Python + scikit-learn
data mining research
statistics research
G Varoquaux 4

1 Column transforming
Pandas in, numpy out
(preprocessing)
G Varoquaux 5

1 Dataframes to numbers
df = pd.read csv(’employee_salary.csv’)
Convert all values to numerical
G Varoquaux 6

Gender: One-hot encode
one hot enc = sklearn. preprocessing .OneHotEncoder()
one hot enc. ﬁt transform (df[[’Gender’]])
Gender (M) Gender (F) ...
1 0
0 1
1 0
0 1G Varoquaux 6

Gender: One-hot encode
Date: use pandas’ datetime support
d a t e s = pd. t o d a t e t i m e ( df [’Date First Hired ’])
# the values hold the data in secs
d a t e s . v a l u e s . a s t y p e (float)
G Varoquaux 6

1 Transformers: fit & transform
Separating fitting from transforming
Avoids data leakage
Can be used in a Pipeline and cross val score
One-hot encoder
one hot enc. fit (df[[’Gender’]])
X = one hot enc.transform(df[[’Gender’]])
1) store which categories are present
2) encode the data accordingly
Better than pd.get dummies because columns are defined
from train set, and do not change with test set
G Varoquaux 7

1 Transformers: ﬁt & transform
Separating ﬁtting from transforming
Avoids data leakage
Can be used in a Pipeline and cross val score
For dates: FunctionTransformer
def date2num ( d a t e s t r ):
out = pd. t o d a t e t i m e ( d a t e s t r ). v a l u e s .
a s t y p e (np.float)
return out . r e s h a p e ((-1, 1)) # 2D output
d a t e t r a n s = p r e p r o c e s s i n g . F u n c t i o n T r a n s f o r m e r (
func =date2num , v a l i d a t e = F a l s e )
X = d a t e t r a n s . t r a n s f o r m ( df [’Date First Hired ’]
G Varoquaux 7

1 ColumnTransformer: assembling
Applies diﬀerent transformers to columns
These can be complex pipelines
c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r (
( one hot enc , [’Gender ’, ’Employee
Position Title ’]),
( d a t e t r a n s , ’Date First Hired ’),
)
X = c o l u m n t r a n s . f i t t r a n s f o r m ( df )
From DataFrame to array with heteroge-
neous preprocessing & feature engineering
G Varoquaux 8

1 ColumnTransformer: assembling
Applies different transformers to columns
These can be complex pipelines
c o l u m n t r a n s = compose . m a k e c o l u m n t r a n s f o r m e r (
( one hot enc , [’Gender ’, ’Employee
Position Title ’]),
( d a t e t r a n s , ’Date First Hired ’),
)
X = c o l u m n t r a n s . f i t t r a n s f o r m ( df )
From DataFrame to array with heteroge-
neous preprocessing & feature engineering
Benefit: model selection on dataframe
model = make pipeline(column trans,
HistGradientBoostingClassifier)
scores = cross val score(model, df, y)
G Varoquaux 8

2 Encoding dirty categories
PhD word of Patricio Cerda [Cerda... 2018]
Employee Position Title
Master Police Oﬃcer
Social Worker IV
Police Oﬃcer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I

2 The problem of dirty categories
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Break OneHotEncoder
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 10

2 Data curation Database normalization
Feature engineering
Social Worker III
Police Officer II
Social Worker II
Police Officer III
⇒
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
G Varoquaux 11

Feature engineering
Social Worker III
...
⇒
Position Rank
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
Pfizer International LLC
Pfizer Limited
Pfizer Corporation Hong Kong Limited
Pfizer Pharmaceuticals Korea Limited
...
Difficult
without
supervision
Potentially
suboptimal
Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea
G Varoquaux 11

Feature engineering
Social Worker III
...
⇒
Position Rank
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pﬁzer Inc.
Pﬁzer Pharmaceuticals LLC
...
Hard to make automatic and turn-key
Harder than supervised learning
G Varoquaux 11

Our goal: supervised learning on dirty categories
The statistical question
should inform curation
Pﬁzer Corporation Hong Kong
=?
Pﬁzer Pharmaceuticals Korea
G Varoquaux 12

2 Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
new categories?
link categories?
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 13

2 Some string similarities
Levenshtein
Number of edit on one string to match the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
3-gram1
L
3-gram2
on
3-gram3
do...
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 14

2 Python implementation: DirtyCat
DirtCat: Dirty category software:
http://dirty-cat.github.io
from d i r t y c a t import S i m i l a r i t y E n c o d e r
s i m i l a r i t y e n c o d e r = S i m i l a r i t y E n c o d e r (
s i m i l a r i t y =’ngram ’)
t r a n s f o r m e d v a l u e s = s i m i l a r i t y e n c o d e r .
f i t t r a n s f o r m ( df )
G Varoquaux 15

2 Other approach: TargetEncoder [Micci-Barreca 2001]
Represent each category by the average target y
For example Police Officer III
→ average salary of policy oﬃcer III
40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager II
G Varoquaux 16

40000 60000 80000 100000 120000 140000
y: Employee salary
Crossing Guard
Liquor Store Clerk I
Library Aide
Police Cadet
Public Safety Reporting Aide I
Administrative Specialist II
Management and Budget Specialist III
Manager III
Manager I
Manager IIEmbedding closeby categories with the same
y can help building a simple decision function.
G Varoquaux 16

DirtCat: Dirty category software:
from d i r t y c a t import TargetEncoder
t a r g e t e n c o d e r = TargetEncoder ()
t r a n s f o r m e d v a l u e s = t a r g e t e n c o d e r .
f i t t r a n s f o r m ( df )
G Varoquaux 16

2 Experimental results: prediction performance
Average rank on 7 datasets
Linear model Gradient-boosted trees
One-hot encoding 4.7 6.0
Target encoding 5.3 4.3
Similarity encoding
Jaro-Winkler 3.4 3.6
Levenshtein 3.1 3.0
3-gram 1.1 1.9
Best: similarity encoding with 3-gram similarity
[Cerda... 2018]
Also, gradient-boosted
trees work much better
G Varoquaux 17

2 Dirty categories blow up dimension
Wow, lot’s of datasets!
G Varoquaux 18

New words in
natural language
G Varoquaux 18

New words in
natural language
X ∈ Rn×p
, p is large
Statistical problems
Computational problems
G Varoquaux 18

2 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number
of prototypes?
G Varoquaux 19

2 Tackling the high cardinality
Similarity encoding, one-hot encoding
= Prototype methods
How to choose a small number
of prototypes?
All training-set ⇒ huge dimensionality
Most frequent?
Maybe the right prototypes /∈ training set
“big cat” “fat cat”
“big dog” “fat dog”
Estimate prototypes
G Varoquaux 19

2 n-grams grow, but there is redundancy
Natural
language
G Varoquaux 20

2 Substring information
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Police Aide
Mechanic Technician II
Police Oﬃcer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 21

2 Latent category model
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
3-gram1
L
3-gram2
on
3-gram3
do...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 22

2 Latent category model
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
3-gram1
L
3-gram2
on
3-gram3
do...
Models strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
er_
cer
fic
off
_of
ce_
ice
lic
pol
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
er_
cer
fic
off
_of
ce_
ice
lic
pol
G Varoquaux 22

2 String models of latent categories
Encodings that extract latent categories
library
operator
ecialist
arehouse
manager
ommunity
,
rescue
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Financial Programs Manager
Capital Projects Manager
Master Police Officer
Police Sergeant
nam
es
Categories
G Varoquaux 23

2 String models of latent categories
Inferring plausible feature names
untant,
assistant,
library
nator,
equipment,
operator
administration,
specialist
t,
craftsworker,
warehouse
crossing,
program,
manager
ician,
mechanic,
community
refighter,
rescuer,
rescue
ional,
correction,
officer
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Financial Programs Manager
Capital Projects Manager
Master Police Officer
Police Sergeant
Inferred
featurenam
es
Categories
G Varoquaux 23

2 Data science with dirty categories
0.0 0.1 0.2
Information, Technology, Technologist
Oﬃcer, Oﬃce, Police
Liquor, Clerk, Store
School, Health, Room
Environmental, Telephone, Capital
Lieutenant, Captain, Chief
Income, Assistance, Compliance
Manager, Management, Property
Inferred feature names Permutation Importances
G Varoquaux 24

3 Learning with missing values
[Josse... 2019]
M NA Bus Operator
G Varoquaux 25

Why doesn’t the #$@! machine learning
toolkit work?!
Machine learning models need entries in a vector
space (or at least a metric space).
NA /∈ R
More than an implementation problem
G Varoquaux 26

Why doesn’t the #$@! machine learning
toolkit work?!
Machine learning models need entries in a vector
space (or at least a metric space).
NA /∈ R
More than an implementation problem
Categorical are discrete anyhow
For missing values in categorical variables,
create a special categorie ”missing”.
Rest of talk on NA in numerical variables
G Varoquaux 26

3 Classic statistics points of view
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
for non-observed values, the probability of missingness
does not depend on this non-observed value.
Proper deﬁnition in [Josse... 2019]
Theorem [Rubin 1976], in MAR, maximizing likelihood for
observed data while ignoring (marginalizing) the unob-
served values gives maximum likelihood of model a).
G Varoquaux 27

Missing Completely at random situation (MCAR)
Missingnes is independent from data
Missing Not at Random situation (MNAR)
Missingnes not ignorable
G Varoquaux 27

2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0
3
2
1
0
MNAR
G Varoquaux 27

2 0 2
2
0
2
Complete
2 0 2
2
0
2
MCAR
2 0
3
2
1
0
MNAR
But
There isn’t always an unobserved value
Age of spouse of singles?
We are not trying to maximize likelihoods
G Varoquaux 27

The #$@! machine learning toolkit still
doesn’t work?!
G Varoquaux 28

3 Imputation
Fill in information
F NA–2000 Social Worker IV
M NA–2012 Bus Operator
M NA–2014 Library Assistant I
Large statistical literature
Procedures and results focused on in sample settings
How about completing the test set with the train set?
What to do with the prediction target y?
G Varoquaux 29

3 Imputation procedures that work out of sample
Mean imputation special case of univariate imputation
Replace NA by the mean of the feature
sklearn.impute.SimpleImpute
G Varoquaux 30

Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
new in 0.21!!
G Varoquaux 30

Conditional imputation
Modeling one feature as a function of others
Possible implementation:
iteratively predict one feature as a function of other
Classic implementations in R: MICE, missforest
sklearn.impute.IterativeImputer
new in 0.21!!
Classic statistics point of view
Mean imputation is dis-
astrous, because it dis-
orts the distribution
2 0 2
3
2
1
0
1
2
3
“Congeniality” conditions: good imputation must
preserve data propeties used by later analysis steps
G Varoquaux 30

3 Imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent)
imputing both train and test with the mean of
train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and
compensates at test time
G Varoquaux 31

3 Imputation for supervised learning
Theorem [Josse... 2019]
For a powerful learner (universally consistent)
imputing both train and test with the mean of
train is consistent
ie it converges to the best possible prediction
Intuition
The learner “recognizes” imputed entries and
compensates at test time
Simulation: MCAR + Gradient boosting
102 103 104
Sample size
0.65
0.70
0.75
0.80
r2score
Mean
Iterative
Convergence
0.725 0.750 0.775
r2 score
Iterative
Mean
Small small size
Notebook: github – @nprost / supervised missing
Conclusions:
IterativeImputer is useful for small sample sizes
G Varoquaux 31

3 Imputation is not enough
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
G Varoquaux 32

3 Imputation is not enough
Pathological case [Josse... 2019]
y depends only on wether data is missing or not
eg tax fraud detection
theory: MNAR = “Missing Not At Random”
Imputing makes prediction impossible
Solution
Add a missingness indicator: extra feature to predict
...SimpleImpute(add indicator=True)
...IterativeImputer(add indicator=True)
Simulation: y depends indirectly on missingness
censoring in the data
102 103 104
Sample size
0.75
0.80
0.85
0.90
0.95
r2score
Mean
Mean+
indicator
Iterative
Iterative+
indicator
Convergence
0.8 0.9
r2 score
Iterative+
indicator
Iterative
Mean+
indicator
Mean
Small small size
Notebook: github – @nprost / supervised missing
Adding a mask is crucial
Iterative imputation can be detrimental
G Varoquaux 32

@GaelVaroquaux
Learning on dirty data
Prepare data via ColumnTransformer
Use HistGradientBoosting

@GaelVaroquaux
Dirty categories
Statistical modeling of non-curated categorical data
Give us your dirty data
Similarity encoding
robust solution that enables statistical models
Dirty category software:

@GaelVaroquaux
Dirty categories
Give us your dirty data
Similarity encoding
Dirty category software:
Supervised learning with missing data
Mean imputation + missing indicator
Much more results in [Josse... 2019]
http://project.inria.fr/dirtydata
On going research

Acknowledgements
Dirty categories
Patricio Cerda and Balazs Kegl
Missing data
Julie Josse, Erwan Scornet, Nicolas Prost
Implementation in scikit-learn
thanks to scikit-learn consortium partners

4 References I
P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for
learning with dirty categorical variables. 2018.
J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the
consistency of supervised learning with missing values. arXiv
preprint arXiv:1902.06931, 2019.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classiﬁcation and prediction
problems. ACM SIGKDD Explorations Newsletter, 3(1):
27–32, 2001.
D. B. Rubin. Inference and missing data. Biometrika, 63(3):
581–592, 1976.

Machine learning on non curated data

More Related Content

More from Gael Varoquaux

Recently uploaded

Machine learning on non curated data