Similarity encoding for learning
on dirty categorical variables
Ga¨el Varoquaux, with Patricio Cerda and Bal´azs K´egl
Agenda today
Bring to light a problem
Show that statistical-learning can solve it
Machine learning
Let X ∈ Rn×p
G Varoquaux 2
Machine learning
Let X ∈ Rn×p
The data
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M 03/02/2008 Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant I
G Varoquaux 2
Machine learning
Let X ∈ Rn×p
The data
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M 03/02/2008 Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant I
A data cleaning problem?
A feature engineering problem?
G Varoquaux 2
The problem of “dirty categories”
Non-curated categorical entries
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 3
Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Employee Position Title
Master Police Officer
Social Worker IV
...
G Varoquaux 4
Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Company name Frequency
Pfizer Inc. 79,073
Pfizer Pharmaceuticals LLC 486
Pfizer International LLC 425
Pfizer Limited 13
Pfizer Corporation Hong Kong Limited 4
Pfizer Pharmaceuticals Korea Limited 3
...
G Varoquaux 4
Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-specific charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
G Varoquaux 4
Dirty categories in the wild
100 1k 10k 100k 1M
Number of rows
100
1 000
10 000
Numberofcategories
beer reviews
road safety
traffic violations
midwest survey
open payments
employee salaries
medical charges
100
√
n
5 log2(n)
G Varoquaux 5
Mechanisms creating dirty categories
Typos
Open-ended entries
Merging different data sources
G Varoquaux 6
Our goal: a statistical view of supervised
learning on dirty categories
The statistical question
should inform curation
Pfizer Corporation Hong Kong
=?
Pfizer Pharmaceuticals Korea
Rest of the talk:
1 Related approaches
2 Similarity encoding
3 Empirical study
G Varoquaux 7
1 Related approaches
Database cleaning
Natural language processing
Machine learning
G Varoquaux 8
1 A database cleaning point of view
Recognizing / merging entities
Record linkage:
matching across different (clean) tables
Deduplication/fuzzy matching:
matching in one dirty table
Techniques [Fellegi and Sunter 1969]
Supervised learning (known matches)
Clustering
Expectation Maximization to learn a metric
Outputs a “clean” database
G Varoquaux 9
1 A natural language processing point of view
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
G Varoquaux 10
1 A natural language processing point of view
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
Semantics
Relate different discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
G Varoquaux 10
1 A natural language processing point of view
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
Semantics
Relate different discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
Character-level NLP
For entity resolution [Klein... 2003]
For semantics [Bojanowski... 2017]
“London” & “Londres” may carry different information
G Varoquaux 10
1 A machine-learning point of view
High-cardinality categorical data
Encoding each category blows up the dimension
Target encoding [Micci-Barreca 2001]
Represent each category by
a simple statistical link to the target y
eg E[y|Xi = Ck]
1D real-number embedding for a categorical column
Bring close categories with same link to y
Great for tree-based machine-learning [Dorogush...]
G Varoquaux 11
1 A machine-learning point of view
High-cardinality categorical data
Encoding each category blows up the dimension
Target encoding [Micci-Barreca 2001]
Represent each category by
a simple statistical link to the target y
eg E[y|Xi = Ck]
1D real-number embedding for a categorical column
Bring close categories with same link to y
Great for tree-based machine-learning [Dorogush...]
But fails on unseen categories
G Varoquaux 11
2 Similarity encoding
[P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018]
G Varoquaux 12
2 Similarity encoding
[P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018]
1. One-hot encoding maps categories to vector spaces
2. String similarities capture information
G Varoquaux 12
2 Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
p grows fast
new categories?
link categories?
G Varoquaux 13
2 Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
p grows fast
new categories?
link categories?Similarity encoding
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 13
2 Some string similarities
Levenshtein
Number of edit operations on one string to match
the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 14
3 Empirical study
G Varoquaux 15
3 Datasets with dirty categories
Dataset # of
rows
# of cat-
egories
Less frequent
category
Prediction
type
medical charges 160k 100 613 regression
employee salaries 9.2k 385 1 regression
open payments 100k 973 1 binary clf
midwest survey 2.8k 1009 1 multiclass clf
traffic violations 100k 3043 1 multiclass clf
road safety 10k 4617 1 binary clf
beer reviews 10k 4634 1 multiclass clf
7 datasets!
All open
G Varoquaux 16
3 Experiments
Cross-validation & measure prediction
Focus on prediction rather than in-sample statistics
Easier non-parametric evaluation
Amenable to high dimension
G Varoquaux 17
3 Results: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 18
3 Results: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 18
3 Results: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 18
3 Results: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 18
3 Results: gradient boosted trees
0.8 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 18
3 Results: ridge
0.7 0.9
medical
charges
3­gram    
Levenshtein
ratio      
Jaro­winkler
Target encoding
One­hot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.25 0.50
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.45 0.50
road
safety
0.25 0.75
beer
reviews
1.0
2.9
3.1
4.4
3.6
6.0
Average ranking across datasets
Similarity encoding, with 3-gram similarity
G Varoquaux 19
3 Results: different learner
0.85 0.90
medical
charges
Random Forest
Gradient Boosting
Ridge CV
Logistic CV
0.7 0.9
employee
salaries
0.50 0.75
open
payments
0.5 0.7
midwest
survey
0.7500.775
traffic
violations
0.45 0.55
road
safety
0.50 0.75
beer
reviews
one­hot encoding 3­gram similarity encoding
2.7
2.4
2.3
2.0
G Varoquaux 20
3 This is just a string similarity?
What similarity is defined by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
G Varoquaux 21
3 This is just a string similarity?
What similarity is defined by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
0.83 0.88
medical
charges
3-gram
Levenshtein-
ratio
Jaro-winkler
Bag of 3-grams
Target encoding
MDV
One-hot encoding
Hash encoding
Similarity
encoding
0.75 0.85
employee
salaries
0.3 0.5
open
payments
0.6 0.7
midwest
survey
0.72 0.78
traffic
violations
0.44 0.52
road
safety
0.3 0.8
beer
reviews
1.1
3.1
3.4
4.1
5.3
6.4
4.7
7.3
Similarity encoding >>> a feature map capturing string similarities
G Varoquaux 21
3 Too high dimensions
X ∈ Rn×p
but p is large
Statistical problems
Computational problems
Interpretation problems
G Varoquaux 22
3 Too high dimensions
X ∈ Rn×p
but p is large
Statistical problems
Computational problems
Interpretation problems
Reducing the dimension
Random projections: “cheap PCA”
Only most-frequent categories as prototypes
Kmeans no strings to select prototypes
Similar to deduplication
without hard assignment
G Varoquaux 22
3 Reducing the dimension
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
One­hot
encoding
ity
 K­means 
Deduplication
with K­means
Random
projections
Factorizing one-hot:
Related to Multiple Correspondance Analysis
G Varoquaux 23
3 Reducing the dimension
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
One­hot
encoding
ity
 K­means 
Deduplication
with K­means
Random
projections
“Hard deduplication”
Difficult problem, lengthy literature
G Varoquaux 23
3 Reducing the dimension
0.7 0.8 0.9
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
One­hot
encoding
3­gram similarity
encoding
Random
projections
Most
frequent
categories
 K­means 
Deduplication
with K­means
Random
projections
0.7 0.8 0.6 0.7 0.7500.G Varoquaux 23
3 Reducing the dimension
0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
One­hot
encoding
3­gram similarity
encoding
Random
projections
Most
frequent
categories
 K­means 
Deduplication
with K­means
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Average ranking across datasets
G Varoquaux 23
3 Reducing the dimension
0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
One­hot
encoding
3­gram similarity
encoding
Random
projections
Most
frequent
categories
 K­means 
Deduplication
with K­means
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Average ranking across datasets
Hashing n-grams (for speed and collisions)
G Varoquaux 23
@GaelVaroquaux
Learning on dirty categories
Dirty categories
Statistical models of non-curated categorical data
Give us your dirty data
Machine learning can help
Similarity encoding
Robust solution (dominates one-hot)
Enables statistical models
More to come
Dirty category software:
http://dirty-cat.github.io
4 References I
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching
word vectors with subword information. Transactions of the
Association of Computational Linguistics, 5(1):135–146,
2017.
P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for
learning with dirty categorical variables. Machine Learning,
pages 1–18, 2018.
A. V. Dorogush, V. Ershov, and A. Gulin. Catboost: gradient
boosting with categorical features support.
I. P. Fellegi and A. B. Sunter. A theory for record linkage.
Journal of the American Statistical Association, 64:1183,
1969.
4 References II
D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named
entity recognition with character-level models. In
Proceedings of the seventh conference on Natural language
learning at HLT-NAACL 2003-Volume 4, pages 180–183.
Association for Computational Linguistics, 2003.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classification and prediction
problems. ACM SIGKDD Explorations Newsletter, 3(1):
27–32, 2001.

Similarity encoding for learning on dirty categorical variables

  • 1.
    Similarity encoding forlearning on dirty categorical variables Ga¨el Varoquaux, with Patricio Cerda and Bal´azs K´egl Agenda today Bring to light a problem Show that statistical-learning can solve it
  • 2.
    Machine learning Let X∈ Rn×p G Varoquaux 2
  • 3.
    Machine learning Let X∈ Rn×p The data Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I G Varoquaux 2
  • 4.
    Machine learning Let X∈ Rn×p The data Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M 03/02/2008 Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I A data cleaning problem? A feature engineering problem? G Varoquaux 2
  • 5.
    The problem of“dirty categories” Non-curated categorical entries Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 3
  • 6.
    Dirty categories inthe wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Employee Position Title Master Police Officer Social Worker IV ... G Varoquaux 4
  • 7.
    Dirty categories inthe wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Company name Frequency Pfizer Inc. 79,073 Pfizer Pharmaceuticals LLC 486 Pfizer International LLC 425 Pfizer Limited 13 Pfizer Corporation Hong Kong Limited 4 Pfizer Pharmaceuticals Korea Limited 3 ... G Varoquaux 4
  • 8.
    Dirty categories inthe wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Medical charges: patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository G Varoquaux 4
  • 9.
    Dirty categories inthe wild 100 1k 10k 100k 1M Number of rows 100 1 000 10 000 Numberofcategories beer reviews road safety traffic violations midwest survey open payments employee salaries medical charges 100 √ n 5 log2(n) G Varoquaux 5
  • 10.
    Mechanisms creating dirtycategories Typos Open-ended entries Merging different data sources G Varoquaux 6
  • 11.
    Our goal: astatistical view of supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea Rest of the talk: 1 Related approaches 2 Similarity encoding 3 Empirical study G Varoquaux 7
  • 12.
    1 Related approaches Databasecleaning Natural language processing Machine learning G Varoquaux 8
  • 13.
    1 A databasecleaning point of view Recognizing / merging entities Record linkage: matching across different (clean) tables Deduplication/fuzzy matching: matching in one dirty table Techniques [Fellegi and Sunter 1969] Supervised learning (known matches) Clustering Expectation Maximization to learn a metric Outputs a “clean” database G Varoquaux 9
  • 14.
    1 A naturallanguage processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains G Varoquaux 10
  • 15.
    1 A naturallanguage processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” G Varoquaux 10
  • 16.
    1 A naturallanguage processing point of view Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” Character-level NLP For entity resolution [Klein... 2003] For semantics [Bojanowski... 2017] “London” & “Londres” may carry different information G Varoquaux 10
  • 17.
    1 A machine-learningpoint of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding [Micci-Barreca 2001] Represent each category by a simple statistical link to the target y eg E[y|Xi = Ck] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...] G Varoquaux 11
  • 18.
    1 A machine-learningpoint of view High-cardinality categorical data Encoding each category blows up the dimension Target encoding [Micci-Barreca 2001] Represent each category by a simple statistical link to the target y eg E[y|Xi = Ck] 1D real-number embedding for a categorical column Bring close categories with same link to y Great for tree-based machine-learning [Dorogush...] But fails on unseen categories G Varoquaux 11
  • 19.
    2 Similarity encoding [P.Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018] G Varoquaux 12
  • 20.
    2 Similarity encoding [P.Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018] 1. One-hot encoding maps categories to vector spaces 2. String similarities capture information G Varoquaux 12
  • 21.
    2 Adding similaritiesto one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 X ∈ Rn×p p grows fast new categories? link categories? G Varoquaux 13
  • 22.
    2 Adding similaritiesto one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 X ∈ Rn×p p grows fast new categories? link categories?Similarity encoding London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 13
  • 23.
    2 Some stringsimilarities Levenshtein Number of edit operations on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters similarity = #n-gram in comon #n-gram in total G Varoquaux 14
  • 24.
    3 Empirical study GVaroquaux 15
  • 25.
    3 Datasets withdirty categories Dataset # of rows # of cat- egories Less frequent category Prediction type medical charges 160k 100 613 regression employee salaries 9.2k 385 1 regression open payments 100k 973 1 binary clf midwest survey 2.8k 1009 1 multiclass clf traffic violations 100k 3043 1 multiclass clf road safety 10k 4617 1 binary clf beer reviews 10k 4634 1 multiclass clf 7 datasets! All open G Varoquaux 16
  • 26.
    3 Experiments Cross-validation &measure prediction Focus on prediction rather than in-sample statistics Easier non-parametric evaluation Amenable to high dimension G Varoquaux 17
  • 27.
    3 Results: gradientboosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 midw surv G Varoquaux 18
  • 28.
    3 Results: gradientboosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 midw surv G Varoquaux 18
  • 29.
    3 Results: gradientboosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 midw surv G Varoquaux 18
  • 30.
    3 Results: gradientboosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 0.7 midwest survey 0.6 0.8 traffic violations 0.4 0.5 road safety 0.25 0.75 beer reviews 1.6 2.4 2.9 3.7 4.6 5.9 Average ranking across datasets G Varoquaux 18
  • 31.
    3 Results: gradientboosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 0.7 midwest survey 0.6 0.8 traffic violations 0.4 0.5 road safety 0.25 0.75 beer reviews 1.6 2.4 2.9 3.7 4.6 5.9 Average ranking across datasets G Varoquaux 18
  • 32.
    3 Results: ridge 0.70.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.25 0.50 open payments 0.5 0.7 midwest survey 0.6 0.8 traffic violations 0.45 0.50 road safety 0.25 0.75 beer reviews 1.0 2.9 3.1 4.4 3.6 6.0 Average ranking across datasets Similarity encoding, with 3-gram similarity G Varoquaux 19
  • 33.
    3 Results: differentlearner 0.85 0.90 medical charges Random Forest Gradient Boosting Ridge CV Logistic CV 0.7 0.9 employee salaries 0.50 0.75 open payments 0.5 0.7 midwest survey 0.7500.775 traffic violations 0.45 0.55 road safety 0.50 0.75 beer reviews one­hot encoding 3­gram similarity encoding 2.7 2.4 2.3 2.0 G Varoquaux 20
  • 34.
    3 This isjust a string similarity? What similarity is defined by our encoding? (kernel) si, sj sim = k l=1 sim(si, s(l) ) sim(sj, s(l) ) Sum over the categories Reference categories The categories in the train set shape the similarity G Varoquaux 21
  • 35.
    3 This isjust a string similarity? What similarity is defined by our encoding? (kernel) si, sj sim = k l=1 sim(si, s(l) ) sim(sj, s(l) ) Sum over the categories Reference categories The categories in the train set shape the similarity 0.83 0.88 medical charges 3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.3 0.5 open payments 0.6 0.7 midwest survey 0.72 0.78 traffic violations 0.44 0.52 road safety 0.3 0.8 beer reviews 1.1 3.1 3.4 4.1 5.3 6.4 4.7 7.3 Similarity encoding >>> a feature map capturing string similarities G Varoquaux 21
  • 36.
    3 Too highdimensions X ∈ Rn×p but p is large Statistical problems Computational problems Interpretation problems G Varoquaux 22
  • 37.
    3 Too highdimensions X ∈ Rn×p but p is large Statistical problems Computational problems Interpretation problems Reducing the dimension Random projections: “cheap PCA” Only most-frequent categories as prototypes Kmeans no strings to select prototypes Similar to deduplication without hard assignment G Varoquaux 22
  • 38.
    3 Reducing thedimension d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full One­hot encoding ity  K­means  Deduplication with K­means Random projections Factorizing one-hot: Related to Multiple Correspondance Analysis G Varoquaux 23
  • 39.
    3 Reducing thedimension d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full One­hot encoding ity  K­means  Deduplication with K­means Random projections “Hard deduplication” Difficult problem, lengthy literature G Varoquaux 23
  • 40.
    3 Reducing thedimension 0.7 0.8 0.9 d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full One­hot encoding 3­gram similarity encoding Random projections Most frequent categories  K­means  Deduplication with K­means Random projections 0.7 0.8 0.6 0.7 0.7500.G Varoquaux 23
  • 41.
    3 Reducing thedimension 0.7 0.8 0.9 employee salaries d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355)Cardinality of categorical variable One­hot encoding 3­gram similarity encoding Random projections Most frequent categories  K­means  Deduplication with K­means Random projections 0.7 0.8 open payments (k=910) 0.6 0.7 midwest survey (k=644) 0.7500.775 traffic violations (k=2588) 0.45 0.50 0.55 road safety (k=3988) 0.25 0.50 0.75 beer reviews (k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5 Average ranking across datasets G Varoquaux 23
  • 42.
    3 Reducing thedimension 0.7 0.8 0.9 employee salaries d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355)Cardinality of categorical variable One­hot encoding 3­gram similarity encoding Random projections Most frequent categories  K­means  Deduplication with K­means Random projections 0.7 0.8 open payments (k=910) 0.6 0.7 midwest survey (k=644) 0.7500.775 traffic violations (k=2588) 0.45 0.50 0.55 road safety (k=3988) 0.25 0.50 0.75 beer reviews (k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5 Average ranking across datasets Hashing n-grams (for speed and collisions) G Varoquaux 23
  • 43.
    @GaelVaroquaux Learning on dirtycategories Dirty categories Statistical models of non-curated categorical data Give us your dirty data Machine learning can help Similarity encoding Robust solution (dominates one-hot) Enables statistical models More to come Dirty category software: http://dirty-cat.github.io
  • 44.
    4 References I P.Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association of Computational Linguistics, 5(1):135–146, 2017. P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for learning with dirty categorical variables. Machine Learning, pages 1–18, 2018. A. V. Dorogush, V. Ershov, and A. Gulin. Catboost: gradient boosting with categorical features support. I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183, 1969.
  • 45.
    4 References II D.Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named entity recognition with character-level models. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 180–183. Association for Computational Linguistics, 2003. D. Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1): 27–32, 2001.