Similarity encoding for learning on dirty categorical variables

Similarity encoding for learning
on dirty categorical variables
Gaël Varoquaux, with Patricio Cerda and Balázs Kégl
Agenda today
Bring to light a problem
Show that statistical-learning can solve it

Machine learning
Let X ∈ Rn×p
G Varoquaux 2

Machine learning
Let X ∈ Rn×p
The data
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Oﬃcer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Oﬃcer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant I
G Varoquaux 2

Machine learning
Let X ∈ Rn×p
The data
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Oﬃcer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Oﬃcer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant I
A data cleaning problem?
A feature engineering problem?
G Varoquaux 2

The problem of “dirty categories”
Non-curated categorical entries
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 3

Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Employee Position Title
Master Police Oﬃcer
Social Worker IV
...
G Varoquaux 4

Open Payments: payments by health care
companies to medical doctors or hospitals.
Company name Frequency
Pfizer Inc. 79,073
Pfizer Pharmaceuticals LLC 486
Pfizer International LLC 425
Pfizer Limited 13
Pfizer Corporation Hong Kong Limited 4
Pfizer Pharmaceuticals Korea Limited 3
...
G Varoquaux 4

Open Payments: payments by health care
companies to medical doctors or hospitals.
Medical charges: patient discharges: utilization,
payment, and hospital-speciﬁc charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
G Varoquaux 4

100 1k 10k 100k 1M
Number of rows
100
1 000
10 000
Numberofcategories
beer reviews
road safety
traffic violations
midwest survey
open payments
employee salaries
medical charges
100
√
n
5 log2(n)
G Varoquaux 5

Mechanisms creating dirty categories
Typos
Open-ended entries
Merging diﬀerent data sources
G Varoquaux 6

Our goal: a statistical view of supervised
learning on dirty categories
The statistical question
should inform curation
Pﬁzer Corporation Hong Kong
=?
Pﬁzer Pharmaceuticals Korea
Rest of the talk:
1 Related approaches
2 Similarity encoding
3 Empirical study
G Varoquaux 7

1 Related approaches
Database cleaning
Natural language processing
Machine learning
G Varoquaux 8

1 A database cleaning point of view
Recognizing / merging entities
Record linkage:
matching across diﬀerent (clean) tables
Deduplication/fuzzy matching:
matching in one dirty table
Techniques [Fellegi and Sunter 1969]
Supervised learning (known matches)
Clustering
Expectation Maximization to learn a metric
Outputs a “clean” database
G Varoquaux 9

1 A natural language processing point of view
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
G Varoquaux 10

Semantics
Relate diﬀerent discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
G Varoquaux 10

Semantics
Relate diﬀerent discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
Character-level NLP
For entity resolution [Klein... 2003]
For semantics [Bojanowski... 2017]
“London” & “Londres” may carry diﬀerent information
G Varoquaux 10

1 A machine-learning point of view
High-cardinality categorical data
Encoding each category blows up the dimension
Target encoding [Micci-Barreca 2001]
Represent each category by
a simple statistical link to the target y
eg E[y|Xi = Ck]
1D real-number embedding for a categorical column
Bring close categories with same link to y
Great for tree-based machine-learning [Dorogush...]
G Varoquaux 11

1 A machine-learning point of view
High-cardinality categorical data
Encoding each category blows up the dimension
Target encoding [Micci-Barreca 2001]
Represent each category by
a simple statistical link to the target y
eg E[y|Xi = Ck]
1D real-number embedding for a categorical column
Bring close categories with same link to y
Great for tree-based machine-learning [Dorogush...]
But fails on unseen categories
G Varoquaux 11

[P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018]
G Varoquaux 12

[P. Cerda, G. Varoquaux, & B. Kegl, Machine Learning 2018]
1. One-hot encoding maps categories to vector spaces
2. String similarities capture information
G Varoquaux 12

2 Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
p grows fast
new categories?
link categories?
G Varoquaux 13

2 Adding similarities to one-hot encoding
One-hot encoding
Londres 0 1 0
London 1 0 0
Paris 0 0 1
X ∈ Rn×p
p grows fast
new categories?
link categories?Similarity encoding
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 13

2 Some string similarities
Levenshtein
Number of edit operations on one string to match
the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 14

3 Empirical study
G Varoquaux 15

3 Datasets with dirty categories
Dataset # of
rows
# of cat-
egories
Less frequent
category
Prediction
type
medical charges 160k 100 613 regression
employee salaries 9.2k 385 1 regression
open payments 100k 973 1 binary clf
midwest survey 2.8k 1009 1 multiclass clf
traﬃc violations 100k 3043 1 multiclass clf
road safety 10k 4617 1 binary clf
beer reviews 10k 4634 1 multiclass clf
7 datasets!
All open
G Varoquaux 16

3 Experiments
Cross-validation & measure prediction
Focus on prediction rather than in-sample statistics
Easier non-parametric evaluation
Amenable to high dimension
G Varoquaux 17

3 Results: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 18

3 Results: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 18

3 Results: ridge
0.7 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.25 0.50
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.45 0.50
road
safety
0.25 0.75
beer
reviews
1.0
2.9
3.1
4.4
3.6
6.0
Similarity encoding, with 3-gram similarity
G Varoquaux 19

3 Results: diﬀerent learner
0.85 0.90
medical
charges
Random Forest
Gradient Boosting
Ridge CV
Logistic CV
0.7 0.9
employee
salaries
0.50 0.75
open
payments
0.5 0.7
midwest
survey
0.7500.775
traffic
violations
0.45 0.55
road
safety
0.50 0.75
beer
reviews
onehot encoding 3gram similarity encoding
2.7
2.4
2.3
2.0
G Varoquaux 20

3 This is just a string similarity?
What similarity is deﬁned by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
G Varoquaux 21

3 This is just a string similarity?
What similarity is deﬁned by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
0.83 0.88
medical
charges
3-gram
Levenshtein-
ratio
Jaro-winkler
Bag of 3-grams
Target encoding
MDV
One-hot encoding
Hash encoding
Similarity
encoding
0.75 0.85
employee
salaries
0.3 0.5
open
payments
0.6 0.7
midwest
survey
0.72 0.78
traffic
violations
0.44 0.52
road
safety
0.3 0.8
beer
reviews
1.1
3.1
3.4
4.1
5.3
6.4
4.7
7.3
Similarity encoding >>> a feature map capturing string similarities
G Varoquaux 21

3 Too high dimensions
X ∈ Rn×p
but p is large
Statistical problems
Computational problems
Interpretation problems
G Varoquaux 22

3 Too high dimensions
X ∈ Rn×p
but p is large
Statistical problems
Computational problems
Interpretation problems
Reducing the dimension
Random projections: “cheap PCA”
Only most-frequent categories as prototypes
Kmeans no strings to select prototypes
Similar to deduplication
without hard assignment
G Varoquaux 22

3 Reducing the dimension
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
ity
Kmeans
Deduplication
with Kmeans
Random
projections
Factorizing one-hot:
Related to Multiple Correspondance Analysis
G Varoquaux 23

d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
ity
Kmeans
Deduplication
with Kmeans
Random
projections
“Hard deduplication”
Diﬃcult problem, lengthy literature
G Varoquaux 23

0.7 0.8 0.9
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8 0.6 0.7 0.7500.G Varoquaux 23

0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
G Varoquaux 23

0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Hashing n-grams (for speed and collisions)
G Varoquaux 23

@GaelVaroquaux
Learning on dirty categories
Dirty categories
Statistical models of non-curated categorical data
Give us your dirty data
Machine learning can help
Similarity encoding
Robust solution (dominates one-hot)
Enables statistical models
More to come
Dirty category software:
http://dirty-cat.github.io

4 References I
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching
word vectors with subword information. Transactions of the
Association of Computational Linguistics, 5(1):135–146,
2017.
P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for
learning with dirty categorical variables. Machine Learning,
pages 1–18, 2018.
A. V. Dorogush, V. Ershov, and A. Gulin. Catboost: gradient
boosting with categorical features support.
I. P. Fellegi and A. B. Sunter. A theory for record linkage.
Journal of the American Statistical Association, 64:1183,
1969.

4 References II
D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named
entity recognition with character-level models. In
Proceedings of the seventh conference on Natural language
learning at HLT-NAACL 2003-Volume 4, pages 180–183.
Association for Computational Linguistics, 2003.
D. Micci-Barreca. A preprocessing scheme for high-cardinality
categorical attributes in classiﬁcation and prediction
problems. ACM SIGKDD Explorations Newsletter, 3(1):
27–32, 2001.

Similarity encoding for learning on dirty categorical variables

More Related Content

Similar to Similarity encoding for learning on dirty categorical variables

More from Gael Varoquaux

Recently uploaded

Similarity encoding for learning on dirty categorical variables