Simple representations for learning: factorizations and similarities

Simple representations for learning:
factorizations and similarities
Ga¨el Varoquaux

Settings: Very high dimensionality
- signals (images, spectra)
- many entities (customers, product)
- non-standardized categories (typos, variants)
Exploit links & redundancy across features
G Varoquaux 2

1 Factorizing huge matrices
2 Encoding with similarities
G Varoquaux 3

1 Factorizing huge matrices
with A. Mensch, J. Mairal, B. Thirion
[Mensch... 2016, 2017]
samples
features
samples
features
Y +E · S= N
Challenge: scalability
1 Intuitions
2 Experiments
3 Algorithms
4 Proof
G Varoquaux 4

1 Real world data: recommender systems
3
9 7
7
9 5 7
8
4
1 6
9
7
7
1
4 4
9
5
5 8
Product ratings
Millions of entries
Hundreds of thousands of
products and users
Large sparse matrix
users
product
users
products
Y +E · S= N
G Varoquaux 5

1 Real world data: brain imaging
Brain activity at rest
1000 subjects with ∼ 100–10 000
samples
Images of dimensionality
> 100 000
Dense matrix, large both ways
time
voxels
time
voxels
time
voxels
Y +E · S=
25
N
G Varoquaux 6

1 Scalable solvers for matrix factorizations
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
G Varoquaux 7

- Data
access
- Dictionary
update
- Code com-
putation
Alternating
minimization
Seen at t Seen at t+1 Unseen at t
Data
matrix
Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
G Varoquaux 7

Large matrices
= terabytes of data
argmin
E,S
Y−E ST 2
Fro + λΩ(S)
Rewrite as an expectation: [Mairal... 2010]
argmin
E i
mins
Yi − E sT 2
Fro + λΩ(s)
argmin
E
E f (E)
⇒ Optimize on approximations (sub-samples)
G Varoquaux 7

- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Data
matrix
G Varoquaux 7

- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation
Online matrix
factorization
Alternating
minimization
Data
matrix
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
Online matrix factorization [Mairal... 2010]
G Varoquaux 7

1 Scalable solvers for matrix factorizations – SOMF
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
New subsampling
algorithm
Alternating
minimization
Data
matrix
Subsampled Online Matrix Factorization
= SOMF
G Varoquaux 7

1 Scalable solvers for matrix factorizations – SOMF
- Data
access
- Dictionary
update
Stream
columns
- Code com-
putation Subsample
rows
Online matrix
factorization
New subsampling
algorithm
Alternating
minimization
Data
matrix
Subsampled Online Matrix Factorization
= SOMF
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
13 h run time
1 terabyte
of data
Online matrix factorization [Mairal... 2010] [Mensch... 2017]
×10 speed up
G Varoquaux 7

1 Experimental results: resting-state fMRI
159 h run time
2 terabytes
of data
12 h run time
100 gigabytes
of data
13 h run time
1 terabyte
of data
100s 1000s 1h 5h 24h
1.02
1.04
1.06
Testobjectivevalue
×105
Time
HCP (3.5TB)
x 1e5 SGD (best step-size)
Online matrix factorization
Proposed SOMF (r = 12)
SOMF = Subsambled Online Matrix Factorization
G Varoquaux 8

1 Experimental results: large images
5s 1min 6min
2.80
2.85
2.90
2.95
Testobjectivevalue ×104
Time
ADHD
Sparse dictionary
2 GB
1min 1h 5h
0.105
0.106
0.107
0.108
0.109
Aviris
NMF
103 GB
1min 1h 5h
0.35
0.36
0.37
0.38
0.39
0.40
Testobjectivevalue
Time
Aviris
Dictionary learning
103 GB
OMF: SOMF: r = 4
r = 6
r = 8
r = 12
r = 24r = 1
Best step-size SGD
100s 1h 5h 24h
0.98
1.00
1.02
1.04
×105
HCP
Sparse dictionary
2 TB
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 9

1 Experimental results: recommender system
SOMF = Subsampled Online Matrix Factorization
G Varoquaux 10

1 Algorithm: Online matrix factorization prior art
Stream samples xt: [Mairal... 2010]
1. Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace


1
2
D DAt − D Bt


At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate
Dt = argmin
D∈C
gt(D) gt = DAt − Bt
G Varoquaux 11

1. Compute code
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace


1
2
D DAt − D Bt


At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate
Dt = argmin
D∈C
gt(D)
surrogate
=
x
l(x, D) αi is used, and not α
⇒ Stochastic Majorization-Minimization
No nasty hyper-parameters
G Varoquaux 11

1. Compute code complexity depends on p
αt = argmin
α∈Rk
xt − Dt−1α 2
2 + λΩ(αt)
2. Update the surrogate function O(p)
gt(D) =
1
t
t
i=1
xi − Dαi
2
2 = trace


1
2
D DAt − D Bt


At = (1 −
1
t
)At−1 +
1
t
αtαt Bt = (1 −
1
t
)Bt−1 +
1
t
xtαt
3. Minimize surrogate O(p)
Dt = argmin
D∈C
G Varoquaux 11

1 Sub-sample features
Data stream: (xt)t → masked
(Mtxt)t
Dimension: p → s
Use only Mtxt in computation
→ complexity in O(s) Mtxt
Stream
Ignore
p
n
1
Modify all steps to work on s features
Code
computation
Surrogate
update
Surrogate
minimization
G Varoquaux 12

1 Sub-sample features
Original online MF
1. Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt)
2. Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xtαt − Bt−1)
3. Surrogate minimization
Dj
← p⊥
Cr
j
(Dj
−
1
(At)j,j
(DAj
t−Bj
t))
Our algorithm
1. Approximate code computation: masked
β
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1Mtx(i)
G
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1MtDt−1
αt ← argmin
α∈Rk
1
2
α Gtα − α βt + λ Ω(α).
2. Surrogate aggregation, averaging
At =
1
wt
αtαt + (1 −
1
wt
)At−1
Pt
¯Bt ← (1 − wt)Pt
¯Bt−1 + wtPtxtαt
PtDt ← argmin
Dr ∈Cr
1
2
tr(Dr
Dr ¯At) − tr(Dr
Pt
¯B
Pt
⊥ ¯Bt ← (1 − wt)Pt
⊥ ¯Bt−1 + wtPt
⊥
xtαt .
G Varoquaux 13

1 Sub-sample features – variance reduction
Original online MF
1. Code computation
αt = argmin
α∈Rk
xt − Dt−1α 2
2
+ λΩ(αt)
2. Surrogate aggregation
At =
1
t
t
i=1
αi αi
Bt = Bt−1 +
1
t
(xtαt − Bt−1)
Dj
← p⊥
Cr
j
(Dj
−
1
(At)j,j
(DAj
t−Bj
t))
Our algorithm
1. Approximate code computation: masked
β
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1Mtx(i)
G
(i)
t ← (1 − γ)G
(i)
t−1 + γDt−1MtDt−1
αt ← argmin
α∈Rk
1
2
α Gtα − α βt + λ Ω(α).
2. Surrogate aggregation, averaging
At =
1
wt
αtαt + (1 −
1
wt
)At−1
Pt
¯Bt ← (1 − wt)Pt
¯Bt−1 + wtPtxtαt
PtDt ← argmin
Dr ∈Cr
1
2
tr(Dr
Dr ¯At) − tr(Dr
Pt
¯B
Pt
⊥ ¯Bt ← (1 − wt)Pt
⊥ ¯Bt−1 + wtPt
⊥
xtαt .
10−1 100 101
97000
97500
98000
98500
99000
99500
100000
Testobjectivefunction
Zoom
10−2
10−3
(relative to lowest value)
Subsampling ratio
None
r = 12
r = 24
100 101 Time
10−2
10−3
Code computation
No subsampling (19)
Averaged estimators (c)
Masked loss (a)
G Varoquaux 13

1 Why does it work?
Objective:
D = argmin
D∈C x
l(x, D) where l(x, D) = minα
f (x, D, α)
Algorithm (online matrix factorization)
gt(D)
majorant
=
x
⇒ Stochastic Majorization-Minimization [Mairal 2013]
G Varoquaux 14

1 Why does it work?
Objective:
D = argmin
D∈C x
f (x, D, α)
gt(D)
majorant
=
x
Surrogate computation SMM Full minimization
G Varoquaux 14

1 Stochastic Approximate Majorization-Minimization
Objective:
D = argmin
D∈C x
f (x, D, α)
gt(D)
majorant
=
x
Surrogate computation
Surrogate approximation Partial minimization
SMM
SAMM
Full minimization
G Varoquaux 14

samples
features
samples
features
Y +E · S= N
Massive matrix factorization via subsampling
Subsampling features ⇒ doubly stochastic
10x speed ups on a fast algorithm
Analysis via stochastic approximate
majorization-minization
Conclusive on various high-dimensional problems
G Varoquaux 15

samples
features
samples
features
Y +E · S= N
with P. Cerda and B. Kegl [Cerda... 2018]
When categories create a huge dimensionality
G Varoquaux 16

samples
features
samples
features
Y +E · S= N
Machine learning Let X ∈ Rn×p
The real world
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Oﬃcer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Oﬃcer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant IG Varoquaux 16

samples
features
samples
features
Y +E · S= N
Machine learning Let X ∈ Rn×p
The real world
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Oﬃcer
F 11/19/1989 Social Worker IV
M 07/16/2007 Police Oﬃcer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M 11/22/2010 Library Assistant I
A data cleaning problem?
A feature engineering problem?
A problem of representations in high dimension
G Varoquaux 16

2 The problem of “dirty categories”
Non-curated categorical entries
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 17

2 Dirty categories in the wild
Employee Salaries: salary information for employees
of Montgomery County, Maryland.
Employee Position Title
Master Police Oﬃcer
Social Worker IV
...
G Varoquaux 18

Open Payments: payments by health care
companies to medical doctors or hospitals.
Company name Frequency
Pfizer Inc. 79,073
Pfizer Pharmaceuticals LLC 486
Pfizer International LLC 425
Pfizer Limited 13
Pfizer Corporation Hong Kong Limited 4
Pfizer Pharmaceuticals Korea Limited 3
...
G Varoquaux 18

Medical charges: patient discharges: utilization,
payment, and hospital-speciﬁc charges across 3 000
US hospitals.
...
Nothing on UCI machine-learning data repository
G Varoquaux 18

US hospitals.
...
Cardinality slowly increases with number of rows
100 1k 10k 100k 1M
Number of rows
100
1 000
10 000
Numberofcategories
beer reviews
road safety
traffic violations
midwest survey
open payments
employee salaries
medical charges
100
√
n
5 log2(n)
Create a high-dimensional learning problem
G Varoquaux 18

US hospitals.
...
Our goal: a statistical view of supervised
learning on dirty categories
The statistical question
should inform curation
Pﬁzer Corporation Hong Kong
=?
Pﬁzer Pharmaceuticals
Korea
G Varoquaux 18

2 Related work: Database cleaning
Recognizing / merging entities
Record linkage:
matching across diﬀerent (clean) tables
Deduplication/fuzzy matching:
matching in one dirty table
Techniques [Fellegi and Sunter 1969]
Supervised learning (known matches)
Clustering
Expectation Maximization to learn a metric
Outputs a “clean” database
G Varoquaux 19

2 Related work: natural language processing
Stemming / normalization
Set of (handcrafted) rules
Need to be adapted to new language / new domains
G Varoquaux 20

Semantics
Relate diﬀerent discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
G Varoquaux 20

Semantics
Relate diﬀerent discreet objects
Formal semantics (entity resolution in knowlege bases)
Distributional semantics:
“a word is characterized by the company it keeps”
Character-level NLP
For entity resolution [Klein... 2003]
For semantics [Bojanowski... 2017]
“London” & “Londres” may carry diﬀerent information
G Varoquaux 20

2 Similarity encoding: a simple solution
Adding similarities to one-hot encoding
1. One-hot encoding maps categories to vector spaces
2. String similarities capture information
G Varoquaux 21

One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
BX ∈ Rn×p
p grows fast
new categories?
link categories?
G Varoquaux 22

One-hot encoding
Londres 0 1 0
London 1 0 0
Paris 0 0 1
BX ∈ Rn×p
p grows fast
new categories?
link categories?Similarity encoding
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
string distance(Londres, London)
G Varoquaux 22

2 Some string similarities
Levenshtein
Number of edit operations on one string to match
the other
Jaro-Winkler
djaro(s1, s2) = m
3|s1| + m
3|s2| + m−t
3m
m: number of matching characters
t: number of character transpositions
n-gram similarity
n-gram: group of n consecutive characters
similarity =
#n-gram in comon
#n-gram in total
G Varoquaux 23

2 Empirical study
Datasets with dirty categories
Dataset # of
rows
# of cat-
egories
Less frequent
category
Prediction
type
medical charges 160k 100 613 regression
employee salaries 9.2k 385 1 regression
open payments 100k 973 1 binary clf
midwest survey 2.8k 1009 1 multiclass clf
traﬃc violations 100k 3043 1 multiclass clf
road safety 10k 4617 1 binary clf
beer reviews 10k 4634 1 multiclass clf
7 datasets! All open
Experimental paradigm
Cross-validation & measure prediction
Stupid Simple
G Varoquaux 24

2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5
midw
surv
G Varoquaux 25

2 Experiments: gradient boosted trees
0.8 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.6 0.8
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.4 0.5
road
safety
0.25 0.75
beer
reviews
1.6
2.4
2.9
3.7
4.6
5.9
Average ranking across datasets
G Varoquaux 25

2 Experiments: ridge
0.7 0.9
medical
charges
3gram
Levenshtein
ratio
Jarowinkler
Target encoding
Onehot encoding
Hash encoding
Similarity
encoding
0.6 0.8
employee
salaries
0.25 0.50
open
payments
0.5 0.7
midwest
survey
0.6 0.8
traffic
violations
0.45 0.50
road
safety
0.25 0.75
beer
reviews
1.0
2.9
3.1
4.4
3.6
6.0
Similarity encoding, with 3-gram similarity
G Varoquaux 26

2 Experiments: diﬀerent learner
0.85 0.90
medical
charges
Random Forest
Gradient Boosting
Ridge CV
Logistic CV
0.7 0.9
employee
salaries
0.50 0.75
open
payments
0.5 0.7
midwest
survey
0.7500.775
traffic
violations
0.45 0.55
road
safety
0.50 0.75
beer
reviews
onehot encoding 3gram similarity encoding
2.7
2.4
2.3
2.0
G Varoquaux 27

2 This is just a string similarity?
What similarity is deﬁned by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
G Varoquaux 28

2 This is just a string similarity?
What similarity is deﬁned by our encoding? (kernel)
si, sj sim =
k
l=1
sim(si, s(l)
) sim(sj, s(l)
)
Sum over the
categories
Reference
categories
The categories in the train set shape the similarity
0.83 0.88
medical
charges
3-gram
Levenshtein-
ratio
Jaro-winkler
Bag of 3-grams
Target encoding
MDV
One-hot encoding
Hash encoding
Similarity
encoding
0.75 0.85
employee
salaries
0.3 0.5
open
payments
0.6 0.7
midwest
survey
0.72 0.78
traffic
violations
0.44 0.52
road
safety
0.3 0.8
beer
reviews
1.1
3.1
3.4
4.1
5.3
6.4
4.7
7.3
Similarity encoding >>> a feature map capturing string similarities
G Varoquaux 28

2 Reducing the dimensionality
BX ∈ Rn×p
but p is large
Statistical problems
Computational problems
G Varoquaux 29

d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
ity
Kmeans
Deduplication
with Kmeans
Random
projections
G Varoquaux 29

0.7 0.8 0.9
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8 0.6 0.7 0.7500.G Varoquaux 29

0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
G Varoquaux 29

0.7 0.8 0.9
employee
salaries
d=30
d=100
d=300
d=30
d=100
d=300
d=30
d=100
d=300
Full
d=30
d=100
d=300
d=30
d=100
d=300
Full
(k=355)Cardinality of
categorical variable
Onehot
encoding
3gram similarity
encoding
Random
projections
Most
frequent
categories
Kmeans
Deduplication
with Kmeans
Random
projections
0.7 0.8
open
payments
(k=910)
0.6 0.7
midwest
survey
(k=644)
0.7500.775
traffic
violations
(k=2588)
0.45 0.50 0.55
road
safety
(k=3988)
0.25 0.50 0.75
beer
reviews
(k=4015)
7.2
5.0
3.7
10.5
7.2
4.6
10.5
6.3
3.3
2.0
16.3
14.5
14.3
12.3
10.8
9.9
14.5
Factorizing one-hot: Multiple Correspondance Analysis
Hashing n-grams (for speed and collisions)
G Varoquaux 29

@GaelVaroquaux
Representations in high dimension
signals, entities, categories

@GaelVaroquaux
Factorizations
Costly in large-p, large-n
Sub-sampling p gives huge speed ups
Stochastic Approximate Majorization-Minimization
https://github.com/arthurmensch/modl

@GaelVaroquaux
Factorizations
https://github.com/arthurmensch/modl
Similarity encoding for categories
No separate duplication / cleaning step
Creates a categorie-aware metric space
https://dirty-cat.github.io
DirtyData project (hiring)

References I
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching
word vectors with subword information. Transactions of the
Association of Computational Linguistics, 5(1):135–146, 2017.
P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for
learning with dirty categorical variables. Machine Learning,
pages 1–18, 2018.
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal
of the American Statistical Association, 64:1183, 1969.
D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named entity
recognition with character-level models. In Proceedings of the
seventh conference on Natural language learning at HLT-NAACL
2003-Volume 4, pages 180–183. Association for Computational
Linguistics, 2003.
J. Mairal. Stochastic majorization-minimization algorithms for
large-scale optimization. In Advances in Neural Information
Processing Systems, 2013.

References II
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for
matrix factorization and sparse coding. Journal of Machine
Learning Research, 11:19, 2010.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary
learning for massive matrix factorization. In ICML, 2016.
A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic
subsampling for factorizing huge matrices. IEEE Transactions on
Signal Processing, 66(1):113–128, 2017.

Simple representations for learning: factorizations and similarities

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Simple representations for learning: factorizations and similarities

Similar to Simple representations for learning: factorizations and similarities (20)

More from Gael Varoquaux

More from Gael Varoquaux (20)

Recently uploaded

Recently uploaded (20)

Simple representations for learning: factorizations and similarities