SlideShare a Scribd company logo
1 of 199
Download to read offline
Hybrid-Parallel Parameter Estimation for Frequentist and
Bayesian Models
Parameswaran Raman
Ph.D. Dissertation Defense
University of California, Santa Cruz
December 2, 2019
1 / 138
Motivation
Data and Model grow hand in hand
2 / 138
Data and Model grow hand in hand
e.g. Extreme Multi-class classification
3 / 138
Data and Model grow hand in hand
e.g. Extreme Multi-label classification
http://manikvarma.org/downloads/XC/XMLRepository.html
4 / 138
Data and Model grow hand in hand
e.g. Topic Models
5 / 138
Challenges in Parameter Estimation
6 / 138
Challenges in Parameter Estimation
1 Storage limitations of Data and Model
6 / 138
Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
6 / 138
Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
3 Bulk-Synchronization is expensive
6 / 138
Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
3 Bulk-Synchronization is expensive
4 Synchronous communication is inefficient
6 / 138
Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
3 Bulk-Synchronization is expensive
4 Synchronous communication is inefficient
Traditional distributed machine learning approaches fall short.
6 / 138
Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
3 Bulk-Synchronization is expensive
4 Synchronous communication is inefficient
Traditional distributed machine learning approaches fall short.
Hybrid-Parallel algorithms for parameter estimation!
6 / 138
Introduction
Outline
1 Introduction
2 Distributed Parameter Estimation
3 Thesis Roadmap
Multinomial Logistic Regression (KDD 2019)
Mixture Models (AISTATS 2019)
Latent Collaborative Retrieval (NIPS 2014)
4 Conclusion and Future Directions
5 Appendix
7 / 138
Introduction
Regularized Risk Minimization
Goals in machine learning
8 / 138
Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
8 / 138
Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
Our model must generalize to unseen (test) data
8 / 138
Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
Our model must generalize to unseen (test) data
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
8 / 138
Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
Our model must generalize to unseen (test) data
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data
θ are the model parameters
8 / 138
Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
Our model must generalize to unseen (test) data
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data
θ are the model parameters
loss (·) to quantify model’s performance
8 / 138
Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
Our model must generalize to unseen (test) data
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data
θ are the model parameters
loss (·) to quantify model’s performance
regularizer R (θ) to avoid over-fitting (penalizes complex models)
8 / 138
Introduction
Frequentist Models
9 / 138
Introduction
Frequentist Models
Binary Classification/Regression
Multinomial Logistic Regression
Matrix Factorization
Latent Collaborative Retrieval
Polynomial Regression, Factorization Machines
9 / 138
Introduction
Bayesian Models
10 / 138
Introduction
Bayesian Models
p(θ|X)
posterior
=
likelihood
p(X|θ) ·
prior
p(θ)
p(X, θ)dθ
marginal likelihood (model evidence)
prior plays the role of regularizer R (θ)
likelihood plays the role of empirical risk
10 / 138
Introduction
Bayesian Models
Gaussian Mixture
Models (GMM)
Latent Dirichlet
Allocation (LDA)
p(θ|X)
posterior
=
likelihood
p(X|θ) ·
prior
p(θ)
p(X, θ)dθ
marginal likelihood (model evidence)
prior plays the role of regularizer R (θ)
likelihood plays the role of empirical risk
10 / 138
Introduction
Focus on Matrix Parameterized Models
N
D
X
Data
D
K
θ
Model
11 / 138
Introduction
Focus on Matrix Parameterized Models
N
D
X
Data
D
K
θ
Model
What if these matrices do not fit in memory?
11 / 138
Distributed Parameter Estimation
Outline
1 Introduction
2 Distributed Parameter Estimation
3 Thesis Roadmap
Multinomial Logistic Regression (KDD 2019)
Mixture Models (AISTATS 2019)
Latent Collaborative Retrieval (NIPS 2014)
4 Conclusion and Future Directions
5 Appendix
12 / 138
Distributed Parameter Estimation
Distributed Parameter Estimation
13 / 138
Distributed Parameter Estimation
Distributed Parameter Estimation
Data parallel
N
D
X
Data
D
K
θ
Model
e.g. L-BFGS
13 / 138
Distributed Parameter Estimation
Distributed Parameter Estimation
Data parallel
N
D
X
Data
D
K
θ
Model
e.g. L-BFGS
Model parallel
N
D
X
Data
D
K
θ
Model
e.g. LC [Gopal et al., 2013]
13 / 138
Distributed Parameter Estimation
Distributed Parameter Estimation
Good
Easy to implement using map-reduce
Scales as long as Data or Model fits in memory
14 / 138
Distributed Parameter Estimation
Distributed Parameter Estimation
Good
Easy to implement using map-reduce
Scales as long as Data or Model fits in memory
Bad
Either Data or the Model is replicated on each worker.
14 / 138
Distributed Parameter Estimation
Distributed Parameter Estimation
Good
Easy to implement using map-reduce
Scales as long as Data or Model fits in memory
Bad
Either Data or the Model is replicated on each worker.
Data Parallel: Each worker requires O N×D
P + O (K × D)
bottleneck
14 / 138
Distributed Parameter Estimation
Distributed Parameter Estimation
Good
Easy to implement using map-reduce
Scales as long as Data or Model fits in memory
Bad
Either Data or the Model is replicated on each worker.
Data Parallel: Each worker requires O N×D
P + O (K × D)
bottleneck
Model Parallel: Each worker requires requires O (N × D)
bottleneck
+O K×D
P
14 / 138
Distributed Parameter Estimation
Distributed Parameter Estimation
Question
Can we get the best of both worlds?
15 / 138
Distributed Parameter Estimation
Hybrid Parallelism
N
D
X
Data
D
K
θ
Model
16 / 138
Distributed Parameter Estimation
Benefits of Hybrid-Parallelism
1 One versatile method for all regimes of data and model parallelism
17 / 138
Distributed Parameter Estimation
Why Hybrid Parallelism?
18 / 138
Distributed Parameter Estimation
Why Hybrid Parallelism?
19 / 138
Distributed Parameter Estimation
Why Hybrid Parallelism?
20 / 138
Distributed Parameter Estimation
Why Hybrid Parallelism?
21 / 138
Distributed Parameter Estimation
Why Hybrid Parallelism?
LSHTC1-small
LSHTC1-large
ODP
utube8M-Video
Reddit-Full
101
102
103
104
105
106
max memory of commodity machine
SizeinMB(log-scale) data size (MB)
parameters size (MB)
22 / 138
Distributed Parameter Estimation
Benefits of Hybrid-Parallelism
1 One versatile method for all regimes of data and model parallelism
2 Independent parameter updates on each worker
23 / 138
Distributed Parameter Estimation
Benefits of Hybrid-Parallelism
1 One versatile method for all regimes of data and model parallelism
2 Independent parameter updates on each worker
3 Fully de-centralized and asynchronous optimization algorithms
24 / 138
Distributed Parameter Estimation
How do we achieve Hybrid Parallelism in machine learning models?
25 / 138
Distributed Parameter Estimation
Double-Separability
Hybrid Parallelism
26 / 138
Distributed Parameter Estimation
Double-Separability
Definition
Let {Si }m
i=1 and {Sj }m
j=1 be two families of sets of parameters. A function
f : m
i=1 Si × m
j=1 Sj → R is doubly separable if ∃ fij : Si × Sj → R for
each i = 1, 2, . . . , m and j = 1, 2, . . . , m such that:
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
27 / 138
Distributed Parameter Estimation
Double-Separability
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
28 / 138
Distributed Parameter Estimation
Double-Separability
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
x
x
x
x
x
x
x
x
x
x
x
x x
m
m
fij (θi , θj )
28 / 138
Distributed Parameter Estimation
Double-Separability
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
x
x
x
x
x
x
x
x
x
x
x
x x
m
m
fij (θi , θj )
fij corresponding to
highlighted diagonal blocks
can be computed
independently and in
parallel
28 / 138
Distributed Parameter Estimation
Direct Double-Separability
e.g. Matrix Factorization
29 / 138
Distributed Parameter Estimation
Direct Double-Separability
e.g. Matrix Factorization
L(w1, w2, . . . , wN, h1, h2, . . . , hM) =
1
2
N
i=1
M
j=1
(Xij − wi , hj )2
29 / 138
Distributed Parameter Estimation
Direct Double-Separability
e.g. Matrix Factorization
L(w1, w2, . . . , wN, h1, h2, . . . , hM) =
1
2
N
i=1
M
j=1
(Xij − wi , hj )2
Objective function is trivially doubly-separable! [Yun et al 2014]
29 / 138
Distributed Parameter Estimation
Others require Reformulations
original objective f (·)
30 / 138
Distributed Parameter Estimation
Others require Reformulations
original objective f (·)
Reformulate
30 / 138
Distributed Parameter Estimation
Others require Reformulations
original objective f (·)
Reformulate
m
i=1
m
j=1fij (θi , θj )
30 / 138
Distributed Parameter Estimation
Technical Contributions of my thesis
Study frequentist and bayesian models which do not have direct
double-separability in their parameter estimation,
31 / 138
Distributed Parameter Estimation
Technical Contributions of my thesis
Study frequentist and bayesian models which do not have direct
double-separability in their parameter estimation,
1 Propose reformulations that lead to Hybrid-Parallelism
31 / 138
Distributed Parameter Estimation
Technical Contributions of my thesis
Study frequentist and bayesian models which do not have direct
double-separability in their parameter estimation,
1 Propose reformulations that lead to Hybrid-Parallelism
2 Design asynchronous distributed optimization algorithms
31 / 138
Thesis Roadmap
Outline
1 Introduction
2 Distributed Parameter Estimation
3 Thesis Roadmap
Multinomial Logistic Regression (KDD 2019)
Mixture Models (AISTATS 2019)
Latent Collaborative Retrieval (NIPS 2014)
4 Conclusion and Future Directions
5 Appendix
32 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Outline
1 Introduction
2 Distributed Parameter Estimation
3 Thesis Roadmap
Multinomial Logistic Regression (KDD 2019)
Mixture Models (AISTATS 2019)
Latent Collaborative Retrieval (NIPS 2014)
4 Conclusion and Future Directions
5 Appendix
33 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Multinomial Logistic Regression (MLR)
Given
Training data (xi , yi )i=1,...,N where xi ∈ RD
corresp. labels yi ∈ {1, 2, . . . , K}
N, D and K are large (N >>> D >> K)
34 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Multinomial Logistic Regression (MLR)
Given
Training data (xi , yi )i=1,...,N where xi ∈ RD
corresp. labels yi ∈ {1, 2, . . . , K}
N, D and K are large (N >>> D >> K)
Goal
The probability that xi belongs to class k is given by:
p(yi = k|xi , W ) =
exp(wk
T xi )
K
j=1 exp(wj
T xi )
where W = {w1, w2, . . . , wK } denotes the parameter for the model.
34 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Multinomial Logistic Regression (MLR)
The corresponding l2 regularized negative log-likelihood loss:
min
W
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yikwk
T
xi +
1
N
N
i=1
log
K
k=1
exp(wk
T
xi )
where λ is the regularization hyper-parameter.
35 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Multinomial Logistic Regression (MLR)
The corresponding l2 regularized negative log-likelihood loss:
min
W
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yikwk
T
xi +
1
N
N
i=1
log
K
k=1
exp(wk
T
xi )
makes model parallelism hard
where λ is the regularization hyper-parameter.
36 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Reformulation into Doubly-Separable form
Log-concavity bound [GopYan13]
log(γ) ≤ a · γ − log(a) − 1, ∀γ, a > 0,
where a is a variational parameter. This bound is tight when a = 1
γ .
37 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Reformulating the objective of MLR
min
W ,A
λ
2
K
k=1
wk
2
+
1
N
N
i=1
−
K
k=1
yik wk
T
xi + ai
K
k=1
exp(wk
T
xi ) − log(ai ) − 1
where ai can be computed in closed form as:
ai =
1
K
k=1 exp(wk
T xi )
38 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Doubly-Separable Multinomial Logistic Regression
(DS-MLR)
Doubly-Separable form
min
W ,A
N
i=1
K
k=1
λ wk
2
2N
−
yik wk
T
xi
N
−
log ai
NK
+
exp(wk
T
xi + log ai )
N
−
1
NK
39 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Doubly-Separable Multinomial Logistic Regression
(DS-MLR)
Stochastic Gradient Updates
Each term in stochastic update depends on only data point i and class k.
40 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Doubly-Separable Multinomial Logistic Regression
(DS-MLR)
Stochastic Gradient Updates
Each term in stochastic update depends on only data point i and class k.
wk
t+1 ← wk
t − ηtK λxi − yikxi + exp wk
T xi + log ai xi
40 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Doubly-Separable Multinomial Logistic Regression
(DS-MLR)
Stochastic Gradient Updates
Each term in stochastic update depends on only data point i and class k.
wk
t+1 ← wk
t − ηtK λxi − yikxi + exp wk
T xi + log ai xi
log ai
t+1 ← log ai
t − ηtK exp(wk
T xi + log ai ) − 1
K
40 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Access Pattern of updates: Stoch wk, Stoch ai
X
W
A
(a) Updating wk only requires
computing ai
X
W
A
(b) Updating ai only requires
accessing wk and xi .
41 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Updating ai: Closed form instead of Stoch update
Closed-form update for ai
ai =
1
K
k=1 exp(wk
T xi )
42 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Access Pattern of updates: Stoch wk, Exact ai
X
W
A
(a) Updating wk only requires
computing ai
X
W
A
(b) Updating ai requires accessing
entire W. Synchronization
bottleneck!
43 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Updating ai: Avoiding bulk-synchronization
Closed-form update for ai
ai =
1
K
k=1 exp(wk
T xi )
44 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Updating ai: Avoiding bulk-synchronization
Closed-form update for ai
ai =
1
K
k=1 exp(wk
T xi )
Each worker computes partial sum using the wk it owns.
44 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Updating ai: Avoiding bulk-synchronization
Closed-form update for ai
ai =
1
K
k=1 exp(wk
T xi )
Each worker computes partial sum using the wk it owns.
P workers: After P rounds the global sum is available
44 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Synchronous DSGD [Gemulla et al., 2011]
X and local parameters A are partitioned horizontally (1, . . . , N)
45 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Synchronous DSGD [Gemulla et al., 2011]
X and local parameters A are partitioned horizontally (1, . . . , N)
Global model parameters W are partitioned vertically (1, . . . , K)
45 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Synchronous DSGD [Gemulla et al., 2011]
X and local parameters A are partitioned horizontally (1, . . . , N)
Global model parameters W are partitioned vertically (1, . . . , K)
P = 4 workers work on mutually-exclusive blocks of A and W
45 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Asynchronous NOMAD [Yun et al., 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
46 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Asynchronous NOMAD [Yun et al 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
47 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Asynchronous NOMAD [Yun et al 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
48 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Asynchronous NOMAD [Yun et al 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
49 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Experiments: Datasets
50 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Experiments: Datasets
Dataset # instances # features #classes data (train + test) parameters sparsity (% nnz)
CLEF 10,000 80 63 9.6 MB + 988 KB 40 KB 100
NEWS20 11,260 53,975 20 21 MB + 14 MB 9.79 MB 0.21
LSHTC1-small 4,463 51,033 1,139 11 MB + 4 MB 465 MB 0.29
LSHTC1-large 93,805 347,256 12,294 258 MB + 98 MB 34 GB 0.049
ODP 1,084,404 422,712 105,034 3.8 GB + 1.8 GB 355 GB 0.0533
YouTube8M-Video 4,902,565 1,152 4,716 59 GB + 17 GB 43 MB 100
Reddit-Full 211,532,359 1,348,182 33,225 159 GB + 69 GB 358 GB 0.0036
Table: Characteristics of the datasets used
51 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Experiments: Single Machine
NEWS20, Data=35 MB, Model=9.79 MB
100 101 102 103
1.45
1.5
1.55
1.6
1.65
time (secs)
objective
DS-MLR
L-BFGS
LC
52 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Experiments: Single Machine
CLEF, Data=10 MB, Model=40 KB
10−1 100 101 102 103 104 105
1
2
3
4
time (secs)
objective
DS-MLR
L-BFGS
LC
53 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Experiments: Single Machine
LSHTC1-small, Data=15 MB, Model=465 MB
101 102 103 104
0
0.2
0.4
0.6
0.8
1
time (secs)
objective
DS-MLR
L-BFGS
LC
54 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Experiments: Multi Machine
LSHTC1-large, Data=356 MB, Model=34 GB,
machines=4, threads=12
103 104
0.5
0.6
0.7
0.8
time (secs)
objective
DS-MLR
LC
55 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Experiments: Multi Machine
ODP, Data=5.6 GB, Model=355 GB,
machines=20, threads=260
0 1 2 3 4 5 6
·105
9.4
9.6
9.8
10
10.2
10.4
time (secs)
objective
DS-MLR
56 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Experiments: Multi Machine Dense Dataset
YouTube-Video, Data=76 GB, Model=43 MB,
machines=4, threads=260
0 1 2 3
·105
0
20
40
60
time (secs)
objective
DS-MLR
57 / 138
Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Experiments: Multi Machine (Nothing fits in memory!)
Reddit-Full, Data=228 GB, Model=358 GB,
machines=40, threads=250
0 2 4 6
·105
10
10.2
10.4
time (secs)
objective DS-MLR
211 million examples, 44 billion parameters (# features × # classes)
58 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Outline
1 Introduction
2 Distributed Parameter Estimation
3 Thesis Roadmap
Multinomial Logistic Regression (KDD 2019)
Mixture Models (AISTATS 2019)
Latent Collaborative Retrieval (NIPS 2014)
4 Conclusion and Future Directions
5 Appendix
59 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Mixture Of Exponential Families
60 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Mixture Of Exponential Families
k-means
Gaussian Mixture Models (GMM)
Latent Dirichlet Allocation (LDA)
Stochastic Mixed Membership Models
60 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Generative Model
61 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Generative Model
Draw Cluster Proportions π
p (π|α) = Dirichlet (α)
61 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Generative Model
Draw Cluster Proportions π
p (π|α) = Dirichlet (α)
Draw Cluster Parameters for k = 1, . . . , K
p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk))
61 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Generative Model
Draw Cluster Proportions π
p (π|α) = Dirichlet (α)
Draw Cluster Parameters for k = 1, . . . , K
p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk))
Draw Observations for i = 1, . . . , N
p (zi |π) = Multinomial (π)
p (xi |zi , θ) = exp ( φ (xi , zi ) , θzi − g (θzi ))
61 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Generative Model
Draw Cluster Proportions π
p (π|α) = Dirichlet (α)
Draw Cluster Parameters for k = 1, . . . , K
p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk))
Draw Observations for i = 1, . . . , N
p (zi |π) = Multinomial (π)
p (xi |zi , θ) = exp ( φ (xi , zi ) , θzi − g (θzi ))
p (θk|nk, νk) is conjugate to p (xi |zi = k, θk)
p (π|α) is conjugate to p (zi |π)
61 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Joint Distribution
p (π, θ, z, x|α, n, ν) = p (π|α) ·
K
k=1
p (θk|nk, νk) ·
N
i=1
p (zi |π) · p (xi |zi , θ)
62 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Joint Distribution
p (π, θ, z, x|α, n, ν) = p (π|α) ·
K
k=1
p (θk|nk, νk) ·
N
i=1
p (zi |π) · p (xi |zi , θ)
Want to estimate the posterior: p (π, θ, z|x, α, n, ν)
62 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Variational Inference [Blei et al., 2016]
Approximate p (π, θ, z|x, α, n, ν) with a simpler (factorized) distribution q
q π, θ, z|˜π, ˜θ, ˜z = q (π|˜π) ·
K
k=1
q θk|˜θk ·
N
i=1
q (zi |˜zi )
63 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Variational Inference
64 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Evidence Based Lower Bound (ELBO)
L ˜π, ˜θ, ˜z = Eq(π,θ,z|˜π,˜θ,˜z) [log p (π, θ, z, x|α, n, ν)]
− Eq(π,θ,z|˜π,˜θ,˜z) log q π, θ, z|˜π, ˜θ, ˜z
65 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Evidence Based Lower Bound (ELBO)
L ˜π, ˜θ, ˜z = Eq(π,θ,z|˜π,˜θ,˜z) [log p (π, θ, z, x|α, n, ν)]
− Eq(π,θ,z|˜π,˜θ,˜z) log q π, θ, z|˜π, ˜θ, ˜z
65 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Variational Inference (VI)
M-step
max˜π L ˜π, ˜θ, ˜z
max˜θ L ˜π, ˜θ, ˜z
E-step
max˜z L ˜π, ˜θ, ˜z
66 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Variational Inference: M-step
max˜π L ˜π, ˜θ, ˜z
˜πk = α +
N
i=1
˜zi,k
67 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Access Pattern of Variables: ˜π update
z x
π
θ
68 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Variational Inference: M-step
max˜θ L ˜π, ˜θ, ˜z
˜nk = nk +
N
i=1
˜zi,k
˜νk = nk · νk +
N
i=1
˜zi,k · φ (xi , k)
69 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Access Pattern of Variables: ˜θ update
z x
π
θ
70 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Variational Inference: E-step
max˜z L ˜π, ˜θ, ˜z
˜zi,k ∝ exp ψ (˜πk) − ψ
K
k =1
˜πk + φ (xi , k) , Eq(θk |˜θk ) [θk]
− Eq(θk |˜θk ) [g (θk)]
71 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Access Pattern of Variables: ˜z update
z x
π
θ
72 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Stochastic Variational Inference (SVI) [Hoffman et al.,
2013]
M-step
max˜π L ˜π, ˜θ, ˜z
max˜θ L ˜π, ˜θ, ˜z
E-step
max˜zi
L ˜π, ˜θ, ˜zi , ˜zi
73 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Stochastic Variational Inference (SVI) [Hoffman et al.,
2013]
M-step
max˜π L ˜π, ˜θ, ˜z
max˜θ L ˜π, ˜θ, ˜z
E-step
max˜zi
L ˜π, ˜θ, ˜zi , ˜zi
More frequent updates to global variables =⇒ Faster convergence
73 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Access Pattern of Variables: ˜π update
z x
π
θ
74 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Access Pattern of Variables: ˜θ update
z x
π
θ
75 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Access Pattern of Variables: ˜z update
z x
π
θ
76 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Extreme Stochastic Variational Inference (ESVI)
M-step
max˜πk ,˜πk
L [˜π1, . . . , ˜πk , ˜πk+1, . . . , ˜πk , . . . , ˜πK ] , ˜θ, ˜z
max˜θk ,˜θk
L ˜π, ˜θ1, . . . , ˜θk , ˜θk+1, . . . , ˜θk , . . . , ˜θK , ˜z
E-step
max˜zi,k ,˜zi,k
L ˜π, ˜θ, ˜zi , [˜zi,1, . . . , ˜zi,k , ˜zi,k+1, . . . , ˜zi,k , . . . , ˜zi,K ]
77 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Access Pattern of Variables: ˜π update
˜z x
˜π
˜θ
78 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Access Pattern of Variables: ˜θ update
˜z x
˜π
˜θ
79 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Access Pattern of Variables: ˜z update
˜z x
˜π
˜θ
80 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Key Observation
M-step
max˜πk ,˜πk
L [˜π1, . . . , ˜πk , ˜πk+1, . . . , ˜πk , . . . , ˜πK ] , ˜θ, ˜z
max˜θk ,˜θk
L ˜π, ˜θ1, . . . , ˜θk , ˜θk+1, . . . , ˜θk , . . . , ˜θK , ˜z
E-step
max˜zi,k ,˜zi,k
L ˜π, ˜θ, ˜zi , [˜zi,1, . . . , ˜zi,k , ˜zi,k+1, . . . , ˜zi,k , . . . , ˜zi,K ]
All the above updates can be done in parallel!
81 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Distributing Computation NOMAD [Yun et al., 2014]
˜z x
˜π
˜θ
82 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Distributing Computation NOMAD [Yun et al., 2014]
˜z x
˜π
˜θ
82 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Distributing Computation NOMAD [Yun et al., 2014]
˜z x
˜π
˜θ
82 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Distributing Computation NOMAD [Yun et al., 2014]
˜z x
˜π
˜θ
82 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Keeping ˜π in Sync
Single global ˜π
travels among machines
broadcasts local delta updates
Every machine p: (πp, ¯π)
πp: local working copy
¯π: snapshot of global ˜π
83 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Datasets
# documents # vocabulary #words
NIPS 1,312 12,149 1,658,309
Enron 37,861 28,102 6,238,796
Ny Times 298,000 102,660 98,793,316
PubMed 8,200,000 141,043 737,869,083
UMBC-3B 40,599,164 3,431,260 3,013,004,127
Mixture Models:
Gaussian Mixture Model
(GMM)
Latent Dirichlet
Allocation (LDA)
Methods:
Variational Inference (VI)
Stochastic Variational Inference (SVI)
Extreme Stochastic Variational
Inference (ESVI)
84 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Single Machine (GMM)
TOY, machines=1, cores=1, topics=256
10−1 100 101 102 103 104
−8
−6
−4
·1010
time (secs)
ELBO
ESVI
SVI
VI
85 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Single Machine (GMM)
AP-DATA, machines=1, cores=1, topics=256
100 101 102 103 104
−1.21
−1.2
−1.2
−1.19
·1012
time (secs)
ELBO
ESVI
SVI
VI
86 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Multi Machine (GMM)
NIPS, machines=16, cores=1, topics=256
101 102 103 104
−1.72
−1.7
−1.68
·1012
time (secs)
ELBO
ESVI
VI
87 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Multi Machine (GMM)
NY Times, machines=16, cores=1, topics=256
103 104
−5.5
−5
−4.5
·1014
time (secs)
ELBO
VI
ESVI
88 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Single Machine Single Core (LDA)
Enron, machines=1, cores=1, topics=128
103 104
−5.4
−5.3
−5.2
−5.1
−5
·107
time (secs)
ELBO
ESVI
VI
SVI
89 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Single Machine Single Core (LDA)
NY Times, machines=1, cores=1, topics=16
103.5 104 104.5
−9.2
−9.1
−9
−8.9
−8.8
−8.7
·108
time (secs)
ELBO
ESVI
VI
SVI
90 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Single Machine Multi Core (LDA)
Enron, machines=1, cores=16, topics=128
102 103 104
−5.4
−5.3
−5.2
−5.1
−5
·107
time (secs)
ELBO
ESVI
dist VI
stream SVI
91 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Single Machine Multi Core (LDA)
NY Times, machines=1, cores=16, topics=64
103 104
−9.2
−9
−8.8
·108
time (secs)
ELBO
ESVI
dist VI
stream SVI
92 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Multi Machine Multi Core (LDA)
PUBMED, machines=32, cores=16, topics=128
102 103 104
−6.8
−6.7
−6.6
·109
time (secs)
ELBO
ESVI
dist VI
93 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Multi Machine Multi Core (LDA)
UMBC-3B, machines=32, cores=16, topics=128
103 104
−2.28
−2.27
−2.26
−2.25
·1010
time (secs)
ELBO
ESVI
dist VI
94 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Predictive Performance (LDA)
Enron, machines=1, cores=16, topics=32
101 102 103
10
15
20
25
30
35
time (secs)
TestPerplexity
ESVI
VI
95 / 138
Thesis Roadmap Mixture Models (AISTATS 2019)
Experiments: Predictive Performance (LDA)
NY Times, machines=1, cores=16, topics=32
103 104
15
20
25
30
35
time (secs)
TestPerplexity
ESVI
VI
96 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Outline
1 Introduction
2 Distributed Parameter Estimation
3 Thesis Roadmap
Multinomial Logistic Regression (KDD 2019)
Mixture Models (AISTATS 2019)
Latent Collaborative Retrieval (NIPS 2014)
4 Conclusion and Future Directions
5 Appendix
97 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Binary Classification
−3 −2 −1 0 1 2 3
0
1
2
3
4
margin
loss 0-1 loss: I(· < 0)
logistic loss: σ(·)
hinge loss
Convex objective functions are sensitive to outliers
98 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Robust Binary Classification
−5 −4 −3 −2 −1 0 1 2 3 4 5
0
1
2
3
4
5
t
loss
σ(t)
σ1(t) := ρ1 (σ(t))
σ2(t) := ρ2 (σ(t))
ρ1(t) = log2(t + 1)
ρ2(t) = 1 − 1
log2(t+2)
Bending the losses provides robustness
99 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Objectives
Can we use this to design robust losses for ranking?
Two scenarios:
Explicit: features, labels are observed (Learning to Rank)
Implicit: features, labels are latent (Latent Collaborative Retrieval)
Distributed Optimization Algorithms
100 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Explicit Setting - Learning to Rank
Scoring function: f (x, y) = φ(x, y), w
101 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Implicit Setting - Latent Collaborative Retrieval
Scoring function: f (x, y) = Ux , Vy
102 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
Rank
rankw(x, y) =
y ∈Yx ,y =y
1 f (x, y) − f (x, y ) < 0
103 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
Rank
rankw(x, y) =
y ∈Yx ,y =y
1 f (x, y) − f (x, y ) < 0
Objective Function
L(w) =
x∈X y∈Yx
rxy
y ∈Yx ,y =y
σ f (x, y) − f (x, y )
where σ = logistic loss.
103 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
Applying the robust transformation ρ2(t)
Objective Function
L(w) =
x∈X y∈Yx
rxy · ρ2


y ∈Yx ,y =y
σ f (x, y) − f (x, y )


where ρ2(t) = 1 − 1
log2(t+2)
104 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
Applying the robust transformation ρ2(t)
Objective Function
L(w) =
x∈X y∈Yx
rxy · ρ2


y ∈Yx ,y =y
σ f (x, y) − f (x, y )


where ρ2(t) = 1 − 1
log2(t+2)
Directly related to DCG (evaluation metric for ranking).
minimizing L (w) =⇒ maximizing DCG
104 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
In practice, ρ1 is easier to optimize,
Objective Function
L(w) =
x∈X y∈Yx
rxy · ρ1


y ∈Yx ,y =y
σ f (x, y) − f (x, y )


where, ρ1(t) = log2(t + 1)
105 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Experiments: Single Machine
TD 2004
5 10 15 20
0.4
0.6
0.8
1
k
NDCG@k RoBiRank
RankSVM
LSRank
InfNormPush
IRPush
RoBiRank shows good accuracy (NDCG) at the top of the ranking list
106 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Experiments: Single Machine
TD 2004
5 10 15 20
0.4
0.6
0.8
1
k
NDCG@k RoBiRank
MART
RankNet
RankBoost
AdaRank
CoordAscent
LambdaMART
ListNet
RandomForests
RoBiRank shows good accuracy (NDCG) at the top of the ranking list
107 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
How to make RoBiRank Hybrid-Parallel?
Consider RoBiRank with the implicit case (no features, no explicit labels).
Implicit Case
x∈X y∈Yx
rxy · ρ1


y ∈Yx ,y =y
σ Ux , Vy − Ux , Vy


expensive
How can we avoid this?
108 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Linearization
Upper Bound for ρ1 [Bouchard07]
ρ1 (t) = log2 (t + 1) ≤ log2 ξ +
ξ · (t + 1) − 1
log2
The bound holds for any ξ > 0 and is tight when ξ = 1
t+1
109 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Linearization
We can linearize the objective function using this upper bound,
x∈X y∈Yx
rxy ·

− log2 ξxy +
ξxy · y =y σ ( Ux , Vy − Ux , Vy ) + 1 − 1
log 2


Can obtain stochastic gradient updates for Ux , Vy and ξxy
ξxy capture the importance of xy-pair in optimization
ξxy also has a closed form solution
110 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Distributing Computation
111 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Experiments: Multi Machine
Million Song Dataset (MSD)
0 0.2 0.4 0.6 0.8 1
·105
0
0.1
0.2
0.3
time (secs)
MeanPrecision@1
Weston et al. (2012)
RoBiRank 1
RoBiRank 4
RoBiRank 16
RoBiRank 32
112 / 138
Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Experiments: Multi Machine
Scalability on Million song dataset, machines=1, 4, 16, 32
0 0.5 1 1.5 2 2.5 3
·106
0
0.1
0.2
0.3
# machines × seconds
MeanPrecision@1
RoBiRank 4
RoBiRank 16
RoBiRank 32
RoBiRank 1
113 / 138
Conclusion and Future Directions
Outline
1 Introduction
2 Distributed Parameter Estimation
3 Thesis Roadmap
Multinomial Logistic Regression (KDD 2019)
Mixture Models (AISTATS 2019)
Latent Collaborative Retrieval (NIPS 2014)
4 Conclusion and Future Directions
5 Appendix
114 / 138
Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
115 / 138
Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
115 / 138
Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
We proposed:
115 / 138
Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
We proposed:
Hybrid-Parallel formulations
115 / 138
Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
We proposed:
Hybrid-Parallel formulations
Distributed, Asynchronous Algorithms
115 / 138
Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
We proposed:
Hybrid-Parallel formulations
Distributed, Asynchronous Algorithms
Case-studies:
Multinomial Logistic Regression
Mixture Models
Latent Collaborative Retrieval
115 / 138
Conclusion and Future Directions
Future Work: Other Hybrid-Parallel models
Infinite Mixture Models
DP Mixture Model [BleiJordan2006]
Pittman-Yor Mixture Model [Dubey et al., 2014]
116 / 138
Conclusion and Future Directions
Future Work: Other Hybrid-Parallel models
Infinite Mixture Models
DP Mixture Model [BleiJordan2006]
Pittman-Yor Mixture Model [Dubey et al., 2014]
Stochastic Mixed-Membership Block Models [Airoldi et al., 2008]
116 / 138
Conclusion and Future Directions
Future Work: Other Hybrid-Parallel models
Infinite Mixture Models
DP Mixture Model [BleiJordan2006]
Pittman-Yor Mixture Model [Dubey et al., 2014]
Stochastic Mixed-Membership Block Models [Airoldi et al., 2008]
Deep Neural Networks [Wang et al., 2014]
116 / 138
Conclusion and Future Directions
Acknowledgements
Thanks to all my collaborators.
117 / 138
Conclusion and Future Directions
References
Parameswaran Raman, Sriram Srinivasan, Shin Matsushima, Xinhua
Zhang, Hyokun Yun, S.V.N. Vishwanathan. Scaling Multinomial Logistic
Regression via Hybrid Parallelism. KDD 2019.
Parameswaran Raman∗
, Jiong Zhang∗
, Shihao Ji, Hsiang-Fu Yu, S.V.N.
Vishwanathan, Inderjit S. Dhillon. Extreme Stochastic Variational Inference:
Distributed and Asynchronous. AISTATS 2019.
Hyokun Yun, Parameswaran Raman, S.V.N. Vishwanathan. Ranking via
Robust Binary Classification. NIPS 2014.
Source Code:
DS-MLR: https://bitbucket.org/params/dsmlr
ESVI: https://bitbucket.org/params/dmixmodels
RoBiRank: https://bitbucket.org/d ijk stra/robirank
118 / 138
Appendix
Outline
1 Introduction
2 Distributed Parameter Estimation
3 Thesis Roadmap
Multinomial Logistic Regression (KDD 2019)
Mixture Models (AISTATS 2019)
Latent Collaborative Retrieval (NIPS 2014)
4 Conclusion and Future Directions
5 Appendix
119 / 138
Appendix
Polynomial Regression
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
wjj xij xij
xi ∈ RD is an example from the dataset X ∈ RN×D
w0 ∈ R, w ∈ RD, W ∈ RD×D are parameters of the model
120 / 138
Appendix
Polynomial Regression
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
wjj xij xij
Dense parameterization is not suited for sparse data
Computationally expensive - O N × D2
121 / 138
Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
122 / 138
Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Factorized Parameterization using VVT instead of Dense W
122 / 138
Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Factorized Parameterization using VVT instead of Dense W
Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K
122 / 138
Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Factorized Parameterization using VVT instead of Dense W
Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K
vj ∈ RK denotes latent embedding for the j-th feature
122 / 138
Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Factorized Parameterization using VVT instead of Dense W
Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K
vj ∈ RK denotes latent embedding for the j-th feature
Computationally much cheaper - O (N × D × K)
122 / 138
Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
= w0 +
D
j=1
wj xij +
1
2
K
k=1



D
d=1
vdkxid
2
−
D
j=1
v2
jkx2
ij



123 / 138
Appendix
Factorization Machines
L (w, V) =
1
N
N
i=1
l (f (xi ) , yi ) +
λw
2
w 2
2 +
λv
2
V 2
2
λw and λv are regularizers for w and V
l (·) is loss function depending on the task
124 / 138
Appendix
Factorization Machines
Gradient Descent updates (First-Order Features)
wt+1
j ← wt
j − η
N
i=1
wj li (w, V) + λw wt
j
= wt
j − η
N
i=1
Gt
i · wj f (xi ) + λw wt
j
= wt
j − η
N
i=1
Gt
i · xij + λw wt
j (1)
where, multiplier Gt
i is given by,
Gt
i =
f (xi ) − yi , if squared loss (regression)
−yi
1+exp(yi ·fi (xi )), if logistic loss (classification)
(2)
125 / 138
Appendix
Factorization Machines
Gradient Descent updates (Second-Order Features)
vt+1
jk ← vt
jk − η
N
i=1
vjk
li (w, V) + λv vt
jk
= vt
jk − η
N
i=1
Gt
i · vjk
f (xi ) + λv vt
jk
= vt
jk − η
N
i=1
Gt
i · xij
D
d=1
vt
dk · xid − vt
jk x2
ij + λv vt
jk (3)
where, multiplier Gt
i is same as before, synchronization term
aik = D
d=1 vt
dk xid .
126 / 138
Appendix
Factorization Machines
Access pattern of parameter updates for wj and vjk
X
w
V
G
(a) wj update
X
w
V
G A
(b) vjk update
Figure: Updating wj requires computing Gi and likewise updating vjk requires
computing aik .
127 / 138
Appendix
Factorization Machines
Access pattern of parameter updates for wj and vjk
X
w
V
G
(a) computing Gi
X
w
V
A
(b) computing aik
Figure: Computing both G and A requires accessing all the dimensions
j = 1, . . . , D. This is the main synchronization bottleneck.
128 / 138
Appendix
Doubly-Separable Factorization Machines (DS-FACTO)
Avoiding bulk-synchronization
Perform a round of Incremental synchronization after all D updates
129 / 138
Appendix
Experiments: Single Machine
housing, machines=1, cores=1, threads=1
10−3 10−2 10−1 100 101 102
10
20
30
40
50
time (secs)
objective
DS-FACTO
LIBFM
130 / 138
Appendix
Experiments: Single Machine
diabetes, machines=1, cores=1, threads=1
10−2 10−1 100 101 102
0.5
0.6
0.7
time (secs)
objective
DS-FACTO
LIBFM
131 / 138
Appendix
Experiments: Single Machine
ijcnn1, machines=1, cores=1, threads=1
10−2 10−1 100 101 102 103 104
0.1
0.2
0.3
0.4
0.5
0.6
time (secs)
objective
DS-FACTO
LIBFM
132 / 138
Appendix
Experiments: Multi Machine
realsim, machines=8, cores=40, threads=1
0 1,000 2,000 3,000 4,000 5,000
0.2
0.3
0.4
time (secs)
objective
DS-FACTO
133 / 138
Appendix
Experiments: Scaling in terms of threads
realsim, Varying cores and threads as 1, 2, 4, 8, 16, 32
0 5 10 15 20 25 30 35
0
10
20
30
time (secs)
speedup
realsim dataset
linear speedup
varying # of threads
varying # of cores
134 / 138
Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 1: Introduce redundant constraints (new parameters A) into the
original MLR problem
min
W ,A
L1(W , A) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
s.t. ai =
1
K
k=1 exp(wk
T xi )
135 / 138
Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 2: Turn the problem to unconstrained min-max problem by
introducing Lagrange multipliers βi , ∀i = 1, . . . , N
min
W ,A
max
β
L2(W , A, β) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
+
1
N
N
i=1
K
k=1
βi ai exp(wk
T
xi ) −
1
N
N
i=1
βi
136 / 138
Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 2: Turn the problem to unconstrained min-max problem by
introducing Lagrange multipliers βi , ∀i = 1, . . . , N
min
W ,A
max
β
L2(W , A, β) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
+
1
N
N
i=1
K
k=1
βi ai exp(wk
T
xi ) −
1
N
N
i=1
βi
Primal Updates for W , A and Dual Update for β (similar in spirit to
dual-decomp. methods).
136 / 138
Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 3: Stare at the updates long-enough!
When at+1
i is solved to optimality, it admits an exact closed-form
solution given by a∗
i = 1
βi
K
k=1 exp(wk
T xi )
.
Dual-ascent update for βi is no longer needed, since the penalty is
always zero if βi is set to a constant equal to 1.
min
W ,A
L3(W , A) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
+
1
N
N
i=1
K
k=1
ai exp(wk
T
xi ) −
1
N
137 / 138
Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 4: Simple re-write
Doubly-Separable form
min
W ,A
N
i=1
K
k=1
λ wk
2
2N
−
yik wk
T
xi
N
−
log ai
NK
+
exp(wk
T
xi + log ai )
N
−
1
NK
138 / 138

More Related Content

Similar to Ph.D. Dissertation Defense

Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismParameswaran Raman
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 r-kor
 
Sparse Support Faces - Battista Biggio - Int'l Conf. Biometrics, ICB 2015, Ph...
Sparse Support Faces - Battista Biggio - Int'l Conf. Biometrics, ICB 2015, Ph...Sparse Support Faces - Battista Biggio - Int'l Conf. Biometrics, ICB 2015, Ph...
Sparse Support Faces - Battista Biggio - Int'l Conf. Biometrics, ICB 2015, Ph...Pluribus One
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersUniversity of Huddersfield
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCStéphanie Roger
 
[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"Young-Min kang
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationWork-Bench
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selectionchenhm
 
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...wajrcs
 
A preliminary study of diversity in ELM ensembles (HAIS 2018)
A preliminary study of diversity in ELM ensembles (HAIS 2018)A preliminary study of diversity in ELM ensembles (HAIS 2018)
A preliminary study of diversity in ELM ensembles (HAIS 2018)Carlos Perales
 
Gradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionGradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionChristian Perone
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10FredrikRonquist
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 

Similar to Ph.D. Dissertation Defense (20)

Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid Parallelism
 
Dubai meetup 2016
Dubai meetup 2016Dubai meetup 2016
Dubai meetup 2016
 
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로 모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
모듈형 패키지를 활용한 나만의 기계학습 모형 만들기 - 회귀나무모형을 중심으로
 
Sparse Support Faces - Battista Biggio - Int'l Conf. Biometrics, ICB 2015, Ph...
Sparse Support Faces - Battista Biggio - Int'l Conf. Biometrics, ICB 2015, Ph...Sparse Support Faces - Battista Biggio - Int'l Conf. Biometrics, ICB 2015, Ph...
Sparse Support Faces - Battista Biggio - Int'l Conf. Biometrics, ICB 2015, Ph...
 
Two methods for optimising cognitive model parameters
Two methods for optimising cognitive model parametersTwo methods for optimising cognitive model parameters
Two methods for optimising cognitive model parameters
 
ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
 
[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"[update] Introductory Parts of the Book "Dive into Deep Learning"
[update] Introductory Parts of the Book "Dive into Deep Learning"
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical ComputationOne Algorithm to Rule Them All: How to Automate Statistical Computation
One Algorithm to Rule Them All: How to Automate Statistical Computation
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selection
 
Bayesian computation with INLA
Bayesian computation with INLABayesian computation with INLA
Bayesian computation with INLA
 
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
A Fairness-aware Machine Learning Interface for End-to-end Discrimination Dis...
 
A preliminary study of diversity in ELM ensembles (HAIS 2018)
A preliminary study of diversity in ELM ensembles (HAIS 2018)A preliminary study of diversity in ELM ensembles (HAIS 2018)
A preliminary study of diversity in ELM ensembles (HAIS 2018)
 
Gradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introductionGradient-based optimization for Deep Learning: a short introduction
Gradient-based optimization for Deep Learning: a short introduction
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10Bayesian phylogenetic inference_big4_ws_2016-10-10
Bayesian phylogenetic inference_big4_ws_2016-10-10
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
MUMS Opening Workshop - Quantifying Nonparametric Modeling Uncertainty with B...
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 

Ph.D. Dissertation Defense

  • 1. Hybrid-Parallel Parameter Estimation for Frequentist and Bayesian Models Parameswaran Raman Ph.D. Dissertation Defense University of California, Santa Cruz December 2, 2019 1 / 138
  • 2. Motivation Data and Model grow hand in hand 2 / 138
  • 3. Data and Model grow hand in hand e.g. Extreme Multi-class classification 3 / 138
  • 4. Data and Model grow hand in hand e.g. Extreme Multi-label classification http://manikvarma.org/downloads/XC/XMLRepository.html 4 / 138
  • 5. Data and Model grow hand in hand e.g. Topic Models 5 / 138
  • 6. Challenges in Parameter Estimation 6 / 138
  • 7. Challenges in Parameter Estimation 1 Storage limitations of Data and Model 6 / 138
  • 8. Challenges in Parameter Estimation 1 Storage limitations of Data and Model 2 Interdependence in parameter updates 6 / 138
  • 9. Challenges in Parameter Estimation 1 Storage limitations of Data and Model 2 Interdependence in parameter updates 3 Bulk-Synchronization is expensive 6 / 138
  • 10. Challenges in Parameter Estimation 1 Storage limitations of Data and Model 2 Interdependence in parameter updates 3 Bulk-Synchronization is expensive 4 Synchronous communication is inefficient 6 / 138
  • 11. Challenges in Parameter Estimation 1 Storage limitations of Data and Model 2 Interdependence in parameter updates 3 Bulk-Synchronization is expensive 4 Synchronous communication is inefficient Traditional distributed machine learning approaches fall short. 6 / 138
  • 12. Challenges in Parameter Estimation 1 Storage limitations of Data and Model 2 Interdependence in parameter updates 3 Bulk-Synchronization is expensive 4 Synchronous communication is inefficient Traditional distributed machine learning approaches fall short. Hybrid-Parallel algorithms for parameter estimation! 6 / 138
  • 13. Introduction Outline 1 Introduction 2 Distributed Parameter Estimation 3 Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Mixture Models (AISTATS 2019) Latent Collaborative Retrieval (NIPS 2014) 4 Conclusion and Future Directions 5 Appendix 7 / 138
  • 15. Introduction Regularized Risk Minimization Goals in machine learning We want to build a model using observed (training) data 8 / 138
  • 16. Introduction Regularized Risk Minimization Goals in machine learning We want to build a model using observed (training) data Our model must generalize to unseen (test) data 8 / 138
  • 17. Introduction Regularized Risk Minimization Goals in machine learning We want to build a model using observed (training) data Our model must generalize to unseen (test) data min θ L (θ) := λ R (θ) regularizer + 1 N N i=1 loss (xi , yi , θ) empirical risk 8 / 138
  • 18. Introduction Regularized Risk Minimization Goals in machine learning We want to build a model using observed (training) data Our model must generalize to unseen (test) data min θ L (θ) := λ R (θ) regularizer + 1 N N i=1 loss (xi , yi , θ) empirical risk X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data θ are the model parameters 8 / 138
  • 19. Introduction Regularized Risk Minimization Goals in machine learning We want to build a model using observed (training) data Our model must generalize to unseen (test) data min θ L (θ) := λ R (θ) regularizer + 1 N N i=1 loss (xi , yi , θ) empirical risk X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data θ are the model parameters loss (·) to quantify model’s performance 8 / 138
  • 20. Introduction Regularized Risk Minimization Goals in machine learning We want to build a model using observed (training) data Our model must generalize to unseen (test) data min θ L (θ) := λ R (θ) regularizer + 1 N N i=1 loss (xi , yi , θ) empirical risk X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data θ are the model parameters loss (·) to quantify model’s performance regularizer R (θ) to avoid over-fitting (penalizes complex models) 8 / 138
  • 22. Introduction Frequentist Models Binary Classification/Regression Multinomial Logistic Regression Matrix Factorization Latent Collaborative Retrieval Polynomial Regression, Factorization Machines 9 / 138
  • 24. Introduction Bayesian Models p(θ|X) posterior = likelihood p(X|θ) · prior p(θ) p(X, θ)dθ marginal likelihood (model evidence) prior plays the role of regularizer R (θ) likelihood plays the role of empirical risk 10 / 138
  • 25. Introduction Bayesian Models Gaussian Mixture Models (GMM) Latent Dirichlet Allocation (LDA) p(θ|X) posterior = likelihood p(X|θ) · prior p(θ) p(X, θ)dθ marginal likelihood (model evidence) prior plays the role of regularizer R (θ) likelihood plays the role of empirical risk 10 / 138
  • 26. Introduction Focus on Matrix Parameterized Models N D X Data D K θ Model 11 / 138
  • 27. Introduction Focus on Matrix Parameterized Models N D X Data D K θ Model What if these matrices do not fit in memory? 11 / 138
  • 28. Distributed Parameter Estimation Outline 1 Introduction 2 Distributed Parameter Estimation 3 Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Mixture Models (AISTATS 2019) Latent Collaborative Retrieval (NIPS 2014) 4 Conclusion and Future Directions 5 Appendix 12 / 138
  • 29. Distributed Parameter Estimation Distributed Parameter Estimation 13 / 138
  • 30. Distributed Parameter Estimation Distributed Parameter Estimation Data parallel N D X Data D K θ Model e.g. L-BFGS 13 / 138
  • 31. Distributed Parameter Estimation Distributed Parameter Estimation Data parallel N D X Data D K θ Model e.g. L-BFGS Model parallel N D X Data D K θ Model e.g. LC [Gopal et al., 2013] 13 / 138
  • 32. Distributed Parameter Estimation Distributed Parameter Estimation Good Easy to implement using map-reduce Scales as long as Data or Model fits in memory 14 / 138
  • 33. Distributed Parameter Estimation Distributed Parameter Estimation Good Easy to implement using map-reduce Scales as long as Data or Model fits in memory Bad Either Data or the Model is replicated on each worker. 14 / 138
  • 34. Distributed Parameter Estimation Distributed Parameter Estimation Good Easy to implement using map-reduce Scales as long as Data or Model fits in memory Bad Either Data or the Model is replicated on each worker. Data Parallel: Each worker requires O N×D P + O (K × D) bottleneck 14 / 138
  • 35. Distributed Parameter Estimation Distributed Parameter Estimation Good Easy to implement using map-reduce Scales as long as Data or Model fits in memory Bad Either Data or the Model is replicated on each worker. Data Parallel: Each worker requires O N×D P + O (K × D) bottleneck Model Parallel: Each worker requires requires O (N × D) bottleneck +O K×D P 14 / 138
  • 36. Distributed Parameter Estimation Distributed Parameter Estimation Question Can we get the best of both worlds? 15 / 138
  • 37. Distributed Parameter Estimation Hybrid Parallelism N D X Data D K θ Model 16 / 138
  • 38. Distributed Parameter Estimation Benefits of Hybrid-Parallelism 1 One versatile method for all regimes of data and model parallelism 17 / 138
  • 39. Distributed Parameter Estimation Why Hybrid Parallelism? 18 / 138
  • 40. Distributed Parameter Estimation Why Hybrid Parallelism? 19 / 138
  • 41. Distributed Parameter Estimation Why Hybrid Parallelism? 20 / 138
  • 42. Distributed Parameter Estimation Why Hybrid Parallelism? 21 / 138
  • 43. Distributed Parameter Estimation Why Hybrid Parallelism? LSHTC1-small LSHTC1-large ODP utube8M-Video Reddit-Full 101 102 103 104 105 106 max memory of commodity machine SizeinMB(log-scale) data size (MB) parameters size (MB) 22 / 138
  • 44. Distributed Parameter Estimation Benefits of Hybrid-Parallelism 1 One versatile method for all regimes of data and model parallelism 2 Independent parameter updates on each worker 23 / 138
  • 45. Distributed Parameter Estimation Benefits of Hybrid-Parallelism 1 One versatile method for all regimes of data and model parallelism 2 Independent parameter updates on each worker 3 Fully de-centralized and asynchronous optimization algorithms 24 / 138
  • 46. Distributed Parameter Estimation How do we achieve Hybrid Parallelism in machine learning models? 25 / 138
  • 48. Distributed Parameter Estimation Double-Separability Definition Let {Si }m i=1 and {Sj }m j=1 be two families of sets of parameters. A function f : m i=1 Si × m j=1 Sj → R is doubly separable if ∃ fij : Si × Sj → R for each i = 1, 2, . . . , m and j = 1, 2, . . . , m such that: f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) = m i=1 m j=1 fij (θi , θj ) 27 / 138
  • 49. Distributed Parameter Estimation Double-Separability f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) = m i=1 m j=1 fij (θi , θj ) 28 / 138
  • 50. Distributed Parameter Estimation Double-Separability f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) = m i=1 m j=1 fij (θi , θj ) x x x x x x x x x x x x x m m fij (θi , θj ) 28 / 138
  • 51. Distributed Parameter Estimation Double-Separability f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) = m i=1 m j=1 fij (θi , θj ) x x x x x x x x x x x x x m m fij (θi , θj ) fij corresponding to highlighted diagonal blocks can be computed independently and in parallel 28 / 138
  • 52. Distributed Parameter Estimation Direct Double-Separability e.g. Matrix Factorization 29 / 138
  • 53. Distributed Parameter Estimation Direct Double-Separability e.g. Matrix Factorization L(w1, w2, . . . , wN, h1, h2, . . . , hM) = 1 2 N i=1 M j=1 (Xij − wi , hj )2 29 / 138
  • 54. Distributed Parameter Estimation Direct Double-Separability e.g. Matrix Factorization L(w1, w2, . . . , wN, h1, h2, . . . , hM) = 1 2 N i=1 M j=1 (Xij − wi , hj )2 Objective function is trivially doubly-separable! [Yun et al 2014] 29 / 138
  • 55. Distributed Parameter Estimation Others require Reformulations original objective f (·) 30 / 138
  • 56. Distributed Parameter Estimation Others require Reformulations original objective f (·) Reformulate 30 / 138
  • 57. Distributed Parameter Estimation Others require Reformulations original objective f (·) Reformulate m i=1 m j=1fij (θi , θj ) 30 / 138
  • 58. Distributed Parameter Estimation Technical Contributions of my thesis Study frequentist and bayesian models which do not have direct double-separability in their parameter estimation, 31 / 138
  • 59. Distributed Parameter Estimation Technical Contributions of my thesis Study frequentist and bayesian models which do not have direct double-separability in their parameter estimation, 1 Propose reformulations that lead to Hybrid-Parallelism 31 / 138
  • 60. Distributed Parameter Estimation Technical Contributions of my thesis Study frequentist and bayesian models which do not have direct double-separability in their parameter estimation, 1 Propose reformulations that lead to Hybrid-Parallelism 2 Design asynchronous distributed optimization algorithms 31 / 138
  • 61. Thesis Roadmap Outline 1 Introduction 2 Distributed Parameter Estimation 3 Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Mixture Models (AISTATS 2019) Latent Collaborative Retrieval (NIPS 2014) 4 Conclusion and Future Directions 5 Appendix 32 / 138
  • 62. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Outline 1 Introduction 2 Distributed Parameter Estimation 3 Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Mixture Models (AISTATS 2019) Latent Collaborative Retrieval (NIPS 2014) 4 Conclusion and Future Directions 5 Appendix 33 / 138
  • 63. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Multinomial Logistic Regression (MLR) Given Training data (xi , yi )i=1,...,N where xi ∈ RD corresp. labels yi ∈ {1, 2, . . . , K} N, D and K are large (N >>> D >> K) 34 / 138
  • 64. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Multinomial Logistic Regression (MLR) Given Training data (xi , yi )i=1,...,N where xi ∈ RD corresp. labels yi ∈ {1, 2, . . . , K} N, D and K are large (N >>> D >> K) Goal The probability that xi belongs to class k is given by: p(yi = k|xi , W ) = exp(wk T xi ) K j=1 exp(wj T xi ) where W = {w1, w2, . . . , wK } denotes the parameter for the model. 34 / 138
  • 65. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Multinomial Logistic Regression (MLR) The corresponding l2 regularized negative log-likelihood loss: min W λ 2 K k=1 wk 2 − 1 N N i=1 K k=1 yikwk T xi + 1 N N i=1 log K k=1 exp(wk T xi ) where λ is the regularization hyper-parameter. 35 / 138
  • 66. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Multinomial Logistic Regression (MLR) The corresponding l2 regularized negative log-likelihood loss: min W λ 2 K k=1 wk 2 − 1 N N i=1 K k=1 yikwk T xi + 1 N N i=1 log K k=1 exp(wk T xi ) makes model parallelism hard where λ is the regularization hyper-parameter. 36 / 138
  • 67. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Reformulation into Doubly-Separable form Log-concavity bound [GopYan13] log(γ) ≤ a · γ − log(a) − 1, ∀γ, a > 0, where a is a variational parameter. This bound is tight when a = 1 γ . 37 / 138
  • 68. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Reformulating the objective of MLR min W ,A λ 2 K k=1 wk 2 + 1 N N i=1 − K k=1 yik wk T xi + ai K k=1 exp(wk T xi ) − log(ai ) − 1 where ai can be computed in closed form as: ai = 1 K k=1 exp(wk T xi ) 38 / 138
  • 69. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Doubly-Separable Multinomial Logistic Regression (DS-MLR) Doubly-Separable form min W ,A N i=1 K k=1 λ wk 2 2N − yik wk T xi N − log ai NK + exp(wk T xi + log ai ) N − 1 NK 39 / 138
  • 70. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Doubly-Separable Multinomial Logistic Regression (DS-MLR) Stochastic Gradient Updates Each term in stochastic update depends on only data point i and class k. 40 / 138
  • 71. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Doubly-Separable Multinomial Logistic Regression (DS-MLR) Stochastic Gradient Updates Each term in stochastic update depends on only data point i and class k. wk t+1 ← wk t − ηtK λxi − yikxi + exp wk T xi + log ai xi 40 / 138
  • 72. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Doubly-Separable Multinomial Logistic Regression (DS-MLR) Stochastic Gradient Updates Each term in stochastic update depends on only data point i and class k. wk t+1 ← wk t − ηtK λxi − yikxi + exp wk T xi + log ai xi log ai t+1 ← log ai t − ηtK exp(wk T xi + log ai ) − 1 K 40 / 138
  • 73. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Access Pattern of updates: Stoch wk, Stoch ai X W A (a) Updating wk only requires computing ai X W A (b) Updating ai only requires accessing wk and xi . 41 / 138
  • 74. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Updating ai: Closed form instead of Stoch update Closed-form update for ai ai = 1 K k=1 exp(wk T xi ) 42 / 138
  • 75. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Access Pattern of updates: Stoch wk, Exact ai X W A (a) Updating wk only requires computing ai X W A (b) Updating ai requires accessing entire W. Synchronization bottleneck! 43 / 138
  • 76. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Updating ai: Avoiding bulk-synchronization Closed-form update for ai ai = 1 K k=1 exp(wk T xi ) 44 / 138
  • 77. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Updating ai: Avoiding bulk-synchronization Closed-form update for ai ai = 1 K k=1 exp(wk T xi ) Each worker computes partial sum using the wk it owns. 44 / 138
  • 78. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Updating ai: Avoiding bulk-synchronization Closed-form update for ai ai = 1 K k=1 exp(wk T xi ) Each worker computes partial sum using the wk it owns. P workers: After P rounds the global sum is available 44 / 138
  • 79. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Parallelization: Synchronous DSGD [Gemulla et al., 2011] X and local parameters A are partitioned horizontally (1, . . . , N) 45 / 138
  • 80. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Parallelization: Synchronous DSGD [Gemulla et al., 2011] X and local parameters A are partitioned horizontally (1, . . . , N) Global model parameters W are partitioned vertically (1, . . . , K) 45 / 138
  • 81. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Parallelization: Synchronous DSGD [Gemulla et al., 2011] X and local parameters A are partitioned horizontally (1, . . . , N) Global model parameters W are partitioned vertically (1, . . . , K) P = 4 workers work on mutually-exclusive blocks of A and W 45 / 138
  • 82. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Parallelization: Asynchronous NOMAD [Yun et al., 2014] x x x x xx x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x xx x x xx A W 46 / 138
  • 83. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Parallelization: Asynchronous NOMAD [Yun et al 2014] x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x xx x x xx A W 47 / 138
  • 84. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Parallelization: Asynchronous NOMAD [Yun et al 2014] x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x xx x x xx A W 48 / 138
  • 85. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Parallelization: Asynchronous NOMAD [Yun et al 2014] x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x xx x x xx A W 49 / 138
  • 86. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Experiments: Datasets 50 / 138
  • 87. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Experiments: Datasets Dataset # instances # features #classes data (train + test) parameters sparsity (% nnz) CLEF 10,000 80 63 9.6 MB + 988 KB 40 KB 100 NEWS20 11,260 53,975 20 21 MB + 14 MB 9.79 MB 0.21 LSHTC1-small 4,463 51,033 1,139 11 MB + 4 MB 465 MB 0.29 LSHTC1-large 93,805 347,256 12,294 258 MB + 98 MB 34 GB 0.049 ODP 1,084,404 422,712 105,034 3.8 GB + 1.8 GB 355 GB 0.0533 YouTube8M-Video 4,902,565 1,152 4,716 59 GB + 17 GB 43 MB 100 Reddit-Full 211,532,359 1,348,182 33,225 159 GB + 69 GB 358 GB 0.0036 Table: Characteristics of the datasets used 51 / 138
  • 88. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Experiments: Single Machine NEWS20, Data=35 MB, Model=9.79 MB 100 101 102 103 1.45 1.5 1.55 1.6 1.65 time (secs) objective DS-MLR L-BFGS LC 52 / 138
  • 89. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Experiments: Single Machine CLEF, Data=10 MB, Model=40 KB 10−1 100 101 102 103 104 105 1 2 3 4 time (secs) objective DS-MLR L-BFGS LC 53 / 138
  • 90. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Experiments: Single Machine LSHTC1-small, Data=15 MB, Model=465 MB 101 102 103 104 0 0.2 0.4 0.6 0.8 1 time (secs) objective DS-MLR L-BFGS LC 54 / 138
  • 91. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Experiments: Multi Machine LSHTC1-large, Data=356 MB, Model=34 GB, machines=4, threads=12 103 104 0.5 0.6 0.7 0.8 time (secs) objective DS-MLR LC 55 / 138
  • 92. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Experiments: Multi Machine ODP, Data=5.6 GB, Model=355 GB, machines=20, threads=260 0 1 2 3 4 5 6 ·105 9.4 9.6 9.8 10 10.2 10.4 time (secs) objective DS-MLR 56 / 138
  • 93. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Experiments: Multi Machine Dense Dataset YouTube-Video, Data=76 GB, Model=43 MB, machines=4, threads=260 0 1 2 3 ·105 0 20 40 60 time (secs) objective DS-MLR 57 / 138
  • 94. Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Experiments: Multi Machine (Nothing fits in memory!) Reddit-Full, Data=228 GB, Model=358 GB, machines=40, threads=250 0 2 4 6 ·105 10 10.2 10.4 time (secs) objective DS-MLR 211 million examples, 44 billion parameters (# features × # classes) 58 / 138
  • 95. Thesis Roadmap Mixture Models (AISTATS 2019) Outline 1 Introduction 2 Distributed Parameter Estimation 3 Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Mixture Models (AISTATS 2019) Latent Collaborative Retrieval (NIPS 2014) 4 Conclusion and Future Directions 5 Appendix 59 / 138
  • 96. Thesis Roadmap Mixture Models (AISTATS 2019) Mixture Of Exponential Families 60 / 138
  • 97. Thesis Roadmap Mixture Models (AISTATS 2019) Mixture Of Exponential Families k-means Gaussian Mixture Models (GMM) Latent Dirichlet Allocation (LDA) Stochastic Mixed Membership Models 60 / 138
  • 98. Thesis Roadmap Mixture Models (AISTATS 2019) Generative Model 61 / 138
  • 99. Thesis Roadmap Mixture Models (AISTATS 2019) Generative Model Draw Cluster Proportions π p (π|α) = Dirichlet (α) 61 / 138
  • 100. Thesis Roadmap Mixture Models (AISTATS 2019) Generative Model Draw Cluster Proportions π p (π|α) = Dirichlet (α) Draw Cluster Parameters for k = 1, . . . , K p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk)) 61 / 138
  • 101. Thesis Roadmap Mixture Models (AISTATS 2019) Generative Model Draw Cluster Proportions π p (π|α) = Dirichlet (α) Draw Cluster Parameters for k = 1, . . . , K p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk)) Draw Observations for i = 1, . . . , N p (zi |π) = Multinomial (π) p (xi |zi , θ) = exp ( φ (xi , zi ) , θzi − g (θzi )) 61 / 138
  • 102. Thesis Roadmap Mixture Models (AISTATS 2019) Generative Model Draw Cluster Proportions π p (π|α) = Dirichlet (α) Draw Cluster Parameters for k = 1, . . . , K p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk)) Draw Observations for i = 1, . . . , N p (zi |π) = Multinomial (π) p (xi |zi , θ) = exp ( φ (xi , zi ) , θzi − g (θzi )) p (θk|nk, νk) is conjugate to p (xi |zi = k, θk) p (π|α) is conjugate to p (zi |π) 61 / 138
  • 103. Thesis Roadmap Mixture Models (AISTATS 2019) Joint Distribution p (π, θ, z, x|α, n, ν) = p (π|α) · K k=1 p (θk|nk, νk) · N i=1 p (zi |π) · p (xi |zi , θ) 62 / 138
  • 104. Thesis Roadmap Mixture Models (AISTATS 2019) Joint Distribution p (π, θ, z, x|α, n, ν) = p (π|α) · K k=1 p (θk|nk, νk) · N i=1 p (zi |π) · p (xi |zi , θ) Want to estimate the posterior: p (π, θ, z|x, α, n, ν) 62 / 138
  • 105. Thesis Roadmap Mixture Models (AISTATS 2019) Variational Inference [Blei et al., 2016] Approximate p (π, θ, z|x, α, n, ν) with a simpler (factorized) distribution q q π, θ, z|˜π, ˜θ, ˜z = q (π|˜π) · K k=1 q θk|˜θk · N i=1 q (zi |˜zi ) 63 / 138
  • 106. Thesis Roadmap Mixture Models (AISTATS 2019) Variational Inference 64 / 138
  • 107. Thesis Roadmap Mixture Models (AISTATS 2019) Evidence Based Lower Bound (ELBO) L ˜π, ˜θ, ˜z = Eq(π,θ,z|˜π,˜θ,˜z) [log p (π, θ, z, x|α, n, ν)] − Eq(π,θ,z|˜π,˜θ,˜z) log q π, θ, z|˜π, ˜θ, ˜z 65 / 138
  • 108. Thesis Roadmap Mixture Models (AISTATS 2019) Evidence Based Lower Bound (ELBO) L ˜π, ˜θ, ˜z = Eq(π,θ,z|˜π,˜θ,˜z) [log p (π, θ, z, x|α, n, ν)] − Eq(π,θ,z|˜π,˜θ,˜z) log q π, θ, z|˜π, ˜θ, ˜z 65 / 138
  • 109. Thesis Roadmap Mixture Models (AISTATS 2019) Variational Inference (VI) M-step max˜π L ˜π, ˜θ, ˜z max˜θ L ˜π, ˜θ, ˜z E-step max˜z L ˜π, ˜θ, ˜z 66 / 138
  • 110. Thesis Roadmap Mixture Models (AISTATS 2019) Variational Inference: M-step max˜π L ˜π, ˜θ, ˜z ˜πk = α + N i=1 ˜zi,k 67 / 138
  • 111. Thesis Roadmap Mixture Models (AISTATS 2019) Access Pattern of Variables: ˜π update z x π θ 68 / 138
  • 112. Thesis Roadmap Mixture Models (AISTATS 2019) Variational Inference: M-step max˜θ L ˜π, ˜θ, ˜z ˜nk = nk + N i=1 ˜zi,k ˜νk = nk · νk + N i=1 ˜zi,k · φ (xi , k) 69 / 138
  • 113. Thesis Roadmap Mixture Models (AISTATS 2019) Access Pattern of Variables: ˜θ update z x π θ 70 / 138
  • 114. Thesis Roadmap Mixture Models (AISTATS 2019) Variational Inference: E-step max˜z L ˜π, ˜θ, ˜z ˜zi,k ∝ exp ψ (˜πk) − ψ K k =1 ˜πk + φ (xi , k) , Eq(θk |˜θk ) [θk] − Eq(θk |˜θk ) [g (θk)] 71 / 138
  • 115. Thesis Roadmap Mixture Models (AISTATS 2019) Access Pattern of Variables: ˜z update z x π θ 72 / 138
  • 116. Thesis Roadmap Mixture Models (AISTATS 2019) Stochastic Variational Inference (SVI) [Hoffman et al., 2013] M-step max˜π L ˜π, ˜θ, ˜z max˜θ L ˜π, ˜θ, ˜z E-step max˜zi L ˜π, ˜θ, ˜zi , ˜zi 73 / 138
  • 117. Thesis Roadmap Mixture Models (AISTATS 2019) Stochastic Variational Inference (SVI) [Hoffman et al., 2013] M-step max˜π L ˜π, ˜θ, ˜z max˜θ L ˜π, ˜θ, ˜z E-step max˜zi L ˜π, ˜θ, ˜zi , ˜zi More frequent updates to global variables =⇒ Faster convergence 73 / 138
  • 118. Thesis Roadmap Mixture Models (AISTATS 2019) Access Pattern of Variables: ˜π update z x π θ 74 / 138
  • 119. Thesis Roadmap Mixture Models (AISTATS 2019) Access Pattern of Variables: ˜θ update z x π θ 75 / 138
  • 120. Thesis Roadmap Mixture Models (AISTATS 2019) Access Pattern of Variables: ˜z update z x π θ 76 / 138
  • 121. Thesis Roadmap Mixture Models (AISTATS 2019) Extreme Stochastic Variational Inference (ESVI) M-step max˜πk ,˜πk L [˜π1, . . . , ˜πk , ˜πk+1, . . . , ˜πk , . . . , ˜πK ] , ˜θ, ˜z max˜θk ,˜θk L ˜π, ˜θ1, . . . , ˜θk , ˜θk+1, . . . , ˜θk , . . . , ˜θK , ˜z E-step max˜zi,k ,˜zi,k L ˜π, ˜θ, ˜zi , [˜zi,1, . . . , ˜zi,k , ˜zi,k+1, . . . , ˜zi,k , . . . , ˜zi,K ] 77 / 138
  • 122. Thesis Roadmap Mixture Models (AISTATS 2019) Access Pattern of Variables: ˜π update ˜z x ˜π ˜θ 78 / 138
  • 123. Thesis Roadmap Mixture Models (AISTATS 2019) Access Pattern of Variables: ˜θ update ˜z x ˜π ˜θ 79 / 138
  • 124. Thesis Roadmap Mixture Models (AISTATS 2019) Access Pattern of Variables: ˜z update ˜z x ˜π ˜θ 80 / 138
  • 125. Thesis Roadmap Mixture Models (AISTATS 2019) Key Observation M-step max˜πk ,˜πk L [˜π1, . . . , ˜πk , ˜πk+1, . . . , ˜πk , . . . , ˜πK ] , ˜θ, ˜z max˜θk ,˜θk L ˜π, ˜θ1, . . . , ˜θk , ˜θk+1, . . . , ˜θk , . . . , ˜θK , ˜z E-step max˜zi,k ,˜zi,k L ˜π, ˜θ, ˜zi , [˜zi,1, . . . , ˜zi,k , ˜zi,k+1, . . . , ˜zi,k , . . . , ˜zi,K ] All the above updates can be done in parallel! 81 / 138
  • 126. Thesis Roadmap Mixture Models (AISTATS 2019) Distributing Computation NOMAD [Yun et al., 2014] ˜z x ˜π ˜θ 82 / 138
  • 127. Thesis Roadmap Mixture Models (AISTATS 2019) Distributing Computation NOMAD [Yun et al., 2014] ˜z x ˜π ˜θ 82 / 138
  • 128. Thesis Roadmap Mixture Models (AISTATS 2019) Distributing Computation NOMAD [Yun et al., 2014] ˜z x ˜π ˜θ 82 / 138
  • 129. Thesis Roadmap Mixture Models (AISTATS 2019) Distributing Computation NOMAD [Yun et al., 2014] ˜z x ˜π ˜θ 82 / 138
  • 130. Thesis Roadmap Mixture Models (AISTATS 2019) Keeping ˜π in Sync Single global ˜π travels among machines broadcasts local delta updates Every machine p: (πp, ¯π) πp: local working copy ¯π: snapshot of global ˜π 83 / 138
  • 131. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Datasets # documents # vocabulary #words NIPS 1,312 12,149 1,658,309 Enron 37,861 28,102 6,238,796 Ny Times 298,000 102,660 98,793,316 PubMed 8,200,000 141,043 737,869,083 UMBC-3B 40,599,164 3,431,260 3,013,004,127 Mixture Models: Gaussian Mixture Model (GMM) Latent Dirichlet Allocation (LDA) Methods: Variational Inference (VI) Stochastic Variational Inference (SVI) Extreme Stochastic Variational Inference (ESVI) 84 / 138
  • 132. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Single Machine (GMM) TOY, machines=1, cores=1, topics=256 10−1 100 101 102 103 104 −8 −6 −4 ·1010 time (secs) ELBO ESVI SVI VI 85 / 138
  • 133. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Single Machine (GMM) AP-DATA, machines=1, cores=1, topics=256 100 101 102 103 104 −1.21 −1.2 −1.2 −1.19 ·1012 time (secs) ELBO ESVI SVI VI 86 / 138
  • 134. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Multi Machine (GMM) NIPS, machines=16, cores=1, topics=256 101 102 103 104 −1.72 −1.7 −1.68 ·1012 time (secs) ELBO ESVI VI 87 / 138
  • 135. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Multi Machine (GMM) NY Times, machines=16, cores=1, topics=256 103 104 −5.5 −5 −4.5 ·1014 time (secs) ELBO VI ESVI 88 / 138
  • 136. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Single Machine Single Core (LDA) Enron, machines=1, cores=1, topics=128 103 104 −5.4 −5.3 −5.2 −5.1 −5 ·107 time (secs) ELBO ESVI VI SVI 89 / 138
  • 137. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Single Machine Single Core (LDA) NY Times, machines=1, cores=1, topics=16 103.5 104 104.5 −9.2 −9.1 −9 −8.9 −8.8 −8.7 ·108 time (secs) ELBO ESVI VI SVI 90 / 138
  • 138. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Single Machine Multi Core (LDA) Enron, machines=1, cores=16, topics=128 102 103 104 −5.4 −5.3 −5.2 −5.1 −5 ·107 time (secs) ELBO ESVI dist VI stream SVI 91 / 138
  • 139. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Single Machine Multi Core (LDA) NY Times, machines=1, cores=16, topics=64 103 104 −9.2 −9 −8.8 ·108 time (secs) ELBO ESVI dist VI stream SVI 92 / 138
  • 140. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Multi Machine Multi Core (LDA) PUBMED, machines=32, cores=16, topics=128 102 103 104 −6.8 −6.7 −6.6 ·109 time (secs) ELBO ESVI dist VI 93 / 138
  • 141. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Multi Machine Multi Core (LDA) UMBC-3B, machines=32, cores=16, topics=128 103 104 −2.28 −2.27 −2.26 −2.25 ·1010 time (secs) ELBO ESVI dist VI 94 / 138
  • 142. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Predictive Performance (LDA) Enron, machines=1, cores=16, topics=32 101 102 103 10 15 20 25 30 35 time (secs) TestPerplexity ESVI VI 95 / 138
  • 143. Thesis Roadmap Mixture Models (AISTATS 2019) Experiments: Predictive Performance (LDA) NY Times, machines=1, cores=16, topics=32 103 104 15 20 25 30 35 time (secs) TestPerplexity ESVI VI 96 / 138
  • 144. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Outline 1 Introduction 2 Distributed Parameter Estimation 3 Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Mixture Models (AISTATS 2019) Latent Collaborative Retrieval (NIPS 2014) 4 Conclusion and Future Directions 5 Appendix 97 / 138
  • 145. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Binary Classification −3 −2 −1 0 1 2 3 0 1 2 3 4 margin loss 0-1 loss: I(· < 0) logistic loss: σ(·) hinge loss Convex objective functions are sensitive to outliers 98 / 138
  • 146. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Robust Binary Classification −5 −4 −3 −2 −1 0 1 2 3 4 5 0 1 2 3 4 5 t loss σ(t) σ1(t) := ρ1 (σ(t)) σ2(t) := ρ2 (σ(t)) ρ1(t) = log2(t + 1) ρ2(t) = 1 − 1 log2(t+2) Bending the losses provides robustness 99 / 138
  • 147. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Objectives Can we use this to design robust losses for ranking? Two scenarios: Explicit: features, labels are observed (Learning to Rank) Implicit: features, labels are latent (Latent Collaborative Retrieval) Distributed Optimization Algorithms 100 / 138
  • 148. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Explicit Setting - Learning to Rank Scoring function: f (x, y) = φ(x, y), w 101 / 138
  • 149. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Implicit Setting - Latent Collaborative Retrieval Scoring function: f (x, y) = Ux , Vy 102 / 138
  • 150. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Formulating a Robust Ranking Loss (RoBiRank) Rank rankw(x, y) = y ∈Yx ,y =y 1 f (x, y) − f (x, y ) < 0 103 / 138
  • 151. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Formulating a Robust Ranking Loss (RoBiRank) Rank rankw(x, y) = y ∈Yx ,y =y 1 f (x, y) − f (x, y ) < 0 Objective Function L(w) = x∈X y∈Yx rxy y ∈Yx ,y =y σ f (x, y) − f (x, y ) where σ = logistic loss. 103 / 138
  • 152. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Formulating a Robust Ranking Loss (RoBiRank) Applying the robust transformation ρ2(t) Objective Function L(w) = x∈X y∈Yx rxy · ρ2   y ∈Yx ,y =y σ f (x, y) − f (x, y )   where ρ2(t) = 1 − 1 log2(t+2) 104 / 138
  • 153. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Formulating a Robust Ranking Loss (RoBiRank) Applying the robust transformation ρ2(t) Objective Function L(w) = x∈X y∈Yx rxy · ρ2   y ∈Yx ,y =y σ f (x, y) − f (x, y )   where ρ2(t) = 1 − 1 log2(t+2) Directly related to DCG (evaluation metric for ranking). minimizing L (w) =⇒ maximizing DCG 104 / 138
  • 154. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Formulating a Robust Ranking Loss (RoBiRank) In practice, ρ1 is easier to optimize, Objective Function L(w) = x∈X y∈Yx rxy · ρ1   y ∈Yx ,y =y σ f (x, y) − f (x, y )   where, ρ1(t) = log2(t + 1) 105 / 138
  • 155. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Experiments: Single Machine TD 2004 5 10 15 20 0.4 0.6 0.8 1 k NDCG@k RoBiRank RankSVM LSRank InfNormPush IRPush RoBiRank shows good accuracy (NDCG) at the top of the ranking list 106 / 138
  • 156. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Experiments: Single Machine TD 2004 5 10 15 20 0.4 0.6 0.8 1 k NDCG@k RoBiRank MART RankNet RankBoost AdaRank CoordAscent LambdaMART ListNet RandomForests RoBiRank shows good accuracy (NDCG) at the top of the ranking list 107 / 138
  • 157. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) How to make RoBiRank Hybrid-Parallel? Consider RoBiRank with the implicit case (no features, no explicit labels). Implicit Case x∈X y∈Yx rxy · ρ1   y ∈Yx ,y =y σ Ux , Vy − Ux , Vy   expensive How can we avoid this? 108 / 138
  • 158. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Linearization Upper Bound for ρ1 [Bouchard07] ρ1 (t) = log2 (t + 1) ≤ log2 ξ + ξ · (t + 1) − 1 log2 The bound holds for any ξ > 0 and is tight when ξ = 1 t+1 109 / 138
  • 159. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Linearization We can linearize the objective function using this upper bound, x∈X y∈Yx rxy ·  − log2 ξxy + ξxy · y =y σ ( Ux , Vy − Ux , Vy ) + 1 − 1 log 2   Can obtain stochastic gradient updates for Ux , Vy and ξxy ξxy capture the importance of xy-pair in optimization ξxy also has a closed form solution 110 / 138
  • 160. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Distributing Computation 111 / 138
  • 161. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Experiments: Multi Machine Million Song Dataset (MSD) 0 0.2 0.4 0.6 0.8 1 ·105 0 0.1 0.2 0.3 time (secs) MeanPrecision@1 Weston et al. (2012) RoBiRank 1 RoBiRank 4 RoBiRank 16 RoBiRank 32 112 / 138
  • 162. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014) Experiments: Multi Machine Scalability on Million song dataset, machines=1, 4, 16, 32 0 0.5 1 1.5 2 2.5 3 ·106 0 0.1 0.2 0.3 # machines × seconds MeanPrecision@1 RoBiRank 4 RoBiRank 16 RoBiRank 32 RoBiRank 1 113 / 138
  • 163. Conclusion and Future Directions Outline 1 Introduction 2 Distributed Parameter Estimation 3 Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Mixture Models (AISTATS 2019) Latent Collaborative Retrieval (NIPS 2014) 4 Conclusion and Future Directions 5 Appendix 114 / 138
  • 164. Conclusion and Future Directions Conclusion and Key Takeaways Data and Model grow hand in hand. 115 / 138
  • 165. Conclusion and Future Directions Conclusion and Key Takeaways Data and Model grow hand in hand. Challenges in Parameter Estimation. 115 / 138
  • 166. Conclusion and Future Directions Conclusion and Key Takeaways Data and Model grow hand in hand. Challenges in Parameter Estimation. We proposed: 115 / 138
  • 167. Conclusion and Future Directions Conclusion and Key Takeaways Data and Model grow hand in hand. Challenges in Parameter Estimation. We proposed: Hybrid-Parallel formulations 115 / 138
  • 168. Conclusion and Future Directions Conclusion and Key Takeaways Data and Model grow hand in hand. Challenges in Parameter Estimation. We proposed: Hybrid-Parallel formulations Distributed, Asynchronous Algorithms 115 / 138
  • 169. Conclusion and Future Directions Conclusion and Key Takeaways Data and Model grow hand in hand. Challenges in Parameter Estimation. We proposed: Hybrid-Parallel formulations Distributed, Asynchronous Algorithms Case-studies: Multinomial Logistic Regression Mixture Models Latent Collaborative Retrieval 115 / 138
  • 170. Conclusion and Future Directions Future Work: Other Hybrid-Parallel models Infinite Mixture Models DP Mixture Model [BleiJordan2006] Pittman-Yor Mixture Model [Dubey et al., 2014] 116 / 138
  • 171. Conclusion and Future Directions Future Work: Other Hybrid-Parallel models Infinite Mixture Models DP Mixture Model [BleiJordan2006] Pittman-Yor Mixture Model [Dubey et al., 2014] Stochastic Mixed-Membership Block Models [Airoldi et al., 2008] 116 / 138
  • 172. Conclusion and Future Directions Future Work: Other Hybrid-Parallel models Infinite Mixture Models DP Mixture Model [BleiJordan2006] Pittman-Yor Mixture Model [Dubey et al., 2014] Stochastic Mixed-Membership Block Models [Airoldi et al., 2008] Deep Neural Networks [Wang et al., 2014] 116 / 138
  • 173. Conclusion and Future Directions Acknowledgements Thanks to all my collaborators. 117 / 138
  • 174. Conclusion and Future Directions References Parameswaran Raman, Sriram Srinivasan, Shin Matsushima, Xinhua Zhang, Hyokun Yun, S.V.N. Vishwanathan. Scaling Multinomial Logistic Regression via Hybrid Parallelism. KDD 2019. Parameswaran Raman∗ , Jiong Zhang∗ , Shihao Ji, Hsiang-Fu Yu, S.V.N. Vishwanathan, Inderjit S. Dhillon. Extreme Stochastic Variational Inference: Distributed and Asynchronous. AISTATS 2019. Hyokun Yun, Parameswaran Raman, S.V.N. Vishwanathan. Ranking via Robust Binary Classification. NIPS 2014. Source Code: DS-MLR: https://bitbucket.org/params/dsmlr ESVI: https://bitbucket.org/params/dmixmodels RoBiRank: https://bitbucket.org/d ijk stra/robirank 118 / 138
  • 175. Appendix Outline 1 Introduction 2 Distributed Parameter Estimation 3 Thesis Roadmap Multinomial Logistic Regression (KDD 2019) Mixture Models (AISTATS 2019) Latent Collaborative Retrieval (NIPS 2014) 4 Conclusion and Future Directions 5 Appendix 119 / 138
  • 176. Appendix Polynomial Regression f (xi ) = w0 + D j=1 wj xij + D j=1 D j =j+1 wjj xij xij xi ∈ RD is an example from the dataset X ∈ RN×D w0 ∈ R, w ∈ RD, W ∈ RD×D are parameters of the model 120 / 138
  • 177. Appendix Polynomial Regression f (xi ) = w0 + D j=1 wj xij + D j=1 D j =j+1 wjj xij xij Dense parameterization is not suited for sparse data Computationally expensive - O N × D2 121 / 138
  • 178. Appendix Factorization Machines f (xi ) = w0 + D j=1 wj xij + D j=1 D j =j+1 vj , vj xij xij 122 / 138
  • 179. Appendix Factorization Machines f (xi ) = w0 + D j=1 wj xij + D j=1 D j =j+1 vj , vj xij xij Factorized Parameterization using VVT instead of Dense W 122 / 138
  • 180. Appendix Factorization Machines f (xi ) = w0 + D j=1 wj xij + D j=1 D j =j+1 vj , vj xij xij Factorized Parameterization using VVT instead of Dense W Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K 122 / 138
  • 181. Appendix Factorization Machines f (xi ) = w0 + D j=1 wj xij + D j=1 D j =j+1 vj , vj xij xij Factorized Parameterization using VVT instead of Dense W Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K vj ∈ RK denotes latent embedding for the j-th feature 122 / 138
  • 182. Appendix Factorization Machines f (xi ) = w0 + D j=1 wj xij + D j=1 D j =j+1 vj , vj xij xij Factorized Parameterization using VVT instead of Dense W Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K vj ∈ RK denotes latent embedding for the j-th feature Computationally much cheaper - O (N × D × K) 122 / 138
  • 183. Appendix Factorization Machines f (xi ) = w0 + D j=1 wj xij + D j=1 D j =j+1 vj , vj xij xij = w0 + D j=1 wj xij + 1 2 K k=1    D d=1 vdkxid 2 − D j=1 v2 jkx2 ij    123 / 138
  • 184. Appendix Factorization Machines L (w, V) = 1 N N i=1 l (f (xi ) , yi ) + λw 2 w 2 2 + λv 2 V 2 2 λw and λv are regularizers for w and V l (·) is loss function depending on the task 124 / 138
  • 185. Appendix Factorization Machines Gradient Descent updates (First-Order Features) wt+1 j ← wt j − η N i=1 wj li (w, V) + λw wt j = wt j − η N i=1 Gt i · wj f (xi ) + λw wt j = wt j − η N i=1 Gt i · xij + λw wt j (1) where, multiplier Gt i is given by, Gt i = f (xi ) − yi , if squared loss (regression) −yi 1+exp(yi ·fi (xi )), if logistic loss (classification) (2) 125 / 138
  • 186. Appendix Factorization Machines Gradient Descent updates (Second-Order Features) vt+1 jk ← vt jk − η N i=1 vjk li (w, V) + λv vt jk = vt jk − η N i=1 Gt i · vjk f (xi ) + λv vt jk = vt jk − η N i=1 Gt i · xij D d=1 vt dk · xid − vt jk x2 ij + λv vt jk (3) where, multiplier Gt i is same as before, synchronization term aik = D d=1 vt dk xid . 126 / 138
  • 187. Appendix Factorization Machines Access pattern of parameter updates for wj and vjk X w V G (a) wj update X w V G A (b) vjk update Figure: Updating wj requires computing Gi and likewise updating vjk requires computing aik . 127 / 138
  • 188. Appendix Factorization Machines Access pattern of parameter updates for wj and vjk X w V G (a) computing Gi X w V A (b) computing aik Figure: Computing both G and A requires accessing all the dimensions j = 1, . . . , D. This is the main synchronization bottleneck. 128 / 138
  • 189. Appendix Doubly-Separable Factorization Machines (DS-FACTO) Avoiding bulk-synchronization Perform a round of Incremental synchronization after all D updates 129 / 138
  • 190. Appendix Experiments: Single Machine housing, machines=1, cores=1, threads=1 10−3 10−2 10−1 100 101 102 10 20 30 40 50 time (secs) objective DS-FACTO LIBFM 130 / 138
  • 191. Appendix Experiments: Single Machine diabetes, machines=1, cores=1, threads=1 10−2 10−1 100 101 102 0.5 0.6 0.7 time (secs) objective DS-FACTO LIBFM 131 / 138
  • 192. Appendix Experiments: Single Machine ijcnn1, machines=1, cores=1, threads=1 10−2 10−1 100 101 102 103 104 0.1 0.2 0.3 0.4 0.5 0.6 time (secs) objective DS-FACTO LIBFM 132 / 138
  • 193. Appendix Experiments: Multi Machine realsim, machines=8, cores=40, threads=1 0 1,000 2,000 3,000 4,000 5,000 0.2 0.3 0.4 time (secs) objective DS-FACTO 133 / 138
  • 194. Appendix Experiments: Scaling in terms of threads realsim, Varying cores and threads as 1, 2, 4, 8, 16, 32 0 5 10 15 20 25 30 35 0 10 20 30 time (secs) speedup realsim dataset linear speedup varying # of threads varying # of cores 134 / 138
  • 195. Appendix Appendix: Alternative Reformulation for DS-MLR Step 1: Introduce redundant constraints (new parameters A) into the original MLR problem min W ,A L1(W , A) = λ 2 K k=1 wk 2 − 1 N N i=1 K k=1 yik wk T xi − 1 N N i=1 log ai s.t. ai = 1 K k=1 exp(wk T xi ) 135 / 138
  • 196. Appendix Appendix: Alternative Reformulation for DS-MLR Step 2: Turn the problem to unconstrained min-max problem by introducing Lagrange multipliers βi , ∀i = 1, . . . , N min W ,A max β L2(W , A, β) = λ 2 K k=1 wk 2 − 1 N N i=1 K k=1 yik wk T xi − 1 N N i=1 log ai + 1 N N i=1 K k=1 βi ai exp(wk T xi ) − 1 N N i=1 βi 136 / 138
  • 197. Appendix Appendix: Alternative Reformulation for DS-MLR Step 2: Turn the problem to unconstrained min-max problem by introducing Lagrange multipliers βi , ∀i = 1, . . . , N min W ,A max β L2(W , A, β) = λ 2 K k=1 wk 2 − 1 N N i=1 K k=1 yik wk T xi − 1 N N i=1 log ai + 1 N N i=1 K k=1 βi ai exp(wk T xi ) − 1 N N i=1 βi Primal Updates for W , A and Dual Update for β (similar in spirit to dual-decomp. methods). 136 / 138
  • 198. Appendix Appendix: Alternative Reformulation for DS-MLR Step 3: Stare at the updates long-enough! When at+1 i is solved to optimality, it admits an exact closed-form solution given by a∗ i = 1 βi K k=1 exp(wk T xi ) . Dual-ascent update for βi is no longer needed, since the penalty is always zero if βi is set to a constant equal to 1. min W ,A L3(W , A) = λ 2 K k=1 wk 2 − 1 N N i=1 K k=1 yik wk T xi − 1 N N i=1 log ai + 1 N N i=1 K k=1 ai exp(wk T xi ) − 1 N 137 / 138
  • 199. Appendix Appendix: Alternative Reformulation for DS-MLR Step 4: Simple re-write Doubly-Separable form min W ,A N i=1 K k=1 λ wk 2 2N − yik wk T xi N − log ai NK + exp(wk T xi + log ai ) N − 1 NK 138 / 138