Thesis Title: Hybrid-Parallel Parameter Estimation for Frequentist and Bayesian Models
Thesis Committee: S.V.N. Vishwanathan, Manfred Warmuth, David P. Helmbold
Abstract: Distributed algorithms in machine learning follow two main flavors: data parallel, where the data is distributed across multiple workers and model parallel, where the model parameters are partitioned across multiple workers. The main limitation of the first approach is that the model parameters need to be replicated on every machine. This is problematic when the number of parameters is very large, and hence cannot fit in a single machine. This drawback of the latter approach is that the data needs to be replicated on each machine.
In this defense talk, I will present Hybrid-Parallelism, an approach that allows us to partition both, the data as well as the model parameters simultaneously. As a result, each worker only needs access to a subset of the data and a subset of the parameters while performing parameter updates. I will present case studies concerning two popular models: (1) Multinomial Logistic Regression, and (2) Mixture of Exponential Families. In both cases, I will show how to exploit the access pattern of parameter updates to derive Hybrid-Parallel asynchronous algorithms.
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
Ph.D. Dissertation Defense
1. Hybrid-Parallel Parameter Estimation for Frequentist and
Bayesian Models
Parameswaran Raman
Ph.D. Dissertation Defense
University of California, Santa Cruz
December 2, 2019
1 / 138
8. Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
6 / 138
9. Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
3 Bulk-Synchronization is expensive
6 / 138
10. Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
3 Bulk-Synchronization is expensive
4 Synchronous communication is inefficient
6 / 138
11. Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
3 Bulk-Synchronization is expensive
4 Synchronous communication is inefficient
Traditional distributed machine learning approaches fall short.
6 / 138
12. Challenges in Parameter Estimation
1 Storage limitations of Data and Model
2 Interdependence in parameter updates
3 Bulk-Synchronization is expensive
4 Synchronous communication is inefficient
Traditional distributed machine learning approaches fall short.
Hybrid-Parallel algorithms for parameter estimation!
6 / 138
17. Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
Our model must generalize to unseen (test) data
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
8 / 138
18. Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
Our model must generalize to unseen (test) data
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data
θ are the model parameters
8 / 138
19. Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
Our model must generalize to unseen (test) data
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data
θ are the model parameters
loss (·) to quantify model’s performance
8 / 138
20. Introduction
Regularized Risk Minimization
Goals in machine learning
We want to build a model using observed (training) data
Our model must generalize to unseen (test) data
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data
θ are the model parameters
loss (·) to quantify model’s performance
regularizer R (θ) to avoid over-fitting (penalizes complex models)
8 / 138
25. Introduction
Bayesian Models
Gaussian Mixture
Models (GMM)
Latent Dirichlet
Allocation (LDA)
p(θ|X)
posterior
=
likelihood
p(X|θ) ·
prior
p(θ)
p(X, θ)dθ
marginal likelihood (model evidence)
prior plays the role of regularizer R (θ)
likelihood plays the role of empirical risk
10 / 138
31. Distributed Parameter Estimation
Distributed Parameter Estimation
Data parallel
N
D
X
Data
D
K
θ
Model
e.g. L-BFGS
Model parallel
N
D
X
Data
D
K
θ
Model
e.g. LC [Gopal et al., 2013]
13 / 138
33. Distributed Parameter Estimation
Distributed Parameter Estimation
Good
Easy to implement using map-reduce
Scales as long as Data or Model fits in memory
Bad
Either Data or the Model is replicated on each worker.
14 / 138
34. Distributed Parameter Estimation
Distributed Parameter Estimation
Good
Easy to implement using map-reduce
Scales as long as Data or Model fits in memory
Bad
Either Data or the Model is replicated on each worker.
Data Parallel: Each worker requires O N×D
P + O (K × D)
bottleneck
14 / 138
35. Distributed Parameter Estimation
Distributed Parameter Estimation
Good
Easy to implement using map-reduce
Scales as long as Data or Model fits in memory
Bad
Either Data or the Model is replicated on each worker.
Data Parallel: Each worker requires O N×D
P + O (K × D)
bottleneck
Model Parallel: Each worker requires requires O (N × D)
bottleneck
+O K×D
P
14 / 138
44. Distributed Parameter Estimation
Benefits of Hybrid-Parallelism
1 One versatile method for all regimes of data and model parallelism
2 Independent parameter updates on each worker
23 / 138
45. Distributed Parameter Estimation
Benefits of Hybrid-Parallelism
1 One versatile method for all regimes of data and model parallelism
2 Independent parameter updates on each worker
3 Fully de-centralized and asynchronous optimization algorithms
24 / 138
48. Distributed Parameter Estimation
Double-Separability
Definition
Let {Si }m
i=1 and {Sj }m
j=1 be two families of sets of parameters. A function
f : m
i=1 Si × m
j=1 Sj → R is doubly separable if ∃ fij : Si × Sj → R for
each i = 1, 2, . . . , m and j = 1, 2, . . . , m such that:
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
27 / 138
51. Distributed Parameter Estimation
Double-Separability
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
x
x
x
x
x
x
x
x
x
x
x
x x
m
m
fij (θi , θj )
fij corresponding to
highlighted diagonal blocks
can be computed
independently and in
parallel
28 / 138
58. Distributed Parameter Estimation
Technical Contributions of my thesis
Study frequentist and bayesian models which do not have direct
double-separability in their parameter estimation,
31 / 138
59. Distributed Parameter Estimation
Technical Contributions of my thesis
Study frequentist and bayesian models which do not have direct
double-separability in their parameter estimation,
1 Propose reformulations that lead to Hybrid-Parallelism
31 / 138
60. Distributed Parameter Estimation
Technical Contributions of my thesis
Study frequentist and bayesian models which do not have direct
double-separability in their parameter estimation,
1 Propose reformulations that lead to Hybrid-Parallelism
2 Design asynchronous distributed optimization algorithms
31 / 138
63. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Multinomial Logistic Regression (MLR)
Given
Training data (xi , yi )i=1,...,N where xi ∈ RD
corresp. labels yi ∈ {1, 2, . . . , K}
N, D and K are large (N >>> D >> K)
34 / 138
64. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Multinomial Logistic Regression (MLR)
Given
Training data (xi , yi )i=1,...,N where xi ∈ RD
corresp. labels yi ∈ {1, 2, . . . , K}
N, D and K are large (N >>> D >> K)
Goal
The probability that xi belongs to class k is given by:
p(yi = k|xi , W ) =
exp(wk
T xi )
K
j=1 exp(wj
T xi )
where W = {w1, w2, . . . , wK } denotes the parameter for the model.
34 / 138
65. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Multinomial Logistic Regression (MLR)
The corresponding l2 regularized negative log-likelihood loss:
min
W
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yikwk
T
xi +
1
N
N
i=1
log
K
k=1
exp(wk
T
xi )
where λ is the regularization hyper-parameter.
35 / 138
66. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Multinomial Logistic Regression (MLR)
The corresponding l2 regularized negative log-likelihood loss:
min
W
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yikwk
T
xi +
1
N
N
i=1
log
K
k=1
exp(wk
T
xi )
makes model parallelism hard
where λ is the regularization hyper-parameter.
36 / 138
67. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Reformulation into Doubly-Separable form
Log-concavity bound [GopYan13]
log(γ) ≤ a · γ − log(a) − 1, ∀γ, a > 0,
where a is a variational parameter. This bound is tight when a = 1
γ .
37 / 138
68. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Reformulating the objective of MLR
min
W ,A
λ
2
K
k=1
wk
2
+
1
N
N
i=1
−
K
k=1
yik wk
T
xi + ai
K
k=1
exp(wk
T
xi ) − log(ai ) − 1
where ai can be computed in closed form as:
ai =
1
K
k=1 exp(wk
T xi )
38 / 138
69. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Doubly-Separable Multinomial Logistic Regression
(DS-MLR)
Doubly-Separable form
min
W ,A
N
i=1
K
k=1
λ wk
2
2N
−
yik wk
T
xi
N
−
log ai
NK
+
exp(wk
T
xi + log ai )
N
−
1
NK
39 / 138
70. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Doubly-Separable Multinomial Logistic Regression
(DS-MLR)
Stochastic Gradient Updates
Each term in stochastic update depends on only data point i and class k.
40 / 138
71. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Doubly-Separable Multinomial Logistic Regression
(DS-MLR)
Stochastic Gradient Updates
Each term in stochastic update depends on only data point i and class k.
wk
t+1 ← wk
t − ηtK λxi − yikxi + exp wk
T xi + log ai xi
40 / 138
72. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Doubly-Separable Multinomial Logistic Regression
(DS-MLR)
Stochastic Gradient Updates
Each term in stochastic update depends on only data point i and class k.
wk
t+1 ← wk
t − ηtK λxi − yikxi + exp wk
T xi + log ai xi
log ai
t+1 ← log ai
t − ηtK exp(wk
T xi + log ai ) − 1
K
40 / 138
73. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Access Pattern of updates: Stoch wk, Stoch ai
X
W
A
(a) Updating wk only requires
computing ai
X
W
A
(b) Updating ai only requires
accessing wk and xi .
41 / 138
74. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Updating ai: Closed form instead of Stoch update
Closed-form update for ai
ai =
1
K
k=1 exp(wk
T xi )
42 / 138
75. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Access Pattern of updates: Stoch wk, Exact ai
X
W
A
(a) Updating wk only requires
computing ai
X
W
A
(b) Updating ai requires accessing
entire W. Synchronization
bottleneck!
43 / 138
76. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Updating ai: Avoiding bulk-synchronization
Closed-form update for ai
ai =
1
K
k=1 exp(wk
T xi )
44 / 138
77. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Updating ai: Avoiding bulk-synchronization
Closed-form update for ai
ai =
1
K
k=1 exp(wk
T xi )
Each worker computes partial sum using the wk it owns.
44 / 138
78. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Updating ai: Avoiding bulk-synchronization
Closed-form update for ai
ai =
1
K
k=1 exp(wk
T xi )
Each worker computes partial sum using the wk it owns.
P workers: After P rounds the global sum is available
44 / 138
79. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Synchronous DSGD [Gemulla et al., 2011]
X and local parameters A are partitioned horizontally (1, . . . , N)
45 / 138
80. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Synchronous DSGD [Gemulla et al., 2011]
X and local parameters A are partitioned horizontally (1, . . . , N)
Global model parameters W are partitioned vertically (1, . . . , K)
45 / 138
81. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Synchronous DSGD [Gemulla et al., 2011]
X and local parameters A are partitioned horizontally (1, . . . , N)
Global model parameters W are partitioned vertically (1, . . . , K)
P = 4 workers work on mutually-exclusive blocks of A and W
45 / 138
82. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Asynchronous NOMAD [Yun et al., 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
46 / 138
83. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Asynchronous NOMAD [Yun et al 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
47 / 138
84. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Asynchronous NOMAD [Yun et al 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
48 / 138
85. Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Parallelization: Asynchronous NOMAD [Yun et al 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
49 / 138
100. Thesis Roadmap Mixture Models (AISTATS 2019)
Generative Model
Draw Cluster Proportions π
p (π|α) = Dirichlet (α)
Draw Cluster Parameters for k = 1, . . . , K
p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk))
61 / 138
101. Thesis Roadmap Mixture Models (AISTATS 2019)
Generative Model
Draw Cluster Proportions π
p (π|α) = Dirichlet (α)
Draw Cluster Parameters for k = 1, . . . , K
p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk))
Draw Observations for i = 1, . . . , N
p (zi |π) = Multinomial (π)
p (xi |zi , θ) = exp ( φ (xi , zi ) , θzi − g (θzi ))
61 / 138
102. Thesis Roadmap Mixture Models (AISTATS 2019)
Generative Model
Draw Cluster Proportions π
p (π|α) = Dirichlet (α)
Draw Cluster Parameters for k = 1, . . . , K
p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk))
Draw Observations for i = 1, . . . , N
p (zi |π) = Multinomial (π)
p (xi |zi , θ) = exp ( φ (xi , zi ) , θzi − g (θzi ))
p (θk|nk, νk) is conjugate to p (xi |zi = k, θk)
p (π|α) is conjugate to p (zi |π)
61 / 138
103. Thesis Roadmap Mixture Models (AISTATS 2019)
Joint Distribution
p (π, θ, z, x|α, n, ν) = p (π|α) ·
K
k=1
p (θk|nk, νk) ·
N
i=1
p (zi |π) · p (xi |zi , θ)
62 / 138
104. Thesis Roadmap Mixture Models (AISTATS 2019)
Joint Distribution
p (π, θ, z, x|α, n, ν) = p (π|α) ·
K
k=1
p (θk|nk, νk) ·
N
i=1
p (zi |π) · p (xi |zi , θ)
Want to estimate the posterior: p (π, θ, z|x, α, n, ν)
62 / 138
105. Thesis Roadmap Mixture Models (AISTATS 2019)
Variational Inference [Blei et al., 2016]
Approximate p (π, θ, z|x, α, n, ν) with a simpler (factorized) distribution q
q π, θ, z|˜π, ˜θ, ˜z = q (π|˜π) ·
K
k=1
q θk|˜θk ·
N
i=1
q (zi |˜zi )
63 / 138
130. Thesis Roadmap Mixture Models (AISTATS 2019)
Keeping ˜π in Sync
Single global ˜π
travels among machines
broadcasts local delta updates
Every machine p: (πp, ¯π)
πp: local working copy
¯π: snapshot of global ˜π
83 / 138
147. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Objectives
Can we use this to design robust losses for ranking?
Two scenarios:
Explicit: features, labels are observed (Learning to Rank)
Implicit: features, labels are latent (Latent Collaborative Retrieval)
Distributed Optimization Algorithms
100 / 138
148. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Explicit Setting - Learning to Rank
Scoring function: f (x, y) = φ(x, y), w
101 / 138
150. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
Rank
rankw(x, y) =
y ∈Yx ,y =y
1 f (x, y) − f (x, y ) < 0
103 / 138
151. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
Rank
rankw(x, y) =
y ∈Yx ,y =y
1 f (x, y) − f (x, y ) < 0
Objective Function
L(w) =
x∈X y∈Yx
rxy
y ∈Yx ,y =y
σ f (x, y) − f (x, y )
where σ = logistic loss.
103 / 138
152. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
Applying the robust transformation ρ2(t)
Objective Function
L(w) =
x∈X y∈Yx
rxy · ρ2
y ∈Yx ,y =y
σ f (x, y) − f (x, y )
where ρ2(t) = 1 − 1
log2(t+2)
104 / 138
153. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
Applying the robust transformation ρ2(t)
Objective Function
L(w) =
x∈X y∈Yx
rxy · ρ2
y ∈Yx ,y =y
σ f (x, y) − f (x, y )
where ρ2(t) = 1 − 1
log2(t+2)
Directly related to DCG (evaluation metric for ranking).
minimizing L (w) =⇒ maximizing DCG
104 / 138
154. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Formulating a Robust Ranking Loss (RoBiRank)
In practice, ρ1 is easier to optimize,
Objective Function
L(w) =
x∈X y∈Yx
rxy · ρ1
y ∈Yx ,y =y
σ f (x, y) − f (x, y )
where, ρ1(t) = log2(t + 1)
105 / 138
155. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Experiments: Single Machine
TD 2004
5 10 15 20
0.4
0.6
0.8
1
k
NDCG@k RoBiRank
RankSVM
LSRank
InfNormPush
IRPush
RoBiRank shows good accuracy (NDCG) at the top of the ranking list
106 / 138
156. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Experiments: Single Machine
TD 2004
5 10 15 20
0.4
0.6
0.8
1
k
NDCG@k RoBiRank
MART
RankNet
RankBoost
AdaRank
CoordAscent
LambdaMART
ListNet
RandomForests
RoBiRank shows good accuracy (NDCG) at the top of the ranking list
107 / 138
157. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
How to make RoBiRank Hybrid-Parallel?
Consider RoBiRank with the implicit case (no features, no explicit labels).
Implicit Case
x∈X y∈Yx
rxy · ρ1
y ∈Yx ,y =y
σ Ux , Vy − Ux , Vy
expensive
How can we avoid this?
108 / 138
158. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Linearization
Upper Bound for ρ1 [Bouchard07]
ρ1 (t) = log2 (t + 1) ≤ log2 ξ +
ξ · (t + 1) − 1
log2
The bound holds for any ξ > 0 and is tight when ξ = 1
t+1
109 / 138
159. Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Linearization
We can linearize the objective function using this upper bound,
x∈X y∈Yx
rxy ·
− log2 ξxy +
ξxy · y =y σ ( Ux , Vy − Ux , Vy ) + 1 − 1
log 2
Can obtain stochastic gradient updates for Ux , Vy and ξxy
ξxy capture the importance of xy-pair in optimization
ξxy also has a closed form solution
110 / 138
164. Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
115 / 138
165. Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
115 / 138
166. Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
We proposed:
115 / 138
167. Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
We proposed:
Hybrid-Parallel formulations
115 / 138
168. Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
We proposed:
Hybrid-Parallel formulations
Distributed, Asynchronous Algorithms
115 / 138
169. Conclusion and Future Directions
Conclusion and Key Takeaways
Data and Model grow hand in hand.
Challenges in Parameter Estimation.
We proposed:
Hybrid-Parallel formulations
Distributed, Asynchronous Algorithms
Case-studies:
Multinomial Logistic Regression
Mixture Models
Latent Collaborative Retrieval
115 / 138
170. Conclusion and Future Directions
Future Work: Other Hybrid-Parallel models
Infinite Mixture Models
DP Mixture Model [BleiJordan2006]
Pittman-Yor Mixture Model [Dubey et al., 2014]
116 / 138
171. Conclusion and Future Directions
Future Work: Other Hybrid-Parallel models
Infinite Mixture Models
DP Mixture Model [BleiJordan2006]
Pittman-Yor Mixture Model [Dubey et al., 2014]
Stochastic Mixed-Membership Block Models [Airoldi et al., 2008]
116 / 138
172. Conclusion and Future Directions
Future Work: Other Hybrid-Parallel models
Infinite Mixture Models
DP Mixture Model [BleiJordan2006]
Pittman-Yor Mixture Model [Dubey et al., 2014]
Stochastic Mixed-Membership Block Models [Airoldi et al., 2008]
Deep Neural Networks [Wang et al., 2014]
116 / 138
173. Conclusion and Future Directions
Acknowledgements
Thanks to all my collaborators.
117 / 138
176. Appendix
Polynomial Regression
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
wjj xij xij
xi ∈ RD is an example from the dataset X ∈ RN×D
w0 ∈ R, w ∈ RD, W ∈ RD×D are parameters of the model
120 / 138
177. Appendix
Polynomial Regression
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
wjj xij xij
Dense parameterization is not suited for sparse data
Computationally expensive - O N × D2
121 / 138
179. Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Factorized Parameterization using VVT instead of Dense W
122 / 138
180. Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Factorized Parameterization using VVT instead of Dense W
Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K
122 / 138
181. Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Factorized Parameterization using VVT instead of Dense W
Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K
vj ∈ RK denotes latent embedding for the j-th feature
122 / 138
182. Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Factorized Parameterization using VVT instead of Dense W
Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K
vj ∈ RK denotes latent embedding for the j-th feature
Computationally much cheaper - O (N × D × K)
122 / 138
183. Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
= w0 +
D
j=1
wj xij +
1
2
K
k=1
D
d=1
vdkxid
2
−
D
j=1
v2
jkx2
ij
123 / 138
184. Appendix
Factorization Machines
L (w, V) =
1
N
N
i=1
l (f (xi ) , yi ) +
λw
2
w 2
2 +
λv
2
V 2
2
λw and λv are regularizers for w and V
l (·) is loss function depending on the task
124 / 138
185. Appendix
Factorization Machines
Gradient Descent updates (First-Order Features)
wt+1
j ← wt
j − η
N
i=1
wj li (w, V) + λw wt
j
= wt
j − η
N
i=1
Gt
i · wj f (xi ) + λw wt
j
= wt
j − η
N
i=1
Gt
i · xij + λw wt
j (1)
where, multiplier Gt
i is given by,
Gt
i =
f (xi ) − yi , if squared loss (regression)
−yi
1+exp(yi ·fi (xi )), if logistic loss (classification)
(2)
125 / 138
186. Appendix
Factorization Machines
Gradient Descent updates (Second-Order Features)
vt+1
jk ← vt
jk − η
N
i=1
vjk
li (w, V) + λv vt
jk
= vt
jk − η
N
i=1
Gt
i · vjk
f (xi ) + λv vt
jk
= vt
jk − η
N
i=1
Gt
i · xij
D
d=1
vt
dk · xid − vt
jk x2
ij + λv vt
jk (3)
where, multiplier Gt
i is same as before, synchronization term
aik = D
d=1 vt
dk xid .
126 / 138
187. Appendix
Factorization Machines
Access pattern of parameter updates for wj and vjk
X
w
V
G
(a) wj update
X
w
V
G A
(b) vjk update
Figure: Updating wj requires computing Gi and likewise updating vjk requires
computing aik .
127 / 138
188. Appendix
Factorization Machines
Access pattern of parameter updates for wj and vjk
X
w
V
G
(a) computing Gi
X
w
V
A
(b) computing aik
Figure: Computing both G and A requires accessing all the dimensions
j = 1, . . . , D. This is the main synchronization bottleneck.
128 / 138
194. Appendix
Experiments: Scaling in terms of threads
realsim, Varying cores and threads as 1, 2, 4, 8, 16, 32
0 5 10 15 20 25 30 35
0
10
20
30
time (secs)
speedup
realsim dataset
linear speedup
varying # of threads
varying # of cores
134 / 138
195. Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 1: Introduce redundant constraints (new parameters A) into the
original MLR problem
min
W ,A
L1(W , A) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
s.t. ai =
1
K
k=1 exp(wk
T xi )
135 / 138
196. Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 2: Turn the problem to unconstrained min-max problem by
introducing Lagrange multipliers βi , ∀i = 1, . . . , N
min
W ,A
max
β
L2(W , A, β) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
+
1
N
N
i=1
K
k=1
βi ai exp(wk
T
xi ) −
1
N
N
i=1
βi
136 / 138
197. Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 2: Turn the problem to unconstrained min-max problem by
introducing Lagrange multipliers βi , ∀i = 1, . . . , N
min
W ,A
max
β
L2(W , A, β) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
+
1
N
N
i=1
K
k=1
βi ai exp(wk
T
xi ) −
1
N
N
i=1
βi
Primal Updates for W , A and Dual Update for β (similar in spirit to
dual-decomp. methods).
136 / 138
198. Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 3: Stare at the updates long-enough!
When at+1
i is solved to optimality, it admits an exact closed-form
solution given by a∗
i = 1
βi
K
k=1 exp(wk
T xi )
.
Dual-ascent update for βi is no longer needed, since the penalty is
always zero if βi is set to a constant equal to 1.
min
W ,A
L3(W , A) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
+
1
N
N
i=1
K
k=1
ai exp(wk
T
xi ) −
1
N
137 / 138
199. Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 4: Simple re-write
Doubly-Separable form
min
W ,A
N
i=1
K
k=1
λ wk
2
2N
−
yik wk
T
xi
N
−
log ai
NK
+
exp(wk
T
xi + log ai )
N
−
1
NK
138 / 138