Ph.D. Dissertation Defense

Hybrid-Parallel Parameter Estimation for Frequentist and
Bayesian Models
Parameswaran Raman
Ph.D. Dissertation Defense
University of California, Santa Cruz
December 2, 2019
1 / 138

Motivation
Data and Model grow hand in hand
2 / 138

e.g. Extreme Multi-class classiﬁcation
3 / 138

e.g. Extreme Multi-label classiﬁcation
http://manikvarma.org/downloads/XC/XMLRepository.html
4 / 138

e.g. Topic Models
5 / 138

Challenges in Parameter Estimation
6 / 138

1 Storage limitations of Data and Model
6 / 138

2 Interdependence in parameter updates
6 / 138

3 Bulk-Synchronization is expensive
6 / 138

4 Synchronous communication is ineﬃcient
6 / 138

Traditional distributed machine learning approaches fall short.
6 / 138

Traditional distributed machine learning approaches fall short.
Hybrid-Parallel algorithms for parameter estimation!
6 / 138

Introduction
Outline
1 Introduction
2 Distributed Parameter Estimation
3 Thesis Roadmap
Multinomial Logistic Regression (KDD 2019)
Mixture Models (AISTATS 2019)
Latent Collaborative Retrieval (NIPS 2014)
4 Conclusion and Future Directions
5 Appendix
7 / 138

Introduction
Regularized Risk Minimization
Goals in machine learning
8 / 138

Introduction
We want to build a model using observed (training) data
8 / 138

Introduction
Our model must generalize to unseen (test) data
8 / 138

Introduction
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
8 / 138

Introduction
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
X = {x1, . . . , xN}, y = {y1, . . . , yN} is the observed training data
θ are the model parameters
8 / 138

Introduction
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
loss (·) to quantify model’s performance
8 / 138

Introduction
min
θ
L (θ) := λ R (θ)
regularizer
+
1
N
N
i=1
loss (xi , yi , θ)
empirical risk
loss (·) to quantify model’s performance
regularizer R (θ) to avoid over-ﬁtting (penalizes complex models)
8 / 138

Introduction
Frequentist Models
9 / 138

Introduction
Frequentist Models
Binary Classiﬁcation/Regression
Multinomial Logistic Regression
Matrix Factorization
Latent Collaborative Retrieval
Polynomial Regression, Factorization Machines
9 / 138

Introduction
Bayesian Models
10 / 138

Introduction
Bayesian Models
p(θ|X)
posterior
=
likelihood
p(X|θ) ·
prior
p(θ)
p(X, θ)dθ
marginal likelihood (model evidence)
prior plays the role of regularizer R (θ)
likelihood plays the role of empirical risk
10 / 138

Introduction
Bayesian Models
Gaussian Mixture
Models (GMM)
Latent Dirichlet
Allocation (LDA)
p(θ|X)
posterior
=
likelihood
p(X|θ) ·
prior
p(θ)
p(X, θ)dθ
marginal likelihood (model evidence)
prior plays the role of regularizer R (θ)
likelihood plays the role of empirical risk
10 / 138

Introduction
Focus on Matrix Parameterized Models
N
D
X
Data
D
K
θ
Model
11 / 138

Introduction
Focus on Matrix Parameterized Models
N
D
X
Data
D
K
θ
Model
What if these matrices do not ﬁt in memory?
11 / 138

Distributed Parameter Estimation
Outline
1 Introduction
3 Thesis Roadmap
5 Appendix
12 / 138

13 / 138

Data parallel
N
D
X
Data
D
K
θ
Model
e.g. L-BFGS
13 / 138

Data parallel
N
D
X
Data
D
K
θ
Model
e.g. L-BFGS
Model parallel
N
D
X
Data
D
K
θ
Model
e.g. LC [Gopal et al., 2013]
13 / 138

Good
Easy to implement using map-reduce
Scales as long as Data or Model ﬁts in memory
14 / 138

Good
Bad
Either Data or the Model is replicated on each worker.
14 / 138

Good
Bad
Data Parallel: Each worker requires O N×D
P + O (K × D)
bottleneck
14 / 138

Good
Bad
Data Parallel: Each worker requires O N×D
P + O (K × D)
bottleneck
Model Parallel: Each worker requires requires O (N × D)
bottleneck
+O K×D
P
14 / 138

Question
Can we get the best of both worlds?
15 / 138

Hybrid Parallelism
N
D
X
Data
D
K
θ
Model
16 / 138

Beneﬁts of Hybrid-Parallelism
1 One versatile method for all regimes of data and model parallelism
17 / 138

Why Hybrid Parallelism?
18 / 138

19 / 138

20 / 138

21 / 138

LSHTC1-small
LSHTC1-large
ODP
utube8M-Video
Reddit-Full
101
102
103
104
105
106
max memory of commodity machine
SizeinMB(log-scale) data size (MB)
parameters size (MB)
22 / 138

2 Independent parameter updates on each worker
23 / 138

2 Independent parameter updates on each worker
3 Fully de-centralized and asynchronous optimization algorithms
24 / 138

How do we achieve Hybrid Parallelism in machine learning models?
25 / 138

Double-Separability
Hybrid Parallelism
26 / 138

Double-Separability
Deﬁnition
Let {Si }m
i=1 and {Sj }m
j=1 be two families of sets of parameters. A function
f : m
i=1 Si × m
j=1 Sj → R is doubly separable if ∃ fij : Si × Sj → R for
each i = 1, 2, . . . , m and j = 1, 2, . . . , m such that:
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
27 / 138

Double-Separability
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
28 / 138

Double-Separability
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
x
x
x
x
x
x
x
x
x
x
x
x x
m
m
fij (θi , θj )
28 / 138

Double-Separability
f (θ1, θ2, . . . , θm, θ1, θ2, . . . , θm ) =
m
i=1
m
j=1
fij (θi , θj )
x
x
x
x
x
x
x
x
x
x
x
x x
m
m
fij (θi , θj )
fij corresponding to
highlighted diagonal blocks
can be computed
independently and in
parallel
28 / 138

Direct Double-Separability
e.g. Matrix Factorization
29 / 138

L(w1, w2, . . . , wN, h1, h2, . . . , hM) =
1
2
N
i=1
M
j=1
(Xij − wi , hj )2
29 / 138

L(w1, w2, . . . , wN, h1, h2, . . . , hM) =
1
2
N
i=1
M
j=1
(Xij − wi , hj )2
Objective function is trivially doubly-separable! [Yun et al 2014]
29 / 138

Others require Reformulations
original objective f (·)
30 / 138

Reformulate
30 / 138

Reformulate
m
i=1
m
j=1fij (θi , θj )
30 / 138

Technical Contributions of my thesis
Study frequentist and bayesian models which do not have direct
double-separability in their parameter estimation,
31 / 138

1 Propose reformulations that lead to Hybrid-Parallelism
31 / 138

1 Propose reformulations that lead to Hybrid-Parallelism
2 Design asynchronous distributed optimization algorithms
31 / 138

Thesis Roadmap
Outline
1 Introduction
3 Thesis Roadmap
5 Appendix
32 / 138

Thesis Roadmap Multinomial Logistic Regression (KDD 2019)
Outline
1 Introduction
3 Thesis Roadmap
5 Appendix
33 / 138

Multinomial Logistic Regression (MLR)
Given
Training data (xi , yi )i=1,...,N where xi ∈ RD
corresp. labels yi ∈ {1, 2, . . . , K}
N, D and K are large (N >>> D >> K)
34 / 138

Given
Training data (xi , yi )i=1,...,N where xi ∈ RD
corresp. labels yi ∈ {1, 2, . . . , K}
N, D and K are large (N >>> D >> K)
Goal
The probability that xi belongs to class k is given by:
p(yi = k|xi , W ) =
exp(wk
T xi )
K
j=1 exp(wj
T xi )
where W = {w1, w2, . . . , wK } denotes the parameter for the model.
34 / 138

The corresponding l2 regularized negative log-likelihood loss:
min
W
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yikwk
T
xi +
1
N
N
i=1
log
K
k=1
exp(wk
T
xi )
where λ is the regularization hyper-parameter.
35 / 138

The corresponding l2 regularized negative log-likelihood loss:
min
W
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yikwk
T
xi +
1
N
N
i=1
log
K
k=1
exp(wk
T
xi )
makes model parallelism hard
where λ is the regularization hyper-parameter.
36 / 138

Reformulation into Doubly-Separable form
Log-concavity bound [GopYan13]
log(γ) ≤ a · γ − log(a) − 1, ∀γ, a > 0,
where a is a variational parameter. This bound is tight when a = 1
γ .
37 / 138

Reformulating the objective of MLR
min
W ,A
λ
2
K
k=1
wk
2
+
1
N
N
i=1
−
K
k=1
yik wk
T
xi + ai
K
k=1
exp(wk
T
xi ) − log(ai ) − 1
where ai can be computed in closed form as:
ai =
1
K
k=1 exp(wk
T xi )
38 / 138

Doubly-Separable Multinomial Logistic Regression
(DS-MLR)
Doubly-Separable form
min
W ,A
N
i=1
K
k=1
λ wk
2
2N
−
yik wk
T
xi
N
−
log ai
NK
+
exp(wk
T
xi + log ai )
N
−
1
NK
39 / 138

(DS-MLR)
Stochastic Gradient Updates
Each term in stochastic update depends on only data point i and class k.
40 / 138

(DS-MLR)
wk
t+1 ← wk
t − ηtK λxi − yikxi + exp wk
T xi + log ai xi
40 / 138

(DS-MLR)
wk
t+1 ← wk
t − ηtK λxi − yikxi + exp wk
T xi + log ai xi
log ai
t+1 ← log ai
t − ηtK exp(wk
T xi + log ai ) − 1
K
40 / 138

Access Pattern of updates: Stoch wk, Stoch ai
X
W
A
(a) Updating wk only requires
computing ai
X
W
A
(b) Updating ai only requires
accessing wk and xi .
41 / 138

Updating ai: Closed form instead of Stoch update
Closed-form update for ai
ai =
1
K
k=1 exp(wk
T xi )
42 / 138

Access Pattern of updates: Stoch wk, Exact ai
X
W
A
(a) Updating wk only requires
computing ai
X
W
A
(b) Updating ai requires accessing
entire W. Synchronization
bottleneck!
43 / 138

Updating ai: Avoiding bulk-synchronization
ai =
1
K
k=1 exp(wk
T xi )
44 / 138

ai =
1
K
k=1 exp(wk
T xi )
Each worker computes partial sum using the wk it owns.
44 / 138

ai =
1
K
k=1 exp(wk
T xi )
Each worker computes partial sum using the wk it owns.
P workers: After P rounds the global sum is available
44 / 138

Parallelization: Synchronous DSGD [Gemulla et al., 2011]
X and local parameters A are partitioned horizontally (1, . . . , N)
45 / 138

Global model parameters W are partitioned vertically (1, . . . , K)
45 / 138

Global model parameters W are partitioned vertically (1, . . . , K)
P = 4 workers work on mutually-exclusive blocks of A and W
45 / 138

Parallelization: Asynchronous NOMAD [Yun et al., 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
46 / 138

Parallelization: Asynchronous NOMAD [Yun et al 2014]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
47 / 138

x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
48 / 138

x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
xx
x
x
x
x
x
x
x
x
xx
x
x
xx
A
W
49 / 138

Experiments: Datasets
50 / 138

Dataset # instances # features #classes data (train + test) parameters sparsity (% nnz)
CLEF 10,000 80 63 9.6 MB + 988 KB 40 KB 100
NEWS20 11,260 53,975 20 21 MB + 14 MB 9.79 MB 0.21
LSHTC1-small 4,463 51,033 1,139 11 MB + 4 MB 465 MB 0.29
LSHTC1-large 93,805 347,256 12,294 258 MB + 98 MB 34 GB 0.049
ODP 1,084,404 422,712 105,034 3.8 GB + 1.8 GB 355 GB 0.0533
YouTube8M-Video 4,902,565 1,152 4,716 59 GB + 17 GB 43 MB 100
Reddit-Full 211,532,359 1,348,182 33,225 159 GB + 69 GB 358 GB 0.0036
Table: Characteristics of the datasets used
51 / 138

Experiments: Single Machine
NEWS20, Data=35 MB, Model=9.79 MB
100 101 102 103
1.45
1.5
1.55
1.6
1.65
time (secs)
objective
DS-MLR
L-BFGS
LC
52 / 138

CLEF, Data=10 MB, Model=40 KB
10−1 100 101 102 103 104 105
1
2
3
4
time (secs)
objective
DS-MLR
L-BFGS
LC
53 / 138

LSHTC1-small, Data=15 MB, Model=465 MB
101 102 103 104
0
0.2
0.4
0.6
0.8
1
time (secs)
objective
DS-MLR
L-BFGS
LC
54 / 138

Experiments: Multi Machine
LSHTC1-large, Data=356 MB, Model=34 GB,
machines=4, threads=12
103 104
0.5
0.6
0.7
0.8
time (secs)
objective
DS-MLR
LC
55 / 138

ODP, Data=5.6 GB, Model=355 GB,
0 1 2 3 4 5 6
·105
9.4
9.6
9.8
10
10.2
10.4
time (secs)
objective
DS-MLR
56 / 138

Experiments: Multi Machine Dense Dataset
YouTube-Video, Data=76 GB, Model=43 MB,
0 1 2 3
·105
0
20
40
60
time (secs)
objective
DS-MLR
57 / 138

Experiments: Multi Machine (Nothing ﬁts in memory!)
Reddit-Full, Data=228 GB, Model=358 GB,
0 2 4 6
·105
10
10.2
10.4
time (secs)
objective DS-MLR
211 million examples, 44 billion parameters (# features × # classes)
58 / 138

Thesis Roadmap Mixture Models (AISTATS 2019)
Outline
1 Introduction
3 Thesis Roadmap
5 Appendix
59 / 138

Mixture Of Exponential Families
60 / 138

Mixture Of Exponential Families
k-means
Gaussian Mixture Models (GMM)
Latent Dirichlet Allocation (LDA)
Stochastic Mixed Membership Models
60 / 138

Generative Model
61 / 138

Generative Model
Draw Cluster Proportions π
p (π|α) = Dirichlet (α)
61 / 138

Generative Model
Draw Cluster Parameters for k = 1, . . . , K
p (θk|nk, νk) = exp ( nk · νk, θk − nk · g (θk) − h (nk, νk))
61 / 138

Generative Model
Draw Observations for i = 1, . . . , N
p (zi |π) = Multinomial (π)
p (xi |zi , θ) = exp ( φ (xi , zi ) , θzi − g (θzi ))
61 / 138

Variational Inference
64 / 138

Evidence Based Lower Bound (ELBO)
L ˜π, ˜θ, ˜z = Eq(π,θ,z|˜π,˜θ,˜z) [log p (π, θ, z, x|α, n, ν)]
− Eq(π,θ,z|˜π,˜θ,˜z) log q π, θ, z|˜π, ˜θ, ˜z
65 / 138

Variational Inference (VI)
M-step
max˜π L ˜π, ˜θ, ˜z
max˜θ L ˜π, ˜θ, ˜z
E-step
max˜z L ˜π, ˜θ, ˜z
66 / 138

Variational Inference: M-step
˜πk = α +
N
i=1
˜zi,k
67 / 138

Access Pattern of Variables: ˜π update
z x
π
θ
68 / 138

Variational Inference: M-step
˜nk = nk +
N
i=1
˜zi,k
˜νk = nk · νk +
N
i=1
˜zi,k · φ (xi , k)
69 / 138

Access Pattern of Variables: ˜θ update
z x
π
θ
70 / 138

Variational Inference: E-step
max˜z L ˜π, ˜θ, ˜z
˜zi,k ∝ exp ψ (˜πk) − ψ
K
k =1
˜πk + φ (xi , k) , Eq(θk |˜θk ) [θk]
− Eq(θk |˜θk ) [g (θk)]
71 / 138

Access Pattern of Variables: ˜z update
z x
π
θ
72 / 138

Stochastic Variational Inference (SVI) [Hoﬀman et al.,
2013]
M-step
E-step
max˜zi
L ˜π, ˜θ, ˜zi , ˜zi
73 / 138

Stochastic Variational Inference (SVI) [Hoﬀman et al.,
2013]
M-step
E-step
max˜zi
L ˜π, ˜θ, ˜zi , ˜zi
More frequent updates to global variables =⇒ Faster convergence
73 / 138

z x
π
θ
74 / 138

z x
π
θ
75 / 138

z x
π
θ
76 / 138

Extreme Stochastic Variational Inference (ESVI)
M-step
max˜πk ,˜πk
L [˜π1, . . . , ˜πk , ˜πk+1, . . . , ˜πk , . . . , ˜πK ] , ˜θ, ˜z
max˜θk ,˜θk
L ˜π, ˜θ1, . . . , ˜θk , ˜θk+1, . . . , ˜θk , . . . , ˜θK , ˜z
E-step
max˜zi,k ,˜zi,k
L ˜π, ˜θ, ˜zi , [˜zi,1, . . . , ˜zi,k , ˜zi,k+1, . . . , ˜zi,k , . . . , ˜zi,K ]
77 / 138

˜z x
˜π
˜θ
78 / 138

˜z x
˜π
˜θ
79 / 138

˜z x
˜π
˜θ
80 / 138

Key Observation
M-step
max˜πk ,˜πk
L [˜π1, . . . , ˜πk , ˜πk+1, . . . , ˜πk , . . . , ˜πK ] , ˜θ, ˜z
max˜θk ,˜θk
L ˜π, ˜θ1, . . . , ˜θk , ˜θk+1, . . . , ˜θk , . . . , ˜θK , ˜z
E-step
max˜zi,k ,˜zi,k
L ˜π, ˜θ, ˜zi , [˜zi,1, . . . , ˜zi,k , ˜zi,k+1, . . . , ˜zi,k , . . . , ˜zi,K ]
All the above updates can be done in parallel!
81 / 138

Distributing Computation NOMAD [Yun et al., 2014]
˜z x
˜π
˜θ
82 / 138

Keeping ˜π in Sync
Single global ˜π
travels among machines
broadcasts local delta updates
Every machine p: (πp, ¯π)
πp: local working copy
¯π: snapshot of global ˜π
83 / 138

# documents # vocabulary #words
NIPS 1,312 12,149 1,658,309
Enron 37,861 28,102 6,238,796
Ny Times 298,000 102,660 98,793,316
PubMed 8,200,000 141,043 737,869,083
UMBC-3B 40,599,164 3,431,260 3,013,004,127
Mixture Models:
Gaussian Mixture Model
(GMM)
Latent Dirichlet
Allocation (LDA)
Methods:
Variational Inference (VI)
Stochastic Variational Inference (SVI)
Extreme Stochastic Variational
Inference (ESVI)
84 / 138

Experiments: Single Machine (GMM)
TOY, machines=1, cores=1, topics=256
10−1 100 101 102 103 104
−8
−6
−4
·1010
time (secs)
ELBO
ESVI
SVI
VI
85 / 138

Experiments: Single Machine (GMM)
AP-DATA, machines=1, cores=1, topics=256
100 101 102 103 104
−1.21
−1.2
−1.2
−1.19
·1012
time (secs)
ELBO
ESVI
SVI
VI
86 / 138

Experiments: Multi Machine (GMM)
NIPS, machines=16, cores=1, topics=256
101 102 103 104
−1.72
−1.7
−1.68
·1012
time (secs)
ELBO
ESVI
VI
87 / 138

Experiments: Multi Machine (GMM)
NY Times, machines=16, cores=1, topics=256
103 104
−5.5
−5
−4.5
·1014
time (secs)
ELBO
VI
ESVI
88 / 138

Experiments: Single Machine Single Core (LDA)
Enron, machines=1, cores=1, topics=128
103 104
−5.4
−5.3
−5.2
−5.1
−5
·107
time (secs)
ELBO
ESVI
VI
SVI
89 / 138

Experiments: Single Machine Single Core (LDA)
103.5 104 104.5
−9.2
−9.1
−9
−8.9
−8.8
−8.7
·108
time (secs)
ELBO
ESVI
VI
SVI
90 / 138

Experiments: Single Machine Multi Core (LDA)
102 103 104
−5.4
−5.3
−5.2
−5.1
−5
·107
time (secs)
ELBO
ESVI
dist VI
stream SVI
91 / 138

Experiments: Single Machine Multi Core (LDA)
103 104
−9.2
−9
−8.8
·108
time (secs)
ELBO
ESVI
dist VI
stream SVI
92 / 138

Experiments: Multi Machine Multi Core (LDA)
PUBMED, machines=32, cores=16, topics=128
102 103 104
−6.8
−6.7
−6.6
·109
time (secs)
ELBO
ESVI
dist VI
93 / 138

Experiments: Multi Machine Multi Core (LDA)
UMBC-3B, machines=32, cores=16, topics=128
103 104
−2.28
−2.27
−2.26
−2.25
·1010
time (secs)
ELBO
ESVI
dist VI
94 / 138

Experiments: Predictive Performance (LDA)
101 102 103
10
15
20
25
30
35
time (secs)
TestPerplexity
ESVI
VI
95 / 138

Experiments: Predictive Performance (LDA)
103 104
15
20
25
30
35
time (secs)
TestPerplexity
ESVI
VI
96 / 138

Thesis Roadmap Latent Collaborative Retrieval (NIPS 2014)
Outline
1 Introduction
3 Thesis Roadmap
5 Appendix
97 / 138

Binary Classiﬁcation
−3 −2 −1 0 1 2 3
0
1
2
3
4
margin
loss 0-1 loss: I(· < 0)
logistic loss: σ(·)
hinge loss
Convex objective functions are sensitive to outliers
98 / 138

Robust Binary Classiﬁcation
−5 −4 −3 −2 −1 0 1 2 3 4 5
0
1
2
3
4
5
t
loss
σ(t)
σ1(t) := ρ1 (σ(t))
σ2(t) := ρ2 (σ(t))
ρ1(t) = log2(t + 1)
ρ2(t) = 1 − 1
log2(t+2)
Bending the losses provides robustness
99 / 138

Objectives
Can we use this to design robust losses for ranking?
Two scenarios:
Explicit: features, labels are observed (Learning to Rank)
Implicit: features, labels are latent (Latent Collaborative Retrieval)
Distributed Optimization Algorithms
100 / 138

Explicit Setting - Learning to Rank
Scoring function: f (x, y) = φ(x, y), w
101 / 138

Implicit Setting - Latent Collaborative Retrieval
Scoring function: f (x, y) = Ux , Vy
102 / 138

Formulating a Robust Ranking Loss (RoBiRank)
Rank
rankw(x, y) =
y ∈Yx ,y =y
1 f (x, y) − f (x, y ) < 0
103 / 138

Rank
rankw(x, y) =
y ∈Yx ,y =y
1 f (x, y) − f (x, y ) < 0
Objective Function
L(w) =
x∈X y∈Yx
rxy
y ∈Yx ,y =y
σ f (x, y) − f (x, y )
where σ = logistic loss.
103 / 138

Applying the robust transformation ρ2(t)
Objective Function
L(w) =
x∈X y∈Yx
rxy · ρ2


y ∈Yx ,y =y
σ f (x, y) − f (x, y )


where ρ2(t) = 1 − 1
log2(t+2)
104 / 138

Applying the robust transformation ρ2(t)
Objective Function
L(w) =
x∈X y∈Yx
rxy · ρ2


y ∈Yx ,y =y
σ f (x, y) − f (x, y )


where ρ2(t) = 1 − 1
log2(t+2)
Directly related to DCG (evaluation metric for ranking).
minimizing L (w) =⇒ maximizing DCG
104 / 138

In practice, ρ1 is easier to optimize,
Objective Function
L(w) =
x∈X y∈Yx
rxy · ρ1


y ∈Yx ,y =y
σ f (x, y) − f (x, y )


where, ρ1(t) = log2(t + 1)
105 / 138

TD 2004
5 10 15 20
0.4
0.6
0.8
1
k
NDCG@k RoBiRank
RankSVM
LSRank
InfNormPush
IRPush
RoBiRank shows good accuracy (NDCG) at the top of the ranking list
106 / 138

TD 2004
5 10 15 20
0.4
0.6
0.8
1
k
NDCG@k RoBiRank
MART
RankNet
RankBoost
AdaRank
CoordAscent
LambdaMART
ListNet
RandomForests
RoBiRank shows good accuracy (NDCG) at the top of the ranking list
107 / 138

How to make RoBiRank Hybrid-Parallel?
Consider RoBiRank with the implicit case (no features, no explicit labels).
Implicit Case
x∈X y∈Yx
rxy · ρ1


y ∈Yx ,y =y
σ Ux , Vy − Ux , Vy


expensive
How can we avoid this?
108 / 138

Linearization
Upper Bound for ρ1 [Bouchard07]
ρ1 (t) = log2 (t + 1) ≤ log2 ξ +
ξ · (t + 1) − 1
log2
The bound holds for any ξ > 0 and is tight when ξ = 1
t+1
109 / 138

Linearization
We can linearize the objective function using this upper bound,
x∈X y∈Yx
rxy ·

− log2 ξxy +
ξxy · y =y σ ( Ux , Vy − Ux , Vy ) + 1 − 1
log 2


Can obtain stochastic gradient updates for Ux , Vy and ξxy
ξxy capture the importance of xy-pair in optimization
ξxy also has a closed form solution
110 / 138

Distributing Computation
111 / 138

Million Song Dataset (MSD)
0 0.2 0.4 0.6 0.8 1
·105
0
0.1
0.2
0.3
time (secs)
MeanPrecision@1
Weston et al. (2012)
RoBiRank 1
RoBiRank 4
RoBiRank 16
RoBiRank 32
112 / 138

Scalability on Million song dataset, machines=1, 4, 16, 32
0 0.5 1 1.5 2 2.5 3
·106
0
0.1
0.2
0.3
# machines × seconds
MeanPrecision@1
RoBiRank 4
RoBiRank 16
RoBiRank 32
RoBiRank 1
113 / 138

Conclusion and Future Directions
Outline
1 Introduction
3 Thesis Roadmap
5 Appendix
114 / 138

Conclusion and Key Takeaways
Data and Model grow hand in hand.
115 / 138

Challenges in Parameter Estimation.
115 / 138

We proposed:
115 / 138

We proposed:
Hybrid-Parallel formulations
115 / 138

We proposed:
Distributed, Asynchronous Algorithms
115 / 138

We proposed:
Distributed, Asynchronous Algorithms
Case-studies:
Multinomial Logistic Regression
Mixture Models
Latent Collaborative Retrieval
115 / 138

Future Work: Other Hybrid-Parallel models
Inﬁnite Mixture Models
DP Mixture Model [BleiJordan2006]
Pittman-Yor Mixture Model [Dubey et al., 2014]
116 / 138

Stochastic Mixed-Membership Block Models [Airoldi et al., 2008]
116 / 138

Stochastic Mixed-Membership Block Models [Airoldi et al., 2008]
Deep Neural Networks [Wang et al., 2014]
116 / 138

Acknowledgements
Thanks to all my collaborators.
117 / 138

References
Parameswaran Raman, Sriram Srinivasan, Shin Matsushima, Xinhua
Zhang, Hyokun Yun, S.V.N. Vishwanathan. Scaling Multinomial Logistic
Regression via Hybrid Parallelism. KDD 2019.
Parameswaran Raman∗
, Jiong Zhang∗
, Shihao Ji, Hsiang-Fu Yu, S.V.N.
Vishwanathan, Inderjit S. Dhillon. Extreme Stochastic Variational Inference:
Distributed and Asynchronous. AISTATS 2019.
Hyokun Yun, Parameswaran Raman, S.V.N. Vishwanathan. Ranking via
Robust Binary Classiﬁcation. NIPS 2014.
Source Code:
DS-MLR: https://bitbucket.org/params/dsmlr
ESVI: https://bitbucket.org/params/dmixmodels
RoBiRank: https://bitbucket.org/d ijk stra/robirank
118 / 138

Appendix
Outline
1 Introduction
3 Thesis Roadmap
5 Appendix
119 / 138

Appendix
Polynomial Regression
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
wjj xij xij
xi ∈ RD is an example from the dataset X ∈ RN×D
w0 ∈ R, w ∈ RD, W ∈ RD×D are parameters of the model
120 / 138

Appendix
Polynomial Regression
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
wjj xij xij
Dense parameterization is not suited for sparse data
Computationally expensive - O N × D2
121 / 138

Appendix
Factorization Machines
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
122 / 138

Appendix
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Factorized Parameterization using VVT instead of Dense W
122 / 138

Appendix
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
Model parameters are w0 ∈ R, w ∈ RD, V ∈ RD×K
122 / 138

Appendix
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
vj ∈ RK denotes latent embedding for the j-th feature
122 / 138

Appendix
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
vj ∈ RK denotes latent embedding for the j-th feature
Computationally much cheaper - O (N × D × K)
122 / 138

Appendix
f (xi ) = w0 +
D
j=1
wj xij +
D
j=1
D
j =j+1
vj , vj xij xij
= w0 +
D
j=1
wj xij +
1
2
K
k=1



D
d=1
vdkxid
2
−
D
j=1
v2
jkx2
ij



123 / 138

Appendix
L (w, V) =
1
N
N
i=1
l (f (xi ) , yi ) +
λw
2
w 2
2 +
λv
2
V 2
2
λw and λv are regularizers for w and V
l (·) is loss function depending on the task
124 / 138

Appendix
Gradient Descent updates (First-Order Features)
wt+1
j ← wt
j − η
N
i=1
wj li (w, V) + λw wt
j
= wt
j − η
N
i=1
Gt
i · wj f (xi ) + λw wt
j
= wt
j − η
N
i=1
Gt
i · xij + λw wt
j (1)
where, multiplier Gt
i is given by,
Gt
i =
f (xi ) − yi , if squared loss (regression)
−yi
1+exp(yi ·fi (xi )), if logistic loss (classiﬁcation)
(2)
125 / 138

Appendix
Gradient Descent updates (Second-Order Features)
vt+1
jk ← vt
jk − η
N
i=1
vjk
li (w, V) + λv vt
jk
= vt
jk − η
N
i=1
Gt
i · vjk
f (xi ) + λv vt
jk
= vt
jk − η
N
i=1
Gt
i · xij
D
d=1
vt
dk · xid − vt
jk x2
ij + λv vt
jk (3)
where, multiplier Gt
i is same as before, synchronization term
aik = D
d=1 vt
dk xid .
126 / 138

Appendix
Access pattern of parameter updates for wj and vjk
X
w
V
G
(a) wj update
X
w
V
G A
(b) vjk update
Figure: Updating wj requires computing Gi and likewise updating vjk requires
computing aik .
127 / 138

Appendix
Access pattern of parameter updates for wj and vjk
X
w
V
G
(a) computing Gi
X
w
V
A
(b) computing aik
Figure: Computing both G and A requires accessing all the dimensions
j = 1, . . . , D. This is the main synchronization bottleneck.
128 / 138

Appendix
Doubly-Separable Factorization Machines (DS-FACTO)
Avoiding bulk-synchronization
Perform a round of Incremental synchronization after all D updates
129 / 138

Appendix
housing, machines=1, cores=1, threads=1
10−3 10−2 10−1 100 101 102
10
20
30
40
50
time (secs)
objective
DS-FACTO
LIBFM
130 / 138

Appendix
diabetes, machines=1, cores=1, threads=1
10−2 10−1 100 101 102
0.5
0.6
0.7
time (secs)
objective
DS-FACTO
LIBFM
131 / 138

Appendix
ijcnn1, machines=1, cores=1, threads=1
10−2 10−1 100 101 102 103 104
0.1
0.2
0.3
0.4
0.5
0.6
time (secs)
objective
DS-FACTO
LIBFM
132 / 138

Appendix
realsim, machines=8, cores=40, threads=1
0 1,000 2,000 3,000 4,000 5,000
0.2
0.3
0.4
time (secs)
objective
DS-FACTO
133 / 138

Appendix
Experiments: Scaling in terms of threads
realsim, Varying cores and threads as 1, 2, 4, 8, 16, 32
0 5 10 15 20 25 30 35
0
10
20
30
time (secs)
speedup
realsim dataset
linear speedup
varying # of threads
varying # of cores
134 / 138

Appendix
Appendix: Alternative Reformulation for DS-MLR
Step 1: Introduce redundant constraints (new parameters A) into the
original MLR problem
min
W ,A
L1(W , A) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
s.t. ai =
1
K
k=1 exp(wk
T xi )
135 / 138

Appendix
Step 2: Turn the problem to unconstrained min-max problem by
introducing Lagrange multipliers βi , ∀i = 1, . . . , N
min
W ,A
max
β
L2(W , A, β) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
+
1
N
N
i=1
K
k=1
βi ai exp(wk
T
xi ) −
1
N
N
i=1
βi
136 / 138

Appendix
Step 2: Turn the problem to unconstrained min-max problem by
introducing Lagrange multipliers βi , ∀i = 1, . . . , N
min
W ,A
max
β
L2(W , A, β) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
+
1
N
N
i=1
K
k=1
βi ai exp(wk
T
xi ) −
1
N
N
i=1
βi
Primal Updates for W , A and Dual Update for β (similar in spirit to
dual-decomp. methods).
136 / 138

Appendix
Step 3: Stare at the updates long-enough!
When at+1
i is solved to optimality, it admits an exact closed-form
solution given by a∗
i = 1
βi
K
k=1 exp(wk
T xi )
.
Dual-ascent update for βi is no longer needed, since the penalty is
always zero if βi is set to a constant equal to 1.
min
W ,A
L3(W , A) =
λ
2
K
k=1
wk
2
−
1
N
N
i=1
K
k=1
yik wk
T
xi −
1
N
N
i=1
log ai
+
1
N
N
i=1
K
k=1
ai exp(wk
T
xi ) −
1
N
137 / 138

Appendix
Step 4: Simple re-write
Doubly-Separable form
min
W ,A
N
i=1
K
k=1
λ wk
2
2N
−
yik wk
T
xi
N
−
log ai
NK
+
exp(wk
T
xi + log ai )
N
−
1
NK
138 / 138

Ph.D. Dissertation Defense

Recommended

Recommended

More Related Content

Similar to Ph.D. Dissertation Defense

Similar to Ph.D. Dissertation Defense (20)

Recently uploaded

Recently uploaded (20)

Ph.D. Dissertation Defense