Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- CORE: Context-Aware Open Relation E... by Fabio Petroni, PhD 453 views
- HSIENA: a hybrid publish/subscribe ... by Fabio Petroni, PhD 308 views
- HDRF: Stream-Based Partitioning for... by Fabio Petroni, PhD 443 views
- Geospatial Big Data by Fabio Petroni, PhD 489 views
- LCBM: Statistics-Based Parallel Col... by Fabio Petroni, PhD 573 views
- Mining at scale with latent factor ... by Fabio Petroni, PhD 480 views

529 views

Published on

"GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning."

In: Proceedings of the 8th ACM Conference on Recommender Systems (RecSys), 2014.

Abstract: "Matrix completion latent factors models are known to be an effective method to build recommender systems. Currently,

stochastic gradient descent (SGD) is considered one of the best latent factor-based algorithm for matrix completion. In this paper we discuss GASGD, a distributed asynchronous variant of SGD for large-scale matrix completion, that (i) leverages data partitioning schemes based on graph partitioning techniques, (ii) exploits specific characteristics of the input data and (iii) introduces an explicit parameter to tune synchronization frequency among the computing nodes. We empirically show how, thanks to these features, GASGD achieves a fast convergence rate incurring in smaller communication cost with respect to current asynchronous distributed SGD implementations."

Published in:
Engineering

No Downloads

Total views

529

On SlideShare

0

From Embeds

0

Number of Embeds

195

Shares

0

Downloads

13

Comments

0

Likes

1

No embeds

No notes for slide

- 1. GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning Fabio Petroni Cyber Intelligence and information Security CIS Sapienza Department of Computer Control and Management Engineering Antonio Ruberti, Sapienza University of Rome - petroni|querzoni@dis.uniroma1.it Foster City, Silicon Valley 6th-10th October 2014 and Leonardo Querzoni
- 2. Matrix Completion & SGD R Q x P ? item vector user vector LOSS(P, Q) = X (Rij Pi Qj )2 + ... I Stochastic Gradient Descent works by taking steps proportional to the negative of the gradient of the LOSS. I stochastic = P and Q are updated for each given training case by a small step, toward the average gradient descent. 2 of 18
- 3. Scalability 7 Lengthy training stages; 7 high computational costs; 7 especially on large data sets; 7 input data may not ﬁt in main memory. I goal = e ciently exploit computer cluster architectures. ... 3 of 18
- 4. Distributed Asynchronous SGD R Q3 P3 Q4 P4 Q2 P2 Q1 P1 1 2 3 4 R1 R4 R3R2 I R is splitted; I vectors are replicated; I replicas concurrently updated; I replicas deviate inconsistently; I synchronization. 4 of 18
- 5. Bulk Synchronous Processing Model 5 of 18
- 6. Challenges 1/2 R 1 2 3 4 R1 R4 R3R2 I 1. Load balance B ensure that computing nodes are fed with the same load. I 2. Minimize communication B minimize vector replicas. 6 of 18
- 7. Challenges 2/2 ... I 3. Tune synchronization frequency among computing nodes. I Current implementations synchronize vector copies: B continuously during the epoch (waste of resources); B once after every epoch (slow convergence). I epoch = a single iteration over the ratings. 7 of 18
- 8. Contributions 3 We mitigate the load imbalance by proposing an input slicing solution based on graph partitioning algorithms; 3 we show how to reduce the number of shared data by properly leveraging known characteristics of the input dataset (bipartite power-law nature); 3 we show how to leverage the tradeo↵ between communication cost and algorithm convergence rate by tuning the frequency of the bulk synchronization phase during the computation. 8 of 18
- 9. Graph representation I The rating matrix describes a bipartite graph. x x x x x x xx x x x x R I Real data: skewed power-law degree distribution. 9 of 18
- 10. Input partitioner I vertex-cut performs better than edge-cut in power-law graphs. I Assumption: the input data doesn’t ﬁt in main memory. I Streaming algorithm. I Balanced k-way vertex-cut graph partitioning: B minimize replicas; B balance edge load. 10 of 18
- 11. Balanced Vertex-Cut Streaming Algorithms I Hashing: pseudo-random edge assignment; I Grid: shu✏e and split the rating matrix in identical blocks; I Greedy: [Gonzalez et al. 2012] and [Ahmed et al. 2013]. Bipartite Aware Greedy Algorithm I Real word bipartite graphs are often signiﬁcantly skewed: one of the two sets is much bigger than the other. I By perfectly splitting the bigger set it is possible to achieve a smaller replication factor. I GIP (Greedy - Item Partitioned) I GUP (Greedy - User Partitioned) 11 of 18
- 12. Evaluation: The Data Sets Degree distributions: 100 101 102 10 3 10 4 100 101 102 103 104 105 numberofvertices degree items users MovieLens 10M 100 101 102 10 3 10 4 100 101 102 103 104 105 numberofvertices degree items users Netﬂix 12 of 18
- 13. Experiments: Partitioning quality MovieLens 1 10 100 8 16 32 64 128 256 replicationfactor partitions hashing grid greedy GIP GUP 1 10 100 8 16 32 64 128 256 loadstandarddeviation(%) partitions hashing grid greedy GIP GUP Netﬂix 1 10 100 8 16 32 64 128 256 replicationfactor partitions hashing grid greedy GIP GUP 1 10 100 8 16 32 64 128 256 loadstandarddeviation(%) partitions hashing grid greedy GIP GUP I RF Replication Factor I RSD Relative Standard Deviation 13 of 18
- 14. Synchronization frequency f = 1 f = 100 Netﬂix 5 6 7 10 20 30 40 50 60 0 5 10 15 20 25 30 loss(×10 7 ) epoch grid greedy GUP centralized 5 6 7 8 9 10 20 30 40 50 60 0 5 10 15 20 25 30 loss(×10 7 ) epoch grid greedy GUP centralized I f = synchronization frequency parameter B number of synchronization steps during an epoch. I tradeo↵ between communication cost and convergence rate. 14 of 18
- 15. Evaluation: SSE and Communication cost MovieLens 1011 1012 10 13 10 14 10 15 100 101 102 103 SSE frequency grid, |C|=32 greedy, |C|=32 GUP, |C|=32 grid, |C|=8 greedy, |C|=8 GUP, |C|=8 1 2 3 4 5 6 7 8 9 10 100 101 102 103 104 105 communacationcost(×107 ) frequency hashing grid greedy GUP 0.01 0.05 0.1 0.15 0.2 0.25 101 Zoom Netﬂix 1013 1014 1015 10 16 1017 100 101 102 103 SSE frequency grid, |C|=32 greedy, |C|=32 GUP, |C|=32 grid, |C|=8 greedy, |C|=8 GUP, |C|=8 1 2 3 4 5 6 7 8 9 10 100 101 102 103 104 105 106 communacationcost(×10 9 ) frequency hashing grid greedy GUP 0.01 0.05 0.1 0.15 0.2 10 1 Zoom I SSE between ASGD variants and SGD curves I CC 15 of 18
- 16. Communication cost I T = the training set I U = users set I I = items set I V = U [ I I C = processing nodes I RF = replication factor I RFU = users’ RF I RFI = items’ RF RF = |U|RFU + |I|RFI |V | f = 1 ! CC ⇡ 2(|U|RFU + |I|RFI ) = 2|V |RF f = |T| |C| ! CC ⇡ |T|(RFU + RFI ) 16 of 18
- 17. Conclusions I three distinct contributions aimed at improving the e ciency and scalability of ASGD: 1. we proposed an input slicing solution based on graph partitioning approach that mitigates the load imbalance among SGD instances (i.e. better scalability); 2. we further reduce the amount of shared data by exploiting speciﬁc characteristics of the training dataset. This provides lower communication costs during the algorithm execution (i.e. better e ciency); 3. we introduced a synchronization frequency parameter driving a tradeo↵ that can be accurately leveraged to further improve the algorithm e ciency. 17 of 18
- 18. Thank you! Questions? Fabio Petroni Rome, Italy Current position: PhD Student in Computer Engineering, Sapienza University of Rome Research Areas: Recommendation Systems, Collaborative Filtering, Distributed Systems petroni@dis.uniroma1.it 21 of 21

No public clipboards found for this slide

Be the first to comment