Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

528 views

Published on

F. Petroni and L. Querzoni:
"GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning."
In: Proceedings of the 8th ACM Conference on Recommender Systems (RecSys), 2014.

Abstract: "Matrix completion latent factors models are known to be an effective method to build recommender systems. Currently,
stochastic gradient descent (SGD) is considered one of the best latent factor-based algorithm for matrix completion. In this paper we discuss GASGD, a distributed asynchronous variant of SGD for large-scale matrix completion, that (i) leverages data partitioning schemes based on graph partitioning techniques, (ii) exploits specific characteristics of the input data and (iii) introduces an explicit parameter to tune synchronization frequency among the computing nodes. We empirically show how, thanks to these features, GASGD achieves a fast convergence rate incurring in smaller communication cost with respect to current asynchronous distributed SGD implementations."

Published in: Engineering
  • Be the first to comment

GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning

  1. 1. GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Completion via Graph Partitioning Fabio Petroni Cyber Intelligence and information Security CIS Sapienza Department of Computer Control and Management Engineering Antonio Ruberti, Sapienza University of Rome - petroni|querzoni@dis.uniroma1.it Foster City, Silicon Valley 6th-10th October 2014 and Leonardo Querzoni
  2. 2. Matrix Completion & SGD R Q x P ? item vector user vector LOSS(P, Q) = X (Rij Pi Qj )2 + ... I Stochastic Gradient Descent works by taking steps proportional to the negative of the gradient of the LOSS. I stochastic = P and Q are updated for each given training case by a small step, toward the average gradient descent. 2 of 18
  3. 3. Scalability 7 Lengthy training stages; 7 high computational costs; 7 especially on large data sets; 7 input data may not fit in main memory. I goal = e ciently exploit computer cluster architectures. ... 3 of 18
  4. 4. Distributed Asynchronous SGD R Q3 P3 Q4 P4 Q2 P2 Q1 P1 1 2 3 4 R1 R4 R3R2 I R is splitted; I vectors are replicated; I replicas concurrently updated; I replicas deviate inconsistently; I synchronization. 4 of 18
  5. 5. Bulk Synchronous Processing Model 5 of 18
  6. 6. Challenges 1/2 R 1 2 3 4 R1 R4 R3R2 I 1. Load balance B ensure that computing nodes are fed with the same load. I 2. Minimize communication B minimize vector replicas. 6 of 18
  7. 7. Challenges 2/2 ... I 3. Tune synchronization frequency among computing nodes. I Current implementations synchronize vector copies: B continuously during the epoch (waste of resources); B once after every epoch (slow convergence). I epoch = a single iteration over the ratings. 7 of 18
  8. 8. Contributions 3 We mitigate the load imbalance by proposing an input slicing solution based on graph partitioning algorithms; 3 we show how to reduce the number of shared data by properly leveraging known characteristics of the input dataset (bipartite power-law nature); 3 we show how to leverage the tradeo↵ between communication cost and algorithm convergence rate by tuning the frequency of the bulk synchronization phase during the computation. 8 of 18
  9. 9. Graph representation I The rating matrix describes a bipartite graph. x x x x x x xx x x x x R I Real data: skewed power-law degree distribution. 9 of 18
  10. 10. Input partitioner I vertex-cut performs better than edge-cut in power-law graphs. I Assumption: the input data doesn’t fit in main memory. I Streaming algorithm. I Balanced k-way vertex-cut graph partitioning: B minimize replicas; B balance edge load. 10 of 18
  11. 11. Balanced Vertex-Cut Streaming Algorithms I Hashing: pseudo-random edge assignment; I Grid: shu✏e and split the rating matrix in identical blocks; I Greedy: [Gonzalez et al. 2012] and [Ahmed et al. 2013]. Bipartite Aware Greedy Algorithm I Real word bipartite graphs are often significantly skewed: one of the two sets is much bigger than the other. I By perfectly splitting the bigger set it is possible to achieve a smaller replication factor. I GIP (Greedy - Item Partitioned) I GUP (Greedy - User Partitioned) 11 of 18
  12. 12. Evaluation: The Data Sets Degree distributions: 100 101 102 10 3 10 4 100 101 102 103 104 105 numberofvertices degree items users MovieLens 10M 100 101 102 10 3 10 4 100 101 102 103 104 105 numberofvertices degree items users Netflix 12 of 18
  13. 13. Experiments: Partitioning quality MovieLens 1 10 100 8 16 32 64 128 256 replicationfactor partitions hashing grid greedy GIP GUP 1 10 100 8 16 32 64 128 256 loadstandarddeviation(%) partitions hashing grid greedy GIP GUP Netflix 1 10 100 8 16 32 64 128 256 replicationfactor partitions hashing grid greedy GIP GUP 1 10 100 8 16 32 64 128 256 loadstandarddeviation(%) partitions hashing grid greedy GIP GUP I RF Replication Factor I RSD Relative Standard Deviation 13 of 18
  14. 14. Synchronization frequency f = 1 f = 100 Netflix 5 6 7 10 20 30 40 50 60 0 5 10 15 20 25 30 loss(×10 7 ) epoch grid greedy GUP centralized 5 6 7 8 9 10 20 30 40 50 60 0 5 10 15 20 25 30 loss(×10 7 ) epoch grid greedy GUP centralized I f = synchronization frequency parameter B number of synchronization steps during an epoch. I tradeo↵ between communication cost and convergence rate. 14 of 18
  15. 15. Evaluation: SSE and Communication cost MovieLens 1011 1012 10 13 10 14 10 15 100 101 102 103 SSE frequency grid, |C|=32 greedy, |C|=32 GUP, |C|=32 grid, |C|=8 greedy, |C|=8 GUP, |C|=8 1 2 3 4 5 6 7 8 9 10 100 101 102 103 104 105 communacationcost(×107 ) frequency hashing grid greedy GUP 0.01 0.05 0.1 0.15 0.2 0.25 101 Zoom Netflix 1013 1014 1015 10 16 1017 100 101 102 103 SSE frequency grid, |C|=32 greedy, |C|=32 GUP, |C|=32 grid, |C|=8 greedy, |C|=8 GUP, |C|=8 1 2 3 4 5 6 7 8 9 10 100 101 102 103 104 105 106 communacationcost(×10 9 ) frequency hashing grid greedy GUP 0.01 0.05 0.1 0.15 0.2 10 1 Zoom I SSE between ASGD variants and SGD curves I CC 15 of 18
  16. 16. Communication cost I T = the training set I U = users set I I = items set I V = U [ I I C = processing nodes I RF = replication factor I RFU = users’ RF I RFI = items’ RF RF = |U|RFU + |I|RFI |V | f = 1 ! CC ⇡ 2(|U|RFU + |I|RFI ) = 2|V |RF f = |T| |C| ! CC ⇡ |T|(RFU + RFI ) 16 of 18
  17. 17. Conclusions I three distinct contributions aimed at improving the e ciency and scalability of ASGD: 1. we proposed an input slicing solution based on graph partitioning approach that mitigates the load imbalance among SGD instances (i.e. better scalability); 2. we further reduce the amount of shared data by exploiting specific characteristics of the training dataset. This provides lower communication costs during the algorithm execution (i.e. better e ciency); 3. we introduced a synchronization frequency parameter driving a tradeo↵ that can be accurately leveraged to further improve the algorithm e ciency. 17 of 18
  18. 18. Thank you! Questions? Fabio Petroni Rome, Italy Current position: PhD Student in Computer Engineering, Sapienza University of Rome Research Areas: Recommendation Systems, Collaborative Filtering, Distributed Systems petroni@dis.uniroma1.it 21 of 21

×