Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Comparison and Evaluation of Open Source 
Implementations of Pregel and Related Systems 
December 2, 2013 
Joshua Woo, Pra...
Outline 
● Motivation 
● Our Project 
● Setup 
● Preliminary Results 
● Preliminary Analysis 
● In-Progress 
● References
Motivation 
Recall: Pregel 
● Large-scale graph processing system 
● Fault-tolerant framework for graph 
algorithms 
● Map...
Motivation 
● Pregel is proprietary 
● Many open source graph processing 
systems 
○ Pregel clones 
○ Pregel-inspired 
○ B...
Motivation 
● Apache Hama 
● Signal/Collect 
● Apache Giraph 
● GPS 
● GraphLab 
● Phoebus 
● GoldenOrb 
● HipG 
● Mizan
Motivation 
System Impl. Language Type 
Apache Hama Java Pure BSP framework 
Signal/Collect Scala Pregel inspired 
Apache ...
Motivation 
● How do these systems compare? 
○ In terms of performance (runtime)? 
○ In terms of memory footprint? 
○ In t...
Our Project 
● Compare at least 3 systems 
○ Apache Hama - general BSP framework 
○ Apache Giraph - Hadoop Map-only job, F...
Our Project 
● Measure the runtime of at least two 
algorithms on each system 
○ PageRank 
■ Fixed number of supersteps = ...
Setup 
● Experiments on AWS 
○ Ubuntu 12.04 m1.medium EC2 instances 
■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network 
p...
Setup 
● Experiments on AWS 
○ 5 runs per dataset per algorithm per cluster 
■ 35 runs per algorithm per cluster 
■ 70 run...
Setup 
● Dataset 
○ 7 datasets 
■ tinyEWD: 8 vertices 15 edges 
■ mediumEWD: 250 vertices 2,546 edges 
■ 1000EWD: 1,000 ve...
Setup 
● Systems 
○ Hama 
■ Hadoop 1.03.0 
■ Hama 0.6.3 
○ Giraph 
■ Hadoop 0.20.203rc1 
■ Giraph (trunk@37bc2c80564b45d7e...
Setup 
● Input Graph 
○ Source files converted into format suitable for each 
system 
■ Time for this conversion excluded ...
Preliminary Results 
Average SSSP runtime on 4-node cluster (in seconds) 
Dataset Hama Giraph GPS 
tinyEWD 14.17 41.60 14....
Preliminary Results 
SSSP runtime vs. graph size (num. vertices)
Preliminary Results 
Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds) 
Dataset Hama Giraph GPS 
tin...
Preliminary Results 
PageRank runtime vs. graph size (num. vertices)
Preliminary Analysis 
● A point of resource crunch 
○ No significant change in performance until a point 
● Hama does not ...
In-Progress 
● Output validation 
● Memory footprint 
● Network utilization (num. messages) 
● GraphLab and Signal/Collect...
Questions?
Extras
Preliminary Results 
Number of supersteps for SSSP 
Dataset Hama Giraph GPS 
tinyEWD 10 7 7 
mediumEWD 16 13 18 
1000EWD 2...
Preliminary Results 
Number of supersteps for SSSP
Really, really Preliminary 
PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated 
Dataset Native Green-Ma...
Really, really Preliminary 
PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
References 
● Our Project Proposal 
● http://algs4.cs.princeton.edu/44sp/ 
● https://github.com/apache/hadoop-common 
● ht...
Upcoming SlideShare
Loading in …5
×

Comparing pregel related systems

642 views

Published on

Comparing Open Source implementations of Pregel and Related Systems.
Installation of Hadoop and the Pregel Related Systems.
Worked with Datasets of varying sizes from very small to very large. Large datasets that have around 30 million vertices and 50 million edges.
Worked on 1,4,8 node Amazon EC2 cluster.
4 Algorithms : PageRank,Shortest Path,KMeans,Collaborative Filtering

Published in: Engineering
  • Hi there! Get Your Professional Job-Winning Resume Here - Check our website! http://bit.ly/resumpro
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Comparing pregel related systems

  1. 1. Comparison and Evaluation of Open Source Implementations of Pregel and Related Systems December 2, 2013 Joshua Woo, Prashant Raghav, Vishnu Prathish David R. Cheriton School of Computer Science University of Waterloo
  2. 2. Outline ● Motivation ● Our Project ● Setup ● Preliminary Results ● Preliminary Analysis ● In-Progress ● References
  3. 3. Motivation Recall: Pregel ● Large-scale graph processing system ● Fault-tolerant framework for graph algorithms ● MapReduce for graph operations? ● Vertex-centric model (“think like a vertex”)
  4. 4. Motivation ● Pregel is proprietary ● Many open source graph processing systems ○ Pregel clones ○ Pregel-inspired ○ BSP
  5. 5. Motivation ● Apache Hama ● Signal/Collect ● Apache Giraph ● GPS ● GraphLab ● Phoebus ● GoldenOrb ● HipG ● Mizan
  6. 6. Motivation System Impl. Language Type Apache Hama Java Pure BSP framework Signal/Collect Scala Pregel inspired Apache Giraph Java Pregel clone GPS Java Advanced Pregel clone GraphLab C++ Pregel inspired Phoebus Erlang Pregel clone GoldenOrb Java Pregel clone HipG Java Advanced Pregel clone Mizan C++ Advanced Pregel clone
  7. 7. Motivation ● How do these systems compare? ○ In terms of performance (runtime)? ○ In terms of memory footprint? ○ In terms of network utilization (num. messages)? ○ Variables: ■ Algorithm ■ Graph size (number of vertices) ■ Cluster size
  8. 8. Our Project ● Compare at least 3 systems ○ Apache Hama - general BSP framework ○ Apache Giraph - Hadoop Map-only job, Facebook ○ GPS - +dynamic repartitioning, +multi vertex-centric ○ Signal/Collect - +edges, +async computations ○ GraphLab ○ Mizan
  9. 9. Our Project ● Measure the runtime of at least two algorithms on each system ○ PageRank ■ Fixed number of supersteps = 30 ○ Single Source Shortest Path (SSSP) ○ k-means clustering
  10. 10. Setup ● Experiments on AWS ○ Ubuntu 12.04 m1.medium EC2 instances ■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network performance ■ 8 GiB EBS volume per instance ○ Cluster sizes: ■ Single-node cluster ■ 4-node cluster ■ 8-node cluster
  11. 11. Setup ● Experiments on AWS ○ 5 runs per dataset per algorithm per cluster ■ 35 runs per algorithm per cluster ■ 70 runs per cluster ■ 140 runs in total (single-node, 4-node) ● TODO: another 70 runs (8-node)
  12. 12. Setup ● Dataset ○ 7 datasets ■ tinyEWD: 8 vertices 15 edges ■ mediumEWD: 250 vertices 2,546 edges ■ 1000EWD: 1,000 vertices 16,866 edges ■ rome99: 3,353 vertices 8,870 edges ■ 10000EWD: 10,000 vertices 16,866 edges ■ NYC: 264,346 vertices 733,846 edges ■ largeEWD: 1,000,000 vertices 15,172,126 edges ○ Source: http://algs4.cs.princeton.edu/44sp/
  13. 13. Setup ● Systems ○ Hama ■ Hadoop 1.03.0 ■ Hama 0.6.3 ○ Giraph ■ Hadoop 0.20.203rc1 ■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a) ○ GPS ■ Hadoop 0.20.203rc1 ■ GPS (trunk@Revision 112)
  14. 14. Setup ● Input Graph ○ Source files converted into format suitable for each system ■ Time for this conversion excluded from results: ● Conversion done before algorithms are run (pre-processing?) ● Negligible for largeEWD (1,000,000 vertices, 15,172,126 edges)
  15. 15. Preliminary Results Average SSSP runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 14.17 41.60 14.40 mediumEWD 16.36 44.00 36.00 1000EWD 18.06 48.80 46.60 rome99 22.95 66.00 50.00 10000EWD 25.32 67.40 55.00 NYC 165.01 267.00 310.00 largeEWD 6,109.20 602.80 618.70
  16. 16. Preliminary Results SSSP runtime vs. graph size (num. vertices)
  17. 17. Preliminary Results Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 29.36 49.40 58.57 mediumEWD 30.26 53.40 60.42 1000EWD 37.86 54.60 61.03 rome99 29.35 56.20 61.80 10000EWD 302.33 61.80 64.80 NYC 1,001.24 134.40 68.69 largeEWD Failed 2,100.00 1,213.56
  18. 18. Preliminary Results PageRank runtime vs. graph size (num. vertices)
  19. 19. Preliminary Analysis ● A point of resource crunch ○ No significant change in performance until a point ● Hama does not scale well (vertices ~10^4) ● Giraph and GPS scale better ● In general, PageRank runtime > SSSP runtime ● GPS input reader does not guarantee true partitioning for large datasets ● Which ‘knobs’ to keep constant? - Optimization vs. Comparability
  20. 20. In-Progress ● Output validation ● Memory footprint ● Network utilization (num. messages) ● GraphLab and Signal/Collect ● Green-Marl? ○ (DSL) → [Compiler] → (Giraph, GPS)
  21. 21. Questions?
  22. 22. Extras
  23. 23. Preliminary Results Number of supersteps for SSSP Dataset Hama Giraph GPS tinyEWD 10 7 7 mediumEWD 16 13 18 1000EWD 27 25 23 rome99 105 102 18 10000EWD 85 80 64 NYC 671 905 438 largeEWD 806 670 730
  24. 24. Preliminary Results Number of supersteps for SSSP
  25. 25. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated Dataset Native Green-Marl generated tinyEWD 58.57 60.20 mediumEWD 60.42 60.11 1000EWD 61.03 62.30 rome99 61.80 62.32 10000EWD 64.80 65.78 NYC 68.69 71.34 largeEWD 1,213.56 -
  26. 26. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
  27. 27. References ● Our Project Proposal ● http://algs4.cs.princeton.edu/44sp/ ● https://github.com/apache/hadoop-common ● https://github.com/apache/giraph ● https://subversion.assembla.com/svn/phd-projects/ gps/trunk/ ● http://ppl.stanford.edu/main/green_marl.html

×