SlideShare a Scribd company logo
Comparison and Evaluation of Open Source 
Implementations of Pregel and Related Systems 
December 2, 2013 
Joshua Woo, Prashant Raghav, Vishnu Prathish 
David R. Cheriton School of Computer Science 
University of Waterloo
Outline 
● Motivation 
● Our Project 
● Setup 
● Preliminary Results 
● Preliminary Analysis 
● In-Progress 
● References
Motivation 
Recall: Pregel 
● Large-scale graph processing system 
● Fault-tolerant framework for graph 
algorithms 
● MapReduce for graph operations? 
● Vertex-centric model (“think like a vertex”)
Motivation 
● Pregel is proprietary 
● Many open source graph processing 
systems 
○ Pregel clones 
○ Pregel-inspired 
○ BSP
Motivation 
● Apache Hama 
● Signal/Collect 
● Apache Giraph 
● GPS 
● GraphLab 
● Phoebus 
● GoldenOrb 
● HipG 
● Mizan
Motivation 
System Impl. Language Type 
Apache Hama Java Pure BSP framework 
Signal/Collect Scala Pregel inspired 
Apache Giraph Java Pregel clone 
GPS Java Advanced Pregel clone 
GraphLab C++ Pregel inspired 
Phoebus Erlang Pregel clone 
GoldenOrb Java Pregel clone 
HipG Java Advanced Pregel clone 
Mizan C++ Advanced Pregel clone
Motivation 
● How do these systems compare? 
○ In terms of performance (runtime)? 
○ In terms of memory footprint? 
○ In terms of network utilization (num. messages)? 
○ Variables: 
■ Algorithm 
■ Graph size (number of vertices) 
■ Cluster size
Our Project 
● Compare at least 3 systems 
○ Apache Hama - general BSP framework 
○ Apache Giraph - Hadoop Map-only job, Facebook 
○ GPS - +dynamic repartitioning, +multi vertex-centric 
○ Signal/Collect - +edges, +async computations 
○ GraphLab 
○ Mizan
Our Project 
● Measure the runtime of at least two 
algorithms on each system 
○ PageRank 
■ Fixed number of supersteps = 30 
○ Single Source Shortest Path (SSSP) 
○ k-means clustering
Setup 
● Experiments on AWS 
○ Ubuntu 12.04 m1.medium EC2 instances 
■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network 
performance 
■ 8 GiB EBS volume per instance 
○ Cluster sizes: 
■ Single-node cluster 
■ 4-node cluster 
■ 8-node cluster
Setup 
● Experiments on AWS 
○ 5 runs per dataset per algorithm per cluster 
■ 35 runs per algorithm per cluster 
■ 70 runs per cluster 
■ 140 runs in total (single-node, 4-node) 
● TODO: another 70 runs (8-node)
Setup 
● Dataset 
○ 7 datasets 
■ tinyEWD: 8 vertices 15 edges 
■ mediumEWD: 250 vertices 2,546 edges 
■ 1000EWD: 1,000 vertices 16,866 edges 
■ rome99: 3,353 vertices 8,870 edges 
■ 10000EWD: 10,000 vertices 16,866 edges 
■ NYC: 264,346 vertices 733,846 edges 
■ largeEWD: 1,000,000 vertices 15,172,126 edges 
○ Source: http://algs4.cs.princeton.edu/44sp/
Setup 
● Systems 
○ Hama 
■ Hadoop 1.03.0 
■ Hama 0.6.3 
○ Giraph 
■ Hadoop 0.20.203rc1 
■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a) 
○ GPS 
■ Hadoop 0.20.203rc1 
■ GPS (trunk@Revision 112)
Setup 
● Input Graph 
○ Source files converted into format suitable for each 
system 
■ Time for this conversion excluded from results: 
● Conversion done before algorithms are run (pre-processing?) 
● Negligible for largeEWD (1,000,000 vertices, 15,172,126 
edges)
Preliminary Results 
Average SSSP runtime on 4-node cluster (in seconds) 
Dataset Hama Giraph GPS 
tinyEWD 14.17 41.60 14.40 
mediumEWD 16.36 44.00 36.00 
1000EWD 18.06 48.80 46.60 
rome99 22.95 66.00 50.00 
10000EWD 25.32 67.40 55.00 
NYC 165.01 267.00 310.00 
largeEWD 6,109.20 602.80 618.70
Preliminary Results 
SSSP runtime vs. graph size (num. vertices)
Preliminary Results 
Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds) 
Dataset Hama Giraph GPS 
tinyEWD 29.36 49.40 58.57 
mediumEWD 30.26 53.40 60.42 
1000EWD 37.86 54.60 61.03 
rome99 29.35 56.20 61.80 
10000EWD 302.33 61.80 64.80 
NYC 1,001.24 134.40 68.69 
largeEWD Failed 2,100.00 1,213.56
Preliminary Results 
PageRank runtime vs. graph size (num. vertices)
Preliminary Analysis 
● A point of resource crunch 
○ No significant change in performance until a point 
● Hama does not scale well (vertices ~10^4) 
● Giraph and GPS scale better 
● In general, PageRank runtime > SSSP runtime 
● GPS input reader does not guarantee true partitioning 
for large datasets 
● Which ‘knobs’ to keep constant? - Optimization vs. 
Comparability
In-Progress 
● Output validation 
● Memory footprint 
● Network utilization (num. messages) 
● GraphLab and Signal/Collect 
● Green-Marl? 
○ (DSL) → [Compiler] → (Giraph, GPS)
Questions?
Extras
Preliminary Results 
Number of supersteps for SSSP 
Dataset Hama Giraph GPS 
tinyEWD 10 7 7 
mediumEWD 16 13 18 
1000EWD 27 25 23 
rome99 105 102 18 
10000EWD 85 80 64 
NYC 671 905 438 
largeEWD 806 670 730
Preliminary Results 
Number of supersteps for SSSP
Really, really Preliminary 
PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated 
Dataset Native Green-Marl generated 
tinyEWD 58.57 60.20 
mediumEWD 60.42 60.11 
1000EWD 61.03 62.30 
rome99 61.80 62.32 
10000EWD 64.80 65.78 
NYC 68.69 71.34 
largeEWD 1,213.56 -
Really, really Preliminary 
PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
References 
● Our Project Proposal 
● http://algs4.cs.princeton.edu/44sp/ 
● https://github.com/apache/hadoop-common 
● https://github.com/apache/giraph 
● https://subversion.assembla.com/svn/phd-projects/ 
gps/trunk/ 
● http://ppl.stanford.edu/main/green_marl.html

More Related Content

What's hot

Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Databricks
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Ziemowit Jankowski
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Databricks
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
Sri Ambati
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
Josef Niedermeier
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Databricks
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Alexey Zinoviev
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014soujavajug
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
Sigmoid
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
Avery Ching
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
Spark Summit
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
Apache Apex
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 

What's hot (20)

Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
Large-Scaled Telematics Analytics in Apache Spark with Wayne Zhang and Neil P...
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...
 
H2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas NykodymH2O World - GLM - Tomas Nykodym
H2O World - GLM - Tomas Nykodym
 
Giraph
GiraphGiraph
Giraph
 
NetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and VerticaNetFlow Data processing using Hadoop and Vertica
NetFlow Data processing using Hadoop and Vertica
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
 
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production ScaleGPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
GPU Support In Spark And GPU/CPU Mixed Resource Scheduling At Production Scale
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 

Similar to Comparing pregel related systems

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Databricks
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Ontico
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Kohei KaiGai
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
Martin Zapletal
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
samthemonad
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
Dori Waldman
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
Naukri.com
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
Nicolas Poggi
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data Platforms
INRIA-OAK
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
Data Con LA
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
Red Hat Developers
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
Data Works MD
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profilepramodbiligiri
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
Lucian Neghina
 

Similar to Comparing pregel related systems (20)

Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU ClustersScalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
Scalable Acceleration of XGBoost Training on Apache Spark GPU Clusters
 
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data Platforms
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
How To Get The Most Out Of Your Hibernate, JBoss EAP 7 Application (Ståle Ped...
 
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
Marcel Kornacker: Impala tech talk Tue Feb 26th 2013
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
RAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data ScienceRAPIDS – Open GPU-accelerated Data Science
RAPIDS – Open GPU-accelerated Data Science
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 

Recently uploaded

H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
AmarGB2
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 

Recently uploaded (20)

H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
Investor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptxInvestor-Presentation-Q1FY2024 investor presentation document.pptx
Investor-Presentation-Q1FY2024 investor presentation document.pptx
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 

Comparing pregel related systems

  • 1. Comparison and Evaluation of Open Source Implementations of Pregel and Related Systems December 2, 2013 Joshua Woo, Prashant Raghav, Vishnu Prathish David R. Cheriton School of Computer Science University of Waterloo
  • 2. Outline ● Motivation ● Our Project ● Setup ● Preliminary Results ● Preliminary Analysis ● In-Progress ● References
  • 3. Motivation Recall: Pregel ● Large-scale graph processing system ● Fault-tolerant framework for graph algorithms ● MapReduce for graph operations? ● Vertex-centric model (“think like a vertex”)
  • 4. Motivation ● Pregel is proprietary ● Many open source graph processing systems ○ Pregel clones ○ Pregel-inspired ○ BSP
  • 5. Motivation ● Apache Hama ● Signal/Collect ● Apache Giraph ● GPS ● GraphLab ● Phoebus ● GoldenOrb ● HipG ● Mizan
  • 6. Motivation System Impl. Language Type Apache Hama Java Pure BSP framework Signal/Collect Scala Pregel inspired Apache Giraph Java Pregel clone GPS Java Advanced Pregel clone GraphLab C++ Pregel inspired Phoebus Erlang Pregel clone GoldenOrb Java Pregel clone HipG Java Advanced Pregel clone Mizan C++ Advanced Pregel clone
  • 7. Motivation ● How do these systems compare? ○ In terms of performance (runtime)? ○ In terms of memory footprint? ○ In terms of network utilization (num. messages)? ○ Variables: ■ Algorithm ■ Graph size (number of vertices) ■ Cluster size
  • 8. Our Project ● Compare at least 3 systems ○ Apache Hama - general BSP framework ○ Apache Giraph - Hadoop Map-only job, Facebook ○ GPS - +dynamic repartitioning, +multi vertex-centric ○ Signal/Collect - +edges, +async computations ○ GraphLab ○ Mizan
  • 9. Our Project ● Measure the runtime of at least two algorithms on each system ○ PageRank ■ Fixed number of supersteps = 30 ○ Single Source Shortest Path (SSSP) ○ k-means clustering
  • 10. Setup ● Experiments on AWS ○ Ubuntu 12.04 m1.medium EC2 instances ■ 2 ECUs, 1 vCPU, 3.7 GiB memory, moderate network performance ■ 8 GiB EBS volume per instance ○ Cluster sizes: ■ Single-node cluster ■ 4-node cluster ■ 8-node cluster
  • 11. Setup ● Experiments on AWS ○ 5 runs per dataset per algorithm per cluster ■ 35 runs per algorithm per cluster ■ 70 runs per cluster ■ 140 runs in total (single-node, 4-node) ● TODO: another 70 runs (8-node)
  • 12. Setup ● Dataset ○ 7 datasets ■ tinyEWD: 8 vertices 15 edges ■ mediumEWD: 250 vertices 2,546 edges ■ 1000EWD: 1,000 vertices 16,866 edges ■ rome99: 3,353 vertices 8,870 edges ■ 10000EWD: 10,000 vertices 16,866 edges ■ NYC: 264,346 vertices 733,846 edges ■ largeEWD: 1,000,000 vertices 15,172,126 edges ○ Source: http://algs4.cs.princeton.edu/44sp/
  • 13. Setup ● Systems ○ Hama ■ Hadoop 1.03.0 ■ Hama 0.6.3 ○ Giraph ■ Hadoop 0.20.203rc1 ■ Giraph (trunk@37bc2c80564b45d7e4ce95db76f5411a6b8bdb3a) ○ GPS ■ Hadoop 0.20.203rc1 ■ GPS (trunk@Revision 112)
  • 14. Setup ● Input Graph ○ Source files converted into format suitable for each system ■ Time for this conversion excluded from results: ● Conversion done before algorithms are run (pre-processing?) ● Negligible for largeEWD (1,000,000 vertices, 15,172,126 edges)
  • 15. Preliminary Results Average SSSP runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 14.17 41.60 14.40 mediumEWD 16.36 44.00 36.00 1000EWD 18.06 48.80 46.60 rome99 22.95 66.00 50.00 10000EWD 25.32 67.40 55.00 NYC 165.01 267.00 310.00 largeEWD 6,109.20 602.80 618.70
  • 16. Preliminary Results SSSP runtime vs. graph size (num. vertices)
  • 17. Preliminary Results Average PageRank (30 supersteps) runtime on 4-node cluster (in seconds) Dataset Hama Giraph GPS tinyEWD 29.36 49.40 58.57 mediumEWD 30.26 53.40 60.42 1000EWD 37.86 54.60 61.03 rome99 29.35 56.20 61.80 10000EWD 302.33 61.80 64.80 NYC 1,001.24 134.40 68.69 largeEWD Failed 2,100.00 1,213.56
  • 18. Preliminary Results PageRank runtime vs. graph size (num. vertices)
  • 19. Preliminary Analysis ● A point of resource crunch ○ No significant change in performance until a point ● Hama does not scale well (vertices ~10^4) ● Giraph and GPS scale better ● In general, PageRank runtime > SSSP runtime ● GPS input reader does not guarantee true partitioning for large datasets ● Which ‘knobs’ to keep constant? - Optimization vs. Comparability
  • 20. In-Progress ● Output validation ● Memory footprint ● Network utilization (num. messages) ● GraphLab and Signal/Collect ● Green-Marl? ○ (DSL) → [Compiler] → (Giraph, GPS)
  • 23. Preliminary Results Number of supersteps for SSSP Dataset Hama Giraph GPS tinyEWD 10 7 7 mediumEWD 16 13 18 1000EWD 27 25 23 rome99 105 102 18 10000EWD 85 80 64 NYC 671 905 438 largeEWD 806 670 730
  • 24. Preliminary Results Number of supersteps for SSSP
  • 25. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated Dataset Native Green-Marl generated tinyEWD 58.57 60.20 mediumEWD 60.42 60.11 1000EWD 61.03 62.30 rome99 61.80 62.32 10000EWD 64.80 65.78 NYC 68.69 71.34 largeEWD 1,213.56 -
  • 26. Really, really Preliminary PageRank runtime (in seconds) on GPS: native vs. Green-Marl generated
  • 27. References ● Our Project Proposal ● http://algs4.cs.princeton.edu/44sp/ ● https://github.com/apache/hadoop-common ● https://github.com/apache/giraph ● https://subversion.assembla.com/svn/phd-projects/ gps/trunk/ ● http://ppl.stanford.edu/main/green_marl.html