Thorny path to the 
Large-Scale Graph 
Processing 
Zinoviev Alexey
About 
• I am a <graph theory, machine learning, traffic jams prediction, BigData 
algorythms> scientist 
• But I'm a <Java, JavaScript, Android, NoSQL, Hadoop, Spark> 
programmer
BigData & Graph Theory 
3/65
Big Data of old times 
• Astronomy 
• Weather 
• Trading 
• Sea routes 
• Battles
And now ... 
• Web graph 
• Facebook friend network 
• Gmail email graph 
• EU road network 
• Citation graph 
• PayPal transaction graph
Graph Number of 
vertexes 
Number of 
edges 
Volume Data/per day 
Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB 
Facebook 
1,1 * 10^9 160 * 10^9 1 PB 15 TB 
(friends 
graph) 
Road graph 
of EU 
18 * 10^6 42 * 10^6 20 GB 50 MB 
Road graph 
of this city 
250 000 460 000 500 MB 100 KB
Problems 
• Popularity rank (page rank) 
• Determining popular users, news, jobs, etc. 
• Shortest paths 
• Max flow 
• How are users, groups connected? 
• Clustering, semi-clustering 
• Max clique, triangle closure, label propagation algorithms 
• Finding related people, groups, interests
Node Centrality Problem 
• Verticies with high impact 
• Removal of important vertices reduces the reliability 
Cases: 
• Bioinformatics 
• Social connections 
• Road network 
• Spam detection 
• Recommendation system
Small World Problem 
Facebook 4.74 712 M 69 G 
Twitter 3.67 ---- 5G follows 
MSN Messenger 
(1 month) 
6.6 180 M 1.3 G arcs
Large graph processing tools 
15/65
Think like a vertex… 
• Majority of graph algorithms are iterative and traverse the graph in 
some way 
• Classic map-reduce overheads (job startup/shutdown, reloading data 
from HDFS, shuffling) 
• High complexity of graph problem reduction to key-value model 
• Iteration algorythms, but multiple chained jobs in M/R with full saving 
and reading of each state
Why not use MapReduce/Hadoop? 
• Example: PageRank, Google‘s 
famous algorithm for measuring the 
authority of a webpage based on the 
underlying network of hyperlinks 
• defined recursively: each vertex 
distributes its authority to its neighbors 
in equal proportions
Google Pregel 
• Distributed system especially developed for large scale graph 
processing 
• Bulk Synchronous Parallel (BSP) as execution model 
• Supersteps are atomic units of parallel computation 
• Any superstep can be restarted from a checkpoint (need not be user 
defined) 
• A new superstep provides an opportunity for rebalancing of 
components among available resources
Superstep in BSP
Vertex-centric BSP 
• Each vertex has an id, a value, a list of its adjacent vertex ids and the 
corresponding edge values 
• Each vertex is invoked in each superstep, can recompute its value and 
send messages to other vertices, which are delivered over superstep 
barriers 
• Advanced features : termination votes, combiners, aggregators, 
topology mutations
C++ API, Pregel
Apache Giraph 
23/65
Why Apache Giraph 
Pregel is proprietary, but: 
• Apache Giraph is an open source implementation of Pregel 
• Runs on standard Hadoop infrastructure 
• Computation is executed in memory 
• Can be a job in a pipeline(MapReduce, Hive) 
• Uses Apache ZooKeeperfor synchronization
Why Apache Giraph 
• No locks: message-based communication 
• No semaphores: global synchronization 
• Iteration isolation: massively parallelizable
ZooKeeper in Apache Giraph 
ZooKeeper: responsible for 
computation state 
• Partition/worker mapping 
• Global state: superstep 
• Checkpoint paths, aggregator 
values, statistics
Master in Apache Giraph 
Master: responsible for coordination 
• Assigns partitions to workers 
• Coordinates synchronization 
• Requests checkpoints 
• Aggregates aggregator values 
• Collects health statuses
Worker in Apache Giraph 
Worker: responsible for vertices 
• Invokes active vertices 
compute() function 
• Sends, receives and assigns 
messages 
• Computes local aggregation 
values
Scaling Giraph to a trillion edges
Fault tolerance 
No single point of failure from Giraph threads 
• With multiple master threads, if the current master dies, a new 
one will automatically take over. 
• If a worker thread dies, the application is rolled back to a 
previously checkpointed superstep. 
• If a zookeeper server dies, as long as a quorum remains, the 
application can proceed 
Hadoop single points of failure still exist (Namenode, jobtracker)
Worker Scalability, 250m nodes
Vertex scalability, 300 workers
Vertex/workers scalability
MapReduce vs Giraph 
6 machines with 2x8core Opteron CPUs, 4x1TB disks and 32GB RAM each, ran 1 
Giraph worker per core 
Wikipedia page link graph (6 million vertices, 200 million edges) 
PageRank on Hadoop/Mahout 
• 10 iterations approx. 29 minutes 
• average time per iteration: approx. 3 minutes 
PageRank on Giraph 
• 30 iterations took approx. 15 minutes 
• average time per iteration: approx. 30 seconds 
10x performance improvement
Okapi 
• Apache Mahout for graphs 
• Graph-based recommenders: ALS, 
SGD, SVD++, etc. 
• Graph analytics: Graph 
partitioning, Community Detection, 
K-Core, etc.
Giraph’s killer
Spark 
• MapReduce in memory 
• Up to 50x faster than Hadoop 
• Support for Shark (like Hive), MLlib 
(Machine learning), GraphX (graph 
processing) 
• RDD is a basic building block 
(immutable distributed collections of 
objects)
Spark in Hadoop old family
GraphX 
Supported algorythms 
● PageRank 
● Connected components 
● Label propagation 
● SVD++ 
● Strongly connected components 
● Triangle count
GraphChi 
• Asynchronous Disk-based version of GraphLab 
• Utilizing parallel sliding window 
• Very small number of non-sequential accessesto the disk 
• Graph does not fit in memory 
• Input graph is split into P disjoint intervals to balance edges, 
each associated with a shard 
• For Home deals ...
GraphChi
GraphChi
Road Networks 
46/65
Definition 
• Edge weights > 0 
• A few classes of roads 
• Lat/Lon attributes for each vertex 
• Subgraphs for cross-roads 
• Not so big as web graph 
• Static
Shortest path problem
AI
Full
Dijkstra
Bi-Directional
We need in fast system! 
• Response < 10 ms (with high accuracy) 
• Shortest path (SP) with O(n) 
• Preprocessing phase 
• Don’t keep all SP - O(n^2) 
• Use geo attributes 
• Using compression and recoding for 
disk storage 
• Network is stable
EU Road network 
Dijkstra ALT RE HH CH TN HL 
2 008 300 24 656 2444 462.0 94.0 1.8 0.3 
• ALT: [Goldberg & Harrelson 05], [Delling & Wagner 07] 
• RE: [Gutman 05], [Goldberg et al. 07] 
• HH: [Sanders & Schultes 06] 
• CH: [Geisberger et al. 08] 
• TN: [Geisberger et al. 08] 
• HL: [Abraham et al. 11]
A* with landmarks (ALT)
Reach (RE)
Transit nodes (TN) 
• Divide graph G on subgraphs G_i 
• Find R (subset of G_i) for each G_i 
• All sortest path in G_i across R 
• Build pairs (v_i, r_k) for each v_i where 
r_k is closest Transit Node 
• Calculate shortest paths between transit 
nodes in R 
• Save it!
TN + ALT
Special Cases 
59/65
Optimization problems 
• Unstable graph 
• Prerpocessing phase is meaningless 
• How to invest 1B $ in road network to minimize human time in 
traffic jams 
• How to invest 1M $ in road network to improve reliability before 
the flooding
Last steps ... 
• I/O Efficient Algorythms and Data Structures 
• Graphs and Memory Errors
Omsk
Novosibirsk
Novosibirsk, TN preprocessing
twitter + G+ + VK

Thorny Path to the Large Scale Graph Processing, Алексей Зиновьев (Тамтэк)

  • 1.
    Thorny path tothe Large-Scale Graph Processing Zinoviev Alexey
  • 2.
    About • Iam a <graph theory, machine learning, traffic jams prediction, BigData algorythms> scientist • But I'm a <Java, JavaScript, Android, NoSQL, Hadoop, Spark> programmer
  • 3.
    BigData & GraphTheory 3/65
  • 4.
    Big Data ofold times • Astronomy • Weather • Trading • Sea routes • Battles
  • 5.
    And now ... • Web graph • Facebook friend network • Gmail email graph • EU road network • Citation graph • PayPal transaction graph
  • 6.
    Graph Number of vertexes Number of edges Volume Data/per day Web-graph 1,5 * 10^12 1,2 * 10^13 100 PB 300 TB Facebook 1,1 * 10^9 160 * 10^9 1 PB 15 TB (friends graph) Road graph of EU 18 * 10^6 42 * 10^6 20 GB 50 MB Road graph of this city 250 000 460 000 500 MB 100 KB
  • 7.
    Problems • Popularityrank (page rank) • Determining popular users, news, jobs, etc. • Shortest paths • Max flow • How are users, groups connected? • Clustering, semi-clustering • Max clique, triangle closure, label propagation algorithms • Finding related people, groups, interests
  • 12.
    Node Centrality Problem • Verticies with high impact • Removal of important vertices reduces the reliability Cases: • Bioinformatics • Social connections • Road network • Spam detection • Recommendation system
  • 14.
    Small World Problem Facebook 4.74 712 M 69 G Twitter 3.67 ---- 5G follows MSN Messenger (1 month) 6.6 180 M 1.3 G arcs
  • 15.
  • 16.
    Think like avertex… • Majority of graph algorithms are iterative and traverse the graph in some way • Classic map-reduce overheads (job startup/shutdown, reloading data from HDFS, shuffling) • High complexity of graph problem reduction to key-value model • Iteration algorythms, but multiple chained jobs in M/R with full saving and reading of each state
  • 17.
    Why not useMapReduce/Hadoop? • Example: PageRank, Google‘s famous algorithm for measuring the authority of a webpage based on the underlying network of hyperlinks • defined recursively: each vertex distributes its authority to its neighbors in equal proportions
  • 18.
    Google Pregel •Distributed system especially developed for large scale graph processing • Bulk Synchronous Parallel (BSP) as execution model • Supersteps are atomic units of parallel computation • Any superstep can be restarted from a checkpoint (need not be user defined) • A new superstep provides an opportunity for rebalancing of components among available resources
  • 19.
  • 20.
    Vertex-centric BSP •Each vertex has an id, a value, a list of its adjacent vertex ids and the corresponding edge values • Each vertex is invoked in each superstep, can recompute its value and send messages to other vertices, which are delivered over superstep barriers • Advanced features : termination votes, combiners, aggregators, topology mutations
  • 21.
  • 23.
  • 24.
    Why Apache Giraph Pregel is proprietary, but: • Apache Giraph is an open source implementation of Pregel • Runs on standard Hadoop infrastructure • Computation is executed in memory • Can be a job in a pipeline(MapReduce, Hive) • Uses Apache ZooKeeperfor synchronization
  • 26.
    Why Apache Giraph • No locks: message-based communication • No semaphores: global synchronization • Iteration isolation: massively parallelizable
  • 29.
    ZooKeeper in ApacheGiraph ZooKeeper: responsible for computation state • Partition/worker mapping • Global state: superstep • Checkpoint paths, aggregator values, statistics
  • 30.
    Master in ApacheGiraph Master: responsible for coordination • Assigns partitions to workers • Coordinates synchronization • Requests checkpoints • Aggregates aggregator values • Collects health statuses
  • 31.
    Worker in ApacheGiraph Worker: responsible for vertices • Invokes active vertices compute() function • Sends, receives and assigns messages • Computes local aggregation values
  • 32.
    Scaling Giraph toa trillion edges
  • 33.
    Fault tolerance Nosingle point of failure from Giraph threads • With multiple master threads, if the current master dies, a new one will automatically take over. • If a worker thread dies, the application is rolled back to a previously checkpointed superstep. • If a zookeeper server dies, as long as a quorum remains, the application can proceed Hadoop single points of failure still exist (Namenode, jobtracker)
  • 34.
  • 35.
  • 36.
  • 37.
    MapReduce vs Giraph 6 machines with 2x8core Opteron CPUs, 4x1TB disks and 32GB RAM each, ran 1 Giraph worker per core Wikipedia page link graph (6 million vertices, 200 million edges) PageRank on Hadoop/Mahout • 10 iterations approx. 29 minutes • average time per iteration: approx. 3 minutes PageRank on Giraph • 30 iterations took approx. 15 minutes • average time per iteration: approx. 30 seconds 10x performance improvement
  • 38.
    Okapi • ApacheMahout for graphs • Graph-based recommenders: ALS, SGD, SVD++, etc. • Graph analytics: Graph partitioning, Community Detection, K-Core, etc.
  • 39.
  • 40.
    Spark • MapReducein memory • Up to 50x faster than Hadoop • Support for Shark (like Hive), MLlib (Machine learning), GraphX (graph processing) • RDD is a basic building block (immutable distributed collections of objects)
  • 41.
    Spark in Hadoopold family
  • 42.
    GraphX Supported algorythms ● PageRank ● Connected components ● Label propagation ● SVD++ ● Strongly connected components ● Triangle count
  • 43.
    GraphChi • AsynchronousDisk-based version of GraphLab • Utilizing parallel sliding window • Very small number of non-sequential accessesto the disk • Graph does not fit in memory • Input graph is split into P disjoint intervals to balance edges, each associated with a shard • For Home deals ...
  • 44.
  • 45.
  • 46.
  • 47.
    Definition • Edgeweights > 0 • A few classes of roads • Lat/Lon attributes for each vertex • Subgraphs for cross-roads • Not so big as web graph • Static
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
    We need infast system! • Response < 10 ms (with high accuracy) • Shortest path (SP) with O(n) • Preprocessing phase • Don’t keep all SP - O(n^2) • Use geo attributes • Using compression and recoding for disk storage • Network is stable
  • 54.
    EU Road network Dijkstra ALT RE HH CH TN HL 2 008 300 24 656 2444 462.0 94.0 1.8 0.3 • ALT: [Goldberg & Harrelson 05], [Delling & Wagner 07] • RE: [Gutman 05], [Goldberg et al. 07] • HH: [Sanders & Schultes 06] • CH: [Geisberger et al. 08] • TN: [Geisberger et al. 08] • HL: [Abraham et al. 11]
  • 55.
  • 56.
  • 57.
    Transit nodes (TN) • Divide graph G on subgraphs G_i • Find R (subset of G_i) for each G_i • All sortest path in G_i across R • Build pairs (v_i, r_k) for each v_i where r_k is closest Transit Node • Calculate shortest paths between transit nodes in R • Save it!
  • 58.
  • 59.
  • 60.
    Optimization problems •Unstable graph • Prerpocessing phase is meaningless • How to invest 1B $ in road network to minimize human time in traffic jams • How to invest 1M $ in road network to improve reliability before the flooding
  • 61.
    Last steps ... • I/O Efficient Algorythms and Data Structures • Graphs and Memory Errors
  • 62.
  • 63.
  • 64.
  • 65.