A Distributed Graph-Processing Library
Ahmet Emre Aladağ - AGMLab
26.08.2013
● Library for large-scale graph processing.
● Runs on Apache Hadoop with Map Jobs
● Bulk Synchronous Parallel (BSP) model
...
Uses
● PageRank-variant iterative algorithms
● Graph clustering
○ Label propagation
○ Max Clique
○ Triangle Closure
○ Find...
Alternatives
● Map-Reduce jobs on Hadoop
○ Not a good fit for graph algorithms: overhead.
● Google Pregel
○ Requires its o...
How Giraph differs
● You can use a Hadoop cluster, no need for
special infrastructure.
● Easy deployment with Amazon EMR
●...
Layers
Mechanism
InputFormat/Reader
Input
Computation OutputFormat/Writer
Output
● Accumulo
● HBase
● HCatalog
● HDFS
● Hive
● Ne...
InputFormat
● VertexInputFormat
1;3.4
2;6.1
3;2.7
● EdgeInputFormat
1;2
2;3
1;3
1 2 3
3.4 6.1 2.7
1 2 3
Computation
● Superstep barriers.
● Send/Receive messages from neighbors
● Update value.
● Vote to halt or wake up.
Single...
Shortest-Path Computation Code
Note: old API
Ex: Finding the maximum value
Aggregators
● Shared variables among the workers.
● Each vertex computation can add/multiply a
value to aggregators.
● Exa...
MasterCompute Class
● Master’s compute() always runs before the
slaves (like pre-superstep)
○ In compute: aggregate vertex...
Worker Context
● Allows for the execution of user code on a
per-worker basis.
● There's one WorkerContext per worker.
● Me...
Flexible Edge/Vertex Input
● Read edges/vertices from different sources.
● Multiple input resources
Parallel Computing
● More map jobs (workers) = parallel computing
● To overcome slowest worker problem,
multithreading is ...
Memory Optimization
● Vertices and edges are stored as serialized
byte arrays.
● Used FastUtil-based Java primitives.
Sharded Aggregators
● Each aggregator is randomly assigned to one of the workers.
● The assigned worker is in charge of ga...
Performance
● PageRank on 1 trillion edges with 200 commodity
machines: 4 minutes/iteration.
● K-Means on 1 billion input ...
Currently
● Version 1.0, on the way to 1.1
● Changing rapidly: backwards-incompatible
changes
● Documentation not mature y...
References
Giraph: Large-scale graph processing infrastructure on Hadoop, 2011
Scaling Apache Giraph to a trillion edges, ...
Questions
?
Upcoming SlideShare
Loading in …5
×

Apache Giraph

2,328 views

Published on

An introduction to Apache Giraph.

Published in: Career, Technology

Apache Giraph

  1. 1. A Distributed Graph-Processing Library Ahmet Emre Aladağ - AGMLab 26.08.2013
  2. 2. ● Library for large-scale graph processing. ● Runs on Apache Hadoop with Map Jobs ● Bulk Synchronous Parallel (BSP) model What is Giraph? 1incoming messages outgoing messages 0.2 0.53 0.32 0.16 0.12 0.34 Vertex computation
  3. 3. Uses ● PageRank-variant iterative algorithms ● Graph clustering ○ Label propagation ○ Max Clique ○ Triangle Closure ○ Finding related people, groups, interests. ● Shortest-Path ○ Single source, s-t, all to all ● Finding Connected Components
  4. 4. Alternatives ● Map-Reduce jobs on Hadoop ○ Not a good fit for graph algorithms: overhead. ● Google Pregel ○ Requires its own infrastructure ○ Not available ○ Master is single point of failure. ● Message Passing Interface (MPI) ○ Not fault-tolerant ○ Too generic
  5. 5. How Giraph differs ● You can use a Hadoop cluster, no need for special infrastructure. ● Easy deployment with Amazon EMR ● Dynamic resource management ● Graph oriented API ● Open Source ● Fault Tolerant, no SPOF except Hadoop namenode and jobtracker ● Jython Support
  6. 6. Layers
  7. 7. Mechanism InputFormat/Reader Input Computation OutputFormat/Writer Output ● Accumulo ● HBase ● HCatalog ● HDFS ● Hive ● Neo4j etc. ● Accumulo ● HBase ● HCatalog ● HDFS ● Hive ● Neo4j etc. ● GraphViz Adjacency matrix, id- value pairs, JSON
  8. 8. InputFormat ● VertexInputFormat 1;3.4 2;6.1 3;2.7 ● EdgeInputFormat 1;2 2;3 1;3 1 2 3 3.4 6.1 2.7 1 2 3
  9. 9. Computation ● Superstep barriers. ● Send/Receive messages from neighbors ● Update value. ● Vote to halt or wake up. Single-Source Shortest Path Example
  10. 10. Shortest-Path Computation Code Note: old API
  11. 11. Ex: Finding the maximum value
  12. 12. Aggregators ● Shared variables among the workers. ● Each vertex computation can add/multiply a value to aggregators. ● Examples: ○ Holding the min/max value among all vertices ○ Holding sum of the vertex values. ○ Holding average value of vertex values. ○ Holding sum of mean square errors and stdev. 1 2 3 0.2 0.6 0.45 1.25 Computation at Iteration k
  13. 13. MasterCompute Class ● Master’s compute() always runs before the slaves (like pre-superstep) ○ In compute: aggregate vertex values: sum of values ○ In MasterCompute: average=sum/N ● Aggregators are registered here. ● You can set values to aggregators.
  14. 14. Worker Context ● Allows for the execution of user code on a per-worker basis. ● There's one WorkerContext per worker. ● Methods for Pre/post superstep/application operations.
  15. 15. Flexible Edge/Vertex Input ● Read edges/vertices from different sources. ● Multiple input resources
  16. 16. Parallel Computing ● More map jobs (workers) = parallel computing ● To overcome slowest worker problem, multithreading is applied on input/computation/output ● Linear speedup in CPU-bound applications such as k-means clustering due to multithreading ● Take a set of entrie machines & use multithreading to maximize resource utilization.
  17. 17. Memory Optimization ● Vertices and edges are stored as serialized byte arrays. ● Used FastUtil-based Java primitives.
  18. 18. Sharded Aggregators ● Each aggregator is randomly assigned to one of the workers. ● The assigned worker is in charge of gathering the values of its aggregators from all workers, performing the aggregation, and distributing the final values to other workers. ● Aggregation responsibilities are balanced across all workers rather than bottlenecked by the master.
  19. 19. Performance ● PageRank on 1 trillion edges with 200 commodity machines: 4 minutes/iteration. ● K-Means on 1 billion input vectors x 100 features into 10.000 centroids: 10 minutes. ● Linear Scalability
  20. 20. Currently ● Version 1.0, on the way to 1.1 ● Changing rapidly: backwards-incompatible changes ● Documentation not mature yet. ● More algorithms to be contributed. ● More data sources to be ported. ● http://giraph.apache.org for more info
  21. 21. References Giraph: Large-scale graph processing infrastructure on Hadoop, 2011 Scaling Apache Giraph to a trillion edges, Avery Ching, Facebook, 2013 Scaling Apache Giraph, Nitay Joffe, Facebook, 2013. Giraph: http://giraph.apache.org
  22. 22. Questions ?

×