Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

2,328 views

Published on

An introduction to Apache Giraph.

No Downloads

Total views

2,328

On SlideShare

0

From Embeds

0

Number of Embeds

16

Shares

0

Downloads

93

Comments

0

Likes

5

No embeds

No notes for slide

- 1. A Distributed Graph-Processing Library Ahmet Emre Aladağ - AGMLab 26.08.2013
- 2. ● Library for large-scale graph processing. ● Runs on Apache Hadoop with Map Jobs ● Bulk Synchronous Parallel (BSP) model What is Giraph? 1incoming messages outgoing messages 0.2 0.53 0.32 0.16 0.12 0.34 Vertex computation
- 3. Uses ● PageRank-variant iterative algorithms ● Graph clustering ○ Label propagation ○ Max Clique ○ Triangle Closure ○ Finding related people, groups, interests. ● Shortest-Path ○ Single source, s-t, all to all ● Finding Connected Components
- 4. Alternatives ● Map-Reduce jobs on Hadoop ○ Not a good fit for graph algorithms: overhead. ● Google Pregel ○ Requires its own infrastructure ○ Not available ○ Master is single point of failure. ● Message Passing Interface (MPI) ○ Not fault-tolerant ○ Too generic
- 5. How Giraph differs ● You can use a Hadoop cluster, no need for special infrastructure. ● Easy deployment with Amazon EMR ● Dynamic resource management ● Graph oriented API ● Open Source ● Fault Tolerant, no SPOF except Hadoop namenode and jobtracker ● Jython Support
- 6. Layers
- 7. Mechanism InputFormat/Reader Input Computation OutputFormat/Writer Output ● Accumulo ● HBase ● HCatalog ● HDFS ● Hive ● Neo4j etc. ● Accumulo ● HBase ● HCatalog ● HDFS ● Hive ● Neo4j etc. ● GraphViz Adjacency matrix, id- value pairs, JSON
- 8. InputFormat ● VertexInputFormat 1;3.4 2;6.1 3;2.7 ● EdgeInputFormat 1;2 2;3 1;3 1 2 3 3.4 6.1 2.7 1 2 3
- 9. Computation ● Superstep barriers. ● Send/Receive messages from neighbors ● Update value. ● Vote to halt or wake up. Single-Source Shortest Path Example
- 10. Shortest-Path Computation Code Note: old API
- 11. Ex: Finding the maximum value
- 12. Aggregators ● Shared variables among the workers. ● Each vertex computation can add/multiply a value to aggregators. ● Examples: ○ Holding the min/max value among all vertices ○ Holding sum of the vertex values. ○ Holding average value of vertex values. ○ Holding sum of mean square errors and stdev. 1 2 3 0.2 0.6 0.45 1.25 Computation at Iteration k
- 13. MasterCompute Class ● Master’s compute() always runs before the slaves (like pre-superstep) ○ In compute: aggregate vertex values: sum of values ○ In MasterCompute: average=sum/N ● Aggregators are registered here. ● You can set values to aggregators.
- 14. Worker Context ● Allows for the execution of user code on a per-worker basis. ● There's one WorkerContext per worker. ● Methods for Pre/post superstep/application operations.
- 15. Flexible Edge/Vertex Input ● Read edges/vertices from different sources. ● Multiple input resources
- 16. Parallel Computing ● More map jobs (workers) = parallel computing ● To overcome slowest worker problem, multithreading is applied on input/computation/output ● Linear speedup in CPU-bound applications such as k-means clustering due to multithreading ● Take a set of entrie machines & use multithreading to maximize resource utilization.
- 17. Memory Optimization ● Vertices and edges are stored as serialized byte arrays. ● Used FastUtil-based Java primitives.
- 18. Sharded Aggregators ● Each aggregator is randomly assigned to one of the workers. ● The assigned worker is in charge of gathering the values of its aggregators from all workers, performing the aggregation, and distributing the final values to other workers. ● Aggregation responsibilities are balanced across all workers rather than bottlenecked by the master.
- 19. Performance ● PageRank on 1 trillion edges with 200 commodity machines: 4 minutes/iteration. ● K-Means on 1 billion input vectors x 100 features into 10.000 centroids: 10 minutes. ● Linear Scalability
- 20. Currently ● Version 1.0, on the way to 1.1 ● Changing rapidly: backwards-incompatible changes ● Documentation not mature yet. ● More algorithms to be contributed. ● More data sources to be ported. ● http://giraph.apache.org for more info
- 21. References Giraph: Large-scale graph processing infrastructure on Hadoop, 2011 Scaling Apache Giraph to a trillion edges, Avery Ching, Facebook, 2013 Scaling Apache Giraph, Nitay Joffe, Facebook, 2013. Giraph: http://giraph.apache.org
- 22. Questions ?

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment