Your SlideShare is downloading. ×
Apache Giraph
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Giraph

1,165

Published on

An introduction to Apache Giraph.

An introduction to Apache Giraph.

Published in: Career, Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,165
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
76
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. A Distributed Graph-Processing Library Ahmet Emre Aladağ - AGMLab 26.08.2013
  • 2. ● Library for large-scale graph processing. ● Runs on Apache Hadoop with Map Jobs ● Bulk Synchronous Parallel (BSP) model What is Giraph? 1incoming messages outgoing messages 0.2 0.53 0.32 0.16 0.12 0.34 Vertex computation
  • 3. Uses ● PageRank-variant iterative algorithms ● Graph clustering ○ Label propagation ○ Max Clique ○ Triangle Closure ○ Finding related people, groups, interests. ● Shortest-Path ○ Single source, s-t, all to all ● Finding Connected Components
  • 4. Alternatives ● Map-Reduce jobs on Hadoop ○ Not a good fit for graph algorithms: overhead. ● Google Pregel ○ Requires its own infrastructure ○ Not available ○ Master is single point of failure. ● Message Passing Interface (MPI) ○ Not fault-tolerant ○ Too generic
  • 5. How Giraph differs ● You can use a Hadoop cluster, no need for special infrastructure. ● Easy deployment with Amazon EMR ● Dynamic resource management ● Graph oriented API ● Open Source ● Fault Tolerant, no SPOF except Hadoop namenode and jobtracker ● Jython Support
  • 6. Layers
  • 7. Mechanism InputFormat/Reader Input Computation OutputFormat/Writer Output ● Accumulo ● HBase ● HCatalog ● HDFS ● Hive ● Neo4j etc. ● Accumulo ● HBase ● HCatalog ● HDFS ● Hive ● Neo4j etc. ● GraphViz Adjacency matrix, id- value pairs, JSON
  • 8. InputFormat ● VertexInputFormat 1;3.4 2;6.1 3;2.7 ● EdgeInputFormat 1;2 2;3 1;3 1 2 3 3.4 6.1 2.7 1 2 3
  • 9. Computation ● Superstep barriers. ● Send/Receive messages from neighbors ● Update value. ● Vote to halt or wake up. Single-Source Shortest Path Example
  • 10. Shortest-Path Computation Code Note: old API
  • 11. Ex: Finding the maximum value
  • 12. Aggregators ● Shared variables among the workers. ● Each vertex computation can add/multiply a value to aggregators. ● Examples: ○ Holding the min/max value among all vertices ○ Holding sum of the vertex values. ○ Holding average value of vertex values. ○ Holding sum of mean square errors and stdev. 1 2 3 0.2 0.6 0.45 1.25 Computation at Iteration k
  • 13. MasterCompute Class ● Master’s compute() always runs before the slaves (like pre-superstep) ○ In compute: aggregate vertex values: sum of values ○ In MasterCompute: average=sum/N ● Aggregators are registered here. ● You can set values to aggregators.
  • 14. Worker Context ● Allows for the execution of user code on a per-worker basis. ● There's one WorkerContext per worker. ● Methods for Pre/post superstep/application operations.
  • 15. Flexible Edge/Vertex Input ● Read edges/vertices from different sources. ● Multiple input resources
  • 16. Parallel Computing ● More map jobs (workers) = parallel computing ● To overcome slowest worker problem, multithreading is applied on input/computation/output ● Linear speedup in CPU-bound applications such as k-means clustering due to multithreading ● Take a set of entrie machines & use multithreading to maximize resource utilization.
  • 17. Memory Optimization ● Vertices and edges are stored as serialized byte arrays. ● Used FastUtil-based Java primitives.
  • 18. Sharded Aggregators ● Each aggregator is randomly assigned to one of the workers. ● The assigned worker is in charge of gathering the values of its aggregators from all workers, performing the aggregation, and distributing the final values to other workers. ● Aggregation responsibilities are balanced across all workers rather than bottlenecked by the master.
  • 19. Performance ● PageRank on 1 trillion edges with 200 commodity machines: 4 minutes/iteration. ● K-Means on 1 billion input vectors x 100 features into 10.000 centroids: 10 minutes. ● Linear Scalability
  • 20. Currently ● Version 1.0, on the way to 1.1 ● Changing rapidly: backwards-incompatible changes ● Documentation not mature yet. ● More algorithms to be contributed. ● More data sources to be ported. ● http://giraph.apache.org for more info
  • 21. References Giraph: Large-scale graph processing infrastructure on Hadoop, 2011 Scaling Apache Giraph to a trillion edges, Avery Ching, Facebook, 2013 Scaling Apache Giraph, Nitay Joffe, Facebook, 2013. Giraph: http://giraph.apache.org
  • 22. Questions ?

×