Upcoming SlideShare
×

638 views

Published on

Learn how to process large graphs in Hadoop with Apache Giraph

Published in: Software, Technology, Education
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
638
On SlideShare
0
From Embeds
0
Number of Embeds
97
Actions
Shares
0
21
0
Likes
0
Embeds 0
No embeds

No notes for slide

1. 1. Processing Large Graphs in Hadoop Dani Solà
2. 2. Index ● The Problem ● Google's Pregel ● Example ● Apache Giraph
3. 3. The Problem ● Processing graphs in MR is not practical: – Most algorithms are iterative – Each iteration is mapped to a MR Job – Takes too long if many iterations are required – Writing MR for graph processing is not easy
4. 4. Google's Pregel ● Framework for iterative large graph processing ● Inspired by Bulk Synchronous Parallel model ● Computation is distributed among N+1 nodes – N workers that do the actual work – 1 master that synchronizes them ● Takes a vertex-centric approach – Is much easier to focus on the algorithm http://kowshik.github.io/JPregel/pregel_paper.pdf
5. 5. Pregel Main Concepts ● Computations are a sequence of supersteps ● Vertices are randomly distributed among nodes ● Vertices have values and directed edges to other vertices ● Vertices can send messages to other vertices ● Messages sent at superstep S are received at superstep S + 1
6. 6. Computation Life Cycle ● Initially, all vertices are active ● Inactive vertices activate again on receiving messages ● In each superstep, active vertices: – Receive messages from the previous superstep – Can change their value depending on their state – Can check the value of their neighbors – Can send messages to other vertices – Can vote to halt, becoming inactive ● When all vertices are inactive, computation ends
7. 7. Ex: Shortest Path A D→ ● Single source shortest paths example ● Want to find the shortest path from A to D ● For simplicity, edges have value 1
8. 8. Ex: Shortest Path A D→ A: 0 B: ∞ C: ∞ D: ∞ E: ∞ Superstep 0: All vertices active, A sends messages and halts 0+1 0+1 0+1
9. 9. Ex: Shortest Path A D→ A: 0 B: 1 C: 1 D: ∞ E: 1 Superstep 1: B, C, E get the messages and update their values 1+1 1+1
10. 10. Ex: Shortest Path A D→ A: 0 B: 1 C: 1 D: 2 E: 1 Superstep 2: E gets mssge from B, but doesn't change its value
11. 11. Ex: Shortest Path A D→ A: 0 B: 1 C: 1 D: 2 E: 1 Superstep 3: All vertices have halted and the computation ends
12. 12. Apache Giraph ● Open-source implementation of Pregel ● Started by Yahoo, used by FB, LinkedIn, Twitter ● Built on top Hadoop & Zookeeper: – Mappers are used as nodes: N workers + 1 master – Master-worker coordination via Zookeeper – Natively reads and writes to HDFS – Natively reads and writes Writables – Can use counters, distributed cache, etc. https://giraph.apache.org/
13. 13. Apache Giraph ● Pros: – Integrates well with Hadoop – Has many examples included – Much better tool for processing graphs than raw MR ● Cons: – Documentation could be better – Still evolving: API changes in Giraph 1.1.0 – Not as used as other Hadoop projects
14. 14. GiraphVertex API public class MyVertex extends Vertex<IntWritable, IntWritable, NullWritable, IntWritable> { @Override public void compute(Iterable<IntWritable> msgs) throws IOException { int superstep = getSuperstep(); // Current superstep setValue(val); // Modifies vertex value sendMessage(neighbor, value); // Sends message to a neighbor sendMessageToAllEdges(value); // Sends message to all neighbors } } Vertex ID Type Vertex Value Type Edge Value Type Message Value Type
15. 15. GiraphVertex API ● Look at the shortest path source code: – SimpleShortestPathsVertex.java (v1.0.0)
16. 16. Giraph Input/Output ● You can read vertex oriented (adjacency list) or edge oriented (pairs of vertices) files ● Many formats already available: – VertexInputFormat / VertexOutputFormat – HiveVertexInputFormat / HiveVertexOutputFormat – … ● You can easily read any format extending VertexInputFormat / EdgeInputFormat
17. 17. Thanks!