Processing Large
Graphs in Hadoop
Dani Solà
Index
● The Problem
● Google's Pregel
● Example
● Apache Giraph
The Problem
● Processing graphs in MR is not practical:
– Most algorithms are iterative
– Each iteration is mapped to a MR...
Google's Pregel
● Framework for iterative large graph processing
● Inspired by Bulk Synchronous Parallel model
● Computati...
Pregel Main Concepts
● Computations are a sequence of supersteps
● Vertices are randomly distributed among nodes
● Vertice...
Computation Life Cycle
● Initially, all vertices are active
● Inactive vertices activate again on receiving
messages
● In ...
Ex: Shortest Path A D→
● Single source shortest paths example
● Want to find the shortest path from A to D
● For simplicit...
Ex: Shortest Path A D→
A: 0
B: ∞
C: ∞ D: ∞
E: ∞
Superstep 0:
All vertices active, A sends messages and halts
0+1
0+1
0+1
Ex: Shortest Path A D→
A: 0
B: 1
C: 1 D: ∞
E: 1
Superstep 1:
B, C, E get the messages and update their values
1+1
1+1
Ex: Shortest Path A D→
A: 0
B: 1
C: 1 D: 2
E: 1
Superstep 2:
E gets mssge from B, but doesn't change its value
Ex: Shortest Path A D→
A: 0
B: 1
C: 1 D: 2
E: 1
Superstep 3:
All vertices have halted and the computation ends
Apache Giraph
● Open-source implementation of Pregel
● Started by Yahoo, used by FB, LinkedIn, Twitter
● Built on top Hado...
Apache Giraph
● Pros:
– Integrates well with Hadoop
– Has many examples included
– Much better tool for processing graphs ...
GiraphVertex API
public class MyVertex
extends Vertex<IntWritable, IntWritable, NullWritable, IntWritable> {
@Override
pub...
GiraphVertex API
● Look at the shortest path source code:
– SimpleShortestPathsVertex.java (v1.0.0)
Giraph Input/Output
● You can read vertex oriented (adjacency list) or
edge oriented (pairs of vertices) files
● Many form...
Thanks!
Upcoming SlideShare
Loading in …5
×

Processing Large Graphs in Hadoop

638 views

Published on

Learn how to process large graphs in Hadoop with Apache Giraph

Published in: Software, Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
638
On SlideShare
0
From Embeds
0
Number of Embeds
97
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Processing Large Graphs in Hadoop

  1. 1. Processing Large Graphs in Hadoop Dani Solà
  2. 2. Index ● The Problem ● Google's Pregel ● Example ● Apache Giraph
  3. 3. The Problem ● Processing graphs in MR is not practical: – Most algorithms are iterative – Each iteration is mapped to a MR Job – Takes too long if many iterations are required – Writing MR for graph processing is not easy
  4. 4. Google's Pregel ● Framework for iterative large graph processing ● Inspired by Bulk Synchronous Parallel model ● Computation is distributed among N+1 nodes – N workers that do the actual work – 1 master that synchronizes them ● Takes a vertex-centric approach – Is much easier to focus on the algorithm http://kowshik.github.io/JPregel/pregel_paper.pdf
  5. 5. Pregel Main Concepts ● Computations are a sequence of supersteps ● Vertices are randomly distributed among nodes ● Vertices have values and directed edges to other vertices ● Vertices can send messages to other vertices ● Messages sent at superstep S are received at superstep S + 1
  6. 6. Computation Life Cycle ● Initially, all vertices are active ● Inactive vertices activate again on receiving messages ● In each superstep, active vertices: – Receive messages from the previous superstep – Can change their value depending on their state – Can check the value of their neighbors – Can send messages to other vertices – Can vote to halt, becoming inactive ● When all vertices are inactive, computation ends
  7. 7. Ex: Shortest Path A D→ ● Single source shortest paths example ● Want to find the shortest path from A to D ● For simplicity, edges have value 1
  8. 8. Ex: Shortest Path A D→ A: 0 B: ∞ C: ∞ D: ∞ E: ∞ Superstep 0: All vertices active, A sends messages and halts 0+1 0+1 0+1
  9. 9. Ex: Shortest Path A D→ A: 0 B: 1 C: 1 D: ∞ E: 1 Superstep 1: B, C, E get the messages and update their values 1+1 1+1
  10. 10. Ex: Shortest Path A D→ A: 0 B: 1 C: 1 D: 2 E: 1 Superstep 2: E gets mssge from B, but doesn't change its value
  11. 11. Ex: Shortest Path A D→ A: 0 B: 1 C: 1 D: 2 E: 1 Superstep 3: All vertices have halted and the computation ends
  12. 12. Apache Giraph ● Open-source implementation of Pregel ● Started by Yahoo, used by FB, LinkedIn, Twitter ● Built on top Hadoop & Zookeeper: – Mappers are used as nodes: N workers + 1 master – Master-worker coordination via Zookeeper – Natively reads and writes to HDFS – Natively reads and writes Writables – Can use counters, distributed cache, etc. https://giraph.apache.org/
  13. 13. Apache Giraph ● Pros: – Integrates well with Hadoop – Has many examples included – Much better tool for processing graphs than raw MR ● Cons: – Documentation could be better – Still evolving: API changes in Giraph 1.1.0 – Not as used as other Hadoop projects
  14. 14. GiraphVertex API public class MyVertex extends Vertex<IntWritable, IntWritable, NullWritable, IntWritable> { @Override public void compute(Iterable<IntWritable> msgs) throws IOException { int superstep = getSuperstep(); // Current superstep setValue(val); // Modifies vertex value sendMessage(neighbor, value); // Sends message to a neighbor sendMessageToAllEdges(value); // Sends message to all neighbors } } Vertex ID Type Vertex Value Type Edge Value Type Message Value Type
  15. 15. GiraphVertex API ● Look at the shortest path source code: – SimpleShortestPathsVertex.java (v1.0.0)
  16. 16. Giraph Input/Output ● You can read vertex oriented (adjacency list) or edge oriented (pairs of vertices) files ● Many formats already available: – VertexInputFormat / VertexOutputFormat – HiveVertexInputFormat / HiveVertexOutputFormat – … ● You can easily read any format extending VertexInputFormat / EdgeInputFormat
  17. 17. Thanks!

×