Processing edges on apache giraph

Processing Over a Billion Edges
on Apache Giraph
Hadoop Summit 2012

Avery Ching
Software Engineer
6/14/2012

Agenda

1 Motivation and Background

2 Giraph Concepts/API

3 Example Applications

4 Architecture Overview

5 Recent/Future Improvements

What is Apache Giraph?

•  Loose implementation of Google’s Pregel that runs
as a map-only job on Hadoop

•  “Think like a vertex” that can send messages to any
other vertex in the graph using the bulk synchronous
parallel programming model

•  An in-memory scalable system*
▪  Will be enhanced with out-of-core messages/vertices to handle
larger problem sets.

What (social) graphs are we targeting?

•  3/2012 LinkedIn has 161 million users

•  6/2012 Twitter discloses 140 million MAU

•  4/2012 Facebook declares 901 million MAU

Example applications

•  Ranking
▪  Popularity, importance, etc.

•  Label Propagation
▪  Location, school, gender, etc.

•  Community
▪  Groups, interests

Bulk synchronous parallel

•  Supersteps
▪  A global epoch followed by a global barrier where components
do concurrent computation and send messages

•  Point-to-point messages (i.e. vertex to vertex)
▪  Sent during a superstep from one component to another and
then delivered in the following superstep

•  Computation complete when all components
complete

Computation + Superstep
Communication
Processors

Time Barrier

MapReduce -> Giraph
“Think like a vertex”, not a key-value pair!

MapReduce Giraph
public class Mapper<
public class Vertex<
KEYIN,
I extends
VALUEIN,
WritableComparable,
KEYOUT,
V extends Writable,
VALUEOUT> {
E extends Writable,
void map(KEYIN key,
M extends Writable> {
VALUEIN value,
void compute(
Context context)
Iterator<M> msgIterator);
throws IOException,
}
InterruptedException;
}

Basic Giraph API
Methods available to compute()

Immediate effect/access Next superstep
I getVertexId() void sendMsg(I id, M msg)
V getVertexValue() void sendMsgToAllEdges(M msg)
void setVertexValue(V vertexValue)
void addVertexRequest(
Iterator<I> iterator() BasicVertex<I, V, E, M> vertex)
E getEdgeValue(I targetVertexId) void removeVertexRequest(I vertexId)
boolean hasEdge(I targetVertexId) void addEdgeRequest(
boolean addEdge(I targetVertexId, I sourceVertexId,
E Edge<I, E> edge)
edgeValue) void removeEdgeRequest(
E removeEdge(I targetVertexId) I sourceVertexId,
I destVertexId)
void voteToHalt()
boolean isHalted()

Why not implement Giraph with multiple
MapReduce jobs?
•  Too much disk, no in-memory caching, a superstep
becomes a job!

Input Map Intermediate Reduce Output
format tasks files tasks format

Split 0
Output 0
Split 1

Split 2

Split 3 Output 1

Giraph is a single Map-only job in
Hadoop
•  Hadoop is purely a resource manager for Giraph, all
communication is done through Netty-based IPC

Vertex input Map Vertex output
format tasks format

Split 0
Output 0
Split 1

Split 2

Split 3 Output 1

Maximum vertex value implementation
public class MaxValueVertex extends EdgeListVertex<
IntWritable, IntWritable, IntWritable, IntWritable> {
@Override
public void compute(Iterator<IntWritable> msgIterator) {
boolean changed = false;
while (msgIterator.hasNext()) {
IntWritable msgValue = msgIterator.next();
if (msgValue.get() > getVertexValue().get()) {
setVertexValue(msgValue);
changed = true;
}
}
if (getSuperstep() == 0 || changed) {
sendMsgToAllEdges(getVertexValue());
} else {
voteToHalt();
}
}
}

Maximum vertex value

Processor 1 5 5 5 5

Processor 2 1
1
5 5
5

2
2 2 5
5

Time Barrier Barrier Barrier

Page rank implementation
public class SimplePageRankVertex extends EdgeListVertex<LongWritable,
DoubleWritable, FloatWritable, DoubleWritable> {
public void compute(Iterator<DoubleWritable> msgIterator) {
if (getSuperstep() >= 1) {
double sum = 0;
while (msgIterator.hasNext()) {
sum += msgIterator.next().get();
}
setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f *
sum);
}
if (getSuperstep() < 30) {
long edges = getNumOutEdges();
sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges));
} else {
voteToHalt();
}
}
}

Giraph components
•  Master – Application coordinator
▪  One active master at a time
▪  Assigns partition owners to workers prior to each superstep
▪  Synchronizes supersteps

•  Worker – Computation & messaging
▪  Loads the graph from input splits
▪  Does the computation/messaging of its assigned partitions

•  ZooKeeper
▪  Maintains global application state

Graph distribution
•  Master graph partitioner
▪  Create initial partitions, generate partition owner changes
between supersteps

•  Worker graph partitioner
▪  Determine which partition a vertex belongs to
▪  Create/modify the partition stats (can split/merge partitions)

•  Default is hash partitioning (hashCode())
▪  Range-based partitioning is also possible on a per-type basis

Graph distribution example

Partition 0 Load/Store Stats 0
Worker 0 Compute
Partition 1 Messages Stats 1

Master Worker 1 Compute

Worker 2 Compute

Worker 3 Compute

Customizable fault tolerance
•  No single point of failure from Giraph threads
▪  With multiple master threads, if the current master dies, a new one will automatically
take over.
▪  If a worker thread dies, the application is rolled back to a previously checkpointed
superstep. The next superstep will begin with the new amount of workers
▪  If a zookeeper server dies, as long as a quorum remains, the application can proceed

•  Hadoop single points of failure still exist
▪  Namenode, jobtracker

▪  Restarting manually from a checkpoint is always possible

19

Master thread fault tolerance
Before failure of active master 0 After failure of active master 0
“Active” “Active”
Master 0 Master 0
Active Active
Master Master
“Spare” State “Active” State
Master 1 Master 1

“Spare” “Spare”
Master 2 Master 2

•  One active master, with spare masters taking over in the event of an active master
failure

•  All active master state is stored in ZooKeeper so that a spare master can
immediately step in when an active master fails

•  “Active” master implemented as a queue in ZooKeeper

20

Worker thread fault tolerance
Superstep i Superstep i+1 Superstep i+2
(no checkpoint) (checkpoint) (no checkpoint)

Worker failure!

Superstep i+1 Superstep i+2 Superstep i+3
(checkpoint) (no checkpoint) (checkpoint)
Worker failure after
checkpoint complete!

Superstep i+3 Application
(no checkpoint) Complete
•  A single worker death fails the superstep

•  Application reverts to the last committed superstep automatically
▪  Master detects worker failure during any superstep with a ZooKeeper “health”
znode
▪  Master chooses the last committed superstep and sends a command through
ZooKeeper for all workers to restart from that superstep
21

Optional features
•  Combiners
▪  Similar to Map-Reduce combiners
▪  Users implement a combine() method that can reduce the
amount of messages sent and received
▪  Run on both the client side (memory, network) and server side
(memory)

•  Aggregators
▪  Similar to MPI aggregation routines (i.e. max, min, sum, etc.)
▪  Commutative and associate operations that are performed
globally
▪  Examples include global communication, monitoring, and
statistics

Recent Netty IPC implementation
300 50
250

Time (Seconds)
•  Big improvement over the 40
Hadoop RPC implementation 200
30
150
•  10-39% overall performance 20
improvement 100
50 10
•  Still need more Netty tuning 0 0
10 30 50
Workers

Netty
Hadoop RPC
% improvement

Recent benchmarks
•  Test cluster of 80 machines
▪  Facebook Hadoop (https://github.com/facebook/hadoop-20)
▪  72 cores, 64+ GB of memory
▪  org.apache.giraph.benchmark.PageRankBenchmark
▪  5 supersteps
▪  No checkpointing
▪  10 edges per vertex

Worker scalability
3000
Time (Seconds)

2500
2000
1500
1000
500
0
10 20 30 40 45 50
Workers

Edge Scalability
5000
Time (Seconds)

4000
3000
2000
1000
0
1 2 3 4 5
Edges (Billions)

Worker / edge scalability
2000 8
Time (Seconds)

Edges (Billions)
1500 6
1000 4
500 2
0 0
10 30 50
Workers
Run Time Workers/Edges

Apache Giraph has graduated as of
5/2012
•  Incubated for less than a year (entered incubator
9/12)

•  Committers from HortonWorks, Twitter, LinkedIn,
Facebook, TrendMicro and various schools (VU
Amsterdam, TU Berlin, Korea University)

•  Released 0.1 as of 2/6/2012, will be release 0.2
within a few months

Future improvements
•  Out-of-core messages/graph
▪  Under memory pressure, dump messages/portions of the graph
to local disk
▪  Ability to run applications without having all needed memory

•  Performance improvements
▪  Netty is a good step in the right direction, but need to tune
messaging performance as it takes up a majority of the time
▪  Scale back use of ZooKeeper to only be for health registration,
rather than implementing aggregators and coordination

More future improvements
•  Adding a master#compute() method
▪  Arbitrary master computation that sends results to workers prior
to a superstep to simplify certain algorithms
▪  GIRAPH-127

•  Handling skew
▪  Some vertices have a large number of edges and we need to
break them up and handle them differently to provide better
scalability

Sessions will resume at 4:30pm

Page 32

Processing edges on apache giraph

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Processing edges on apache giraph

Similar to Processing edges on apache giraph (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Processing edges on apache giraph