• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Processing edges on apache giraph
 

Processing edges on apache giraph

on

  • 6,968 views

 

Statistics

Views

Total Views
6,968
Views on SlideShare
6,906
Embed Views
62

Actions

Likes
15
Downloads
0
Comments
0

4 Embeds 62

http://eventifier.co 43
http://eventifier.com 17
http://us-w1.rockmelt.com 1
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Processing edges on apache giraph Processing edges on apache giraph Presentation Transcript

    • Processing Over a Billion Edgeson Apache GiraphHadoop Summit 2012Avery ChingSoftware Engineer6/14/2012
    • Agenda1 Motivation and Background2 Giraph Concepts/API3 Example Applications4 Architecture Overview5 Recent/Future Improvements
    • What is Apache Giraph?•  Loose implementation of Google’s Pregel that runs as a map-only job on Hadoop•  “Think like a vertex” that can send messages to any other vertex in the graph using the bulk synchronous parallel programming model•  An in-memory scalable system* ▪  Will be enhanced with out-of-core messages/vertices to handle larger problem sets.
    • What (social) graphs are we targeting?•  3/2012 LinkedIn has 161 million users•  6/2012 Twitter discloses 140 million MAU•  4/2012 Facebook declares 901 million MAU
    • Example applications•  Ranking ▪  Popularity, importance, etc.•  Label Propagation ▪  Location, school, gender, etc.•  Community ▪  Groups, interests
    • Bulk synchronous parallel•  Supersteps ▪  A global epoch followed by a global barrier where components do concurrent computation and send messages•  Point-to-point messages (i.e. vertex to vertex) ▪  Sent during a superstep from one component to another and then delivered in the following superstep•  Computation complete when all components complete
    • Computation + Superstep Communication ProcessorsTime Barrier
    • MapReduce -> Giraph“Think like a vertex”, not a key-value pair! MapReduce Giraphpublic class Mapper< public class Vertex< KEYIN, I extends VALUEIN, WritableComparable, KEYOUT, V extends Writable, VALUEOUT> { E extends Writable, void map(KEYIN key, M extends Writable> { VALUEIN value, void compute( Context context) Iterator<M> msgIterator); throws IOException, } InterruptedException;}
    • Basic Giraph API Methods available to compute() Immediate effect/access Next superstepI getVertexId() void sendMsg(I id, M msg)V getVertexValue() void sendMsgToAllEdges(M msg)void setVertexValue(V vertexValue) void addVertexRequest(Iterator<I> iterator() BasicVertex<I, V, E, M> vertex)E getEdgeValue(I targetVertexId) void removeVertexRequest(I vertexId)boolean hasEdge(I targetVertexId) void addEdgeRequest(boolean addEdge(I targetVertexId, I sourceVertexId, E Edge<I, E> edge)edgeValue) void removeEdgeRequest(E removeEdge(I targetVertexId) I sourceVertexId, I destVertexId)void voteToHalt()boolean isHalted()
    • Why not implement Giraph with multipleMapReduce jobs?•  Too much disk, no in-memory caching, a superstep becomes a job! Input Map Intermediate Reduce Output format tasks files tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
    • Giraph is a single Map-only job inHadoop•  Hadoop is purely a resource manager for Giraph, all communication is done through Netty-based IPC Vertex input Map Vertex output format tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
    • Maximum vertex value implementationpublic class MaxValueVertex extends EdgeListVertex< IntWritable, IntWritable, IntWritable, IntWritable> { @Override public void compute(Iterator<IntWritable> msgIterator) { boolean changed = false; while (msgIterator.hasNext()) { IntWritable msgValue = msgIterator.next(); if (msgValue.get() > getVertexValue().get()) { setVertexValue(msgValue); changed = true; } } if (getSuperstep() == 0 || changed) { sendMsgToAllEdges(getVertexValue()); } else { voteToHalt(); } }}
    • Maximum vertex value Processor 1 5 5 5 5 Processor 2 1 1 5 5 5 2 2 2 5 5 Time Barrier Barrier Barrier
    • Page rank implementationpublic class SimplePageRankVertex extends EdgeListVertex<LongWritable,DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Iterator<DoubleWritable> msgIterator) { if (getSuperstep() >= 1) { double sum = 0; while (msgIterator.hasNext()) { sum += msgIterator.next().get(); } setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f *sum); } if (getSuperstep() < 30) { long edges = getNumOutEdges(); sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } }}
    • Giraph In MapReduce
    • Giraph components•  Master – Application coordinator ▪  One active master at a time ▪  Assigns partition owners to workers prior to each superstep ▪  Synchronizes supersteps•  Worker – Computation & messaging ▪  Loads the graph from input splits ▪  Does the computation/messaging of its assigned partitions•  ZooKeeper ▪  Maintains global application state
    • Graph distribution•  Master graph partitioner ▪  Create initial partitions, generate partition owner changes between supersteps•  Worker graph partitioner ▪  Determine which partition a vertex belongs to ▪  Create/modify the partition stats (can split/merge partitions)•  Default is hash partitioning (hashCode()) ▪  Range-based partitioning is also possible on a per-type basis
    • Graph distribution example Partition 0 Load/Store Stats 0 Worker 0 Compute Partition 1 Messages Stats 1 Partition 2 Load/Store Stats 2 Master Worker 1 Compute Partition 3 Messages Stats 3 Partition 4 Load/Store Stats 4 Worker 2 Compute Partition 5 Messages Stats 5 Partition 6 Load/Store Stats 6 Worker 3 Compute Partition 7 Messages Stats 7
    • Customizable fault tolerance•  No single point of failure from Giraph threads ▪  With multiple master threads, if the current master dies, a new one will automatically take over. ▪  If a worker thread dies, the application is rolled back to a previously checkpointed superstep. The next superstep will begin with the new amount of workers ▪  If a zookeeper server dies, as long as a quorum remains, the application can proceed•  Hadoop single points of failure still exist ▪  Namenode, jobtracker ▪  Restarting manually from a checkpoint is always possible 19
    • Master thread fault tolerance Before failure of active master 0 After failure of active master 0 “Active” “Active” Master 0 Master 0 Active Active Master Master “Spare” State “Active” State Master 1 Master 1 “Spare” “Spare” Master 2 Master 2•  One active master, with spare masters taking over in the event of an active master failure•  All active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails•  “Active” master implemented as a queue in ZooKeeper 20
    • Worker thread fault tolerance Superstep i Superstep i+1 Superstep i+2(no checkpoint) (checkpoint) (no checkpoint) Worker failure! Superstep i+1 Superstep i+2 Superstep i+3 (checkpoint) (no checkpoint) (checkpoint) Worker failure after checkpoint complete! Superstep i+3 Application (no checkpoint) Complete•  A single worker death fails the superstep•  Application reverts to the last committed superstep automatically ▪  Master detects worker failure during any superstep with a ZooKeeper “health” znode ▪  Master chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep 21
    • Optional features•  Combiners ▪  Similar to Map-Reduce combiners ▪  Users implement a combine() method that can reduce the amount of messages sent and received ▪  Run on both the client side (memory, network) and server side (memory)•  Aggregators ▪  Similar to MPI aggregation routines (i.e. max, min, sum, etc.) ▪  Commutative and associate operations that are performed globally ▪  Examples include global communication, monitoring, and statistics
    • Recent Netty IPC implementation 300 50 250 Time (Seconds)•  Big improvement over the 40 Hadoop RPC implementation 200 30 150•  10-39% overall performance 20 improvement 100 50 10•  Still need more Netty tuning 0 0 10 30 50 Workers Netty Hadoop RPC % improvement
    • Recent benchmarks•  Test cluster of 80 machines ▪  Facebook Hadoop (https://github.com/facebook/hadoop-20) ▪  72 cores, 64+ GB of memory▪  org.apache.giraph.benchmark.PageRankBenchmark ▪  5 supersteps ▪  No checkpointing ▪  10 edges per vertex
    • Worker scalability 3000 Time (Seconds) 2500 2000 1500 1000 500 0 10 20 30 40 45 50 Workers
    • Edge Scalability 5000 Time (Seconds) 4000 3000 2000 1000 0 1 2 3 4 5 Edges (Billions)
    • Worker / edge scalability 2000 8 Time (Seconds) Edges (Billions) 1500 6 1000 4 500 2 0 0 10 30 50 Workers Run Time Workers/Edges
    • Apache Giraph has graduated as of5/2012•  Incubated for less than a year (entered incubator 9/12)•  Committers from HortonWorks, Twitter, LinkedIn, Facebook, TrendMicro and various schools (VU Amsterdam, TU Berlin, Korea University)•  Released 0.1 as of 2/6/2012, will be release 0.2 within a few months
    • Future improvements•  Out-of-core messages/graph ▪  Under memory pressure, dump messages/portions of the graph to local disk ▪  Ability to run applications without having all needed memory•  Performance improvements ▪  Netty is a good step in the right direction, but need to tune messaging performance as it takes up a majority of the time ▪  Scale back use of ZooKeeper to only be for health registration, rather than implementing aggregators and coordination
    • More future improvements•  Adding a master#compute() method ▪  Arbitrary master computation that sends results to workers prior to a superstep to simplify certain algorithms ▪  GIRAPH-127•  Handling skew ▪  Some vertices have a large number of edges and we need to break them up and handle them differently to provide better scalability
    • (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
    • Sessions will resume at 4:30pm Page 32