Processing edges on apache giraph

12,303 views

Published on

Published in: Technology

Processing edges on apache giraph

  1. Processing Over a Billion Edgeson Apache GiraphHadoop Summit 2012Avery ChingSoftware Engineer6/14/2012
  2. Agenda1 Motivation and Background2 Giraph Concepts/API3 Example Applications4 Architecture Overview5 Recent/Future Improvements
  3. What is Apache Giraph?•  Loose implementation of Google’s Pregel that runs as a map-only job on Hadoop•  “Think like a vertex” that can send messages to any other vertex in the graph using the bulk synchronous parallel programming model•  An in-memory scalable system* ▪  Will be enhanced with out-of-core messages/vertices to handle larger problem sets.
  4. What (social) graphs are we targeting?•  3/2012 LinkedIn has 161 million users•  6/2012 Twitter discloses 140 million MAU•  4/2012 Facebook declares 901 million MAU
  5. Example applications•  Ranking ▪  Popularity, importance, etc.•  Label Propagation ▪  Location, school, gender, etc.•  Community ▪  Groups, interests
  6. Bulk synchronous parallel•  Supersteps ▪  A global epoch followed by a global barrier where components do concurrent computation and send messages•  Point-to-point messages (i.e. vertex to vertex) ▪  Sent during a superstep from one component to another and then delivered in the following superstep•  Computation complete when all components complete
  7. Computation + Superstep Communication ProcessorsTime Barrier
  8. MapReduce -> Giraph“Think like a vertex”, not a key-value pair! MapReduce Giraphpublic class Mapper< public class Vertex< KEYIN, I extends VALUEIN, WritableComparable, KEYOUT, V extends Writable, VALUEOUT> { E extends Writable, void map(KEYIN key, M extends Writable> { VALUEIN value, void compute( Context context) Iterator<M> msgIterator); throws IOException, } InterruptedException;}
  9. Basic Giraph API Methods available to compute() Immediate effect/access Next superstepI getVertexId() void sendMsg(I id, M msg)V getVertexValue() void sendMsgToAllEdges(M msg)void setVertexValue(V vertexValue) void addVertexRequest(Iterator<I> iterator() BasicVertex<I, V, E, M> vertex)E getEdgeValue(I targetVertexId) void removeVertexRequest(I vertexId)boolean hasEdge(I targetVertexId) void addEdgeRequest(boolean addEdge(I targetVertexId, I sourceVertexId, E Edge<I, E> edge)edgeValue) void removeEdgeRequest(E removeEdge(I targetVertexId) I sourceVertexId, I destVertexId)void voteToHalt()boolean isHalted()
  10. Why not implement Giraph with multipleMapReduce jobs?•  Too much disk, no in-memory caching, a superstep becomes a job! Input Map Intermediate Reduce Output format tasks files tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  11. Giraph is a single Map-only job inHadoop•  Hadoop is purely a resource manager for Giraph, all communication is done through Netty-based IPC Vertex input Map Vertex output format tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  12. Maximum vertex value implementationpublic class MaxValueVertex extends EdgeListVertex< IntWritable, IntWritable, IntWritable, IntWritable> { @Override public void compute(Iterator<IntWritable> msgIterator) { boolean changed = false; while (msgIterator.hasNext()) { IntWritable msgValue = msgIterator.next(); if (msgValue.get() > getVertexValue().get()) { setVertexValue(msgValue); changed = true; } } if (getSuperstep() == 0 || changed) { sendMsgToAllEdges(getVertexValue()); } else { voteToHalt(); } }}
  13. Maximum vertex value Processor 1 5 5 5 5 Processor 2 1 1 5 5 5 2 2 2 5 5 Time Barrier Barrier Barrier
  14. Page rank implementationpublic class SimplePageRankVertex extends EdgeListVertex<LongWritable,DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Iterator<DoubleWritable> msgIterator) { if (getSuperstep() >= 1) { double sum = 0; while (msgIterator.hasNext()) { sum += msgIterator.next().get(); } setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f *sum); } if (getSuperstep() < 30) { long edges = getNumOutEdges(); sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } }}
  15. Giraph In MapReduce
  16. Giraph components•  Master – Application coordinator ▪  One active master at a time ▪  Assigns partition owners to workers prior to each superstep ▪  Synchronizes supersteps•  Worker – Computation & messaging ▪  Loads the graph from input splits ▪  Does the computation/messaging of its assigned partitions•  ZooKeeper ▪  Maintains global application state
  17. Graph distribution•  Master graph partitioner ▪  Create initial partitions, generate partition owner changes between supersteps•  Worker graph partitioner ▪  Determine which partition a vertex belongs to ▪  Create/modify the partition stats (can split/merge partitions)•  Default is hash partitioning (hashCode()) ▪  Range-based partitioning is also possible on a per-type basis
  18. Graph distribution example Partition 0 Load/Store Stats 0 Worker 0 Compute Partition 1 Messages Stats 1 Partition 2 Load/Store Stats 2 Master Worker 1 Compute Partition 3 Messages Stats 3 Partition 4 Load/Store Stats 4 Worker 2 Compute Partition 5 Messages Stats 5 Partition 6 Load/Store Stats 6 Worker 3 Compute Partition 7 Messages Stats 7
  19. Customizable fault tolerance•  No single point of failure from Giraph threads ▪  With multiple master threads, if the current master dies, a new one will automatically take over. ▪  If a worker thread dies, the application is rolled back to a previously checkpointed superstep. The next superstep will begin with the new amount of workers ▪  If a zookeeper server dies, as long as a quorum remains, the application can proceed•  Hadoop single points of failure still exist ▪  Namenode, jobtracker ▪  Restarting manually from a checkpoint is always possible 19
  20. Master thread fault tolerance Before failure of active master 0 After failure of active master 0 “Active” “Active” Master 0 Master 0 Active Active Master Master “Spare” State “Active” State Master 1 Master 1 “Spare” “Spare” Master 2 Master 2•  One active master, with spare masters taking over in the event of an active master failure•  All active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails•  “Active” master implemented as a queue in ZooKeeper 20
  21. Worker thread fault tolerance Superstep i Superstep i+1 Superstep i+2(no checkpoint) (checkpoint) (no checkpoint) Worker failure! Superstep i+1 Superstep i+2 Superstep i+3 (checkpoint) (no checkpoint) (checkpoint) Worker failure after checkpoint complete! Superstep i+3 Application (no checkpoint) Complete•  A single worker death fails the superstep•  Application reverts to the last committed superstep automatically ▪  Master detects worker failure during any superstep with a ZooKeeper “health” znode ▪  Master chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep 21
  22. Optional features•  Combiners ▪  Similar to Map-Reduce combiners ▪  Users implement a combine() method that can reduce the amount of messages sent and received ▪  Run on both the client side (memory, network) and server side (memory)•  Aggregators ▪  Similar to MPI aggregation routines (i.e. max, min, sum, etc.) ▪  Commutative and associate operations that are performed globally ▪  Examples include global communication, monitoring, and statistics
  23. Recent Netty IPC implementation 300 50 250 Time (Seconds)•  Big improvement over the 40 Hadoop RPC implementation 200 30 150•  10-39% overall performance 20 improvement 100 50 10•  Still need more Netty tuning 0 0 10 30 50 Workers Netty Hadoop RPC % improvement
  24. Recent benchmarks•  Test cluster of 80 machines ▪  Facebook Hadoop (https://github.com/facebook/hadoop-20) ▪  72 cores, 64+ GB of memory▪  org.apache.giraph.benchmark.PageRankBenchmark ▪  5 supersteps ▪  No checkpointing ▪  10 edges per vertex
  25. Worker scalability 3000 Time (Seconds) 2500 2000 1500 1000 500 0 10 20 30 40 45 50 Workers
  26. Edge Scalability 5000 Time (Seconds) 4000 3000 2000 1000 0 1 2 3 4 5 Edges (Billions)
  27. Worker / edge scalability 2000 8 Time (Seconds) Edges (Billions) 1500 6 1000 4 500 2 0 0 10 30 50 Workers Run Time Workers/Edges
  28. Apache Giraph has graduated as of5/2012•  Incubated for less than a year (entered incubator 9/12)•  Committers from HortonWorks, Twitter, LinkedIn, Facebook, TrendMicro and various schools (VU Amsterdam, TU Berlin, Korea University)•  Released 0.1 as of 2/6/2012, will be release 0.2 within a few months
  29. Future improvements•  Out-of-core messages/graph ▪  Under memory pressure, dump messages/portions of the graph to local disk ▪  Ability to run applications without having all needed memory•  Performance improvements ▪  Netty is a good step in the right direction, but need to tune messaging performance as it takes up a majority of the time ▪  Scale back use of ZooKeeper to only be for health registration, rather than implementing aggregators and coordination
  30. More future improvements•  Adding a master#compute() method ▪  Arbitrary master computation that sends results to workers prior to a superstep to simplify certain algorithms ▪  GIRAPH-127•  Handling skew ▪  Some vertices have a large number of edges and we need to break them up and handle them differently to provide better scalability
  31. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
  32. Sessions will resume at 4:30pm Page 32

×