Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Processing edges on apache giraph

15,297 views

Published on

Published in: Technology

Processing edges on apache giraph

  1. Processing Over a Billion Edgeson Apache GiraphHadoop Summit 2012Avery ChingSoftware Engineer6/14/2012
  2. Agenda1 Motivation and Background2 Giraph Concepts/API3 Example Applications4 Architecture Overview5 Recent/Future Improvements
  3. What is Apache Giraph?•  Loose implementation of Google’s Pregel that runs as a map-only job on Hadoop•  “Think like a vertex” that can send messages to any other vertex in the graph using the bulk synchronous parallel programming model•  An in-memory scalable system* ▪  Will be enhanced with out-of-core messages/vertices to handle larger problem sets.
  4. What (social) graphs are we targeting?•  3/2012 LinkedIn has 161 million users•  6/2012 Twitter discloses 140 million MAU•  4/2012 Facebook declares 901 million MAU
  5. Example applications•  Ranking ▪  Popularity, importance, etc.•  Label Propagation ▪  Location, school, gender, etc.•  Community ▪  Groups, interests
  6. Bulk synchronous parallel•  Supersteps ▪  A global epoch followed by a global barrier where components do concurrent computation and send messages•  Point-to-point messages (i.e. vertex to vertex) ▪  Sent during a superstep from one component to another and then delivered in the following superstep•  Computation complete when all components complete
  7. Computation + Superstep Communication ProcessorsTime Barrier
  8. MapReduce -> Giraph“Think like a vertex”, not a key-value pair! MapReduce Giraphpublic class Mapper< public class Vertex< KEYIN, I extends VALUEIN, WritableComparable, KEYOUT, V extends Writable, VALUEOUT> { E extends Writable, void map(KEYIN key, M extends Writable> { VALUEIN value, void compute( Context context) Iterator<M> msgIterator); throws IOException, } InterruptedException;}
  9. Basic Giraph API Methods available to compute() Immediate effect/access Next superstepI getVertexId() void sendMsg(I id, M msg)V getVertexValue() void sendMsgToAllEdges(M msg)void setVertexValue(V vertexValue) void addVertexRequest(Iterator<I> iterator() BasicVertex<I, V, E, M> vertex)E getEdgeValue(I targetVertexId) void removeVertexRequest(I vertexId)boolean hasEdge(I targetVertexId) void addEdgeRequest(boolean addEdge(I targetVertexId, I sourceVertexId, E Edge<I, E> edge)edgeValue) void removeEdgeRequest(E removeEdge(I targetVertexId) I sourceVertexId, I destVertexId)void voteToHalt()boolean isHalted()
  10. Why not implement Giraph with multipleMapReduce jobs?•  Too much disk, no in-memory caching, a superstep becomes a job! Input Map Intermediate Reduce Output format tasks files tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  11. Giraph is a single Map-only job inHadoop•  Hadoop is purely a resource manager for Giraph, all communication is done through Netty-based IPC Vertex input Map Vertex output format tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  12. Maximum vertex value implementationpublic class MaxValueVertex extends EdgeListVertex< IntWritable, IntWritable, IntWritable, IntWritable> { @Override public void compute(Iterator<IntWritable> msgIterator) { boolean changed = false; while (msgIterator.hasNext()) { IntWritable msgValue = msgIterator.next(); if (msgValue.get() > getVertexValue().get()) { setVertexValue(msgValue); changed = true; } } if (getSuperstep() == 0 || changed) { sendMsgToAllEdges(getVertexValue()); } else { voteToHalt(); } }}
  13. Maximum vertex value Processor 1 5 5 5 5 Processor 2 1 1 5 5 5 2 2 2 5 5 Time Barrier Barrier Barrier
  14. Page rank implementationpublic class SimplePageRankVertex extends EdgeListVertex<LongWritable,DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Iterator<DoubleWritable> msgIterator) { if (getSuperstep() >= 1) { double sum = 0; while (msgIterator.hasNext()) { sum += msgIterator.next().get(); } setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f *sum); } if (getSuperstep() < 30) { long edges = getNumOutEdges(); sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } }}
  15. Giraph In MapReduce
  16. Giraph components•  Master – Application coordinator ▪  One active master at a time ▪  Assigns partition owners to workers prior to each superstep ▪  Synchronizes supersteps•  Worker – Computation & messaging ▪  Loads the graph from input splits ▪  Does the computation/messaging of its assigned partitions•  ZooKeeper ▪  Maintains global application state
  17. Graph distribution•  Master graph partitioner ▪  Create initial partitions, generate partition owner changes between supersteps•  Worker graph partitioner ▪  Determine which partition a vertex belongs to ▪  Create/modify the partition stats (can split/merge partitions)•  Default is hash partitioning (hashCode()) ▪  Range-based partitioning is also possible on a per-type basis
  18. Graph distribution example Partition 0 Load/Store Stats 0 Worker 0 Compute Partition 1 Messages Stats 1 Partition 2 Load/Store Stats 2 Master Worker 1 Compute Partition 3 Messages Stats 3 Partition 4 Load/Store Stats 4 Worker 2 Compute Partition 5 Messages Stats 5 Partition 6 Load/Store Stats 6 Worker 3 Compute Partition 7 Messages Stats 7
  19. Customizable fault tolerance•  No single point of failure from Giraph threads ▪  With multiple master threads, if the current master dies, a new one will automatically take over. ▪  If a worker thread dies, the application is rolled back to a previously checkpointed superstep. The next superstep will begin with the new amount of workers ▪  If a zookeeper server dies, as long as a quorum remains, the application can proceed•  Hadoop single points of failure still exist ▪  Namenode, jobtracker ▪  Restarting manually from a checkpoint is always possible 19
  20. Master thread fault tolerance Before failure of active master 0 After failure of active master 0 “Active” “Active” Master 0 Master 0 Active Active Master Master “Spare” State “Active” State Master 1 Master 1 “Spare” “Spare” Master 2 Master 2•  One active master, with spare masters taking over in the event of an active master failure•  All active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails•  “Active” master implemented as a queue in ZooKeeper 20
  21. Worker thread fault tolerance Superstep i Superstep i+1 Superstep i+2(no checkpoint) (checkpoint) (no checkpoint) Worker failure! Superstep i+1 Superstep i+2 Superstep i+3 (checkpoint) (no checkpoint) (checkpoint) Worker failure after checkpoint complete! Superstep i+3 Application (no checkpoint) Complete•  A single worker death fails the superstep•  Application reverts to the last committed superstep automatically ▪  Master detects worker failure during any superstep with a ZooKeeper “health” znode ▪  Master chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep 21
  22. Optional features•  Combiners ▪  Similar to Map-Reduce combiners ▪  Users implement a combine() method that can reduce the amount of messages sent and received ▪  Run on both the client side (memory, network) and server side (memory)•  Aggregators ▪  Similar to MPI aggregation routines (i.e. max, min, sum, etc.) ▪  Commutative and associate operations that are performed globally ▪  Examples include global communication, monitoring, and statistics
  23. Recent Netty IPC implementation 300 50 250 Time (Seconds)•  Big improvement over the 40 Hadoop RPC implementation 200 30 150•  10-39% overall performance 20 improvement 100 50 10•  Still need more Netty tuning 0 0 10 30 50 Workers Netty Hadoop RPC % improvement
  24. Recent benchmarks•  Test cluster of 80 machines ▪  Facebook Hadoop (https://github.com/facebook/hadoop-20) ▪  72 cores, 64+ GB of memory▪  org.apache.giraph.benchmark.PageRankBenchmark ▪  5 supersteps ▪  No checkpointing ▪  10 edges per vertex
  25. Worker scalability 3000 Time (Seconds) 2500 2000 1500 1000 500 0 10 20 30 40 45 50 Workers
  26. Edge Scalability 5000 Time (Seconds) 4000 3000 2000 1000 0 1 2 3 4 5 Edges (Billions)
  27. Worker / edge scalability 2000 8 Time (Seconds) Edges (Billions) 1500 6 1000 4 500 2 0 0 10 30 50 Workers Run Time Workers/Edges
  28. Apache Giraph has graduated as of5/2012•  Incubated for less than a year (entered incubator 9/12)•  Committers from HortonWorks, Twitter, LinkedIn, Facebook, TrendMicro and various schools (VU Amsterdam, TU Berlin, Korea University)•  Released 0.1 as of 2/6/2012, will be release 0.2 within a few months
  29. Future improvements•  Out-of-core messages/graph ▪  Under memory pressure, dump messages/portions of the graph to local disk ▪  Ability to run applications without having all needed memory•  Performance improvements ▪  Netty is a good step in the right direction, but need to tune messaging performance as it takes up a majority of the time ▪  Scale back use of ZooKeeper to only be for health registration, rather than implementing aggregators and coordination
  30. More future improvements•  Adding a master#compute() method ▪  Arbitrary master computation that sends results to workers prior to a superstep to simplify certain algorithms ▪  GIRAPH-127•  Handling skew ▪  Some vertices have a large number of edges and we need to break them up and handle them differently to provide better scalability
  31. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
  32. Sessions will resume at 4:30pm Page 32

×