2011.06.29. Giraph - Hadoop Summit 2011


Published on


Giraph : Large-scale graph processing on Hadoop

Web and online social graphs have been rapidly growing in size and
scale during the past decade. In 2008, Google estimated that the
number of web pages reached over a trillion. Online social networking
and email sites, including Yahoo!, Google, Microsoft, Facebook,
LinkedIn, and Twitter, have hundreds of millions of users and are
expected to grow much more in the future. Processing these graphs
plays a big role in relevant and personalized information for users,
such as results from a search engine or news in an online social
networking site.

Graph processing platforms to run large-scale algorithms (such as page
rank, shared connections, personalization-based popularity, etc.) have
become quite popular. Some recent examples include Pregel and HaLoop.
For general-purpose big data computation, the map-reduce computing
model has been well adopted and the most deployed map-reduce
infrastructure is Apache Hadoop. We have implemented a
graph-processing framework that is launched as a typical Hadoop job to
leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph
builds upon the graph-oriented nature of Pregel but additionally adds
fault-tolerance to the coordinator process with the use of ZooKeeper
as its centralized coordination service and is in the process of being

Giraph follows the bulk-synchronous parallel model relative to graphs
where vertices can send messages to other vertices during a given
superstep. Checkpoints are initiated by the Giraph infrastructure at
user-defined intervals and are used for automatic application restarts
when any worker in the application fails. Any worker in the
application can act as the application coordinator and one will
automatically take over if the current application coordinator fails.

Published in: Technology

2011.06.29. Giraph - Hadoop Summit 2011

  1. 1. Giraph: Large-scale graph processing infrastructure on Hadoop<br />Avery Ching, Yahoo!<br />Christian Kunz, Jybe <br />6/29/2011<br />
  2. 2. Why scalable graph processing?<br />2<br />Real world web and social graphs are at immense scale and continuing to grow<br />In 2008, Google estimated the number of web pages at 1 trillion<br />In 2011, Goldman Sachs disclosed Facebook has over 600 million active users<br />In March 2011, LinkedIn said it had over 100 million users<br />In October 2010, Twitter claimed to have over 175 million users<br />Relevant and personalized information for users relies strongly on iterative graph ranking algorithms (search results, social news, ads, etc.)<br />In web graphs, page rank and its variants<br />In social graphs, popularity rank, personalized rankings, shared connections, shortest paths, etc.<br />
  3. 3. Existing solutions<br />Sequence of map-reduce jobs in Hadoop<br />Classic map-reduce overheads (job startup/shutdown, reloading data from HDFS, shuffling)<br />Map-reduce programming model not a good fit for graph algorithms<br />Google’s Pregel<br />Requires its own computing infrastructure<br />Not available (unless you work at Google)<br />Master is a SPOF<br />Message passing interface (MPI)<br />Not fault-tolerant<br />Too generic<br />3<br />
  4. 4. Giraph<br />Leverage Hadoop installations around the world for iterative graph processing<br />Big data today is processed on Hadoop with the Map-Reduce computing model<br />Map-Reduce with Hadoop is widely deployed outside of Yahoo! as well (i.e. EC2, Cloudera, etc.)<br />Bulk synchronous processing (BSP) computing model<br />Input data loaded once during the application, all messaging in memory<br />Fault-tolerant/dynamic graph processing infrastructure<br />Automatically adjust to available resources on the Hadoop grid<br />No single point of failure except Hadoop namenode and jobtracker<br />Relies on ZooKeeper as a fault-tolerant coordination service<br />Vertex-centric API to do graph processing (similar to the map() method in Hadoop) in a bulk synchronous parallel computing model<br />Inspired by Pregel<br />Open-source<br />https://github.com/aching/Giraph<br />4<br />
  5. 5. Bulk synchronous parallel model<br />Computation consists of a series of “supersteps”<br />Supersteps are an atomic unit of computation where operations can happen in parallel<br />During a superstep, components are assigned tasks and receive unordered messages from the previous superstep<br />Components communicate through point-to-point messaging<br />All (or a subset of) components can be synchronized through the superstep concept<br />5<br />Superstep i<br />Superstep i+1<br />Superstep i+2<br />Component<br />1<br />Component<br />1<br />Component<br />1<br />Component<br />2<br />Component<br />2<br />Component<br />2<br />Component<br />3<br />Component<br />3<br />Component<br />3<br />Application progress<br />
  6. 6. Writing a Giraph application<br />Every vertex will call compute() method once during a superstep<br />Analogous to map() method in Hadoop for a <key, value> tuple<br />Users chooses 4 types for their implementation of Vertex (I VertexId, V VertexValue, E EdgeValue, M MsgValue)<br />6<br />
  7. 7. Basic Giraph API<br />Local vertex mutations happen immediately<br />Graph mutations are processed just prior to the next superstep<br />Sent messages received at the next superstep during compute<br />Vertices “vote” to end the computation<br />Once all vertices have voted to end the computation, the application is finished<br />7<br />
  8. 8. Page rank example<br />public class SimplePageRankVertex extends HadoopVertex<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> {<br /> public void compute(Iterator<DoubleWritable> msgIterator) {<br /> double sum = 0;<br /> while (msgIterator.hasNext()) {<br /> sum += msgIterator.next().get();<br /> }<br /> setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f * sum);<br /> if (getSuperstep() < 30) {<br /> long edges = getOutEdgeIterator().size();<br />sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges));<br /> } else {<br />voteToHalt();<br /> }<br /> }<br />8<br />
  9. 9. Thread architecture<br />9<br />Map-only job in Hadoop<br />Map<br />0<br />Map<br />1<br />Map<br />2<br />Map<br />3<br />Map<br />4<br />Thread assignment in Giraph<br />Worker<br />Worker<br />Worker<br />Worker<br />Worker<br />Worker<br />Worker<br />Master<br />ZooKeeper<br />
  10. 10. Thread responsibilities<br />Master<br />Only one active master at a time<br />Runs the VertexInputFormat getSplits() to get the appropriate number of VertexSplit objects for the application and writes it to ZooKeeper<br />Coordinates application<br />Synchronizes supersteps, end of application<br />Handles changes that can occur within supersteps (i.e. vertex movement, change in number of workers, etc.)<br />Worker<br />Reads vertices from one or more VertexSplit objects, splitting them into VertexRanges<br />Executes the compute() method for every Vertex it is assigned once per superstep<br />Buffers the incoming messages to every Vertex it is assigned for the next superstep<br />ZooKeeper<br />Manages a server that Is a part of the ZooKeeper quorum (maintains global application state)<br />10<br />
  11. 11. Vertex distribution<br />11<br />Worker<br />0<br />V<br />Worker<br />0<br />VertexRange 0<br />V<br />Master<br />VertexSplit 0<br />V<br />V<br />Workers process the VertexSplit objects and divide them further to create VertexRange objects.<br />V<br />V<br />VertexRange 0,1<br />Prior to every superstep, the master assigns every worker 0 or more VertexRange objects. They are responsible for the vertices that fall in those VertexRange objects.<br />V<br />V<br />VertexRange 1<br />V<br />V<br />V<br />V<br />V<br />V<br />Master<br />V<br />Worker<br />1<br />VertexRange 2<br />V<br />VertexSplit 1<br />VertexRange 2,3<br />V<br />V<br />V<br />V<br />Master uses the VertexInputFormat to divide the input data into VertexSplit objects and serializes them to ZooKeeper.<br />V<br />V<br />V<br />VertexRange 3<br />V<br />V<br />V<br />V<br />V<br />Worker<br />1<br />V<br />V<br />
  12. 12. Worker phases in a superstep<br />Master selects one or more of the available workers to use for the superstep<br />Users can set the checkpoint frequency<br /><ul><li>Checkpoints are implemented by Giraph (all types implement Writable)</li></ul>Users can determine how to distribute vertex ranges on the set of available workers<br />BSP model allows for dynamic resource usage<br />Every superstep is an atomic unit of computation<br />Resources can change between supersteps and used accordingly (shrink or grow)<br />12<br />Prepare by registering worker health<br />If desired, checkpoint vertices<br />If desired, exchange vertex ranges<br />Execute compute() on each vertex and exchange messages<br />
  13. 13. Fault tolerance<br />No single point of failure from BSP threads<br />With multiple master threads, if the current master dies, a new one will automatically take over.<br />If a worker thread dies, the application is rolled back to a previously checkpointed superstep. The next superstep will begin with the new amount of workers<br />If a zookeeper server dies, as long as a quorum remains, the application can proceed<br />Hadoop single points of failure still exist<br />Namenode, jobtracker<br />Restarting manually from a checkpoint is always possible<br />13<br />
  14. 14. Master thread fault tolerance<br />14<br />Before failure of active master 0<br />After failure of active master 0<br />Active<br />Master State<br />Active<br />Master State<br />“Active”<br />Master 0<br />“Active”<br />Master 0<br />“Spare”<br />Master 1<br />“Active”<br />Master 1<br />“Spare”<br />Master 2<br />“Spare”<br />Master 2<br />One active master, with spare masters taking over in the event of an active master failure<br />All active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails<br />“Active” master implemented as a queue in ZooKeeper<br />
  15. 15. Worker thread fault tolerance<br />15<br />Superstep i<br />(no checkpoint)<br />Superstep i+1<br />(checkpoint)<br />Superstep i+2<br />(no checkpoint)<br />Worker failure!<br />Superstep i+3<br />(checkpoint)<br />Superstep i+1<br />(checkpoint)<br />Superstep i+2<br />(no checkpoint)<br />Worker failure after checkpoint complete!<br />Superstep i+3<br />(no checkpoint)<br />Application<br />Complete<br />A single worker failure causes the superstep to fail<br />In order to disrupt the application, a worker must be registered and chosen for the current superstep<br />Application reverts to the last committed superstep automatically<br />Master detects worker failure during any superstep through the loss of a ZooKeeper “health” znode<br />Master chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep <br />
  16. 16. Optional features<br />Combiners<br />Similar to Map-Reduce combiners<br />Users implement a combine() method that can reduce the amount of messages sent and received<br />Run on both the client side and server side<br />Client side saves memory and message traffic<br />Server side save memory<br />Aggregators<br />Similar to MPI aggregation routines (i.e. max, min, sum, etc.)<br />Users can write their own aggregators<br />Commutative and associate operations that are performed globally<br />Examples include global communication, monitoring, and statistics<br />16<br />
  17. 17. Early Yahoo! customers<br />Web of Objects<br />Currently used for the movie database (10’s of millions of records, run with 400 workers)<br />Popularity rank, shared connections, personalized page rank<br />Web map<br />Next generation page-rank related algorithms will use this framework (250 billion web pages)<br />Current graph processing solution uses MPI (no fault-tolerance, customized code)<br />17<br />
  18. 18. Future work<br />Improved fault tolerance testing<br />Unit tests exist, but test a limited subset of failures<br />Performance testing<br />Tested with 400 workers, but thousands desired<br />Importing some of the currently used graph algorithms into an included library<br />Storing messages that surpass available memory on disk<br />Leverage the programming model, maybe even convert to a native Map-Reduce application<br />18<br />
  19. 19. Conclusion<br />Giraph is a graph processing infrastructure that runs on existing Hadoop infrastructure<br />Already being used at Yahoo! <br />Open source – available on GitHub<br />https://github.com/aching/Giraph<br />In the process of submitting an Apache Incubator proposal<br />Questions/Comments?<br />aching@yahoo-inc.com<br />19<br />