Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
King Abdullah University of Science and             Technology       CS348: Cloud Computing    Large-Scale Graph Processin...
The Importance of Graphs     ●   A graph is a mathematical structure that represents pairwise         relations between en...
Graph algorithm characteristics*    ●   Data-Drivin Computations: Computations in graph        algorithms depends on the s...
Challenges in Graph processing    ●   Graphs grows fast; a single computer either cannot fit a large        graph into mem...
Why Cloud in Graph Processing●   Easy to scale up and down; provision machines    depending on your graph size.●   Cheaper...
Large Scale Graph Processing●   Systems that tries to solve the problem of processing large    graphs in parallel:        ...
Pregel* Graph Processing   ●   Consists of a series of synchronized iterations       (supersteps); based on Bulk Synchrono...
Pregel messaging Example 1Superstep 0       A      B       D      C
Pregel messaging Example 1Superstep 0            Superstep 1                                         22       A      B    ...
Pregel messaging Example 1Superstep 0                          Superstep 1                                                ...
Pregel messaging Example 1Superstep 0                          Superstep 1                                                ...
Vertexs State ●   All vertices are active at superstep 1 ●   All active vertices runs user function compute() at any     s...
Pregel Example 2    Data Distribution(Hash-based partitioning)                             Worker 1          Worker 2     ...
Pregel Example 3 – Max       3     6     2     1
Pregel Example 3 – Max       3     6     2     1       6     6     2     6
Pregel Example 3 – Max       3     6     2     1       6     6     2     6       6     6     6     6
Pregel Example 3 – Max       3     6     2     1       6     6     2     6       6     6     6     6       6     6     6  ...
Pregel Example 4 – Max code                                 Vertex value                                                  ...
Pregel Message Optimizations●   Message Combiners:         –   A special function that combines the incoming              ...
Pregel Guarantees ●   Scalability: process vertices in parallel, overlap     computation and communication. ●   Messages w...
Pregels Limitations ●   Pregels superstep waits for all workers to finish at the     synchronization barrier. That is, it ...
Mizan* Graph Processing      ●   Mizan is an open source graph processing system, similar          to Pregel, developed lo...
Source of Imbalance in BSP
Source of Imbalance in BSP
Types of Graph Algorithms ●   Stationary Graph Algorithms:           –   Algorithms with fixed message distribution across...
Mizan architecture ●   Each Mizan worker contains three distinct main     components: BSP Processor, communicator and stor...
Mizans Barriers
Dynamic migration: Statistics ●   Mizan monitors the following for every vertex:          –   Response time          –   R...
Dynamic migration: planning ●   Mizans migration planner runs after the BSP barrier and creates a     new barrier. The pla...
Mizans Migration Work-flow
Mizan PageRank Compute() Examplevoid compute(messageIterator<mDouble> * messages, userVertexObject<mLong, mDouble, mDouble...
Mizan PageRank Combiner Examplevoid combineMessages(mLong dst, messageIterator<mDouble> * messages,messageManager<mLong, m...
Mizan Max Aggregator Exampleclass maxAggregator: public IAggregator<mLong> {Public:       mlong aggValue;       maxAggrega...
Class Assignment ●   Your assignment is to configure, install and run Mizan on     a single Linux machine throw following ...
Upcoming SlideShare
Loading in …5
×

Large Graph Processing

1,846 views

Published on

A lecture for cloud computing course about large graph processing, Pregel and Mizan

Published in: Education
  • Be the first to comment

  • Be the first to like this

Large Graph Processing

  1. 1. King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013
  2. 2. The Importance of Graphs ● A graph is a mathematical structure that represents pairwise relations between entities or objects. Such as: – Physical communication networks – Web pages links – Social interaction graphs – Protein-to-protein interactions ● Graphs are used to abstract application-specific features into a generic problem, which makes Graph Algorithms applicable to a wide variety of applications*.*http://11011110.livejournal.com/164613.html
  3. 3. Graph algorithm characteristics* ● Data-Drivin Computations: Computations in graph algorithms depends on the structure of the graph. It is hard to predict the algorithm behavior ● Unstructured Problems: Different graph distributions requires distinct load balancing techniques. ● Poor Data Locality. ● High Data Access to Computation Ratio: Runtime can be dominated by waiting memory fetches.*Lumsdaine et. al, Challenges in Parallel Graph Processing
  4. 4. Challenges in Graph processing ● Graphs grows fast; a single computer either cannot fit a large graph into memory or it fits the large graph with huge cost. ● Custom implementations for a single graph algorithm requires time and effort and cannot be used on other algorithms ● Scientific parallel applications (i.e. parallel PDE solvers) cannot fully adapt to the computational requirements of graph algorithms*. ● Fault tolerance is required to support large scale processing.*Lumsdaine et. al, Challenges in Parallel Graph Processing
  5. 5. Why Cloud in Graph Processing● Easy to scale up and down; provision machines depending on your graph size.● Cheaper than buying a physical large cluster.● Can be used in the cloud as “Software as a services” to support online social networks.
  6. 6. Large Scale Graph Processing● Systems that tries to solve the problem of processing large graphs in parallel: – MapReduce – auto task scheduling, distributed disk based computations: ● Pegasus ● X-Rime – Pregel - Bulk Synchronous Parallel Graph Processing: ● Giraph ● GPS ● Mizan – GraphLab – Asynchronous Parallel Graph Processing.
  7. 7. Pregel* Graph Processing ● Consists of a series of synchronized iterations (supersteps); based on Bulk Synchronous Parallel computing model. Each superstep consists of: – Concurrent computations – Communication – Synchronization barrier ● Vertex centric computation, the users compute() function is applied individually on each vertex, which is able to: – Send message to vertices in the next superstep – Receive messages from the previous superstep*Malewicz et. al., Pregel: A System for Large-Scale Graph Processing
  8. 8. Pregel messaging Example 1Superstep 0 A B D C
  9. 9. Pregel messaging Example 1Superstep 0 Superstep 1 22 A B A B 9 15 D C D C 47
  10. 10. Pregel messaging Example 1Superstep 0 Superstep 1 22 A B A B 9 15 D C D C 47Superstep 2 -2 22, 9 A B 7 55 47 D C 15 14
  11. 11. Pregel messaging Example 1Superstep 0 Superstep 1 22 A B A B 9 15 D C D C 47Superstep 2 Superstep 3 -2 22, 9 5 -2, 7 A B A B 7 55 5 98 47 D C 15 14 D C 55 14 9
  12. 12. Vertexs State ● All vertices are active at superstep 1 ● All active vertices runs user function compute() at any superstep ● A vertex deactivates itself by voting to halt, but returns to active if it received messages. ● Pregel terminates of all vertices are inactive
  13. 13. Pregel Example 2 Data Distribution(Hash-based partitioning) Worker 1 Worker 2 Worker 3 Computation Communication Synchronization Barrier Yes No Terminate Done?
  14. 14. Pregel Example 3 – Max 3 6 2 1
  15. 15. Pregel Example 3 – Max 3 6 2 1 6 6 2 6
  16. 16. Pregel Example 3 – Max 3 6 2 1 6 6 2 6 6 6 6 6
  17. 17. Pregel Example 3 – Max 3 6 2 1 6 6 2 6 6 6 6 6 6 6 6 6
  18. 18. Pregel Example 4 – Max code Vertex value classClass MaxFindVertex:public Vertex<double, void, double> { Edge value class  public: Message class  virtual void Compute(MessageIterator* msgs) {      int currMax = GetValue(); Send current Max      SendMessageToAllNeighbors(currMax); Check messages      for ( ; !msgs­>Done(); msgs­>Next()) { and store max          if (msgs­>Value() > currMax)               currMax = msgs­>Value(); Store new max      }      if (currMax > GetValue())         *MutableValue() = currMax;      else VoteToHalt();  }};
  19. 19. Pregel Message Optimizations● Message Combiners: – A special function that combines the incoming messages for a vertex before running compute() – Can run on the message sending or receiving worker● Global Aggregators : – A shared object accessible to all vertices. that is synchronized at the end of each superstep, i.e., max and min aggregators.
  20. 20. Pregel Guarantees ● Scalability: process vertices in parallel, overlap computation and communication. ● Messages will be received without duplication in any order. ● Fault tolerance through check points
  21. 21. Pregels Limitations ● Pregels superstep waits for all workers to finish at the synchronization barrier. That is, it waits for the slowest worker to finish. ● Smart partitioning can solve the load balancing problem for static algorithms. However not all algorithms are static, algorithms can have a variable execution behaviors which leads to an unbalanced supersteps.
  22. 22. Mizan* Graph Processing ● Mizan is an open source graph processing system, similar to Pregel, developed locally at KAUST. ● Mizan employs dynamic graph repartitioning without affecting the correctness of graph processing to rebalanced the execution of the supersteps for all types of workloads.*Khayyat et. al., Mizan: A System for Dynamic Load Balancing in Large-scaleGraph Processing
  23. 23. Source of Imbalance in BSP
  24. 24. Source of Imbalance in BSP
  25. 25. Types of Graph Algorithms ● Stationary Graph Algorithms: – Algorithms with fixed message distribution across superstep – All vertices are either active or inactive at same time – i.e. PageRank, Diameter Estimation and weakly connected components. ● Non-stationary Graph Algorithms – Algorithms with variable message distribution across supersteps – Vertices can be active and inactive independent to others – i.e. Distributed Minimal spanning tree
  26. 26. Mizan architecture ● Each Mizan worker contains three distinct main components: BSP Processor, communicator and storage manager. ● The distributed hash table (DHT) is used to maintain the location of each vertex ● The migration planner interacts with other components during the BSP barrier
  27. 27. Mizans Barriers
  28. 28. Dynamic migration: Statistics ● Mizan monitors the following for every vertex: – Response time – Remote outgoing messages – Incoming messages
  29. 29. Dynamic migration: planning ● Mizans migration planner runs after the BSP barrier and creates a new barrier. The planning includes the following steps: – Identifying unbalanced workers. – Identifying migration objective: ● Response time ● Incoming messages ● Outgoing messages – Pair over-utilized workers with underutilized – Select vertices to migrate
  30. 30. Mizans Migration Work-flow
  31. 31. Mizan PageRank Compute() Examplevoid compute(messageIterator<mDouble> * messages, userVertexObject<mLong, mDouble, mDouble, mLong> * data,messageManager<mLong, mDouble, mDouble, mLong> * comm) {       double currVal = data­>getVertexValue().getValue();       double newVal = 0;  double c = 0.85;       while (messages­>hasNext()) {            double tmp = messages­>getNext().getValue(); Processing            newVal = newVal + tmp; Messages       }       newVal = newVal * c + (1.0 ­ c) / ((double) vertexTotal);       mDouble outVal(newVal / ((double) data­>getOutEdgeCount()));       if (data­>getCurrentSS() <= maxSuperStep) {          for (int i = 0; i < data­>getOutEdgeCount(); i++) { Termination               comm­>sendMessage(data­>getOutEdgeID(i), outVal);               data­>getOutEdgeID(i); Condition          }        } else {           data­>voteToHalt();        } Sending to        Neighbors      data­>setVertexValue(mDouble(newVal));}
  32. 32. Mizan PageRank Combiner Examplevoid combineMessages(mLong dst, messageIterator<mDouble> * messages,messageManager<mLong, mDouble, mDouble, mLong> * mManager) {       double newVal = 0;       while (messages­>hasNext()) {              double tmp = messages­>getNext().getValue();              newVal = newVal + tmp;       }       mDouble messageOut(newVal);       mManager­>sendMessage(dst,messageOut);}
  33. 33. Mizan Max Aggregator Exampleclass maxAggregator: public IAggregator<mLong> {Public:       mlong aggValue;       maxAggregator() {          aggValue.setValue(0);       }       void aggregate(mLong value) {           if (value > aggValue) {               aggValue = value;           }       }       mLong getValue() {            return aggValue;       }       void setValue(mLong value) {            this­>aggValue = value;       }              virtual ~maxAggregator() {}};
  34. 34. Class Assignment ● Your assignment is to configure, install and run Mizan on a single Linux machine throw following this tutorial: https://thegraphsblog.wordpress.com/mizan-on-ubuntu/ ● By the end of the tutorial, you should be able to execute the command on your machine: mpirun ­np 2 ./Mizan­0.1b ­u ubuntu ­g web­Google.txt ­w 2 ● Deliverables: you store the output of of the above command and submit it by Wednesdays class. ● Any questions regarding the tutorial or to get an account for a Ubuntu machine, contact me on: zuhair.khayyat@kaust.edu.sa

×