SlideShare a Scribd company logo
1 of 14
Download to read offline
Mizan: A System for Dynamic Load Balancing
                                in Large-scale Graph Processing

           Zuhair Khayyat‡                     Karim Awara‡                   Amani Alonazi‡           Hani Jamjoom†           Dan Williams†
                                                                              Panos Kalnis‡
                                         ‡King Abdullah University of Science and Technology, Saudi Arabia
                                             †IBM T. J. Watson Research Center, Yorktown Heights, NY




Abstract                                                                                   array of popular graph mining algorithms (including PageR-
Pregel [23] was recently introduced as a scalable graph min-                               ank, shortest paths problems, bipartite matching, and semi-
ing system that can provide significant performance im-                                     clustering), Pregel was shown to improve the overall perfor-
provements over traditional MapReduce implementations.                                     mance by 1-2 orders of magnitude [17] over a traditional
Existing implementations focus primarily on graph par-                                     MapReduce [6] implementation. Pregel builds on the Bulk
titioning as a preprocessing step to balance computation                                   Synchronous Parallel (BSP) [30] programming model. It
across compute nodes. In this paper, we examine the run-                                   operates on graph data, consisting of vertices and edges.
time characteristics of a Pregel system. We show that graph                                Each vertex runs an algorithm and can send messages—
partitioning alone is insufficient for minimizing end-to-end                                asynchronously—to any other vertex. Computation is di-
computation. Especially where data is very large or the run-                               vided into a number of supersteps (iterations), each sepa-
time behavior of the algorithm is unknown, an adaptive ap-                                 rated by a global synchronization barrier. During each su-
proach is needed. To this end, we introduce Mizan, a Pregel                                perstep, vertices run in parallel across a distributed infras-
system that achieves efficient load balancing to better adapt                               tructure. Each vertex processes incoming messages from the
to changes in computing needs. Unlike known implementa-                                    previous superstep. This process continues until all vertices
tions of Pregel, Mizan does not assume any a priori knowl-                                 have no messages to send, thereby becoming inactive.
edge of the structure of the graph or behavior of the algo-                                    Not surprisingly, balanced computation and communi-
rithm. Instead, it monitors the runtime characteristics of the                             cation is fundamental to the efficiency of a Pregel system.
system. Mizan then performs efficient fine-grained vertex                                    To this end, existing implementations of Pregel (including
migration to balance computation and communication. We                                     Giraph [8], GoldenOrb [9], Hama [28], Surfer [4]) primar-
have fully implemented Mizan; using extensive evaluation                                   ily focus on efficient partitioning of input data as a pre-
we show that—especially for highly-dynamic workloads—                                      processing step. They take one or more of the following
Mizan provides up to 84% improvement over techniques                                       five approaches to achieving a balanced workload: (1) pro-
leveraging static graph pre-partitioning.                                                  vide simple graph partitioning schemes, like hash- or range-
                                                                                           based partitioning (e.g., Giraph), (2) allow developers to set
                                                                                           their own partitioning scheme or pre-partition the graph data
1. Introduction
                                                                                           (e.g., Pregel), (3) provide more sophisticated partitioning
With the increasing emphasis on big data, new platforms are                                techniques (e.g., GraphLab, GoldenOrb and Surfer use min-
being proposed to better exploit the structure of both the                                 cuts [16]), (4) utilize distributed data stores and graph index-
data and algorithms (e.g., Pregel [23], HADI [14], PEGA-                                   ing on vertices and edges (e.g., GoldenOrb and Hama), and
SUS [13] and X-RIME [31]). Recently, Pregel [23] was in-                                   (5) perform coarse-grained load balancing (e.g., Pregel).
troduced as a message passing-based programming model                                          In this paper, we analyze the efficacy of these workload-
specifically targeting the mining of large graphs. For a wide                               balancing approaches when applied to a broad range of
                                                                                           graph mining problems. Built into the design of these ap-
                                                                                           proaches is the assumption that the structure of the graph
Permission to make digital or hard copies of all or part of this work for personal or
                                                                                           is static, the algorithm has predictable behavior, or that the
classroom use is granted without fee provided that copies are not made or distributed      developer has deep knowledge about the runtime character-
for profit or commercial advantage and that copies bear this notice and the full citation   istics of the algorithm. Using large datasets from the Lab-
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee.                                  oratory for Web Algorithmics (LAW) [2, 20] and a broad
Eurosys’13 April 15-17, 2013, Prague, Czech Republic                                       set of graph algorithms, we show that—especially when the
Copyright c 2013 ACM 978-1-4503-1994-2/13/04. . . $15.00
graph mining algorithm has unpredictable communication                     • We analyze different graph algorithm characteristics that
needs, frequently changes the structure of the graph, or has                 can contribute to imbalanced computation of a Pregel
a variable computation complexity—a single solution is not                   system.
enough. We further show that a poor partitioning can result                • We propose a dynamic vertex migration model based on
in significant performance degradation or large up-front cost.                runtime monitoring of vertices to optimize the end-to-end
    More fundamentally, existing workload balancing ap-                      computation.
proaches suffer from poor adaptation to variability (or shifts)
                                                                           • We fully implement Mizan in C++ as an optimized Pregel
in computation needs. Even the dynamic partition assign-
ment that was proposed in Pregel [23] reacts poorly to highly                system that supports dynamic load balancing and effi-
dynamic algorithms. We are, thus, interested in building a                   cient vertex migration.
system that is (1) adaptive, (2) agnostic to the graph struc-              • We deploy Mizan on a local Linux cluster (21 machines)
ture, and (3) requires no a priori knowledge of the behavior                 and evaluate its efficacy on a representative number of
of the algorithm. At the same time, we are interested in im-                 datasets. We also show the linear scalability of our design
proving the overall performance of the system.                               by running Mizan on a 1024-CPU IBM Blue Gene/P.
    To this end, we introduce Mizan1 , a Pregel system that
                                                                             This paper is organized as follows. Section 2 describes
supports fine-grained load balancing across supersteps. Sim-
                                                                          factors affecting algorithm behavior and discusses a number
ilar to other Pregel implementations, Mizan supports differ-
                                                                          of example algorithms. In Section 3, we introduce the design
ent schemes to pre-partition the input graph. Unlike exist-
                                                                          of Mizan. We describe implementation details in Section 4.
ing implementations, Mizan monitors runtime characteris-
                                                                          Section 5 compares the performance of Mizan using a wide
tics of all vertices (i.e., their execution time, and incoming
                                                                          array of datasets and algorithms. We then describe related
and outgoing messages). Using these measurements, at the
                                                                          work in Section 6. Section 7 discusses future work and
end of every superstep, Mizan constructs a migration plan
                                                                          Section 8 concludes.
that minimizes the variations across workers by identify-
ing which vertices to migrate and where to migrate them
to. Finally, Mizan performs efficient vertex migration across
                                                                          2. Dynamic Behavior of Algorithms
workers, leveraging a distributed hash table (DHT)-based lo-              In the Pregel (BSP) computing model, several factors can
cation service to track the movement of vertices as they mi-              affect the runtime performance of the underlying system
grate.                                                                    (shown in Figure 1). During a superstep, each vertex is
    All components of Mizan support distributed execution,                either in an active or inactive state. In an active state, a
eliminating the need for a centralized controller. We have                vertex may be computing, sending messages, or processing
fully implemented Mizan in C++, with the messaging—                       received messages.
BSP—layer implemented using MPI [3]. Using a representa-                     Naturally, vertices can experience variable execution
tive number of datasets and a broad spectrum of algorithms,               times depending on the algorithm they implement or their
we compare our implementation against Giraph, a popular                   degree of connectivity. A densely connected vertex, for ex-
Pregel clone, and show that Mizan consistently outperforms                ample, will likely spend more time sending and process-
Giraph by an average of 202%. Normalizing across plat-                    ing incoming messages than a sparsely connected vertex.
forms, we show that for stable workloads, Mizan’s dynamic                 The BSP programming model, however, masks this varia-
load balancing matches the performance of the static parti-               tion by (1) overlapping communication and computation,
tioning without requiring a priori knowledge of the structure             and (2) running many vertices on the same compute node.
of the graph or algorithm runtime characteristics. When the               Intuitively, as long as the workloads of vertices are roughly
algorithm is highly dynamic, we show that Mizan provides                  balanced across computing nodes, the overall computation
over 87% performance improvement. Even with inefficient                    time will be minimized.
input partitioning, Mizan is able to reduce the computation                  Counter to intuition, achieving a balanced workload is not
overhead by up to 40% when compared to the static case,                   straightforward. There are two sources of imbalance: one
incurring only a 10% overhead for performing dynamic ver-                 originating from the graph structure and another from the
tex migration. We even run our implementation on an IBM                   algorithm behavior. In both cases, vertices on some compute
Blue Gene/P supercompter, demonstrating linear scalability                nodes can spend a disproportional amount of time comput-
to 1024 CPUs. To the best of our knowledge, this is the                   ing, sending or processing messages. In some cases, these
largest scale out test of a Pregel-like system to date.                   vertices can run out of input buffer capacity and start pag-
    To summarize, we make the following contributions:                    ing, further exacerbating the imbalance.
                                                                             Existing implementations of Pregel (including Gi-
                                                                          raph [8], GoldenOrb [9], Hama [28]) focus on providing
                                                                          multiple alternatives to partitioning the graph data. The three
1 Mizan is Arabic for a double-pan scale, commonly used in reference to   common approaches to partitioning the data are hash-based,
achieving balance.                                                        range-based, or min-cut [16]. Hash- and range-based parti-
Superstep 1                Superstep 2            Superstep 3                                                    350
                                                                                                                                      Hash
                                                                                                                          300        Range
   Worker 1                    Worker 1                    Worker 1                                                                Min-cuts




                                                                                                         Run Time (Min)
                                                                                                                          250

   Worker 2                    Worker 2                    Worker 2                                                       200

                                                                                                                          150
   Worker 3                    Worker 3                    Worker 3                                                       100

                                 BSP Barrier                                                                               50
   -Vertex response time             -Time to receive in messages
                                                                                                                               0
                       -Time to send out messages                                                                                       Live      kg     ar
                                                                                                                                             Journa raph4m abic-200
                                                                                                                                                   l      68m       5
Figure 1. Factors that can affect the runtime in the Pregel
framework                                                                      Figure 2. The difference in execution time when processing
                                                                               PageRank on different graphs using hash-based, range-based
                                                                               and min-cuts partitioning. Because of its size, arabic-2005
tioning methods divide a dataset based on a simple heuristic:                  cannot be partitioned using min-cuts (ParMETIS) in our
to evenly distribute vertices across compute nodes, irrespec-                  local cluster.
tive of their edge connectivity. Min-cut based partitioning,
on the other hand, considers vertex connectivity and parti-                                              1000

                                                                                In Messages (Millions)
tions the data such that it places strongly connected vertices
                                                                                                                 100
close to each other (i.e., on the same cluster). The result-
ing performance of these partitioning approaches is, how-                                                                 10
ever, graph dependent. To demonstrate this variability, we                                                                 1
ran a simple—and highly predictable—PageRank algorithm                                                              0.1
on different datasets (summarized in Table 1) using the three                                                0.01
popular partitioning methods. Figure 2 shows that none of                                                0.001
the partitioning methods consistently outperforms the rest,                                                                    0   10     20      30        40   50     60
noting that ParMETIS [15] partitioning cannot be performed                                                                                     SuperSteps
on the arabic-2005 graph due to memory limitations.
    In addition to the graph structure, the running algorithm                                                         PageRank - Total                  DMST - Total
                                                                                                                      PageRank - Max                    DMST - Max
can also affect the workload balance across compute nodes.
Broadly speaking, graph algorithms can be divided into                         Figure 3. The difference between stationary and non-
two categories (based on their communication characteris-                      stationary graph algorithms with respect to the incoming
tics across supersteps): stationary and non-stationary.                        messages. Total represents the sum across all workers and
Stationary Graph Algorithms. An algorithm is stationary                        Max represents the maximum amount (on a single worker)
if its active vertices send and receive the same distribution of               across all workers.
messages across supersteps. At the end of a stationary algo-
rithm, all active vertices become inactive (terminate) during                      To illustrate the differences between the two classes of
the same superstep. Usually graph algorithms represented by                    algorithms, we compared the runtime behavior of PageRank
a matrix-vector multiplication2 are stationary algorithms, in-                 (stationary) against DMST (non-stationary) when process-
cluding PageRank, diameter estimation and finding weakly                        ing the same dataset (LiveJournal1) on a cluster of 21 ma-
connected components.                                                          chines. The input graph was partitioned using a hash func-
Non-stationary Graph Algorithms. A graph algorithm is                          tion. Figure 3 shows the variability in the incoming messages
non-stationary if the destination or size of its outgoing mes-                 per superstep for all workers. In particular, the variability can
sages changes across supersteps. Such variations can create                    span over five orders of magnitude for non-stationary algo-
workload imbalances across supersteps. Examples of non-                        rithms.
stationary algorithms include distributed minimal spanning                         The remainder of the section describes four popular graph
tree construction (DMST), graph queries, and various simu-                     mining algorithms that we use throughout the paper. They
lations on social network graphs (e.g., advertisement propa-                   cover both stationary and non-stationary algorithms.
gation).
                                                                               2.1                       Example Algorithms

2 The matrix represents the graph adjacency matrix and the vector represents   PageRank. PageRank [24] is a stationary algorithm that
the vertices’ value.                                                           uses matrix-vector multiplications to calculate the eigenval-
ues of the graph’s adjacency matrix at each iteration.3 The                     sages between vertices is unpredictable and depends on the
algorithm terminates when the PageRank values of all nodes                      state of the edges at each vertex.
change by less than a defined error during an iteration. At                      Simulating Advertisements on Social Networks. To sim-
each superstep, all vertices are active, where every vertex                     ulate advertisements on a social network graph, each vertex
always receives messages from its in-edges and sends mes-                       represents a profile of a person containing a list of his/her
sages to all of its out-edges. All messages have fixed size                      interests. A small set of selected vertices are identified as
(8 bytes) and the computation complexity on each vertex is                      sources and send different advertisements to their direct
linear to the number of messages.                                               neighbors at each superstep for a predefined number of su-
Top-K Ranks in PageRank. It is often desirable to perform                       persteps. When a vertex receives an advertisement, it is ei-
graph queries, like using PageRank to find the top k vertex                      ther forwarded to the vertex’s neighbors (based on the ver-
ranks reachable to a vertex after y supersteps. In this case,                   tex’s interest matrix) or ignored. This algorithm has a com-
PageRank runs for x supersteps; at superstep x + 1, each                        putation complexity that depends on the count of active ver-
vertex sends its rank to its direct out neighbors, receives                     tices at a superstep and whether the active vertices commu-
other ranks from its in neighbors, and stores the highest                       nicate with their neighbors or not. It is, thus, a non-stationary
received k ranks. At supersteps x + 2 and x + y, each                           algorithm.
vertex sends the top-K ranks stored locally to its direct out                       We have also implemented and evaluated a number of
neighbors, and again stores the highest k ranks received from                   other—stationary and non-stationary—algorithms including
its in neighbors. The message size generated by each vertex                     diameter estimation and finding weakly connected compo-
can be between [1..k], which depends on the number of ranks                     nents. Given that their behavior is consistent with the results
stored the vertex. If the highest k values maintained by a                      in the paper, we omit them for space considerations.
vertex did not change at a superstep, that vertex votes to
halt and becomes inactive until it receives messages from                       3. Mizan
other neighbors. The size of the message sent between two
                                                                                Mizan is a BSP-based graph processing system that is simi-
vertices varies depending on the actual ranks. We classify
                                                                                lar to Pregel, but focuses on efficient dynamic load balancing
this algorithm as non-stationary because of the variability in
                                                                                of both computation and communication across all worker
message size and number of messages sent and received.
                                                                                (compute) nodes. Like Pregel, Mizan first reads and parti-
Distributed Minimal Spanning Tree (DMST). Implement-                            tions the graph data across workers. The system then pro-
ing the GHS algorithm of Gallager, Humblet and Spira [7] in                     ceeds as a series of supersteps, each separated by a global
the Pregel model requires that, given a weighted graph, every                   synchronization barrier. During each superstep, each vertex
vertex classifies the state of its edges as (1) branch (i.e., be-                processes incoming messages from the previous superstep
longs to the minimal spanning tree (MST)), (2) rejected (i.e.,                  and sends messages to neighboring vertices (which are pro-
does not belong to the MST), or (3) basic (i.e., unclassified),                  cessed in the following superstep). Unlike Pregel, Mizan bal-
in each superstep. The algorithm starts with each vertex as a                   ances its workload by moving selected vertices across work-
fragment, then joins fragments until there is only one frag-                    ers. Vertex migration is performed when all workers reach a
ment left (i.e., the MST). Each fragment has an ID and level.                   superstep synchronization barrier to avoid violating the com-
Each vertex updates its fragment ID and level by exchang-                       putation integrity, isolation and correctness of the BSP com-
ing messages with its neighbors. The vertex has two states:                     pute model.
active and inactive. There is also a list of auto-activated ver-                    This section describes two aspects of the design of Mizan
tices defined by the user. At the early stage of the algorithm,                  related to vertex migration: the distributed runtime monitor-
active vertices send messages over the minimum weighted                         ing of workload characteristics and a distributed migration
edges (i.e., the edge that has the minimum weight among the                     planner that decides which vertices to migrate. Other de-
other in-edges and out-edges). As the algorithm progresses,                     sign aspects that are not related to vertex migration (e.g.,
more messages are sent across many edges according to the                       fault tolerance) follow similar approaches to Pregel [23], Gi-
number of fragments identified during the MST computa-                           raph [8], GraphLab [22] and GPS [27]; they are omitted for
tions at each vertex. At the last superstep, each vertex knows                  space considerations. Implementation details are described
which of its edges belong to the minimum weighted span-                         in Section 4.
ning tree. The computational complexity for each vertex is
quadratic to the number of incoming messages. We classify                       3.1   Monitoring
this algorithm as non-stationary because the flow of mes-                        Mizan monitors three key metrics for each vertex and main-
                                                                                tains a high level summary these metrics for each worker
                                                                                node; summaries are broadcast to other workers at the end
3 Formally, each iteration calculates: v (k+1) = cAt v k + (1 − c)/|N |,        of every superstep. As shown in Figure 4, the key metrics
where v k is the eigenvector of iteration k, c is the damping factor used for   for every vertex are (1) the number of outgoing messages to
normalization, and A is a row normalized adjacency matrix.                      other (remote) workers, (2) total incoming messages, and (3)
Vertex
        Response Time                                                                                  Compute()
                           Remote Outgoing Messages


                        V6                   V2                  V1                                      BSP
      V4                                               V3
              V5                                                                                        Barrier
                                                                                                                                   Pair with
                                                                                                                                 Underutilized
                                                                                                                                    Worker
             Mizan                                    Mizan
                                                                                Migration    No       Superstep
            Worker 1                               Worker 2                      Barrier             Imbalanced?
 Local Incoming Messages     Remote Incoming Messages
                                                                                                                                    Select
                                                                                                             Yes                    Vertices
        Figure 4. The statistics monitored by Mizan                                                                               and Migrate
                                                                                                    Identify Source
                                                                                                     of Imbalance
      Superstep 1                   Superstep 2               Superstep 3


 Worker 1                      Worker 1                       Worker 1
                                                                                            No         Worker            Yes
                                                                                                     Overutilized?

 Worker 2                      Worker 2                       Worker 2


 Worker 3                      Worker 3                       Worker 3          Figure 6. Summary of Mizan’s Migration Planner

             BSP Barrier     Migration Barrier    Migration Planner
                                                                            tween a second synchronization barrier as shown in Figure 5,
   Figure 5. Mizan’s BSP flow with migration planning                        which is necessary to ensure correctness of the BSP model.
                                                                            Mizan’s migration planner executes the following five steps,
                                                                            summarized in Figure 6:
the response time (execution time) during the current super-
step:                                                                       1. Identify the source of imbalance.
Outgoing Messages. Only outgoing messages to other ver-                     2. Select the migration objective (i.e., optimize for outgoing
tices in remote workers are counted since the local outgoing                   messages, incoming messages, or response time).
messages are rerouted to the same worker and do not incur                   3. Pair over-utilized workers with under-utilized ones.
any actual network cost.
                                                                            4. Select vertices to migrate.
Incoming Messages. All incoming messages are monitored,
those that are received from remote vertices and those lo-                  5. Migrate vertices.
cally generated. This is because queue size can affect the
performance of the vertex (i.e., when its buffer capacity is                STEP 1: Identify the source of imbalance. Mizan detects
exhausted, paging to disk is required).                                     imbalances across supersteps by comparing the summary
                                                                            statistics of all workers against a normal random distribu-
Response Time. The response time for each vertex is mea-                    tion and flagging outliers. Specifically, at the end of a su-
sured. It is the time when a vertex starts processing its in-               perstep, Mizan computes the z-score4 for all workers. If
coming messages until it finishes.                                           any worker has z-score greater than zdef , Mizan’s migra-
                                                                            tion planner flags the superstep as imbalanced. We found that
3.2 Migration Planning
                                                                            zdef = 1.96, the commonly recommend value [26], allows
High values for any of the three metrics above may indicate                 for natural workload fluctuations across workers. We have
poor vertex placement, which leads to workload imbalance.                   experimented with different values of zdef and validated the
As with many constrained optimization problems, optimiz-                    robustness of our choice.
ing against three objectives is non-trivial. To reduce the op-
                                                                            STEP 2: Select the migration objective. Each worker—
timization search space, Mizan’s migration planner finds the
                                                                            identically—uses the summary statistics to compute the cor-
strongest cause of workload imbalance among the three met-
                                                                            relation between outgoing messages and response time, and
rics and plans the vertex migration accordingly.
                                                                            also the correlation between incoming messages and re-
   By design, all workers create and execute the migration
                                                                            sponse time. The correlation scores are used to select the
plan in parallel, without requiring any centralized coordina-
                                                                            objective to optimize for: to balance outgoing messages,
tion. The migration planner starts on every worker at the end
                                                                            balance incoming messages, or balance computation time.
of each superstep (i.e., when all workers reach the synchro-
                                                                            The default objective is to balance the response time. If the
nization barrier), after it receives summary statistics (as de-
scribed in Section 3.1) from all other workers. Additionally,               4 The z-score W z =
                                                                                             i
                                                                                                      |xi −xmax |
                                                                                                                     ,   where xi is the run time of
                                                                                                  standard deviation
the execution of the migration planner is sandwiched be-                    worker i
outgoing or incoming messages are highly correlated with
the response time, then Mizan chooses the objective with
highest correlation score. The computation of the correlation                   W7   W2    W1    W5    W8     W4    W0   W6    W9    W3
score is described in Appendix A.
                                                                                 0    1     2     3      4     5     6     7          8
STEP 3: Pair over-utilized workers with under-utilized
ones. Each overutilized worker that needs to migrate ver-                    Figure 7. Matching senders (workers with vertices to mi-
tices out is paired with a single underutilized worker. While                grate) with receivers using their summary statistics
complex pairings are possible, we choose a design that is ef-
ficient to execute, especially since the exact number of ver-                 into a stream that includes the vertex ID, state, edge informa-
tices that different workers plan to migrate is not globally                 tion and the received messages it will process. Once a ver-
known. Similar to the previous two steps in the migration                    tex stream is successfully sent, the sending worker deletes
plan, this step is executed by each worker without explicit                  the sent vertices so that it does not run them in the next su-
global synchronization. Using the summary statistics for the                 perstep. The receiving worker, on the other hand, receives
chosen migration objective (in Step 2), each worker creates                  vertices (together with their messages) and prepares to run
an ordered list of all workers. For example, if the objective is             them in the next superstep. The next superstep is started once
to balance outgoing messages, then the list will order work-                 all workers finish migrating vertices and reach the migration
ers from highest to lowest outgoing messages. The resulting                  barrier. The complexity of the migration process is directly
list, thus, places overutilized workers at the top and least uti-            related to the size of vertices being migrated.
lized workers at the bottom. The pairing function then suc-
cessively matches workers from opposite ends of the ordered                  4. Implementation
list. As depicted in Figure 7, if the list contains n elements
                                                                             Mizan consists of four modules, shown in Figure 8: the
(one for each worker), then the worker at position i is paired
                                                                             BSP Processor, Storage Manager, Communicator, and Mi-
with the worker at position n − i. In cases where a worker
                                                                             gration Planner. The BSP Processor implements the Pregel
does not have memory to receive any vertices, the worker is
                                                                             APIs, consisting primarily of the Compute class, and the
marked unavailable in the list.
                                                                             SendMessageTo, GetOutEdgeIterator and getValue
STEP 4: Select vertices to migrate. The number of vertices                   methods. The BSP Processor operates on the data struc-
to be selected from an overutilized worker depends on the                    tures of the graph and executes the user’s algorithm. It also
difference of the selected migration objective statistics with               performs barrier synchronization with other workers at the
its paired worker. Assume that wx is a worker that needs                     end of each superstep. The Storage Manager module main-
to migrate out a number of vertices, and is paired with the                  tains access atomicity and correctness of the graph’s data,
receiver, wy . The load that should be migrated to the under-                and maintains the data structures for message queues. Graph
utilized worker is defined as ∆xy , which equals to half the                  data can be read and written to either HDFS or local disks,
difference in statistics of the migration objective between the              depending on how Mizan is deployed. The Communicator
two workers. The selection criteria of the vertices depends                  module uses MPI to enable communication between work-
on the distribution of the statistics of the migration objective,            ers; it also maintains distributed vertex ownership informa-
where the statistics of each vertex is compared against a nor-               tion. Finally, the Migration Planner operates transparently
mal distribution. A vertex is selected if it is an outlier (i.e.,            across superstep barriers to maintain the dynamic workload
          i
if its V zstat 5 ). For example, if the migrating objective is to            balance.
balance the number of remote outgoing messages, vertices                         Mizan allows the user’s code to manipulate the graph con-
with large remote outgoing messages are selected to migrate                  nectivity by adding and removing vertices and edges at any
to the underutilized worker. The sum of the statistics of the                superstep. It also guarantees that all graph mutation com-
selected vertices is denoted by        Vstat which should mini-              mands issued at superstepx are executed at the end of the
mize |∆xy − Vstat | to ensure the balance between wx and                     same superstep and before the BSP barrier, which is illus-
wy in the next superstep. If there not enough outlier vertices               trated in Figure 5. Therefore, vertex migrations performed
are found, a random set of vertices are selected to minimize                 by Mizan do not conflict with the user’s graph mutations and
|∆xy − Vstat |.                                                              Mizan always considers the most recent graph structure for
STEP 5: Migrate vertices. After the vertex selection pro-                    migration planning.
cess, the migrating workers start sending the selected ver-                      When implementing Mizan, we wanted to avoid having a
tices while other workers wait at the migration barrier. A                   centralized controller. Overall, the BSP (Pregel) model nat-
migrating worker starts sending the selected set of vertices                 urally lends itself to a decentralized implementation. There
to its unique target worker, where each vertex is encoded                    were, however, three key challenges in implementing a dis-
                                                                             tributed control plane that supports fine-grained vertex mi-
5 The            i            |x −x
                                 i     |
                                      avg
      z-score V zstat = standard deviation , where xi is the statistics of   gration. The first challenge was in maintaining vertex own-
the migration objective of vertex i is greater than the zdef                 ership so that vertices can be freely migrated across work-
1) migrate(Vi)
                             Vertex Compute()                     Worker 1                             Worker 2
                                                                                 BSP Processor                         BSP Processor
           Mizan Worker
                                                                   Migration                            Migration
                                                                               Communicator - DHT                 Communicator - DHT
                               BSP Processor                        Planner                              Planner
            Migration                                                           Storage Manager                       Storage Manager
                            Communicator - DHT
             Planner
                                                                                             2) inform_DHT(Vi)
                              Storage Manager
                                                                  Worker 3                             Worker 4
                                        IO                                       BSP Processor                         BSP Processor
                                                                   Migration                            Migration
                               HDFS/Local Disks                                Communicator - DHT                 Communicator - DHT
                                                                    Planner                              Planner
                                                                                Storage Manager                       Storage Manager
              Figure 8. Architecture of Mizan
                                                                                                  3) update_loc(Vi)


ers. This is different from existing approaches (e.g., Pregel,    Figure 9. Migrating vertex vi from Worker 1 to Worker 2,
Giraph, and GoldenOrb), which operate on a much coarser           while updating its DHT home worker (Worker 3)
granularity (clusters of vertices) to enable scalability. The
second challenge was in allowing fast updates to the vertex       cation of target vertex. The source worker finally sends the
ownership information as vertices get migrated. The third         queued message to the current worker for the target vertex. It
challenge was in minimizing the cost of migrating vertices        also caches the physical location to minimize future lookups.
with large data structures. In this section, we discuss the im-
plementation details around these three challenges, which al-     4.2   DHT Updates After Vertex Migration
low Mizan to achieve its effectiveness and scalability.           Figure 9 depicts the vertex migration process. When a vertex
                                                                  v migrates between two workers, the receiving worker sends
4.1   Vertex Ownership                                            the new location of v to the home worker of v. The home
With huge datasets, Mizan workers cannot maintain the             worker, in turn, sends an update message to all workers that
management information for all vertices in the graph. Man-        have previously asked for—and, thus, potentially cached—
agement information includes the collected statistics for         the location of v. Since Mizan migrates vertices in the barrier
each vertex (described in Section 3.1) and the location (own-     between two supersteps, all workers that have cached the
ership) of each vertex. While per-vertex monitoring statis-       location of the migrating vertex will receive the updated
tics are only used locally by the worker, vertex ownership        physical location from the home worker before the start of
information is needed by all workers. When vertices send          the new superstep.
messages, workers need to know the destination address for           If for any reason a worker did not receive the updated
each message. With frequent vertex migration, updating the        location, the messages will be sent to the last known phys-
location of the vertices across all workers can easily create a   ical location of v. The receiving worker, which no longer
communication bottleneck.                                         owns the vertex, will simply buffer the incorrectly routed
    To overcome this challenge, we use a distributed hash ta-     messages, ask for the new location for v, and reroute the
ble (DHT) [1] to implement a distributed lookup service. The      messages to the correct worker.
DHT implementation allows Mizan to distribute the over-
head of looking up and updating vertex location across all        4.3   Migrating Vertices with Large Message Size
workers. The DHT stores a set of (key,value) pairs, where         Migrating a vertex to another worker requires moving its
the key represents a vertex ID and the value represents its       queued messages and its entire state (which includes its
current physical location. Each vertex is assigned a home         ID, value, and neighbors). Especially when processing large
worker. The role of the home worker is to maintain the            graphs, a vertex can have a significant number of queued
current location of the vertex. A vertex can physically ex-       messages, which are costly to migrate. To minimize the cost,
ist in any worker, including its home worker. The DHT             Mizan migrates large vertices using a delayed migration
uses a globally defined hash function that maps the keys to        process that spreads the migration over two supersteps. In-
their associated home workers, such that home worker =            stead of physically moving the vertex with its large message
location hash(key).                                               queue, delayed migration only moves the vertex’s informa-
    During a superstep, when a (source) vertex sends a mes-       tion and the ownership to the new worker. Assuming wold is
sage to another (target) vertex, the message is passed to the     the old owner and wnew is the new one, wold continues to
Communicator. If the target vertex is located on the same         process the migrating vertex v in the next superstep, SSt+1 .
worker, it is rerouted back to the appropriate queue. Other-      wnew receives the messages for v, which will be processed
wise, the source worker uses the location hash function           at the following superstep, SSt+2 . At the end of SSt+1 , wold
to locate and query the home worker for the target vertex.        sends the new value of v, calculated at SSt+1 , to wnew and
The home worker responds back with the actual physical lo-        completes the delayed migration. Note that migration plan-
Superstep t         Superstep t+1      Superstep t+2                                 40
                                                                                                 Giraph
Worker_old           Worker_old            Worker_old                                 35   Static Mizan
  1         3          1          3          1         3                              30




                                                                     Run Time (Min)
                                                                                      25
  2         7          2          7          2                                        20
                                                                                      15
                                                                                      10
  4                    4         '7          4         7                               5
                                                                                       0
                                                                                            web     k      Live       k
                       5          6                    6                                        - Goo g1          Jour g4m68
                                                                                                                      nal1
  5         6                                5                                                       gle                     m
Worker_new           Worker_new            Worker_new
                                                               Figure 11. Comparing Static Mizan vs. Giraph using
           Figure 10. Delayed vertex migration                 PageRank on social network and random graphs

      G(N, E)                  |N |          |E|               Web Algorithmics (LAW) [2, 20]. We also generated syn-
      kg1                  1,048,576      5,360,368            thetic datasets using the Kronecker [21] generator that mod-
      kg4m68m              4,194,304     68,671,566            els the structure of real life networks. The details are shown
      web-Google            875,713       5,105,039            in Table 1.
      LiveJournal1         4,847,571     68,993,773                To better isolate the effects of dynamic migration on
      hollywood-2011       2,180,759    228,985,632            system performance, we implemented three variations of
      arabic-2005          22,744,080   639,999,458            Mizan: Static, Work Stealing (WS), and Mizan. Static Mizan
                                                               disables any dynamic migration and uses either hash-based,
Table 1. Datasets—N , E denote nodes and edges, respec-        range-based or min-cuts graph pre-partitioning (rendering
tively. Graphs with prefix kg are synthetic.                    it similar to Giraph). Work Stealing (WS) Mizan is our
                                                               attempt to emulate Pregel’s coarse-grained dynamic load
ning is disabled for superstep SSt+1 after applying delayed    balancing behavior (described in [23]). Finally, Mizan is our
migration to ensure the consistency of the migration plans.    framework that supports dynamic migration as described in
   An example is shown in Figure 10, where Vertex 7 is         Sections 3 and 4.
migrated using delayed migration. As Vertex 7 migrates,            We evaluated Mizan across three dimensions: partitioning
the ownership of Vertex 7 is moved to Worker new and           scheme, graph structure, and algorithm behavior. We par-
messages sent to Vertex 7 at SSt+1 are addressed to            tition the datasets using three schemes: hash-based, range-
Worker new. At the barrier of SSt+1 , Worker old sends         based, and METIS partitioning. We run our experiments
the edge information and state of v to Worker new and starts   against both social and random graphs. Finally, we experi-
SSt+2 with v fully migrated to Worker new.                     mented with the algorithms described in Section 2.1.
   With delayed migration, the consistency of computation
is maintained without introducing additional network cost      5.1     Giraph vs. Mizan
for vertices with large message queues. Since delayed mi-      We picked Giraph for its popularity as an open source Pregel
gration uses two supersteps to complete, Mizan may need        framework with broad adoption. We only compared Giraph
more steps before it converges on a balanced state.            to Static Mizan, which allows us to evaluate our base (non-
                                                               dynamic) implementation. Mizan’s dynamic migration is
5. Evaluation                                                  evaluated later in this section. To further equalize Giraph and
We implemented Mizan using C++ and MPI and compared            the Static Mizan, we disabled the fault tolerance feature of
it against Giraph [8], a Pregel clone implemented in Java      Giraph to eliminate any additional overhead resulting from
on top of Hadoop. We ran our experiments on a local clus-      frequent snapshots during runtime. In this section, we report
ter of 21 machines equipped with a mix of i5 and i7 pro-       the results using PageRank (a stationary and well balanced
cessors with 16GB RAM on each machine. We also used            algorithm) using different graph structures.
an IBM Blue Gene/P supercomputer with 1024 PowerPC-               In Figure 11, we ran both frameworks on social network
450 CPUs, each with 4 cores at 850MHz and 4GB RAM.             and random graphs. As shown, Static Mizan outperforms Gi-
We downloaded publicly available datasets from the Stan-       raph by a large margin, up to four times faster than Giraph in
ford Network Analysis Project6 and from The Laboratory for     kg4m68, which contains around 70M edges. We also com-
                                                               pared both systems when increasing the number of nodes
6 http://snap.stanford.edu                                     in random structure graphs. As shown in Figure 12, Static
30
                      12         Giraph
                           Static Mizan                                                  25




                                                                        Run Time (Min)
                      10
     Run Time (Min)



                                                                                         20
                       8
                                                                                         15
                       6
                                                                                         10
                       4
                                                                                          5
                       2                                                                  0




                                                                                                            Mizan




                                                                                                                                  Mizan




                                                                                                                                                        Mizan
                                                                                              Static




                                                                                                                    Static




                                                                                                                                          Static
                       0




                                                                                                       WS




                                                                                                                             WS




                                                                                                                                                   WS
                           1       2      4     8        16
                               Vertex count (Millions)                                             Hash                Range                  Metis

Figure 12. Comparing Mizan vs. Giraph using PageRank              Figure 13. Comparing Static Mizan and Work Stealing
on regular random graphs, the graphs are uniformly dis-           (Pregel clone) vs. Mizan using PageRank on a social graph
tributed with each has around 17M edge                            (LiveJournal1). The shaded part of each column repre-
                                                                  sents the algorithm runtime while unshaded parts represents
                                                                  the initial partitioning cost of the input graph.
Mizan consistently outperforms Giraph in all datasets and
reaches up to three times faster with 16 million vertexes.
While the execution time of both frameworks increases lin-        perstep’s runtime. The same figure also shows that Mizan’s
early with graph size, the rate of increase—slope of the          migration alternated the optimization objective between the
graph—for Giraph (0.318) is steeper than Mizan (0.09), in-        number of outgoing messages and vertex response time, il-
dicating that Mizan also achieves better scalability.             lustrated as points on the superstep’s runtime trend.
   The experiments in Figures 12 and 11 show that Giraph’s            By looking at both Figures 14 and 15, we observe that
implementation is inefficient. It is a non-trivial task to dis-    the convergence of Mizan’s dynamic migration is correlated
cover the source of inefficiency in Giraph since it is tightly     with the algorithm’s runtime reduction. For the PageRank
coupled with Hadoop. We suspect that part of the ineffi-           algorithm with range-based partitioning, Mizan requires 13
ciency is due to the initialization cost of the Hadoop jobs       supersteps to reach an acceptable balanced workload. Since
and the high overhead of communication. Other factors, like       PageRank is a stationary algorithm, Mizan’s migration con-
internal data structure choice and memory footprint, might        verged quickly; we expect that it would require more super-
also play a role in this inefficiency.                             steps to converge on other algorithms and datasets. In gen-
                                                                  eral, Mizan requires multiple supersteps before it balances
5.2 Effectiveness of Dynamic Vertex Migration                     the workload. This also explains why running Mizan with
Given the large performance difference between Static             range-based partitioning is less efficient than running Mizan
Mizan and Giraph, we exclude Giraph from further exper-           with METIS or hash-based partitioning.
iments and focus on isolating the effects of dynamic migra-           As described in Section 2.1, Top-K PageRank adds vari-
tion on the overall performance of the system.                    ability in the communication among the graph nodes. As
    Figure 13 shows the results for running PageRank on a         shown in Figure 16, such variability in the messages ex-
social graph with various partitioning schemes. We notice         changed leads to minor variation in both hash-based and
that both hash-based and METIS partitioning achieve a bal-        METIS execution times. Similarly, in range partitioning,
anced workload for PageRank such that dynamic migration           Mizan had better performance than the static version. The
did not improve the results. In comparison, range-based par-      slight performance improvement arises from the fact that the
titioning resulted in poor graph partitioning. In this case, we   base algorithm (PageRank) dominates the execution time. If
observe Mizan (with dynamic partitioning) was able to re-         a large number of queries are performed, the improvements
duce execution time by approximately 40% when compared            will be more significant.
to the static version. We have also evaluated the diameter es-        To study the effect of algorithms with highly variable
timation algorithm, which behaves the same as PageRank,           messaging patterns, we evaluated Mizan using two algo-
but exchanges larger size messages. Mizan exhibited similar       rithms: DMST and advertisement propagation simulation.
behavior with diameter estimation; the results are omitted        In both cases, we used METIS to pre-partition the graph
for space considerations.                                         data. METIS partitioning groups the strongly connected sub-
    Figure 14 shows how Mizan’s dynamic migration was             graphs into clusters, thus minimizing the global communica-
able to optimize running PageRank starting with range-            tion among each cluster.
based partitioning. The figure shows that Mizan’s migration            In DMST, as discussed in Section 2.1, computation com-
reduced both the variance in workers’ runtime and the su-         plexity increases with vertex connectivity degree. Because
4                                                                                                40
                                                    Superstep Runtime                                                          35
                            3.5              Average Workers Runtime
Run Time (Minutes)




                                                                                                              Run Time (Min)
                              3                         Migration Cost                                                         30
                                           Balance Outgoing Messages                                                           25
                            2.5            Balance Incoming Messages                                                           20
                              2                Balance Response Time
                                                                                                                               15
                            1.5                                                                                                10
                              1                                                                                                 5
                            0.5                                                                                                 0




                                                                                                                                                      Mizan




                                                                                                                                                                             Mizan




                                                                                                                                                                                                     Mizan
                                                                                                                                        Static




                                                                                                                                                               Static




                                                                                                                                                                                       Static
                              0




                                                                                                                                                 WS




                                                                                                                                                                        WS




                                                                                                                                                                                                WS
                                  5    10          15        20    25                  30
                                              Supersteps                                                                                     Hash                 Range                    Metis

Figure 14. Comparing Mizan’s migration cost with the su-                                                Figure 16. Comparing static Mizan and Work Stealing
perstep’s (or the algorithm) runtime at each supertstep us-                                             (Pregel clone) vs. Mizan using Top-K PageRanks on a so-
ing PageRank on a social graph (LiveJournal1) starting                                                  cial graph (LiveJournal1). The shaded part of each col-
with range based partitioning. The points on the superstep’s                                            umn represents the algorithm runtime while unshaded parts
runtime trend represents Mizan’s migration objective at that                                            represents the initial partitioning cost of the input graph.
specific superstep.


                                                                         4                                                     300
                                                                                                                                                                               Static
Migrated Vertices (x1000)




                            500                 Migration Cost
                                                                             Migration Cost (Minutes)




                                       Total Migrated Vertices                                                                 250                                      Work Stealing
                                       Max Vertices Migrated             3                                                                                                     Mizan
                            400
                                                                                                              Run Time (Min)
                                             by Single Worker                                                                  200
                            300
                                                                         2                                                     150
                            200
                                                                                                                               100
                                                                         1
                            100
                                                                                                                                50
                              0                                          0                                                          0
                                  5   10      15        20    25   30                                                                                         Adve                    DMS
                                                                                                                                                                   r    tism              T
                                           Supersteps                                                                                                                        e       nt

Figure 15. Comparing both the total migrated vertices and                                               Figure 17. Comparing Work stealing (Pregel clone) vs.
the maximum migrated vertices by a single worker for                                                    Mizan using DMST and Propagation Simulation on a metis
PageRank on LiveJournal1 starting with range-based par-                                                 partitioned social graph (LiveJournal1)
titioning. The migration cost at each superstep is also shown.

of the quadratic complexity of computation as a function of                                                           Total runtime (s)                                                            707
connectivity degree, some workers will suffer from exten-                                                       Data read time to memory (s)                                                        72
sive workload while others will have light workload. Such                                                             Migrated vertices                                                         1,062,559
imbalance in the workload leads to the results shown in Fig-                                                       Total migration time (s)                                                         75
ure 17. Mizan was able to reduce the imbalance in the work-                                                  Average migrate time per vertex (µs)                                                  70.5
load, resulting in a large drop in execution time (two orders
of magnitude improvement). Even when using our version                                                  Table 2. Overhead of Mizan’s migration process when
of Pregel’s load balancing approach (called work stealing),                                             compared to the total runtime using range partitioning on
Mizan is roughly eight times faster.                                                                    LiveJournal1 graph
    Similar to DMST, the behavior of the advertising propa-
gation simulation algorithm varies across supersteps. In this
algorithm, a dynamic set of graph vertices communicate                                                  step. Even METIS partitioning in such case is ill-suited since
heavily with each other, while others have little or no com-                                            workers’ load dynamically changes at runtime. As shown in
munication. In every superstep, the communication behav-                                                Figure 17, similar to DMST, Mizan is able to reduce such an
ior differs depending on the state of the vertex. Therefore,                                            imbalance, resulting in approximately 200% speed up when
it creates an imbalance across the workers for every super-                                             compared to the work stealing and static versions.
Linux Cluster                  Blue Gene/P                        8
                                                                                                       Mizan - hollywood-2011
                                                                          7
      hollywood-2011                   arabic-2005
                                                                          6
 Processors    Runtime (m)      Processors    Runtime (m)
                                                                          5




                                                                Speedup
     2             154              64           144.7
                                                                          4
     4            79.1             128           74.6
                                                                          3
     8            40.4             256           37.9
                                                                          2
     16           21.5             512           21.5
                                  1024           17.5                     1
                                                                          0
Table 3. Scalability of Mizan on a Linux Cluster of                           0          2      4     6    8       10   12   14   16
16 machines (hollywood-2011 dataset), and an IBM                                                    Compute Nodes
Blue Gene/P supercomputer (arabic-2005 dataset).
                                                                Figure 18. Speedup on Linux Cluster of 16 machines using
                                                                PageRank on the hollywood-2011 dataset.
5.3 Overhead of Vertex Migration
To analyze migration cost, we measured the time for var-
ious performance metrics of Mizan. We used the PageR-                     10
                                                                                             Mizan - Arabic-2005
ank algorithm with a range-based partitioning of the
                                                                              8
Live-Journal1 dataset on 21 workers. We chose range-

                                                                Speedup
based partitioning as it provides the worst data distribution                 6
according to previous experiments and therefore will trigger
frequent vertex migrations to balance the workload.                           4
   Table 2 reports the average cost of migrating as 70.5 µs
                                                                              2
per vertex. In the LiveJournal1 dataset, Mizan paid a 9%
penalty of the total runtime to balance the workload, trans-                  0
ferring over 1M vertices. As shown earlier in Figure 13, this
                                                                                  64
                                                                                       128


                                                                                              256




                                                                                                          512




                                                                                                                                  1024
resulted in a 40% saving in computation time when com-
pared to Static Mizan. Moreover, Figures 14 and 15 compare                                           Compute Nodes
the algorithm runtime and the migration cost at each super-
step, the migration cost is at most 13% (at superstep 2) and    Figure 19. Speedup on Shaheen IBM Blue Gene/P super-
on average 6% for all supersteps that included a migration      computer using PageRank on the arabic-2005 dataset.
phase.
                                                                will be spent communicating (which effectively breaks the
5.4 Scalability of Mizan
                                                                BSP model of overlapping communication and computa-
We tested the scalability of Mizan on the Linux cluster         tion).
as shown in Table 3. We used two compute nodes as our
base reference as a single node was too small when running      6. Related Work
the dataset (hollywood-2011), causing significant paging
activities. As Figure 18 shows, Mizan scales linearly with      In the past few years, a number of large scale graph process-
the number of workers.                                          ing systems have been proposed. A common thread across
   We were interested in performing large scale-out experi-     the majority of existing work is how to partition graph data
ments, well beyond what can be achieved on public clouds.       and how to balance computation across compute nodes. This
Since Mizan’s was written in C++ and uses MPI for mes-          problem is known as applying dynamic load balancing to the
sage passing, it was easily ported to IBM’s Blue Gene/P su-     distributed graph data by utilizing on the fly graph partition-
percomputer. Once ported, we natively ran Mizan on 1024         ing, which Bruce et al. [11] assert to be either too expen-
Blue Gene/P compute nodes. The results are shown in Ta-         sive or hard to parallelize. Mizan follows the Pregel model,
ble 3. We ran the PageRank algorithm using a huge graph         but focuses on dynamic load balancing of vertexes. In this
(arabic-2005) that contains 639M edges. As shown in             section, we examine the design aspects of existing work for
Figure 19, Mizan scales linearly from 64 to 512 compute         achieving a balanced workload.
nodes then starts to flatten out as we increase to 1024 com-     Pregel and its Clones. The default partitioning scheme for
pute nodes. The flattening was expected since with an in-        the input graph used by Pregel [23] is hash based. In every
creased number of cores, compute nodes will spend more          superstep, Pregel assigns more than one subgraph (partition)
time communicating than computing. We expect that as we         to a worker. While this load balancing approach helps in bal-
continue to increase the number of CPUs, most of the time       ancing computation in any superstep, it is coarse-grained
and reacts slowly to large imbalances in the initial parti-      minimizes the edges among graph partitions and allows
tioning or large behavior changes. Giraph, an open source        for a fast graph repartitioning. Unlike BSP frameworks,
Pregel clone, supports both hash-based and range-based par-      computation in GraphLab is executed on the most recently
titioning schemes; the user can also implement a differ-         available data. GraphLab ensures that computations are
ent graph partitioning scheme. Giraph balances the parti-        sequentially consistent through a derived consistency model
tions across the workers based on the number of edges or         to guarantee the end result data will be consistent as well.
vertices. GoldenOrb [9] uses HDFS as its storage system          HipG [18] is designed to operate on distributed memory
and employs hash-based partitioning. An optimization that        machines while providing transparent memory sharing to
was introduced in [12] uses METIS partitioning, instead of       support hierarchical parallel graph algorithms. HipG also
GoldenOrb’s hash-based partitioning, to maintain a balanced      support processing graph algorithms following the BSP
workload among the partitions. Additionally, its distributed     model. But unlike the global synchronous barriers of Pregel,
data storage supports duplication of the frequently accessed     HipG applies more fine-grained barriers, called synchroniz-
vertices. Similarly, Surfer [4] utilizes min-cut graph parti-    ers, during algorithm execution. The user has to be involved
tioning to improve the performance of distributed vertex cen-    in writing the synchronizers; thus, it requires additional
tric graph processing. However, Surfer focuses more on pro-      complexity in the users’ code. GraphChi [19] is a disk based
viding a bandwidth-aware variant of the multilevel graph         graph processing system based on GraphLab and designed
partitioning algorithm (integrated into a Pregel implemen-       to run on a single machine with limited memory. It requires
tation) to improve the performance of graph partitioning on      a small number of non-sequential disk accesses to process
the cloud. Unlike Mizan, these systems neither optimize for      the subgraphs from disk, while allowing for asynchronous
dynamic algorithm behavior nor message traffic.                   iterative computations. Unlike, Mizan these systems do not
Power-law Optimized Graph Processing Systems. Stan-              support any dynamic load balancing techniques.
ford GPS (GPS) [27] has three additional features over           Specialized Graph Systems. Kineograph [5] is a distributed
Pregel. First, GPS extends Pregel’s API to handle graph al-      system that targets the processing of fast changing graphs.
gorithms that perform both vertex-centric and global compu-      It captures the relationships in the input data, reflects them
tations. Second, GPS supports dynamic repartitioning based       on the graph structure, and creates regular snapshots of the
on the outgoing communication. Third, GPS provides an op-        data. Data consistency is ensured in Kineograph by separat-
timization scheme called large adjacency list partitioning,      ing graph processing and graph update routines. This sys-
where the adjacency lists of high degree vertices are par-       tem is mainly developed to support algorithms that mine the
titioned across workers. GPS only monitors outgoing mes-         incremental changes in the graph structure over a series of
sages, while Mizan dynamically selects the most appropriate      graph snapshots. The user can also run an offline iterative
metric based on changes in runtime characteristics. Power-       analysis over a copy of the graph, similar to Pregel. The
Graph [10], on the other hand, is a distributed graph system     Little Engine(s) [25] is also a distributed system designed
that overcomes the challenges of processing natural graphs       to scale with online social networks. This system utilizes
with extreme power-law distributions. PowerGraph intro-          graph repartitioning with one-hop replication to rebalanced
duces vertex-based partitioning with replication to distribute   any changes to the graph structure, localize graph data, and
the graph’s data based on its edges. It also provides a pro-     reduce the network cost associated with query processing.
gramming abstraction (based on gather, apply and scatter op-     The Little Engine(s) repartitioning algorithm aims to find
erators) to distribute the user’s algorithm workload on their    min-cuts in a graph under the condition of minimizing the
graph processing system. The authors in [10] have focused        replication overhead at each worker. This system works as a
on evaluating stationary algorithms on power-law graphs.         middle-box between an application and its database systems;
It is, thus, unclear how their system performs across the        it is not intended for offline analytics or batch data process-
broader class of algorithms (e.g., non-stationary) and graphs    ing like Pregel or MapReduce. Trinity [29] is a distributed
(e.g., non-power-law). In designing Mizan, we did not make       in-memory key-value storage based graph processing sys-
any assumptions on either the type of algorithm or shape         tem that supports both offline analytics and online query an-
of input graph. Nonetheless, PowerGraph’s replication ap-        swering. Trinity provides a similar—but more restrictive—
proach is complementary to Mizan’s dynamic load balancing        computing model to Pregel for offline graph analytics. The
techniques. They can be combined to improve the robustness       key advantage of Trinity is its low latency graph storage
of the underlying graph processing system.                       design, where the data of each vertex is stored in a bi-
Shared    Memory        Graph     Processing      Systems.       nary format to eliminate decoding and encoding costs.While
GraphLab [22] is a parallel framework (that uses distributed     these systems address specific design challenges in process-
shared memory) for efficient machine learning algorithms.         ing large graphs, their focus is orthogonal to the design of
The authors extended their original multi-core GraphLab          Mizan.
implementations to a cloud-based distributed GraphLab,
where they apply a two-phase partitioning scheme that
7. Future Work
We are planning on making Mizan an open source project.                                              +        +               −        −
                                                                                              count(ST ime ∩ SStat ) + count(ST ime ∩ SStat )
We will continue to work on number of open problems as                            CTscore =
                                                                                                     count(ST ime ) + count(SStat )
outlined in this section. Our goal is to provide a robust
and adaptive platform that frees developers from worrying                             The sorting test ST consists of sorting both sets, ST ime
about the structure of the data or runtime variability of their                   and SStat , and calculating the z-scores for all elements. Ele-
algorithms.                                                                       ments with a z-score in the range (−zdef , zdef ) are excluded
   Although Mizan is a graph-agnostic framework, its per-                         from both sorted sets. For the two sorted sets ST ime and
formance can degrade when processing extremely skewed                             SStat of size n and i ∈ [0..n − 1], a mismatch is defined if
graphs due to frequent migration of highly connected ver-                         the i-th elements from both sets are related to different work-
                                                                                                 i                                         i
tices. We believe that vertex replication, as proposed by                         ers. That is, ST ime is the i-th element from ST ime , SStat is
                                                                                                                       i
PowerGraph [10], offers a good solution, but mandates the                         the i-th element from SStat , and ST ime ∈ W orkerx while
                                                                                    i
implementation of custom combiners. We plan to further                            SStat ∈ W orkerx . A match, on the other hand, is defined if
                                                                                         /
                                                                                    i                          i
reduce frequent migrations in Mizan by leveraging vertex                          ST ime ∈ W orkerx and SStat ∈ W orkerx . A score of 1 is
replication, however, without requiring any custom combin-                        given by the sorting test if:
ers.                                                                                                       count(matches)
   So far, dynamic migration in Mizan optimizes for work-                            STscore =
                                                                                                 count(matches) + count(mismatches)
loads generated by a single application or algorithm. We
are interested in investigating how to efficiently run multi-                         The correlation of sets ST ime and SStat are reported as
ple algorithms on the same graph. This would better position                      strongly correlated if (CT + ST )/2 ≥ 1, reported as weakly
Mizan to be offered as a service for very large datasets.                         correlated if 1 > (CT + ST )/2 ≥ 0.5 and reported as
                                                                                  uncorrelated otherwise.
8. Conclusion
                                                                                  References
In this paper, we presented Mizan, a Pregel system that
                                                                                   [1] H. Balakrishnan, M. F. Kaashoek, D. Karger, R. Morris, and
uses fine-grained vertex migration to load balance computa-                             I. Stoica. Looking Up Data in P2P Systems. Communications
tion and communication across supersteps. Using distributed                            of the ACM, 46(2):43–48, 2003.
measurements of the performance characteristics of all ver-
                                                                                   [2] P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler:
tices, Mizan identifies the cause of workload imbalance and                             A Scalable Fully Distributed Web Crawler. Software: Practice
constructs a vertex migration plan without requiring central-                          & Experience, 34(8):711–726, 2004.
ized coordination. Using a representative number of datasets
                                                                                   [3] D. Buntinas. Designing a Common Communication Subsys-
and algorithms, we have showed both the efficacy and ro-                                tem. In Proceedings of the 12th European Parallel Virtual Ma-
bustness of our design against varying workload conditions.                            chine and Message Passing Interface Conference (Euro PVM
We have also showed the linear scalability of Mizan, scaling                           MPI), pages 156–166, Sorrento, Italy, 2005.
to 1024 CPUs.                                                                      [4] R. Chen, M. Yang, X. Weng, B. Choi, B. He, and X. Li.
                                                                                       Improving Large Graph Processing on Partitioned Graphs in
A.     Computing the Correlation Score                                                 the Cloud. In Proceedings of the 3rd ACM Symposium on
                                                                                       Cloud Computing (SOCC), pages 3:1–3:13, San Jose, Califor-
In Mizan, the correlation scores between the workers’ re-                              nia, USA, 2012.
sponse time and their summary statistics are calculated using                      [5] R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu,
two tests: clustering and sorting. In the clustering test CT ,                         F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: Taking
elements from the set of worker response times ST ime and                              the Pulse of a Fast-Changing and Connected World. In Pro-
the statistics set SStat (either incoming or outgoing message                          ceedings of the 7th ACM european conference on Computer
statistics) are divided into two categories based on their z-                          Systems (EuroSys), pages 85–98, Bern, Switzerland, 2012.
score 7 : positive and negative clusters. Elements from a set                      [6] J. Dean and S. Ghemawat. MapReduce: Simplified Data Pro-
S are categorized as positive if their z-scores are greater than                       cessing on Large Clusters. In Proceedings of the 6th USENIX
some zdef . Elements categorized as negative have a z-score                            Symposium on Opearting Systems Design and Implementa-
less than −zdef . Elements in the range (−zdef , zdef ) are ig-                        tion (OSDI), pages 137–150, San Francisco, California, USA,
nored. The score of the cluster test is calculated based on the                        2004.
total count of elements that belongs to the same worker in the                     [7] R. G. Gallager, P. A. Humblet, and P. M. Spira. A Distributed
positive and negative sets, which is shown by the following                            Algorithm for Minimum-Weight Spanning Trees. ACM Trans-
equation:                                                                              actions on Programming Languages and Systems (TOPLAS),
                                                                                       5(1):66–77, 1983.
                                         |xi −xavg |
7 The z-score S i          i
               T ime &SStat =                           ,   where xi is the run    [8] Giraph. Apache Incubator Giraph. http://incubator.
                                     standard deviation
time of worker i or its statistics                                                     apache.org/giraph, 2012.
[9] GoldenOrb. A Cloud-based Open Source Project for Massive-        [20] LAW. The Laboratory for Web Algorithmics. http://law.
     Scale Graph Analysis. http://goldenorbos.org/, 2012.                  di.unimi.it/index.php, 2012.
[10] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin.      [21] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and
     PowerGraph: Distributed Graph-Parallel Computation on Nat-            Z. Ghahramani. Kronecker Graphs: An Approach to Model-
     ural Graphs. In Proceedings of the 10th USENIX Symposium              ing Networks. Journal of Machine Learning Research, 11:
     on Operating Systems Design and Implementation (OSDI),                985–1042, 2010.
     pages 17–30, Hollywood, California, USA, 2012.                   [22] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin,
[11] B. Hendrickson and K. Devine. Dynamic Load Balancing                  and J. Hellerstein. Distributed GraphLab: A Framework for
     in Computational Mechanics. Computer Methods in Applied               Machine Learning and Data Mining in the Cloud. Proceedings
     Mechanics and Engineering, 184(2–4):485–500, 2000.                    of the VLDB Endowment (PVLDB), 5(8):716–727, 2012.
[12] L.-Y. Ho, J.-J. Wu, and P. Liu. Distributed Graph Database for   [23] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn,
     Large-Scale Social Computing. In Proceedings of the IEEE              N. Leiser, and G. Czajkowski. Pregel: A System for Large-
     5th International Conference in Cloud Computing (CLOUD),              Scale Graph Processing. In Proceedings of the 2010 ACM
     pages 455–462, Honolulu, Hawaii, USA, 2012.                           SIGMOD International Conference on Management of Data,
[13] U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS:                pages 135–146, Indianapolis, Indiana, USA.
     A Peta-Scale Graph Mining System Implementation and Ob-          [24] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageR-
     servations. In Proceedings of the IEEE 9th International              ank Citation Ranking: Bringing Order to the Web. http:
     Conference on Data Mining (ICDM), pages 229–238, Miami,               //ilpubs.stanford.edu:8090/422/, 2001.
     Florida, USA, 2009.                                              [25] J. M. Pujol, V. Erramilli, G. Siganos, X. Yang, N. Laoutaris,
[14] U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos, and               P. Chhabra, and P. Rodriguez. The Little Engine(s) That
     J. Leskovec. HADI: Fast Diameter Estimation and Mining in             Could: Scaling Online Social Networks. IEEE/ACM Trans-
     Massive Graphs with Hadoop. ACM Trasactions on Knowl-                 actions on Networking, 20(4):1162–1175, 2012.
     edge Discovery from Data (TKDD), 5(2):8:1–8:24, 2011.            [26] D. Rees. Foundations of Statistics. Chapman and Hall,
[15] G. Karypis and V. Kumar. Parallel Multilevel K-Way Parti-             London New York, 1987. ISBN 0412285606.
     tioning Scheme for Irregular Graphs. In Proceedings of the       [27] S. Salihoglu and J. Widom. GPS: A Graph Processing System.
     ACM/IEEE Conference on Supercomputing (CDROM), pages                  Technical report, Stanford University, 2012.
     278–300, Pittsburgh, Pennsylvania, USA.
                                                                      [28] S. Seo, E. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng.
[16] G. Karypis and V. Kumar. Multilevel k-way Partitioning                HAMA: An Efficient Matrix Computation with the MapRe-
     Scheme for Irregular Graphs. Journal of Parallel and Dis-             duce Framework. In Proceedings of the 2010 IEEE Second In-
     tributed Computing, 48(1):96–129, 1998.                               ternational Conference on Cloud Computing Technology and
[17] Z. Khayyat, K. Awara, H. Jamjoom, and P. Kalnis. Mizan: Op-           Science (CloudCom), pages 721–726, Athens, Greece.
     timizing Graph Mining in Large Parallel Systems. Technical       [29] B. Shao, H. Wang, and Y. Li. Trinity: A Distributed Graph
     report, King Abdullah University of Science and Technology,           Engine on a Memory Cloud. In Proceedings of the 2013 ACM
     2012.                                                                 SIGMOD International Conference on Management of Data,
[18] E. Krepska, T. Kielmann, W. Fokkink, and H. Bal. HipG:                New York, New York, USA.
     Parallel Processing of Large-Scale Graphs. ACM SIGOPS            [30] L. G. Valiant. A Bridging Model for Parallel Computation.
     Operating Systems Review, 45(2):3–13, 2011.                           Communications of the ACM, 33:103–111, 1990.
[19] A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-        [31] W. Xue, J. Shi, and B. Yang. X-RIME: Cloud-Based Large
     Scale Graph computation on Just a PC. In Proceedings of the           Scale Social Network Analysis. In Proceedings of the
     10th USENIX Symposium on Operating Systems Design and                 2010 IEEE International Conference on Services Computing
     Implementation (OSDI), pages 31–46, Hollywood, California,            (SCC), pages 506–513, Shanghai, China.
     USA, 2012.

More Related Content

What's hot

A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAIJERD Editor
 
Scrutinizing mapreduce mechanism with orientation to refine the productivity
Scrutinizing mapreduce mechanism with orientation to refine the productivityScrutinizing mapreduce mechanism with orientation to refine the productivity
Scrutinizing mapreduce mechanism with orientation to refine the productivityeSAT Journals
 
Load balancing in Distributed Systems
Load balancing in Distributed SystemsLoad balancing in Distributed Systems
Load balancing in Distributed SystemsRicha Singh
 
Task scheduling methodologies for high speed computing systems
Task scheduling methodologies for high speed computing systemsTask scheduling methodologies for high speed computing systems
Task scheduling methodologies for high speed computing systemsijesajournal
 
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGDYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGcscpconf
 
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...iosrjce
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...IDES Editor
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...IJERA Editor
 
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENTVIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENTijmpict
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...NECST Lab @ Politecnico di Milano
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updatedVajira Thambawita
 
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...IRJET Journal
 

What's hot (19)

D0212326
D0212326D0212326
D0212326
 
A Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDAA Review on Image Compression in Parallel using CUDA
A Review on Image Compression in Parallel using CUDA
 
Scrutinizing mapreduce mechanism with orientation to refine the productivity
Scrutinizing mapreduce mechanism with orientation to refine the productivityScrutinizing mapreduce mechanism with orientation to refine the productivity
Scrutinizing mapreduce mechanism with orientation to refine the productivity
 
Load balancing in Distributed Systems
Load balancing in Distributed SystemsLoad balancing in Distributed Systems
Load balancing in Distributed Systems
 
Todtree
TodtreeTodtree
Todtree
 
F017423643
F017423643F017423643
F017423643
 
Task scheduling methodologies for high speed computing systems
Task scheduling methodologies for high speed computing systemsTask scheduling methodologies for high speed computing systems
Task scheduling methodologies for high speed computing systems
 
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTINGDYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
DYNAMIC TASK PARTITIONING MODEL IN PARALLEL COMPUTING
 
E031201032036
E031201032036E031201032036
E031201032036
 
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
A survey of Parallel models for Sequence Alignment using Smith Waterman Algor...
 
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
Multilevel Hybrid Cognitive Load Balancing Algorithm for Private/Public Cloud...
 
IMQA Paper
IMQA PaperIMQA Paper
IMQA Paper
 
Parallel Processing Concepts
Parallel Processing Concepts Parallel Processing Concepts
Parallel Processing Concepts
 
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
FPGA Implementation of 2-D DCT & DWT Engines for Vision Based Tracking of Dyn...
 
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENTVIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...
 
Omega
OmegaOmega
Omega
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
 
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...
Novel Scheduling Algorithms for Efficient Deployment of Map Reduce Applicatio...
 

Similar to Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...Subhajit Sahu
 
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEM
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMGRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEM
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation1crore projects
 
Raminder kaur presentation_two
Raminder kaur presentation_twoRaminder kaur presentation_two
Raminder kaur presentation_tworamikaurraminder
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
 
A novel load balancing model for overloaded cloud
A novel load balancing model for overloaded cloudA novel load balancing model for overloaded cloud
A novel load balancing model for overloaded cloudeSAT Publishing House
 
Practical Parallel Hypergraph Algorithms | PPoPP ’20
Practical Parallel Hypergraph Algorithms | PPoPP ’20Practical Parallel Hypergraph Algorithms | PPoPP ’20
Practical Parallel Hypergraph Algorithms | PPoPP ’20Subhajit Sahu
 
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATIONPAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATIONNexgen Technology
 
Page a partition aware engine
Page a partition aware enginePage a partition aware engine
Page a partition aware enginenexgentech15
 
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATIONNexgen Technology
 
A Parallel Packed Memory Array to Store Dynamic Graphs
A Parallel Packed Memory Array to Store Dynamic GraphsA Parallel Packed Memory Array to Store Dynamic Graphs
A Parallel Packed Memory Array to Store Dynamic GraphsSubhajit Sahu
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Vijay Srinivas Agneeswaran, Ph.D
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
 
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...Shakas Technologies
 

Similar to Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing (20)

A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
A Parallel Algorithm Template for Updating Single-Source Shortest Paths in La...
 
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEM
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMGRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEM
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEM
 
Pregel - Paper Review
Pregel - Paper ReviewPregel - Paper Review
Pregel - Paper Review
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
 
Raminder kaur presentation_two
Raminder kaur presentation_twoRaminder kaur presentation_two
Raminder kaur presentation_two
 
An Intro to Hadoop
An Intro to HadoopAn Intro to Hadoop
An Intro to Hadoop
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
 
A novel load balancing model for overloaded cloud
A novel load balancing model for overloaded cloudA novel load balancing model for overloaded cloud
A novel load balancing model for overloaded cloud
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Grid computing for load balancing strategies
Grid computing for load balancing strategiesGrid computing for load balancing strategies
Grid computing for load balancing strategies
 
Practical Parallel Hypergraph Algorithms | PPoPP ’20
Practical Parallel Hypergraph Algorithms | PPoPP ’20Practical Parallel Hypergraph Algorithms | PPoPP ’20
Practical Parallel Hypergraph Algorithms | PPoPP ’20
 
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATIONPAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 
Page a partition aware engine
Page a partition aware enginePage a partition aware engine
Page a partition aware engine
 
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
PAGE: A PARTITION AWARE ENGINE FOR PARALLEL GRAPH COMPUTATION
 
A Parallel Packed Memory Array to Store Dynamic Graphs
A Parallel Packed Memory Array to Store Dynamic GraphsA Parallel Packed Memory Array to Store Dynamic Graphs
A Parallel Packed Memory Array to Store Dynamic Graphs
 
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4Distributed computing abstractions_data_science_6_june_2016_ver_0.4
Distributed computing abstractions_data_science_6_june_2016_ver_0.4
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
IEEE Projects, Non-IEEE Projects, Data Mining, Cloud computing, Main Projects...
 
Hadoop v0.3.1
Hadoop v0.3.1Hadoop v0.3.1
Hadoop v0.3.1
 
J0210053057
J0210053057J0210053057
J0210053057
 

More from Zuhair khayyat

Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data CleansingZuhair khayyat
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTZuhair khayyat
 
IEJoin and Big Data Cleansing
IEJoin and Big Data CleansingIEJoin and Big Data Cleansing
IEJoin and Big Data CleansingZuhair khayyat
 
BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015Zuhair khayyat
 
Large Graph Processing
Large Graph ProcessingLarge Graph Processing
Large Graph ProcessingZuhair khayyat
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hoodZuhair khayyat
 

More from Zuhair khayyat (10)

Scaling Big Data Cleansing
Scaling Big Data CleansingScaling Big Data Cleansing
Scaling Big Data Cleansing
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUST
 
IEJoin and Big Data Cleansing
IEJoin and Big Data CleansingIEJoin and Big Data Cleansing
IEJoin and Big Data Cleansing
 
BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015BigDansing presentation slides for SIGMOD 2015
BigDansing presentation slides for SIGMOD 2015
 
Large Graph Processing
Large Graph ProcessingLarge Graph Processing
Large Graph Processing
 
Google appengine
Google appengineGoogle appengine
Google appengine
 
MapReduce
MapReduceMapReduce
MapReduce
 
Kineograph
KineographKineograph
Kineograph
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hood
 
Dynamo db
Dynamo dbDynamo db
Dynamo db
 

Recently uploaded

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 

Recently uploaded (20)

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

  • 1. Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat‡ Karim Awara‡ Amani Alonazi‡ Hani Jamjoom† Dan Williams† Panos Kalnis‡ ‡King Abdullah University of Science and Technology, Saudi Arabia †IBM T. J. Watson Research Center, Yorktown Heights, NY Abstract array of popular graph mining algorithms (including PageR- Pregel [23] was recently introduced as a scalable graph min- ank, shortest paths problems, bipartite matching, and semi- ing system that can provide significant performance im- clustering), Pregel was shown to improve the overall perfor- provements over traditional MapReduce implementations. mance by 1-2 orders of magnitude [17] over a traditional Existing implementations focus primarily on graph par- MapReduce [6] implementation. Pregel builds on the Bulk titioning as a preprocessing step to balance computation Synchronous Parallel (BSP) [30] programming model. It across compute nodes. In this paper, we examine the run- operates on graph data, consisting of vertices and edges. time characteristics of a Pregel system. We show that graph Each vertex runs an algorithm and can send messages— partitioning alone is insufficient for minimizing end-to-end asynchronously—to any other vertex. Computation is di- computation. Especially where data is very large or the run- vided into a number of supersteps (iterations), each sepa- time behavior of the algorithm is unknown, an adaptive ap- rated by a global synchronization barrier. During each su- proach is needed. To this end, we introduce Mizan, a Pregel perstep, vertices run in parallel across a distributed infras- system that achieves efficient load balancing to better adapt tructure. Each vertex processes incoming messages from the to changes in computing needs. Unlike known implementa- previous superstep. This process continues until all vertices tions of Pregel, Mizan does not assume any a priori knowl- have no messages to send, thereby becoming inactive. edge of the structure of the graph or behavior of the algo- Not surprisingly, balanced computation and communi- rithm. Instead, it monitors the runtime characteristics of the cation is fundamental to the efficiency of a Pregel system. system. Mizan then performs efficient fine-grained vertex To this end, existing implementations of Pregel (including migration to balance computation and communication. We Giraph [8], GoldenOrb [9], Hama [28], Surfer [4]) primar- have fully implemented Mizan; using extensive evaluation ily focus on efficient partitioning of input data as a pre- we show that—especially for highly-dynamic workloads— processing step. They take one or more of the following Mizan provides up to 84% improvement over techniques five approaches to achieving a balanced workload: (1) pro- leveraging static graph pre-partitioning. vide simple graph partitioning schemes, like hash- or range- based partitioning (e.g., Giraph), (2) allow developers to set their own partitioning scheme or pre-partition the graph data 1. Introduction (e.g., Pregel), (3) provide more sophisticated partitioning With the increasing emphasis on big data, new platforms are techniques (e.g., GraphLab, GoldenOrb and Surfer use min- being proposed to better exploit the structure of both the cuts [16]), (4) utilize distributed data stores and graph index- data and algorithms (e.g., Pregel [23], HADI [14], PEGA- ing on vertices and edges (e.g., GoldenOrb and Hama), and SUS [13] and X-RIME [31]). Recently, Pregel [23] was in- (5) perform coarse-grained load balancing (e.g., Pregel). troduced as a message passing-based programming model In this paper, we analyze the efficacy of these workload- specifically targeting the mining of large graphs. For a wide balancing approaches when applied to a broad range of graph mining problems. Built into the design of these ap- proaches is the assumption that the structure of the graph Permission to make digital or hard copies of all or part of this work for personal or is static, the algorithm has predictable behavior, or that the classroom use is granted without fee provided that copies are not made or distributed developer has deep knowledge about the runtime character- for profit or commercial advantage and that copies bear this notice and the full citation istics of the algorithm. Using large datasets from the Lab- on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. oratory for Web Algorithmics (LAW) [2, 20] and a broad Eurosys’13 April 15-17, 2013, Prague, Czech Republic set of graph algorithms, we show that—especially when the Copyright c 2013 ACM 978-1-4503-1994-2/13/04. . . $15.00
  • 2. graph mining algorithm has unpredictable communication • We analyze different graph algorithm characteristics that needs, frequently changes the structure of the graph, or has can contribute to imbalanced computation of a Pregel a variable computation complexity—a single solution is not system. enough. We further show that a poor partitioning can result • We propose a dynamic vertex migration model based on in significant performance degradation or large up-front cost. runtime monitoring of vertices to optimize the end-to-end More fundamentally, existing workload balancing ap- computation. proaches suffer from poor adaptation to variability (or shifts) • We fully implement Mizan in C++ as an optimized Pregel in computation needs. Even the dynamic partition assign- ment that was proposed in Pregel [23] reacts poorly to highly system that supports dynamic load balancing and effi- dynamic algorithms. We are, thus, interested in building a cient vertex migration. system that is (1) adaptive, (2) agnostic to the graph struc- • We deploy Mizan on a local Linux cluster (21 machines) ture, and (3) requires no a priori knowledge of the behavior and evaluate its efficacy on a representative number of of the algorithm. At the same time, we are interested in im- datasets. We also show the linear scalability of our design proving the overall performance of the system. by running Mizan on a 1024-CPU IBM Blue Gene/P. To this end, we introduce Mizan1 , a Pregel system that This paper is organized as follows. Section 2 describes supports fine-grained load balancing across supersteps. Sim- factors affecting algorithm behavior and discusses a number ilar to other Pregel implementations, Mizan supports differ- of example algorithms. In Section 3, we introduce the design ent schemes to pre-partition the input graph. Unlike exist- of Mizan. We describe implementation details in Section 4. ing implementations, Mizan monitors runtime characteris- Section 5 compares the performance of Mizan using a wide tics of all vertices (i.e., their execution time, and incoming array of datasets and algorithms. We then describe related and outgoing messages). Using these measurements, at the work in Section 6. Section 7 discusses future work and end of every superstep, Mizan constructs a migration plan Section 8 concludes. that minimizes the variations across workers by identify- ing which vertices to migrate and where to migrate them to. Finally, Mizan performs efficient vertex migration across 2. Dynamic Behavior of Algorithms workers, leveraging a distributed hash table (DHT)-based lo- In the Pregel (BSP) computing model, several factors can cation service to track the movement of vertices as they mi- affect the runtime performance of the underlying system grate. (shown in Figure 1). During a superstep, each vertex is All components of Mizan support distributed execution, either in an active or inactive state. In an active state, a eliminating the need for a centralized controller. We have vertex may be computing, sending messages, or processing fully implemented Mizan in C++, with the messaging— received messages. BSP—layer implemented using MPI [3]. Using a representa- Naturally, vertices can experience variable execution tive number of datasets and a broad spectrum of algorithms, times depending on the algorithm they implement or their we compare our implementation against Giraph, a popular degree of connectivity. A densely connected vertex, for ex- Pregel clone, and show that Mizan consistently outperforms ample, will likely spend more time sending and process- Giraph by an average of 202%. Normalizing across plat- ing incoming messages than a sparsely connected vertex. forms, we show that for stable workloads, Mizan’s dynamic The BSP programming model, however, masks this varia- load balancing matches the performance of the static parti- tion by (1) overlapping communication and computation, tioning without requiring a priori knowledge of the structure and (2) running many vertices on the same compute node. of the graph or algorithm runtime characteristics. When the Intuitively, as long as the workloads of vertices are roughly algorithm is highly dynamic, we show that Mizan provides balanced across computing nodes, the overall computation over 87% performance improvement. Even with inefficient time will be minimized. input partitioning, Mizan is able to reduce the computation Counter to intuition, achieving a balanced workload is not overhead by up to 40% when compared to the static case, straightforward. There are two sources of imbalance: one incurring only a 10% overhead for performing dynamic ver- originating from the graph structure and another from the tex migration. We even run our implementation on an IBM algorithm behavior. In both cases, vertices on some compute Blue Gene/P supercompter, demonstrating linear scalability nodes can spend a disproportional amount of time comput- to 1024 CPUs. To the best of our knowledge, this is the ing, sending or processing messages. In some cases, these largest scale out test of a Pregel-like system to date. vertices can run out of input buffer capacity and start pag- To summarize, we make the following contributions: ing, further exacerbating the imbalance. Existing implementations of Pregel (including Gi- raph [8], GoldenOrb [9], Hama [28]) focus on providing multiple alternatives to partitioning the graph data. The three 1 Mizan is Arabic for a double-pan scale, commonly used in reference to common approaches to partitioning the data are hash-based, achieving balance. range-based, or min-cut [16]. Hash- and range-based parti-
  • 3. Superstep 1 Superstep 2 Superstep 3 350 Hash 300 Range Worker 1 Worker 1 Worker 1 Min-cuts Run Time (Min) 250 Worker 2 Worker 2 Worker 2 200 150 Worker 3 Worker 3 Worker 3 100 BSP Barrier 50 -Vertex response time -Time to receive in messages 0 -Time to send out messages Live kg ar Journa raph4m abic-200 l 68m 5 Figure 1. Factors that can affect the runtime in the Pregel framework Figure 2. The difference in execution time when processing PageRank on different graphs using hash-based, range-based and min-cuts partitioning. Because of its size, arabic-2005 tioning methods divide a dataset based on a simple heuristic: cannot be partitioned using min-cuts (ParMETIS) in our to evenly distribute vertices across compute nodes, irrespec- local cluster. tive of their edge connectivity. Min-cut based partitioning, on the other hand, considers vertex connectivity and parti- 1000 In Messages (Millions) tions the data such that it places strongly connected vertices 100 close to each other (i.e., on the same cluster). The result- ing performance of these partitioning approaches is, how- 10 ever, graph dependent. To demonstrate this variability, we 1 ran a simple—and highly predictable—PageRank algorithm 0.1 on different datasets (summarized in Table 1) using the three 0.01 popular partitioning methods. Figure 2 shows that none of 0.001 the partitioning methods consistently outperforms the rest, 0 10 20 30 40 50 60 noting that ParMETIS [15] partitioning cannot be performed SuperSteps on the arabic-2005 graph due to memory limitations. In addition to the graph structure, the running algorithm PageRank - Total DMST - Total PageRank - Max DMST - Max can also affect the workload balance across compute nodes. Broadly speaking, graph algorithms can be divided into Figure 3. The difference between stationary and non- two categories (based on their communication characteris- stationary graph algorithms with respect to the incoming tics across supersteps): stationary and non-stationary. messages. Total represents the sum across all workers and Stationary Graph Algorithms. An algorithm is stationary Max represents the maximum amount (on a single worker) if its active vertices send and receive the same distribution of across all workers. messages across supersteps. At the end of a stationary algo- rithm, all active vertices become inactive (terminate) during To illustrate the differences between the two classes of the same superstep. Usually graph algorithms represented by algorithms, we compared the runtime behavior of PageRank a matrix-vector multiplication2 are stationary algorithms, in- (stationary) against DMST (non-stationary) when process- cluding PageRank, diameter estimation and finding weakly ing the same dataset (LiveJournal1) on a cluster of 21 ma- connected components. chines. The input graph was partitioned using a hash func- Non-stationary Graph Algorithms. A graph algorithm is tion. Figure 3 shows the variability in the incoming messages non-stationary if the destination or size of its outgoing mes- per superstep for all workers. In particular, the variability can sages changes across supersteps. Such variations can create span over five orders of magnitude for non-stationary algo- workload imbalances across supersteps. Examples of non- rithms. stationary algorithms include distributed minimal spanning The remainder of the section describes four popular graph tree construction (DMST), graph queries, and various simu- mining algorithms that we use throughout the paper. They lations on social network graphs (e.g., advertisement propa- cover both stationary and non-stationary algorithms. gation). 2.1 Example Algorithms 2 The matrix represents the graph adjacency matrix and the vector represents PageRank. PageRank [24] is a stationary algorithm that the vertices’ value. uses matrix-vector multiplications to calculate the eigenval-
  • 4. ues of the graph’s adjacency matrix at each iteration.3 The sages between vertices is unpredictable and depends on the algorithm terminates when the PageRank values of all nodes state of the edges at each vertex. change by less than a defined error during an iteration. At Simulating Advertisements on Social Networks. To sim- each superstep, all vertices are active, where every vertex ulate advertisements on a social network graph, each vertex always receives messages from its in-edges and sends mes- represents a profile of a person containing a list of his/her sages to all of its out-edges. All messages have fixed size interests. A small set of selected vertices are identified as (8 bytes) and the computation complexity on each vertex is sources and send different advertisements to their direct linear to the number of messages. neighbors at each superstep for a predefined number of su- Top-K Ranks in PageRank. It is often desirable to perform persteps. When a vertex receives an advertisement, it is ei- graph queries, like using PageRank to find the top k vertex ther forwarded to the vertex’s neighbors (based on the ver- ranks reachable to a vertex after y supersteps. In this case, tex’s interest matrix) or ignored. This algorithm has a com- PageRank runs for x supersteps; at superstep x + 1, each putation complexity that depends on the count of active ver- vertex sends its rank to its direct out neighbors, receives tices at a superstep and whether the active vertices commu- other ranks from its in neighbors, and stores the highest nicate with their neighbors or not. It is, thus, a non-stationary received k ranks. At supersteps x + 2 and x + y, each algorithm. vertex sends the top-K ranks stored locally to its direct out We have also implemented and evaluated a number of neighbors, and again stores the highest k ranks received from other—stationary and non-stationary—algorithms including its in neighbors. The message size generated by each vertex diameter estimation and finding weakly connected compo- can be between [1..k], which depends on the number of ranks nents. Given that their behavior is consistent with the results stored the vertex. If the highest k values maintained by a in the paper, we omit them for space considerations. vertex did not change at a superstep, that vertex votes to halt and becomes inactive until it receives messages from 3. Mizan other neighbors. The size of the message sent between two Mizan is a BSP-based graph processing system that is simi- vertices varies depending on the actual ranks. We classify lar to Pregel, but focuses on efficient dynamic load balancing this algorithm as non-stationary because of the variability in of both computation and communication across all worker message size and number of messages sent and received. (compute) nodes. Like Pregel, Mizan first reads and parti- Distributed Minimal Spanning Tree (DMST). Implement- tions the graph data across workers. The system then pro- ing the GHS algorithm of Gallager, Humblet and Spira [7] in ceeds as a series of supersteps, each separated by a global the Pregel model requires that, given a weighted graph, every synchronization barrier. During each superstep, each vertex vertex classifies the state of its edges as (1) branch (i.e., be- processes incoming messages from the previous superstep longs to the minimal spanning tree (MST)), (2) rejected (i.e., and sends messages to neighboring vertices (which are pro- does not belong to the MST), or (3) basic (i.e., unclassified), cessed in the following superstep). Unlike Pregel, Mizan bal- in each superstep. The algorithm starts with each vertex as a ances its workload by moving selected vertices across work- fragment, then joins fragments until there is only one frag- ers. Vertex migration is performed when all workers reach a ment left (i.e., the MST). Each fragment has an ID and level. superstep synchronization barrier to avoid violating the com- Each vertex updates its fragment ID and level by exchang- putation integrity, isolation and correctness of the BSP com- ing messages with its neighbors. The vertex has two states: pute model. active and inactive. There is also a list of auto-activated ver- This section describes two aspects of the design of Mizan tices defined by the user. At the early stage of the algorithm, related to vertex migration: the distributed runtime monitor- active vertices send messages over the minimum weighted ing of workload characteristics and a distributed migration edges (i.e., the edge that has the minimum weight among the planner that decides which vertices to migrate. Other de- other in-edges and out-edges). As the algorithm progresses, sign aspects that are not related to vertex migration (e.g., more messages are sent across many edges according to the fault tolerance) follow similar approaches to Pregel [23], Gi- number of fragments identified during the MST computa- raph [8], GraphLab [22] and GPS [27]; they are omitted for tions at each vertex. At the last superstep, each vertex knows space considerations. Implementation details are described which of its edges belong to the minimum weighted span- in Section 4. ning tree. The computational complexity for each vertex is quadratic to the number of incoming messages. We classify 3.1 Monitoring this algorithm as non-stationary because the flow of mes- Mizan monitors three key metrics for each vertex and main- tains a high level summary these metrics for each worker node; summaries are broadcast to other workers at the end 3 Formally, each iteration calculates: v (k+1) = cAt v k + (1 − c)/|N |, of every superstep. As shown in Figure 4, the key metrics where v k is the eigenvector of iteration k, c is the damping factor used for for every vertex are (1) the number of outgoing messages to normalization, and A is a row normalized adjacency matrix. other (remote) workers, (2) total incoming messages, and (3)
  • 5. Vertex Response Time Compute() Remote Outgoing Messages V6 V2 V1 BSP V4 V3 V5 Barrier Pair with Underutilized Worker Mizan Mizan Migration No Superstep Worker 1 Worker 2 Barrier Imbalanced? Local Incoming Messages Remote Incoming Messages Select Yes Vertices Figure 4. The statistics monitored by Mizan and Migrate Identify Source of Imbalance Superstep 1 Superstep 2 Superstep 3 Worker 1 Worker 1 Worker 1 No Worker Yes Overutilized? Worker 2 Worker 2 Worker 2 Worker 3 Worker 3 Worker 3 Figure 6. Summary of Mizan’s Migration Planner BSP Barrier Migration Barrier Migration Planner tween a second synchronization barrier as shown in Figure 5, Figure 5. Mizan’s BSP flow with migration planning which is necessary to ensure correctness of the BSP model. Mizan’s migration planner executes the following five steps, summarized in Figure 6: the response time (execution time) during the current super- step: 1. Identify the source of imbalance. Outgoing Messages. Only outgoing messages to other ver- 2. Select the migration objective (i.e., optimize for outgoing tices in remote workers are counted since the local outgoing messages, incoming messages, or response time). messages are rerouted to the same worker and do not incur 3. Pair over-utilized workers with under-utilized ones. any actual network cost. 4. Select vertices to migrate. Incoming Messages. All incoming messages are monitored, those that are received from remote vertices and those lo- 5. Migrate vertices. cally generated. This is because queue size can affect the performance of the vertex (i.e., when its buffer capacity is STEP 1: Identify the source of imbalance. Mizan detects exhausted, paging to disk is required). imbalances across supersteps by comparing the summary statistics of all workers against a normal random distribu- Response Time. The response time for each vertex is mea- tion and flagging outliers. Specifically, at the end of a su- sured. It is the time when a vertex starts processing its in- perstep, Mizan computes the z-score4 for all workers. If coming messages until it finishes. any worker has z-score greater than zdef , Mizan’s migra- tion planner flags the superstep as imbalanced. We found that 3.2 Migration Planning zdef = 1.96, the commonly recommend value [26], allows High values for any of the three metrics above may indicate for natural workload fluctuations across workers. We have poor vertex placement, which leads to workload imbalance. experimented with different values of zdef and validated the As with many constrained optimization problems, optimiz- robustness of our choice. ing against three objectives is non-trivial. To reduce the op- STEP 2: Select the migration objective. Each worker— timization search space, Mizan’s migration planner finds the identically—uses the summary statistics to compute the cor- strongest cause of workload imbalance among the three met- relation between outgoing messages and response time, and rics and plans the vertex migration accordingly. also the correlation between incoming messages and re- By design, all workers create and execute the migration sponse time. The correlation scores are used to select the plan in parallel, without requiring any centralized coordina- objective to optimize for: to balance outgoing messages, tion. The migration planner starts on every worker at the end balance incoming messages, or balance computation time. of each superstep (i.e., when all workers reach the synchro- The default objective is to balance the response time. If the nization barrier), after it receives summary statistics (as de- scribed in Section 3.1) from all other workers. Additionally, 4 The z-score W z = i |xi −xmax | , where xi is the run time of standard deviation the execution of the migration planner is sandwiched be- worker i
  • 6. outgoing or incoming messages are highly correlated with the response time, then Mizan chooses the objective with highest correlation score. The computation of the correlation W7 W2 W1 W5 W8 W4 W0 W6 W9 W3 score is described in Appendix A. 0 1 2 3 4 5 6 7 8 STEP 3: Pair over-utilized workers with under-utilized ones. Each overutilized worker that needs to migrate ver- Figure 7. Matching senders (workers with vertices to mi- tices out is paired with a single underutilized worker. While grate) with receivers using their summary statistics complex pairings are possible, we choose a design that is ef- ficient to execute, especially since the exact number of ver- into a stream that includes the vertex ID, state, edge informa- tices that different workers plan to migrate is not globally tion and the received messages it will process. Once a ver- known. Similar to the previous two steps in the migration tex stream is successfully sent, the sending worker deletes plan, this step is executed by each worker without explicit the sent vertices so that it does not run them in the next su- global synchronization. Using the summary statistics for the perstep. The receiving worker, on the other hand, receives chosen migration objective (in Step 2), each worker creates vertices (together with their messages) and prepares to run an ordered list of all workers. For example, if the objective is them in the next superstep. The next superstep is started once to balance outgoing messages, then the list will order work- all workers finish migrating vertices and reach the migration ers from highest to lowest outgoing messages. The resulting barrier. The complexity of the migration process is directly list, thus, places overutilized workers at the top and least uti- related to the size of vertices being migrated. lized workers at the bottom. The pairing function then suc- cessively matches workers from opposite ends of the ordered 4. Implementation list. As depicted in Figure 7, if the list contains n elements Mizan consists of four modules, shown in Figure 8: the (one for each worker), then the worker at position i is paired BSP Processor, Storage Manager, Communicator, and Mi- with the worker at position n − i. In cases where a worker gration Planner. The BSP Processor implements the Pregel does not have memory to receive any vertices, the worker is APIs, consisting primarily of the Compute class, and the marked unavailable in the list. SendMessageTo, GetOutEdgeIterator and getValue STEP 4: Select vertices to migrate. The number of vertices methods. The BSP Processor operates on the data struc- to be selected from an overutilized worker depends on the tures of the graph and executes the user’s algorithm. It also difference of the selected migration objective statistics with performs barrier synchronization with other workers at the its paired worker. Assume that wx is a worker that needs end of each superstep. The Storage Manager module main- to migrate out a number of vertices, and is paired with the tains access atomicity and correctness of the graph’s data, receiver, wy . The load that should be migrated to the under- and maintains the data structures for message queues. Graph utilized worker is defined as ∆xy , which equals to half the data can be read and written to either HDFS or local disks, difference in statistics of the migration objective between the depending on how Mizan is deployed. The Communicator two workers. The selection criteria of the vertices depends module uses MPI to enable communication between work- on the distribution of the statistics of the migration objective, ers; it also maintains distributed vertex ownership informa- where the statistics of each vertex is compared against a nor- tion. Finally, the Migration Planner operates transparently mal distribution. A vertex is selected if it is an outlier (i.e., across superstep barriers to maintain the dynamic workload i if its V zstat 5 ). For example, if the migrating objective is to balance. balance the number of remote outgoing messages, vertices Mizan allows the user’s code to manipulate the graph con- with large remote outgoing messages are selected to migrate nectivity by adding and removing vertices and edges at any to the underutilized worker. The sum of the statistics of the superstep. It also guarantees that all graph mutation com- selected vertices is denoted by Vstat which should mini- mands issued at superstepx are executed at the end of the mize |∆xy − Vstat | to ensure the balance between wx and same superstep and before the BSP barrier, which is illus- wy in the next superstep. If there not enough outlier vertices trated in Figure 5. Therefore, vertex migrations performed are found, a random set of vertices are selected to minimize by Mizan do not conflict with the user’s graph mutations and |∆xy − Vstat |. Mizan always considers the most recent graph structure for STEP 5: Migrate vertices. After the vertex selection pro- migration planning. cess, the migrating workers start sending the selected ver- When implementing Mizan, we wanted to avoid having a tices while other workers wait at the migration barrier. A centralized controller. Overall, the BSP (Pregel) model nat- migrating worker starts sending the selected set of vertices urally lends itself to a decentralized implementation. There to its unique target worker, where each vertex is encoded were, however, three key challenges in implementing a dis- tributed control plane that supports fine-grained vertex mi- 5 The i |x −x i | avg z-score V zstat = standard deviation , where xi is the statistics of gration. The first challenge was in maintaining vertex own- the migration objective of vertex i is greater than the zdef ership so that vertices can be freely migrated across work-
  • 7. 1) migrate(Vi) Vertex Compute() Worker 1 Worker 2 BSP Processor BSP Processor Mizan Worker Migration Migration Communicator - DHT Communicator - DHT BSP Processor Planner Planner Migration Storage Manager Storage Manager Communicator - DHT Planner 2) inform_DHT(Vi) Storage Manager Worker 3 Worker 4 IO BSP Processor BSP Processor Migration Migration HDFS/Local Disks Communicator - DHT Communicator - DHT Planner Planner Storage Manager Storage Manager Figure 8. Architecture of Mizan 3) update_loc(Vi) ers. This is different from existing approaches (e.g., Pregel, Figure 9. Migrating vertex vi from Worker 1 to Worker 2, Giraph, and GoldenOrb), which operate on a much coarser while updating its DHT home worker (Worker 3) granularity (clusters of vertices) to enable scalability. The second challenge was in allowing fast updates to the vertex cation of target vertex. The source worker finally sends the ownership information as vertices get migrated. The third queued message to the current worker for the target vertex. It challenge was in minimizing the cost of migrating vertices also caches the physical location to minimize future lookups. with large data structures. In this section, we discuss the im- plementation details around these three challenges, which al- 4.2 DHT Updates After Vertex Migration low Mizan to achieve its effectiveness and scalability. Figure 9 depicts the vertex migration process. When a vertex v migrates between two workers, the receiving worker sends 4.1 Vertex Ownership the new location of v to the home worker of v. The home With huge datasets, Mizan workers cannot maintain the worker, in turn, sends an update message to all workers that management information for all vertices in the graph. Man- have previously asked for—and, thus, potentially cached— agement information includes the collected statistics for the location of v. Since Mizan migrates vertices in the barrier each vertex (described in Section 3.1) and the location (own- between two supersteps, all workers that have cached the ership) of each vertex. While per-vertex monitoring statis- location of the migrating vertex will receive the updated tics are only used locally by the worker, vertex ownership physical location from the home worker before the start of information is needed by all workers. When vertices send the new superstep. messages, workers need to know the destination address for If for any reason a worker did not receive the updated each message. With frequent vertex migration, updating the location, the messages will be sent to the last known phys- location of the vertices across all workers can easily create a ical location of v. The receiving worker, which no longer communication bottleneck. owns the vertex, will simply buffer the incorrectly routed To overcome this challenge, we use a distributed hash ta- messages, ask for the new location for v, and reroute the ble (DHT) [1] to implement a distributed lookup service. The messages to the correct worker. DHT implementation allows Mizan to distribute the over- head of looking up and updating vertex location across all 4.3 Migrating Vertices with Large Message Size workers. The DHT stores a set of (key,value) pairs, where Migrating a vertex to another worker requires moving its the key represents a vertex ID and the value represents its queued messages and its entire state (which includes its current physical location. Each vertex is assigned a home ID, value, and neighbors). Especially when processing large worker. The role of the home worker is to maintain the graphs, a vertex can have a significant number of queued current location of the vertex. A vertex can physically ex- messages, which are costly to migrate. To minimize the cost, ist in any worker, including its home worker. The DHT Mizan migrates large vertices using a delayed migration uses a globally defined hash function that maps the keys to process that spreads the migration over two supersteps. In- their associated home workers, such that home worker = stead of physically moving the vertex with its large message location hash(key). queue, delayed migration only moves the vertex’s informa- During a superstep, when a (source) vertex sends a mes- tion and the ownership to the new worker. Assuming wold is sage to another (target) vertex, the message is passed to the the old owner and wnew is the new one, wold continues to Communicator. If the target vertex is located on the same process the migrating vertex v in the next superstep, SSt+1 . worker, it is rerouted back to the appropriate queue. Other- wnew receives the messages for v, which will be processed wise, the source worker uses the location hash function at the following superstep, SSt+2 . At the end of SSt+1 , wold to locate and query the home worker for the target vertex. sends the new value of v, calculated at SSt+1 , to wnew and The home worker responds back with the actual physical lo- completes the delayed migration. Note that migration plan-
  • 8. Superstep t Superstep t+1 Superstep t+2 40 Giraph Worker_old Worker_old Worker_old 35 Static Mizan 1 3 1 3 1 3 30 Run Time (Min) 25 2 7 2 7 2 20 15 10 4 4 '7 4 7 5 0 web k Live k 5 6 6 - Goo g1 Jour g4m68 nal1 5 6 5 gle m Worker_new Worker_new Worker_new Figure 11. Comparing Static Mizan vs. Giraph using Figure 10. Delayed vertex migration PageRank on social network and random graphs G(N, E) |N | |E| Web Algorithmics (LAW) [2, 20]. We also generated syn- kg1 1,048,576 5,360,368 thetic datasets using the Kronecker [21] generator that mod- kg4m68m 4,194,304 68,671,566 els the structure of real life networks. The details are shown web-Google 875,713 5,105,039 in Table 1. LiveJournal1 4,847,571 68,993,773 To better isolate the effects of dynamic migration on hollywood-2011 2,180,759 228,985,632 system performance, we implemented three variations of arabic-2005 22,744,080 639,999,458 Mizan: Static, Work Stealing (WS), and Mizan. Static Mizan disables any dynamic migration and uses either hash-based, Table 1. Datasets—N , E denote nodes and edges, respec- range-based or min-cuts graph pre-partitioning (rendering tively. Graphs with prefix kg are synthetic. it similar to Giraph). Work Stealing (WS) Mizan is our attempt to emulate Pregel’s coarse-grained dynamic load ning is disabled for superstep SSt+1 after applying delayed balancing behavior (described in [23]). Finally, Mizan is our migration to ensure the consistency of the migration plans. framework that supports dynamic migration as described in An example is shown in Figure 10, where Vertex 7 is Sections 3 and 4. migrated using delayed migration. As Vertex 7 migrates, We evaluated Mizan across three dimensions: partitioning the ownership of Vertex 7 is moved to Worker new and scheme, graph structure, and algorithm behavior. We par- messages sent to Vertex 7 at SSt+1 are addressed to tition the datasets using three schemes: hash-based, range- Worker new. At the barrier of SSt+1 , Worker old sends based, and METIS partitioning. We run our experiments the edge information and state of v to Worker new and starts against both social and random graphs. Finally, we experi- SSt+2 with v fully migrated to Worker new. mented with the algorithms described in Section 2.1. With delayed migration, the consistency of computation is maintained without introducing additional network cost 5.1 Giraph vs. Mizan for vertices with large message queues. Since delayed mi- We picked Giraph for its popularity as an open source Pregel gration uses two supersteps to complete, Mizan may need framework with broad adoption. We only compared Giraph more steps before it converges on a balanced state. to Static Mizan, which allows us to evaluate our base (non- dynamic) implementation. Mizan’s dynamic migration is 5. Evaluation evaluated later in this section. To further equalize Giraph and We implemented Mizan using C++ and MPI and compared the Static Mizan, we disabled the fault tolerance feature of it against Giraph [8], a Pregel clone implemented in Java Giraph to eliminate any additional overhead resulting from on top of Hadoop. We ran our experiments on a local clus- frequent snapshots during runtime. In this section, we report ter of 21 machines equipped with a mix of i5 and i7 pro- the results using PageRank (a stationary and well balanced cessors with 16GB RAM on each machine. We also used algorithm) using different graph structures. an IBM Blue Gene/P supercomputer with 1024 PowerPC- In Figure 11, we ran both frameworks on social network 450 CPUs, each with 4 cores at 850MHz and 4GB RAM. and random graphs. As shown, Static Mizan outperforms Gi- We downloaded publicly available datasets from the Stan- raph by a large margin, up to four times faster than Giraph in ford Network Analysis Project6 and from The Laboratory for kg4m68, which contains around 70M edges. We also com- pared both systems when increasing the number of nodes 6 http://snap.stanford.edu in random structure graphs. As shown in Figure 12, Static
  • 9. 30 12 Giraph Static Mizan 25 Run Time (Min) 10 Run Time (Min) 20 8 15 6 10 4 5 2 0 Mizan Mizan Mizan Static Static Static 0 WS WS WS 1 2 4 8 16 Vertex count (Millions) Hash Range Metis Figure 12. Comparing Mizan vs. Giraph using PageRank Figure 13. Comparing Static Mizan and Work Stealing on regular random graphs, the graphs are uniformly dis- (Pregel clone) vs. Mizan using PageRank on a social graph tributed with each has around 17M edge (LiveJournal1). The shaded part of each column repre- sents the algorithm runtime while unshaded parts represents the initial partitioning cost of the input graph. Mizan consistently outperforms Giraph in all datasets and reaches up to three times faster with 16 million vertexes. While the execution time of both frameworks increases lin- perstep’s runtime. The same figure also shows that Mizan’s early with graph size, the rate of increase—slope of the migration alternated the optimization objective between the graph—for Giraph (0.318) is steeper than Mizan (0.09), in- number of outgoing messages and vertex response time, il- dicating that Mizan also achieves better scalability. lustrated as points on the superstep’s runtime trend. The experiments in Figures 12 and 11 show that Giraph’s By looking at both Figures 14 and 15, we observe that implementation is inefficient. It is a non-trivial task to dis- the convergence of Mizan’s dynamic migration is correlated cover the source of inefficiency in Giraph since it is tightly with the algorithm’s runtime reduction. For the PageRank coupled with Hadoop. We suspect that part of the ineffi- algorithm with range-based partitioning, Mizan requires 13 ciency is due to the initialization cost of the Hadoop jobs supersteps to reach an acceptable balanced workload. Since and the high overhead of communication. Other factors, like PageRank is a stationary algorithm, Mizan’s migration con- internal data structure choice and memory footprint, might verged quickly; we expect that it would require more super- also play a role in this inefficiency. steps to converge on other algorithms and datasets. In gen- eral, Mizan requires multiple supersteps before it balances 5.2 Effectiveness of Dynamic Vertex Migration the workload. This also explains why running Mizan with Given the large performance difference between Static range-based partitioning is less efficient than running Mizan Mizan and Giraph, we exclude Giraph from further exper- with METIS or hash-based partitioning. iments and focus on isolating the effects of dynamic migra- As described in Section 2.1, Top-K PageRank adds vari- tion on the overall performance of the system. ability in the communication among the graph nodes. As Figure 13 shows the results for running PageRank on a shown in Figure 16, such variability in the messages ex- social graph with various partitioning schemes. We notice changed leads to minor variation in both hash-based and that both hash-based and METIS partitioning achieve a bal- METIS execution times. Similarly, in range partitioning, anced workload for PageRank such that dynamic migration Mizan had better performance than the static version. The did not improve the results. In comparison, range-based par- slight performance improvement arises from the fact that the titioning resulted in poor graph partitioning. In this case, we base algorithm (PageRank) dominates the execution time. If observe Mizan (with dynamic partitioning) was able to re- a large number of queries are performed, the improvements duce execution time by approximately 40% when compared will be more significant. to the static version. We have also evaluated the diameter es- To study the effect of algorithms with highly variable timation algorithm, which behaves the same as PageRank, messaging patterns, we evaluated Mizan using two algo- but exchanges larger size messages. Mizan exhibited similar rithms: DMST and advertisement propagation simulation. behavior with diameter estimation; the results are omitted In both cases, we used METIS to pre-partition the graph for space considerations. data. METIS partitioning groups the strongly connected sub- Figure 14 shows how Mizan’s dynamic migration was graphs into clusters, thus minimizing the global communica- able to optimize running PageRank starting with range- tion among each cluster. based partitioning. The figure shows that Mizan’s migration In DMST, as discussed in Section 2.1, computation com- reduced both the variance in workers’ runtime and the su- plexity increases with vertex connectivity degree. Because
  • 10. 4 40 Superstep Runtime 35 3.5 Average Workers Runtime Run Time (Minutes) Run Time (Min) 3 Migration Cost 30 Balance Outgoing Messages 25 2.5 Balance Incoming Messages 20 2 Balance Response Time 15 1.5 10 1 5 0.5 0 Mizan Mizan Mizan Static Static Static 0 WS WS WS 5 10 15 20 25 30 Supersteps Hash Range Metis Figure 14. Comparing Mizan’s migration cost with the su- Figure 16. Comparing static Mizan and Work Stealing perstep’s (or the algorithm) runtime at each supertstep us- (Pregel clone) vs. Mizan using Top-K PageRanks on a so- ing PageRank on a social graph (LiveJournal1) starting cial graph (LiveJournal1). The shaded part of each col- with range based partitioning. The points on the superstep’s umn represents the algorithm runtime while unshaded parts runtime trend represents Mizan’s migration objective at that represents the initial partitioning cost of the input graph. specific superstep. 4 300 Static Migrated Vertices (x1000) 500 Migration Cost Migration Cost (Minutes) Total Migrated Vertices 250 Work Stealing Max Vertices Migrated 3 Mizan 400 Run Time (Min) by Single Worker 200 300 2 150 200 100 1 100 50 0 0 0 5 10 15 20 25 30 Adve DMS r tism T Supersteps e nt Figure 15. Comparing both the total migrated vertices and Figure 17. Comparing Work stealing (Pregel clone) vs. the maximum migrated vertices by a single worker for Mizan using DMST and Propagation Simulation on a metis PageRank on LiveJournal1 starting with range-based par- partitioned social graph (LiveJournal1) titioning. The migration cost at each superstep is also shown. of the quadratic complexity of computation as a function of Total runtime (s) 707 connectivity degree, some workers will suffer from exten- Data read time to memory (s) 72 sive workload while others will have light workload. Such Migrated vertices 1,062,559 imbalance in the workload leads to the results shown in Fig- Total migration time (s) 75 ure 17. Mizan was able to reduce the imbalance in the work- Average migrate time per vertex (µs) 70.5 load, resulting in a large drop in execution time (two orders of magnitude improvement). Even when using our version Table 2. Overhead of Mizan’s migration process when of Pregel’s load balancing approach (called work stealing), compared to the total runtime using range partitioning on Mizan is roughly eight times faster. LiveJournal1 graph Similar to DMST, the behavior of the advertising propa- gation simulation algorithm varies across supersteps. In this algorithm, a dynamic set of graph vertices communicate step. Even METIS partitioning in such case is ill-suited since heavily with each other, while others have little or no com- workers’ load dynamically changes at runtime. As shown in munication. In every superstep, the communication behav- Figure 17, similar to DMST, Mizan is able to reduce such an ior differs depending on the state of the vertex. Therefore, imbalance, resulting in approximately 200% speed up when it creates an imbalance across the workers for every super- compared to the work stealing and static versions.
  • 11. Linux Cluster Blue Gene/P 8 Mizan - hollywood-2011 7 hollywood-2011 arabic-2005 6 Processors Runtime (m) Processors Runtime (m) 5 Speedup 2 154 64 144.7 4 4 79.1 128 74.6 3 8 40.4 256 37.9 2 16 21.5 512 21.5 1024 17.5 1 0 Table 3. Scalability of Mizan on a Linux Cluster of 0 2 4 6 8 10 12 14 16 16 machines (hollywood-2011 dataset), and an IBM Compute Nodes Blue Gene/P supercomputer (arabic-2005 dataset). Figure 18. Speedup on Linux Cluster of 16 machines using PageRank on the hollywood-2011 dataset. 5.3 Overhead of Vertex Migration To analyze migration cost, we measured the time for var- ious performance metrics of Mizan. We used the PageR- 10 Mizan - Arabic-2005 ank algorithm with a range-based partitioning of the 8 Live-Journal1 dataset on 21 workers. We chose range- Speedup based partitioning as it provides the worst data distribution 6 according to previous experiments and therefore will trigger frequent vertex migrations to balance the workload. 4 Table 2 reports the average cost of migrating as 70.5 µs 2 per vertex. In the LiveJournal1 dataset, Mizan paid a 9% penalty of the total runtime to balance the workload, trans- 0 ferring over 1M vertices. As shown earlier in Figure 13, this 64 128 256 512 1024 resulted in a 40% saving in computation time when com- pared to Static Mizan. Moreover, Figures 14 and 15 compare Compute Nodes the algorithm runtime and the migration cost at each super- step, the migration cost is at most 13% (at superstep 2) and Figure 19. Speedup on Shaheen IBM Blue Gene/P super- on average 6% for all supersteps that included a migration computer using PageRank on the arabic-2005 dataset. phase. will be spent communicating (which effectively breaks the 5.4 Scalability of Mizan BSP model of overlapping communication and computa- We tested the scalability of Mizan on the Linux cluster tion). as shown in Table 3. We used two compute nodes as our base reference as a single node was too small when running 6. Related Work the dataset (hollywood-2011), causing significant paging activities. As Figure 18 shows, Mizan scales linearly with In the past few years, a number of large scale graph process- the number of workers. ing systems have been proposed. A common thread across We were interested in performing large scale-out experi- the majority of existing work is how to partition graph data ments, well beyond what can be achieved on public clouds. and how to balance computation across compute nodes. This Since Mizan’s was written in C++ and uses MPI for mes- problem is known as applying dynamic load balancing to the sage passing, it was easily ported to IBM’s Blue Gene/P su- distributed graph data by utilizing on the fly graph partition- percomputer. Once ported, we natively ran Mizan on 1024 ing, which Bruce et al. [11] assert to be either too expen- Blue Gene/P compute nodes. The results are shown in Ta- sive or hard to parallelize. Mizan follows the Pregel model, ble 3. We ran the PageRank algorithm using a huge graph but focuses on dynamic load balancing of vertexes. In this (arabic-2005) that contains 639M edges. As shown in section, we examine the design aspects of existing work for Figure 19, Mizan scales linearly from 64 to 512 compute achieving a balanced workload. nodes then starts to flatten out as we increase to 1024 com- Pregel and its Clones. The default partitioning scheme for pute nodes. The flattening was expected since with an in- the input graph used by Pregel [23] is hash based. In every creased number of cores, compute nodes will spend more superstep, Pregel assigns more than one subgraph (partition) time communicating than computing. We expect that as we to a worker. While this load balancing approach helps in bal- continue to increase the number of CPUs, most of the time ancing computation in any superstep, it is coarse-grained
  • 12. and reacts slowly to large imbalances in the initial parti- minimizes the edges among graph partitions and allows tioning or large behavior changes. Giraph, an open source for a fast graph repartitioning. Unlike BSP frameworks, Pregel clone, supports both hash-based and range-based par- computation in GraphLab is executed on the most recently titioning schemes; the user can also implement a differ- available data. GraphLab ensures that computations are ent graph partitioning scheme. Giraph balances the parti- sequentially consistent through a derived consistency model tions across the workers based on the number of edges or to guarantee the end result data will be consistent as well. vertices. GoldenOrb [9] uses HDFS as its storage system HipG [18] is designed to operate on distributed memory and employs hash-based partitioning. An optimization that machines while providing transparent memory sharing to was introduced in [12] uses METIS partitioning, instead of support hierarchical parallel graph algorithms. HipG also GoldenOrb’s hash-based partitioning, to maintain a balanced support processing graph algorithms following the BSP workload among the partitions. Additionally, its distributed model. But unlike the global synchronous barriers of Pregel, data storage supports duplication of the frequently accessed HipG applies more fine-grained barriers, called synchroniz- vertices. Similarly, Surfer [4] utilizes min-cut graph parti- ers, during algorithm execution. The user has to be involved tioning to improve the performance of distributed vertex cen- in writing the synchronizers; thus, it requires additional tric graph processing. However, Surfer focuses more on pro- complexity in the users’ code. GraphChi [19] is a disk based viding a bandwidth-aware variant of the multilevel graph graph processing system based on GraphLab and designed partitioning algorithm (integrated into a Pregel implemen- to run on a single machine with limited memory. It requires tation) to improve the performance of graph partitioning on a small number of non-sequential disk accesses to process the cloud. Unlike Mizan, these systems neither optimize for the subgraphs from disk, while allowing for asynchronous dynamic algorithm behavior nor message traffic. iterative computations. Unlike, Mizan these systems do not Power-law Optimized Graph Processing Systems. Stan- support any dynamic load balancing techniques. ford GPS (GPS) [27] has three additional features over Specialized Graph Systems. Kineograph [5] is a distributed Pregel. First, GPS extends Pregel’s API to handle graph al- system that targets the processing of fast changing graphs. gorithms that perform both vertex-centric and global compu- It captures the relationships in the input data, reflects them tations. Second, GPS supports dynamic repartitioning based on the graph structure, and creates regular snapshots of the on the outgoing communication. Third, GPS provides an op- data. Data consistency is ensured in Kineograph by separat- timization scheme called large adjacency list partitioning, ing graph processing and graph update routines. This sys- where the adjacency lists of high degree vertices are par- tem is mainly developed to support algorithms that mine the titioned across workers. GPS only monitors outgoing mes- incremental changes in the graph structure over a series of sages, while Mizan dynamically selects the most appropriate graph snapshots. The user can also run an offline iterative metric based on changes in runtime characteristics. Power- analysis over a copy of the graph, similar to Pregel. The Graph [10], on the other hand, is a distributed graph system Little Engine(s) [25] is also a distributed system designed that overcomes the challenges of processing natural graphs to scale with online social networks. This system utilizes with extreme power-law distributions. PowerGraph intro- graph repartitioning with one-hop replication to rebalanced duces vertex-based partitioning with replication to distribute any changes to the graph structure, localize graph data, and the graph’s data based on its edges. It also provides a pro- reduce the network cost associated with query processing. gramming abstraction (based on gather, apply and scatter op- The Little Engine(s) repartitioning algorithm aims to find erators) to distribute the user’s algorithm workload on their min-cuts in a graph under the condition of minimizing the graph processing system. The authors in [10] have focused replication overhead at each worker. This system works as a on evaluating stationary algorithms on power-law graphs. middle-box between an application and its database systems; It is, thus, unclear how their system performs across the it is not intended for offline analytics or batch data process- broader class of algorithms (e.g., non-stationary) and graphs ing like Pregel or MapReduce. Trinity [29] is a distributed (e.g., non-power-law). In designing Mizan, we did not make in-memory key-value storage based graph processing sys- any assumptions on either the type of algorithm or shape tem that supports both offline analytics and online query an- of input graph. Nonetheless, PowerGraph’s replication ap- swering. Trinity provides a similar—but more restrictive— proach is complementary to Mizan’s dynamic load balancing computing model to Pregel for offline graph analytics. The techniques. They can be combined to improve the robustness key advantage of Trinity is its low latency graph storage of the underlying graph processing system. design, where the data of each vertex is stored in a bi- Shared Memory Graph Processing Systems. nary format to eliminate decoding and encoding costs.While GraphLab [22] is a parallel framework (that uses distributed these systems address specific design challenges in process- shared memory) for efficient machine learning algorithms. ing large graphs, their focus is orthogonal to the design of The authors extended their original multi-core GraphLab Mizan. implementations to a cloud-based distributed GraphLab, where they apply a two-phase partitioning scheme that
  • 13. 7. Future Work We are planning on making Mizan an open source project. + + − − count(ST ime ∩ SStat ) + count(ST ime ∩ SStat ) We will continue to work on number of open problems as CTscore = count(ST ime ) + count(SStat ) outlined in this section. Our goal is to provide a robust and adaptive platform that frees developers from worrying The sorting test ST consists of sorting both sets, ST ime about the structure of the data or runtime variability of their and SStat , and calculating the z-scores for all elements. Ele- algorithms. ments with a z-score in the range (−zdef , zdef ) are excluded Although Mizan is a graph-agnostic framework, its per- from both sorted sets. For the two sorted sets ST ime and formance can degrade when processing extremely skewed SStat of size n and i ∈ [0..n − 1], a mismatch is defined if graphs due to frequent migration of highly connected ver- the i-th elements from both sets are related to different work- i i tices. We believe that vertex replication, as proposed by ers. That is, ST ime is the i-th element from ST ime , SStat is i PowerGraph [10], offers a good solution, but mandates the the i-th element from SStat , and ST ime ∈ W orkerx while i implementation of custom combiners. We plan to further SStat ∈ W orkerx . A match, on the other hand, is defined if / i i reduce frequent migrations in Mizan by leveraging vertex ST ime ∈ W orkerx and SStat ∈ W orkerx . A score of 1 is replication, however, without requiring any custom combin- given by the sorting test if: ers. count(matches) So far, dynamic migration in Mizan optimizes for work- STscore = count(matches) + count(mismatches) loads generated by a single application or algorithm. We are interested in investigating how to efficiently run multi- The correlation of sets ST ime and SStat are reported as ple algorithms on the same graph. This would better position strongly correlated if (CT + ST )/2 ≥ 1, reported as weakly Mizan to be offered as a service for very large datasets. correlated if 1 > (CT + ST )/2 ≥ 0.5 and reported as uncorrelated otherwise. 8. Conclusion References In this paper, we presented Mizan, a Pregel system that [1] H. Balakrishnan, M. F. Kaashoek, D. Karger, R. Morris, and uses fine-grained vertex migration to load balance computa- I. Stoica. Looking Up Data in P2P Systems. Communications tion and communication across supersteps. Using distributed of the ACM, 46(2):43–48, 2003. measurements of the performance characteristics of all ver- [2] P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler: tices, Mizan identifies the cause of workload imbalance and A Scalable Fully Distributed Web Crawler. Software: Practice constructs a vertex migration plan without requiring central- & Experience, 34(8):711–726, 2004. ized coordination. Using a representative number of datasets [3] D. Buntinas. Designing a Common Communication Subsys- and algorithms, we have showed both the efficacy and ro- tem. In Proceedings of the 12th European Parallel Virtual Ma- bustness of our design against varying workload conditions. chine and Message Passing Interface Conference (Euro PVM We have also showed the linear scalability of Mizan, scaling MPI), pages 156–166, Sorrento, Italy, 2005. to 1024 CPUs. [4] R. Chen, M. Yang, X. Weng, B. Choi, B. He, and X. Li. Improving Large Graph Processing on Partitioned Graphs in A. Computing the Correlation Score the Cloud. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SOCC), pages 3:1–3:13, San Jose, Califor- In Mizan, the correlation scores between the workers’ re- nia, USA, 2012. sponse time and their summary statistics are calculated using [5] R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, two tests: clustering and sorting. In the clustering test CT , F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: Taking elements from the set of worker response times ST ime and the Pulse of a Fast-Changing and Connected World. In Pro- the statistics set SStat (either incoming or outgoing message ceedings of the 7th ACM european conference on Computer statistics) are divided into two categories based on their z- Systems (EuroSys), pages 85–98, Bern, Switzerland, 2012. score 7 : positive and negative clusters. Elements from a set [6] J. Dean and S. Ghemawat. MapReduce: Simplified Data Pro- S are categorized as positive if their z-scores are greater than cessing on Large Clusters. In Proceedings of the 6th USENIX some zdef . Elements categorized as negative have a z-score Symposium on Opearting Systems Design and Implementa- less than −zdef . Elements in the range (−zdef , zdef ) are ig- tion (OSDI), pages 137–150, San Francisco, California, USA, nored. The score of the cluster test is calculated based on the 2004. total count of elements that belongs to the same worker in the [7] R. G. Gallager, P. A. Humblet, and P. M. Spira. A Distributed positive and negative sets, which is shown by the following Algorithm for Minimum-Weight Spanning Trees. ACM Trans- equation: actions on Programming Languages and Systems (TOPLAS), 5(1):66–77, 1983. |xi −xavg | 7 The z-score S i i T ime &SStat = , where xi is the run [8] Giraph. Apache Incubator Giraph. http://incubator. standard deviation time of worker i or its statistics apache.org/giraph, 2012.
  • 14. [9] GoldenOrb. A Cloud-based Open Source Project for Massive- [20] LAW. The Laboratory for Web Algorithmics. http://law. Scale Graph Analysis. http://goldenorbos.org/, 2012. di.unimi.it/index.php, 2012. [10] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. [21] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and PowerGraph: Distributed Graph-Parallel Computation on Nat- Z. Ghahramani. Kronecker Graphs: An Approach to Model- ural Graphs. In Proceedings of the 10th USENIX Symposium ing Networks. Journal of Machine Learning Research, 11: on Operating Systems Design and Implementation (OSDI), 985–1042, 2010. pages 17–30, Hollywood, California, USA, 2012. [22] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, [11] B. Hendrickson and K. Devine. Dynamic Load Balancing and J. Hellerstein. Distributed GraphLab: A Framework for in Computational Mechanics. Computer Methods in Applied Machine Learning and Data Mining in the Cloud. Proceedings Mechanics and Engineering, 184(2–4):485–500, 2000. of the VLDB Endowment (PVLDB), 5(8):716–727, 2012. [12] L.-Y. Ho, J.-J. Wu, and P. Liu. Distributed Graph Database for [23] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, Large-Scale Social Computing. In Proceedings of the IEEE N. Leiser, and G. Czajkowski. Pregel: A System for Large- 5th International Conference in Cloud Computing (CLOUD), Scale Graph Processing. In Proceedings of the 2010 ACM pages 455–462, Honolulu, Hawaii, USA, 2012. SIGMOD International Conference on Management of Data, [13] U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: pages 135–146, Indianapolis, Indiana, USA. A Peta-Scale Graph Mining System Implementation and Ob- [24] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageR- servations. In Proceedings of the IEEE 9th International ank Citation Ranking: Bringing Order to the Web. http: Conference on Data Mining (ICDM), pages 229–238, Miami, //ilpubs.stanford.edu:8090/422/, 2001. Florida, USA, 2009. [25] J. M. Pujol, V. Erramilli, G. Siganos, X. Yang, N. Laoutaris, [14] U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos, and P. Chhabra, and P. Rodriguez. The Little Engine(s) That J. Leskovec. HADI: Fast Diameter Estimation and Mining in Could: Scaling Online Social Networks. IEEE/ACM Trans- Massive Graphs with Hadoop. ACM Trasactions on Knowl- actions on Networking, 20(4):1162–1175, 2012. edge Discovery from Data (TKDD), 5(2):8:1–8:24, 2011. [26] D. Rees. Foundations of Statistics. Chapman and Hall, [15] G. Karypis and V. Kumar. Parallel Multilevel K-Way Parti- London New York, 1987. ISBN 0412285606. tioning Scheme for Irregular Graphs. In Proceedings of the [27] S. Salihoglu and J. Widom. GPS: A Graph Processing System. ACM/IEEE Conference on Supercomputing (CDROM), pages Technical report, Stanford University, 2012. 278–300, Pittsburgh, Pennsylvania, USA. [28] S. Seo, E. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng. [16] G. Karypis and V. Kumar. Multilevel k-way Partitioning HAMA: An Efficient Matrix Computation with the MapRe- Scheme for Irregular Graphs. Journal of Parallel and Dis- duce Framework. In Proceedings of the 2010 IEEE Second In- tributed Computing, 48(1):96–129, 1998. ternational Conference on Cloud Computing Technology and [17] Z. Khayyat, K. Awara, H. Jamjoom, and P. Kalnis. Mizan: Op- Science (CloudCom), pages 721–726, Athens, Greece. timizing Graph Mining in Large Parallel Systems. Technical [29] B. Shao, H. Wang, and Y. Li. Trinity: A Distributed Graph report, King Abdullah University of Science and Technology, Engine on a Memory Cloud. In Proceedings of the 2013 ACM 2012. SIGMOD International Conference on Management of Data, [18] E. Krepska, T. Kielmann, W. Fokkink, and H. Bal. HipG: New York, New York, USA. Parallel Processing of Large-Scale Graphs. ACM SIGOPS [30] L. G. Valiant. A Bridging Model for Parallel Computation. Operating Systems Review, 45(2):3–13, 2011. Communications of the ACM, 33:103–111, 1990. [19] A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large- [31] W. Xue, J. Shi, and B. Yang. X-RIME: Cloud-Based Large Scale Graph computation on Just a PC. In Proceedings of the Scale Social Network Analysis. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and 2010 IEEE International Conference on Services Computing Implementation (OSDI), pages 31–46, Hollywood, California, (SCC), pages 506–513, Shanghai, China. USA, 2012.