Transcript of "Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing"
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat‡ Karim Awara‡ Amani Alonazi‡ Hani Jamjoom† Dan Williams† Panos Kalnis‡ ‡King Abdullah University of Science and Technology, Saudi Arabia †IBM T. J. Watson Research Center, Yorktown Heights, NYAbstract array of popular graph mining algorithms (including PageR-Pregel  was recently introduced as a scalable graph min- ank, shortest paths problems, bipartite matching, and semi-ing system that can provide signiﬁcant performance im- clustering), Pregel was shown to improve the overall perfor-provements over traditional MapReduce implementations. mance by 1-2 orders of magnitude  over a traditionalExisting implementations focus primarily on graph par- MapReduce  implementation. Pregel builds on the Bulktitioning as a preprocessing step to balance computation Synchronous Parallel (BSP)  programming model. Itacross compute nodes. In this paper, we examine the run- operates on graph data, consisting of vertices and edges.time characteristics of a Pregel system. We show that graph Each vertex runs an algorithm and can send messages—partitioning alone is insufﬁcient for minimizing end-to-end asynchronously—to any other vertex. Computation is di-computation. Especially where data is very large or the run- vided into a number of supersteps (iterations), each sepa-time behavior of the algorithm is unknown, an adaptive ap- rated by a global synchronization barrier. During each su-proach is needed. To this end, we introduce Mizan, a Pregel perstep, vertices run in parallel across a distributed infras-system that achieves efﬁcient load balancing to better adapt tructure. Each vertex processes incoming messages from theto changes in computing needs. Unlike known implementa- previous superstep. This process continues until all verticestions of Pregel, Mizan does not assume any a priori knowl- have no messages to send, thereby becoming inactive.edge of the structure of the graph or behavior of the algo- Not surprisingly, balanced computation and communi-rithm. Instead, it monitors the runtime characteristics of the cation is fundamental to the efﬁciency of a Pregel system.system. Mizan then performs efﬁcient ﬁne-grained vertex To this end, existing implementations of Pregel (includingmigration to balance computation and communication. We Giraph , GoldenOrb , Hama , Surfer ) primar-have fully implemented Mizan; using extensive evaluation ily focus on efﬁcient partitioning of input data as a pre-we show that—especially for highly-dynamic workloads— processing step. They take one or more of the followingMizan provides up to 84% improvement over techniques ﬁve approaches to achieving a balanced workload: (1) pro-leveraging static graph pre-partitioning. vide simple graph partitioning schemes, like hash- or range- based partitioning (e.g., Giraph), (2) allow developers to set their own partitioning scheme or pre-partition the graph data1. Introduction (e.g., Pregel), (3) provide more sophisticated partitioningWith the increasing emphasis on big data, new platforms are techniques (e.g., GraphLab, GoldenOrb and Surfer use min-being proposed to better exploit the structure of both the cuts ), (4) utilize distributed data stores and graph index-data and algorithms (e.g., Pregel , HADI , PEGA- ing on vertices and edges (e.g., GoldenOrb and Hama), andSUS  and X-RIME ). Recently, Pregel  was in- (5) perform coarse-grained load balancing (e.g., Pregel).troduced as a message passing-based programming model In this paper, we analyze the efﬁcacy of these workload-speciﬁcally targeting the mining of large graphs. For a wide balancing approaches when applied to a broad range of graph mining problems. Built into the design of these ap- proaches is the assumption that the structure of the graphPermission to make digital or hard copies of all or part of this work for personal or is static, the algorithm has predictable behavior, or that theclassroom use is granted without fee provided that copies are not made or distributed developer has deep knowledge about the runtime character-for proﬁt or commercial advantage and that copies bear this notice and the full citation istics of the algorithm. Using large datasets from the Lab-on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior speciﬁc permission and/or a fee. oratory for Web Algorithmics (LAW) [2, 20] and a broadEurosys’13 April 15-17, 2013, Prague, Czech Republic set of graph algorithms, we show that—especially when theCopyright c 2013 ACM 978-1-4503-1994-2/13/04. . . $15.00
graph mining algorithm has unpredictable communication • We analyze different graph algorithm characteristics thatneeds, frequently changes the structure of the graph, or has can contribute to imbalanced computation of a Pregela variable computation complexity—a single solution is not system.enough. We further show that a poor partitioning can result • We propose a dynamic vertex migration model based onin signiﬁcant performance degradation or large up-front cost. runtime monitoring of vertices to optimize the end-to-end More fundamentally, existing workload balancing ap- computation.proaches suffer from poor adaptation to variability (or shifts) • We fully implement Mizan in C++ as an optimized Pregelin computation needs. Even the dynamic partition assign-ment that was proposed in Pregel  reacts poorly to highly system that supports dynamic load balancing and efﬁ-dynamic algorithms. We are, thus, interested in building a cient vertex migration.system that is (1) adaptive, (2) agnostic to the graph struc- • We deploy Mizan on a local Linux cluster (21 machines)ture, and (3) requires no a priori knowledge of the behavior and evaluate its efﬁcacy on a representative number ofof the algorithm. At the same time, we are interested in im- datasets. We also show the linear scalability of our designproving the overall performance of the system. by running Mizan on a 1024-CPU IBM Blue Gene/P. To this end, we introduce Mizan1 , a Pregel system that This paper is organized as follows. Section 2 describessupports ﬁne-grained load balancing across supersteps. Sim- factors affecting algorithm behavior and discusses a numberilar to other Pregel implementations, Mizan supports differ- of example algorithms. In Section 3, we introduce the designent schemes to pre-partition the input graph. Unlike exist- of Mizan. We describe implementation details in Section 4.ing implementations, Mizan monitors runtime characteris- Section 5 compares the performance of Mizan using a widetics of all vertices (i.e., their execution time, and incoming array of datasets and algorithms. We then describe relatedand outgoing messages). Using these measurements, at the work in Section 6. Section 7 discusses future work andend of every superstep, Mizan constructs a migration plan Section 8 concludes.that minimizes the variations across workers by identify-ing which vertices to migrate and where to migrate themto. Finally, Mizan performs efﬁcient vertex migration across 2. Dynamic Behavior of Algorithmsworkers, leveraging a distributed hash table (DHT)-based lo- In the Pregel (BSP) computing model, several factors cancation service to track the movement of vertices as they mi- affect the runtime performance of the underlying systemgrate. (shown in Figure 1). During a superstep, each vertex is All components of Mizan support distributed execution, either in an active or inactive state. In an active state, aeliminating the need for a centralized controller. We have vertex may be computing, sending messages, or processingfully implemented Mizan in C++, with the messaging— received messages.BSP—layer implemented using MPI . Using a representa- Naturally, vertices can experience variable executiontive number of datasets and a broad spectrum of algorithms, times depending on the algorithm they implement or theirwe compare our implementation against Giraph, a popular degree of connectivity. A densely connected vertex, for ex-Pregel clone, and show that Mizan consistently outperforms ample, will likely spend more time sending and process-Giraph by an average of 202%. Normalizing across plat- ing incoming messages than a sparsely connected vertex.forms, we show that for stable workloads, Mizan’s dynamic The BSP programming model, however, masks this varia-load balancing matches the performance of the static parti- tion by (1) overlapping communication and computation,tioning without requiring a priori knowledge of the structure and (2) running many vertices on the same compute node.of the graph or algorithm runtime characteristics. When the Intuitively, as long as the workloads of vertices are roughlyalgorithm is highly dynamic, we show that Mizan provides balanced across computing nodes, the overall computationover 87% performance improvement. Even with inefﬁcient time will be minimized.input partitioning, Mizan is able to reduce the computation Counter to intuition, achieving a balanced workload is notoverhead by up to 40% when compared to the static case, straightforward. There are two sources of imbalance: oneincurring only a 10% overhead for performing dynamic ver- originating from the graph structure and another from thetex migration. We even run our implementation on an IBM algorithm behavior. In both cases, vertices on some computeBlue Gene/P supercompter, demonstrating linear scalability nodes can spend a disproportional amount of time comput-to 1024 CPUs. To the best of our knowledge, this is the ing, sending or processing messages. In some cases, theselargest scale out test of a Pregel-like system to date. vertices can run out of input buffer capacity and start pag- To summarize, we make the following contributions: ing, further exacerbating the imbalance. Existing implementations of Pregel (including Gi- raph , GoldenOrb , Hama ) focus on providing multiple alternatives to partitioning the graph data. The three1 Mizan is Arabic for a double-pan scale, commonly used in reference to common approaches to partitioning the data are hash-based,achieving balance. range-based, or min-cut . Hash- and range-based parti-
Superstep 1 Superstep 2 Superstep 3 350 Hash 300 Range Worker 1 Worker 1 Worker 1 Min-cuts Run Time (Min) 250 Worker 2 Worker 2 Worker 2 200 150 Worker 3 Worker 3 Worker 3 100 BSP Barrier 50 -Vertex response time -Time to receive in messages 0 -Time to send out messages Live kg ar Journa raph4m abic-200 l 68m 5Figure 1. Factors that can affect the runtime in the Pregelframework Figure 2. The difference in execution time when processing PageRank on different graphs using hash-based, range-based and min-cuts partitioning. Because of its size, arabic-2005tioning methods divide a dataset based on a simple heuristic: cannot be partitioned using min-cuts (ParMETIS) in ourto evenly distribute vertices across compute nodes, irrespec- local cluster.tive of their edge connectivity. Min-cut based partitioning,on the other hand, considers vertex connectivity and parti- 1000 In Messages (Millions)tions the data such that it places strongly connected vertices 100close to each other (i.e., on the same cluster). The result-ing performance of these partitioning approaches is, how- 10ever, graph dependent. To demonstrate this variability, we 1ran a simple—and highly predictable—PageRank algorithm 0.1on different datasets (summarized in Table 1) using the three 0.01popular partitioning methods. Figure 2 shows that none of 0.001the partitioning methods consistently outperforms the rest, 0 10 20 30 40 50 60noting that ParMETIS  partitioning cannot be performed SuperStepson the arabic-2005 graph due to memory limitations. In addition to the graph structure, the running algorithm PageRank - Total DMST - Total PageRank - Max DMST - Maxcan also affect the workload balance across compute nodes.Broadly speaking, graph algorithms can be divided into Figure 3. The difference between stationary and non-two categories (based on their communication characteris- stationary graph algorithms with respect to the incomingtics across supersteps): stationary and non-stationary. messages. Total represents the sum across all workers andStationary Graph Algorithms. An algorithm is stationary Max represents the maximum amount (on a single worker)if its active vertices send and receive the same distribution of across all workers.messages across supersteps. At the end of a stationary algo-rithm, all active vertices become inactive (terminate) during To illustrate the differences between the two classes ofthe same superstep. Usually graph algorithms represented by algorithms, we compared the runtime behavior of PageRanka matrix-vector multiplication2 are stationary algorithms, in- (stationary) against DMST (non-stationary) when process-cluding PageRank, diameter estimation and ﬁnding weakly ing the same dataset (LiveJournal1) on a cluster of 21 ma-connected components. chines. The input graph was partitioned using a hash func-Non-stationary Graph Algorithms. A graph algorithm is tion. Figure 3 shows the variability in the incoming messagesnon-stationary if the destination or size of its outgoing mes- per superstep for all workers. In particular, the variability cansages changes across supersteps. Such variations can create span over ﬁve orders of magnitude for non-stationary algo-workload imbalances across supersteps. Examples of non- rithms.stationary algorithms include distributed minimal spanning The remainder of the section describes four popular graphtree construction (DMST), graph queries, and various simu- mining algorithms that we use throughout the paper. Theylations on social network graphs (e.g., advertisement propa- cover both stationary and non-stationary algorithms.gation). 2.1 Example Algorithms2 The matrix represents the graph adjacency matrix and the vector represents PageRank. PageRank  is a stationary algorithm thatthe vertices’ value. uses matrix-vector multiplications to calculate the eigenval-
ues of the graph’s adjacency matrix at each iteration.3 The sages between vertices is unpredictable and depends on thealgorithm terminates when the PageRank values of all nodes state of the edges at each vertex.change by less than a deﬁned error during an iteration. At Simulating Advertisements on Social Networks. To sim-each superstep, all vertices are active, where every vertex ulate advertisements on a social network graph, each vertexalways receives messages from its in-edges and sends mes- represents a proﬁle of a person containing a list of his/hersages to all of its out-edges. All messages have ﬁxed size interests. A small set of selected vertices are identiﬁed as(8 bytes) and the computation complexity on each vertex is sources and send different advertisements to their directlinear to the number of messages. neighbors at each superstep for a predeﬁned number of su-Top-K Ranks in PageRank. It is often desirable to perform persteps. When a vertex receives an advertisement, it is ei-graph queries, like using PageRank to ﬁnd the top k vertex ther forwarded to the vertex’s neighbors (based on the ver-ranks reachable to a vertex after y supersteps. In this case, tex’s interest matrix) or ignored. This algorithm has a com-PageRank runs for x supersteps; at superstep x + 1, each putation complexity that depends on the count of active ver-vertex sends its rank to its direct out neighbors, receives tices at a superstep and whether the active vertices commu-other ranks from its in neighbors, and stores the highest nicate with their neighbors or not. It is, thus, a non-stationaryreceived k ranks. At supersteps x + 2 and x + y, each algorithm.vertex sends the top-K ranks stored locally to its direct out We have also implemented and evaluated a number ofneighbors, and again stores the highest k ranks received from other—stationary and non-stationary—algorithms includingits in neighbors. The message size generated by each vertex diameter estimation and ﬁnding weakly connected compo-can be between [1..k], which depends on the number of ranks nents. Given that their behavior is consistent with the resultsstored the vertex. If the highest k values maintained by a in the paper, we omit them for space considerations.vertex did not change at a superstep, that vertex votes tohalt and becomes inactive until it receives messages from 3. Mizanother neighbors. The size of the message sent between two Mizan is a BSP-based graph processing system that is simi-vertices varies depending on the actual ranks. We classify lar to Pregel, but focuses on efﬁcient dynamic load balancingthis algorithm as non-stationary because of the variability in of both computation and communication across all workermessage size and number of messages sent and received. (compute) nodes. Like Pregel, Mizan ﬁrst reads and parti-Distributed Minimal Spanning Tree (DMST). Implement- tions the graph data across workers. The system then pro-ing the GHS algorithm of Gallager, Humblet and Spira  in ceeds as a series of supersteps, each separated by a globalthe Pregel model requires that, given a weighted graph, every synchronization barrier. During each superstep, each vertexvertex classiﬁes the state of its edges as (1) branch (i.e., be- processes incoming messages from the previous supersteplongs to the minimal spanning tree (MST)), (2) rejected (i.e., and sends messages to neighboring vertices (which are pro-does not belong to the MST), or (3) basic (i.e., unclassiﬁed), cessed in the following superstep). Unlike Pregel, Mizan bal-in each superstep. The algorithm starts with each vertex as a ances its workload by moving selected vertices across work-fragment, then joins fragments until there is only one frag- ers. Vertex migration is performed when all workers reach ament left (i.e., the MST). Each fragment has an ID and level. superstep synchronization barrier to avoid violating the com-Each vertex updates its fragment ID and level by exchang- putation integrity, isolation and correctness of the BSP com-ing messages with its neighbors. The vertex has two states: pute model.active and inactive. There is also a list of auto-activated ver- This section describes two aspects of the design of Mizantices deﬁned by the user. At the early stage of the algorithm, related to vertex migration: the distributed runtime monitor-active vertices send messages over the minimum weighted ing of workload characteristics and a distributed migrationedges (i.e., the edge that has the minimum weight among the planner that decides which vertices to migrate. Other de-other in-edges and out-edges). As the algorithm progresses, sign aspects that are not related to vertex migration (e.g.,more messages are sent across many edges according to the fault tolerance) follow similar approaches to Pregel , Gi-number of fragments identiﬁed during the MST computa- raph , GraphLab  and GPS ; they are omitted fortions at each vertex. At the last superstep, each vertex knows space considerations. Implementation details are describedwhich of its edges belong to the minimum weighted span- in Section 4.ning tree. The computational complexity for each vertex isquadratic to the number of incoming messages. We classify 3.1 Monitoringthis algorithm as non-stationary because the ﬂow of mes- Mizan monitors three key metrics for each vertex and main- tains a high level summary these metrics for each worker node; summaries are broadcast to other workers at the end3 Formally, each iteration calculates: v (k+1) = cAt v k + (1 − c)/|N |, of every superstep. As shown in Figure 4, the key metricswhere v k is the eigenvector of iteration k, c is the damping factor used for for every vertex are (1) the number of outgoing messages tonormalization, and A is a row normalized adjacency matrix. other (remote) workers, (2) total incoming messages, and (3)
Vertex Response Time Compute() Remote Outgoing Messages V6 V2 V1 BSP V4 V3 V5 Barrier Pair with Underutilized Worker Mizan Mizan Migration No Superstep Worker 1 Worker 2 Barrier Imbalanced? Local Incoming Messages Remote Incoming Messages Select Yes Vertices Figure 4. The statistics monitored by Mizan and Migrate Identify Source of Imbalance Superstep 1 Superstep 2 Superstep 3 Worker 1 Worker 1 Worker 1 No Worker Yes Overutilized? Worker 2 Worker 2 Worker 2 Worker 3 Worker 3 Worker 3 Figure 6. Summary of Mizan’s Migration Planner BSP Barrier Migration Barrier Migration Planner tween a second synchronization barrier as shown in Figure 5, Figure 5. Mizan’s BSP ﬂow with migration planning which is necessary to ensure correctness of the BSP model. Mizan’s migration planner executes the following ﬁve steps, summarized in Figure 6:the response time (execution time) during the current super-step: 1. Identify the source of imbalance.Outgoing Messages. Only outgoing messages to other ver- 2. Select the migration objective (i.e., optimize for outgoingtices in remote workers are counted since the local outgoing messages, incoming messages, or response time).messages are rerouted to the same worker and do not incur 3. Pair over-utilized workers with under-utilized ones.any actual network cost. 4. Select vertices to migrate.Incoming Messages. All incoming messages are monitored,those that are received from remote vertices and those lo- 5. Migrate vertices.cally generated. This is because queue size can affect theperformance of the vertex (i.e., when its buffer capacity is STEP 1: Identify the source of imbalance. Mizan detectsexhausted, paging to disk is required). imbalances across supersteps by comparing the summary statistics of all workers against a normal random distribu-Response Time. The response time for each vertex is mea- tion and ﬂagging outliers. Speciﬁcally, at the end of a su-sured. It is the time when a vertex starts processing its in- perstep, Mizan computes the z-score4 for all workers. Ifcoming messages until it ﬁnishes. any worker has z-score greater than zdef , Mizan’s migra- tion planner ﬂags the superstep as imbalanced. We found that3.2 Migration Planning zdef = 1.96, the commonly recommend value , allowsHigh values for any of the three metrics above may indicate for natural workload ﬂuctuations across workers. We havepoor vertex placement, which leads to workload imbalance. experimented with different values of zdef and validated theAs with many constrained optimization problems, optimiz- robustness of our choice.ing against three objectives is non-trivial. To reduce the op- STEP 2: Select the migration objective. Each worker—timization search space, Mizan’s migration planner ﬁnds the identically—uses the summary statistics to compute the cor-strongest cause of workload imbalance among the three met- relation between outgoing messages and response time, andrics and plans the vertex migration accordingly. also the correlation between incoming messages and re- By design, all workers create and execute the migration sponse time. The correlation scores are used to select theplan in parallel, without requiring any centralized coordina- objective to optimize for: to balance outgoing messages,tion. The migration planner starts on every worker at the end balance incoming messages, or balance computation time.of each superstep (i.e., when all workers reach the synchro- The default objective is to balance the response time. If thenization barrier), after it receives summary statistics (as de-scribed in Section 3.1) from all other workers. Additionally, 4 The z-score W z = i |xi −xmax | , where xi is the run time of standard deviationthe execution of the migration planner is sandwiched be- worker i
outgoing or incoming messages are highly correlated withthe response time, then Mizan chooses the objective withhighest correlation score. The computation of the correlation W7 W2 W1 W5 W8 W4 W0 W6 W9 W3score is described in Appendix A. 0 1 2 3 4 5 6 7 8STEP 3: Pair over-utilized workers with under-utilizedones. Each overutilized worker that needs to migrate ver- Figure 7. Matching senders (workers with vertices to mi-tices out is paired with a single underutilized worker. While grate) with receivers using their summary statisticscomplex pairings are possible, we choose a design that is ef-ﬁcient to execute, especially since the exact number of ver- into a stream that includes the vertex ID, state, edge informa-tices that different workers plan to migrate is not globally tion and the received messages it will process. Once a ver-known. Similar to the previous two steps in the migration tex stream is successfully sent, the sending worker deletesplan, this step is executed by each worker without explicit the sent vertices so that it does not run them in the next su-global synchronization. Using the summary statistics for the perstep. The receiving worker, on the other hand, receiveschosen migration objective (in Step 2), each worker creates vertices (together with their messages) and prepares to runan ordered list of all workers. For example, if the objective is them in the next superstep. The next superstep is started onceto balance outgoing messages, then the list will order work- all workers ﬁnish migrating vertices and reach the migrationers from highest to lowest outgoing messages. The resulting barrier. The complexity of the migration process is directlylist, thus, places overutilized workers at the top and least uti- related to the size of vertices being migrated.lized workers at the bottom. The pairing function then suc-cessively matches workers from opposite ends of the ordered 4. Implementationlist. As depicted in Figure 7, if the list contains n elements Mizan consists of four modules, shown in Figure 8: the(one for each worker), then the worker at position i is paired BSP Processor, Storage Manager, Communicator, and Mi-with the worker at position n − i. In cases where a worker gration Planner. The BSP Processor implements the Pregeldoes not have memory to receive any vertices, the worker is APIs, consisting primarily of the Compute class, and themarked unavailable in the list. SendMessageTo, GetOutEdgeIterator and getValueSTEP 4: Select vertices to migrate. The number of vertices methods. The BSP Processor operates on the data struc-to be selected from an overutilized worker depends on the tures of the graph and executes the user’s algorithm. It alsodifference of the selected migration objective statistics with performs barrier synchronization with other workers at theits paired worker. Assume that wx is a worker that needs end of each superstep. The Storage Manager module main-to migrate out a number of vertices, and is paired with the tains access atomicity and correctness of the graph’s data,receiver, wy . The load that should be migrated to the under- and maintains the data structures for message queues. Graphutilized worker is deﬁned as ∆xy , which equals to half the data can be read and written to either HDFS or local disks,difference in statistics of the migration objective between the depending on how Mizan is deployed. The Communicatortwo workers. The selection criteria of the vertices depends module uses MPI to enable communication between work-on the distribution of the statistics of the migration objective, ers; it also maintains distributed vertex ownership informa-where the statistics of each vertex is compared against a nor- tion. Finally, the Migration Planner operates transparentlymal distribution. A vertex is selected if it is an outlier (i.e., across superstep barriers to maintain the dynamic workload iif its V zstat 5 ). For example, if the migrating objective is to balance.balance the number of remote outgoing messages, vertices Mizan allows the user’s code to manipulate the graph con-with large remote outgoing messages are selected to migrate nectivity by adding and removing vertices and edges at anyto the underutilized worker. The sum of the statistics of the superstep. It also guarantees that all graph mutation com-selected vertices is denoted by Vstat which should mini- mands issued at superstepx are executed at the end of themize |∆xy − Vstat | to ensure the balance between wx and same superstep and before the BSP barrier, which is illus-wy in the next superstep. If there not enough outlier vertices trated in Figure 5. Therefore, vertex migrations performedare found, a random set of vertices are selected to minimize by Mizan do not conﬂict with the user’s graph mutations and|∆xy − Vstat |. Mizan always considers the most recent graph structure forSTEP 5: Migrate vertices. After the vertex selection pro- migration planning.cess, the migrating workers start sending the selected ver- When implementing Mizan, we wanted to avoid having atices while other workers wait at the migration barrier. A centralized controller. Overall, the BSP (Pregel) model nat-migrating worker starts sending the selected set of vertices urally lends itself to a decentralized implementation. Thereto its unique target worker, where each vertex is encoded were, however, three key challenges in implementing a dis- tributed control plane that supports ﬁne-grained vertex mi-5 The i |x −x i | avg z-score V zstat = standard deviation , where xi is the statistics of gration. The ﬁrst challenge was in maintaining vertex own-the migration objective of vertex i is greater than the zdef ership so that vertices can be freely migrated across work-
1) migrate(Vi) Vertex Compute() Worker 1 Worker 2 BSP Processor BSP Processor Mizan Worker Migration Migration Communicator - DHT Communicator - DHT BSP Processor Planner Planner Migration Storage Manager Storage Manager Communicator - DHT Planner 2) inform_DHT(Vi) Storage Manager Worker 3 Worker 4 IO BSP Processor BSP Processor Migration Migration HDFS/Local Disks Communicator - DHT Communicator - DHT Planner Planner Storage Manager Storage Manager Figure 8. Architecture of Mizan 3) update_loc(Vi)ers. This is different from existing approaches (e.g., Pregel, Figure 9. Migrating vertex vi from Worker 1 to Worker 2,Giraph, and GoldenOrb), which operate on a much coarser while updating its DHT home worker (Worker 3)granularity (clusters of vertices) to enable scalability. Thesecond challenge was in allowing fast updates to the vertex cation of target vertex. The source worker ﬁnally sends theownership information as vertices get migrated. The third queued message to the current worker for the target vertex. Itchallenge was in minimizing the cost of migrating vertices also caches the physical location to minimize future lookups.with large data structures. In this section, we discuss the im-plementation details around these three challenges, which al- 4.2 DHT Updates After Vertex Migrationlow Mizan to achieve its effectiveness and scalability. Figure 9 depicts the vertex migration process. When a vertex v migrates between two workers, the receiving worker sends4.1 Vertex Ownership the new location of v to the home worker of v. The homeWith huge datasets, Mizan workers cannot maintain the worker, in turn, sends an update message to all workers thatmanagement information for all vertices in the graph. Man- have previously asked for—and, thus, potentially cached—agement information includes the collected statistics for the location of v. Since Mizan migrates vertices in the barriereach vertex (described in Section 3.1) and the location (own- between two supersteps, all workers that have cached theership) of each vertex. While per-vertex monitoring statis- location of the migrating vertex will receive the updatedtics are only used locally by the worker, vertex ownership physical location from the home worker before the start ofinformation is needed by all workers. When vertices send the new superstep.messages, workers need to know the destination address for If for any reason a worker did not receive the updatedeach message. With frequent vertex migration, updating the location, the messages will be sent to the last known phys-location of the vertices across all workers can easily create a ical location of v. The receiving worker, which no longercommunication bottleneck. owns the vertex, will simply buffer the incorrectly routed To overcome this challenge, we use a distributed hash ta- messages, ask for the new location for v, and reroute theble (DHT)  to implement a distributed lookup service. The messages to the correct worker.DHT implementation allows Mizan to distribute the over-head of looking up and updating vertex location across all 4.3 Migrating Vertices with Large Message Sizeworkers. The DHT stores a set of (key,value) pairs, where Migrating a vertex to another worker requires moving itsthe key represents a vertex ID and the value represents its queued messages and its entire state (which includes itscurrent physical location. Each vertex is assigned a home ID, value, and neighbors). Especially when processing largeworker. The role of the home worker is to maintain the graphs, a vertex can have a signiﬁcant number of queuedcurrent location of the vertex. A vertex can physically ex- messages, which are costly to migrate. To minimize the cost,ist in any worker, including its home worker. The DHT Mizan migrates large vertices using a delayed migrationuses a globally deﬁned hash function that maps the keys to process that spreads the migration over two supersteps. In-their associated home workers, such that home worker = stead of physically moving the vertex with its large messagelocation hash(key). queue, delayed migration only moves the vertex’s informa- During a superstep, when a (source) vertex sends a mes- tion and the ownership to the new worker. Assuming wold issage to another (target) vertex, the message is passed to the the old owner and wnew is the new one, wold continues toCommunicator. If the target vertex is located on the same process the migrating vertex v in the next superstep, SSt+1 .worker, it is rerouted back to the appropriate queue. Other- wnew receives the messages for v, which will be processedwise, the source worker uses the location hash function at the following superstep, SSt+2 . At the end of SSt+1 , woldto locate and query the home worker for the target vertex. sends the new value of v, calculated at SSt+1 , to wnew andThe home worker responds back with the actual physical lo- completes the delayed migration. Note that migration plan-
Superstep t Superstep t+1 Superstep t+2 40 GiraphWorker_old Worker_old Worker_old 35 Static Mizan 1 3 1 3 1 3 30 Run Time (Min) 25 2 7 2 7 2 20 15 10 4 4 7 4 7 5 0 web k Live k 5 6 6 - Goo g1 Jour g4m68 nal1 5 6 5 gle mWorker_new Worker_new Worker_new Figure 11. Comparing Static Mizan vs. Giraph using Figure 10. Delayed vertex migration PageRank on social network and random graphs G(N, E) |N | |E| Web Algorithmics (LAW) [2, 20]. We also generated syn- kg1 1,048,576 5,360,368 thetic datasets using the Kronecker  generator that mod- kg4m68m 4,194,304 68,671,566 els the structure of real life networks. The details are shown web-Google 875,713 5,105,039 in Table 1. LiveJournal1 4,847,571 68,993,773 To better isolate the effects of dynamic migration on hollywood-2011 2,180,759 228,985,632 system performance, we implemented three variations of arabic-2005 22,744,080 639,999,458 Mizan: Static, Work Stealing (WS), and Mizan. Static Mizan disables any dynamic migration and uses either hash-based,Table 1. Datasets—N , E denote nodes and edges, respec- range-based or min-cuts graph pre-partitioning (renderingtively. Graphs with preﬁx kg are synthetic. it similar to Giraph). Work Stealing (WS) Mizan is our attempt to emulate Pregel’s coarse-grained dynamic loadning is disabled for superstep SSt+1 after applying delayed balancing behavior (described in ). Finally, Mizan is ourmigration to ensure the consistency of the migration plans. framework that supports dynamic migration as described in An example is shown in Figure 10, where Vertex 7 is Sections 3 and 4.migrated using delayed migration. As Vertex 7 migrates, We evaluated Mizan across three dimensions: partitioningthe ownership of Vertex 7 is moved to Worker new and scheme, graph structure, and algorithm behavior. We par-messages sent to Vertex 7 at SSt+1 are addressed to tition the datasets using three schemes: hash-based, range-Worker new. At the barrier of SSt+1 , Worker old sends based, and METIS partitioning. We run our experimentsthe edge information and state of v to Worker new and starts against both social and random graphs. Finally, we experi-SSt+2 with v fully migrated to Worker new. mented with the algorithms described in Section 2.1. With delayed migration, the consistency of computationis maintained without introducing additional network cost 5.1 Giraph vs. Mizanfor vertices with large message queues. Since delayed mi- We picked Giraph for its popularity as an open source Pregelgration uses two supersteps to complete, Mizan may need framework with broad adoption. We only compared Giraphmore steps before it converges on a balanced state. to Static Mizan, which allows us to evaluate our base (non- dynamic) implementation. Mizan’s dynamic migration is5. Evaluation evaluated later in this section. To further equalize Giraph andWe implemented Mizan using C++ and MPI and compared the Static Mizan, we disabled the fault tolerance feature ofit against Giraph , a Pregel clone implemented in Java Giraph to eliminate any additional overhead resulting fromon top of Hadoop. We ran our experiments on a local clus- frequent snapshots during runtime. In this section, we reportter of 21 machines equipped with a mix of i5 and i7 pro- the results using PageRank (a stationary and well balancedcessors with 16GB RAM on each machine. We also used algorithm) using different graph structures.an IBM Blue Gene/P supercomputer with 1024 PowerPC- In Figure 11, we ran both frameworks on social network450 CPUs, each with 4 cores at 850MHz and 4GB RAM. and random graphs. As shown, Static Mizan outperforms Gi-We downloaded publicly available datasets from the Stan- raph by a large margin, up to four times faster than Giraph inford Network Analysis Project6 and from The Laboratory for kg4m68, which contains around 70M edges. We also com- pared both systems when increasing the number of nodes6 http://snap.stanford.edu in random structure graphs. As shown in Figure 12, Static
30 12 Giraph Static Mizan 25 Run Time (Min) 10 Run Time (Min) 20 8 15 6 10 4 5 2 0 Mizan Mizan Mizan Static Static Static 0 WS WS WS 1 2 4 8 16 Vertex count (Millions) Hash Range MetisFigure 12. Comparing Mizan vs. Giraph using PageRank Figure 13. Comparing Static Mizan and Work Stealingon regular random graphs, the graphs are uniformly dis- (Pregel clone) vs. Mizan using PageRank on a social graphtributed with each has around 17M edge (LiveJournal1). The shaded part of each column repre- sents the algorithm runtime while unshaded parts represents the initial partitioning cost of the input graph.Mizan consistently outperforms Giraph in all datasets andreaches up to three times faster with 16 million vertexes.While the execution time of both frameworks increases lin- perstep’s runtime. The same ﬁgure also shows that Mizan’searly with graph size, the rate of increase—slope of the migration alternated the optimization objective between thegraph—for Giraph (0.318) is steeper than Mizan (0.09), in- number of outgoing messages and vertex response time, il-dicating that Mizan also achieves better scalability. lustrated as points on the superstep’s runtime trend. The experiments in Figures 12 and 11 show that Giraph’s By looking at both Figures 14 and 15, we observe thatimplementation is inefﬁcient. It is a non-trivial task to dis- the convergence of Mizan’s dynamic migration is correlatedcover the source of inefﬁciency in Giraph since it is tightly with the algorithm’s runtime reduction. For the PageRankcoupled with Hadoop. We suspect that part of the inefﬁ- algorithm with range-based partitioning, Mizan requires 13ciency is due to the initialization cost of the Hadoop jobs supersteps to reach an acceptable balanced workload. Sinceand the high overhead of communication. Other factors, like PageRank is a stationary algorithm, Mizan’s migration con-internal data structure choice and memory footprint, might verged quickly; we expect that it would require more super-also play a role in this inefﬁciency. steps to converge on other algorithms and datasets. In gen- eral, Mizan requires multiple supersteps before it balances5.2 Effectiveness of Dynamic Vertex Migration the workload. This also explains why running Mizan withGiven the large performance difference between Static range-based partitioning is less efﬁcient than running MizanMizan and Giraph, we exclude Giraph from further exper- with METIS or hash-based partitioning.iments and focus on isolating the effects of dynamic migra- As described in Section 2.1, Top-K PageRank adds vari-tion on the overall performance of the system. ability in the communication among the graph nodes. As Figure 13 shows the results for running PageRank on a shown in Figure 16, such variability in the messages ex-social graph with various partitioning schemes. We notice changed leads to minor variation in both hash-based andthat both hash-based and METIS partitioning achieve a bal- METIS execution times. Similarly, in range partitioning,anced workload for PageRank such that dynamic migration Mizan had better performance than the static version. Thedid not improve the results. In comparison, range-based par- slight performance improvement arises from the fact that thetitioning resulted in poor graph partitioning. In this case, we base algorithm (PageRank) dominates the execution time. Ifobserve Mizan (with dynamic partitioning) was able to re- a large number of queries are performed, the improvementsduce execution time by approximately 40% when compared will be more signiﬁcant.to the static version. We have also evaluated the diameter es- To study the effect of algorithms with highly variabletimation algorithm, which behaves the same as PageRank, messaging patterns, we evaluated Mizan using two algo-but exchanges larger size messages. Mizan exhibited similar rithms: DMST and advertisement propagation simulation.behavior with diameter estimation; the results are omitted In both cases, we used METIS to pre-partition the graphfor space considerations. data. METIS partitioning groups the strongly connected sub- Figure 14 shows how Mizan’s dynamic migration was graphs into clusters, thus minimizing the global communica-able to optimize running PageRank starting with range- tion among each cluster.based partitioning. The ﬁgure shows that Mizan’s migration In DMST, as discussed in Section 2.1, computation com-reduced both the variance in workers’ runtime and the su- plexity increases with vertex connectivity degree. Because
4 40 Superstep Runtime 35 3.5 Average Workers RuntimeRun Time (Minutes) Run Time (Min) 3 Migration Cost 30 Balance Outgoing Messages 25 2.5 Balance Incoming Messages 20 2 Balance Response Time 15 1.5 10 1 5 0.5 0 Mizan Mizan Mizan Static Static Static 0 WS WS WS 5 10 15 20 25 30 Supersteps Hash Range MetisFigure 14. Comparing Mizan’s migration cost with the su- Figure 16. Comparing static Mizan and Work Stealingperstep’s (or the algorithm) runtime at each supertstep us- (Pregel clone) vs. Mizan using Top-K PageRanks on a so-ing PageRank on a social graph (LiveJournal1) starting cial graph (LiveJournal1). The shaded part of each col-with range based partitioning. The points on the superstep’s umn represents the algorithm runtime while unshaded partsruntime trend represents Mizan’s migration objective at that represents the initial partitioning cost of the input graph.speciﬁc superstep. 4 300 StaticMigrated Vertices (x1000) 500 Migration Cost Migration Cost (Minutes) Total Migrated Vertices 250 Work Stealing Max Vertices Migrated 3 Mizan 400 Run Time (Min) by Single Worker 200 300 2 150 200 100 1 100 50 0 0 0 5 10 15 20 25 30 Adve DMS r tism T Supersteps e ntFigure 15. Comparing both the total migrated vertices and Figure 17. Comparing Work stealing (Pregel clone) vs.the maximum migrated vertices by a single worker for Mizan using DMST and Propagation Simulation on a metisPageRank on LiveJournal1 starting with range-based par- partitioned social graph (LiveJournal1)titioning. The migration cost at each superstep is also shown.of the quadratic complexity of computation as a function of Total runtime (s) 707connectivity degree, some workers will suffer from exten- Data read time to memory (s) 72sive workload while others will have light workload. Such Migrated vertices 1,062,559imbalance in the workload leads to the results shown in Fig- Total migration time (s) 75ure 17. Mizan was able to reduce the imbalance in the work- Average migrate time per vertex (µs) 70.5load, resulting in a large drop in execution time (two ordersof magnitude improvement). Even when using our version Table 2. Overhead of Mizan’s migration process whenof Pregel’s load balancing approach (called work stealing), compared to the total runtime using range partitioning onMizan is roughly eight times faster. LiveJournal1 graph Similar to DMST, the behavior of the advertising propa-gation simulation algorithm varies across supersteps. In thisalgorithm, a dynamic set of graph vertices communicate step. Even METIS partitioning in such case is ill-suited sinceheavily with each other, while others have little or no com- workers’ load dynamically changes at runtime. As shown inmunication. In every superstep, the communication behav- Figure 17, similar to DMST, Mizan is able to reduce such anior differs depending on the state of the vertex. Therefore, imbalance, resulting in approximately 200% speed up whenit creates an imbalance across the workers for every super- compared to the work stealing and static versions.
Linux Cluster Blue Gene/P 8 Mizan - hollywood-2011 7 hollywood-2011 arabic-2005 6 Processors Runtime (m) Processors Runtime (m) 5 Speedup 2 154 64 144.7 4 4 79.1 128 74.6 3 8 40.4 256 37.9 2 16 21.5 512 21.5 1024 17.5 1 0Table 3. Scalability of Mizan on a Linux Cluster of 0 2 4 6 8 10 12 14 1616 machines (hollywood-2011 dataset), and an IBM Compute NodesBlue Gene/P supercomputer (arabic-2005 dataset). Figure 18. Speedup on Linux Cluster of 16 machines using PageRank on the hollywood-2011 dataset.5.3 Overhead of Vertex MigrationTo analyze migration cost, we measured the time for var-ious performance metrics of Mizan. We used the PageR- 10 Mizan - Arabic-2005ank algorithm with a range-based partitioning of the 8Live-Journal1 dataset on 21 workers. We chose range- Speedupbased partitioning as it provides the worst data distribution 6according to previous experiments and therefore will triggerfrequent vertex migrations to balance the workload. 4 Table 2 reports the average cost of migrating as 70.5 µs 2per vertex. In the LiveJournal1 dataset, Mizan paid a 9%penalty of the total runtime to balance the workload, trans- 0ferring over 1M vertices. As shown earlier in Figure 13, this 64 128 256 512 1024resulted in a 40% saving in computation time when com-pared to Static Mizan. Moreover, Figures 14 and 15 compare Compute Nodesthe algorithm runtime and the migration cost at each super-step, the migration cost is at most 13% (at superstep 2) and Figure 19. Speedup on Shaheen IBM Blue Gene/P super-on average 6% for all supersteps that included a migration computer using PageRank on the arabic-2005 dataset.phase. will be spent communicating (which effectively breaks the5.4 Scalability of Mizan BSP model of overlapping communication and computa-We tested the scalability of Mizan on the Linux cluster tion).as shown in Table 3. We used two compute nodes as ourbase reference as a single node was too small when running 6. Related Workthe dataset (hollywood-2011), causing signiﬁcant pagingactivities. As Figure 18 shows, Mizan scales linearly with In the past few years, a number of large scale graph process-the number of workers. ing systems have been proposed. A common thread across We were interested in performing large scale-out experi- the majority of existing work is how to partition graph dataments, well beyond what can be achieved on public clouds. and how to balance computation across compute nodes. ThisSince Mizan’s was written in C++ and uses MPI for mes- problem is known as applying dynamic load balancing to thesage passing, it was easily ported to IBM’s Blue Gene/P su- distributed graph data by utilizing on the ﬂy graph partition-percomputer. Once ported, we natively ran Mizan on 1024 ing, which Bruce et al.  assert to be either too expen-Blue Gene/P compute nodes. The results are shown in Ta- sive or hard to parallelize. Mizan follows the Pregel model,ble 3. We ran the PageRank algorithm using a huge graph but focuses on dynamic load balancing of vertexes. In this(arabic-2005) that contains 639M edges. As shown in section, we examine the design aspects of existing work forFigure 19, Mizan scales linearly from 64 to 512 compute achieving a balanced workload.nodes then starts to ﬂatten out as we increase to 1024 com- Pregel and its Clones. The default partitioning scheme forpute nodes. The ﬂattening was expected since with an in- the input graph used by Pregel  is hash based. In everycreased number of cores, compute nodes will spend more superstep, Pregel assigns more than one subgraph (partition)time communicating than computing. We expect that as we to a worker. While this load balancing approach helps in bal-continue to increase the number of CPUs, most of the time ancing computation in any superstep, it is coarse-grained
and reacts slowly to large imbalances in the initial parti- minimizes the edges among graph partitions and allowstioning or large behavior changes. Giraph, an open source for a fast graph repartitioning. Unlike BSP frameworks,Pregel clone, supports both hash-based and range-based par- computation in GraphLab is executed on the most recentlytitioning schemes; the user can also implement a differ- available data. GraphLab ensures that computations areent graph partitioning scheme. Giraph balances the parti- sequentially consistent through a derived consistency modeltions across the workers based on the number of edges or to guarantee the end result data will be consistent as well.vertices. GoldenOrb  uses HDFS as its storage system HipG  is designed to operate on distributed memoryand employs hash-based partitioning. An optimization that machines while providing transparent memory sharing towas introduced in  uses METIS partitioning, instead of support hierarchical parallel graph algorithms. HipG alsoGoldenOrb’s hash-based partitioning, to maintain a balanced support processing graph algorithms following the BSPworkload among the partitions. Additionally, its distributed model. But unlike the global synchronous barriers of Pregel,data storage supports duplication of the frequently accessed HipG applies more ﬁne-grained barriers, called synchroniz-vertices. Similarly, Surfer  utilizes min-cut graph parti- ers, during algorithm execution. The user has to be involvedtioning to improve the performance of distributed vertex cen- in writing the synchronizers; thus, it requires additionaltric graph processing. However, Surfer focuses more on pro- complexity in the users’ code. GraphChi  is a disk basedviding a bandwidth-aware variant of the multilevel graph graph processing system based on GraphLab and designedpartitioning algorithm (integrated into a Pregel implemen- to run on a single machine with limited memory. It requirestation) to improve the performance of graph partitioning on a small number of non-sequential disk accesses to processthe cloud. Unlike Mizan, these systems neither optimize for the subgraphs from disk, while allowing for asynchronousdynamic algorithm behavior nor message trafﬁc. iterative computations. Unlike, Mizan these systems do notPower-law Optimized Graph Processing Systems. Stan- support any dynamic load balancing techniques.ford GPS (GPS)  has three additional features over Specialized Graph Systems. Kineograph  is a distributedPregel. First, GPS extends Pregel’s API to handle graph al- system that targets the processing of fast changing graphs.gorithms that perform both vertex-centric and global compu- It captures the relationships in the input data, reﬂects themtations. Second, GPS supports dynamic repartitioning based on the graph structure, and creates regular snapshots of theon the outgoing communication. Third, GPS provides an op- data. Data consistency is ensured in Kineograph by separat-timization scheme called large adjacency list partitioning, ing graph processing and graph update routines. This sys-where the adjacency lists of high degree vertices are par- tem is mainly developed to support algorithms that mine thetitioned across workers. GPS only monitors outgoing mes- incremental changes in the graph structure over a series ofsages, while Mizan dynamically selects the most appropriate graph snapshots. The user can also run an ofﬂine iterativemetric based on changes in runtime characteristics. Power- analysis over a copy of the graph, similar to Pregel. TheGraph , on the other hand, is a distributed graph system Little Engine(s)  is also a distributed system designedthat overcomes the challenges of processing natural graphs to scale with online social networks. This system utilizeswith extreme power-law distributions. PowerGraph intro- graph repartitioning with one-hop replication to rebalancedduces vertex-based partitioning with replication to distribute any changes to the graph structure, localize graph data, andthe graph’s data based on its edges. It also provides a pro- reduce the network cost associated with query processing.gramming abstraction (based on gather, apply and scatter op- The Little Engine(s) repartitioning algorithm aims to ﬁnderators) to distribute the user’s algorithm workload on their min-cuts in a graph under the condition of minimizing thegraph processing system. The authors in  have focused replication overhead at each worker. This system works as aon evaluating stationary algorithms on power-law graphs. middle-box between an application and its database systems;It is, thus, unclear how their system performs across the it is not intended for ofﬂine analytics or batch data process-broader class of algorithms (e.g., non-stationary) and graphs ing like Pregel or MapReduce. Trinity  is a distributed(e.g., non-power-law). In designing Mizan, we did not make in-memory key-value storage based graph processing sys-any assumptions on either the type of algorithm or shape tem that supports both ofﬂine analytics and online query an-of input graph. Nonetheless, PowerGraph’s replication ap- swering. Trinity provides a similar—but more restrictive—proach is complementary to Mizan’s dynamic load balancing computing model to Pregel for ofﬂine graph analytics. Thetechniques. They can be combined to improve the robustness key advantage of Trinity is its low latency graph storageof the underlying graph processing system. design, where the data of each vertex is stored in a bi-Shared Memory Graph Processing Systems. nary format to eliminate decoding and encoding costs.WhileGraphLab  is a parallel framework (that uses distributed these systems address speciﬁc design challenges in process-shared memory) for efﬁcient machine learning algorithms. ing large graphs, their focus is orthogonal to the design ofThe authors extended their original multi-core GraphLab Mizan.implementations to a cloud-based distributed GraphLab,where they apply a two-phase partitioning scheme that
7. Future WorkWe are planning on making Mizan an open source project. + + − − count(ST ime ∩ SStat ) + count(ST ime ∩ SStat )We will continue to work on number of open problems as CTscore = count(ST ime ) + count(SStat )outlined in this section. Our goal is to provide a robustand adaptive platform that frees developers from worrying The sorting test ST consists of sorting both sets, ST imeabout the structure of the data or runtime variability of their and SStat , and calculating the z-scores for all elements. Ele-algorithms. ments with a z-score in the range (−zdef , zdef ) are excluded Although Mizan is a graph-agnostic framework, its per- from both sorted sets. For the two sorted sets ST ime andformance can degrade when processing extremely skewed SStat of size n and i ∈ [0..n − 1], a mismatch is deﬁned ifgraphs due to frequent migration of highly connected ver- the i-th elements from both sets are related to different work- i itices. We believe that vertex replication, as proposed by ers. That is, ST ime is the i-th element from ST ime , SStat is iPowerGraph , offers a good solution, but mandates the the i-th element from SStat , and ST ime ∈ W orkerx while iimplementation of custom combiners. We plan to further SStat ∈ W orkerx . A match, on the other hand, is deﬁned if / i ireduce frequent migrations in Mizan by leveraging vertex ST ime ∈ W orkerx and SStat ∈ W orkerx . A score of 1 isreplication, however, without requiring any custom combin- given by the sorting test if:ers. count(matches) So far, dynamic migration in Mizan optimizes for work- STscore = count(matches) + count(mismatches)loads generated by a single application or algorithm. Weare interested in investigating how to efﬁciently run multi- The correlation of sets ST ime and SStat are reported asple algorithms on the same graph. This would better position strongly correlated if (CT + ST )/2 ≥ 1, reported as weaklyMizan to be offered as a service for very large datasets. correlated if 1 > (CT + ST )/2 ≥ 0.5 and reported as uncorrelated otherwise.8. Conclusion ReferencesIn this paper, we presented Mizan, a Pregel system that  H. Balakrishnan, M. F. Kaashoek, D. Karger, R. Morris, anduses ﬁne-grained vertex migration to load balance computa- I. Stoica. Looking Up Data in P2P Systems. Communicationstion and communication across supersteps. Using distributed of the ACM, 46(2):43–48, 2003.measurements of the performance characteristics of all ver-  P. Boldi, B. Codenotti, M. Santini, and S. Vigna. UbiCrawler:tices, Mizan identiﬁes the cause of workload imbalance and A Scalable Fully Distributed Web Crawler. Software: Practiceconstructs a vertex migration plan without requiring central- & Experience, 34(8):711–726, 2004.ized coordination. Using a representative number of datasets  D. Buntinas. Designing a Common Communication Subsys-and algorithms, we have showed both the efﬁcacy and ro- tem. In Proceedings of the 12th European Parallel Virtual Ma-bustness of our design against varying workload conditions. chine and Message Passing Interface Conference (Euro PVMWe have also showed the linear scalability of Mizan, scaling MPI), pages 156–166, Sorrento, Italy, 2005.to 1024 CPUs.  R. Chen, M. Yang, X. Weng, B. Choi, B. He, and X. Li. Improving Large Graph Processing on Partitioned Graphs inA. Computing the Correlation Score the Cloud. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SOCC), pages 3:1–3:13, San Jose, Califor-In Mizan, the correlation scores between the workers’ re- nia, USA, 2012.sponse time and their summary statistics are calculated using  R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu,two tests: clustering and sorting. In the clustering test CT , F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: Takingelements from the set of worker response times ST ime and the Pulse of a Fast-Changing and Connected World. In Pro-the statistics set SStat (either incoming or outgoing message ceedings of the 7th ACM european conference on Computerstatistics) are divided into two categories based on their z- Systems (EuroSys), pages 85–98, Bern, Switzerland, 2012.score 7 : positive and negative clusters. Elements from a set  J. Dean and S. Ghemawat. MapReduce: Simpliﬁed Data Pro-S are categorized as positive if their z-scores are greater than cessing on Large Clusters. In Proceedings of the 6th USENIXsome zdef . Elements categorized as negative have a z-score Symposium on Opearting Systems Design and Implementa-less than −zdef . Elements in the range (−zdef , zdef ) are ig- tion (OSDI), pages 137–150, San Francisco, California, USA,nored. The score of the cluster test is calculated based on the 2004.total count of elements that belongs to the same worker in the  R. G. Gallager, P. A. Humblet, and P. M. Spira. A Distributedpositive and negative sets, which is shown by the following Algorithm for Minimum-Weight Spanning Trees. ACM Trans-equation: actions on Programming Languages and Systems (TOPLAS), 5(1):66–77, 1983. |xi −xavg |7 The z-score S i i T ime &SStat = , where xi is the run  Giraph. Apache Incubator Giraph. http://incubator. standard deviationtime of worker i or its statistics apache.org/giraph, 2012.
 GoldenOrb. A Cloud-based Open Source Project for Massive-  LAW. The Laboratory for Web Algorithmics. http://law. Scale Graph Analysis. http://goldenorbos.org/, 2012. di.unimi.it/index.php, 2012. J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin.  J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and PowerGraph: Distributed Graph-Parallel Computation on Nat- Z. Ghahramani. Kronecker Graphs: An Approach to Model- ural Graphs. In Proceedings of the 10th USENIX Symposium ing Networks. Journal of Machine Learning Research, 11: on Operating Systems Design and Implementation (OSDI), 985–1042, 2010. pages 17–30, Hollywood, California, USA, 2012.  Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, B. Hendrickson and K. Devine. Dynamic Load Balancing and J. Hellerstein. Distributed GraphLab: A Framework for in Computational Mechanics. Computer Methods in Applied Machine Learning and Data Mining in the Cloud. Proceedings Mechanics and Engineering, 184(2–4):485–500, 2000. of the VLDB Endowment (PVLDB), 5(8):716–727, 2012. L.-Y. Ho, J.-J. Wu, and P. Liu. Distributed Graph Database for  G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, Large-Scale Social Computing. In Proceedings of the IEEE N. Leiser, and G. Czajkowski. Pregel: A System for Large- 5th International Conference in Cloud Computing (CLOUD), Scale Graph Processing. In Proceedings of the 2010 ACM pages 455–462, Honolulu, Hawaii, USA, 2012. SIGMOD International Conference on Management of Data, U. Kang, C. E. Tsourakakis, and C. Faloutsos. PEGASUS: pages 135–146, Indianapolis, Indiana, USA. A Peta-Scale Graph Mining System Implementation and Ob-  L. Page, S. Brin, R. Motwani, and T. Winograd. The PageR- servations. In Proceedings of the IEEE 9th International ank Citation Ranking: Bringing Order to the Web. http: Conference on Data Mining (ICDM), pages 229–238, Miami, //ilpubs.stanford.edu:8090/422/, 2001. Florida, USA, 2009.  J. M. Pujol, V. Erramilli, G. Siganos, X. Yang, N. Laoutaris, U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos, and P. Chhabra, and P. Rodriguez. The Little Engine(s) That J. Leskovec. HADI: Fast Diameter Estimation and Mining in Could: Scaling Online Social Networks. IEEE/ACM Trans- Massive Graphs with Hadoop. ACM Trasactions on Knowl- actions on Networking, 20(4):1162–1175, 2012. edge Discovery from Data (TKDD), 5(2):8:1–8:24, 2011.  D. Rees. Foundations of Statistics. Chapman and Hall, G. Karypis and V. Kumar. Parallel Multilevel K-Way Parti- London New York, 1987. ISBN 0412285606. tioning Scheme for Irregular Graphs. In Proceedings of the  S. Salihoglu and J. Widom. GPS: A Graph Processing System. ACM/IEEE Conference on Supercomputing (CDROM), pages Technical report, Stanford University, 2012. 278–300, Pittsburgh, Pennsylvania, USA.  S. Seo, E. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng. G. Karypis and V. Kumar. Multilevel k-way Partitioning HAMA: An Efﬁcient Matrix Computation with the MapRe- Scheme for Irregular Graphs. Journal of Parallel and Dis- duce Framework. In Proceedings of the 2010 IEEE Second In- tributed Computing, 48(1):96–129, 1998. ternational Conference on Cloud Computing Technology and Z. Khayyat, K. Awara, H. Jamjoom, and P. Kalnis. Mizan: Op- Science (CloudCom), pages 721–726, Athens, Greece. timizing Graph Mining in Large Parallel Systems. Technical  B. Shao, H. Wang, and Y. Li. Trinity: A Distributed Graph report, King Abdullah University of Science and Technology, Engine on a Memory Cloud. In Proceedings of the 2013 ACM 2012. SIGMOD International Conference on Management of Data, E. Krepska, T. Kielmann, W. Fokkink, and H. Bal. HipG: New York, New York, USA. Parallel Processing of Large-Scale Graphs. ACM SIGOPS  L. G. Valiant. A Bridging Model for Parallel Computation. Operating Systems Review, 45(2):3–13, 2011. Communications of the ACM, 33:103–111, 1990. A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-  W. Xue, J. Shi, and B. Yang. X-RIME: Cloud-Based Large Scale Graph computation on Just a PC. In Proceedings of the Scale Social Network Analysis. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and 2010 IEEE International Conference on Services Computing Implementation (OSDI), pages 31–46, Hollywood, California, (SCC), pages 506–513, Shanghai, China. USA, 2012.