- 1. HyPR: Hybrid Page Ranking on Evolving Graphs Hemant Kumar Giri, Mridul Haque, Dip Sankar Banerjee Department of Computer Science and Engineering. Indian Institute of Information Technology Guwahati Bongora, Guwahati 781015. Assam. India. Email:girihemant19@gmail.com, {mridul.haque,dipsankarb}@iiitg.ac.in Abstract—PageRank (PR) is the standard metric used by the Google search engine to compute the importance of a web page via modeling the entire web as a first order Markov chain. The challenge of computing PR efficiently and quickly has been already addressed by several works previously who have shown innovations in both algorithms and in the use of parallel computing. The standard method of computing PR is handled by modelling the web as a graph. The fast growing internet adds several new web pages everyday and hence more nodes (representing the web pages) and edges (the hyperlinks) are added to this graph in an incremental fashion. Computing PR on this evolving graph is now an emerging challenge since computations from scratch on the massive graph is time consuming and unscalable. In this work, we propose Hybrid Page Rank (HyPR), which computes PR on evolving graphs using collaborative executions on muti-core CPUs and massively parallel GPUs. We exploit data parallelism via efficiently partitioning the graph into different regions that are affected and unaffected by the new updates. The different partitions are then processed in an overlapped manner for PR updates. The novelty of our technique is in utilizing the hybrid platform to scale the solution to massive graphs. The technique also provides high performance through parallel processing of every batch of updates using a parallel algorithm. HyPR efficiently executes on a NVIDIA V100 GPU hosted on a 6th Gen Intel Xeon CPU and is able to update a graph with 640M edges with a single batch of 100,000 edges in 12 ms. HyPR outperforms other state of the art techniques for computing PR on evolving graphs [1] by 4.8x. Additionally HyPR provides 1.2x speedup over GPU only executions, and 95x speedup over CPU only parallel executions. Index Terms—Heterogeneous Computing, PageRank, CPU+GPU, Dynamic graphs. I. INTRODUCTION Link analysis is a popular technique for mining meaningful information from real world graphs. There are a variety of knowledge models that are typically employed by different ap- plications. The knowledge models encode structural informa- tion and rich relationships between the different entities which help in the extraction of critical information. One such model is the Hubs and Authority [2] model which essentially proves that a web graph has multiple bipartite cores. Google [3] on the other hand models the web graph as a first order Markov chain which captures an user’s browsing patterns of the web. This model is used by Google to generate ranking of the different web pages which forms the core concept for the Page Rank (PR) algorithm and is used in Google search. The page ranking method models a hyperlink from one page to another as an endorsement for the destination page from the source. With the growth of the internet, computing PR on the entire web is a challenging task given that the graph is massive. Additionally, the graph is evolving (also called dynamic) in nature and will have several new pages and hyperlinks generated every day which leads to an additional challenge of computing new page ranks at regular intervals. While the simplest solution is to compute the page ranks from scratch every time, it is not feasible for any realistic use case. Hence it is necessary to investigate methods for computing new page ranks on evolving graphs that will not require computing PR from scratch. Parallel PR [4–6], has found wide spread success in com- puting PR scores quickly. With the advent of processors such as Graphics Processing Units (GPUs), and modern multi-core CPUs, the parallel solutions of PR have further found wide spread success. Given the capacity for massive parallelism of GPUs, good programmability, and strong community support, GPUs have been able to provide sub-second performance in PR computations on massive real world graphs. In computing PR scores on evolving (or dynamic) graphs too, GPUs have found good success in the past [1, 7]. Computing dynamic PR, presents an ideal use case for heterogeneous computations where the nature of the computations require investigations into techniques that can provide parallelism in a hierarchical manner. While some parallel work can be done at coarser degrees of granularity, end page ranking scores need to be computed at a much finer granularity. In this paper, we present HyPR1 (pronounced hy-per), a method for computing PR on dynamic graphs which uti- lizes both CPUs and GPUs towards fast computations of the new scores. Our method depends on two broad phases of computation. In the first phase, we identify a partition of the graph which will be affected by a newer set of edges that are getting inserted. In the second phase, the actual PR computation will be carried out. While the steps are mutually exclusive, performing both the steps on a GPU will lead to some sequential computations. Towards that, we propose a heterogeneous method which provides higher degrees of parallelism towards computing the two phases and hence leads to significant performance benefits. The concrete motivations and contributions of our paper are as follows: 1) Parallel partitioning: As pointed out, our technique depends on identifying portions of the graph that are affected and unaffected by the newer batch of edges that are updated in the graph. We present a method that can perform the partitions in parallel to a current set of updates are getting incorporated into the existing graph. 1Code available at : https://github.com/girihemant19/HyPR
- 2. This reduces irregular accesses of the GPU to a large extent, and boosts performance. 2) Parallel updates with a parallel algorithm: To extract maximum data parallelism (and hence performance) available on a GPU, we propose a technique that can parallely process a batch of updates to compute new PR scores using a parallel algorithm which runs on the GPU. The step for making the parallel update possible is done through a pre-processing step which is carried out in the CPU. To the best of our knowledge, this has not been proposed earlier. 3) Scalability: This is one of the main motivations behind the adoption of a hybrid solution. While the updates that happen in a batch parallel manner could be small, the whole graph, even when represented using the most space efficient data structures, will consume the high- est amount of main memory. We show that a hybrid approach will not necessitate the storage of the whole graph on a limited GPU memory. We propose to create minimal working sets that can effectively assimilate a new batch of updates and gets constrained only when the minimal working sets exceeds the memory. The much larger host memory holds the full graph as an auxiliary storage. 4) Benchmarking: We thoroughly experiment our technique on several real world graphs with up to 640M edges (limited by the GPU memory). We perform experiments to show that our technique out performs the state of the art dynamic PR techniques by 4.8x by providing PR scores in 12 ms. The rest of the paper is organized as follows. We provide a background of the fundamental ideas our work in Section II. The related works are discussed in Section V. We provide detailed methodology of our HyPR technique in Section III. This is followed by the evaluations where we discuss and analyze the results obtained in Section IV. Finally, we draw some brief conclusions of our work and discuss possible directions for extension in Section VI. II. BACKGROUND In this section we briefly discuss the algorithmic prelimi- naries on which HyPR depends. A. Baseline PageRank PR computations in general is computed on a graph G(V, E, w) where V (|V |= n) is the set of nodes representing the web pages. E (|E|= m) is the set of edges representing the hyperlinks and w is the set of edge weights which represents the contribution of a node to the node it connects with via an edge. Hence, using the random-surfer model which accounts for any random internet user to land at a particular page, the PR value of a node u is given as: pr(u) = X v∈I(u) c(v → u) + d n (1) In Equation 1, d (taken as 0.15 conventionally) represents the damping factor which is the probability of the random surfer to stay at a particular page. The function c() is the contribution. The magnitude of the contribution is taken to be the ratio of the pr(u)/out degree(u) times (1 − d) then Equation 1 can be re-written as: c(v → u) = (1 − d) pr(v) out degree(v) (2) In Equation 2, the PR computations happen in a cyclic manner as two different nodes can make contributions to each other. The PR values are computed via iteratively computing the PR values of all nodes until they converge. This signifies there is very little update to the PR scores after a particular iteration. This style of computing the PR scores is popularly done through power iterations [3]. The genesis of power iterations arises from the idea of the Markov chains reaching a steady state when starting from a initial distribution. Here, the initial PR values compose the starting states of every node which then goes through several transitions to reach a stable state. The PR values of all the nodes can be arranged to form the state matrix which has a set of eigenvalues. The eigen vectors of the said state matrix is solved using power iterations. By the intrinsic property of Markov chains, if the state matrix is stochastic, aperiodic, and irreducible, then the state matrix will converge to a stationary set of vectors which will be the final PR scores. A basic parallel implementation of computing PR in parallel is shown in Algorithm 1. B. Dynamic PR calculations The seed idea for HyPR stems from the batch dynamic graph connectivity algorithm proposed by Acar et. al. in [8]. Due to the iterative nature of PR, if the PR scores of all the incoming neighbors of a node u converge in a particular iter- ation, then the score of u will also converge in the immediate next iteration. This nature of PR, allows the decomposition of a directed graph into a set of connected components (CC) which can be processed in parallel [4]. Since maintaining CCs of a graph in a dynamic setting, is equivalent to maintaining the connectivity of a graph, we can perform a batch update with size B in parallel in O(lg n+lg(1+n/B)) work per edge. A batch here, refers to a set of updates which can be either insertions into the existing graph, or deletions. Each entry in a batch is arranged as a tuple (ti, ui, vi, wi, oi), where ti is a time stamp, (ui, vi) is the edge, wi is the weight associated with edge, and oi is the update type (insert/delete). We assume that a batch i will have at most Bi edges. We experiment the impact of this batch size on performance later in Section IV. As stated, the problem of computing the PR scores for an incrementally growing graph can be treated in essence as the computation of two CCs where one contains a set of nodes that gets affected by the batch of updates and the other does not. The same treatment for partitioning the graph for incrementally computing PR scores was also done by Desikan et. al. [2] where the authors proposed scaling of the unchanged nodes first followed by the PR computations of the
- 3. changed nodes. For a graph G(V, E, w) where Pn 1 wi = W and |V | = n being sum of weights of all nodes and order of graph respectively. Every node is initialized with PR score of wi W . Consider a particular node s in the existing graph. It’s PR value can be expressed as: PR(s) = d( ws W ) + (1 − d) k X i=1 PR(xi) δ(xi) (3) where d is the damping factor and xi denotes every incoming neighbouring nodes pointing to s up to k number of nodes and δ(xi) is the out-degree of xi. From the fact the over some k iterations, the PR values of all the nodes will get scaled by a constant factor proportional to W, it can be deduced that for the node s the updated PR score can be scaled as W.PR(s) = W0 PR0 (s) or, PR0 (s) = ( W W0 )PR(s) (4) So, the new PR scores can easily be determined by scaling old PR with factor ( W W 0 ). W can be also taken to be the order of nodes n(G) since every node is equally likely to be of weight 1; Equations 4 can be re-written as: PR0 (s) = n(G) n(G0) PR(s) (5) The new nodes that are getting added into G will be required to be put through the usual iterations to compute PR values. In Figure 1, we have clearly demonstrated these partitions where Bi represents the batches of updates. A set of nodes which we denote by Vnew, are in the partition that requires scratch computations of PR. The other partition Vold requires to be scaled using the Equation 5. The Vborder nodes that are in the border area which requires scaling along with a few iterations of ranking before it converges to its’ final PR scores. While the standard PR algorithm will require a O(n + m) space to be maintained in memory, partitioning of the graph effectively aids in reducing the memory requirements. if we assume each batch to be of size ∆ edges then in addition to the original graph, an additional space of O(∆) needs to be created. Given the limited space that is available in the GPUs a combined space of O(n + m + ∆) will severely limit scaling. In order to scale HyPR to larger sizes, we aim to keep only O(∆) additional space in the GPU, while the rest of the graph will be maintained on the host memory. This is further refined through the partitioning which decomposes PR computations to discrete space requirements in O(Vnew), O(Vborder), and O(Vold). Each of these partitions need not reside on the same memory at all time. We use the compressed sparse row representation for representing the graph and is detailed in [9]. C. Datasets used Even though PR is usually computed on web graphs repre- senting web pages and hyperlinks, their properties are similar to other real world graphs. Hence, we choose a healthy mix of web-graphs and real world graphs. The datasets we use are Figure 1: Identification of nodes Algorithm 1: parallelPR(V,outdeg,InV[y],γ) Require: Set of nodes as V , outdegree of each node in outdeg, incoming neighbors InV. Ensure: PageRank p of each node 1: err=∞ 2: for all u ∈ V do 3: previous(u)= d |V | 4: end for 5: while err > γ do 6: for all u ∈ V in parallel do 7: p(u) = d n 8: for all x ∈ Inv do 9: p(u) = p(u) + previous(x) outdeg(x) ∗ (1 − d) 10: end for 11: end for 12: for all u ∈ V do 13: err = max(err, abs(previous[u] − p[u])) 14: previous(u) = p(u) 15: end for 16: end while 17: return p collected from the University of Florida Sparse Matrix Collec- tion [10] and Stanford Network Analysis Project (SNAP) [11]. The datasets which we have selected ranges from 2.9M edges to around 640M edges as detailed in Table I. III. METHODOLOGY In this section we will briefly discuss about the approach adopted for implementing HyPR. We first explain a basic overview of the approach that we take and then provide a detailed explanation on the implementation strategies. Al- gorithm 2, shows the basic set of steps that are adopted towards the implementation. Figure 2 shows the overview of the steps that are implemented. In a broad sense, HyPR works as concurrent (or overlapped) phases of the following. 1) Partitioning: The parallel cores of the CPU creates the
- 4. Table I: Datasets Used Graph Name Sources |V | |E| Type 1. Amazon [11] 0.41 M 3.35 M Purchasing 2. web-Google [11] 0.87 M 5.10 M Web Graph 3. wiki-Topcat [11] 1.79 M 28.51 M Social 4. soc-pokec [11] 1.63 M 30.62 M Social 5. Reddit [11] 2.61 M 34.40 M Social 6. soc-LiveJournal [11] 4.84 M 68.99 M Social 7. Orkut [11] 3.00 M 117.10 M Social 8. Graph500 [11] 1.00 M 200.00 M Synthetic 9. NLP [10] 16.24 M 232.23 M Optimization 10. Arabic [10] 22.74 M 639.99 M Web Graph partitions Vold, Vnew, and Vborder as discussed in Section II-B. 2) Transfer: Perform asynchronous transfer of ∆ sized batch to the GPU memory. 3) PR Calculations: depending on the type of the partition, either scale, or calculate new PR scores on the GPU. A. Why Hybrid ? Before we discuss HyPR design in detail, we first motivate the requirement for a hybrid solution. As stated earlier, our intention for performing heterogeneous solution is three fold. In the first place, a hybrid solution allows the PR calculation to scale to very large sizes which is otherwise limited by the GPU main memory size. As discussed in Section II-B, only O(|V |+∆) space is needed for computing the updated PR values for a particular batch of size Bi. The static graph which is undergoing updates resides in the much larger main memory of the host and only aids in creating the partitions. In the second place, if we ignore scaling, we need to perform partitioning, followed by parallel computations of the PR scores which will result in limited GPU utilizations as all these steps are not independent. Also the degree of parallelism is limited for the partitioning steps in comparison to the scaling and PR updates. Hence, for coarse parallel partitioning step, the CPU is an ideal device and GPU is more suited for the finely parallel update and scaling. We show how a GPU only execution is actually slower than the hybrid technique in Section IV. In the last place, a hybrid technique allows us to accrue higher system efficiency which would otherwise have the CPU sitting idle while the GPU is updating the PR scores. B. Graph Partitioning: We start the computation on a graph G(V, E, w) where there already exists some PR scores that have been computed previously. The algorithmic phases that HyPR goes through is outlined in Algorithm 2. In successive time intervals, we will have a set of batch updates that arrive. As discussed previously in Section II-B, the batches consist of an heterogeneous mix of edges that are either to be inserted or deleted. A batch Bi, can be represented by the tuple (ti, ui, vi, wi, oi), as discussed in Section II-B. For updating the batch in parallel, we identify three con- nected components (CC) that strictly requires different com- putations. If we consider the existing graph as one single CC (say Co), and the new batch of updates as a separate CC (say Cn), then we can construct a block graph S where there will be some incoming edges from Cn to Co if there exists edge updates (ui, vi) where ui ∈ Co and vi ∈ Cn. We can assume that the new batch of updates is topologically sorted since the PR scores of the new nodes in Bi is guaranteed to be lower than those in Co. In an auxiliary experiment, we tested the nature of these CCs to see if they form strongly connected components (SCC) so as to arrive at a formulation similar to [4]. Decomposing a graph into a set of SCCs provides the advantage of doing a topological ordering of the partitions where PR computations (or re-computations) of the scores can be done in a cascaded manner with the scaling of Vborder, and Vold nodes first, followed by the PR calculation of Vborder and Vnew. This order of cascaded updates has also been adopted to update the PR scores as proved in [2]. We found, that if we re-compute the three partitions using Kosaraju’s algorithm (cf. [12]) to identify the Vold, Vnew, and Vborder partitions, approximately 2% of the edges differ from the Arabic dataset when compared to our mechanism of partitioning. Hence, we can conclude that our mechanism produces approximate SCCs which can be easily monitored for only the 2% extraneous edges so as to maintain a proper topological order while performing the batch updates. We do a small book-keeping of these edges in order to correctly partition the edges. We can now identify three partitions for every batch Bi (i) Vold which is a set of nodes that are already there in the existing graph (ii) Vnew is a set of vertices which are entirely new nodes which are to be added to the existing graph and can be found as Vi − (Vi ∩ Vold) if Vi is the set of nodes in batch Bi (iii) Vborder which is the set of nodes which have edges in Bi connecting Vold and Vnew and is reachable using a breadth first traversal. As an example, Figure 1, we can see the Vold nodes in red, Vnew nodes in green, and the Vborder nodes which are a collection of the entire set of nodes which are in the first hub and having direct connection with Vnew. Essentially, all the nodes in G without the yellow nodes are Vborder. As, we can see in Algorithm 2, the first phase of the update operation is the creation of these partitions. Since, the identification of these vertices are independent from each other, they can be done in parallel. It is critical to note that at this stage the number of edges that are present in a single batch does not warrant any form of edge parallelism on GPU as that will lead to lower utilization of the available GPU bandwidth at the cost of high memory transfers. C. Pre-processing In pre-processing phase, we compute the Vold, Vnew, and Vborder partitions in an overlapped manner with the GPU. The sequence of operations that are followed for every incoming batch of updates are: 1) Pre-process the first batch in parallel using OpenMP threads on CPU. 2) Transfer Vold, Vnew, and Vborder to GPU for scaling, and PR computations in an asynchronous manner 3) Start pre-processing the second batch of updates on the CPU as soon as the first partition is How? Note
- 5. (a) Graph with batch updates (b) Identification of border, new, and unchanged nodes (c) Scaling on unchanged and border nodes (d) PageRank Computation on border and new nodes Figure 2: Overview of HyPR execution steps. handed off to GPU for ranking. Algorithm 3 demonstrates the partitioning mechanism which is called from Line 1 of Algo- rithm 2. As mentioned earlier, the batch of updates contain an heterogeneous mix of both insert and delete operations which have to be handled uniquely. 1) Insertion: Insertion is more compute intensive than deletion. In Lines 6-22 of Algorithm 3 we depict our insertion mechanism. We first populate Vnew with all the nodes of the batch and process all the elements in Vnew using available CPU threads to successively send the node to the appropriate partition. We first check if the source node of an edge (ui, vi) belongs to the existing partition or not. If it is, then that particular node is put in Vold. For computing the Vborder, we check if the particular node is reachable in G (the existing graph) in a breadth-first manner and has a predecessor in the incoming batch. The intuition behind this is that the Vborder will be the set of nodes that are reachable from the Vnew set of nodes and hence will undergo both scaling and PR computation. So in parallel, all the nodes in Vnew are popped and either classified in Vold or Vborder. The remaining ones in Vnew are the ones which are the entirely new set of nodes that have come in Bi whose new PR values needs to be computed. 2) Deletion: Deletion is much simpler than insertion and hence less compute intensive. In case of deletion, the Vnew set will be NULL, and the only nodes that will be involved are the Vold and Vborder sets. As we can see from Lines 23-33 of Algorithm 3, if the update type oi is delete, then we remove the nodes involved from Vold which is containing the nodes of the original graph. These will require a newer set of PR computations on the Vnew nodes that will be handled during the PR update step in Algorithm 2. Additionally, the removals will induce a newer set of Vborder nodes for which, we will see if any of the reachable successors or predecessors of the removed nodes are present. Such nodes will be pushed into Vborder. D. Scaling the old nodes The pre-processing step essentially allows us to perform data parallel scaling and PR computations on the individual partitions. As discussed earlier, the primary idea behind HyPR is the localization of the set of nodes that will be affected by the new batch of updates. As we can now see from Lines 6-11 in Algorithm 2, we call a GPU kernel to scale the nodes of the Vold partition using the Equation 5 discussed in Section II-B. We can now achieve full GPU bandwidth saturation as |Vold| number of threads can be spawned to scale all the nodes in parallel. It is critical to note here that in the hybrid implementation, there will be intermediate transfers that will be necessary which will not require the entire G (which is basically V nodes from G initially) to be copied every time. Rather, the original graph G is copied to GPU before the batch processing starts and is augmented with Vnew after every batch is processed. As we can see in Lines 9-11 of Algorithm 2, the scaling operation is performed accordingly on the Vborder nodes as well. Scaling of the nodes in the GPU is required for all the Vold and Vborder set of nodes. The actual scaling operation is a O(1) operation which makes it most suitable for a massively parallel implementation on the GPUs. Threads equal to the number of nodes involved in the scaling process (Vold or Vborder) are spawned on the GPU for executing the scaling kernels in an SIMT manner. E. Page Rank Update The PR update of the Vnew and Vborder nodes are now a much lighter computation owing to the partitions. As with the standard parallel PR implementation shown in Algorithm 1, the power iterations for computing the PR scores continue until the scores converge to an error threshold γ (set to 10−10 ). However, since the Vborder set undergoes a step of scaling before the PR update step, the number of iterations required for the scores to converge will be much lower than the case of computing from scratch. So, during the PR update step for Vborder, and Vnew, as shown in Lines 12-17 of Algorithm 2, we call the parallelPR() of Algorithm 1. For the Vnew nodes, the computation is trivial since the number of nodes are low as it contains only the new nodes being added from a new batch. The Vborder nodes, although much bigger in size than the Vnew set, will also call parallelPR(). However, they will converge much quicker since, they have undergone a step of scaling previously. For the deletions, the PR update will be required only for the Vold set and the Vborder set of nodes. As we can see from Lines 19-26 in Algorithm 2, the same PR update process will V-batch? ?
- 6. Algorithm 2: HyPR: Hybrid Page Ranking on G with incremental Batches Bi Require: Scratch graph G and k number of batches in B represented by (ti, ui, vi, wi, oi), PageRank vector dest,outdegree of each node in outdeg, incoming neighbors of every node in InV. Ensure: Rank of the nodes in vector dest {Phase 1: Pre-processing phase} 1: CPU::Partition the incoming batches based on insertion and deletion 2: (Vold, Vborder, Vnew)=createPartition(G,B) 3: CPU:: Queue Vold, Vborder, Vnew for async transfer to GPU {Phase 2: PR update} 4: INSERTION: Generate threads equal to number of Vborder, Vnew 5: if (oi==insert) then 6: for ∀u ∈ Vold in parallel do 7: GPU:: dest[u] = |V |∗dest[u] |Vold| {Scaling} 8: end for 9: for ∀x ∈ Vborder in parallel do 10: GPU:: dest[x] = |V |∗dest[u] |Vborder| {Scaling} 11: end for 12: for ∀z ∈ Vborder in parallel do 13: GPU:: dest[z]=parallelPR(Vborder,outdeg,InV [z]) 14: end for 15: for ∀y ∈ Vnew in parallel do 16: GPU :: dest[y]=parallelPR(Vnew,outdeg,InV [y]) 17: end for 18: end if 19: DELETION: Generate threads equal to number of Vold and Vborder 20: if oi==delete then 21: for ∀u ∈ Vborder in parallel do 22: GPU :: dest[y]=parallelPR(Vborder,outdeg,InV [u]) {PR Update} 23: end for 24: for ∀v ∈ Vold in parallel do 25: GPU:: dest[v] = |V |∗dest[u] |Vold| {Scaling} 26: end for 27: end if be applied first for the Vborder set. The Vold set will simply undergo a step of scaling similar to the case of insertions. F. CUDA+OpenMP implementation We can observe a snapshot of the overlapped execution model in Figure 4. The target performance critically depends on creating the perfect balance of computations that are oc- curring on the CPU and the GPU. The CPU is responsible for creating the partitions, and transferring them to the GPU. The GPU on the other hand is responsible for performing the three kernel operations of scaling, and two PR update operations. To achieve that, we make use of the synchronous CUDA kernel Algorithm 3: createPartition(G,B) Require: Graph G and k number of batches in Bi represented by bi ∈ (ti, ui, vi, wi, oi) where oi denotes insert or delete Ensure: Vold, Vnew, Vborder 1: CPU :: Generate threads using OpenMP 2: Initialize Vold, Vnew, Vborder = φ, Vtemp = φ 3: Push ∀(u, v) ∈ batch to Vnew 4: Push ∀u ∈ G to Vold 5: for ∀bi ∈ B in parallel : do 6: if (oi==insert) then 7: while (Vnew! = NULL) do 8: Pop element x ∈ Vnew 9: if (x ∈ Vtemp) then 10: Continue 11: end if 12: Push x to Vtemp 13: for every successor y of x ∈ G do 14: Push y to Vtemp 15: end for 16: end while 17: for ∀z ∈ Vtemp do 18: for ∀ predecessors li ∈ bi do 19: Push li into Vborder 20: end for 21: end for 22: end if 23: if (oi==delete) then 24: for ∀u, v ∈ bi do 25: Choose u and v from Vold 26: end for 27: for ∀y successor of (u, v) ∈ do 28: Push y into Vborder 29: end for 30: for ∀y, predecessor of (u, v) ∈ do 31: Push y into Vborder 32: end for 33: end if 34: end for 35: return Vold,Vnew,Vborder calls, asynchronous transfers, and CUDA streams to orches- trate the entire execution model. For creating the partitions, we utilize CPU threads created using the OpenMP library. We create threads equal to the number of processing cores that are available. It is natural to understand, that the batch sizes will be much bigger than the number of threads. We use standard blocking of the batches for each of the threads to handle. Despite large batches, this provides good performance due to the fact that the partitioning operation in itself are simple in nature and does not involve too much CPU intensive oper- ations. Additionally, the partitioning is a irregular operation which the CPU is much better at handling in comparison to the GPU. CUDA streams are created before the start of the
- 7. operation. Once, the CPU finishes the partitioning operation on a particular batch, cudaMemcpyAsync() calls are issued on the three partitions created on individual streams. CUDA events associated with the copy operations monitors the completion. IV. PERFORMANCE EVALUATION In this section we discuss the experiments that we perform to validate the efficacy of our solution, and also analyze the performance. A. Experimental environment: For conducting our experiments we use a platform that has a multicore CPU connected to a state of the art GPU via the PCI link. The CPU is a Intel(R) Xeon (R) Silver 4110 having the Skylake micro-architecture. Two of these CPUs each having 8 cores are arranged in two sockets effectively providing 16 NUMA cores. The cores are clocked at 2.1 GHz with 12 MB of L3 cache. It is attached to a NVIDIA V100 GPU which has 5120 CUDA cores spread across 80 symmetric multiprocessors (SMs). Each of the GPU core is clocked at 1.38 GHz and has 32 GB of global main memory. The GPU is connected to the CPU via a PCIe Gen2 link. The machine is running CentOS 7 OS. For multi-threading on the CPU, we use OpenMP version 3.1 and GCC version 4.8. The GPU programs are compiled with nvcc from CUDA version 10.1 with the −O3 flag. All experiments have been averaged over a dozen runs. For experimentations, we use the real world graphs shown in Table I. The datasets do not posses any timestamps of their own. As done in previous works [1, 13], we simulate a random arrival of an edge update by randomly setting the timestamps. This is followed by the updates happening in an increasing order of the timestamps. For evaluations, we adopt the sliding window model where we take a certain percentage of the original dataset to construct the batches. These are then varied to measure the performance. B. Update time In this section we discuss the performance of HyPR in the context of update times. As mentioned earlier, we we start the computations by taking half of the edges in the entire graph dataset. We then measure the update times, as and when the sliding window moves to generate a set of batches. The update times reported is recorded using CUDA event timers to capture the overlapped pre-processing, transfer, and kernel execution times (shown in one blue box in Figure 4). A timer is started before the OpenMP parallel section where 15 threads perform parallel partitioning a batch and one thread is performing asynchronous transfers and calling kernels.The timer is stopped once the CUDA events indicate end of the final update kernel. We have experimented with different batch sizes from 1% to 10%. In Figure 3, we can see the latencies achieved over all the graphs. For the largest graph Arabic, the average update time that we achieve is 85.108 ms. In Figure 3, we also show the comparative time with the “Pure GPU” implementations. This is done in order to validate our claim that a hybrid implementation which will be able to efficiently overlap the PR computations with the CPU side partitioning and transfers, will show better performance. On an average, we observe that HyPR achieves 1.1843 times speedups over the “PureGPU” performance on 5 of the largest graphs that we experiment with and 1.2305 times over all the graphs. It can be observed that, the speedups achieved are more pronounced over the larger graphs in comparison to the smaller ones. This is due to the fact that the larger graphs provide higher degrees of parallelism that allows the GPUs to spawn higher number of threads and also the CPU side pre-processing allows a higher degree of overlap. This can be better explained using the experiment discussed in the next section. We also execute HyPR on a multi-core CPU using only OpenMP which we can call “Pure CPU”. We get around 95x improvement over the same which is orders of magnitude faster in comparison to “Pure GPU” and hence not discussed in Figure 3. C. Hybrid Overlaps As stated in Section I, one of the major motivations behind doing a hybrid computation is to perform the partitioning of the graph iteratively while the GPU is busy updating the PR scores. We first show how the pipelined that has been setup is working for the Arabic graph. The large graphs becomes the best use-cases for this pipelined execution, as they are able to achieve the perfect balance of the computations on the CPU and GPU sides. As we can observe from Figure 4, the execution begins with the partitioning of the first batch B1 of updates which does not involve any overlap. Once the partitioning completes, the asynchronous transfer of the partitions is done to the GPU which allows the CPU to process the B2 batch immediately. The transfer, which is registered with the callback automatically spawns the kernels as soon as the transfers complete. We can see that for a modest batch size of 10,000 edges, it takes 74.43 ms to pre-process the B2 batch which fully masks the 41.12 ms of transfer and 34.77 ms required by the kernels. This behavior is the best case scenario which is however not observable in the case of all graphs. We can see from Figure 5, that for the smaller graphs (Figure 5(a-f)), the difference between the partitioning time and kernel+transfer time is -1.81 ms (meaning pre-processing is slower on average). In Figure 5, the batch sizes are kept uniform (at 10K edges, approximately 1% of the dataset). For several of the smaller graphs soc- pokec, Reddit, Orkut, and Graph-500, we can observe that for several batches, the two curves cross each other at multiple points. This can be attributed to the structural heterogeneity of the batches. If a batch for example, contains too many new nodes, then the kernel times for calculating new PR scores on Vborder and Vnew will be higher than the pre-processing time. The pre-processing time will be lower in those cases. The reverse scenario, which is more conducive, is when the batches have a healthy mix of new and old nodes. This will create the right kind of overlap, as shown in Figure 4, when the waiting times of the GPU will be minimized. An inflection point is observed for NLP (Figure 5(i)), and for Arabic (Figure 5(j)), where we can see that the partitioning
- 8. 0 50 100 150 200 250 300 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (a) Amazon 0 50 100 150 200 250 300 350 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (b) Web google 0 50 100 150 200 250 300 350 400 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (c) Wiki topcat 0 50 100 150 200 250 300 350 400 450 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (d) Soc-Pokec 0 50 100 150 200 250 300 350 400 450 500 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (e) Reddit 0 100 200 300 400 500 600 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (f) Soc LiveJournal 0 100 200 300 400 500 600 700 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (g) Orkut 0 100 200 300 400 500 600 700 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (h) Graph500 0 100 200 300 400 500 600 700 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (i) NLP 0 100 200 300 400 500 600 700 800 900 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (j) Arabic Figure 3: Update Time Figure 4: Overlapping of the pre-processing time with transfer and kernel time for the Arabic graph time is exceeding the transfer+kernel time at a particular batch size. The partitioning times remain higher for the remaining larger graphs. This indicates the scalability of the CPU side partitioning on different batches. On an average the difference between the two is 4.95 ms for the larger graphs (Figure 5 (g-j)). Hence, we can conclude that HyPR is data driven, and shows best performances on the datasets that are able to extract the maximum amount of system performance. The system efficiency achieved by HyPR is discussed in the next sub-section. D. Resource Utilization In this experiment, we investigate the system efficiencies achieved by HyPR. Towards that, we do profiling in order to know the utilization of the resources. This is done through measurements of the memory and utilization and warp occu- pancy of the GPU threads. We used the nvprof profiler from the CUDA toolkit. We used two profiling metrics from nvprof, first we monitor achieved occupancy which means the number of warps running concurrently on a SM divided by maximum number warp capacity. Second we see gld efficiency which is the ratio of the global memory load throughput to required global memory load, i.e how well the coalescing of DRAM- accesses works. Figure 6 shows these two profiling results with different batch size. We can observe that the occupancies scale linearly with increasing batch sizes. The global load efficiency indicates increased coalesced accesses on every batch. On an average we achieve a global load efficiency of 61.07% and a warp occupancy of 64.14%. E. Comparative Analysis We now compare the performance of HyPR withs some of the state of the art solutions for dynamic PR. We mainly compare our work with GPMA [1], GPMA+ [1] and cuS- parse [14]. GPMA exploits packed memory array(PMA) to handle dynamic updates by storing sorted elements in a partially contiguous manner that enhance dynamic updates. cuSparse library has efficient CSR implementations for sparse matrix vector multiplications. We implement a basic PR update mechanism (purely on GPU) using cuSparse where the new incoming batch iteratively undergoes SpMV operations until the PR values converge. For performing the comparisons with GPMA and GPMA+, we configure the experiment to run HyPR on the same platform as used in [1] which is a Intel Xeon CPU connected to a Titan X Pascal GPU, and also the same datasets. Additionally, the PR scores are derived after running all the experiments till convergence or 1000 iterations (whichever is earlier). Towards that, we use an additional graph “Random” having 1M nodes and 200M edges. From Figure 7, we can observe that HyPR outperforms the state of the art GPMA and GPMA+, and cuSparse implementation for four of the largest graphs. On an average over all four the graphs used, HyPR outperforms GPMA by 4.81x, GPMA+ by 3.26x, and cuSparse by 102.36x. F. Accuracy Analysis For checking the accuracy of HyPR, all the PR scores are pre-computed using nvGraph [15]. nvGraph is a part of the NVIDIA CUDA Toolkit is the state of the art solution which effectively provides millisecond performance for computing PR score on a static graph with 500 iterations. We use nvGraph for computing the scratch PR scores of the graph. The batch updates are then included into the graph through the HyPR method, and checked with the nvGraph scores which we feed with a graph that has the batches included. Once all the batch Older GPU
- 9. 2 3 4 6 8 12 16 1 20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (a) Amazon 2 3 4 6 8 12 1 14 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (b) Web-google 2 3 4 6 8 12 16 1 20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (c) Wiki-topcat 16 24 32 48 50 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (d) soc-pokec 24 32 48 50 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (e) Reddit 6 8 12 16 20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (f) soc-LiveJournal 6 8 12 16 20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (g) Orkut 32 48 50 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (h) Graph-500 32 48 64 80 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (i) NLP 64 96 100 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (j) Arabic Figure 5: Overlaps achieved between partitioning and transfer+kernel times for 10 batches B1 − B10 of uniform size. 0.58 0.59 0.59 0.6 0.6 0.61 0.61 0.62 0.62 0.63 0.64 0.64 10 1 10 3 10 5 Ratio Batch Size Achieved Warp Occupancy Global Load Efficiency (a) Reddit 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 10 1 10 3 10 5 Ratio Batch Size Achieved Warp Occupancy Global Load Efficiency (b) Graph500 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 10 1 10 3 10 5 Ratio Batch Size Achieved Warp Occupancy Global Load Efficiency (c) Orkut 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 10 1 10 3 10 5 Ratio Batch Size Achieved Warp Occupancy Global Load Efficiency (d) NLP Figure 6: Resource Utilization 0.1 1 10 100 1000 10 2 10 3 10 5 10 6 Time(ms) Batch Size GPMA+ GPMA cuSpare HyPR 0.19 6.00 15.00 125.00 1.11 8.00 13.00 99.00 51.00 52.00 49.00 48.00 0.17 2.16 5.14 43.30 (a) Reddit 0.1 1 10 100 1000 10 2 10 3 10 5 10 6 Time(ms) Batch Size GPMA+ GPMA cuSpare HyPR 0.15 6.00 14.00 120.00 1.20 8.00 12.00 98.00 49.00 51.00 50.00 50.00 0.14 1.19 4.17 42.26 (b) soc-Pokec 0.1 1 10 100 1000 10000 10 2 10 3 10 5 10 6 Time(ms) Batch Size GPMA+ GPMA cuSpare HyPR 0.80 7.00 99.00 1012.00 0.99 9.00 22.00 98.00 130.00 131.00 128.00 130.00 0.22 3.95 11.52 67.31 (c) Graph500 0.1 1 10 100 1000 10000 10 2 10 3 10 5 10 6 Time(ms) Batch Size GPMA+ GPMA cuSparse HyPR 0.80 7.00 102.00 1001.00 1.01 9.00 23.00 99.00 130.00 131.00 128.00 130.00 0.57 6.66 12.00 66.69 (d) Random Figure 7: Comparison of HyPR with GPMA, GPMA+, and cuSparse 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 1 O r k u t G r a p h 5 0 0 N L P A r a b i c 0.0005 0.001 0.0015 0.002 0.0025 Jaccard Similarity Divergence top-10 top-20 top-30 div. Figure 8: Accuracy of PR scores updates are completed, we compare the PR scores with the pre-computed ones. To check for accuracy, we introduce batch updates with inserts and deletes updates on the nodes with the highest PR scores on the existing graph. This is done in order to impose maximum change on the PR scores due to an update. In Figure 8, we show the Jaccard similarities of the PR scores computed by HyPR with those computed by nvGraph for the top 10, 20, and 30 nodes with the highest PR scores. We see that on an average, the similarity score comes to 0.985 for a fixed 500 iterations (same as that used for computing the scores using nvGraph). The similarity score comes to 0.991 on average if we allow the different graphs to converge till they reach the threshold γ. We see that on an average the divergence of the PR scores computed by HyPR from that of nvGraph is less than 0.001% (shown in y2 axis). V. RELATED WORKS GPU based PR has been explored in the works done by Duong et al. [16] and Garg et al. [4]. In [16], authors
- 10. proposes a new data structure for graph representations named link structure file. In their work they target the steps in PR computation where sufficient data parallelism exist. Further these steps are distributed among multiple GPUs where each threads perform finer grained work. Garg et. al. [4], provide algorithmic techniques for partitioning the graph based on their structural properties to extract paralleism. PR on evolving graphs have been explored by Sha et al. [1] where they propose two algorithms GPMA and GPMA+ based upon packed memory array. GPMA is lock-based approach where few concurrent updates conflicts are handled efficiently. GPMA+ is a lock-free bottom-up approach which prioritizes updates and favors coalesced memory access. Another work is done by Feng et al. [17] where they propose the DISTINGER framework. DISTINGER employs a hash partitioning-based scheme that favors massive graph updates and message passing among the partition sites using MPI. Another algorithm to compute personalized PR on dynamic graphs was published by Guo et al. [13] that also exploit GPUs for performance. Similar to HyPR, computations proposed in [13] is also done in a batched manner. To enhance the performance of parallel push different optimization techniques are introduced. One of them is eager propagation, that minimizes the number of local push operations. They also propose frontier generation method that keeps track of vertex frontiers by cutting down synchronization overhead to merge duplicate vertices. Batch parallelism for dynamic graphs have also found several theoretical studies. A generic framework for batch parallelism is proposed by Acar et. al. in [8]. Batch parallelism for graph connectivity and other problems, in the massively parallel computation (MPC) model is explored by Dhulipala et. al. in [18]. The work done by Desikan et. al. [2], is one of the earliest works done towards incremental PR computations on evolving graphs. The authors proposed the partitioning, and scaling techniques which we modify for parallelization on a heteroge- neous platform. To the best of our knowledge, there is no other work that exists which explores hybrid CPU+GPU solution for computing global PR. In HyPR we propose techniques for PR computations that uses batch parallelism in unison with fully parallel partitioning and PR update mechanisms on a hybrid platform towards extracting high performance. VI. CONCLUSION AND FUTURE WORK In this work, we propose HyPR which is a hybrid tech- nique for computing PR on evolving graphs. We have shown an efficient mechanism to partition the existing graph and updates into data parallel work units which can be updated independently. HyPR is executed on a state-of-the-art high performance platform and exhaustively tested against large real world graphs. HyPR is able to provide substantial performance gains of up to 4.8x over other existing mechanisms and also extracts generous system efficiency. In the near future, we plan on extending HyPR by spreading the computations across multiple GPUs located on shared and distributed mem- ories. Communication between distributed nodes will become an additional overhead to handle in that case. Additionally, modern HPC systems are equipped with newer generation interconnects like NVLink which deserves to be explored in the context of page ranking. VII. ACKNOWLEDGMENT This work is supported by Science and Engineering Re- search Board (SERB), DST, India through the Early Career Research Grant (no. ECR/2016/002061) and NVIDIA Corpo- ration through the GPU Hardware Grant program. REFERENCES [1] M. Sha, Y. Li, B. He, and K.-L. Tan, “Accelerating dynamic graph analytics on GPUs,” Proc. of the VLDB Endow., vol. 11, no. 1, pp. 107–120, 2017. [2] P. Desikan, N. Pathak, J. Srivastava, and V. Kumar, “Incremental page rank computation on evolving graphs,” in 14th Interna- tional WWW, 2005, pp. 1094–1095. [3] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999. [4] P. Garg and K. Kothapalli, “STIC-D: Algorithmic techniques for efficient parallel PageRank computation on Real-World Graphs,” in Proceedings of the 17th ICDCN, 2016, pp. 1–10. [5] D. Gleich, L. Zhukov, and P. Berkhin, “Fast parallel pager- ank: A linear system approach,” Yahoo! Research Techni- cal Report YRL-038, available via http://research. yahoo. com/publication/YRL-038. pdf, vol. 13, p. 22, 2004. [6] A. Cevahir, C. Aykanat, A. Turk, and B. B. Cambazoglu, “Site- based partitioning and repartitioning techniques for parallel PageRank computation,” IEEE TPDS, vol. 22, no. 5, pp. 786– 802, 2011. [7] M. Kim, “Towards exploiting GPUs for fast PageRank computa- tion of large-scale networks,” in Proceeding of 3rd International Conference on Emerging Databases, 2013. [8] U. A. Acar, D. Anderson, G. E. Blelloch, and L. Dhulipala, “Parallel batch-dynamic graph connectivity,” in The 31st ACM SPAA, 2019, pp. 381–392. [9] Compressed Sparse Column Format (CSC), https: //scipy-lectures.org/advanced/scipy sparse/csr matrix.html. [10] The University of Florida Sparse Matrix Collection, https: //snap.stanford.edu/data. [11] J. Leskovec and A. Krevl, SNAP Datasets: Stanford Large Network Dataset Collection, http://snap.stanford.edu/data. [12] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Third Edition, 3rd ed. The MIT Press, 2009. [13] W. Guo, Y. Li, M. Sha, and K.-L. Tan, “Parallel personalized pagerank on dynamic graphs,” Proc. of the VLDB Endow., vol. 11, no. 1, pp. 93–106, 2017. [14] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cus- parse library,” in GPU Technology Conference, 2010. [15] NVGraph toolkit documentation, https://docs.nvidia.com/cuda/ cuda-runtime-api/index.html. [16] N. T. Duong, Q. A. P. Nguyen, A. T. Nguyen, and H.- D. Nguyen, “Parallel PageRank computation using GPUs,” in Proce. of the 3rd Symposium on Information and Communica- tion Technology, 2012, pp. 223–230. [17] G. Feng, X. Meng, and K. Ammar, “Distinger: A distributed graph data structure for massive dynamic graph processing,” in International Conference on Big Data. IEEE, 2015, pp. 1814– 1822. [18] L. Dhulipala, D. Durfee, J. Kulkarni, R. Peng, S. Sawlani, and X. Sun, “Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds,” in Proceedings of the 31st SODA. USA: SIAM, 2020, p. 1300–1319.