Highlighted notes on A Parallel Algorithm Template for Updating Single-Source Shortest Paths in Large-Scale Dynamic Networks.
While doing research work under Prof. Dip Banerjee, Prof, Kishore Kothapalli.
In Hybrid Pagerank the vertices are divided in 3 groups, V_old, V_border, V_new. Scaling for old, border vertices is N/N_new, and 1/N_new for V_new (i do this too ). Then PR is run only on V_border, V_new.
"V_border which is the set of nodes which have edges in Bi connecting V_old and V_new and is reachable using a breadth first traversal."
Does that mean V_border = V_batch(i) ∩ V_old? BFS from where?
"We can assume that the new batch of updates is topologically sorted since the PR scores of the new nodes in Bi is guaranteed to be lower than those in Co."
Is sum(PR) in V_old > sum(PR) in V_new always?
"For performing the comparisons with GPMA and GPMA+, we configure the experiment to run HyPR on the same platform as used in [1] which is a Intel Xeon CPU connected to a Titan X Pascal GPU, and also the same datasets."
Old GPUs are going to be slower ...
Like we were discussing last time, it is not possible to scale old ranks, and skip the unchanged components (or here V_old). Please check this simple counter example that shows skipping leads to incorrect ranks.
https://github.com/puzzlef/pagerank-levelwise-skip-unchanged-components
Another omission in the paper is that Hybrid PR (just like STICD) wont work for graphs which have dead ends. This is a pre-condition for the algorithm.
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)
1. HyPR: Hybrid Page Ranking on Evolving Graphs
Hemant Kumar Giri, Mridul Haque, Dip Sankar Banerjee
Department of Computer Science and Engineering. Indian Institute of Information Technology Guwahati
Bongora, Guwahati 781015. Assam. India.
Email:girihemant19@gmail.com, {mridul.haque,dipsankarb}@iiitg.ac.in
Abstract—PageRank (PR) is the standard metric used by the
Google search engine to compute the importance of a web page
via modeling the entire web as a first order Markov chain.
The challenge of computing PR efficiently and quickly has
been already addressed by several works previously who have
shown innovations in both algorithms and in the use of parallel
computing. The standard method of computing PR is handled
by modelling the web as a graph. The fast growing internet
adds several new web pages everyday and hence more nodes
(representing the web pages) and edges (the hyperlinks) are added
to this graph in an incremental fashion. Computing PR on this
evolving graph is now an emerging challenge since computations
from scratch on the massive graph is time consuming and
unscalable. In this work, we propose Hybrid Page Rank (HyPR),
which computes PR on evolving graphs using collaborative
executions on muti-core CPUs and massively parallel GPUs. We
exploit data parallelism via efficiently partitioning the graph
into different regions that are affected and unaffected by the
new updates. The different partitions are then processed in an
overlapped manner for PR updates. The novelty of our technique
is in utilizing the hybrid platform to scale the solution to massive
graphs. The technique also provides high performance through
parallel processing of every batch of updates using a parallel
algorithm. HyPR efficiently executes on a NVIDIA V100 GPU
hosted on a 6th Gen Intel Xeon CPU and is able to update a
graph with 640M edges with a single batch of 100,000 edges
in 12 ms. HyPR outperforms other state of the art techniques
for computing PR on evolving graphs [1] by 4.8x. Additionally
HyPR provides 1.2x speedup over GPU only executions, and 95x
speedup over CPU only parallel executions.
Index Terms—Heterogeneous Computing, PageRank,
CPU+GPU, Dynamic graphs.
I. INTRODUCTION
Link analysis is a popular technique for mining meaningful
information from real world graphs. There are a variety of
knowledge models that are typically employed by different ap-
plications. The knowledge models encode structural informa-
tion and rich relationships between the different entities which
help in the extraction of critical information. One such model
is the Hubs and Authority [2] model which essentially proves
that a web graph has multiple bipartite cores. Google [3] on the
other hand models the web graph as a first order Markov chain
which captures an user’s browsing patterns of the web. This
model is used by Google to generate ranking of the different
web pages which forms the core concept for the Page Rank
(PR) algorithm and is used in Google search. The page ranking
method models a hyperlink from one page to another as an
endorsement for the destination page from the source. With the
growth of the internet, computing PR on the entire web is a
challenging task given that the graph is massive. Additionally,
the graph is evolving (also called dynamic) in nature and will
have several new pages and hyperlinks generated every day
which leads to an additional challenge of computing new page
ranks at regular intervals. While the simplest solution is to
compute the page ranks from scratch every time, it is not
feasible for any realistic use case. Hence it is necessary to
investigate methods for computing new page ranks on evolving
graphs that will not require computing PR from scratch.
Parallel PR [4–6], has found wide spread success in com-
puting PR scores quickly. With the advent of processors such
as Graphics Processing Units (GPUs), and modern multi-core
CPUs, the parallel solutions of PR have further found wide
spread success. Given the capacity for massive parallelism of
GPUs, good programmability, and strong community support,
GPUs have been able to provide sub-second performance in
PR computations on massive real world graphs. In computing
PR scores on evolving (or dynamic) graphs too, GPUs have
found good success in the past [1, 7]. Computing dynamic
PR, presents an ideal use case for heterogeneous computations
where the nature of the computations require investigations
into techniques that can provide parallelism in a hierarchical
manner. While some parallel work can be done at coarser
degrees of granularity, end page ranking scores need to be
computed at a much finer granularity.
In this paper, we present HyPR1
(pronounced hy-per), a
method for computing PR on dynamic graphs which uti-
lizes both CPUs and GPUs towards fast computations of the
new scores. Our method depends on two broad phases of
computation. In the first phase, we identify a partition of
the graph which will be affected by a newer set of edges
that are getting inserted. In the second phase, the actual PR
computation will be carried out. While the steps are mutually
exclusive, performing both the steps on a GPU will lead
to some sequential computations. Towards that, we propose
a heterogeneous method which provides higher degrees of
parallelism towards computing the two phases and hence leads
to significant performance benefits. The concrete motivations
and contributions of our paper are as follows:
1) Parallel partitioning: As pointed out, our technique
depends on identifying portions of the graph that are
affected and unaffected by the newer batch of edges
that are updated in the graph. We present a method that
can perform the partitions in parallel to a current set of
updates are getting incorporated into the existing graph.
1Code available at : https://github.com/girihemant19/HyPR
2. This reduces irregular accesses of the GPU to a large
extent, and boosts performance.
2) Parallel updates with a parallel algorithm: To extract
maximum data parallelism (and hence performance)
available on a GPU, we propose a technique that can
parallely process a batch of updates to compute new
PR scores using a parallel algorithm which runs on the
GPU. The step for making the parallel update possible
is done through a pre-processing step which is carried
out in the CPU. To the best of our knowledge, this has
not been proposed earlier.
3) Scalability: This is one of the main motivations behind
the adoption of a hybrid solution. While the updates
that happen in a batch parallel manner could be small,
the whole graph, even when represented using the most
space efficient data structures, will consume the high-
est amount of main memory. We show that a hybrid
approach will not necessitate the storage of the whole
graph on a limited GPU memory. We propose to create
minimal working sets that can effectively assimilate a
new batch of updates and gets constrained only when the
minimal working sets exceeds the memory. The much
larger host memory holds the full graph as an auxiliary
storage.
4) Benchmarking: We thoroughly experiment our technique
on several real world graphs with up to 640M edges
(limited by the GPU memory). We perform experiments
to show that our technique out performs the state of
the art dynamic PR techniques by 4.8x by providing PR
scores in 12 ms.
The rest of the paper is organized as follows. We provide a
background of the fundamental ideas our work in Section II.
The related works are discussed in Section V. We provide
detailed methodology of our HyPR technique in Section III.
This is followed by the evaluations where we discuss and
analyze the results obtained in Section IV. Finally, we draw
some brief conclusions of our work and discuss possible
directions for extension in Section VI.
II. BACKGROUND
In this section we briefly discuss the algorithmic prelimi-
naries on which HyPR depends.
A. Baseline PageRank
PR computations in general is computed on a graph
G(V, E, w) where V (|V |= n) is the set of nodes representing
the web pages. E (|E|= m) is the set of edges representing the
hyperlinks and w is the set of edge weights which represents
the contribution of a node to the node it connects with via an
edge. Hence, using the random-surfer model which accounts
for any random internet user to land at a particular page, the
PR value of a node u is given as:
pr(u) =
X
v∈I(u)
c(v → u) +
d
n
(1)
In Equation 1, d (taken as 0.15 conventionally) represents
the damping factor which is the probability of the random
surfer to stay at a particular page. The function c() is the
contribution. The magnitude of the contribution is taken to
be the ratio of the pr(u)/out degree(u) times (1 − d) then
Equation 1 can be re-written as:
c(v → u) = (1 − d)
pr(v)
out degree(v)
(2)
In Equation 2, the PR computations happen in a cyclic
manner as two different nodes can make contributions to each
other. The PR values are computed via iteratively computing
the PR values of all nodes until they converge. This signifies
there is very little update to the PR scores after a particular
iteration. This style of computing the PR scores is popularly
done through power iterations [3]. The genesis of power
iterations arises from the idea of the Markov chains reaching
a steady state when starting from a initial distribution. Here,
the initial PR values compose the starting states of every node
which then goes through several transitions to reach a stable
state. The PR values of all the nodes can be arranged to form
the state matrix which has a set of eigenvalues. The eigen
vectors of the said state matrix is solved using power iterations.
By the intrinsic property of Markov chains, if the state matrix
is stochastic, aperiodic, and irreducible, then the state matrix
will converge to a stationary set of vectors which will be the
final PR scores. A basic parallel implementation of computing
PR in parallel is shown in Algorithm 1.
B. Dynamic PR calculations
The seed idea for HyPR stems from the batch dynamic
graph connectivity algorithm proposed by Acar et. al. in [8].
Due to the iterative nature of PR, if the PR scores of all the
incoming neighbors of a node u converge in a particular iter-
ation, then the score of u will also converge in the immediate
next iteration. This nature of PR, allows the decomposition
of a directed graph into a set of connected components (CC)
which can be processed in parallel [4]. Since maintaining CCs
of a graph in a dynamic setting, is equivalent to maintaining
the connectivity of a graph, we can perform a batch update
with size B in parallel in O(lg n+lg(1+n/B)) work per edge.
A batch here, refers to a set of updates which can be either
insertions into the existing graph, or deletions. Each entry in
a batch is arranged as a tuple (ti, ui, vi, wi, oi), where ti is a
time stamp, (ui, vi) is the edge, wi is the weight associated
with edge, and oi is the update type (insert/delete). We assume
that a batch i will have at most Bi edges. We experiment the
impact of this batch size on performance later in Section IV.
As stated, the problem of computing the PR scores for
an incrementally growing graph can be treated in essence
as the computation of two CCs where one contains a set
of nodes that gets affected by the batch of updates and the
other does not. The same treatment for partitioning the graph
for incrementally computing PR scores was also done by
Desikan et. al. [2] where the authors proposed scaling of the
unchanged nodes first followed by the PR computations of the
3. changed nodes. For a graph G(V, E, w) where
Pn
1 wi = W
and |V | = n being sum of weights of all nodes and order of
graph respectively. Every node is initialized with PR score of
wi
W . Consider a particular node s in the existing graph. It’s PR
value can be expressed as:
PR(s) = d(
ws
W
) + (1 − d)
k
X
i=1
PR(xi)
δ(xi)
(3)
where d is the damping factor and xi denotes every incoming
neighbouring nodes pointing to s up to k number of nodes and
δ(xi) is the out-degree of xi. From the fact the over some k
iterations, the PR values of all the nodes will get scaled by a
constant factor proportional to W, it can be deduced that for
the node s the updated PR score can be scaled as
W.PR(s) = W0
PR0
(s)
or, PR0
(s) = (
W
W0
)PR(s)
(4)
So, the new PR scores can easily be determined by scaling
old PR with factor ( W
W 0 ). W can be also taken to be the order
of nodes n(G) since every node is equally likely to be of
weight 1; Equations 4 can be re-written as:
PR0
(s) =
n(G)
n(G0)
PR(s) (5)
The new nodes that are getting added into G will be required
to be put through the usual iterations to compute PR values.
In Figure 1, we have clearly demonstrated these partitions
where Bi represents the batches of updates. A set of nodes
which we denote by Vnew, are in the partition that requires
scratch computations of PR. The other partition Vold requires
to be scaled using the Equation 5. The Vborder nodes that
are in the border area which requires scaling along with a few
iterations of ranking before it converges to its’ final PR scores.
While the standard PR algorithm will require a O(n + m)
space to be maintained in memory, partitioning of the graph
effectively aids in reducing the memory requirements. if we
assume each batch to be of size ∆ edges then in addition
to the original graph, an additional space of O(∆) needs to
be created. Given the limited space that is available in the
GPUs a combined space of O(n + m + ∆) will severely limit
scaling. In order to scale HyPR to larger sizes, we aim to
keep only O(∆) additional space in the GPU, while the rest
of the graph will be maintained on the host memory. This
is further refined through the partitioning which decomposes
PR computations to discrete space requirements in O(Vnew),
O(Vborder), and O(Vold). Each of these partitions need not
reside on the same memory at all time. We use the compressed
sparse row representation for representing the graph and is
detailed in [9].
C. Datasets used
Even though PR is usually computed on web graphs repre-
senting web pages and hyperlinks, their properties are similar
to other real world graphs. Hence, we choose a healthy mix
of web-graphs and real world graphs. The datasets we use are
Figure 1: Identification of nodes
Algorithm 1: parallelPR(V,outdeg,InV[y],γ)
Require: Set of nodes as V , outdegree of each node in
outdeg, incoming neighbors InV.
Ensure: PageRank p of each node
1: err=∞
2: for all u ∈ V do
3: previous(u)= d
|V |
4: end for
5: while err > γ do
6: for all u ∈ V in parallel do
7: p(u) = d
n
8: for all x ∈ Inv do
9: p(u) = p(u) + previous(x)
outdeg(x) ∗ (1 − d)
10: end for
11: end for
12: for all u ∈ V do
13: err = max(err, abs(previous[u] − p[u]))
14: previous(u) = p(u)
15: end for
16: end while
17: return p
collected from the University of Florida Sparse Matrix Collec-
tion [10] and Stanford Network Analysis Project (SNAP) [11].
The datasets which we have selected ranges from 2.9M edges
to around 640M edges as detailed in Table I.
III. METHODOLOGY
In this section we will briefly discuss about the approach
adopted for implementing HyPR. We first explain a basic
overview of the approach that we take and then provide
a detailed explanation on the implementation strategies. Al-
gorithm 2, shows the basic set of steps that are adopted
towards the implementation. Figure 2 shows the overview
of the steps that are implemented. In a broad sense, HyPR
works as concurrent (or overlapped) phases of the following.
1) Partitioning: The parallel cores of the CPU creates the
4. Table I: Datasets Used
Graph Name Sources |V | |E| Type
1. Amazon [11] 0.41 M 3.35 M Purchasing
2. web-Google [11] 0.87 M 5.10 M Web Graph
3. wiki-Topcat [11] 1.79 M 28.51 M Social
4. soc-pokec [11] 1.63 M 30.62 M Social
5. Reddit [11] 2.61 M 34.40 M Social
6. soc-LiveJournal [11] 4.84 M 68.99 M Social
7. Orkut [11] 3.00 M 117.10 M Social
8. Graph500 [11] 1.00 M 200.00 M Synthetic
9. NLP [10] 16.24 M 232.23 M Optimization
10. Arabic [10] 22.74 M 639.99 M Web Graph
partitions Vold, Vnew, and Vborder as discussed in Section II-B.
2) Transfer: Perform asynchronous transfer of ∆ sized batch
to the GPU memory. 3) PR Calculations: depending on the
type of the partition, either scale, or calculate new PR scores
on the GPU.
A. Why Hybrid ?
Before we discuss HyPR design in detail, we first motivate
the requirement for a hybrid solution. As stated earlier, our
intention for performing heterogeneous solution is three fold.
In the first place, a hybrid solution allows the PR calculation
to scale to very large sizes which is otherwise limited by
the GPU main memory size. As discussed in Section II-B,
only O(|V |+∆) space is needed for computing the updated
PR values for a particular batch of size Bi. The static graph
which is undergoing updates resides in the much larger main
memory of the host and only aids in creating the partitions. In
the second place, if we ignore scaling, we need to perform
partitioning, followed by parallel computations of the PR
scores which will result in limited GPU utilizations as all these
steps are not independent. Also the degree of parallelism is
limited for the partitioning steps in comparison to the scaling
and PR updates. Hence, for coarse parallel partitioning step,
the CPU is an ideal device and GPU is more suited for
the finely parallel update and scaling. We show how a GPU
only execution is actually slower than the hybrid technique in
Section IV. In the last place, a hybrid technique allows us to
accrue higher system efficiency which would otherwise have
the CPU sitting idle while the GPU is updating the PR scores.
B. Graph Partitioning:
We start the computation on a graph G(V, E, w) where
there already exists some PR scores that have been computed
previously. The algorithmic phases that HyPR goes through is
outlined in Algorithm 2. In successive time intervals, we will
have a set of batch updates that arrive. As discussed previously
in Section II-B, the batches consist of an heterogeneous mix
of edges that are either to be inserted or deleted. A batch Bi,
can be represented by the tuple (ti, ui, vi, wi, oi), as discussed
in Section II-B.
For updating the batch in parallel, we identify three con-
nected components (CC) that strictly requires different com-
putations. If we consider the existing graph as one single CC
(say Co), and the new batch of updates as a separate CC (say
Cn), then we can construct a block graph S where there will
be some incoming edges from Cn to Co if there exists edge
updates (ui, vi) where ui ∈ Co and vi ∈ Cn. We can assume
that the new batch of updates is topologically sorted since the
PR scores of the new nodes in Bi is guaranteed to be lower
than those in Co.
In an auxiliary experiment, we tested the nature of these
CCs to see if they form strongly connected components (SCC)
so as to arrive at a formulation similar to [4]. Decomposing
a graph into a set of SCCs provides the advantage of doing a
topological ordering of the partitions where PR computations
(or re-computations) of the scores can be done in a cascaded
manner with the scaling of Vborder, and Vold nodes first,
followed by the PR calculation of Vborder and Vnew. This
order of cascaded updates has also been adopted to update the
PR scores as proved in [2]. We found, that if we re-compute
the three partitions using Kosaraju’s algorithm (cf. [12]) to
identify the Vold, Vnew, and Vborder partitions, approximately
2% of the edges differ from the Arabic dataset when compared
to our mechanism of partitioning. Hence, we can conclude
that our mechanism produces approximate SCCs which can
be easily monitored for only the 2% extraneous edges so as
to maintain a proper topological order while performing the
batch updates. We do a small book-keeping of these edges in
order to correctly partition the edges.
We can now identify three partitions for every batch Bi
(i) Vold which is a set of nodes that are already there in the
existing graph (ii) Vnew is a set of vertices which are entirely
new nodes which are to be added to the existing graph and
can be found as Vi − (Vi ∩ Vold) if Vi is the set of nodes in
batch Bi (iii) Vborder which is the set of nodes which have
edges in Bi connecting Vold and Vnew and is reachable using
a breadth first traversal. As an example, Figure 1, we can see
the Vold nodes in red, Vnew nodes in green, and the Vborder
nodes which are a collection of the entire set of nodes which
are in the first hub and having direct connection with Vnew.
Essentially, all the nodes in G without the yellow nodes are
Vborder.
As, we can see in Algorithm 2, the first phase of the
update operation is the creation of these partitions. Since,
the identification of these vertices are independent from each
other, they can be done in parallel. It is critical to note that at
this stage the number of edges that are present in a single batch
does not warrant any form of edge parallelism on GPU as that
will lead to lower utilization of the available GPU bandwidth
at the cost of high memory transfers.
C. Pre-processing
In pre-processing phase, we compute the Vold, Vnew, and
Vborder partitions in an overlapped manner with the GPU. The
sequence of operations that are followed for every incoming
batch of updates are: 1) Pre-process the first batch in parallel
using OpenMP threads on CPU. 2) Transfer Vold, Vnew,
and Vborder to GPU for scaling, and PR computations in
an asynchronous manner 3) Start pre-processing the second
batch of updates on the CPU as soon as the first partition is
How?
Note
5. (a) Graph with batch updates (b) Identification of border, new, and
unchanged nodes
(c) Scaling on unchanged and border
nodes
(d) PageRank Computation on border
and new nodes
Figure 2: Overview of HyPR execution steps.
handed off to GPU for ranking. Algorithm 3 demonstrates the
partitioning mechanism which is called from Line 1 of Algo-
rithm 2. As mentioned earlier, the batch of updates contain an
heterogeneous mix of both insert and delete operations which
have to be handled uniquely.
1) Insertion: Insertion is more compute intensive than
deletion. In Lines 6-22 of Algorithm 3 we depict our insertion
mechanism. We first populate Vnew with all the nodes of the
batch and process all the elements in Vnew using available
CPU threads to successively send the node to the appropriate
partition. We first check if the source node of an edge (ui, vi)
belongs to the existing partition or not. If it is, then that
particular node is put in Vold. For computing the Vborder, we
check if the particular node is reachable in G (the existing
graph) in a breadth-first manner and has a predecessor in the
incoming batch. The intuition behind this is that the Vborder
will be the set of nodes that are reachable from the Vnew
set of nodes and hence will undergo both scaling and PR
computation. So in parallel, all the nodes in Vnew are popped
and either classified in Vold or Vborder. The remaining ones in
Vnew are the ones which are the entirely new set of nodes that
have come in Bi whose new PR values needs to be computed.
2) Deletion: Deletion is much simpler than insertion and
hence less compute intensive. In case of deletion, the Vnew set
will be NULL, and the only nodes that will be involved are
the Vold and Vborder sets. As we can see from Lines 23-33 of
Algorithm 3, if the update type oi is delete, then we remove
the nodes involved from Vold which is containing the nodes
of the original graph. These will require a newer set of PR
computations on the Vnew nodes that will be handled during
the PR update step in Algorithm 2. Additionally, the removals
will induce a newer set of Vborder nodes for which, we will
see if any of the reachable successors or predecessors of the
removed nodes are present. Such nodes will be pushed into
Vborder.
D. Scaling the old nodes
The pre-processing step essentially allows us to perform
data parallel scaling and PR computations on the individual
partitions. As discussed earlier, the primary idea behind HyPR
is the localization of the set of nodes that will be affected by
the new batch of updates. As we can now see from Lines
6-11 in Algorithm 2, we call a GPU kernel to scale the
nodes of the Vold partition using the Equation 5 discussed
in Section II-B. We can now achieve full GPU bandwidth
saturation as |Vold| number of threads can be spawned to scale
all the nodes in parallel. It is critical to note here that in the
hybrid implementation, there will be intermediate transfers that
will be necessary which will not require the entire G (which
is basically V nodes from G initially) to be copied every time.
Rather, the original graph G is copied to GPU before the batch
processing starts and is augmented with Vnew after every batch
is processed. As we can see in Lines 9-11 of Algorithm 2,
the scaling operation is performed accordingly on the Vborder
nodes as well. Scaling of the nodes in the GPU is required
for all the Vold and Vborder set of nodes. The actual scaling
operation is a O(1) operation which makes it most suitable
for a massively parallel implementation on the GPUs. Threads
equal to the number of nodes involved in the scaling process
(Vold or Vborder) are spawned on the GPU for executing the
scaling kernels in an SIMT manner.
E. Page Rank Update
The PR update of the Vnew and Vborder nodes are now a
much lighter computation owing to the partitions. As with the
standard parallel PR implementation shown in Algorithm 1,
the power iterations for computing the PR scores continue until
the scores converge to an error threshold γ (set to 10−10
).
However, since the Vborder set undergoes a step of scaling
before the PR update step, the number of iterations required
for the scores to converge will be much lower than the case
of computing from scratch. So, during the PR update step for
Vborder, and Vnew, as shown in Lines 12-17 of Algorithm 2,
we call the parallelPR() of Algorithm 1. For the Vnew nodes,
the computation is trivial since the number of nodes are low
as it contains only the new nodes being added from a new
batch. The Vborder nodes, although much bigger in size than
the Vnew set, will also call parallelPR(). However, they will
converge much quicker since, they have undergone a step of
scaling previously.
For the deletions, the PR update will be required only for
the Vold set and the Vborder set of nodes. As we can see from
Lines 19-26 in Algorithm 2, the same PR update process will
V-batch?
?
6. Algorithm 2: HyPR: Hybrid Page Ranking on G with
incremental Batches Bi
Require: Scratch graph G and k number of batches in B
represented by (ti, ui, vi, wi, oi), PageRank vector
dest,outdegree of each node in outdeg, incoming
neighbors of every node in InV.
Ensure: Rank of the nodes in vector dest
{Phase 1: Pre-processing phase}
1: CPU::Partition the incoming batches based on insertion
and deletion
2: (Vold, Vborder, Vnew)=createPartition(G,B)
3: CPU:: Queue Vold, Vborder, Vnew for async transfer to
GPU
{Phase 2: PR update}
4: INSERTION: Generate threads equal to number of
Vborder, Vnew
5: if (oi==insert) then
6: for ∀u ∈ Vold in parallel do
7: GPU:: dest[u] = |V |∗dest[u]
|Vold| {Scaling}
8: end for
9: for ∀x ∈ Vborder in parallel do
10: GPU:: dest[x] = |V |∗dest[u]
|Vborder| {Scaling}
11: end for
12: for ∀z ∈ Vborder in parallel do
13: GPU:: dest[z]=parallelPR(Vborder,outdeg,InV [z])
14: end for
15: for ∀y ∈ Vnew in parallel do
16: GPU :: dest[y]=parallelPR(Vnew,outdeg,InV [y])
17: end for
18: end if
19: DELETION: Generate threads equal to number of
Vold and Vborder
20: if oi==delete then
21: for ∀u ∈ Vborder in parallel do
22: GPU :: dest[y]=parallelPR(Vborder,outdeg,InV [u])
{PR Update}
23: end for
24: for ∀v ∈ Vold in parallel do
25: GPU:: dest[v] = |V |∗dest[u]
|Vold| {Scaling}
26: end for
27: end if
be applied first for the Vborder set. The Vold set will simply
undergo a step of scaling similar to the case of insertions.
F. CUDA+OpenMP implementation
We can observe a snapshot of the overlapped execution
model in Figure 4. The target performance critically depends
on creating the perfect balance of computations that are oc-
curring on the CPU and the GPU. The CPU is responsible for
creating the partitions, and transferring them to the GPU. The
GPU on the other hand is responsible for performing the three
kernel operations of scaling, and two PR update operations. To
achieve that, we make use of the synchronous CUDA kernel
Algorithm 3: createPartition(G,B)
Require: Graph G and k number of batches in Bi
represented by bi ∈ (ti, ui, vi, wi, oi) where oi denotes
insert or delete
Ensure: Vold, Vnew, Vborder
1: CPU :: Generate threads using OpenMP
2: Initialize Vold, Vnew, Vborder = φ, Vtemp = φ
3: Push ∀(u, v) ∈ batch to Vnew
4: Push ∀u ∈ G to Vold
5: for ∀bi ∈ B in parallel : do
6: if (oi==insert) then
7: while (Vnew! = NULL) do
8: Pop element x ∈ Vnew
9: if (x ∈ Vtemp) then
10: Continue
11: end if
12: Push x to Vtemp
13: for every successor y of x ∈ G do
14: Push y to Vtemp
15: end for
16: end while
17: for ∀z ∈ Vtemp do
18: for ∀ predecessors li ∈ bi do
19: Push li into Vborder
20: end for
21: end for
22: end if
23: if (oi==delete) then
24: for ∀u, v ∈ bi do
25: Choose u and v from Vold
26: end for
27: for ∀y successor of (u, v) ∈ do
28: Push y into Vborder
29: end for
30: for ∀y, predecessor of (u, v) ∈ do
31: Push y into Vborder
32: end for
33: end if
34: end for
35: return Vold,Vnew,Vborder
calls, asynchronous transfers, and CUDA streams to orches-
trate the entire execution model. For creating the partitions,
we utilize CPU threads created using the OpenMP library. We
create threads equal to the number of processing cores that are
available. It is natural to understand, that the batch sizes will
be much bigger than the number of threads. We use standard
blocking of the batches for each of the threads to handle.
Despite large batches, this provides good performance due to
the fact that the partitioning operation in itself are simple in
nature and does not involve too much CPU intensive oper-
ations. Additionally, the partitioning is a irregular operation
which the CPU is much better at handling in comparison to
the GPU. CUDA streams are created before the start of the
7. operation. Once, the CPU finishes the partitioning operation on
a particular batch, cudaMemcpyAsync() calls are issued on the
three partitions created on individual streams. CUDA events
associated with the copy operations monitors the completion.
IV. PERFORMANCE EVALUATION
In this section we discuss the experiments that we perform
to validate the efficacy of our solution, and also analyze the
performance.
A. Experimental environment:
For conducting our experiments we use a platform that has a
multicore CPU connected to a state of the art GPU via the PCI
link. The CPU is a Intel(R) Xeon (R) Silver 4110 having the
Skylake micro-architecture. Two of these CPUs each having
8 cores are arranged in two sockets effectively providing 16
NUMA cores. The cores are clocked at 2.1 GHz with 12 MB
of L3 cache. It is attached to a NVIDIA V100 GPU which has
5120 CUDA cores spread across 80 symmetric multiprocessors
(SMs). Each of the GPU core is clocked at 1.38 GHz and has
32 GB of global main memory. The GPU is connected to the
CPU via a PCIe Gen2 link. The machine is running CentOS 7
OS. For multi-threading on the CPU, we use OpenMP version
3.1 and GCC version 4.8. The GPU programs are compiled
with nvcc from CUDA version 10.1 with the −O3 flag. All
experiments have been averaged over a dozen runs.
For experimentations, we use the real world graphs shown
in Table I. The datasets do not posses any timestamps of
their own. As done in previous works [1, 13], we simulate
a random arrival of an edge update by randomly setting the
timestamps. This is followed by the updates happening in an
increasing order of the timestamps. For evaluations, we adopt
the sliding window model where we take a certain percentage
of the original dataset to construct the batches. These are then
varied to measure the performance.
B. Update time
In this section we discuss the performance of HyPR in the
context of update times. As mentioned earlier, we we start
the computations by taking half of the edges in the entire
graph dataset. We then measure the update times, as and when
the sliding window moves to generate a set of batches. The
update times reported is recorded using CUDA event timers
to capture the overlapped pre-processing, transfer, and kernel
execution times (shown in one blue box in Figure 4). A
timer is started before the OpenMP parallel section where 15
threads perform parallel partitioning a batch and one thread
is performing asynchronous transfers and calling kernels.The
timer is stopped once the CUDA events indicate end of the
final update kernel.
We have experimented with different batch sizes from 1%
to 10%. In Figure 3, we can see the latencies achieved over all
the graphs. For the largest graph Arabic, the average update
time that we achieve is 85.108 ms. In Figure 3, we also show
the comparative time with the “Pure GPU” implementations.
This is done in order to validate our claim that a hybrid
implementation which will be able to efficiently overlap the
PR computations with the CPU side partitioning and transfers,
will show better performance. On an average, we observe that
HyPR achieves 1.1843 times speedups over the “PureGPU”
performance on 5 of the largest graphs that we experiment with
and 1.2305 times over all the graphs. It can be observed that,
the speedups achieved are more pronounced over the larger
graphs in comparison to the smaller ones. This is due to the
fact that the larger graphs provide higher degrees of parallelism
that allows the GPUs to spawn higher number of threads and
also the CPU side pre-processing allows a higher degree of
overlap. This can be better explained using the experiment
discussed in the next section. We also execute HyPR on a
multi-core CPU using only OpenMP which we can call “Pure
CPU”. We get around 95x improvement over the same which
is orders of magnitude faster in comparison to “Pure GPU”
and hence not discussed in Figure 3.
C. Hybrid Overlaps
As stated in Section I, one of the major motivations behind
doing a hybrid computation is to perform the partitioning of
the graph iteratively while the GPU is busy updating the PR
scores. We first show how the pipelined that has been setup
is working for the Arabic graph. The large graphs becomes
the best use-cases for this pipelined execution, as they are
able to achieve the perfect balance of the computations on
the CPU and GPU sides. As we can observe from Figure 4,
the execution begins with the partitioning of the first batch
B1 of updates which does not involve any overlap. Once
the partitioning completes, the asynchronous transfer of the
partitions is done to the GPU which allows the CPU to process
the B2 batch immediately. The transfer, which is registered
with the callback automatically spawns the kernels as soon as
the transfers complete. We can see that for a modest batch
size of 10,000 edges, it takes 74.43 ms to pre-process the B2
batch which fully masks the 41.12 ms of transfer and 34.77
ms required by the kernels.
This behavior is the best case scenario which is however not
observable in the case of all graphs. We can see from Figure 5,
that for the smaller graphs (Figure 5(a-f)), the difference
between the partitioning time and kernel+transfer time is -1.81
ms (meaning pre-processing is slower on average). In Figure 5,
the batch sizes are kept uniform (at 10K edges, approximately
1% of the dataset). For several of the smaller graphs soc-
pokec, Reddit, Orkut, and Graph-500, we can observe that for
several batches, the two curves cross each other at multiple
points. This can be attributed to the structural heterogeneity
of the batches. If a batch for example, contains too many new
nodes, then the kernel times for calculating new PR scores
on Vborder and Vnew will be higher than the pre-processing
time. The pre-processing time will be lower in those cases.
The reverse scenario, which is more conducive, is when the
batches have a healthy mix of new and old nodes. This will
create the right kind of overlap, as shown in Figure 4, when
the waiting times of the GPU will be minimized.
An inflection point is observed for NLP (Figure 5(i)), and
for Arabic (Figure 5(j)), where we can see that the partitioning
8. 0
50
100
150
200
250
300
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(a) Amazon
0
50
100
150
200
250
300
350
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(b) Web google
0
50
100
150
200
250
300
350
400
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(c) Wiki topcat
0
50
100
150
200
250
300
350
400
450
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(d) Soc-Pokec
0
50
100
150
200
250
300
350
400
450
500
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(e) Reddit
0
100
200
300
400
500
600
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(f) Soc LiveJournal
0
100
200
300
400
500
600
700
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(g) Orkut
0
100
200
300
400
500
600
700
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(h) Graph500
0
100
200
300
400
500
600
700
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(i) NLP
0
100
200
300
400
500
600
700
800
900
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(j) Arabic
Figure 3: Update Time
Figure 4: Overlapping of the pre-processing time with transfer
and kernel time for the Arabic graph
time is exceeding the transfer+kernel time at a particular batch
size. The partitioning times remain higher for the remaining
larger graphs. This indicates the scalability of the CPU side
partitioning on different batches. On an average the difference
between the two is 4.95 ms for the larger graphs (Figure 5
(g-j)). Hence, we can conclude that HyPR is data driven,
and shows best performances on the datasets that are able
to extract the maximum amount of system performance. The
system efficiency achieved by HyPR is discussed in the next
sub-section.
D. Resource Utilization
In this experiment, we investigate the system efficiencies
achieved by HyPR. Towards that, we do profiling in order to
know the utilization of the resources. This is done through
measurements of the memory and utilization and warp occu-
pancy of the GPU threads. We used the nvprof profiler from the
CUDA toolkit. We used two profiling metrics from nvprof, first
we monitor achieved occupancy which means the number
of warps running concurrently on a SM divided by maximum
number warp capacity. Second we see gld efficiency which
is the ratio of the global memory load throughput to required
global memory load, i.e how well the coalescing of DRAM-
accesses works. Figure 6 shows these two profiling results with
different batch size. We can observe that the occupancies scale
linearly with increasing batch sizes. The global load efficiency
indicates increased coalesced accesses on every batch. On an
average we achieve a global load efficiency of 61.07% and a
warp occupancy of 64.14%.
E. Comparative Analysis
We now compare the performance of HyPR withs some
of the state of the art solutions for dynamic PR. We mainly
compare our work with GPMA [1], GPMA+ [1] and cuS-
parse [14]. GPMA exploits packed memory array(PMA) to
handle dynamic updates by storing sorted elements in a
partially contiguous manner that enhance dynamic updates.
cuSparse library has efficient CSR implementations for sparse
matrix vector multiplications. We implement a basic PR update
mechanism (purely on GPU) using cuSparse where the new
incoming batch iteratively undergoes SpMV operations until
the PR values converge. For performing the comparisons with
GPMA and GPMA+, we configure the experiment to run
HyPR on the same platform as used in [1] which is a Intel
Xeon CPU connected to a Titan X Pascal GPU, and also the
same datasets. Additionally, the PR scores are derived after
running all the experiments till convergence or 1000 iterations
(whichever is earlier). Towards that, we use an additional graph
“Random” having 1M nodes and 200M edges. From Figure 7,
we can observe that HyPR outperforms the state of the art
GPMA and GPMA+, and cuSparse implementation for four
of the largest graphs. On an average over all four the graphs
used, HyPR outperforms GPMA by 4.81x, GPMA+ by 3.26x,
and cuSparse by 102.36x.
F. Accuracy Analysis
For checking the accuracy of HyPR, all the PR scores are
pre-computed using nvGraph [15]. nvGraph is a part of the
NVIDIA CUDA Toolkit is the state of the art solution which
effectively provides millisecond performance for computing
PR score on a static graph with 500 iterations. We use nvGraph
for computing the scratch PR scores of the graph. The batch
updates are then included into the graph through the HyPR
method, and checked with the nvGraph scores which we feed
with a graph that has the batches included. Once all the batch
Older
GPU
9. 2
3
4
6
8
12
16
1
20
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(a) Amazon
2
3
4
6
8
12
1
14
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(b) Web-google
2
3
4
6
8
12
16
1
20
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(c) Wiki-topcat
16
24
32
48
50
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(d) soc-pokec
24
32
48
50
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(e) Reddit
6
8
12
16
20
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(f) soc-LiveJournal
6
8
12
16
20
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(g) Orkut
32
48
50
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(h) Graph-500
32
48
64
80
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(i) NLP
64
96
100
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(j) Arabic
Figure 5: Overlaps achieved between partitioning and transfer+kernel times for 10 batches B1 − B10 of uniform size.
0.58
0.59
0.59
0.6
0.6
0.61
0.61
0.62
0.62
0.63
0.64
0.64
10
1
10
3
10
5
Ratio
Batch Size
Achieved Warp Occupancy
Global Load Efficiency
(a) Reddit
0.55
0.56
0.57
0.58
0.59
0.6
0.61
0.62
10
1
10
3
10
5
Ratio
Batch Size
Achieved Warp Occupancy
Global Load Efficiency
(b) Graph500
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
10
1
10
3
10
5
Ratio
Batch Size
Achieved Warp Occupancy
Global Load Efficiency
(c) Orkut
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
10
1
10
3
10
5
Ratio
Batch Size
Achieved Warp Occupancy
Global Load Efficiency
(d) NLP
Figure 6: Resource Utilization
0.1
1
10
100
1000
10
2
10
3
10
5
10
6
Time(ms)
Batch Size
GPMA+
GPMA
cuSpare
HyPR
0.19
6.00
15.00
125.00
1.11
8.00
13.00
99.00
51.00 52.00 49.00 48.00
0.17
2.16
5.14
43.30
(a) Reddit
0.1
1
10
100
1000
10
2
10
3
10
5
10
6
Time(ms)
Batch Size
GPMA+
GPMA
cuSpare
HyPR
0.15
6.00
14.00
120.00
1.20
8.00
12.00
98.00
49.00 51.00 50.00 50.00
0.14
1.19
4.17
42.26
(b) soc-Pokec
0.1
1
10
100
1000
10000
10
2
10
3
10
5
10
6
Time(ms)
Batch Size
GPMA+
GPMA
cuSpare
HyPR
0.80
7.00
99.00
1012.00
0.99
9.00
22.00
98.00
130.00 131.00 128.00 130.00
0.22
3.95
11.52
67.31
(c) Graph500
0.1
1
10
100
1000
10000
10
2
10
3
10
5
10
6
Time(ms)
Batch Size
GPMA+
GPMA
cuSparse
HyPR
0.80
7.00
102.00
1001.00
1.01
9.00
23.00
99.00
130.00 131.00 128.00 130.00
0.57
6.66
12.00
66.69
(d) Random
Figure 7: Comparison of HyPR with GPMA, GPMA+, and cuSparse
0.96
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
O
r
k
u
t
G
r
a
p
h
5
0
0
N
L
P
A
r
a
b
i
c
0.0005
0.001
0.0015
0.002
0.0025
Jaccard
Similarity
Divergence
top-10 top-20 top-30 div.
Figure 8: Accuracy of PR scores
updates are completed, we compare the PR scores with the
pre-computed ones. To check for accuracy, we introduce batch
updates with inserts and deletes updates on the nodes with the
highest PR scores on the existing graph. This is done in order
to impose maximum change on the PR scores due to an update.
In Figure 8, we show the Jaccard similarities of the PR scores
computed by HyPR with those computed by nvGraph for the
top 10, 20, and 30 nodes with the highest PR scores. We see
that on an average, the similarity score comes to 0.985 for
a fixed 500 iterations (same as that used for computing the
scores using nvGraph). The similarity score comes to 0.991
on average if we allow the different graphs to converge till
they reach the threshold γ. We see that on an average the
divergence of the PR scores computed by HyPR from that of
nvGraph is less than 0.001% (shown in y2 axis).
V. RELATED WORKS
GPU based PR has been explored in the works done
by Duong et al. [16] and Garg et al. [4]. In [16], authors
10. proposes a new data structure for graph representations named
link structure file. In their work they target the steps in PR
computation where sufficient data parallelism exist. Further
these steps are distributed among multiple GPUs where each
threads perform finer grained work. Garg et. al. [4], provide
algorithmic techniques for partitioning the graph based on their
structural properties to extract paralleism.
PR on evolving graphs have been explored by Sha et al. [1]
where they propose two algorithms GPMA and GPMA+ based
upon packed memory array. GPMA is lock-based approach
where few concurrent updates conflicts are handled efficiently.
GPMA+ is a lock-free bottom-up approach which prioritizes
updates and favors coalesced memory access. Another work is
done by Feng et al. [17] where they propose the DISTINGER
framework. DISTINGER employs a hash partitioning-based
scheme that favors massive graph updates and message passing
among the partition sites using MPI. Another algorithm to
compute personalized PR on dynamic graphs was published by
Guo et al. [13] that also exploit GPUs for performance. Similar
to HyPR, computations proposed in [13] is also done in a
batched manner. To enhance the performance of parallel push
different optimization techniques are introduced. One of them
is eager propagation, that minimizes the number of local push
operations. They also propose frontier generation method that
keeps track of vertex frontiers by cutting down synchronization
overhead to merge duplicate vertices. Batch parallelism for
dynamic graphs have also found several theoretical studies. A
generic framework for batch parallelism is proposed by Acar
et. al. in [8]. Batch parallelism for graph connectivity and other
problems, in the massively parallel computation (MPC) model
is explored by Dhulipala et. al. in [18].
The work done by Desikan et. al. [2], is one of the earliest
works done towards incremental PR computations on evolving
graphs. The authors proposed the partitioning, and scaling
techniques which we modify for parallelization on a heteroge-
neous platform. To the best of our knowledge, there is no other
work that exists which explores hybrid CPU+GPU solution for
computing global PR. In HyPR we propose techniques for PR
computations that uses batch parallelism in unison with fully
parallel partitioning and PR update mechanisms on a hybrid
platform towards extracting high performance.
VI. CONCLUSION AND FUTURE WORK
In this work, we propose HyPR which is a hybrid tech-
nique for computing PR on evolving graphs. We have shown
an efficient mechanism to partition the existing graph and
updates into data parallel work units which can be updated
independently. HyPR is executed on a state-of-the-art high
performance platform and exhaustively tested against large real
world graphs. HyPR is able to provide substantial performance
gains of up to 4.8x over other existing mechanisms and
also extracts generous system efficiency. In the near future,
we plan on extending HyPR by spreading the computations
across multiple GPUs located on shared and distributed mem-
ories. Communication between distributed nodes will become
an additional overhead to handle in that case. Additionally,
modern HPC systems are equipped with newer generation
interconnects like NVLink which deserves to be explored in
the context of page ranking.
VII. ACKNOWLEDGMENT
This work is supported by Science and Engineering Re-
search Board (SERB), DST, India through the Early Career
Research Grant (no. ECR/2016/002061) and NVIDIA Corpo-
ration through the GPU Hardware Grant program.
REFERENCES
[1] M. Sha, Y. Li, B. He, and K.-L. Tan, “Accelerating dynamic
graph analytics on GPUs,” Proc. of the VLDB Endow., vol. 11,
no. 1, pp. 107–120, 2017.
[2] P. Desikan, N. Pathak, J. Srivastava, and V. Kumar, “Incremental
page rank computation on evolving graphs,” in 14th Interna-
tional WWW, 2005, pp. 1094–1095.
[3] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank
citation ranking: Bringing order to the web.” Stanford InfoLab,
Tech. Rep., 1999.
[4] P. Garg and K. Kothapalli, “STIC-D: Algorithmic techniques
for efficient parallel PageRank computation on Real-World
Graphs,” in Proceedings of the 17th ICDCN, 2016, pp. 1–10.
[5] D. Gleich, L. Zhukov, and P. Berkhin, “Fast parallel pager-
ank: A linear system approach,” Yahoo! Research Techni-
cal Report YRL-038, available via http://research. yahoo.
com/publication/YRL-038. pdf, vol. 13, p. 22, 2004.
[6] A. Cevahir, C. Aykanat, A. Turk, and B. B. Cambazoglu, “Site-
based partitioning and repartitioning techniques for parallel
PageRank computation,” IEEE TPDS, vol. 22, no. 5, pp. 786–
802, 2011.
[7] M. Kim, “Towards exploiting GPUs for fast PageRank computa-
tion of large-scale networks,” in Proceeding of 3rd International
Conference on Emerging Databases, 2013.
[8] U. A. Acar, D. Anderson, G. E. Blelloch, and L. Dhulipala,
“Parallel batch-dynamic graph connectivity,” in The 31st ACM
SPAA, 2019, pp. 381–392.
[9] Compressed Sparse Column Format (CSC), https:
//scipy-lectures.org/advanced/scipy sparse/csr matrix.html.
[10] The University of Florida Sparse Matrix Collection, https:
//snap.stanford.edu/data.
[11] J. Leskovec and A. Krevl, SNAP Datasets: Stanford Large
Network Dataset Collection, http://snap.stanford.edu/data.
[12] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction to Algorithms, Third Edition, 3rd ed. The MIT
Press, 2009.
[13] W. Guo, Y. Li, M. Sha, and K.-L. Tan, “Parallel personalized
pagerank on dynamic graphs,” Proc. of the VLDB Endow.,
vol. 11, no. 1, pp. 93–106, 2017.
[14] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cus-
parse library,” in GPU Technology Conference, 2010.
[15] NVGraph toolkit documentation, https://docs.nvidia.com/cuda/
cuda-runtime-api/index.html.
[16] N. T. Duong, Q. A. P. Nguyen, A. T. Nguyen, and H.-
D. Nguyen, “Parallel PageRank computation using GPUs,” in
Proce. of the 3rd Symposium on Information and Communica-
tion Technology, 2012, pp. 223–230.
[17] G. Feng, X. Meng, and K. Ammar, “Distinger: A distributed
graph data structure for massive dynamic graph processing,” in
International Conference on Big Data. IEEE, 2015, pp. 1814–
1822.
[18] L. Dhulipala, D. Durfee, J. Kulkarni, R. Peng, S. Sawlani,
and X. Sun, “Parallel Batch-Dynamic Graphs: Algorithms and
Lower Bounds,” in Proceedings of the 31st SODA. USA:
SIAM, 2020, p. 1300–1319.