SlideShare a Scribd company logo
HyPR: Hybrid Page Ranking on Evolving Graphs
Hemant Kumar Giri, Mridul Haque, Dip Sankar Banerjee
Department of Computer Science and Engineering. Indian Institute of Information Technology Guwahati
Bongora, Guwahati 781015. Assam. India.
Email:girihemant19@gmail.com, {mridul.haque,dipsankarb}@iiitg.ac.in
Abstract—PageRank (PR) is the standard metric used by the
Google search engine to compute the importance of a web page
via modeling the entire web as a first order Markov chain.
The challenge of computing PR efficiently and quickly has
been already addressed by several works previously who have
shown innovations in both algorithms and in the use of parallel
computing. The standard method of computing PR is handled
by modelling the web as a graph. The fast growing internet
adds several new web pages everyday and hence more nodes
(representing the web pages) and edges (the hyperlinks) are added
to this graph in an incremental fashion. Computing PR on this
evolving graph is now an emerging challenge since computations
from scratch on the massive graph is time consuming and
unscalable. In this work, we propose Hybrid Page Rank (HyPR),
which computes PR on evolving graphs using collaborative
executions on muti-core CPUs and massively parallel GPUs. We
exploit data parallelism via efficiently partitioning the graph
into different regions that are affected and unaffected by the
new updates. The different partitions are then processed in an
overlapped manner for PR updates. The novelty of our technique
is in utilizing the hybrid platform to scale the solution to massive
graphs. The technique also provides high performance through
parallel processing of every batch of updates using a parallel
algorithm. HyPR efficiently executes on a NVIDIA V100 GPU
hosted on a 6th Gen Intel Xeon CPU and is able to update a
graph with 640M edges with a single batch of 100,000 edges
in 12 ms. HyPR outperforms other state of the art techniques
for computing PR on evolving graphs [1] by 4.8x. Additionally
HyPR provides 1.2x speedup over GPU only executions, and 95x
speedup over CPU only parallel executions.
Index Terms—Heterogeneous Computing, PageRank,
CPU+GPU, Dynamic graphs.
I. INTRODUCTION
Link analysis is a popular technique for mining meaningful
information from real world graphs. There are a variety of
knowledge models that are typically employed by different ap-
plications. The knowledge models encode structural informa-
tion and rich relationships between the different entities which
help in the extraction of critical information. One such model
is the Hubs and Authority [2] model which essentially proves
that a web graph has multiple bipartite cores. Google [3] on the
other hand models the web graph as a first order Markov chain
which captures an user’s browsing patterns of the web. This
model is used by Google to generate ranking of the different
web pages which forms the core concept for the Page Rank
(PR) algorithm and is used in Google search. The page ranking
method models a hyperlink from one page to another as an
endorsement for the destination page from the source. With the
growth of the internet, computing PR on the entire web is a
challenging task given that the graph is massive. Additionally,
the graph is evolving (also called dynamic) in nature and will
have several new pages and hyperlinks generated every day
which leads to an additional challenge of computing new page
ranks at regular intervals. While the simplest solution is to
compute the page ranks from scratch every time, it is not
feasible for any realistic use case. Hence it is necessary to
investigate methods for computing new page ranks on evolving
graphs that will not require computing PR from scratch.
Parallel PR [4–6], has found wide spread success in com-
puting PR scores quickly. With the advent of processors such
as Graphics Processing Units (GPUs), and modern multi-core
CPUs, the parallel solutions of PR have further found wide
spread success. Given the capacity for massive parallelism of
GPUs, good programmability, and strong community support,
GPUs have been able to provide sub-second performance in
PR computations on massive real world graphs. In computing
PR scores on evolving (or dynamic) graphs too, GPUs have
found good success in the past [1, 7]. Computing dynamic
PR, presents an ideal use case for heterogeneous computations
where the nature of the computations require investigations
into techniques that can provide parallelism in a hierarchical
manner. While some parallel work can be done at coarser
degrees of granularity, end page ranking scores need to be
computed at a much finer granularity.
In this paper, we present HyPR1
(pronounced hy-per), a
method for computing PR on dynamic graphs which uti-
lizes both CPUs and GPUs towards fast computations of the
new scores. Our method depends on two broad phases of
computation. In the first phase, we identify a partition of
the graph which will be affected by a newer set of edges
that are getting inserted. In the second phase, the actual PR
computation will be carried out. While the steps are mutually
exclusive, performing both the steps on a GPU will lead
to some sequential computations. Towards that, we propose
a heterogeneous method which provides higher degrees of
parallelism towards computing the two phases and hence leads
to significant performance benefits. The concrete motivations
and contributions of our paper are as follows:
1) Parallel partitioning: As pointed out, our technique
depends on identifying portions of the graph that are
affected and unaffected by the newer batch of edges
that are updated in the graph. We present a method that
can perform the partitions in parallel to a current set of
updates are getting incorporated into the existing graph.
1Code available at : https://github.com/girihemant19/HyPR
This reduces irregular accesses of the GPU to a large
extent, and boosts performance.
2) Parallel updates with a parallel algorithm: To extract
maximum data parallelism (and hence performance)
available on a GPU, we propose a technique that can
parallely process a batch of updates to compute new
PR scores using a parallel algorithm which runs on the
GPU. The step for making the parallel update possible
is done through a pre-processing step which is carried
out in the CPU. To the best of our knowledge, this has
not been proposed earlier.
3) Scalability: This is one of the main motivations behind
the adoption of a hybrid solution. While the updates
that happen in a batch parallel manner could be small,
the whole graph, even when represented using the most
space efficient data structures, will consume the high-
est amount of main memory. We show that a hybrid
approach will not necessitate the storage of the whole
graph on a limited GPU memory. We propose to create
minimal working sets that can effectively assimilate a
new batch of updates and gets constrained only when the
minimal working sets exceeds the memory. The much
larger host memory holds the full graph as an auxiliary
storage.
4) Benchmarking: We thoroughly experiment our technique
on several real world graphs with up to 640M edges
(limited by the GPU memory). We perform experiments
to show that our technique out performs the state of
the art dynamic PR techniques by 4.8x by providing PR
scores in 12 ms.
The rest of the paper is organized as follows. We provide a
background of the fundamental ideas our work in Section II.
The related works are discussed in Section V. We provide
detailed methodology of our HyPR technique in Section III.
This is followed by the evaluations where we discuss and
analyze the results obtained in Section IV. Finally, we draw
some brief conclusions of our work and discuss possible
directions for extension in Section VI.
II. BACKGROUND
In this section we briefly discuss the algorithmic prelimi-
naries on which HyPR depends.
A. Baseline PageRank
PR computations in general is computed on a graph
G(V, E, w) where V (|V |= n) is the set of nodes representing
the web pages. E (|E|= m) is the set of edges representing the
hyperlinks and w is the set of edge weights which represents
the contribution of a node to the node it connects with via an
edge. Hence, using the random-surfer model which accounts
for any random internet user to land at a particular page, the
PR value of a node u is given as:
pr(u) =
X
v∈I(u)
c(v → u) +
d
n
(1)
In Equation 1, d (taken as 0.15 conventionally) represents
the damping factor which is the probability of the random
surfer to stay at a particular page. The function c() is the
contribution. The magnitude of the contribution is taken to
be the ratio of the pr(u)/out degree(u) times (1 − d) then
Equation 1 can be re-written as:
c(v → u) = (1 − d)
pr(v)
out degree(v)
(2)
In Equation 2, the PR computations happen in a cyclic
manner as two different nodes can make contributions to each
other. The PR values are computed via iteratively computing
the PR values of all nodes until they converge. This signifies
there is very little update to the PR scores after a particular
iteration. This style of computing the PR scores is popularly
done through power iterations [3]. The genesis of power
iterations arises from the idea of the Markov chains reaching
a steady state when starting from a initial distribution. Here,
the initial PR values compose the starting states of every node
which then goes through several transitions to reach a stable
state. The PR values of all the nodes can be arranged to form
the state matrix which has a set of eigenvalues. The eigen
vectors of the said state matrix is solved using power iterations.
By the intrinsic property of Markov chains, if the state matrix
is stochastic, aperiodic, and irreducible, then the state matrix
will converge to a stationary set of vectors which will be the
final PR scores. A basic parallel implementation of computing
PR in parallel is shown in Algorithm 1.
B. Dynamic PR calculations
The seed idea for HyPR stems from the batch dynamic
graph connectivity algorithm proposed by Acar et. al. in [8].
Due to the iterative nature of PR, if the PR scores of all the
incoming neighbors of a node u converge in a particular iter-
ation, then the score of u will also converge in the immediate
next iteration. This nature of PR, allows the decomposition
of a directed graph into a set of connected components (CC)
which can be processed in parallel [4]. Since maintaining CCs
of a graph in a dynamic setting, is equivalent to maintaining
the connectivity of a graph, we can perform a batch update
with size B in parallel in O(lg n+lg(1+n/B)) work per edge.
A batch here, refers to a set of updates which can be either
insertions into the existing graph, or deletions. Each entry in
a batch is arranged as a tuple (ti, ui, vi, wi, oi), where ti is a
time stamp, (ui, vi) is the edge, wi is the weight associated
with edge, and oi is the update type (insert/delete). We assume
that a batch i will have at most Bi edges. We experiment the
impact of this batch size on performance later in Section IV.
As stated, the problem of computing the PR scores for
an incrementally growing graph can be treated in essence
as the computation of two CCs where one contains a set
of nodes that gets affected by the batch of updates and the
other does not. The same treatment for partitioning the graph
for incrementally computing PR scores was also done by
Desikan et. al. [2] where the authors proposed scaling of the
unchanged nodes first followed by the PR computations of the
changed nodes. For a graph G(V, E, w) where
Pn
1 wi = W
and |V | = n being sum of weights of all nodes and order of
graph respectively. Every node is initialized with PR score of
wi
W . Consider a particular node s in the existing graph. It’s PR
value can be expressed as:
PR(s) = d(
ws
W
) + (1 − d)
k
X
i=1
PR(xi)
δ(xi)
(3)
where d is the damping factor and xi denotes every incoming
neighbouring nodes pointing to s up to k number of nodes and
δ(xi) is the out-degree of xi. From the fact the over some k
iterations, the PR values of all the nodes will get scaled by a
constant factor proportional to W, it can be deduced that for
the node s the updated PR score can be scaled as
W.PR(s) = W0
PR0
(s)
or, PR0
(s) = (
W
W0
)PR(s)
(4)
So, the new PR scores can easily be determined by scaling
old PR with factor ( W
W 0 ). W can be also taken to be the order
of nodes n(G) since every node is equally likely to be of
weight 1; Equations 4 can be re-written as:
PR0
(s) =
n(G)
n(G0)
PR(s) (5)
The new nodes that are getting added into G will be required
to be put through the usual iterations to compute PR values.
In Figure 1, we have clearly demonstrated these partitions
where Bi represents the batches of updates. A set of nodes
which we denote by Vnew, are in the partition that requires
scratch computations of PR. The other partition Vold requires
to be scaled using the Equation 5. The Vborder nodes that
are in the border area which requires scaling along with a few
iterations of ranking before it converges to its’ final PR scores.
While the standard PR algorithm will require a O(n + m)
space to be maintained in memory, partitioning of the graph
effectively aids in reducing the memory requirements. if we
assume each batch to be of size ∆ edges then in addition
to the original graph, an additional space of O(∆) needs to
be created. Given the limited space that is available in the
GPUs a combined space of O(n + m + ∆) will severely limit
scaling. In order to scale HyPR to larger sizes, we aim to
keep only O(∆) additional space in the GPU, while the rest
of the graph will be maintained on the host memory. This
is further refined through the partitioning which decomposes
PR computations to discrete space requirements in O(Vnew),
O(Vborder), and O(Vold). Each of these partitions need not
reside on the same memory at all time. We use the compressed
sparse row representation for representing the graph and is
detailed in [9].
C. Datasets used
Even though PR is usually computed on web graphs repre-
senting web pages and hyperlinks, their properties are similar
to other real world graphs. Hence, we choose a healthy mix
of web-graphs and real world graphs. The datasets we use are
Figure 1: Identification of nodes
Algorithm 1: parallelPR(V,outdeg,InV[y],γ)
Require: Set of nodes as V , outdegree of each node in
outdeg, incoming neighbors InV.
Ensure: PageRank p of each node
1: err=∞
2: for all u ∈ V do
3: previous(u)= d
|V |
4: end for
5: while err > γ do
6: for all u ∈ V in parallel do
7: p(u) = d
n
8: for all x ∈ Inv do
9: p(u) = p(u) + previous(x)
outdeg(x) ∗ (1 − d)
10: end for
11: end for
12: for all u ∈ V do
13: err = max(err, abs(previous[u] − p[u]))
14: previous(u) = p(u)
15: end for
16: end while
17: return p
collected from the University of Florida Sparse Matrix Collec-
tion [10] and Stanford Network Analysis Project (SNAP) [11].
The datasets which we have selected ranges from 2.9M edges
to around 640M edges as detailed in Table I.
III. METHODOLOGY
In this section we will briefly discuss about the approach
adopted for implementing HyPR. We first explain a basic
overview of the approach that we take and then provide
a detailed explanation on the implementation strategies. Al-
gorithm 2, shows the basic set of steps that are adopted
towards the implementation. Figure 2 shows the overview
of the steps that are implemented. In a broad sense, HyPR
works as concurrent (or overlapped) phases of the following.
1) Partitioning: The parallel cores of the CPU creates the
Table I: Datasets Used
Graph Name Sources |V | |E| Type
1. Amazon [11] 0.41 M 3.35 M Purchasing
2. web-Google [11] 0.87 M 5.10 M Web Graph
3. wiki-Topcat [11] 1.79 M 28.51 M Social
4. soc-pokec [11] 1.63 M 30.62 M Social
5. Reddit [11] 2.61 M 34.40 M Social
6. soc-LiveJournal [11] 4.84 M 68.99 M Social
7. Orkut [11] 3.00 M 117.10 M Social
8. Graph500 [11] 1.00 M 200.00 M Synthetic
9. NLP [10] 16.24 M 232.23 M Optimization
10. Arabic [10] 22.74 M 639.99 M Web Graph
partitions Vold, Vnew, and Vborder as discussed in Section II-B.
2) Transfer: Perform asynchronous transfer of ∆ sized batch
to the GPU memory. 3) PR Calculations: depending on the
type of the partition, either scale, or calculate new PR scores
on the GPU.
A. Why Hybrid ?
Before we discuss HyPR design in detail, we first motivate
the requirement for a hybrid solution. As stated earlier, our
intention for performing heterogeneous solution is three fold.
In the first place, a hybrid solution allows the PR calculation
to scale to very large sizes which is otherwise limited by
the GPU main memory size. As discussed in Section II-B,
only O(|V |+∆) space is needed for computing the updated
PR values for a particular batch of size Bi. The static graph
which is undergoing updates resides in the much larger main
memory of the host and only aids in creating the partitions. In
the second place, if we ignore scaling, we need to perform
partitioning, followed by parallel computations of the PR
scores which will result in limited GPU utilizations as all these
steps are not independent. Also the degree of parallelism is
limited for the partitioning steps in comparison to the scaling
and PR updates. Hence, for coarse parallel partitioning step,
the CPU is an ideal device and GPU is more suited for
the finely parallel update and scaling. We show how a GPU
only execution is actually slower than the hybrid technique in
Section IV. In the last place, a hybrid technique allows us to
accrue higher system efficiency which would otherwise have
the CPU sitting idle while the GPU is updating the PR scores.
B. Graph Partitioning:
We start the computation on a graph G(V, E, w) where
there already exists some PR scores that have been computed
previously. The algorithmic phases that HyPR goes through is
outlined in Algorithm 2. In successive time intervals, we will
have a set of batch updates that arrive. As discussed previously
in Section II-B, the batches consist of an heterogeneous mix
of edges that are either to be inserted or deleted. A batch Bi,
can be represented by the tuple (ti, ui, vi, wi, oi), as discussed
in Section II-B.
For updating the batch in parallel, we identify three con-
nected components (CC) that strictly requires different com-
putations. If we consider the existing graph as one single CC
(say Co), and the new batch of updates as a separate CC (say
Cn), then we can construct a block graph S where there will
be some incoming edges from Cn to Co if there exists edge
updates (ui, vi) where ui ∈ Co and vi ∈ Cn. We can assume
that the new batch of updates is topologically sorted since the
PR scores of the new nodes in Bi is guaranteed to be lower
than those in Co.
In an auxiliary experiment, we tested the nature of these
CCs to see if they form strongly connected components (SCC)
so as to arrive at a formulation similar to [4]. Decomposing
a graph into a set of SCCs provides the advantage of doing a
topological ordering of the partitions where PR computations
(or re-computations) of the scores can be done in a cascaded
manner with the scaling of Vborder, and Vold nodes first,
followed by the PR calculation of Vborder and Vnew. This
order of cascaded updates has also been adopted to update the
PR scores as proved in [2]. We found, that if we re-compute
the three partitions using Kosaraju’s algorithm (cf. [12]) to
identify the Vold, Vnew, and Vborder partitions, approximately
2% of the edges differ from the Arabic dataset when compared
to our mechanism of partitioning. Hence, we can conclude
that our mechanism produces approximate SCCs which can
be easily monitored for only the 2% extraneous edges so as
to maintain a proper topological order while performing the
batch updates. We do a small book-keeping of these edges in
order to correctly partition the edges.
We can now identify three partitions for every batch Bi
(i) Vold which is a set of nodes that are already there in the
existing graph (ii) Vnew is a set of vertices which are entirely
new nodes which are to be added to the existing graph and
can be found as Vi − (Vi ∩ Vold) if Vi is the set of nodes in
batch Bi (iii) Vborder which is the set of nodes which have
edges in Bi connecting Vold and Vnew and is reachable using
a breadth first traversal. As an example, Figure 1, we can see
the Vold nodes in red, Vnew nodes in green, and the Vborder
nodes which are a collection of the entire set of nodes which
are in the first hub and having direct connection with Vnew.
Essentially, all the nodes in G without the yellow nodes are
Vborder.
As, we can see in Algorithm 2, the first phase of the
update operation is the creation of these partitions. Since,
the identification of these vertices are independent from each
other, they can be done in parallel. It is critical to note that at
this stage the number of edges that are present in a single batch
does not warrant any form of edge parallelism on GPU as that
will lead to lower utilization of the available GPU bandwidth
at the cost of high memory transfers.
C. Pre-processing
In pre-processing phase, we compute the Vold, Vnew, and
Vborder partitions in an overlapped manner with the GPU. The
sequence of operations that are followed for every incoming
batch of updates are: 1) Pre-process the first batch in parallel
using OpenMP threads on CPU. 2) Transfer Vold, Vnew,
and Vborder to GPU for scaling, and PR computations in
an asynchronous manner 3) Start pre-processing the second
batch of updates on the CPU as soon as the first partition is
How?
Note
(a) Graph with batch updates (b) Identification of border, new, and
unchanged nodes
(c) Scaling on unchanged and border
nodes
(d) PageRank Computation on border
and new nodes
Figure 2: Overview of HyPR execution steps.
handed off to GPU for ranking. Algorithm 3 demonstrates the
partitioning mechanism which is called from Line 1 of Algo-
rithm 2. As mentioned earlier, the batch of updates contain an
heterogeneous mix of both insert and delete operations which
have to be handled uniquely.
1) Insertion: Insertion is more compute intensive than
deletion. In Lines 6-22 of Algorithm 3 we depict our insertion
mechanism. We first populate Vnew with all the nodes of the
batch and process all the elements in Vnew using available
CPU threads to successively send the node to the appropriate
partition. We first check if the source node of an edge (ui, vi)
belongs to the existing partition or not. If it is, then that
particular node is put in Vold. For computing the Vborder, we
check if the particular node is reachable in G (the existing
graph) in a breadth-first manner and has a predecessor in the
incoming batch. The intuition behind this is that the Vborder
will be the set of nodes that are reachable from the Vnew
set of nodes and hence will undergo both scaling and PR
computation. So in parallel, all the nodes in Vnew are popped
and either classified in Vold or Vborder. The remaining ones in
Vnew are the ones which are the entirely new set of nodes that
have come in Bi whose new PR values needs to be computed.
2) Deletion: Deletion is much simpler than insertion and
hence less compute intensive. In case of deletion, the Vnew set
will be NULL, and the only nodes that will be involved are
the Vold and Vborder sets. As we can see from Lines 23-33 of
Algorithm 3, if the update type oi is delete, then we remove
the nodes involved from Vold which is containing the nodes
of the original graph. These will require a newer set of PR
computations on the Vnew nodes that will be handled during
the PR update step in Algorithm 2. Additionally, the removals
will induce a newer set of Vborder nodes for which, we will
see if any of the reachable successors or predecessors of the
removed nodes are present. Such nodes will be pushed into
Vborder.
D. Scaling the old nodes
The pre-processing step essentially allows us to perform
data parallel scaling and PR computations on the individual
partitions. As discussed earlier, the primary idea behind HyPR
is the localization of the set of nodes that will be affected by
the new batch of updates. As we can now see from Lines
6-11 in Algorithm 2, we call a GPU kernel to scale the
nodes of the Vold partition using the Equation 5 discussed
in Section II-B. We can now achieve full GPU bandwidth
saturation as |Vold| number of threads can be spawned to scale
all the nodes in parallel. It is critical to note here that in the
hybrid implementation, there will be intermediate transfers that
will be necessary which will not require the entire G (which
is basically V nodes from G initially) to be copied every time.
Rather, the original graph G is copied to GPU before the batch
processing starts and is augmented with Vnew after every batch
is processed. As we can see in Lines 9-11 of Algorithm 2,
the scaling operation is performed accordingly on the Vborder
nodes as well. Scaling of the nodes in the GPU is required
for all the Vold and Vborder set of nodes. The actual scaling
operation is a O(1) operation which makes it most suitable
for a massively parallel implementation on the GPUs. Threads
equal to the number of nodes involved in the scaling process
(Vold or Vborder) are spawned on the GPU for executing the
scaling kernels in an SIMT manner.
E. Page Rank Update
The PR update of the Vnew and Vborder nodes are now a
much lighter computation owing to the partitions. As with the
standard parallel PR implementation shown in Algorithm 1,
the power iterations for computing the PR scores continue until
the scores converge to an error threshold γ (set to 10−10
).
However, since the Vborder set undergoes a step of scaling
before the PR update step, the number of iterations required
for the scores to converge will be much lower than the case
of computing from scratch. So, during the PR update step for
Vborder, and Vnew, as shown in Lines 12-17 of Algorithm 2,
we call the parallelPR() of Algorithm 1. For the Vnew nodes,
the computation is trivial since the number of nodes are low
as it contains only the new nodes being added from a new
batch. The Vborder nodes, although much bigger in size than
the Vnew set, will also call parallelPR(). However, they will
converge much quicker since, they have undergone a step of
scaling previously.
For the deletions, the PR update will be required only for
the Vold set and the Vborder set of nodes. As we can see from
Lines 19-26 in Algorithm 2, the same PR update process will
V-batch?
?
Algorithm 2: HyPR: Hybrid Page Ranking on G with
incremental Batches Bi
Require: Scratch graph G and k number of batches in B
represented by (ti, ui, vi, wi, oi), PageRank vector
dest,outdegree of each node in outdeg, incoming
neighbors of every node in InV.
Ensure: Rank of the nodes in vector dest
{Phase 1: Pre-processing phase}
1: CPU::Partition the incoming batches based on insertion
and deletion
2: (Vold, Vborder, Vnew)=createPartition(G,B)
3: CPU:: Queue Vold, Vborder, Vnew for async transfer to
GPU
{Phase 2: PR update}
4: INSERTION: Generate threads equal to number of
Vborder, Vnew
5: if (oi==insert) then
6: for ∀u ∈ Vold in parallel do
7: GPU:: dest[u] = |V |∗dest[u]
|Vold| {Scaling}
8: end for
9: for ∀x ∈ Vborder in parallel do
10: GPU:: dest[x] = |V |∗dest[u]
|Vborder| {Scaling}
11: end for
12: for ∀z ∈ Vborder in parallel do
13: GPU:: dest[z]=parallelPR(Vborder,outdeg,InV [z])
14: end for
15: for ∀y ∈ Vnew in parallel do
16: GPU :: dest[y]=parallelPR(Vnew,outdeg,InV [y])
17: end for
18: end if
19: DELETION: Generate threads equal to number of
Vold and Vborder
20: if oi==delete then
21: for ∀u ∈ Vborder in parallel do
22: GPU :: dest[y]=parallelPR(Vborder,outdeg,InV [u])
{PR Update}
23: end for
24: for ∀v ∈ Vold in parallel do
25: GPU:: dest[v] = |V |∗dest[u]
|Vold| {Scaling}
26: end for
27: end if
be applied first for the Vborder set. The Vold set will simply
undergo a step of scaling similar to the case of insertions.
F. CUDA+OpenMP implementation
We can observe a snapshot of the overlapped execution
model in Figure 4. The target performance critically depends
on creating the perfect balance of computations that are oc-
curring on the CPU and the GPU. The CPU is responsible for
creating the partitions, and transferring them to the GPU. The
GPU on the other hand is responsible for performing the three
kernel operations of scaling, and two PR update operations. To
achieve that, we make use of the synchronous CUDA kernel
Algorithm 3: createPartition(G,B)
Require: Graph G and k number of batches in Bi
represented by bi ∈ (ti, ui, vi, wi, oi) where oi denotes
insert or delete
Ensure: Vold, Vnew, Vborder
1: CPU :: Generate threads using OpenMP
2: Initialize Vold, Vnew, Vborder = φ, Vtemp = φ
3: Push ∀(u, v) ∈ batch to Vnew
4: Push ∀u ∈ G to Vold
5: for ∀bi ∈ B in parallel : do
6: if (oi==insert) then
7: while (Vnew! = NULL) do
8: Pop element x ∈ Vnew
9: if (x ∈ Vtemp) then
10: Continue
11: end if
12: Push x to Vtemp
13: for every successor y of x ∈ G do
14: Push y to Vtemp
15: end for
16: end while
17: for ∀z ∈ Vtemp do
18: for ∀ predecessors li ∈ bi do
19: Push li into Vborder
20: end for
21: end for
22: end if
23: if (oi==delete) then
24: for ∀u, v ∈ bi do
25: Choose u and v from Vold
26: end for
27: for ∀y successor of (u, v) ∈ do
28: Push y into Vborder
29: end for
30: for ∀y, predecessor of (u, v) ∈ do
31: Push y into Vborder
32: end for
33: end if
34: end for
35: return Vold,Vnew,Vborder
calls, asynchronous transfers, and CUDA streams to orches-
trate the entire execution model. For creating the partitions,
we utilize CPU threads created using the OpenMP library. We
create threads equal to the number of processing cores that are
available. It is natural to understand, that the batch sizes will
be much bigger than the number of threads. We use standard
blocking of the batches for each of the threads to handle.
Despite large batches, this provides good performance due to
the fact that the partitioning operation in itself are simple in
nature and does not involve too much CPU intensive oper-
ations. Additionally, the partitioning is a irregular operation
which the CPU is much better at handling in comparison to
the GPU. CUDA streams are created before the start of the
operation. Once, the CPU finishes the partitioning operation on
a particular batch, cudaMemcpyAsync() calls are issued on the
three partitions created on individual streams. CUDA events
associated with the copy operations monitors the completion.
IV. PERFORMANCE EVALUATION
In this section we discuss the experiments that we perform
to validate the efficacy of our solution, and also analyze the
performance.
A. Experimental environment:
For conducting our experiments we use a platform that has a
multicore CPU connected to a state of the art GPU via the PCI
link. The CPU is a Intel(R) Xeon (R) Silver 4110 having the
Skylake micro-architecture. Two of these CPUs each having
8 cores are arranged in two sockets effectively providing 16
NUMA cores. The cores are clocked at 2.1 GHz with 12 MB
of L3 cache. It is attached to a NVIDIA V100 GPU which has
5120 CUDA cores spread across 80 symmetric multiprocessors
(SMs). Each of the GPU core is clocked at 1.38 GHz and has
32 GB of global main memory. The GPU is connected to the
CPU via a PCIe Gen2 link. The machine is running CentOS 7
OS. For multi-threading on the CPU, we use OpenMP version
3.1 and GCC version 4.8. The GPU programs are compiled
with nvcc from CUDA version 10.1 with the −O3 flag. All
experiments have been averaged over a dozen runs.
For experimentations, we use the real world graphs shown
in Table I. The datasets do not posses any timestamps of
their own. As done in previous works [1, 13], we simulate
a random arrival of an edge update by randomly setting the
timestamps. This is followed by the updates happening in an
increasing order of the timestamps. For evaluations, we adopt
the sliding window model where we take a certain percentage
of the original dataset to construct the batches. These are then
varied to measure the performance.
B. Update time
In this section we discuss the performance of HyPR in the
context of update times. As mentioned earlier, we we start
the computations by taking half of the edges in the entire
graph dataset. We then measure the update times, as and when
the sliding window moves to generate a set of batches. The
update times reported is recorded using CUDA event timers
to capture the overlapped pre-processing, transfer, and kernel
execution times (shown in one blue box in Figure 4). A
timer is started before the OpenMP parallel section where 15
threads perform parallel partitioning a batch and one thread
is performing asynchronous transfers and calling kernels.The
timer is stopped once the CUDA events indicate end of the
final update kernel.
We have experimented with different batch sizes from 1%
to 10%. In Figure 3, we can see the latencies achieved over all
the graphs. For the largest graph Arabic, the average update
time that we achieve is 85.108 ms. In Figure 3, we also show
the comparative time with the “Pure GPU” implementations.
This is done in order to validate our claim that a hybrid
implementation which will be able to efficiently overlap the
PR computations with the CPU side partitioning and transfers,
will show better performance. On an average, we observe that
HyPR achieves 1.1843 times speedups over the “PureGPU”
performance on 5 of the largest graphs that we experiment with
and 1.2305 times over all the graphs. It can be observed that,
the speedups achieved are more pronounced over the larger
graphs in comparison to the smaller ones. This is due to the
fact that the larger graphs provide higher degrees of parallelism
that allows the GPUs to spawn higher number of threads and
also the CPU side pre-processing allows a higher degree of
overlap. This can be better explained using the experiment
discussed in the next section. We also execute HyPR on a
multi-core CPU using only OpenMP which we can call “Pure
CPU”. We get around 95x improvement over the same which
is orders of magnitude faster in comparison to “Pure GPU”
and hence not discussed in Figure 3.
C. Hybrid Overlaps
As stated in Section I, one of the major motivations behind
doing a hybrid computation is to perform the partitioning of
the graph iteratively while the GPU is busy updating the PR
scores. We first show how the pipelined that has been setup
is working for the Arabic graph. The large graphs becomes
the best use-cases for this pipelined execution, as they are
able to achieve the perfect balance of the computations on
the CPU and GPU sides. As we can observe from Figure 4,
the execution begins with the partitioning of the first batch
B1 of updates which does not involve any overlap. Once
the partitioning completes, the asynchronous transfer of the
partitions is done to the GPU which allows the CPU to process
the B2 batch immediately. The transfer, which is registered
with the callback automatically spawns the kernels as soon as
the transfers complete. We can see that for a modest batch
size of 10,000 edges, it takes 74.43 ms to pre-process the B2
batch which fully masks the 41.12 ms of transfer and 34.77
ms required by the kernels.
This behavior is the best case scenario which is however not
observable in the case of all graphs. We can see from Figure 5,
that for the smaller graphs (Figure 5(a-f)), the difference
between the partitioning time and kernel+transfer time is -1.81
ms (meaning pre-processing is slower on average). In Figure 5,
the batch sizes are kept uniform (at 10K edges, approximately
1% of the dataset). For several of the smaller graphs soc-
pokec, Reddit, Orkut, and Graph-500, we can observe that for
several batches, the two curves cross each other at multiple
points. This can be attributed to the structural heterogeneity
of the batches. If a batch for example, contains too many new
nodes, then the kernel times for calculating new PR scores
on Vborder and Vnew will be higher than the pre-processing
time. The pre-processing time will be lower in those cases.
The reverse scenario, which is more conducive, is when the
batches have a healthy mix of new and old nodes. This will
create the right kind of overlap, as shown in Figure 4, when
the waiting times of the GPU will be minimized.
An inflection point is observed for NLP (Figure 5(i)), and
for Arabic (Figure 5(j)), where we can see that the partitioning
0
50
100
150
200
250
300
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(a) Amazon
0
50
100
150
200
250
300
350
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(b) Web google
0
50
100
150
200
250
300
350
400
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(c) Wiki topcat
0
50
100
150
200
250
300
350
400
450
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(d) Soc-Pokec
0
50
100
150
200
250
300
350
400
450
500
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(e) Reddit
0
100
200
300
400
500
600
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(f) Soc LiveJournal
0
100
200
300
400
500
600
700
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(g) Orkut
0
100
200
300
400
500
600
700
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(h) Graph500
0
100
200
300
400
500
600
700
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(i) NLP
0
100
200
300
400
500
600
700
800
900
0 1% 5% 10%
Time(ms)
Batch Size
HyPR
PureGPU
(j) Arabic
Figure 3: Update Time
Figure 4: Overlapping of the pre-processing time with transfer
and kernel time for the Arabic graph
time is exceeding the transfer+kernel time at a particular batch
size. The partitioning times remain higher for the remaining
larger graphs. This indicates the scalability of the CPU side
partitioning on different batches. On an average the difference
between the two is 4.95 ms for the larger graphs (Figure 5
(g-j)). Hence, we can conclude that HyPR is data driven,
and shows best performances on the datasets that are able
to extract the maximum amount of system performance. The
system efficiency achieved by HyPR is discussed in the next
sub-section.
D. Resource Utilization
In this experiment, we investigate the system efficiencies
achieved by HyPR. Towards that, we do profiling in order to
know the utilization of the resources. This is done through
measurements of the memory and utilization and warp occu-
pancy of the GPU threads. We used the nvprof profiler from the
CUDA toolkit. We used two profiling metrics from nvprof, first
we monitor achieved occupancy which means the number
of warps running concurrently on a SM divided by maximum
number warp capacity. Second we see gld efficiency which
is the ratio of the global memory load throughput to required
global memory load, i.e how well the coalescing of DRAM-
accesses works. Figure 6 shows these two profiling results with
different batch size. We can observe that the occupancies scale
linearly with increasing batch sizes. The global load efficiency
indicates increased coalesced accesses on every batch. On an
average we achieve a global load efficiency of 61.07% and a
warp occupancy of 64.14%.
E. Comparative Analysis
We now compare the performance of HyPR withs some
of the state of the art solutions for dynamic PR. We mainly
compare our work with GPMA [1], GPMA+ [1] and cuS-
parse [14]. GPMA exploits packed memory array(PMA) to
handle dynamic updates by storing sorted elements in a
partially contiguous manner that enhance dynamic updates.
cuSparse library has efficient CSR implementations for sparse
matrix vector multiplications. We implement a basic PR update
mechanism (purely on GPU) using cuSparse where the new
incoming batch iteratively undergoes SpMV operations until
the PR values converge. For performing the comparisons with
GPMA and GPMA+, we configure the experiment to run
HyPR on the same platform as used in [1] which is a Intel
Xeon CPU connected to a Titan X Pascal GPU, and also the
same datasets. Additionally, the PR scores are derived after
running all the experiments till convergence or 1000 iterations
(whichever is earlier). Towards that, we use an additional graph
“Random” having 1M nodes and 200M edges. From Figure 7,
we can observe that HyPR outperforms the state of the art
GPMA and GPMA+, and cuSparse implementation for four
of the largest graphs. On an average over all four the graphs
used, HyPR outperforms GPMA by 4.81x, GPMA+ by 3.26x,
and cuSparse by 102.36x.
F. Accuracy Analysis
For checking the accuracy of HyPR, all the PR scores are
pre-computed using nvGraph [15]. nvGraph is a part of the
NVIDIA CUDA Toolkit is the state of the art solution which
effectively provides millisecond performance for computing
PR score on a static graph with 500 iterations. We use nvGraph
for computing the scratch PR scores of the graph. The batch
updates are then included into the graph through the HyPR
method, and checked with the nvGraph scores which we feed
with a graph that has the batches included. Once all the batch
Older
GPU
2
3
4
6
8
12
16
1
20
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(a) Amazon
2
3
4
6
8
12
1
14
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(b) Web-google
2
3
4
6
8
12
16
1
20
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(c) Wiki-topcat
16
24
32
48
50
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(d) soc-pokec
24
32
48
50
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(e) Reddit
6
8
12
16
20
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(f) soc-LiveJournal
6
8
12
16
20
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(g) Orkut
32
48
50
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(h) Graph-500
32
48
64
80
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(i) NLP
64
96
100
B1 B2 B3 B4 B5 B6 B7 B8 B9 B10
Time(ms)
Batch Size
Preprocessing Time
Transfer and Kernel Time
(j) Arabic
Figure 5: Overlaps achieved between partitioning and transfer+kernel times for 10 batches B1 − B10 of uniform size.
0.58
0.59
0.59
0.6
0.6
0.61
0.61
0.62
0.62
0.63
0.64
0.64
10
1
10
3
10
5
Ratio
Batch Size
Achieved Warp Occupancy
Global Load Efficiency
(a) Reddit
0.55
0.56
0.57
0.58
0.59
0.6
0.61
0.62
10
1
10
3
10
5
Ratio
Batch Size
Achieved Warp Occupancy
Global Load Efficiency
(b) Graph500
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
10
1
10
3
10
5
Ratio
Batch Size
Achieved Warp Occupancy
Global Load Efficiency
(c) Orkut
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
10
1
10
3
10
5
Ratio
Batch Size
Achieved Warp Occupancy
Global Load Efficiency
(d) NLP
Figure 6: Resource Utilization
0.1
1
10
100
1000
10
2
10
3
10
5
10
6
Time(ms)
Batch Size
GPMA+
GPMA
cuSpare
HyPR
0.19
6.00
15.00
125.00
1.11
8.00
13.00
99.00
51.00 52.00 49.00 48.00
0.17
2.16
5.14
43.30
(a) Reddit
0.1
1
10
100
1000
10
2
10
3
10
5
10
6
Time(ms)
Batch Size
GPMA+
GPMA
cuSpare
HyPR
0.15
6.00
14.00
120.00
1.20
8.00
12.00
98.00
49.00 51.00 50.00 50.00
0.14
1.19
4.17
42.26
(b) soc-Pokec
0.1
1
10
100
1000
10000
10
2
10
3
10
5
10
6
Time(ms)
Batch Size
GPMA+
GPMA
cuSpare
HyPR
0.80
7.00
99.00
1012.00
0.99
9.00
22.00
98.00
130.00 131.00 128.00 130.00
0.22
3.95
11.52
67.31
(c) Graph500
0.1
1
10
100
1000
10000
10
2
10
3
10
5
10
6
Time(ms)
Batch Size
GPMA+
GPMA
cuSparse
HyPR
0.80
7.00
102.00
1001.00
1.01
9.00
23.00
99.00
130.00 131.00 128.00 130.00
0.57
6.66
12.00
66.69
(d) Random
Figure 7: Comparison of HyPR with GPMA, GPMA+, and cuSparse
0.96
0.965
0.97
0.975
0.98
0.985
0.99
0.995
1
O
r
k
u
t
G
r
a
p
h
5
0
0
N
L
P
A
r
a
b
i
c
0.0005
0.001
0.0015
0.002
0.0025
Jaccard
Similarity
Divergence
top-10 top-20 top-30 div.
Figure 8: Accuracy of PR scores
updates are completed, we compare the PR scores with the
pre-computed ones. To check for accuracy, we introduce batch
updates with inserts and deletes updates on the nodes with the
highest PR scores on the existing graph. This is done in order
to impose maximum change on the PR scores due to an update.
In Figure 8, we show the Jaccard similarities of the PR scores
computed by HyPR with those computed by nvGraph for the
top 10, 20, and 30 nodes with the highest PR scores. We see
that on an average, the similarity score comes to 0.985 for
a fixed 500 iterations (same as that used for computing the
scores using nvGraph). The similarity score comes to 0.991
on average if we allow the different graphs to converge till
they reach the threshold γ. We see that on an average the
divergence of the PR scores computed by HyPR from that of
nvGraph is less than 0.001% (shown in y2 axis).
V. RELATED WORKS
GPU based PR has been explored in the works done
by Duong et al. [16] and Garg et al. [4]. In [16], authors
proposes a new data structure for graph representations named
link structure file. In their work they target the steps in PR
computation where sufficient data parallelism exist. Further
these steps are distributed among multiple GPUs where each
threads perform finer grained work. Garg et. al. [4], provide
algorithmic techniques for partitioning the graph based on their
structural properties to extract paralleism.
PR on evolving graphs have been explored by Sha et al. [1]
where they propose two algorithms GPMA and GPMA+ based
upon packed memory array. GPMA is lock-based approach
where few concurrent updates conflicts are handled efficiently.
GPMA+ is a lock-free bottom-up approach which prioritizes
updates and favors coalesced memory access. Another work is
done by Feng et al. [17] where they propose the DISTINGER
framework. DISTINGER employs a hash partitioning-based
scheme that favors massive graph updates and message passing
among the partition sites using MPI. Another algorithm to
compute personalized PR on dynamic graphs was published by
Guo et al. [13] that also exploit GPUs for performance. Similar
to HyPR, computations proposed in [13] is also done in a
batched manner. To enhance the performance of parallel push
different optimization techniques are introduced. One of them
is eager propagation, that minimizes the number of local push
operations. They also propose frontier generation method that
keeps track of vertex frontiers by cutting down synchronization
overhead to merge duplicate vertices. Batch parallelism for
dynamic graphs have also found several theoretical studies. A
generic framework for batch parallelism is proposed by Acar
et. al. in [8]. Batch parallelism for graph connectivity and other
problems, in the massively parallel computation (MPC) model
is explored by Dhulipala et. al. in [18].
The work done by Desikan et. al. [2], is one of the earliest
works done towards incremental PR computations on evolving
graphs. The authors proposed the partitioning, and scaling
techniques which we modify for parallelization on a heteroge-
neous platform. To the best of our knowledge, there is no other
work that exists which explores hybrid CPU+GPU solution for
computing global PR. In HyPR we propose techniques for PR
computations that uses batch parallelism in unison with fully
parallel partitioning and PR update mechanisms on a hybrid
platform towards extracting high performance.
VI. CONCLUSION AND FUTURE WORK
In this work, we propose HyPR which is a hybrid tech-
nique for computing PR on evolving graphs. We have shown
an efficient mechanism to partition the existing graph and
updates into data parallel work units which can be updated
independently. HyPR is executed on a state-of-the-art high
performance platform and exhaustively tested against large real
world graphs. HyPR is able to provide substantial performance
gains of up to 4.8x over other existing mechanisms and
also extracts generous system efficiency. In the near future,
we plan on extending HyPR by spreading the computations
across multiple GPUs located on shared and distributed mem-
ories. Communication between distributed nodes will become
an additional overhead to handle in that case. Additionally,
modern HPC systems are equipped with newer generation
interconnects like NVLink which deserves to be explored in
the context of page ranking.
VII. ACKNOWLEDGMENT
This work is supported by Science and Engineering Re-
search Board (SERB), DST, India through the Early Career
Research Grant (no. ECR/2016/002061) and NVIDIA Corpo-
ration through the GPU Hardware Grant program.
REFERENCES
[1] M. Sha, Y. Li, B. He, and K.-L. Tan, “Accelerating dynamic
graph analytics on GPUs,” Proc. of the VLDB Endow., vol. 11,
no. 1, pp. 107–120, 2017.
[2] P. Desikan, N. Pathak, J. Srivastava, and V. Kumar, “Incremental
page rank computation on evolving graphs,” in 14th Interna-
tional WWW, 2005, pp. 1094–1095.
[3] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank
citation ranking: Bringing order to the web.” Stanford InfoLab,
Tech. Rep., 1999.
[4] P. Garg and K. Kothapalli, “STIC-D: Algorithmic techniques
for efficient parallel PageRank computation on Real-World
Graphs,” in Proceedings of the 17th ICDCN, 2016, pp. 1–10.
[5] D. Gleich, L. Zhukov, and P. Berkhin, “Fast parallel pager-
ank: A linear system approach,” Yahoo! Research Techni-
cal Report YRL-038, available via http://research. yahoo.
com/publication/YRL-038. pdf, vol. 13, p. 22, 2004.
[6] A. Cevahir, C. Aykanat, A. Turk, and B. B. Cambazoglu, “Site-
based partitioning and repartitioning techniques for parallel
PageRank computation,” IEEE TPDS, vol. 22, no. 5, pp. 786–
802, 2011.
[7] M. Kim, “Towards exploiting GPUs for fast PageRank computa-
tion of large-scale networks,” in Proceeding of 3rd International
Conference on Emerging Databases, 2013.
[8] U. A. Acar, D. Anderson, G. E. Blelloch, and L. Dhulipala,
“Parallel batch-dynamic graph connectivity,” in The 31st ACM
SPAA, 2019, pp. 381–392.
[9] Compressed Sparse Column Format (CSC), https:
//scipy-lectures.org/advanced/scipy sparse/csr matrix.html.
[10] The University of Florida Sparse Matrix Collection, https:
//snap.stanford.edu/data.
[11] J. Leskovec and A. Krevl, SNAP Datasets: Stanford Large
Network Dataset Collection, http://snap.stanford.edu/data.
[12] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,
Introduction to Algorithms, Third Edition, 3rd ed. The MIT
Press, 2009.
[13] W. Guo, Y. Li, M. Sha, and K.-L. Tan, “Parallel personalized
pagerank on dynamic graphs,” Proc. of the VLDB Endow.,
vol. 11, no. 1, pp. 93–106, 2017.
[14] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cus-
parse library,” in GPU Technology Conference, 2010.
[15] NVGraph toolkit documentation, https://docs.nvidia.com/cuda/
cuda-runtime-api/index.html.
[16] N. T. Duong, Q. A. P. Nguyen, A. T. Nguyen, and H.-
D. Nguyen, “Parallel PageRank computation using GPUs,” in
Proce. of the 3rd Symposium on Information and Communica-
tion Technology, 2012, pp. 223–230.
[17] G. Feng, X. Meng, and K. Ammar, “Distinger: A distributed
graph data structure for massive dynamic graph processing,” in
International Conference on Big Data. IEEE, 2015, pp. 1814–
1822.
[18] L. Dhulipala, D. Durfee, J. Kulkarni, R. Peng, S. Sawlani,
and X. Sun, “Parallel Batch-Dynamic Graphs: Algorithms and
Lower Bounds,” in Proceedings of the 31st SODA. USA:
SIAM, 2020, p. 1300–1319.

More Related Content

What's hot

Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
paperpublications3
 
Journal paper 1
Journal paper 1Journal paper 1
Journal paper 1
Editor IJCATR
 
A Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph DatabasesA Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph Databases
IJMER
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
IJCNCJournal
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
MapR Technologies
 
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
IRJET Journal
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
MapR Technologies
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
GraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBGraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDB
ArangoDB Database
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
Sunny Kr
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
ijdpsjournal
 
Hadoop performance modeling for job
Hadoop performance modeling for jobHadoop performance modeling for job
Hadoop performance modeling for job
ranjith kumar
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applicationsA location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
IAEME Publication
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
1crore projects
 
Harvard poster
Harvard posterHarvard poster
Harvard poster
Alysson Almeida
 
Review on Multiply-Accumulate Unit
Review on Multiply-Accumulate UnitReview on Multiply-Accumulate Unit
Review on Multiply-Accumulate Unit
IJERA Editor
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
lmphuong06
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
cscpconf
 
Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...
VLSICS Design
 
Todtree
TodtreeTodtree
Todtree
Manasa Prasad
 

What's hot (20)

Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
 
Journal paper 1
Journal paper 1Journal paper 1
Journal paper 1
 
A Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph DatabasesA Subgraph Pattern Search over Graph Databases
A Subgraph Pattern Search over Graph Databases
 
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
NETWORK-AWARE DATA PREFETCHING OPTIMIZATION OF COMPUTATIONS IN A HETEROGENEOU...
 
Graphlab Ted Dunning Clustering
Graphlab Ted Dunning  ClusteringGraphlab Ted Dunning  Clustering
Graphlab Ted Dunning Clustering
 
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
IRJET-An Efficient Technique to Improve Resources Utilization for Hadoop Mapr...
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
GraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDBGraphSage vs Pinsage #InsideArangoDB
GraphSage vs Pinsage #InsideArangoDB
 
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardin...
 
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...
 
Hadoop performance modeling for job
Hadoop performance modeling for jobHadoop performance modeling for job
Hadoop performance modeling for job
 
A location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applicationsA location based least-cost scheduling for data-intensive applications
A location based least-cost scheduling for data-intensive applications
 
PAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph ComputationPAGE: A Partition Aware Engine for Parallel Graph Computation
PAGE: A Partition Aware Engine for Parallel Graph Computation
 
Harvard poster
Harvard posterHarvard poster
Harvard poster
 
Review on Multiply-Accumulate Unit
Review on Multiply-Accumulate UnitReview on Multiply-Accumulate Unit
Review on Multiply-Accumulate Unit
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...
 
Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...
 
Todtree
TodtreeTodtree
Todtree
 

Similar to HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)

Hybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESHybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTES
Subhajit Sahu
 
I/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTESI/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTES
Subhajit Sahu
 
GPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale Analytics
Suleiman Shehu
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
Sandeep Singh
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
Rusif Eyvazli
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
IJMER
 
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
cscpconf
 
hetero_pim
hetero_pimhetero_pim
hetero_pim
Borui Wang
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
Ahmad El Tawil
 
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYSPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
csandit
 
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
IJET - International Journal of Engineering and Techniques
 
I017425763
I017425763I017425763
I017425763
IOSR Journals
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
UT, San Antonio
 
T180304125129
T180304125129T180304125129
T180304125129
IOSR Journals
 
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
Subhajit Sahu
 
Programming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi ClustersProgramming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi Clusters
AM Publications
 
High Performance Computing for Satellite Image Processing and Analyzing – A ...
High Performance Computing for Satellite Image  Processing and Analyzing – A ...High Performance Computing for Satellite Image  Processing and Analyzing – A ...
High Performance Computing for Satellite Image Processing and Analyzing – A ...
Editor IJCATR
 
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
AM Publications
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET Journal
 
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERDynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Subhajit Sahu
 

Similar to HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES) (20)

Hybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESHybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTES
 
I/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTESI/O-Efficient Techniques for Computing Pagerank : NOTES
I/O-Efficient Techniques for Computing Pagerank : NOTES
 
GPU accelerated Large Scale Analytics
GPU accelerated Large Scale AnalyticsGPU accelerated Large Scale Analytics
GPU accelerated Large Scale Analytics
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
 
hetero_pim
hetero_pimhetero_pim
hetero_pim
 
Map reduce advantages over parallel databases report
Map reduce advantages over parallel databases reportMap reduce advantages over parallel databases report
Map reduce advantages over parallel databases report
 
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHYSPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
SPEED-UP IMPROVEMENT USING PARALLEL APPROACH IN IMAGE STEGANOGRAPHY
 
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
[IJET V2I5P18] Authors:Pooja Mangla, Dr. Sandip Kumar Goyal
 
I017425763
I017425763I017425763
I017425763
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
 
T180304125129
T180304125129T180304125129
T180304125129
 
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
 
Programming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi ClustersProgramming Modes and Performance of Raspberry-Pi Clusters
Programming Modes and Performance of Raspberry-Pi Clusters
 
High Performance Computing for Satellite Image Processing and Analyzing – A ...
High Performance Computing for Satellite Image  Processing and Analyzing – A ...High Performance Computing for Satellite Image  Processing and Analyzing – A ...
High Performance Computing for Satellite Image Processing and Analyzing – A ...
 
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
Review: Data Driven Traffic Flow Forecasting using MapReduce in Distributed M...
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
 
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERDynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
 

More from Subhajit Sahu

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
Subhajit Sahu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Adjusting Bitset for graph : SHORT REPORT / NOTES
Adjusting Bitset for graph : SHORT REPORT / NOTESAdjusting Bitset for graph : SHORT REPORT / NOTES
Adjusting Bitset for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Experiments with Primitive operations : SHORT REPORT / NOTES
Experiments with Primitive operations : SHORT REPORT / NOTESExperiments with Primitive operations : SHORT REPORT / NOTES
Experiments with Primitive operations : SHORT REPORT / NOTES
Subhajit Sahu
 
PageRank Experiments : SHORT REPORT / NOTES
PageRank Experiments : SHORT REPORT / NOTESPageRank Experiments : SHORT REPORT / NOTES
PageRank Experiments : SHORT REPORT / NOTES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Subhajit Sahu
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
Subhajit Sahu
 
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
Subhajit Sahu
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
Subhajit Sahu
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
Subhajit Sahu
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Subhajit Sahu
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTES
Subhajit Sahu
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTES
Subhajit Sahu
 
Survey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSurvey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTES
Subhajit Sahu
 
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Subhajit Sahu
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Subhajit Sahu
 
Can you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTESCan you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTES
Subhajit Sahu
 

More from Subhajit Sahu (20)

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Adjusting Bitset for graph : SHORT REPORT / NOTES
Adjusting Bitset for graph : SHORT REPORT / NOTESAdjusting Bitset for graph : SHORT REPORT / NOTES
Adjusting Bitset for graph : SHORT REPORT / NOTES
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Experiments with Primitive operations : SHORT REPORT / NOTES
Experiments with Primitive operations : SHORT REPORT / NOTESExperiments with Primitive operations : SHORT REPORT / NOTES
Experiments with Primitive operations : SHORT REPORT / NOTES
 
PageRank Experiments : SHORT REPORT / NOTES
PageRank Experiments : SHORT REPORT / NOTESPageRank Experiments : SHORT REPORT / NOTES
PageRank Experiments : SHORT REPORT / NOTES
 
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...
 
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTES
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTES
 
Survey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSurvey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTES
 
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
 
Can you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTESCan you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTES
 

Recently uploaded

Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech.Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech
 
Authentication Review-June -2024 AP & TS.pptx
Authentication Review-June -2024 AP & TS.pptxAuthentication Review-June -2024 AP & TS.pptx
Authentication Review-June -2024 AP & TS.pptx
DEMONDUOS
 
UMiami degree offer diploma Transcript
UMiami degree offer diploma TranscriptUMiami degree offer diploma Transcript
UMiami degree offer diploma Transcript
attueb
 
Russian Girls Call Mumbai 🛵🚡9833363713 💃 Choose Best And Top Girl Service And...
Russian Girls Call Mumbai 🛵🚡9833363713 💃 Choose Best And Top Girl Service And...Russian Girls Call Mumbai 🛵🚡9833363713 💃 Choose Best And Top Girl Service And...
Russian Girls Call Mumbai 🛵🚡9833363713 💃 Choose Best And Top Girl Service And...
dream girl
 
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
SimonedeGijt
 
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
rachitkumar09887
 
NYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction InnovationNYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction Innovation
NYGGS Construction ERP Software
 
GT degree offer diploma Transcript
GT degree offer diploma TranscriptGT degree offer diploma Transcript
GT degree offer diploma Transcript
attueb
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
ThousandEyes
 
Blockchain in Agricultural Traceability Use Cases in 2024.pdf
Blockchain in Agricultural Traceability Use Cases in 2024.pdfBlockchain in Agricultural Traceability Use Cases in 2024.pdf
Blockchain in Agricultural Traceability Use Cases in 2024.pdf
Natsoft Corporation
 
Comprehensive Vulnerability Assessments Process _ Aardwolf Security.docx
Comprehensive Vulnerability Assessments Process _ Aardwolf Security.docxComprehensive Vulnerability Assessments Process _ Aardwolf Security.docx
Comprehensive Vulnerability Assessments Process _ Aardwolf Security.docx
Aardwolf Security
 
TEQnation 2024: Sustainable Software: May the Green Code Be with You
TEQnation 2024: Sustainable Software: May the Green Code Be with YouTEQnation 2024: Sustainable Software: May the Green Code Be with You
TEQnation 2024: Sustainable Software: May the Green Code Be with You
marcofolio
 
ThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and DjangoThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and Django
akshesh doshi
 
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
jealousviolet
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
Philip Schwarz
 
Artificial intelligence in customer services or chatbots
Artificial intelligence  in customer services or chatbotsArtificial intelligence  in customer services or chatbots
Artificial intelligence in customer services or chatbots
kayash1656
 
welcome to presentation on Google Apps
welcome to   presentation on Google Appswelcome to   presentation on Google Apps
welcome to presentation on Google Apps
AsifKarimJim
 
Celebrity Girls Call Mumbai 9930687706 Unlimited Short Providing Girls Servic...
Celebrity Girls Call Mumbai 9930687706 Unlimited Short Providing Girls Servic...Celebrity Girls Call Mumbai 9930687706 Unlimited Short Providing Girls Servic...
Celebrity Girls Call Mumbai 9930687706 Unlimited Short Providing Girls Servic...
kiara pandey
 
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to KnowThe Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
onemonitarsoftware
 
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
bhumivarma35300
 

Recently uploaded (20)

Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech.Mobile App Development Company in Noida - Drona Infotech.
Mobile App Development Company in Noida - Drona Infotech.
 
Authentication Review-June -2024 AP & TS.pptx
Authentication Review-June -2024 AP & TS.pptxAuthentication Review-June -2024 AP & TS.pptx
Authentication Review-June -2024 AP & TS.pptx
 
UMiami degree offer diploma Transcript
UMiami degree offer diploma TranscriptUMiami degree offer diploma Transcript
UMiami degree offer diploma Transcript
 
Russian Girls Call Mumbai 🛵🚡9833363713 💃 Choose Best And Top Girl Service And...
Russian Girls Call Mumbai 🛵🚡9833363713 💃 Choose Best And Top Girl Service And...Russian Girls Call Mumbai 🛵🚡9833363713 💃 Choose Best And Top Girl Service And...
Russian Girls Call Mumbai 🛵🚡9833363713 💃 Choose Best And Top Girl Service And...
 
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptxWired_2.0_Create_AmsterdamJUG_09072024.pptx
Wired_2.0_Create_AmsterdamJUG_09072024.pptx
 
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
Agra Girls Call Agra 0X0000000X Unlimited Short Providing Girls Service Avail...
 
NYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction InnovationNYGGS 360: A Complete ERP for Construction Innovation
NYGGS 360: A Complete ERP for Construction Innovation
 
GT degree offer diploma Transcript
GT degree offer diploma TranscriptGT degree offer diploma Transcript
GT degree offer diploma Transcript
 
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
Cisco Live Announcements: New ThousandEyes Release Highlights - July 2024
 
Blockchain in Agricultural Traceability Use Cases in 2024.pdf
Blockchain in Agricultural Traceability Use Cases in 2024.pdfBlockchain in Agricultural Traceability Use Cases in 2024.pdf
Blockchain in Agricultural Traceability Use Cases in 2024.pdf
 
Comprehensive Vulnerability Assessments Process _ Aardwolf Security.docx
Comprehensive Vulnerability Assessments Process _ Aardwolf Security.docxComprehensive Vulnerability Assessments Process _ Aardwolf Security.docx
Comprehensive Vulnerability Assessments Process _ Aardwolf Security.docx
 
TEQnation 2024: Sustainable Software: May the Green Code Be with You
TEQnation 2024: Sustainable Software: May the Green Code Be with YouTEQnation 2024: Sustainable Software: May the Green Code Be with You
TEQnation 2024: Sustainable Software: May the Green Code Be with You
 
ThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and DjangoThaiPy meetup - Indexes and Django
ThaiPy meetup - Indexes and Django
 
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
VVIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 i...
 
Folding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a seriesFolding Cheat Sheet #7 - seventh in a series
Folding Cheat Sheet #7 - seventh in a series
 
Artificial intelligence in customer services or chatbots
Artificial intelligence  in customer services or chatbotsArtificial intelligence  in customer services or chatbots
Artificial intelligence in customer services or chatbots
 
welcome to presentation on Google Apps
welcome to   presentation on Google Appswelcome to   presentation on Google Apps
welcome to presentation on Google Apps
 
Celebrity Girls Call Mumbai 9930687706 Unlimited Short Providing Girls Servic...
Celebrity Girls Call Mumbai 9930687706 Unlimited Short Providing Girls Servic...Celebrity Girls Call Mumbai 9930687706 Unlimited Short Providing Girls Servic...
Celebrity Girls Call Mumbai 9930687706 Unlimited Short Providing Girls Servic...
 
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to KnowThe Ultimate Guide to Phone Spy Apps: Everything You Need to Know
The Ultimate Guide to Phone Spy Apps: Everything You Need to Know
 
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
Independent Girls call Service Pune 000XX00000 Provide Best And Top Girl Serv...
 

HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)

  • 1. HyPR: Hybrid Page Ranking on Evolving Graphs Hemant Kumar Giri, Mridul Haque, Dip Sankar Banerjee Department of Computer Science and Engineering. Indian Institute of Information Technology Guwahati Bongora, Guwahati 781015. Assam. India. Email:girihemant19@gmail.com, {mridul.haque,dipsankarb}@iiitg.ac.in Abstract—PageRank (PR) is the standard metric used by the Google search engine to compute the importance of a web page via modeling the entire web as a first order Markov chain. The challenge of computing PR efficiently and quickly has been already addressed by several works previously who have shown innovations in both algorithms and in the use of parallel computing. The standard method of computing PR is handled by modelling the web as a graph. The fast growing internet adds several new web pages everyday and hence more nodes (representing the web pages) and edges (the hyperlinks) are added to this graph in an incremental fashion. Computing PR on this evolving graph is now an emerging challenge since computations from scratch on the massive graph is time consuming and unscalable. In this work, we propose Hybrid Page Rank (HyPR), which computes PR on evolving graphs using collaborative executions on muti-core CPUs and massively parallel GPUs. We exploit data parallelism via efficiently partitioning the graph into different regions that are affected and unaffected by the new updates. The different partitions are then processed in an overlapped manner for PR updates. The novelty of our technique is in utilizing the hybrid platform to scale the solution to massive graphs. The technique also provides high performance through parallel processing of every batch of updates using a parallel algorithm. HyPR efficiently executes on a NVIDIA V100 GPU hosted on a 6th Gen Intel Xeon CPU and is able to update a graph with 640M edges with a single batch of 100,000 edges in 12 ms. HyPR outperforms other state of the art techniques for computing PR on evolving graphs [1] by 4.8x. Additionally HyPR provides 1.2x speedup over GPU only executions, and 95x speedup over CPU only parallel executions. Index Terms—Heterogeneous Computing, PageRank, CPU+GPU, Dynamic graphs. I. INTRODUCTION Link analysis is a popular technique for mining meaningful information from real world graphs. There are a variety of knowledge models that are typically employed by different ap- plications. The knowledge models encode structural informa- tion and rich relationships between the different entities which help in the extraction of critical information. One such model is the Hubs and Authority [2] model which essentially proves that a web graph has multiple bipartite cores. Google [3] on the other hand models the web graph as a first order Markov chain which captures an user’s browsing patterns of the web. This model is used by Google to generate ranking of the different web pages which forms the core concept for the Page Rank (PR) algorithm and is used in Google search. The page ranking method models a hyperlink from one page to another as an endorsement for the destination page from the source. With the growth of the internet, computing PR on the entire web is a challenging task given that the graph is massive. Additionally, the graph is evolving (also called dynamic) in nature and will have several new pages and hyperlinks generated every day which leads to an additional challenge of computing new page ranks at regular intervals. While the simplest solution is to compute the page ranks from scratch every time, it is not feasible for any realistic use case. Hence it is necessary to investigate methods for computing new page ranks on evolving graphs that will not require computing PR from scratch. Parallel PR [4–6], has found wide spread success in com- puting PR scores quickly. With the advent of processors such as Graphics Processing Units (GPUs), and modern multi-core CPUs, the parallel solutions of PR have further found wide spread success. Given the capacity for massive parallelism of GPUs, good programmability, and strong community support, GPUs have been able to provide sub-second performance in PR computations on massive real world graphs. In computing PR scores on evolving (or dynamic) graphs too, GPUs have found good success in the past [1, 7]. Computing dynamic PR, presents an ideal use case for heterogeneous computations where the nature of the computations require investigations into techniques that can provide parallelism in a hierarchical manner. While some parallel work can be done at coarser degrees of granularity, end page ranking scores need to be computed at a much finer granularity. In this paper, we present HyPR1 (pronounced hy-per), a method for computing PR on dynamic graphs which uti- lizes both CPUs and GPUs towards fast computations of the new scores. Our method depends on two broad phases of computation. In the first phase, we identify a partition of the graph which will be affected by a newer set of edges that are getting inserted. In the second phase, the actual PR computation will be carried out. While the steps are mutually exclusive, performing both the steps on a GPU will lead to some sequential computations. Towards that, we propose a heterogeneous method which provides higher degrees of parallelism towards computing the two phases and hence leads to significant performance benefits. The concrete motivations and contributions of our paper are as follows: 1) Parallel partitioning: As pointed out, our technique depends on identifying portions of the graph that are affected and unaffected by the newer batch of edges that are updated in the graph. We present a method that can perform the partitions in parallel to a current set of updates are getting incorporated into the existing graph. 1Code available at : https://github.com/girihemant19/HyPR
  • 2. This reduces irregular accesses of the GPU to a large extent, and boosts performance. 2) Parallel updates with a parallel algorithm: To extract maximum data parallelism (and hence performance) available on a GPU, we propose a technique that can parallely process a batch of updates to compute new PR scores using a parallel algorithm which runs on the GPU. The step for making the parallel update possible is done through a pre-processing step which is carried out in the CPU. To the best of our knowledge, this has not been proposed earlier. 3) Scalability: This is one of the main motivations behind the adoption of a hybrid solution. While the updates that happen in a batch parallel manner could be small, the whole graph, even when represented using the most space efficient data structures, will consume the high- est amount of main memory. We show that a hybrid approach will not necessitate the storage of the whole graph on a limited GPU memory. We propose to create minimal working sets that can effectively assimilate a new batch of updates and gets constrained only when the minimal working sets exceeds the memory. The much larger host memory holds the full graph as an auxiliary storage. 4) Benchmarking: We thoroughly experiment our technique on several real world graphs with up to 640M edges (limited by the GPU memory). We perform experiments to show that our technique out performs the state of the art dynamic PR techniques by 4.8x by providing PR scores in 12 ms. The rest of the paper is organized as follows. We provide a background of the fundamental ideas our work in Section II. The related works are discussed in Section V. We provide detailed methodology of our HyPR technique in Section III. This is followed by the evaluations where we discuss and analyze the results obtained in Section IV. Finally, we draw some brief conclusions of our work and discuss possible directions for extension in Section VI. II. BACKGROUND In this section we briefly discuss the algorithmic prelimi- naries on which HyPR depends. A. Baseline PageRank PR computations in general is computed on a graph G(V, E, w) where V (|V |= n) is the set of nodes representing the web pages. E (|E|= m) is the set of edges representing the hyperlinks and w is the set of edge weights which represents the contribution of a node to the node it connects with via an edge. Hence, using the random-surfer model which accounts for any random internet user to land at a particular page, the PR value of a node u is given as: pr(u) = X v∈I(u) c(v → u) + d n (1) In Equation 1, d (taken as 0.15 conventionally) represents the damping factor which is the probability of the random surfer to stay at a particular page. The function c() is the contribution. The magnitude of the contribution is taken to be the ratio of the pr(u)/out degree(u) times (1 − d) then Equation 1 can be re-written as: c(v → u) = (1 − d) pr(v) out degree(v) (2) In Equation 2, the PR computations happen in a cyclic manner as two different nodes can make contributions to each other. The PR values are computed via iteratively computing the PR values of all nodes until they converge. This signifies there is very little update to the PR scores after a particular iteration. This style of computing the PR scores is popularly done through power iterations [3]. The genesis of power iterations arises from the idea of the Markov chains reaching a steady state when starting from a initial distribution. Here, the initial PR values compose the starting states of every node which then goes through several transitions to reach a stable state. The PR values of all the nodes can be arranged to form the state matrix which has a set of eigenvalues. The eigen vectors of the said state matrix is solved using power iterations. By the intrinsic property of Markov chains, if the state matrix is stochastic, aperiodic, and irreducible, then the state matrix will converge to a stationary set of vectors which will be the final PR scores. A basic parallel implementation of computing PR in parallel is shown in Algorithm 1. B. Dynamic PR calculations The seed idea for HyPR stems from the batch dynamic graph connectivity algorithm proposed by Acar et. al. in [8]. Due to the iterative nature of PR, if the PR scores of all the incoming neighbors of a node u converge in a particular iter- ation, then the score of u will also converge in the immediate next iteration. This nature of PR, allows the decomposition of a directed graph into a set of connected components (CC) which can be processed in parallel [4]. Since maintaining CCs of a graph in a dynamic setting, is equivalent to maintaining the connectivity of a graph, we can perform a batch update with size B in parallel in O(lg n+lg(1+n/B)) work per edge. A batch here, refers to a set of updates which can be either insertions into the existing graph, or deletions. Each entry in a batch is arranged as a tuple (ti, ui, vi, wi, oi), where ti is a time stamp, (ui, vi) is the edge, wi is the weight associated with edge, and oi is the update type (insert/delete). We assume that a batch i will have at most Bi edges. We experiment the impact of this batch size on performance later in Section IV. As stated, the problem of computing the PR scores for an incrementally growing graph can be treated in essence as the computation of two CCs where one contains a set of nodes that gets affected by the batch of updates and the other does not. The same treatment for partitioning the graph for incrementally computing PR scores was also done by Desikan et. al. [2] where the authors proposed scaling of the unchanged nodes first followed by the PR computations of the
  • 3. changed nodes. For a graph G(V, E, w) where Pn 1 wi = W and |V | = n being sum of weights of all nodes and order of graph respectively. Every node is initialized with PR score of wi W . Consider a particular node s in the existing graph. It’s PR value can be expressed as: PR(s) = d( ws W ) + (1 − d) k X i=1 PR(xi) δ(xi) (3) where d is the damping factor and xi denotes every incoming neighbouring nodes pointing to s up to k number of nodes and δ(xi) is the out-degree of xi. From the fact the over some k iterations, the PR values of all the nodes will get scaled by a constant factor proportional to W, it can be deduced that for the node s the updated PR score can be scaled as W.PR(s) = W0 PR0 (s) or, PR0 (s) = ( W W0 )PR(s) (4) So, the new PR scores can easily be determined by scaling old PR with factor ( W W 0 ). W can be also taken to be the order of nodes n(G) since every node is equally likely to be of weight 1; Equations 4 can be re-written as: PR0 (s) = n(G) n(G0) PR(s) (5) The new nodes that are getting added into G will be required to be put through the usual iterations to compute PR values. In Figure 1, we have clearly demonstrated these partitions where Bi represents the batches of updates. A set of nodes which we denote by Vnew, are in the partition that requires scratch computations of PR. The other partition Vold requires to be scaled using the Equation 5. The Vborder nodes that are in the border area which requires scaling along with a few iterations of ranking before it converges to its’ final PR scores. While the standard PR algorithm will require a O(n + m) space to be maintained in memory, partitioning of the graph effectively aids in reducing the memory requirements. if we assume each batch to be of size ∆ edges then in addition to the original graph, an additional space of O(∆) needs to be created. Given the limited space that is available in the GPUs a combined space of O(n + m + ∆) will severely limit scaling. In order to scale HyPR to larger sizes, we aim to keep only O(∆) additional space in the GPU, while the rest of the graph will be maintained on the host memory. This is further refined through the partitioning which decomposes PR computations to discrete space requirements in O(Vnew), O(Vborder), and O(Vold). Each of these partitions need not reside on the same memory at all time. We use the compressed sparse row representation for representing the graph and is detailed in [9]. C. Datasets used Even though PR is usually computed on web graphs repre- senting web pages and hyperlinks, their properties are similar to other real world graphs. Hence, we choose a healthy mix of web-graphs and real world graphs. The datasets we use are Figure 1: Identification of nodes Algorithm 1: parallelPR(V,outdeg,InV[y],γ) Require: Set of nodes as V , outdegree of each node in outdeg, incoming neighbors InV. Ensure: PageRank p of each node 1: err=∞ 2: for all u ∈ V do 3: previous(u)= d |V | 4: end for 5: while err > γ do 6: for all u ∈ V in parallel do 7: p(u) = d n 8: for all x ∈ Inv do 9: p(u) = p(u) + previous(x) outdeg(x) ∗ (1 − d) 10: end for 11: end for 12: for all u ∈ V do 13: err = max(err, abs(previous[u] − p[u])) 14: previous(u) = p(u) 15: end for 16: end while 17: return p collected from the University of Florida Sparse Matrix Collec- tion [10] and Stanford Network Analysis Project (SNAP) [11]. The datasets which we have selected ranges from 2.9M edges to around 640M edges as detailed in Table I. III. METHODOLOGY In this section we will briefly discuss about the approach adopted for implementing HyPR. We first explain a basic overview of the approach that we take and then provide a detailed explanation on the implementation strategies. Al- gorithm 2, shows the basic set of steps that are adopted towards the implementation. Figure 2 shows the overview of the steps that are implemented. In a broad sense, HyPR works as concurrent (or overlapped) phases of the following. 1) Partitioning: The parallel cores of the CPU creates the
  • 4. Table I: Datasets Used Graph Name Sources |V | |E| Type 1. Amazon [11] 0.41 M 3.35 M Purchasing 2. web-Google [11] 0.87 M 5.10 M Web Graph 3. wiki-Topcat [11] 1.79 M 28.51 M Social 4. soc-pokec [11] 1.63 M 30.62 M Social 5. Reddit [11] 2.61 M 34.40 M Social 6. soc-LiveJournal [11] 4.84 M 68.99 M Social 7. Orkut [11] 3.00 M 117.10 M Social 8. Graph500 [11] 1.00 M 200.00 M Synthetic 9. NLP [10] 16.24 M 232.23 M Optimization 10. Arabic [10] 22.74 M 639.99 M Web Graph partitions Vold, Vnew, and Vborder as discussed in Section II-B. 2) Transfer: Perform asynchronous transfer of ∆ sized batch to the GPU memory. 3) PR Calculations: depending on the type of the partition, either scale, or calculate new PR scores on the GPU. A. Why Hybrid ? Before we discuss HyPR design in detail, we first motivate the requirement for a hybrid solution. As stated earlier, our intention for performing heterogeneous solution is three fold. In the first place, a hybrid solution allows the PR calculation to scale to very large sizes which is otherwise limited by the GPU main memory size. As discussed in Section II-B, only O(|V |+∆) space is needed for computing the updated PR values for a particular batch of size Bi. The static graph which is undergoing updates resides in the much larger main memory of the host and only aids in creating the partitions. In the second place, if we ignore scaling, we need to perform partitioning, followed by parallel computations of the PR scores which will result in limited GPU utilizations as all these steps are not independent. Also the degree of parallelism is limited for the partitioning steps in comparison to the scaling and PR updates. Hence, for coarse parallel partitioning step, the CPU is an ideal device and GPU is more suited for the finely parallel update and scaling. We show how a GPU only execution is actually slower than the hybrid technique in Section IV. In the last place, a hybrid technique allows us to accrue higher system efficiency which would otherwise have the CPU sitting idle while the GPU is updating the PR scores. B. Graph Partitioning: We start the computation on a graph G(V, E, w) where there already exists some PR scores that have been computed previously. The algorithmic phases that HyPR goes through is outlined in Algorithm 2. In successive time intervals, we will have a set of batch updates that arrive. As discussed previously in Section II-B, the batches consist of an heterogeneous mix of edges that are either to be inserted or deleted. A batch Bi, can be represented by the tuple (ti, ui, vi, wi, oi), as discussed in Section II-B. For updating the batch in parallel, we identify three con- nected components (CC) that strictly requires different com- putations. If we consider the existing graph as one single CC (say Co), and the new batch of updates as a separate CC (say Cn), then we can construct a block graph S where there will be some incoming edges from Cn to Co if there exists edge updates (ui, vi) where ui ∈ Co and vi ∈ Cn. We can assume that the new batch of updates is topologically sorted since the PR scores of the new nodes in Bi is guaranteed to be lower than those in Co. In an auxiliary experiment, we tested the nature of these CCs to see if they form strongly connected components (SCC) so as to arrive at a formulation similar to [4]. Decomposing a graph into a set of SCCs provides the advantage of doing a topological ordering of the partitions where PR computations (or re-computations) of the scores can be done in a cascaded manner with the scaling of Vborder, and Vold nodes first, followed by the PR calculation of Vborder and Vnew. This order of cascaded updates has also been adopted to update the PR scores as proved in [2]. We found, that if we re-compute the three partitions using Kosaraju’s algorithm (cf. [12]) to identify the Vold, Vnew, and Vborder partitions, approximately 2% of the edges differ from the Arabic dataset when compared to our mechanism of partitioning. Hence, we can conclude that our mechanism produces approximate SCCs which can be easily monitored for only the 2% extraneous edges so as to maintain a proper topological order while performing the batch updates. We do a small book-keeping of these edges in order to correctly partition the edges. We can now identify three partitions for every batch Bi (i) Vold which is a set of nodes that are already there in the existing graph (ii) Vnew is a set of vertices which are entirely new nodes which are to be added to the existing graph and can be found as Vi − (Vi ∩ Vold) if Vi is the set of nodes in batch Bi (iii) Vborder which is the set of nodes which have edges in Bi connecting Vold and Vnew and is reachable using a breadth first traversal. As an example, Figure 1, we can see the Vold nodes in red, Vnew nodes in green, and the Vborder nodes which are a collection of the entire set of nodes which are in the first hub and having direct connection with Vnew. Essentially, all the nodes in G without the yellow nodes are Vborder. As, we can see in Algorithm 2, the first phase of the update operation is the creation of these partitions. Since, the identification of these vertices are independent from each other, they can be done in parallel. It is critical to note that at this stage the number of edges that are present in a single batch does not warrant any form of edge parallelism on GPU as that will lead to lower utilization of the available GPU bandwidth at the cost of high memory transfers. C. Pre-processing In pre-processing phase, we compute the Vold, Vnew, and Vborder partitions in an overlapped manner with the GPU. The sequence of operations that are followed for every incoming batch of updates are: 1) Pre-process the first batch in parallel using OpenMP threads on CPU. 2) Transfer Vold, Vnew, and Vborder to GPU for scaling, and PR computations in an asynchronous manner 3) Start pre-processing the second batch of updates on the CPU as soon as the first partition is How? Note
  • 5. (a) Graph with batch updates (b) Identification of border, new, and unchanged nodes (c) Scaling on unchanged and border nodes (d) PageRank Computation on border and new nodes Figure 2: Overview of HyPR execution steps. handed off to GPU for ranking. Algorithm 3 demonstrates the partitioning mechanism which is called from Line 1 of Algo- rithm 2. As mentioned earlier, the batch of updates contain an heterogeneous mix of both insert and delete operations which have to be handled uniquely. 1) Insertion: Insertion is more compute intensive than deletion. In Lines 6-22 of Algorithm 3 we depict our insertion mechanism. We first populate Vnew with all the nodes of the batch and process all the elements in Vnew using available CPU threads to successively send the node to the appropriate partition. We first check if the source node of an edge (ui, vi) belongs to the existing partition or not. If it is, then that particular node is put in Vold. For computing the Vborder, we check if the particular node is reachable in G (the existing graph) in a breadth-first manner and has a predecessor in the incoming batch. The intuition behind this is that the Vborder will be the set of nodes that are reachable from the Vnew set of nodes and hence will undergo both scaling and PR computation. So in parallel, all the nodes in Vnew are popped and either classified in Vold or Vborder. The remaining ones in Vnew are the ones which are the entirely new set of nodes that have come in Bi whose new PR values needs to be computed. 2) Deletion: Deletion is much simpler than insertion and hence less compute intensive. In case of deletion, the Vnew set will be NULL, and the only nodes that will be involved are the Vold and Vborder sets. As we can see from Lines 23-33 of Algorithm 3, if the update type oi is delete, then we remove the nodes involved from Vold which is containing the nodes of the original graph. These will require a newer set of PR computations on the Vnew nodes that will be handled during the PR update step in Algorithm 2. Additionally, the removals will induce a newer set of Vborder nodes for which, we will see if any of the reachable successors or predecessors of the removed nodes are present. Such nodes will be pushed into Vborder. D. Scaling the old nodes The pre-processing step essentially allows us to perform data parallel scaling and PR computations on the individual partitions. As discussed earlier, the primary idea behind HyPR is the localization of the set of nodes that will be affected by the new batch of updates. As we can now see from Lines 6-11 in Algorithm 2, we call a GPU kernel to scale the nodes of the Vold partition using the Equation 5 discussed in Section II-B. We can now achieve full GPU bandwidth saturation as |Vold| number of threads can be spawned to scale all the nodes in parallel. It is critical to note here that in the hybrid implementation, there will be intermediate transfers that will be necessary which will not require the entire G (which is basically V nodes from G initially) to be copied every time. Rather, the original graph G is copied to GPU before the batch processing starts and is augmented with Vnew after every batch is processed. As we can see in Lines 9-11 of Algorithm 2, the scaling operation is performed accordingly on the Vborder nodes as well. Scaling of the nodes in the GPU is required for all the Vold and Vborder set of nodes. The actual scaling operation is a O(1) operation which makes it most suitable for a massively parallel implementation on the GPUs. Threads equal to the number of nodes involved in the scaling process (Vold or Vborder) are spawned on the GPU for executing the scaling kernels in an SIMT manner. E. Page Rank Update The PR update of the Vnew and Vborder nodes are now a much lighter computation owing to the partitions. As with the standard parallel PR implementation shown in Algorithm 1, the power iterations for computing the PR scores continue until the scores converge to an error threshold γ (set to 10−10 ). However, since the Vborder set undergoes a step of scaling before the PR update step, the number of iterations required for the scores to converge will be much lower than the case of computing from scratch. So, during the PR update step for Vborder, and Vnew, as shown in Lines 12-17 of Algorithm 2, we call the parallelPR() of Algorithm 1. For the Vnew nodes, the computation is trivial since the number of nodes are low as it contains only the new nodes being added from a new batch. The Vborder nodes, although much bigger in size than the Vnew set, will also call parallelPR(). However, they will converge much quicker since, they have undergone a step of scaling previously. For the deletions, the PR update will be required only for the Vold set and the Vborder set of nodes. As we can see from Lines 19-26 in Algorithm 2, the same PR update process will V-batch? ?
  • 6. Algorithm 2: HyPR: Hybrid Page Ranking on G with incremental Batches Bi Require: Scratch graph G and k number of batches in B represented by (ti, ui, vi, wi, oi), PageRank vector dest,outdegree of each node in outdeg, incoming neighbors of every node in InV. Ensure: Rank of the nodes in vector dest {Phase 1: Pre-processing phase} 1: CPU::Partition the incoming batches based on insertion and deletion 2: (Vold, Vborder, Vnew)=createPartition(G,B) 3: CPU:: Queue Vold, Vborder, Vnew for async transfer to GPU {Phase 2: PR update} 4: INSERTION: Generate threads equal to number of Vborder, Vnew 5: if (oi==insert) then 6: for ∀u ∈ Vold in parallel do 7: GPU:: dest[u] = |V |∗dest[u] |Vold| {Scaling} 8: end for 9: for ∀x ∈ Vborder in parallel do 10: GPU:: dest[x] = |V |∗dest[u] |Vborder| {Scaling} 11: end for 12: for ∀z ∈ Vborder in parallel do 13: GPU:: dest[z]=parallelPR(Vborder,outdeg,InV [z]) 14: end for 15: for ∀y ∈ Vnew in parallel do 16: GPU :: dest[y]=parallelPR(Vnew,outdeg,InV [y]) 17: end for 18: end if 19: DELETION: Generate threads equal to number of Vold and Vborder 20: if oi==delete then 21: for ∀u ∈ Vborder in parallel do 22: GPU :: dest[y]=parallelPR(Vborder,outdeg,InV [u]) {PR Update} 23: end for 24: for ∀v ∈ Vold in parallel do 25: GPU:: dest[v] = |V |∗dest[u] |Vold| {Scaling} 26: end for 27: end if be applied first for the Vborder set. The Vold set will simply undergo a step of scaling similar to the case of insertions. F. CUDA+OpenMP implementation We can observe a snapshot of the overlapped execution model in Figure 4. The target performance critically depends on creating the perfect balance of computations that are oc- curring on the CPU and the GPU. The CPU is responsible for creating the partitions, and transferring them to the GPU. The GPU on the other hand is responsible for performing the three kernel operations of scaling, and two PR update operations. To achieve that, we make use of the synchronous CUDA kernel Algorithm 3: createPartition(G,B) Require: Graph G and k number of batches in Bi represented by bi ∈ (ti, ui, vi, wi, oi) where oi denotes insert or delete Ensure: Vold, Vnew, Vborder 1: CPU :: Generate threads using OpenMP 2: Initialize Vold, Vnew, Vborder = φ, Vtemp = φ 3: Push ∀(u, v) ∈ batch to Vnew 4: Push ∀u ∈ G to Vold 5: for ∀bi ∈ B in parallel : do 6: if (oi==insert) then 7: while (Vnew! = NULL) do 8: Pop element x ∈ Vnew 9: if (x ∈ Vtemp) then 10: Continue 11: end if 12: Push x to Vtemp 13: for every successor y of x ∈ G do 14: Push y to Vtemp 15: end for 16: end while 17: for ∀z ∈ Vtemp do 18: for ∀ predecessors li ∈ bi do 19: Push li into Vborder 20: end for 21: end for 22: end if 23: if (oi==delete) then 24: for ∀u, v ∈ bi do 25: Choose u and v from Vold 26: end for 27: for ∀y successor of (u, v) ∈ do 28: Push y into Vborder 29: end for 30: for ∀y, predecessor of (u, v) ∈ do 31: Push y into Vborder 32: end for 33: end if 34: end for 35: return Vold,Vnew,Vborder calls, asynchronous transfers, and CUDA streams to orches- trate the entire execution model. For creating the partitions, we utilize CPU threads created using the OpenMP library. We create threads equal to the number of processing cores that are available. It is natural to understand, that the batch sizes will be much bigger than the number of threads. We use standard blocking of the batches for each of the threads to handle. Despite large batches, this provides good performance due to the fact that the partitioning operation in itself are simple in nature and does not involve too much CPU intensive oper- ations. Additionally, the partitioning is a irregular operation which the CPU is much better at handling in comparison to the GPU. CUDA streams are created before the start of the
  • 7. operation. Once, the CPU finishes the partitioning operation on a particular batch, cudaMemcpyAsync() calls are issued on the three partitions created on individual streams. CUDA events associated with the copy operations monitors the completion. IV. PERFORMANCE EVALUATION In this section we discuss the experiments that we perform to validate the efficacy of our solution, and also analyze the performance. A. Experimental environment: For conducting our experiments we use a platform that has a multicore CPU connected to a state of the art GPU via the PCI link. The CPU is a Intel(R) Xeon (R) Silver 4110 having the Skylake micro-architecture. Two of these CPUs each having 8 cores are arranged in two sockets effectively providing 16 NUMA cores. The cores are clocked at 2.1 GHz with 12 MB of L3 cache. It is attached to a NVIDIA V100 GPU which has 5120 CUDA cores spread across 80 symmetric multiprocessors (SMs). Each of the GPU core is clocked at 1.38 GHz and has 32 GB of global main memory. The GPU is connected to the CPU via a PCIe Gen2 link. The machine is running CentOS 7 OS. For multi-threading on the CPU, we use OpenMP version 3.1 and GCC version 4.8. The GPU programs are compiled with nvcc from CUDA version 10.1 with the −O3 flag. All experiments have been averaged over a dozen runs. For experimentations, we use the real world graphs shown in Table I. The datasets do not posses any timestamps of their own. As done in previous works [1, 13], we simulate a random arrival of an edge update by randomly setting the timestamps. This is followed by the updates happening in an increasing order of the timestamps. For evaluations, we adopt the sliding window model where we take a certain percentage of the original dataset to construct the batches. These are then varied to measure the performance. B. Update time In this section we discuss the performance of HyPR in the context of update times. As mentioned earlier, we we start the computations by taking half of the edges in the entire graph dataset. We then measure the update times, as and when the sliding window moves to generate a set of batches. The update times reported is recorded using CUDA event timers to capture the overlapped pre-processing, transfer, and kernel execution times (shown in one blue box in Figure 4). A timer is started before the OpenMP parallel section where 15 threads perform parallel partitioning a batch and one thread is performing asynchronous transfers and calling kernels.The timer is stopped once the CUDA events indicate end of the final update kernel. We have experimented with different batch sizes from 1% to 10%. In Figure 3, we can see the latencies achieved over all the graphs. For the largest graph Arabic, the average update time that we achieve is 85.108 ms. In Figure 3, we also show the comparative time with the “Pure GPU” implementations. This is done in order to validate our claim that a hybrid implementation which will be able to efficiently overlap the PR computations with the CPU side partitioning and transfers, will show better performance. On an average, we observe that HyPR achieves 1.1843 times speedups over the “PureGPU” performance on 5 of the largest graphs that we experiment with and 1.2305 times over all the graphs. It can be observed that, the speedups achieved are more pronounced over the larger graphs in comparison to the smaller ones. This is due to the fact that the larger graphs provide higher degrees of parallelism that allows the GPUs to spawn higher number of threads and also the CPU side pre-processing allows a higher degree of overlap. This can be better explained using the experiment discussed in the next section. We also execute HyPR on a multi-core CPU using only OpenMP which we can call “Pure CPU”. We get around 95x improvement over the same which is orders of magnitude faster in comparison to “Pure GPU” and hence not discussed in Figure 3. C. Hybrid Overlaps As stated in Section I, one of the major motivations behind doing a hybrid computation is to perform the partitioning of the graph iteratively while the GPU is busy updating the PR scores. We first show how the pipelined that has been setup is working for the Arabic graph. The large graphs becomes the best use-cases for this pipelined execution, as they are able to achieve the perfect balance of the computations on the CPU and GPU sides. As we can observe from Figure 4, the execution begins with the partitioning of the first batch B1 of updates which does not involve any overlap. Once the partitioning completes, the asynchronous transfer of the partitions is done to the GPU which allows the CPU to process the B2 batch immediately. The transfer, which is registered with the callback automatically spawns the kernels as soon as the transfers complete. We can see that for a modest batch size of 10,000 edges, it takes 74.43 ms to pre-process the B2 batch which fully masks the 41.12 ms of transfer and 34.77 ms required by the kernels. This behavior is the best case scenario which is however not observable in the case of all graphs. We can see from Figure 5, that for the smaller graphs (Figure 5(a-f)), the difference between the partitioning time and kernel+transfer time is -1.81 ms (meaning pre-processing is slower on average). In Figure 5, the batch sizes are kept uniform (at 10K edges, approximately 1% of the dataset). For several of the smaller graphs soc- pokec, Reddit, Orkut, and Graph-500, we can observe that for several batches, the two curves cross each other at multiple points. This can be attributed to the structural heterogeneity of the batches. If a batch for example, contains too many new nodes, then the kernel times for calculating new PR scores on Vborder and Vnew will be higher than the pre-processing time. The pre-processing time will be lower in those cases. The reverse scenario, which is more conducive, is when the batches have a healthy mix of new and old nodes. This will create the right kind of overlap, as shown in Figure 4, when the waiting times of the GPU will be minimized. An inflection point is observed for NLP (Figure 5(i)), and for Arabic (Figure 5(j)), where we can see that the partitioning
  • 8. 0 50 100 150 200 250 300 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (a) Amazon 0 50 100 150 200 250 300 350 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (b) Web google 0 50 100 150 200 250 300 350 400 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (c) Wiki topcat 0 50 100 150 200 250 300 350 400 450 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (d) Soc-Pokec 0 50 100 150 200 250 300 350 400 450 500 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (e) Reddit 0 100 200 300 400 500 600 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (f) Soc LiveJournal 0 100 200 300 400 500 600 700 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (g) Orkut 0 100 200 300 400 500 600 700 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (h) Graph500 0 100 200 300 400 500 600 700 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (i) NLP 0 100 200 300 400 500 600 700 800 900 0 1% 5% 10% Time(ms) Batch Size HyPR PureGPU (j) Arabic Figure 3: Update Time Figure 4: Overlapping of the pre-processing time with transfer and kernel time for the Arabic graph time is exceeding the transfer+kernel time at a particular batch size. The partitioning times remain higher for the remaining larger graphs. This indicates the scalability of the CPU side partitioning on different batches. On an average the difference between the two is 4.95 ms for the larger graphs (Figure 5 (g-j)). Hence, we can conclude that HyPR is data driven, and shows best performances on the datasets that are able to extract the maximum amount of system performance. The system efficiency achieved by HyPR is discussed in the next sub-section. D. Resource Utilization In this experiment, we investigate the system efficiencies achieved by HyPR. Towards that, we do profiling in order to know the utilization of the resources. This is done through measurements of the memory and utilization and warp occu- pancy of the GPU threads. We used the nvprof profiler from the CUDA toolkit. We used two profiling metrics from nvprof, first we monitor achieved occupancy which means the number of warps running concurrently on a SM divided by maximum number warp capacity. Second we see gld efficiency which is the ratio of the global memory load throughput to required global memory load, i.e how well the coalescing of DRAM- accesses works. Figure 6 shows these two profiling results with different batch size. We can observe that the occupancies scale linearly with increasing batch sizes. The global load efficiency indicates increased coalesced accesses on every batch. On an average we achieve a global load efficiency of 61.07% and a warp occupancy of 64.14%. E. Comparative Analysis We now compare the performance of HyPR withs some of the state of the art solutions for dynamic PR. We mainly compare our work with GPMA [1], GPMA+ [1] and cuS- parse [14]. GPMA exploits packed memory array(PMA) to handle dynamic updates by storing sorted elements in a partially contiguous manner that enhance dynamic updates. cuSparse library has efficient CSR implementations for sparse matrix vector multiplications. We implement a basic PR update mechanism (purely on GPU) using cuSparse where the new incoming batch iteratively undergoes SpMV operations until the PR values converge. For performing the comparisons with GPMA and GPMA+, we configure the experiment to run HyPR on the same platform as used in [1] which is a Intel Xeon CPU connected to a Titan X Pascal GPU, and also the same datasets. Additionally, the PR scores are derived after running all the experiments till convergence or 1000 iterations (whichever is earlier). Towards that, we use an additional graph “Random” having 1M nodes and 200M edges. From Figure 7, we can observe that HyPR outperforms the state of the art GPMA and GPMA+, and cuSparse implementation for four of the largest graphs. On an average over all four the graphs used, HyPR outperforms GPMA by 4.81x, GPMA+ by 3.26x, and cuSparse by 102.36x. F. Accuracy Analysis For checking the accuracy of HyPR, all the PR scores are pre-computed using nvGraph [15]. nvGraph is a part of the NVIDIA CUDA Toolkit is the state of the art solution which effectively provides millisecond performance for computing PR score on a static graph with 500 iterations. We use nvGraph for computing the scratch PR scores of the graph. The batch updates are then included into the graph through the HyPR method, and checked with the nvGraph scores which we feed with a graph that has the batches included. Once all the batch Older GPU
  • 9. 2 3 4 6 8 12 16 1 20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (a) Amazon 2 3 4 6 8 12 1 14 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (b) Web-google 2 3 4 6 8 12 16 1 20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (c) Wiki-topcat 16 24 32 48 50 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (d) soc-pokec 24 32 48 50 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (e) Reddit 6 8 12 16 20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (f) soc-LiveJournal 6 8 12 16 20 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (g) Orkut 32 48 50 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (h) Graph-500 32 48 64 80 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (i) NLP 64 96 100 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 Time(ms) Batch Size Preprocessing Time Transfer and Kernel Time (j) Arabic Figure 5: Overlaps achieved between partitioning and transfer+kernel times for 10 batches B1 − B10 of uniform size. 0.58 0.59 0.59 0.6 0.6 0.61 0.61 0.62 0.62 0.63 0.64 0.64 10 1 10 3 10 5 Ratio Batch Size Achieved Warp Occupancy Global Load Efficiency (a) Reddit 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 10 1 10 3 10 5 Ratio Batch Size Achieved Warp Occupancy Global Load Efficiency (b) Graph500 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 10 1 10 3 10 5 Ratio Batch Size Achieved Warp Occupancy Global Load Efficiency (c) Orkut 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 10 1 10 3 10 5 Ratio Batch Size Achieved Warp Occupancy Global Load Efficiency (d) NLP Figure 6: Resource Utilization 0.1 1 10 100 1000 10 2 10 3 10 5 10 6 Time(ms) Batch Size GPMA+ GPMA cuSpare HyPR 0.19 6.00 15.00 125.00 1.11 8.00 13.00 99.00 51.00 52.00 49.00 48.00 0.17 2.16 5.14 43.30 (a) Reddit 0.1 1 10 100 1000 10 2 10 3 10 5 10 6 Time(ms) Batch Size GPMA+ GPMA cuSpare HyPR 0.15 6.00 14.00 120.00 1.20 8.00 12.00 98.00 49.00 51.00 50.00 50.00 0.14 1.19 4.17 42.26 (b) soc-Pokec 0.1 1 10 100 1000 10000 10 2 10 3 10 5 10 6 Time(ms) Batch Size GPMA+ GPMA cuSpare HyPR 0.80 7.00 99.00 1012.00 0.99 9.00 22.00 98.00 130.00 131.00 128.00 130.00 0.22 3.95 11.52 67.31 (c) Graph500 0.1 1 10 100 1000 10000 10 2 10 3 10 5 10 6 Time(ms) Batch Size GPMA+ GPMA cuSparse HyPR 0.80 7.00 102.00 1001.00 1.01 9.00 23.00 99.00 130.00 131.00 128.00 130.00 0.57 6.66 12.00 66.69 (d) Random Figure 7: Comparison of HyPR with GPMA, GPMA+, and cuSparse 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 1 O r k u t G r a p h 5 0 0 N L P A r a b i c 0.0005 0.001 0.0015 0.002 0.0025 Jaccard Similarity Divergence top-10 top-20 top-30 div. Figure 8: Accuracy of PR scores updates are completed, we compare the PR scores with the pre-computed ones. To check for accuracy, we introduce batch updates with inserts and deletes updates on the nodes with the highest PR scores on the existing graph. This is done in order to impose maximum change on the PR scores due to an update. In Figure 8, we show the Jaccard similarities of the PR scores computed by HyPR with those computed by nvGraph for the top 10, 20, and 30 nodes with the highest PR scores. We see that on an average, the similarity score comes to 0.985 for a fixed 500 iterations (same as that used for computing the scores using nvGraph). The similarity score comes to 0.991 on average if we allow the different graphs to converge till they reach the threshold γ. We see that on an average the divergence of the PR scores computed by HyPR from that of nvGraph is less than 0.001% (shown in y2 axis). V. RELATED WORKS GPU based PR has been explored in the works done by Duong et al. [16] and Garg et al. [4]. In [16], authors
  • 10. proposes a new data structure for graph representations named link structure file. In their work they target the steps in PR computation where sufficient data parallelism exist. Further these steps are distributed among multiple GPUs where each threads perform finer grained work. Garg et. al. [4], provide algorithmic techniques for partitioning the graph based on their structural properties to extract paralleism. PR on evolving graphs have been explored by Sha et al. [1] where they propose two algorithms GPMA and GPMA+ based upon packed memory array. GPMA is lock-based approach where few concurrent updates conflicts are handled efficiently. GPMA+ is a lock-free bottom-up approach which prioritizes updates and favors coalesced memory access. Another work is done by Feng et al. [17] where they propose the DISTINGER framework. DISTINGER employs a hash partitioning-based scheme that favors massive graph updates and message passing among the partition sites using MPI. Another algorithm to compute personalized PR on dynamic graphs was published by Guo et al. [13] that also exploit GPUs for performance. Similar to HyPR, computations proposed in [13] is also done in a batched manner. To enhance the performance of parallel push different optimization techniques are introduced. One of them is eager propagation, that minimizes the number of local push operations. They also propose frontier generation method that keeps track of vertex frontiers by cutting down synchronization overhead to merge duplicate vertices. Batch parallelism for dynamic graphs have also found several theoretical studies. A generic framework for batch parallelism is proposed by Acar et. al. in [8]. Batch parallelism for graph connectivity and other problems, in the massively parallel computation (MPC) model is explored by Dhulipala et. al. in [18]. The work done by Desikan et. al. [2], is one of the earliest works done towards incremental PR computations on evolving graphs. The authors proposed the partitioning, and scaling techniques which we modify for parallelization on a heteroge- neous platform. To the best of our knowledge, there is no other work that exists which explores hybrid CPU+GPU solution for computing global PR. In HyPR we propose techniques for PR computations that uses batch parallelism in unison with fully parallel partitioning and PR update mechanisms on a hybrid platform towards extracting high performance. VI. CONCLUSION AND FUTURE WORK In this work, we propose HyPR which is a hybrid tech- nique for computing PR on evolving graphs. We have shown an efficient mechanism to partition the existing graph and updates into data parallel work units which can be updated independently. HyPR is executed on a state-of-the-art high performance platform and exhaustively tested against large real world graphs. HyPR is able to provide substantial performance gains of up to 4.8x over other existing mechanisms and also extracts generous system efficiency. In the near future, we plan on extending HyPR by spreading the computations across multiple GPUs located on shared and distributed mem- ories. Communication between distributed nodes will become an additional overhead to handle in that case. Additionally, modern HPC systems are equipped with newer generation interconnects like NVLink which deserves to be explored in the context of page ranking. VII. ACKNOWLEDGMENT This work is supported by Science and Engineering Re- search Board (SERB), DST, India through the Early Career Research Grant (no. ECR/2016/002061) and NVIDIA Corpo- ration through the GPU Hardware Grant program. REFERENCES [1] M. Sha, Y. Li, B. He, and K.-L. Tan, “Accelerating dynamic graph analytics on GPUs,” Proc. of the VLDB Endow., vol. 11, no. 1, pp. 107–120, 2017. [2] P. Desikan, N. Pathak, J. Srivastava, and V. Kumar, “Incremental page rank computation on evolving graphs,” in 14th Interna- tional WWW, 2005, pp. 1094–1095. [3] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999. [4] P. Garg and K. Kothapalli, “STIC-D: Algorithmic techniques for efficient parallel PageRank computation on Real-World Graphs,” in Proceedings of the 17th ICDCN, 2016, pp. 1–10. [5] D. Gleich, L. Zhukov, and P. Berkhin, “Fast parallel pager- ank: A linear system approach,” Yahoo! Research Techni- cal Report YRL-038, available via http://research. yahoo. com/publication/YRL-038. pdf, vol. 13, p. 22, 2004. [6] A. Cevahir, C. Aykanat, A. Turk, and B. B. Cambazoglu, “Site- based partitioning and repartitioning techniques for parallel PageRank computation,” IEEE TPDS, vol. 22, no. 5, pp. 786– 802, 2011. [7] M. Kim, “Towards exploiting GPUs for fast PageRank computa- tion of large-scale networks,” in Proceeding of 3rd International Conference on Emerging Databases, 2013. [8] U. A. Acar, D. Anderson, G. E. Blelloch, and L. Dhulipala, “Parallel batch-dynamic graph connectivity,” in The 31st ACM SPAA, 2019, pp. 381–392. [9] Compressed Sparse Column Format (CSC), https: //scipy-lectures.org/advanced/scipy sparse/csr matrix.html. [10] The University of Florida Sparse Matrix Collection, https: //snap.stanford.edu/data. [11] J. Leskovec and A. Krevl, SNAP Datasets: Stanford Large Network Dataset Collection, http://snap.stanford.edu/data. [12] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Third Edition, 3rd ed. The MIT Press, 2009. [13] W. Guo, Y. Li, M. Sha, and K.-L. Tan, “Parallel personalized pagerank on dynamic graphs,” Proc. of the VLDB Endow., vol. 11, no. 1, pp. 93–106, 2017. [14] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cus- parse library,” in GPU Technology Conference, 2010. [15] NVGraph toolkit documentation, https://docs.nvidia.com/cuda/ cuda-runtime-api/index.html. [16] N. T. Duong, Q. A. P. Nguyen, A. T. Nguyen, and H.- D. Nguyen, “Parallel PageRank computation using GPUs,” in Proce. of the 3rd Symposium on Information and Communica- tion Technology, 2012, pp. 223–230. [17] G. Feng, X. Meng, and K. Ammar, “Distinger: A distributed graph data structure for massive dynamic graph processing,” in International Conference on Big Data. IEEE, 2015, pp. 1814– 1822. [18] L. Dhulipala, D. Durfee, J. Kulkarni, R. Peng, S. Sawlani, and X. Sun, “Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds,” in Proceedings of the 31st SODA. USA: SIAM, 2020, p. 1300–1319.