Adjusting OpenMP PageRank
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Before starting an OpenMP implementation, a good sequential pagerank implementation
needs to be set up. There are two ways (algorithmically) to think of the pagerank calculation.
One approach (push) is to find pagerank by pushing contributions to out-vertices. The push
method is somewhat easier to implement, and is described in this lecture. With this
approach, in an iteration for each vertex, the ranks of vertices connected to its outgoing edge
are cumulated with p×rn, where p is the damping factor (0.85), and rn is the rank of the
(source) vertex in the previous iteration. But, if a vertex has no out-going edges, it is
considered to have out-going edges to all vertices in the graph (including itself). This is
because a random surfer jumps to a random page upon visiting a page with no links, in order
to avoid the rank-sink effect. However, it requires multiple writes per source vertex, due to
the cumulation (+=) operation. The other approach (pull) is to pull contributions from
in-vertices. Here, the rank of a vertex in an iteration is calculated as c0 + pΣrn/dn, where c0 is
the common teleport contribution, p is the damping factor (0.85), rn is the previous rank of
vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is the
total number of vertices in the graph. The common teleport contribution c0, calculated as
(1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due
to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in
the graph pΣrn/N (to avoid the rank-sink effect). This approach requires 2 additional
calculations per-vertex, i.e., non-teleport contribution of each vertex, and total teleport
contribution (to all vertices). However, it requires only 1 write per destination vertex. For this
experiment both of these approaches are assessed on a number of different graphs.
All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the
SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using
GCC 9 with OpenMP flag (-fopenmp), optimization level 3 (-O3). The system used is a Dell
PowerEdge R740 Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB
DIMM DDR4 Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running
CentOS Linux release 7.9.2009 (Core). The execution time of each test case is measured
using std::chrono::high_performance_timer. This is done 5 times for each test case, and
timings are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
While it might seem that the pull method would be a clear winner, the results indicate that
although pull is always faster than push approach, the difference between the two depends
on the nature of the graph. The next step is to compare the performance between finding
pagerank using C++ DiGraph class directly (using arrays of edge-lists) vs its CSR
(Compressed Sparse Row) representation (contiguous). Using a CSR representation has
the potential for performance improvement due to information on vertices and edges being
stored contiguously.
Adjusting Sequential approach
Push Pull Class CSR
1. Performance of contribution-push based vs contribution-pull based PageRank.
2. Performance of C++ DiGraph class based vs CSR based PageRank (pull).
Both uniform and hybrid OpenMP techniques were attempted on different types of graphs.
All OpenMP based functions are defined with a parallel for clause and static scheduling of
size 4096. When necessary, a reduction clause is used. Number of threads for this
experiment (using OMP_NUM_THREADS) was varied from 2 to 48. Results show that the
hybrid approach performs worse in most cases, and is only slightly better than the uniform
approach in a few cases. This could possibly be because of proper chip/core-scheduling
handled by OpenMP when it is used with all the primitives.
Adjusting OpenMP approach
Map Reduce Uniform Hybrid
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Performance of sequential execution based vs OpenMP based vector element sum.
3. Performance of uniform-OpenMP based vs hybrid-OpenMP based PageRank (pull, CSR).
In the final experiment performance of OpenMP based pagerank is contrasted with
sequential based approach and nvGraph pagerank. OpenMP based pagerank does seem
to provide a clear benefit for most graphs wrt sequential pagerank. This speedup is
definitely not directly proportional to the number of threads, as one would normally expect
(Amdahl's law). However, nvGraph is clearly way faster than the OpenMP version. This is
as expected because nvGraph makes use of GPU for performance.
Comparing sequential approach
OpenMP nvGraph
Sequential vs vs
OpenMP vs
1. Performance of sequential execution based vs OpenMP based PageRank (pull, CSR).
2. Performance of sequential execution based vs nvGraph based PageRank (pull, CSR).
3. Performance of OpenMP based vs nvGraph based PageRank (pull, CSR).

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

  • 1.
    Adjusting OpenMP PageRank Formassive graphs that fit in RAM, but not in GPU memory, it is possible to take advantage of a shared memory system with multiple CPUs, each with multiple cores, to accelerate pagerank computation. If the NUMA architecture of the system is properly taken into account with good vertex partitioning, the speedup can be significant. To take steps in this direction, experiments are conducted to implement pagerank in OpenMP using two different approaches, uniform and hybrid. The uniform approach runs all primitives required for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid approach runs certain primitives in sequential mode (i.e., sumAt, multiply). Before starting an OpenMP implementation, a good sequential pagerank implementation needs to be set up. There are two ways (algorithmically) to think of the pagerank calculation. One approach (push) is to find pagerank by pushing contributions to out-vertices. The push method is somewhat easier to implement, and is described in this lecture. With this approach, in an iteration for each vertex, the ranks of vertices connected to its outgoing edge are cumulated with p×rn, where p is the damping factor (0.85), and rn is the rank of the (source) vertex in the previous iteration. But, if a vertex has no out-going edges, it is considered to have out-going edges to all vertices in the graph (including itself). This is because a random surfer jumps to a random page upon visiting a page with no links, in order to avoid the rank-sink effect. However, it requires multiple writes per source vertex, due to the cumulation (+=) operation. The other approach (pull) is to pull contributions from in-vertices. Here, the rank of a vertex in an iteration is calculated as c0 + pΣrn/dn, where c0 is the common teleport contribution, p is the damping factor (0.85), rn is the previous rank of vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is the total number of vertices in the graph. The common teleport contribution c0, calculated as (1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in the graph pΣrn/N (to avoid the rank-sink effect). This approach requires 2 additional calculations per-vertex, i.e., non-teleport contribution of each vertex, and total teleport contribution (to all vertices). However, it requires only 1 write per destination vertex. For this experiment both of these approaches are assessed on a number of different graphs. All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using GCC 9 with OpenMP flag (-fopenmp), optimization level 3 (-O3). The system used is a Dell PowerEdge R740 Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4 Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux release 7.9.2009 (Core). The execution time of each test case is measured using std::chrono::high_performance_timer. This is done 5 times for each test case, and timings are averaged. Statistics of each test case is printed to standard output (stdout), and redirected to a log file, which is then processed with a script to generate a CSV file, with each row representing the details of a single test case. This CSV file is imported into Google Sheets, and necessary tables are set up with the help of the FILTER function to create the charts.
  • 2.
    While it mightseem that the pull method would be a clear winner, the results indicate that although pull is always faster than push approach, the difference between the two depends on the nature of the graph. The next step is to compare the performance between finding pagerank using C++ DiGraph class directly (using arrays of edge-lists) vs its CSR (Compressed Sparse Row) representation (contiguous). Using a CSR representation has the potential for performance improvement due to information on vertices and edges being stored contiguously. Adjusting Sequential approach Push Pull Class CSR 1. Performance of contribution-push based vs contribution-pull based PageRank. 2. Performance of C++ DiGraph class based vs CSR based PageRank (pull). Both uniform and hybrid OpenMP techniques were attempted on different types of graphs. All OpenMP based functions are defined with a parallel for clause and static scheduling of size 4096. When necessary, a reduction clause is used. Number of threads for this experiment (using OMP_NUM_THREADS) was varied from 2 to 48. Results show that the hybrid approach performs worse in most cases, and is only slightly better than the uniform approach in a few cases. This could possibly be because of proper chip/core-scheduling handled by OpenMP when it is used with all the primitives. Adjusting OpenMP approach Map Reduce Uniform Hybrid 1. Performance of sequential execution based vs OpenMP based vector multiply. 2. Performance of sequential execution based vs OpenMP based vector element sum. 3. Performance of uniform-OpenMP based vs hybrid-OpenMP based PageRank (pull, CSR). In the final experiment performance of OpenMP based pagerank is contrasted with sequential based approach and nvGraph pagerank. OpenMP based pagerank does seem to provide a clear benefit for most graphs wrt sequential pagerank. This speedup is definitely not directly proportional to the number of threads, as one would normally expect (Amdahl's law). However, nvGraph is clearly way faster than the OpenMP version. This is as expected because nvGraph makes use of GPU for performance. Comparing sequential approach OpenMP nvGraph
  • 3.
    Sequential vs vs OpenMPvs 1. Performance of sequential execution based vs OpenMP based PageRank (pull, CSR). 2. Performance of sequential execution based vs nvGraph based PageRank (pull, CSR). 3. Performance of OpenMP based vs nvGraph based PageRank (pull, CSR).