This paper presents a hybrid parallel breadth-first search (BFS) algorithm for distributed memory systems that uses two stacks. BFS is important for graph algorithms and parallelizing it is important for large graphs. The paper's contributions include a 1D partitioning approach for graph representation. The hybrid algorithm assigns vertices to processors and uses local stacks to parallelize edge visits, balancing load. Experimental results on large systems show the hybrid 1D approach scales better than a 2D approach and is faster than a flat 1D implementation for higher processor counts. The paper concludes the algorithm can implement BFS without errors using relaxed queues and poses questions about bounding errors without level synchronization.
3. Introduction
BFS is important.
BFS usually forms a sub-part to more complex graph
algorithms.
Now that we have BIG graphs, parallelizing it is very
important
Shared Memory BFS involves: (1) communication between
processors and (2) distribution of the graph(vertices) among
processors
4. Model
Graph G(V , E), and |V | = n and |E| = m, also m is O(n);
i.e. sparse graphs.
Edge weights = 1.
6. Serial BFS overview
Sequential BFS uses a queue data structure
BFS requirement :
all vertices at a distance k from the source should be “visited”
before vertices at distance k + 1.
Explanation?
Level Synchronous BFS is a key concept in correct shared
memory BFS.
7. Modified BFS : Use 2 stacks
Can be parallelized as is: perform lines 6-7 in parallel,
lines 8-10 are atomic
8. Related Work: Level Synchronous Parallel BFS using 2
queues by Agarwal et al SC’10 [1]
9. Hybrid 1D Parallel BFS Algorithm
One of the main areas for optimization to this basic parallel
algorithm is
load-balancing: ensuring that parallelization of the edge visit
steps is load-balanced
1D partitioning: If there are p processors in the system, give
ownership of n/p vertices, to each processor.
Random shuffling of the vertice identifiers prior to
partitioning. So all processors ge roughly same number of
vertices(n/p) and edges(m/p)
Use of local stacks NSi for pushes and then global
union.(Overhead < 3% of execution time)
12. 1D BFS errors
The value of level is not incremented
The Next Stack NSi data structure should be emptied before
traversing next level.
13. Experiments
1D Flat MPI: one process per core
1D Hybrid: one or more MPI processes within a node
synthetic graphs based on the R-MAT random graph
model(default m : n 16) , web crawl of the UK domain (133
million vertices and 5.5 billion edges).
Systems: Hopper (6392-node Cray XE6) and Franklin
(9660-node Cray XT4)
17. Experiments
Flat 1D algorithms are about 1.5 − 1.8 times faster than the
2D algorithms.
The 1D hybrid algorithm, are slower than the flat 1D
algorithm for smaller concurrencies, starts to perform
significantly faster for larger concurrencies.
18. Conclusion
Conjecture: Level synchronous BFS can be implemented
without any error with relaxed queues
Question: Can the error be bounded if we don’t have a level
synchronous algorithm?
19. V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalable
graph exploration on multicore processors. In Proc. ACM/IEEE
Conference on Supercomputing (SC10), November 2010.
A. Buluc K. Madduri. Parallel breadth-first search on
distributed memory systems. In Proceedings of 2011
International Conference for High Performance Computing,
Networking, Storage and Analysis, SC ’11, pages 65:1–65:12,
New York, NY, USA, 2011. ACM.
C.E. Leiserson and T.B. Schardl. A work-efficient parallel
breadth-first search algorithm (or how to cope with the
nondeterminism of reducers). In Proc. 22nd ACM Symp. on
Parallism in Algorithms and Architectures (SPAA ’10), pages
303–314, June 2010.