Parallel 1D Hybrid BFS Delivers Faster Traversal on Large Graphs

•

0 likes•186 views

This paper presents a hybrid parallel breadth-first search (BFS) algorithm for distributed memory systems that uses two stacks. BFS is important for graph algorithms and parallelizing it is important for large graphs. The paper's contributions include a 1D partitioning approach for graph representation. The hybrid algorithm assigns vertices to processors and uses local stacks to parallelize edge visits, balancing load. Experimental results on large systems show the hybrid 1D approach scales better than a 2D approach and is faster than a flat 1D implementation for higher processor counts. The paper concludes the algorithm can implement BFS without errors using relaxed queues and poses questions about bounding errors without level synchronization.

Science

Parallel BFS on Distributed Memory Systems
Aydin Buluc and Kamesh Madduri
Sapta
DC reading group
September 29, 2016

Outline
Introduction
Shared Memory BFS
Model
Contributions
Serial BFS overview
Another paper: Parallel BFS using 2 queues
This paper: Hybrid Parallel BFS using 2 stacks
Experimental Results
Conclusion

Introduction
BFS is important.
BFS usually forms a sub-part to more complex graph
algorithms.
Now that we have BIG graphs, parallelizing it is very
important
Shared Memory BFS involves: (1) communication between
processors and (2) distribution of the graph(vertices) among
processors

Model
Graph G(V , E), and |V | = n and |E| = m, also m is O(n);
i.e. sparse graphs.
Edge weights = 1.

Contributions
Traditional representation: 1 dimensional BFS (1D adjacency
arrays).
Sparse matrix representation: 2D partitioning of the graph
(Not discussed).

Serial BFS overview
Sequential BFS uses a queue data structure
BFS requirement :
all vertices at a distance k from the source should be “visited”
before vertices at distance k + 1.
Explanation?
Level Synchronous BFS is a key concept in correct shared
memory BFS.

Modiﬁed BFS : Use 2 stacks
Can be parallelized as is: perform lines 6-7 in parallel,
lines 8-10 are atomic

Related Work: Level Synchronous Parallel BFS using 2
queues by Agarwal et al SC’10 [1]

Hybrid 1D Parallel BFS Algorithm
One of the main areas for optimization to this basic parallel
algorithm is
load-balancing: ensuring that parallelization of the edge visit
steps is load-balanced
1D partitioning: If there are p processors in the system, give
ownership of n/p vertices, to each processor.
Random shuﬄing of the vertice identiﬁers prior to
partitioning. So all processors ge roughly same number of
vertices(n/p) and edges(m/p)
Use of local stacks NSi for pushes and then global
union.(Overhead < 3% of execution time)

1D BFS errors
The value of level is not incremented
The Next Stack NSi data structure should be emptied before
traversing next level.

Experiments
1D Flat MPI: one process per core
1D Hybrid: one or more MPI processes within a node
synthetic graphs based on the R-MAT random graph
model(default m : n 16) , web crawl of the UK domain (133
million vertices and 5.5 billion edges).
Systems: Hopper (6392-node Cray XE6) and Franklin
(9660-node Cray XT4)

Experimental Results
Strong scaling on Franklin
Higher is better
GTEPS: Giga Traversed Edges per Second

Experimental Results
lower is better
Strong scaling on Franklin

Experimental Results
Weak Scaling on Franklin
Lower is better

Experiments
Flat 1D algorithms are about 1.5 − 1.8 times faster than the
2D algorithms.
The 1D hybrid algorithm, are slower than the ﬂat 1D
algorithm for smaller concurrencies, starts to perform
signiﬁcantly faster for larger concurrencies.

Conclusion
Conjecture: Level synchronous BFS can be implemented
without any error with relaxed queues
Question: Can the error be bounded if we don’t have a level
synchronous algorithm?

V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalable
graph exploration on multicore processors. In Proc. ACM/IEEE
Conference on Supercomputing (SC10), November 2010.
A. Buluc K. Madduri. Parallel breadth-ﬁrst search on
distributed memory systems. In Proceedings of 2011
International Conference for High Performance Computing,
Networking, Storage and Analysis, SC ’11, pages 65:1–65:12,
New York, NY, USA, 2011. ACM.
C.E. Leiserson and T.B. Schardl. A work-eﬃcient parallel
breadth-ﬁrst search algorithm (or how to cope with the
nondeterminism of reducers). In Proc. 22nd ACM Symp. on
Parallism in Algorithms and Architectures (SPAA ’10), pages
303–314, June 2010.

What's hot

Graph based transistor network generation method for supergate designIeee Xpert

An Efficient Arabic Text Spotting from Natural Scenes ImagesReham Marzouk

High performance nb-ldpc decoder with reduction of message exchange Ieee Xpert

ca-ap9222-pdfSRI TECHNOLOGICAL SOLUTIONS

High performance pipelined architecture of elliptic curve scalar multiplicati...Ieee Xpert

Flexible dsp accelerator architecture exploiting carry save arithmeticIeee Xpert

Capp june 2012SRI TECHNOLOGICAL SOLUTIONS

Aerial detection1ssuser456ad6

A high performance fir filter architecture for fixed and reconfigurable appli...Ieee Xpert

Basic use of xcmsXiuxia Du

A novel area efficient vlsi architecture for recursion computation in lte tur...jpstudcorner

RWCap ASCION2011Hao Zhuang

Flexible dsp accelerator architecture exploiting carry save arithmeticNexgen Technology

Building and road detection from large aerial imageryShunta Saito

Cs 611Web Developer

PAP245gaussFlorian Gauss

Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...Derryck Lamptey, MPhil, CISSP

A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...ijdpsjournal

On Extending MapReduce - Survey and ExperimentsYu Liu

What's hot (19)

Graph based transistor network generation method for supergate design

An Efficient Arabic Text Spotting from Natural Scenes Images

High performance nb-ldpc decoder with reduction of message exchange

ca-ap9222-pdf

High performance pipelined architecture of elliptic curve scalar multiplicati...

Flexible dsp accelerator architecture exploiting carry save arithmetic

Capp june 2012

Aerial detection1

A high performance fir filter architecture for fixed and reconfigurable appli...

Basic use of xcms

A novel area efficient vlsi architecture for recursion computation in lte tur...

RWCap ASCION2011

Flexible dsp accelerator architecture exploiting carry save arithmetic

Building and road detection from large aerial imagery

Cs 611

PAP245gauss

Achieving Portability and Efficiency in a HPC Code Using Standard Message-pas...

A PROGRESSIVE MESH METHOD FOR PHYSICAL SIMULATIONS USING LATTICE BOLTZMANN ME...

On Extending MapReduce - Survey and Experiments

Viewers also liked

The American Rifle and Pistol Association: Confessions of a Former MAIG Supp...Peter Vogt

Active concludedTransparencySite

Jubilee2Whittney Price

Aida rec.tere1000

Kickb1Whittney Price

Presentation1.PPTXjameschloejames

Running eZ Platform on Kubernetes (presented by Björn Dieding at eZ Conferenc...eZ Systems

GEAR_Company Introductionadarsh pandey

Keynote: How to design effective financial interventions - Sille KrukowWijzer in geldzaken

Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...Course5i

Festo IO-Link PresentationPaul Plavicheanu

The Convergence of Content and Commerce in a Complex WorldMozu

Viewers also liked (12)

The American Rifle and Pistol Association: Confessions of a Former MAIG Supp...

Active concluded

Jubilee2

Aida rec.

Kickb1

Presentation1.PPTX

Running eZ Platform on Kubernetes (presented by Björn Dieding at eZ Conferenc...

GEAR_Company Introduction

Keynote: How to design effective financial interventions - Sille Krukow

Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...

Festo IO-Link Presentation

The Convergence of Content and Commerce in a Complex World

Similar to Parallel 1D Hybrid BFS Delivers Faster Traversal on Large Graphs

Adams_SIAMCSE15Karen Pao

Distributed approximate spectral clustering for large scale datasetsBita Kazemi

El text.tokuron a(2019).jung190711RCCSRENKEI

Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy

Multidimensional RNNGrigory Sapunov

Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui

Simulation of Scale-Free NetworksGabriele D'Angelo

Graph chiJay Rathod

CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics Computational Materials Science Initiative

UIC Panella ThesisMarco Santambrogio

Solution(1)Gopi Saiteja

Kailash(13EC35032)_mtp.pptxKailashChandMeena6

Recurrent Instance Segmentation (UPC Reading Group)Universitat Politècnica de Catalunya

"An adaptive modular approach to the mining of sensor network ...butest

Parallel Computing 2007: Bring your own parallel applicationGeoffrey Fox

3rd 3DDRESD: DReAMSMarco Santambrogio

Scaling PageRank to 100 Billion PagesSubhajit Sahu

1409.1556.pdfZuhriddin1

Ling liu part 02：big graph processingjins0618

Towards Deep Attention in Graph Neural Networks: Problems and Remedies.pptxssuser2624f71

Similar to Parallel 1D Hybrid BFS Delivers Faster Traversal on Large Graphs (20)

Adams_SIAMCSE15

Distributed approximate spectral clustering for large scale datasets

El text.tokuron a(2019).jung190711

Cycle’s topological optimizations and the iterative decoding problem on gener...

Multidimensional RNN

Fast and Scalable NUMA-based Thread Parallel Breadth-first Search

Simulation of Scale-Free Networks

Graph chi

CMSI計算科学技術特論A (2015) 第13回 Parallelization of Molecular Dynamics

UIC Panella Thesis

Solution(1)

Kailash(13EC35032)_mtp.pptx

Recurrent Instance Segmentation (UPC Reading Group)

"An adaptive modular approach to the mining of sensor network ...

Parallel Computing 2007: Bring your own parallel application

3rd 3DDRESD: DReAMS

Scaling PageRank to 100 Billion Pages

1409.1556.pdf

Ling liu part 02：big graph processing

Towards Deep Attention in Graph Neural Networks: Problems and Remedies.pptx

Recently uploaded

Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar

Cytokinin, mechanism and its application.pptxVarshiniMK

insect anatomy and insect body wall and their physiologyDrAnita Sharma

Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl

TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter

Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48

Heredity: Inheritance and Variation of TraitsCharlene Llagas

Is RISC-V ready for HPC workload? Maybe?Patrick Diehl

Harmful and Useful Microorganisms Presentationtahreemzahra82

Forest laws, Indian forest laws, why they are importantadityabhardwaj282

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9

Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl

Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar

Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013

Manassas R - Parkside Middle School 🌎🏫qfactory1

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl

Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9

Recently uploaded (20)

Analytical Profile of Coleus Forskohlii | Forskolin .pptx

Cytokinin, mechanism and its application.pptx

insect anatomy and insect body wall and their physiology

Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service

Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.

TOTAL CHOLESTEROL (lipid profile test).pptx

SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx

Vision and reflection on Mining Software Repositories research in 2024

Heredity: Inheritance and Variation of Traits

Is RISC-V ready for HPC workload? Maybe?

Harmful and Useful Microorganisms Presentation

Forest laws, Indian forest laws, why they are important

Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...

Call Girls in Aiims Metro Delhi 💯Call Us 🔝9953322196🔝 💯Escort.

Analytical Profile of Coleus Forskohlii | Forskolin .pdf

Scheme-of-Work-Science-Stage-4 cambridge science.docx

Manassas R - Parkside Middle School 🌎🏫

Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |

Artificial Intelligence In Microbiology by Dr. Prince C P

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR

Parallel 1D Hybrid BFS Delivers Faster Traversal on Large Graphs

1. Parallel BFS on Distributed Memory Systems Aydin Buluc and Kamesh Madduri Sapta DC reading group September 29, 2016

2. Outline Introduction Shared Memory BFS Model Contributions Serial BFS overview Another paper: Parallel BFS using 2 queues This paper: Hybrid Parallel BFS using 2 stacks Experimental Results Conclusion

3. Introduction BFS is important. BFS usually forms a sub-part to more complex graph algorithms. Now that we have BIG graphs, parallelizing it is very important Shared Memory BFS involves: (1) communication between processors and (2) distribution of the graph(vertices) among processors

4. Model Graph G(V , E), and |V | = n and |E| = m, also m is O(n); i.e. sparse graphs. Edge weights = 1.

5. Contributions Traditional representation: 1 dimensional BFS (1D adjacency arrays). Sparse matrix representation: 2D partitioning of the graph (Not discussed).

6. Serial BFS overview Sequential BFS uses a queue data structure BFS requirement : all vertices at a distance k from the source should be “visited” before vertices at distance k + 1. Explanation? Level Synchronous BFS is a key concept in correct shared memory BFS.

7. Modiﬁed BFS : Use 2 stacks Can be parallelized as is: perform lines 6-7 in parallel, lines 8-10 are atomic

8. Related Work: Level Synchronous Parallel BFS using 2 queues by Agarwal et al SC’10 [1]

9. Hybrid 1D Parallel BFS Algorithm One of the main areas for optimization to this basic parallel algorithm is load-balancing: ensuring that parallelization of the edge visit steps is load-balanced 1D partitioning: If there are p processors in the system, give ownership of n/p vertices, to each processor. Random shuﬄing of the vertice identiﬁers prior to partitioning. So all processors ge roughly same number of vertices(n/p) and edges(m/p) Use of local stacks NSi for pushes and then global union.(Overhead < 3% of execution time)

10. 1D BFS

11. 1D BFS contd..

12. 1D BFS errors The value of level is not incremented The Next Stack NSi data structure should be emptied before traversing next level.

13. Experiments 1D Flat MPI: one process per core 1D Hybrid: one or more MPI processes within a node synthetic graphs based on the R-MAT random graph model(default m : n 16) , web crawl of the UK domain (133 million vertices and 5.5 billion edges). Systems: Hopper (6392-node Cray XE6) and Franklin (9660-node Cray XT4)

14. Experimental Results Strong scaling on Franklin Higher is better GTEPS: Giga Traversed Edges per Second

15. Experimental Results lower is better Strong scaling on Franklin

16. Experimental Results Weak Scaling on Franklin Lower is better

17. Experiments Flat 1D algorithms are about 1.5 − 1.8 times faster than the 2D algorithms. The 1D hybrid algorithm, are slower than the ﬂat 1D algorithm for smaller concurrencies, starts to perform signiﬁcantly faster for larger concurrencies.

18. Conclusion Conjecture: Level synchronous BFS can be implemented without any error with relaxed queues Question: Can the error be bounded if we don’t have a level synchronous algorithm?

19. V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalable graph exploration on multicore processors. In Proc. ACM/IEEE Conference on Supercomputing (SC10), November 2010. A. Buluc K. Madduri. Parallel breadth-first search on distributed memory systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 65:1–65:12, New York, NY, USA, 2011. ACM. C.E. Leiserson and T.B. Schardl. A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In Proc. 22nd ACM Symp. on Parallism in Algorithms and Architectures (SPAA ’10), pages 303–314, June 2010.

20. Thank You :)

Parallel 1D Hybrid BFS Delivers Faster Traversal on Large Graphs

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (12)

Similar to Parallel 1D Hybrid BFS Delivers Faster Traversal on Large Graphs

Similar to Parallel 1D Hybrid BFS Delivers Faster Traversal on Large Graphs (20)

Recently uploaded

Recently uploaded (20)

Parallel 1D Hybrid BFS Delivers Faster Traversal on Large Graphs