SlideShare a Scribd company logo
NUMA optimized Parallel Breadth
first
Search on Multicore Single node
System
Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi
Ruba Break Mariam Al-kassar Nagham Ballan
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
Background
 Large scale graph in various fields
 US Road network:
58 million edges
 Twitter follow-ship :
1.47 billion edges
 Neuronal network :
100 trillion edges
large
Background
 Fast and scalable graph processing by using HPC
Importance of graph processing
 Application field
 Transportation
 Social network
 Cyber-security
 Bioinformatics
- Step 3:
• concurrent search (breadth first search)
• optimization (single source shortest path)
• edge-oriented (maximal independent set)
Breadth first search
 BFS is important and fundamental graph processing
– Obtains relationship of distance (hops) as standIalone
– Many algorithm (BC, Max.flow, Max.independent set)
 Problems of Fast and scalable computation BFS
- low arithmetic intensity
- irregular memory accesses
Graph500 Benchmark
 Measures computer performance using TEPS ratio in graph
processing such as BFS (Breath-first search)
 TEPS ratio = # of Traversed edges per second
Contribution
 Efficient hybrid algorithm of BFS [Beamer2011,2012]
 reduces unnecessary edge traversal
Our proposal
- NUMA-optimized hybrid algorithm
- Improves locality of memory access
. Library for considering NUMA carefully
. Column-wise graph partitioning
 Example
4-way Intel Xeon E5 (64 CPU cores)
• Scalable: Scale well up to 64 threads.
• Fast: 11.15 GTEPS and 2.2x speedup compared with
original Hybrid algorithm
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
Breadth first Search (BFS)
 Obtain level of each vertices from source vertex
 Level = certain # of hops away from the source
Input:
Graph G and source
Output:
Tree with root as source
Hybrid BFS for low diameter graph
 Efficient for Low diameter graph
– scale free and/or small world property such as social
network.
 At higher ranks in Graph500 benchmark
 Hybrid algorithm
- combines top-down algorithm and bottom-up algorithm
– reduces unnecessary edge traversal
Hybrid algorithm
Top down algorithm Bottom up algorithm
Efficient for a small-frontier Efficient for a large-frontier
Top down algorithm
 Explores outgoing edges of frontier queue QF
 Appends unvisited vertices into neighbor queue QN
Efficient for a small frontier
• Has an unnecessary edge traversal for a large frontier
Bottom up algorithm
 Explores frontier queue QF from unvisited vertices.
 Appends adjacent vertices into neighbors QN
Efficient for a large frontier
• Has unnecessary edge traversal for a small frontier
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
How to speedup the hybrid algorithm?
 NUMA architecture
– Non uniform memory access
– Each CPU socket has a local RAM
– Fast local RAM and slow non-local RAM
4 socket Intel Xeon E5 system
 Frequent non local memory accesses on NUMA
architecture
Difficulty of considering NUMA architecture
1. How does distribute graph and data to each local RAM?
2. How does bind partial graph and data to each NUMA unit?
ULIBC: Ubiquity Library for Intelligently Binding
Cores
1. NUMACTL (command line tool, library for C/C++)
2. Intel compiler Thread Affinity Interface (API)
3. ULIBC (Our library, library for C/C++)
– Processor ID : index of logical processor core
– Package ID : index of CPU socket
– Core ID : index of physical core in each CPU socket
NUMA-opt. Column wise Graph Partitioning
. Divides G=(V,A) into partial Gk=(Vk,.Ak) and binds local RAM k
- Ak is a set of adjacency list that holds incoming edges to Vk.
NUMA-optimized Top down
 Explores outgoing edges of frontier queue QF.
 Appends unvisited vertices into neighbor queue QN.
Efficient for a small frontier
• Has unnecessary edge traversal for a large frontier
Details of NUMA-optimized Top-down
NUMA-optimized Bottom-up
 Explores frontier queue QF from unvisited vertices.
 Appends adjacent vertices into neighbors QN.
Efficient for a large frontier.
• Has unnecessary edge traversal for a small frontier
Details of NUMA-optimized Bottom- up
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
Machine specification
 4 way Intel Xeon E5.
– CentOS 6.4 (Kernel 2.6.32)
– GCC 4.4.7
– 64 logical CPU cores
– 4 NUMA units x 16 logical cores
 4 way AMD Opteron 6174.
– Fedora 19 (Kernel 3.11.2)
– GCC 4.8.1
– 48 CPU cores
– 8 NUMA units x 6-core
TEPS ratio varied with problem size
 NUMA 2.2x speedups compared with original hybrid algorithm
 NUMA achieves11.15 GTEPS for Kronecker graph (SCALE26).
Strong scaling on Intel/AMD System
Scale well up to # of threads as # of cores
Twitter network
Graph500 benchmark
 Fastest of single node on 4th list (Jue 2012)
 Fastest of CPU-based single-node on 6th list (June 2013)
1st Green Graph500 list on June 2013
 Measures power-efficient using TEPS/W ratio
 Results on various system such as Android ,Linux , and Mac.
NUMA
Small Data category
Outline
 Background
 Breadth-first Search (BFS)
 NUMAI optimized parallel BFS
 Numerical Results
 Conclusion
Conclusion
 NUMA-optimized Hybrid BFS algorithm
– Reduces unnecessary edge traversals and remote RAM access
carefully considering NUMA
 Numerical results on 4 way Intel Xeon
– scales well up to 64 threads (scalable)
– achieves 11.15 GTEPS (fast)
– 2.2x speedup compared original Hybrid
 Graph500 and Green Graph500
– Fastest single-node in June 2012
– Most power-efficient in June 2013
Future work
 Further optimizing NUMA-optimized BFS$
Distributed-memory parallel computation
References
 Parallel Breadth-First Search on Distributed Memory Systems
[Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National
Laboratory Berkeley, CA {ABuluc, KMadduri}@lbl.gov]
 A Scalable Distributed Parallel Breadth-First Search Algorithm on
BlueGene/L
[Andy Yooy Edmond Chowx Keith Hendersony William McLendonz Bruce Hendricksonz ÄUmit C»
atalyÄurek]
 Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up
Search
[Scott Beamer Aydn Buluc Krste Asanovic David A. Patterson]
 Distributed Breadth First Search
[CSE 6220, Spring 2013 Georgia Institute of Technology April 18 -- Guest lecture by Anita Zakrzewska]
 Evaluation and Optimization of Breadth-First Search on NUMA Cluster
[Zehan Cui1,2, Licheng Chen1,2, Mingyu Chen1, Yungang Bao1, Yongbing Huang1,2, Huiwei Lv1,2 1 State
Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of
Sciences 2 Graduate School of Chinese Academy of Sciences {cuizehan, chenlicheng, cmy, baoyg,
huangyongbing, lvhuiwei}@ict.ac.cn]
References
 Scaling Techniques for Massive Scale-Free Graphs in Distributed (External)
Memory
[Roger Pearcey, Maya Gokhaley, Nancy M. Amato Parasol Laboratory; Dept. of Computer Science and
Engineering Texas A&M University; College Station, TX yCenter for Applied Scientific Computing
Lawrence Livermore National Laboratory; Livermore, CA frpearce, mayag@llnl.gov frpearce,
amatog@cse.tamu.edu]
 Reducing Communication in Parallel Breadth-First Search on Distributed
Memory Systems
[Huiwei Luy, Guangming Tan, Mingyu Chen, Ninghui Sun State Key Laboratory of Computer
Architecture, Institute of Computing Technology, Chinese Academy of Sciences yArgonne National
Laboratory Email: huiweilu@mcs.anl.gov, tgm@ict.ac.cn, cmy@ict.ac.cn, snh@ncic.ac.cn]
 Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore
and Multiprocessor Systems
[Rudolf Berrendorf and Matthias Makulla Computer Science Department Bonn-Rhein-Sieg University Sankt
Augustin, Germany e-mail: rudolf.berrendorf@h-brs.de, mathias.makulla@h-brs.de]

More Related Content

What's hot

Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
BaliThorat1
 
compiler design
compiler designcompiler design
compiler design
sakthibalabalamuruga
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
Viet-Trung TRAN
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
BaliThorat1
 
Chap6 slides
Chap6 slidesChap6 slides
Chap6 slides
BaliThorat1
 
[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講
[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講
[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講
台灣資料科學年會
 
Google TPU
Google TPUGoogle TPU
Google TPU
Hao(Robin) Dong
 
An Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed SimulationAn Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed Simulation
Gabriele D'Angelo
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
Vajira Thambawita
 
Accelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and moneyAccelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and money
Laurent Oberholzer
 
Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)
Yuanyuan Tian
 
Physical organization of parallel platforms
Physical organization of parallel platformsPhysical organization of parallel platforms
Physical organization of parallel platforms
Syed Zaid Irshad
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
Syed Zaid Irshad
 
(Ds+alg) 3
(Ds+alg)   3(Ds+alg)   3
(Ds+alg) 3
MirOmranudinAbhar
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Cheng-Hsuan Li
 
AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING
 AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING
AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING
nexgentechnology
 
Communication model of parallel platforms
Communication model of parallel platformsCommunication model of parallel platforms
Communication model of parallel platforms
Syed Zaid Irshad
 

What's hot (17)

Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
compiler design
compiler designcompiler design
compiler design
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 
Chap6 slides
Chap6 slidesChap6 slides
Chap6 slides
 
[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講
[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講
[台灣人工智慧學校] 新竹分校第一期結業典禮 - 主題演講
 
Google TPU
Google TPUGoogle TPU
Google TPU
 
An Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed SimulationAn Adaptive Load Balancing Middleware for Distributed Simulation
An Adaptive Load Balancing Middleware for Distributed Simulation
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Accelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and moneyAccelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and money
 
Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)Big Graph Analytics Systems (Sigmod16 Tutorial)
Big Graph Analytics Systems (Sigmod16 Tutorial)
 
Physical organization of parallel platforms
Physical organization of parallel platformsPhysical organization of parallel platforms
Physical organization of parallel platforms
 
Limitations of memory system performance
Limitations of memory system performanceLimitations of memory system performance
Limitations of memory system performance
 
(Ds+alg) 3
(Ds+alg)   3(Ds+alg)   3
(Ds+alg) 3
 
Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...Exploring hybrid memory for gpu energy efficiency through software hardware c...
Exploring hybrid memory for gpu energy efficiency through software hardware c...
 
AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING
 AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING
AN OVERLAY ARCHITECTURE FOR THROUGHPUTOPTIMAL MULTIPATH ROUTING
 
Communication model of parallel platforms
Communication model of parallel platformsCommunication model of parallel platforms
Communication model of parallel platforms
 

Viewers also liked

Extreme Scale Breadth-First Search on Supercomputers
Extreme Scale Breadth-First Search on SupercomputersExtreme Scale Breadth-First Search on Supercomputers
Extreme Scale Breadth-First Search on Supercomputers
Toyotaro Suzumura
 
Bfs dfs
Bfs dfsBfs dfs
Bfs dfs
Praveen Yadav
 
Breadth first search
Breadth first searchBreadth first search
Breadth first search
Vignesh Prasanna
 
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node SystemNUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
Yuichiro Yasui
 
Bfs and dfs in data structure
Bfs and dfs in  data structure Bfs and dfs in  data structure
Bfs and dfs in data structure
Ankit Kumar Singh
 
Php workshop L01 CSS
Php workshop L01 CSSPhp workshop L01 CSS
Php workshop L01 CSS
Mohammad Tahsin Alshalabi
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variables
Suveeksha
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
Dhanashree Prasad
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
jbp4444
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel Software Brasil
 
Dynamic Programming - Part II
Dynamic Programming - Part IIDynamic Programming - Part II
Dynamic Programming - Part II
Amrinder Arora
 
Bfs and Dfs
Bfs and DfsBfs and Dfs
Bfs and Dfs
Masud Parvaze
 
Graph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First SearchGraph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First Search
Amrinder Arora
 
NP completeness
NP completenessNP completeness
NP completeness
Amrinder Arora
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
Piyush Mittal
 
Dynamic Programming - Part 1
Dynamic Programming - Part 1Dynamic Programming - Part 1
Dynamic Programming - Part 1
Amrinder Arora
 
Graph Traversal Algorithms - Depth First Search Traversal
Graph Traversal Algorithms - Depth First Search TraversalGraph Traversal Algorithms - Depth First Search Traversal
Graph Traversal Algorithms - Depth First Search Traversal
Amrinder Arora
 
BFS
BFSBFS
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of PointsDivide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Amrinder Arora
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
Smruti Sarangi
 

Viewers also liked (20)

Extreme Scale Breadth-First Search on Supercomputers
Extreme Scale Breadth-First Search on SupercomputersExtreme Scale Breadth-First Search on Supercomputers
Extreme Scale Breadth-First Search on Supercomputers
 
Bfs dfs
Bfs dfsBfs dfs
Bfs dfs
 
Breadth first search
Breadth first searchBreadth first search
Breadth first search
 
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node SystemNUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System
 
Bfs and dfs in data structure
Bfs and dfs in  data structure Bfs and dfs in  data structure
Bfs and dfs in data structure
 
Php workshop L01 CSS
Php workshop L01 CSSPhp workshop L01 CSS
Php workshop L01 CSS
 
Open mp library functions and environment variables
Open mp library functions and environment variablesOpen mp library functions and environment variables
Open mp library functions and environment variables
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
 
Intro to OpenMP
Intro to OpenMPIntro to OpenMP
Intro to OpenMP
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
 
Dynamic Programming - Part II
Dynamic Programming - Part IIDynamic Programming - Part II
Dynamic Programming - Part II
 
Bfs and Dfs
Bfs and DfsBfs and Dfs
Bfs and Dfs
 
Graph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First SearchGraph Traversal Algorithms - Breadth First Search
Graph Traversal Algorithms - Breadth First Search
 
NP completeness
NP completenessNP completeness
NP completeness
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Dynamic Programming - Part 1
Dynamic Programming - Part 1Dynamic Programming - Part 1
Dynamic Programming - Part 1
 
Graph Traversal Algorithms - Depth First Search Traversal
Graph Traversal Algorithms - Depth First Search TraversalGraph Traversal Algorithms - Depth First Search Traversal
Graph Traversal Algorithms - Depth First Search Traversal
 
BFS
BFSBFS
BFS
 
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of PointsDivide and Conquer - Part II - Quickselect and Closest Pair of Points
Divide and Conquer - Part II - Quickselect and Closest Pair of Points
 
Multicore Processors
Multicore ProcessorsMulticore Processors
Multicore Processors
 

Similar to NUMA optimized Parallel Breadth first Search on Multicore Single node System

NUMA-aware Scalable Graph Traversal on SGI UV Systems
NUMA-aware Scalable Graph Traversal on SGI UV SystemsNUMA-aware Scalable Graph Traversal on SGI UV Systems
NUMA-aware Scalable Graph Traversal on SGI UV Systems
Yuichiro Yasui
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
Yuichiro Yasui
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
inside-BigData.com
 
NSCC Training Introductory Class
NSCC Training  Introductory ClassNSCC Training  Introductory Class
NSCC Training Introductory Class
National Supercomputing Centre Singapore
 
Presentation
PresentationPresentation
Presentation
butest
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
HPCC Systems
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
inside-BigData.com
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
BigDataEverywhere
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
inside-BigData.com
 
CNN Dataflow Implementation on FPGAs
CNN Dataflow Implementation on FPGAsCNN Dataflow Implementation on FPGAs
CNN Dataflow Implementation on FPGAs
NECST Lab @ Politecnico di Milano
 
Intelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiencyIntelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiency
Qualcomm Research
 
An air index for spatial query processing in road networks
An air index for spatial query processing in road networksAn air index for spatial query processing in road networks
An air index for spatial query processing in road networks
ieeepondy
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
Dr Reeja S R
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
inside-BigData.com
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
NECST Lab @ Politecnico di Milano
 
KSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor designKSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor design
ssuser7dcef0
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
Simon Lia-Jonassen
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
HPCC Systems
 

Similar to NUMA optimized Parallel Breadth first Search on Multicore Single node System (20)

NUMA-aware Scalable Graph Traversal on SGI UV Systems
NUMA-aware Scalable Graph Traversal on SGI UV SystemsNUMA-aware Scalable Graph Traversal on SGI UV Systems
NUMA-aware Scalable Graph Traversal on SGI UV Systems
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
NSCC Training Introductory Class
NSCC Training  Introductory ClassNSCC Training  Introductory Class
NSCC Training Introductory Class
 
Presentation
PresentationPresentation
Presentation
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
Big Data Everywhere Chicago: High Performance Computing - Contributions Towar...
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
CNN Dataflow Implementation on FPGAs
CNN Dataflow Implementation on FPGAsCNN Dataflow Implementation on FPGAs
CNN Dataflow Implementation on FPGAs
 
Intelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiencyIntelligence at scale through AI model efficiency
Intelligence at scale through AI model efficiency
 
An air index for spatial query processing in road networks
An air index for spatial query processing in road networksAn air index for spatial query processing in road networks
An air index for spatial query processing in road networks
 
Role of python in hpc
Role of python in hpcRole of python in hpc
Role of python in hpc
 
HPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance OptimizationHPC Best Practices: Application Performance Optimization
HPC Best Practices: Application Performance Optimization
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
KSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor designKSpeculative aspects of high-speed processor design
KSpeculative aspects of high-speed processor design
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 
Advancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine LearningAdvancements in HPCC Systems Machine Learning
Advancements in HPCC Systems Machine Learning
 

More from Mohammad Tahsin Alshalabi

Learning Management System in Damascus University-Information Technology Engi...
Learning Management System in Damascus University-Information Technology Engi...Learning Management System in Damascus University-Information Technology Engi...
Learning Management System in Damascus University-Information Technology Engi...
Mohammad Tahsin Alshalabi
 
Learning management system in information technology engineering faculty
Learning management system in  information technology engineering facultyLearning management system in  information technology engineering faculty
Learning management system in information technology engineering faculty
Mohammad Tahsin Alshalabi
 
Moodle documentation
Moodle documentationMoodle documentation
Moodle documentation
Mohammad Tahsin Alshalabi
 
Moodle plugins programing manual
Moodle plugins programing manualMoodle plugins programing manual
Moodle plugins programing manual
Mohammad Tahsin Alshalabi
 
CodeIgniter L5 email & user agent & security
CodeIgniter L5 email & user agent & securityCodeIgniter L5 email & user agent & security
CodeIgniter L5 email & user agent & security
Mohammad Tahsin Alshalabi
 
CodeIgniter L4 file upload & image manipulation & language
CodeIgniter L4 file upload & image manipulation & languageCodeIgniter L4 file upload & image manipulation & language
CodeIgniter L4 file upload & image manipulation & language
Mohammad Tahsin Alshalabi
 
CodeIgniter L3 model & active record & template
CodeIgniter L3 model & active record  & templateCodeIgniter L3 model & active record  & template
CodeIgniter L3 model & active record & template
Mohammad Tahsin Alshalabi
 
CodeIgniter L2 helper & libraries & form validation
CodeIgniter L2 helper & libraries & form validation CodeIgniter L2 helper & libraries & form validation
CodeIgniter L2 helper & libraries & form validation
Mohammad Tahsin Alshalabi
 
CodeIgniter L1 introduction to CodeIgniter framework
CodeIgniter L1 introduction to CodeIgniter frameworkCodeIgniter L1 introduction to CodeIgniter framework
CodeIgniter L1 introduction to CodeIgniter framework
Mohammad Tahsin Alshalabi
 
Comparison between web and mobile application requirements
Comparison between web and mobile application requirementsComparison between web and mobile application requirements
Comparison between web and mobile application requirements
Mohammad Tahsin Alshalabi
 
Introduction to web services
Introduction to web servicesIntroduction to web services
Introduction to web services
Mohammad Tahsin Alshalabi
 
Introduction to HTML5
Introduction to HTML5Introduction to HTML5
Introduction to HTML5
Mohammad Tahsin Alshalabi
 
Php workshop L04 database
Php workshop L04 databasePhp workshop L04 database
Php workshop L04 database
Mohammad Tahsin Alshalabi
 
Php workshop L03 superglobals
Php workshop L03 superglobalsPhp workshop L03 superglobals
Php workshop L03 superglobals
Mohammad Tahsin Alshalabi
 
Php workshop L02 php basics
Php workshop L02 php basicsPhp workshop L02 php basics
Php workshop L02 php basics
Mohammad Tahsin Alshalabi
 
Php workshop L0 Introduction
Php workshop L0 IntroductionPhp workshop L0 Introduction
Php workshop L0 Introduction
Mohammad Tahsin Alshalabi
 

More from Mohammad Tahsin Alshalabi (16)

Learning Management System in Damascus University-Information Technology Engi...
Learning Management System in Damascus University-Information Technology Engi...Learning Management System in Damascus University-Information Technology Engi...
Learning Management System in Damascus University-Information Technology Engi...
 
Learning management system in information technology engineering faculty
Learning management system in  information technology engineering facultyLearning management system in  information technology engineering faculty
Learning management system in information technology engineering faculty
 
Moodle documentation
Moodle documentationMoodle documentation
Moodle documentation
 
Moodle plugins programing manual
Moodle plugins programing manualMoodle plugins programing manual
Moodle plugins programing manual
 
CodeIgniter L5 email & user agent & security
CodeIgniter L5 email & user agent & securityCodeIgniter L5 email & user agent & security
CodeIgniter L5 email & user agent & security
 
CodeIgniter L4 file upload & image manipulation & language
CodeIgniter L4 file upload & image manipulation & languageCodeIgniter L4 file upload & image manipulation & language
CodeIgniter L4 file upload & image manipulation & language
 
CodeIgniter L3 model & active record & template
CodeIgniter L3 model & active record  & templateCodeIgniter L3 model & active record  & template
CodeIgniter L3 model & active record & template
 
CodeIgniter L2 helper & libraries & form validation
CodeIgniter L2 helper & libraries & form validation CodeIgniter L2 helper & libraries & form validation
CodeIgniter L2 helper & libraries & form validation
 
CodeIgniter L1 introduction to CodeIgniter framework
CodeIgniter L1 introduction to CodeIgniter frameworkCodeIgniter L1 introduction to CodeIgniter framework
CodeIgniter L1 introduction to CodeIgniter framework
 
Comparison between web and mobile application requirements
Comparison between web and mobile application requirementsComparison between web and mobile application requirements
Comparison between web and mobile application requirements
 
Introduction to web services
Introduction to web servicesIntroduction to web services
Introduction to web services
 
Introduction to HTML5
Introduction to HTML5Introduction to HTML5
Introduction to HTML5
 
Php workshop L04 database
Php workshop L04 databasePhp workshop L04 database
Php workshop L04 database
 
Php workshop L03 superglobals
Php workshop L03 superglobalsPhp workshop L03 superglobals
Php workshop L03 superglobals
 
Php workshop L02 php basics
Php workshop L02 php basicsPhp workshop L02 php basics
Php workshop L02 php basics
 
Php workshop L0 Introduction
Php workshop L0 IntroductionPhp workshop L0 Introduction
Php workshop L0 Introduction
 

Recently uploaded

IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
Mahmoud Morsy
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 

Recently uploaded (20)

IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Certificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi AhmedCertificates - Mahmoud Mohamed Moursi Ahmed
Certificates - Mahmoud Mohamed Moursi Ahmed
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 

NUMA optimized Parallel Breadth first Search on Multicore Single node System

  • 1. NUMA optimized Parallel Breadth first Search on Multicore Single node System Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi Ruba Break Mariam Al-kassar Nagham Ballan
  • 2. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 3. Background  Large scale graph in various fields  US Road network: 58 million edges  Twitter follow-ship : 1.47 billion edges  Neuronal network : 100 trillion edges large
  • 4. Background  Fast and scalable graph processing by using HPC
  • 5. Importance of graph processing  Application field  Transportation  Social network  Cyber-security  Bioinformatics - Step 3: • concurrent search (breadth first search) • optimization (single source shortest path) • edge-oriented (maximal independent set)
  • 6. Breadth first search  BFS is important and fundamental graph processing – Obtains relationship of distance (hops) as standIalone – Many algorithm (BC, Max.flow, Max.independent set)  Problems of Fast and scalable computation BFS - low arithmetic intensity - irregular memory accesses
  • 7. Graph500 Benchmark  Measures computer performance using TEPS ratio in graph processing such as BFS (Breath-first search)  TEPS ratio = # of Traversed edges per second
  • 8. Contribution  Efficient hybrid algorithm of BFS [Beamer2011,2012]  reduces unnecessary edge traversal Our proposal - NUMA-optimized hybrid algorithm - Improves locality of memory access . Library for considering NUMA carefully . Column-wise graph partitioning
  • 9.  Example 4-way Intel Xeon E5 (64 CPU cores) • Scalable: Scale well up to 64 threads. • Fast: 11.15 GTEPS and 2.2x speedup compared with original Hybrid algorithm
  • 10. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 11. Breadth first Search (BFS)  Obtain level of each vertices from source vertex  Level = certain # of hops away from the source Input: Graph G and source Output: Tree with root as source
  • 12. Hybrid BFS for low diameter graph  Efficient for Low diameter graph – scale free and/or small world property such as social network.  At higher ranks in Graph500 benchmark  Hybrid algorithm - combines top-down algorithm and bottom-up algorithm – reduces unnecessary edge traversal
  • 13. Hybrid algorithm Top down algorithm Bottom up algorithm Efficient for a small-frontier Efficient for a large-frontier
  • 14. Top down algorithm  Explores outgoing edges of frontier queue QF  Appends unvisited vertices into neighbor queue QN Efficient for a small frontier • Has an unnecessary edge traversal for a large frontier
  • 15. Bottom up algorithm  Explores frontier queue QF from unvisited vertices.  Appends adjacent vertices into neighbors QN Efficient for a large frontier • Has unnecessary edge traversal for a small frontier
  • 16. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 17. How to speedup the hybrid algorithm?  NUMA architecture – Non uniform memory access – Each CPU socket has a local RAM – Fast local RAM and slow non-local RAM 4 socket Intel Xeon E5 system
  • 18.  Frequent non local memory accesses on NUMA architecture
  • 19. Difficulty of considering NUMA architecture 1. How does distribute graph and data to each local RAM? 2. How does bind partial graph and data to each NUMA unit?
  • 20. ULIBC: Ubiquity Library for Intelligently Binding Cores 1. NUMACTL (command line tool, library for C/C++) 2. Intel compiler Thread Affinity Interface (API) 3. ULIBC (Our library, library for C/C++) – Processor ID : index of logical processor core – Package ID : index of CPU socket – Core ID : index of physical core in each CPU socket
  • 21. NUMA-opt. Column wise Graph Partitioning . Divides G=(V,A) into partial Gk=(Vk,.Ak) and binds local RAM k - Ak is a set of adjacency list that holds incoming edges to Vk.
  • 22. NUMA-optimized Top down  Explores outgoing edges of frontier queue QF.  Appends unvisited vertices into neighbor queue QN. Efficient for a small frontier • Has unnecessary edge traversal for a large frontier
  • 24. NUMA-optimized Bottom-up  Explores frontier queue QF from unvisited vertices.  Appends adjacent vertices into neighbors QN. Efficient for a large frontier. • Has unnecessary edge traversal for a small frontier
  • 26. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 27. Machine specification  4 way Intel Xeon E5. – CentOS 6.4 (Kernel 2.6.32) – GCC 4.4.7 – 64 logical CPU cores – 4 NUMA units x 16 logical cores  4 way AMD Opteron 6174. – Fedora 19 (Kernel 3.11.2) – GCC 4.8.1 – 48 CPU cores – 8 NUMA units x 6-core
  • 28. TEPS ratio varied with problem size  NUMA 2.2x speedups compared with original hybrid algorithm  NUMA achieves11.15 GTEPS for Kronecker graph (SCALE26).
  • 29. Strong scaling on Intel/AMD System Scale well up to # of threads as # of cores
  • 31. Graph500 benchmark  Fastest of single node on 4th list (Jue 2012)  Fastest of CPU-based single-node on 6th list (June 2013)
  • 32. 1st Green Graph500 list on June 2013  Measures power-efficient using TEPS/W ratio  Results on various system such as Android ,Linux , and Mac. NUMA Small Data category
  • 33. Outline  Background  Breadth-first Search (BFS)  NUMAI optimized parallel BFS  Numerical Results  Conclusion
  • 34. Conclusion  NUMA-optimized Hybrid BFS algorithm – Reduces unnecessary edge traversals and remote RAM access carefully considering NUMA  Numerical results on 4 way Intel Xeon – scales well up to 64 threads (scalable) – achieves 11.15 GTEPS (fast) – 2.2x speedup compared original Hybrid  Graph500 and Green Graph500 – Fastest single-node in June 2012 – Most power-efficient in June 2013
  • 35. Future work  Further optimizing NUMA-optimized BFS$ Distributed-memory parallel computation
  • 36. References  Parallel Breadth-First Search on Distributed Memory Systems [Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA {ABuluc, KMadduri}@lbl.gov]  A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L [Andy Yooy Edmond Chowx Keith Hendersony William McLendonz Bruce Hendricksonz ÄUmit C» atalyÄurek]  Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search [Scott Beamer Aydn Buluc Krste Asanovic David A. Patterson]  Distributed Breadth First Search [CSE 6220, Spring 2013 Georgia Institute of Technology April 18 -- Guest lecture by Anita Zakrzewska]  Evaluation and Optimization of Breadth-First Search on NUMA Cluster [Zehan Cui1,2, Licheng Chen1,2, Mingyu Chen1, Yungang Bao1, Yongbing Huang1,2, Huiwei Lv1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences 2 Graduate School of Chinese Academy of Sciences {cuizehan, chenlicheng, cmy, baoyg, huangyongbing, lvhuiwei}@ict.ac.cn]
  • 37. References  Scaling Techniques for Massive Scale-Free Graphs in Distributed (External) Memory [Roger Pearcey, Maya Gokhaley, Nancy M. Amato Parasol Laboratory; Dept. of Computer Science and Engineering Texas A&M University; College Station, TX yCenter for Applied Scientific Computing Lawrence Livermore National Laboratory; Livermore, CA frpearce, mayag@llnl.gov frpearce, amatog@cse.tamu.edu]  Reducing Communication in Parallel Breadth-First Search on Distributed Memory Systems [Huiwei Luy, Guangming Tan, Mingyu Chen, Ninghui Sun State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences yArgonne National Laboratory Email: huiweilu@mcs.anl.gov, tgm@ict.ac.cn, cmy@ict.ac.cn, snh@ncic.ac.cn]  Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems [Rudolf Berrendorf and Matthias Makulla Computer Science Department Bonn-Rhein-Sieg University Sankt Augustin, Germany e-mail: rudolf.berrendorf@h-brs.de, mathias.makulla@h-brs.de]