The document discusses NUMA-aware scalable graph traversal on SGI UV systems. It proposes an efficient NUMA-aware breadth-first search (BFS) algorithm for large-scale graph processing by pruning remote edge traversals. Numerical results on SGI UV 300 systems with 32 sockets show the algorithm achieves 219 billion traversed edges per second (GTEPS), setting a new single-node performance record on the Graph500 benchmark.
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui
The 2015 International Conference on High Performance Computing & Simulation (HPCS2015)
Session 9A: July 22, 14:45 − 16:00
July 20 – 24, 2015, Amsterdam, the Netherlands
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Robo India
This article is about learning Global Positioning system.
In order to understand GPS, we need to communication protocol of GPS. GPS communicates in NMEA messages.
This document describes NMEA messages and algorithm to extract data.
We welcome all of your queries and views. We are found at-
website- http://roboindia.com
mail-info@roboindia.com
Fast and Scalable NUMA-based Thread Parallel Breadth-first SearchYuichiro Yasui
The 2015 International Conference on High Performance Computing & Simulation (HPCS2015)
Session 9A: July 22, 14:45 − 16:00
July 20 – 24, 2015, Amsterdam, the Netherlands
Understanding GPS & NMEA Messages and Algo to extract Information from NMEA.Robo India
This article is about learning Global Positioning system.
In order to understand GPS, we need to communication protocol of GPS. GPS communicates in NMEA messages.
This document describes NMEA messages and algorithm to extract data.
We welcome all of your queries and views. We are found at-
website- http://roboindia.com
mail-info@roboindia.com
In this paper we propose Regularised Cross-Modal Hashing
(RCMH) a new cross-modal hashing model that projects
annotation and visual feature descriptors into a common
Hamming space. RCMH optimises the hashcode similarity
of related data-points in the annotation modality using an
iterative three-step hashing algorithm: in the first step each
training image is assigned a K-bit hashcode based on hyperplanes learnt at the previous iteration; in the second step the binary bits are smoothed by a formulation of graph regularisation so that similar data-points have similar bits; in the third step a set of binary classifiers are trained to predict the regularised bits with maximum margin. Visual descriptors are projected into the annotation Hamming space by a set of binary classifiers learnt using the bits of the corresponding annotations as labels. RCMH is shown to consistently improve retrieval effectiveness over state-of-the-art baselines.
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
We consider several problem related to graph model related to error-correcting codes. From base problem of cycle broken, trapping set elliminating and bypass to fundamental problem of graph model. Thanks to the hard work of Michail Chertkov, Michail Stepanov and Andrea Montanari which inspirit me...
Slides presented at Applied Mathematics Day, Steklov Mathematical Institute of the Russian Academy of Sciences September 22, 2017 http://www.mathnet.ru/conf1249
By Tobias Grosser, Scalable Parallel Computing Laboratory
The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.
Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Development of Routing for Car Navigation SystemsAtsushi Koike
Car navigation systems are devices that show us routes to our destination. Finding good routes is a key feature in the systems. I will explain the development of routing for the systems.
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.
Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/
On Extending MapReduce - Survey and ExperimentsYu Liu
It talks a survey and my experiments on extending MapReduce programming model. A BSP-based MapReduce interface was implemented and evaluated, which shows dramatically improvement on performance.
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Andrea Tassi
In this paper, we propose a novel advanced multi-rate design for evolved Multimedia Multicast/Broadcast Service (eMBMS) in fourth generation (4G) Long-Term Evolution (LTE)/LTE-Advanced (LTE-A) networks. The proposed design provides: i) reliability, based on random network coded (RNC) transmission, and ii) efficiency, obtained by optimized rate allocation across multi-rate RNC streams. The paper provides an in-depth description of the system realization and demonstrates the feasibility of the proposed eMBMS design using both analytical and simulation results. The system performance is compared with popular multi-rate multicast approaches in a realistic simulated LTE/LTE-A environment.
Graph based transistor network generation method for supergate designIeee Xpert
Graph based transistor network generation method for supergate design Graph based transistor network generation method for supergate design Graph based transistor network generation method for supergate design Graph based transistor network generation method for supergate design
In this paper we propose Regularised Cross-Modal Hashing
(RCMH) a new cross-modal hashing model that projects
annotation and visual feature descriptors into a common
Hamming space. RCMH optimises the hashcode similarity
of related data-points in the annotation modality using an
iterative three-step hashing algorithm: in the first step each
training image is assigned a K-bit hashcode based on hyperplanes learnt at the previous iteration; in the second step the binary bits are smoothed by a formulation of graph regularisation so that similar data-points have similar bits; in the third step a set of binary classifiers are trained to predict the regularised bits with maximum margin. Visual descriptors are projected into the annotation Hamming space by a set of binary classifiers learnt using the bits of the corresponding annotations as labels. RCMH is shown to consistently improve retrieval effectiveness over state-of-the-art baselines.
Cycle’s topological optimizations and the iterative decoding problem on gener...Usatyuk Vasiliy
We consider several problem related to graph model related to error-correcting codes. From base problem of cycle broken, trapping set elliminating and bypass to fundamental problem of graph model. Thanks to the hard work of Michail Chertkov, Michail Stepanov and Andrea Montanari which inspirit me...
Slides presented at Applied Mathematics Day, Steklov Mathematical Institute of the Russian Academy of Sciences September 22, 2017 http://www.mathnet.ru/conf1249
By Tobias Grosser, Scalable Parallel Computing Laboratory
The COSMO climate and weather model delivers daily forecasts for Switzerland and many other nations. As a traditional HPC application it was developed with SIMD-CPUs in mind and large manual efforts were required to enable the 2016 move to GPU acceleration. As today's high-performance computer systems increasingly rely on accelerators to reach peak performance and manual translation to accelerators is both costly and difficult to maintain, we propose a fully automatic accelerator compiler for the automatic translation of scientific Fortran codes to CUDA GPU accelerated systems. Several challenges had to be overcome to make this reality: 1) improved scalability, 2) automatic data placement using unified memory, 3) loop rescheduling to expose coarse-grained parallelism, 4) inter-procedural loop optimization, and 5) plenty of performance tuning. Our evaluation shows that end-to-end automatic accelerator compilation is possible for non-trivial portions of the COSMO climate model, despite the lack of complete static information. Non-trivial loop optimizations previously implemented manually are performed fully automatically and memory management happens fully transparently using unified memory. Our preliminary results show notable performance improvements over sequential CPU code (40s to 8s reduction in execution time) and we are currently working on closing the remaining gap to hand-tuned GPU code. This talk is a status update on our most recent efforts and also intended to gather feedback on future research plans towards automatically mapping COSMO to FPGAs.
Tobias Grosser Bio
Tobias Grosser is a senior researcher in the Scalable Parallel Computing Laboratory (SPCL) of Torsten Hoefler at the Computer Science Department of ETH Zürich. Supported by a Google PhD Fellowship he received his doctoral degree from Universite Pierre et Marie Curie under the supervision of Albert Cohen. Tobias' research is taking place at the border of low-level compilers and high-level program transformations with the goal of enabling complex - but highly-beneficial - program transformations in a production compiler environment. He develops with the Polly loop optimizer a loop transformation framework which today is a community project supported throught the Polly Labs research laboratory. Tobias also developed advanced tiling schemes for the efficient execution of iterated stencils. Today Tobias leads the heterogeneous compute efforts in the Swiss University funded ComPASC project and is about to start a three year NSF Ambizione project on advancing automatic compilation and heterogenization techniques at ETH Zurich.
Email
bgerofi@riken.jp
For more info on The Linaro High Performance Computing (HPC) visit https://www.linaro.org/sig/hpc/
Development of Routing for Car Navigation SystemsAtsushi Koike
Car navigation systems are devices that show us routes to our destination. Finding good routes is a key feature in the systems. I will explain the development of routing for the systems.
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.
Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/
On Extending MapReduce - Survey and ExperimentsYu Liu
It talks a survey and my experiments on extending MapReduce programming model. A BSP-based MapReduce interface was implemented and evaluated, which shows dramatically improvement on performance.
Presentation of 'Reliable Rate-Optimized Video Multicasting Services over LTE...Andrea Tassi
In this paper, we propose a novel advanced multi-rate design for evolved Multimedia Multicast/Broadcast Service (eMBMS) in fourth generation (4G) Long-Term Evolution (LTE)/LTE-Advanced (LTE-A) networks. The proposed design provides: i) reliability, based on random network coded (RNC) transmission, and ii) efficiency, obtained by optimized rate allocation across multi-rate RNC streams. The paper provides an in-depth description of the system realization and demonstrates the feasibility of the proposed eMBMS design using both analytical and simulation results. The system performance is compared with popular multi-rate multicast approaches in a realistic simulated LTE/LTE-A environment.
Graph based transistor network generation method for supergate designIeee Xpert
Graph based transistor network generation method for supergate design Graph based transistor network generation method for supergate design Graph based transistor network generation method for supergate design Graph based transistor network generation method for supergate design
In this slidecast, Bill Mannel from SGI presents an update on the company's innovative HPC solutions.
Learn more at: http://sgi.com
Watch the presentation video: http://insidehpc.com/2013/07/01/slidecast-sgi-product-update-for-june-2013/
SGI: Meeting Manufacturing's Need for Production Supercomputinginside-BigData.com
In this slidecast, Tony DeVarco from SGI describes how the company delivers Production Supercomputing for SMEs.
For the manufacturing sector, SGI serves the needs of customers that require extreme performance with efficiency, and scalability with reliability. Leading organizations around the world combine SGI high performance computing servers, storage and software to solve some of the world’s most difficult problems.
The Explosion of Petascale in the Race to ExascaleIntel IT Center
Raj Hazra VP of the Architecture Group and GM of Technical Computing at Intel discusses the race to Exascale computing in the world of HPC and Supercomputing and Intel Xeon Phi's role.
RouteFlow & IXPs
This talk will discuss the architecture of RouteFlow which is a leading OpenFlow based virtual router. It will focus on the new projects based upon RouteFlow which are finding traction in Internet eXchange Points (IXPs) - Cardigan being one of the most popular one. Some common aspects of IXPS will be shown. The talk will conclude with a list of future projects and vision of SDN routing.
About Raphael Vincent Rosa
Raphael is a Communications Network Engineer. He finished his MS in Computer Science working with intra datacenter routing, contributing to open source SDN projects such as Ryu network controller and RouteFlow platform. Currently he is pursuing PhD research under the guidance of Dr. Christian Esteve Rothenburg with main interests in SDN and Distributed-NFV topics.
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
In this deck from the Hot Chips conference, Chris Nicol from Wave Computing presents: A Dataflow Processing Chip for Training Deep Neural Networks.
Watch the video: https://wp.me/p3RLHQ-k6W
Learn more: https://wavecomp.ai/
and
http://www.hotchips.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsRosie Wells
Insight: In a landscape where traditional narrative structures are giving way to fragmented and non-linear forms of storytelling, there lies immense potential for creativity and exploration.
'Collapsing Narratives: Exploring Non-Linearity' is a micro report from Rosie Wells.
Rosie Wells is an Arts & Cultural Strategist uniquely positioned at the intersection of grassroots and mainstream storytelling.
Their work is focused on developing meaningful and lasting connections that can drive social change.
Please download this presentation to enjoy the hyperlinks!
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Media as a Mind Controlling Strategy In Old and Modern Era
NUMA-aware Scalable Graph Traversal on SGI UV Systems
1. NUMA-aware scalable graph traversal
on SGI UV systems
*Yuichiro Yasui, Katsuki Fujisawa
Kyushu University
Eng Lim Goh, John Baron, Atsushi Sugiura
SGI Corp.
Takashi Uchiyama
SGI Japan, Ltd.
HPGP’16 @ Kyoto at May 31, 2016
2. Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
3. Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
4. Our motivations
• NUMA / cc-NUMA architecture • Graph algorithm; BFS
• Efficient NUMA-aware BFS algorithm
– improves locality of reference in memory accesses
– exploits multithreading on many-socket system (SGI UV2000, UV300)
- SCALE
- edgefactor
Input parameters Graph generation Graph construction BFS
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
Local access
Remote access
・・・
Many-socket system
Represents many relationships by graph structure
CPU
RAM
CPU
RAM
CPU
RAM
…
Partial CSR graphassigned
Kronecker graph w/ SCALE 34
17 billion nodes and 275 billion edges
SGI UV 300
32-sockets 18-core Xeon and 16 TB RAM
5. Our contributions (Previous work and this paper)
• Efficient Graph data structure
4
3
2
0
1
Input graph
1
3
0
4
2
Vertex sorting (HPCS15) Adjacency list sorting (ISC14)
A0
A1
A2
A3
NUMA-aware graph (BD13)
• Efficient BFS based on Beamer’s Direction-optimizing (SC12)
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Agarwal’s Top-down (SC10) Pruning edges
Top-down direction Bottom-up direction
NUMA-aware Bottom-up (BD13)
A0
A1
A2
A3
Input:
CQ
Data:
VSk
Output:
NQk
Local
Sorting by outdegree
CSR graph
This paper
Reduction of
remote edges
UV 300 w/ 32 sockets
219 GTEPS (NEW)
New result !!
Updated highest score
On single-node
Binding on
NUMA node
6. Graph processing for Large scale networks
• Large-scale networks are generated in widely appli. areas
– US Road network: 58 million edges
– Twitter follow-ship: 1.47 billion edges
– Neuronal network: 100 trillion edges
89 billion vertices & 100 trillion edges
Neuronal network @ Human Brain Project
Cyber-security
Twitter
US road network
24 million vertices & 58 million edges 15 billion log entries / day
Social network
• Fast and scalable graph processing with HPC
– categorized as data intensive application
large
61.6 million vertices
& 1.47 billion edges
7. • Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Used to understand relationships in real-world networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortestpath
・Maximum flow ・Maximal independentset
・Centrality metrics ・Clustering ・Graph Mining
8. • One of most important and fundamental algorithm to traverse graph structures
• Many algorithms and applications based on BFS (Max. flow and Centrality)
• Well-known algorithm takes O(n+m) for a digraph G with n vertices and m edges
Breadth-first search (BFS)
Source
BFS Lv. 3
Source Lv. 2
Lv. 1
Outputs
• BFS tree
• Distance
Inputs
• digraph G = (V, E)
• Source vertex
• Transportation
• Social network
• Cyber-security
• Bioinformatics
Graph analysis and important kernel BFS
• Used to understand relationships in real-world networks
graph
processing
Understanding
Application fields
- SCALE
- edgefactor
- SCALE
- edgefac
- BFS Tim
- Traverse
- TEPS
Input parameters ResuGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
Relationships
- SCALE
- edgefactor
Input parameters Graph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
graph
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
results
Step1
Step2
Step3
constructing
・Breadth-first search ・Single-source shortestpath
・Maximum flow ・Maximal independentset
・Centrality metrics ・Clustering ・Graph Mining
BFS Tree
9. BFS on Twitter follow-ship network
• follow-ship network
– #Users (#vertices): 41,652,230
– Follow-ships (#edges): 2,405,026,092
Lv. #users ratio (%) percentile (%)
0 1 0.00 0.00
1 7 0.00 0.00
2 6,188 0.01 0.01
3 510,515 1.23 1.24
4 29,526,508 70.89 72.13
5 11,314,238 27.16 99.29
6 282,456 0.68 99.97
7 11536 0.03 100.00
8 673 0.00 100.00
9 68 0.00 100.00
10 19 0.00 100.00
11 10 0.00 100.00
12 5 0.00 100.00
13 2 0.00 100.00
14 2 0.00 100.00
15 2 0.00 100.00
Total 41,652,230 100.00 -
BFS result from User 21,804,357
excluding unconnected users
Six-degrees of separation
“everyone and everything is
six or fewer steps away”
Ours: 60 milliseconds per BFS
Twitter2009
10. Graph500 and Green Graph500
• New benchmarks using graph processing (breadth-first search)
• measures a performance and energy efficiency of irregular
memory access
TEPS score (# of Traversed edges per second) for
Measuring a performance of irregular memory accesses
TEPS per Watt score for measuring
power-efficient performamce
Graph500 benchmark Green Graph500 benchmark
1. Generation
or
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
eters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
LE
efactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
rameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
ut parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS x 642. Construction
x 64
Median TEPS
SCALE & edgefactor (=16)
Kronecker graph with 2SCALE vertices and
2SCALE edgefactor edges
by using SCALE-times the recursive
Kronecker productsG1 G2 G3 G4
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
Input parameters ResultsGraph generation Graph construction
TEPS
ratio
ValidationBFS
64 Iterations
3. BFS x 64
x 64
Median of
64 TEPSs
Power
consumption
Power consumption
in watt
TEPS per Watt
11. NUMA (Non-uniform memory access) system
NUMA 0
NUMA 1 NUMA 2
NUMA 3
0
1
2
3
0 1 2 3
targetNUMAnode
source NUMA node
24.2
3.4
3.0
3.4
3.3
23.9
3.5
3.0
3.0
3.4
24.3
3.4
3.5
3.0
3.4
24.2
Local access: 24 GB/s
Remote access: 3 GB/s
NUMA 0
NUMA 1 NUMA 2
NUMA 3
Data
Threads
NUMA system w/ 4-sockets
Data
threads
Fast local access Slow non-local access
Different
distances
CPU
RAM
CPU
RAM
CPU
RAM
CPU
RAM
Thread placement
Memoryplacement
(Example) 4-socket Xeon system
• 4 (# of CPU sockets)
• 8 (# of physical cores per socket)
• 2 (# of threads per core)
diagonal elements
= local access
12. SGI UV 2000
• UV 2000
– Single OS: SUSE Linux 11 (x86_64)
– hypercube interconnection
– Up to 2,560 cores and 64 TB RAM
– (= 128 UV 2000 shassis x 2 sockets x 10 cores)
ISM has two full-spec. UV 2000
• Hierarchical network topologies
– Sockets, Chassis, Cubes, Inner-racks, and Outer-racks
UV2000 Chassis = 2 sockets Cube = 8 Chassis Rack = 32 nodes
CPU
RAM
CPU
RAM
4 =
NUMAlink6
6.7GB/s
Cannotdetect
NUMA
node
ISM Kyushu U.
13. RA
M
CPU
to other chassis
to other chassis
SGI UV 300
• UV 300
– Single OS: SUSE Linux 11 (x86_64)
– All-to-all interconnection
– Up to 1,152 cores and 16 TB RAM
– (= 8 UV 300 shassis x 4 sockets x 18 cores x 2 SMT)
• UV 300 chassis
– 4-socket 18-core Intel Xeon E7-8867 (Haswell)
– 2TB RAM (512 GB per NUMA node)
UV300 chassis
UV300 Rack
All-to-All
• 18-core Xeon E7-8867
• HT enabled (2 SMT)
• 512GB RAM
NUMA node
UV 300 chassis
Kyushu U.
8 chassis
14. Memory Bandwidths on UV 2000 and UV 300
• Bandwidths in GB/s b/w NUMA nodes using STREAM TRIAD
• Local access is clearly faster than remote access
0
16
32
48
0 16 32 48
MemoryPlacement
Thread Placement
0
5
10
15
20
25
30
35
UV 2000 (64 sockets)
3-7 GB/s
3-7 GB/s
Each chassis has 2 sockets and connects to
each other by hypercube topology
Local
33 GB/s
UV 2000
chassis
0
4
8
12
16
20
24
28
0 4 8 12 16 20 24 28
MemoryPlacement
Thread Placement
0
10
20
30
40
50
60
UV 300 (32 sockets)
Each chassis has 4 sockets and connects to
each other by all-to-all topology
6 GB/s
6 GB/s
Local
56 GB/s
12-14 GB/s
UV 300
chassis
15. Programming cost for NUMA-aware
• Thread and Memory binding
– Reduce remote access
– Avoid thread migration
• Linux provides naïve interfaces
– sched_{set,get}affinity()
• binds a thread on a processor set (specifying by processor id)
– mbind()
• binds pages on a NUMA node set (specifying by NUMA node id)
– Linux provides processer id and NUMA node id as system files;
/proc/cpuinfo, /sys/devices/system/{node,cpu}/
• Reducing programming cost using ULIBC
– provides Some APIs for NUMA-aware programming
– available at
https://bitbucket.org/yuichiro_yasui/ulibc
#define _GNU_SOURCE
#include <sched.h>
int bind_thread(int procid) {
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(procid, &set);
return sched_setaffinity( (pid_t)0,
sizeof(cpu_set_t), &set) );
}
Specifying by Processor id
16. 2. Detects online topology
CPU Affinity construction using ULIBC
RAM
1. Detects entire topology
RAM
RAM RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
RAM
RAM
Socket 0
Socket 2
Socket 1
Socket 3
RAM 0 RAM 1
RAM 2 RAM 3
numactl --cpunodebind=1,2 ¥
--membind=1,2
e.g.)
3. Constructs two-type affinities
NUMA node 0
NUMA node 1
thread 0
thread 1
thread 2
thread 3
RAM
RAM
Local RAM
assigns threads in a position close to each other.
Compact-type affinity
export ULIBC_AFFINITY=compact:fine
export OMP_NUM_THREADS=7
e.g.)
export ULIBC_AFFINITY=scatter:fine
export OMP_NUM_THREADS=7e.g.)
NUMA node 0
NUMA node 1
thread 0 thread 2
thread 1 thread 3
RAM
RAM
Local RAM
distributes the threads as evenly as possible
across online processors.
Scatter-type affinity
RAM
RAM
17. NUMA-aware computation with ULIBC
• ULIBC is a callable library for NUMA-aware computation
• Detects processor topology on run time
• Constructs thread and memory affinity setting
#include <stdio.h>
#include <omp.h>
#include <ulibc.h>
int main(void) {
ULIBC_init();
_Pragma("omp parallel") {
const int tid = ULIBC_get_thread_num();
ULIBC_bind_thread();
const struct numainfo_t loc = ULIBC_get_numainfo( tid );
printf(”Thread: %2d, NUMA-node: %d, NUMA-core: %d¥n",
loc.id, loc.node, loc.core);
/* do something */
}
}
initialize
get thread id
bind current thread
get NUMA placement
https://bitbucket.org/yuichiro_yasui/ulibc
Thread: 4, NUMA-node: 0, NUMA-core: 1
Thread: 55, NUMA-node: 3, NUMA-core: 13
Thread: 16, NUMA-node: 0, NUMA-core: 4
Thread: 37, NUMA-node: 1, NUMA-core: 9
Thread: 30, NUMA-node: 2, NUMA-core: 7
. . .
Core IDNUMA node IDThread ID
include header file
ULIBC is available at
Execution log on 4-socket
18. Level-synchronized parallel BFS (Top-down)
• Started from source vertex and
executes following two phases at
each level
Level k
Level k+1CQ
NQ
Swap exchanges CQ and NQ for next
level
Traversal phase finds unvisited vertices
from CQ and appends into NQ
visited
unvisited
NQ
Level 1
Source
Level 0
CQ
Level 2
Level 1
NQ
CQ
Level 3
Level 2
NQ
CQ
Level 0
Sync.
Sync.
Level 1
Level 2
NQCQ
NQCQ
19. Frontier
Level k
Level k+1
NeighborsFrontier Neighbors
Level k
Level k+1
Candidates of
neighbors
Direction-optimizing BFS [Beamer, SC12]
• Top-down dir. using out-going edges • Bottom-up dir. using in-coming edges
Outgoing
edges Incoming
edges
Two directions; Top-down or Bottom-up
幅優先探索に対する前方探索 (Top-down) と後方探索 (Bottom-up)
Level Top-down Bottom-up Hybrid
0 2 2,103,840,895 2
1 66,206 1,766,587,029 66,206
2 346,918,235 52,677,691 52,677,691
3 1,727,195,615 12,820,854 12,820,854
4 29,557,400 103,184 103,184
5 82,357 21,467 21,467
6 221 21,240 227
Total 2,103,820,036 3,936,072,360 65,689,631
Ratio 100.00% 187.09% 3.12%
Distance from source
Large
frontier
Top-down
Top-down
Bottom-up
Direction-opt. BFS
20. Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
21. Our contributions (Previous work and this paper)
• Efficient Graph data structure
4
3
2
0
1
Input graph
1
3
0
4
2
Vertex sorting (HPCS15) Adjacency list sorting (ISC14)
A0
A1
A2
A3
NUMA-aware graph (BD13)
• Efficient BFS based on Beamer’s Direction-optimizing (SC12)
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Agarwal’s Top-down (SC10) Pruning edges
Top-down direction Bottom-up direction
• 131 GTEPS
• 152 GTEPS (NEW)
NUMA-aware Bottom-up (BD13)
A0
A1
A2
A3
Input:
CQ
Data:
VSk
Output:
NQk
Local
Sorting by outdegree
CSR graph
This paper
Reduction of
remote edges
UV 2000 w/ 64 sockets
UV 300 w/ 32 sockets
• 219 GTEPS (NEW)
• New results
16 %
faster than highest
single-node entry
Binding on
NUMA node
22. Ours: NUMA-aware 1-D part. graph [BD13]
• Divides sub graphs and assigns on each NUMA node
A0
A1
A2
A3
Adjacency matrix 1-D part. Graph
CPU
RAM
assigndivide
CPU
RAM
CPU
RAM
CPU
RAM
NUMA node
A0
A1
A2
A3
Input:
Frontier
CQ
Data:
visited VSk
Output:
neighbors
NQ
Local RAM
Bottom-up direction
• At bottom-up direction (Bottleneck component), each NUMA node
computes partial NQ using local copied CQ and local assigned VS.
Each sub graph represents by CSR graph
Top-down direction uses inverse of G.
(G is undirected)
A0
A1
A2
A3
Input:
Frontier
CQk
Data:
visited VSk
Output:
neighbors
NQ
Local
Local
Remote Remote
Modified version of Agarwal’s NUMA-aware BFS
23. Ours: Adjacency list sorting [ISC14]
• Reduces unnecessary edge traversals at Bottom-up dir.
Loop count τ
A(va)
A(vb)
finds frontier vertex and breaks this loop
……
Bottom-up
Skipped adjacencyvertices
Traversed adjacency vertices
• Sorting adjacency lists by the corresponding outdegree
Vertex vi Vertex vi+1
Index
Value
High Low
Adjacency vertices of vi
Sorting by outdegree
24. Ours: Vertex sorting [HPCS15]
Degree distribution
Access freq. w/ vertex sorting
• # of vertex traversals equals the outdegree of the corresponding vertex
• Our vertex sorting reorders vertex indices by the outdegrees
Access freq. and OutDegree are correlated
4
3
2
0
1
1
3
0
4
2
Original
indices
Sorted indices by outdegree
Highest
outdegree
Many accesses
for small-index vertex
25. NUMA-aware Top-down BFS
• Original version was proposed by Agarwal [Agarwal-SC10]
• Reducing random remote accesses using socket-queue
CQ
Local + Remote
NUMA 0
NUMA 1
NUMA 2
NUMA 3
Local : Remote = 1 : ℓ on ℓ-sockets
e.g.) focused on NUMA 2
synchronize
Socket−queue
Local
VS NQ
synchronize
Swap CQ and NQ
Append unvisited
vertices into NQ
Local
Phase1: CQ NQ or Socket-queue Phase2: Socket-queue NQ Next level
CQ
Socket−queue
Remote
Remote
Local
VS NQ
Append unvisited
vertices into NQ
26. NUMA-aware Top-down w/ Pruning remote edges
• pruning remote edges to reduce remote accesses
NUMA 0
NUMA 1
NUMA 2
NUMA 3
e.g.) focused remote edge traversal on NUMA2
This paperproposed by Agarwal’s SC10 paper
with Pruningw/o Pruning (original)
Each NUMA node appends remote edges (v,w)
into the corresponding socket-queue, if the F
doesn't contain w. (And then, F appends w)
Each NUMA node appends all remote
edges (v,w) into the corresponding
socket-queue
F (reuse CQ bitmap for Bottom-up)
CQ
(vector queue)
Socket−queue
Remote
Local
Local
Remote
The F is not initialized,
while there is no change of
search direction.
CQ
(vector queue)
Socket−queue
Remote
Remote
Each vertex is
searched once only.
27. Effects of pruning & Updated TEPS score
• Pruned many remote edges
0.0
0.2
0.4
0.6
0.8
1.0
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
(4) (9.04K) (221M) (15.3B) (1.50B) (4.55M) (11.1K) (29)
Local
Pruned-remote
Remote
Level 7Level 6Level 5Level 4Level 3Level 2Level 1Level 0
Figure 5: Ratio of traversed edges on a NUMA node in the
top-down algorithm with remote edge traversal pruning for
a Kronecker graph with SCALE29 on a four-socket server
(SB4). Each number in a bracket represents the total num-
ber of traversed edges at each level.
Algorithm 3: Top-down with pruning remote traversal
Procedure NUMA-aware-Top-down(G, CQ, VS, ⇡)
fork
Top-down Bottom-up Top-down
0
50
100
150
200
1 2 4 8 16 32 64 128
GTEPS Number of NUMA nodes (CPU sockets)
HPCS15-SG (SCALE 26 per NUMA node)
This paper (SCALE 27 per NUMA node)
7.7
15.3
24.2
42.1
59.4
94.8
131.4
174.7
8.3
14.2
25.1
38.6
61.5
91.8
152.2
Figure 6: Weak scaling on UV 2000
5.2 SGI UV 300
In this study, we obtained new results on SGI UV 300,
which has 32 CPU sockets and 16 TB memory. Fig. 7 de-
picts TEPS versus number of CPU sockets (NUMA nodes).
Table 8 shows the TEPS obtained for 32 CPU sockets. We
discuss the results with the following parameters:
SCALE29, on 4-socket Xeon UV2000 (used only 64 sockets)
However, this method may not be effective on a few
sockets, because the algorithm switch a direction as
the bottom-up at middle levels.
Pruned
Previous: TD BU
This paper: TD BU TD
On many sockets, Updated TEPS score
UV 2000 with 64 sockets;
• w/o pruning: 131 GTEPS
• w/ pruning: 152 GTEPS
28. Outline
• Introduction
– Graph analysis for large-scale networks
– Graph500 benchmark and Breadth-first search
– NUMA-aware computation
• Our proposal: Pruning of remote edge traversals
• Numerical results on SGI UV systems
29. Weak scaling performance
• UV 300 clearly outperforms UV 2000.
0
50
100
150
200
1 2 4 8 16 32 64
GTEPS
Number of sockets
UV2000: Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)
8.3
14.2
25.1
38.6
61.5
91.8
152.2
16.3
29.2
53.5
83.9
129.4
171.0
15.9
28.0
57.9
93.5
151.4
209.3
91.4
147.6
203.7
18.7
32.5
64.7
100.3
161.5
219.4
UV 2000
UV 300
Compared on next slide.
30. Breakdown of system configuration on UV300
• UV 300 is 2x faster than UV 2000
– same sockets (32 sockets)
– #ThreadsPerSockets = #logical cores
• Best perf. of UV 300 obtained with
– Larger problem size
– THP (transparent huge page) enabled
– Set memory reference mode as local-mode
– HT (Hyperthreading) enabled
System #sockets SCALE HT THP Mem-Ref-mode GTEPS
UV2000 32 32 − 92
UV300 32
32 Remote 171
34
Remote 204
Remote 209
*1 Local 188
Local 219 +16.5% by HT enabled
0
50
100
150
200
1 2 4 8 16 32 64
GTEPS
Number of sockets
UV2000: Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)
UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)
UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)
8.3
14.2
25.1
38.6
61.5
91.8
152.2
16.3
29.2
53.5
83.9
129.4
171.0
15.9
28.0
57.9
93.5
151.4
209.3
91.4
147.6
203.7
18.7
32.5
64.7
100.3
161.5
219.4
+ 4.8% by Memory
Reference mode
+ 2.5% by THP enabled
+19.3% by using a
larger memory space
*1: uses # of threads same as physical cores. emulated "Hyperthreading disabled”.
Perf.
gap
28.3 % perf. gap
32. Bandwidth and TEPS
• BW and TEPS of our implementations on 3 systems
– GB/s: STEAM TRIAD with 10 M elements per socket
– TEPS: SCALE27 (n=134M, m=2.15B) per socket
via a modified implementation using ULIBC, in which each
thread computed the partial TRIAD operation for vectors
on local memory only, shown in subsection 3.2. Figure shows
correlativity between the memory bandwidth and the graph
traversal performance. The optimized Graph500 implemen-
tation and our previous implementation are scalable, like
the memory bandwidth. In contrast, the reference code of
Graph500 is not scalable and cannot exploit the NUMA sys-
tem e ciently.
2
4
8
16
32
64
128
256
16 32 64 128 256 512 1024 2048
GTEPS
Memory Bandwidth (GB/s)
UV300 (Haswell) (HT, THP, Local): This paper
UV2000 (Ivy Bridge): This paper
UV2000 (Ivy Bridge): BD13
SB4 (Sandy Bride-EP) (HT, THP): This paper
SB4 (Sandy Bride-EP) (HT, THP): BD13
(a) GTEPS
Our previous [BD13]
This paper
• Bandwidth and GTEPS are correlated on three Xeon processors
• UV300
– 32-sockets Haswell
• UV2000
– 64-sockets Ivy Bridge
• SB4
– 4-sockets Sandy Bridge-EP
Systems
33. Conclusion
• NUMA / cc-NUMA architecture • Graph algorithm; BFS
• Efficient NUMA-aware BFS algorithm
– NUMA-aware to improved a locality of memory access
– Exploit multithreading on many-socket system (SGI UV2000, UV300)
Motivations
- SCALE
- edgefactor
Input parameters Graph generation Graph construction BFS
- SCALE
- edgefactor
Input parameters Graph generation Graph construction VBFS
6
Local access
Remote access
・・・
Many-socket system
Represents many relationships by graph structure
• NUMA-aware scalable BFS algorithm
– Scalable more than thousand threads on SGI UV 2000 and SGI UV 300
– Updated highest score single-node as 219 GTEPS on SGI UV300 with 32 sockets
• “ULIBC”: Callable library for NUMA-aware computation
– available at https://bitbucket.org/yuichiro_yasui/ulibc
Contributions Pruning edge traversal
to reduce remote edges
34. References
• [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel
Breadth-first Search on Multicore Single-node System, IEEE BigData 2013
• [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient
Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014
• [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread
parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015.
• [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui,
K. Iwabuchi, and T. Endo: Advanced Computing & Optimization
Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale
upercomputers, Proceedings of the Optimization in the Real World --
Toward Solving Real-World Optimization Problems --, Springer, 2015.
NUMA-aware BFS algorithm
Other results of Our Graph500 team