SlideShare a Scribd company logo
1 of 24
Download to read offline
Deploying a Task-based Runtime System on Raspberry
Pi Clusters
Patrick Diehl and Steven R. Brandt
University of Louisiana
{pdiehl,sbrandt}@cct.lsu.edu
Extreme Scale Programming Models and Middleware (ESPM2’20)
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 1 / 23
Motivation
Arm®
-Technology is emerging
in supercomputers and data
centers, e.g. Fugaku the fastest
super computer in the Top500.
The low power consumption.
The low costs of a Raspberry Pi
and building a small cluster.
One cluster of 4 nodes costs
around $200.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 2 / 23
Outline
1 Tools
2 Hardware and Software
3 Benchmarks
4 Results
Memory
Computation time
Energy consumption
5 Conclusion & Outlook
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 3 / 23
Tools
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 4 / 23
HPX
The C++ Standard Library for Concurrency and Parallelism
Thread Scheduling ActiveGlobal AddressSpace
Parcel Transport Layer Local Control Objects
PerformanceMonitoring
Operating System
HPX Application
HPX’s lightweight user threads reduce context switching overhead
Active Global Address Space (AGAS) makes a unified view of the
application
Overlapping communication and computation in the Parcel Layer
Reference
Kaiser et al., (2020). HPX - The C++ Standard Library for Parallelism and Concurrency. Journal of Open Source Software,
5(53), 2352, https://doi.org/10.21105/joss.02352
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 5 / 23
Phylanx
An Asynchronous Distributed Array Computing Toolkit
Run Python code within the HPX runtime system in parallel.
Python is a common used language in machine and deep learning.
Reference
Tohid, R., et al. ”Asynchronous execution of python code on task-based runtime systems.” 2018 IEEE/ACM 4th International
Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 2018.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 6 / 23
Hardware and Software
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 7 / 23
Raspberry PI cluster & Software
Table: Specification/Architecture of the three nodes utilised in the benchmarks.
Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B
Micro-architecture Arm® v8-A Arm® v8-A Arm® v8
Processor Model Cortex-A53 Cortex-A53 Cortex-A72
Number of CPUs 1 1 1
Cores per CPU 4 4 4
Total Cores 4 4 4
Frequency 1.2GHz 1.4GHz 1.5GHz
Memory 1GB 1GB 4GB
1
https://bitbucket.org/blaze-lib/blaze
2
https://github.com/STEllAR-GROUP/hpx
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
Raspberry PI cluster & Software
Table: Specification/Architecture of the three nodes utilised in the benchmarks.
Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B
Micro-architecture Arm® v8-A Arm® v8-A Arm® v8
Processor Model Cortex-A53 Cortex-A53 Cortex-A72
Number of CPUs 1 1 1
Cores per CPU 4 4 4
Total Cores 4 4 4
Frequency 1.2GHz 1.4GHz 1.5GHz
Memory 1GB 1GB 4GB
Table: Overview of the compilers, software, and operating system used.
Operating System Ubuntu 20.04 LTS Kernel 5.4
for Arm® blaze1
75179e6
Compilers gcc 9.30.1 boost 1.71
hwloc 2.1.0 gperftools 2.7
lapack 3.8 HPX2
5b9de48ab1
1
https://bitbucket.org/blaze-lib/blaze
2
https://github.com/STEllAR-GROUP/hpx
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
Benchmarks
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 9 / 23
2D Jacobi Solver (Shared memory)
2D Stencil based on the Jacobi method using
Standard grid layout for GCC’s autovecorize (−03)
Virtual Node Scheme1
for explicit vectorization
Roofline model2 to predict the optimal performance
Poptimal = Memory Bandwidth × AI
with arithmetic intensity (AI) is given by 1/24 for double precision and
1/12 for single precision.
References
1 P. Boyle, A. Yamaguchi, G. Cossu, and A. Portelli, “Grid: A next generation data
parallel c++ qcd library,” arXiv preprint arXiv:1512.03487, 2015
2 S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick, “The roofline model:
A pedagogical tool for auto-tuning kernels on multicore architectures,” in Hot
Chips, vol. 20, 2008, pp. 24–26.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 10 / 23
1D Heat equation solver (Distributed memory)
Parameter: heat transfer coefficient k = 0.5, time step dt = 1, and
grid spacing dx = 1.
2 1 0 1 2
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
0
20
40
60
80
u
Figure: Initial conditions
2 1 0 1 2
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
20
30
40
50
60
70
80
u
Figure: Solution
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 11 / 23
Results
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 12 / 23
STREAM TRIAD Benchmark
0
1000
2000
3000
4000
5000
6000
1 2 3 4
Memory
Bandwidth
(in
MBps)
Core count
STREAM TRIAD Results
Rpi 3B
Rpi 3B+
Rpi 4
Figure: Memory Bandwidth results using
the STREAM TRIAD Benchmark with
an array size of 10M elements
Pi 3B/3B+ have very low
memory bandwidth (MB)
One single processor unit (PU)
already saturated
Pi 4 same behavior, but double
MB
Conclusion: Memory bus can
only handle a certain amount of
MB and concurrency at the
same time.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 13 / 23
2D stencil (Raspberry Pi 4)
2D Stencil: Raspberry Pi 4
200
300
400
500
600
700
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
100
150
200
250
300
350
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 4): Grid size of 4096×4096 iterated over a 100
time steps.
Conclusion: Best performance on 2 cores.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 14 / 23
2D stencil (Raspberry Pi 3B+)
2D Stencil: Raspberry Pi 3B+
100
150
200
250
300
350
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
40
60
80
100
120
140
160
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 3B+): Grid size of 4096×4096 iterated over a
100 time steps.
Conclusion: We can not achieve the expected peak performance.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 15 / 23
2D stencil (Raspberry Pi 3B)
2D Stencil: Raspberry Pi 3B
100
150
200
250
300
350
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
40
60
80
100
120
140
160
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 3B): Grid size of 4096×4096 iterated over a 100
time steps.
Conclusion: PI 3B and 3B+ have similar performance, because these two
model differ only in the clock speeds.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 16 / 23
Multi-Node Benchmark
0
20
40
60
80
100
120
140
1 2 3 4
Execution
Time
(s)
Node count
Raspberry Pi 3B
0
20
40
60
80
100
120
140
1 2 3 4
Node count
Raspberry Pi 3B+
0
20
40
60
80
100
120
140
160
1 2 3 4
Node count
Raspberry Pi 4
Np=30M, Nt=100
Np=60M, Nt=100
Np=60M, Nt=500
Figure: Execution time in seconds for various node counts using all 4 threads.
Conclusion: Multi-node codes can scale well.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 17 / 23
Multi-Node Benchmark
0
10
20
30
40
50
60
70
1 2 3 4
Execution
Time
(s)
Node count
Raspberry Pi 3B
0
10
20
30
40
50
60
1 2 3 4
Node count
Raspberry Pi 3B+
0
2
4
6
8
10
12
14
16
1 2 3 4
Node count
Raspberry Pi 4
1 thread/node
2 threads/node
3 threads/node
4 threads/node
Figure: Execution time in seconds for various node counts using various threads.
Conclusion: Threads provide little performance gain, and actually hurt on
the 3 and 3+.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 18 / 23
Phylanx: ALS Benchmark
3
4
5
6
7
8
9
10
1 2 3 4
Execution
Time
(s)
Core count
Performance of the ALS Benchmark
Rpi 4
Rpi 3B
Rpi 3B+
Figure: Execution time in seconds for various core counts.
Conclusion: Our ALS code is fastest on 2 cores, but probably needs more
development.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 19 / 23
Cost wrt Power Consumption
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Rpi3 Rpi3p Rpi4
Cost
(in
1e-5
US
¢)
Model
1D stencil: Cost wrt Power Consumption
Figure: Cost with respect to power
consumption for the 1D stencil code
using 30 million stencil points per
iteration and a total of 100 iterations.
0
2
4
6
8
10
12
Rpi3 Rpi3p Rpi4
Cost
(in
1e-5
US
¢)
Model
ALS: Cost wrt Power Consumption
Figure: Cost with respect to power
consumption for the Alternating Least
Square (ALS) benchmark in Phylanx for
the MovieLens 20m database.
The power consumption for all models was obtained using the Linux
command stress3 for all four cores.
3
https://linux.die.net/man/1/stress
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 20 / 23
Conclusion & Outlook
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 21 / 23
Conclusion & Outlook
Conclusion
Limited memory bandwith limits the utilization of all cores.
Ubuntu Server 2020 supports 32-bit and 64-bit.
Frequency setting of “performance” used instead of “ondemand.”
The cluster provides modest performance at a reasonable cost.
Outlook
Use the small and affordable cluster for teaching parallel and
distributing computing.
The interface to attach sensors could be used in field studies to
collect data and Phylanx could process the data before uploading to
more powerful devices to do the analysis.
Use a larger cluster and more sophisticated Arm®
hardware.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 22 / 23
This work is licensed under a Creative
Commons “Attribution-NonCommercial-
NoDerivatives 4.0 International” license.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 23 / 23

More Related Content

What's hot

Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 
TermProject_cp33252_alw278_aa44757
TermProject_cp33252_alw278_aa44757TermProject_cp33252_alw278_aa44757
TermProject_cp33252_alw278_aa44757
Abe Arredondo
 

What's hot (20)

Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
[Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network [Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network
 
Cognitive Engine: Boosting Scientific Discovery
Cognitive Engine:  Boosting Scientific DiscoveryCognitive Engine:  Boosting Scientific Discovery
Cognitive Engine: Boosting Scientific Discovery
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
20 26
20 26 20 26
20 26
 
Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...Combining density functional theory calculations, supercomputing, and data-dr...
Combining density functional theory calculations, supercomputing, and data-dr...
 
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV DataThe DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
The OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XDThe OptIPuter as a Prototype for CalREN-XD
The OptIPuter as a Prototype for CalREN-XD
 
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
 
TermProject_cp33252_alw278_aa44757
TermProject_cp33252_alw278_aa44757TermProject_cp33252_alw278_aa44757
TermProject_cp33252_alw278_aa44757
 
Science and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated EraScience and Cyberinfrastructure in the Data-Dominated Era
Science and Cyberinfrastructure in the Data-Dominated Era
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Core Objective 1: Highlights from the Central Data Resource
Core Objective 1: Highlights from the Central Data ResourceCore Objective 1: Highlights from the Central Data Resource
Core Objective 1: Highlights from the Central Data Resource
 

Similar to Deploying a Task-based Runtime System on Raspberry Pi Clusters

Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuSimulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Patrick Diehl
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
Sri Prasanna
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
inside-BigData.com
 
Cwf96 (1)
Cwf96 (1)Cwf96 (1)
Cwf96 (1)
manoj g
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
TELECOM I+D
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programming
inside-BigData.com
 
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
guest3f9c6b
 

Similar to Deploying a Task-based Runtime System on Raspberry Pi Clusters (20)

Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V Using an Astrophysics Application Octo-Tiger
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Nikravesh big datafeb2013bt
Nikravesh big datafeb2013btNikravesh big datafeb2013bt
Nikravesh big datafeb2013bt
 
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer FugakuSimulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIArm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AI
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
Cwf96 (1)
Cwf96 (1)Cwf96 (1)
Cwf96 (1)
 
Valladolid final-septiembre-2010
Valladolid final-septiembre-2010Valladolid final-septiembre-2010
Valladolid final-septiembre-2010
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
The Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPCThe Coming Age of Extreme Heterogeneity in HPC
The Coming Age of Extreme Heterogeneity in HPC
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programming
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Toward Greener Cyberinfrastructure
Toward Greener CyberinfrastructureToward Greener Cyberinfrastructure
Toward Greener Cyberinfrastructure
 
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...
Augmenting Amdahl's Second Law for Cost-Effective and Balanced HPC Infrastruc...
 
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
Benchmarking the Parallel 1D Heat Equation Solver in Chapel, Charm++, C++, HP...
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
 
Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}
Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}
Digital Ic Applications Jntu Model Paper{Www.Studentyogi.Com}
 
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
D I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{WwwD I G I T A L  I C  A P P L I C A T I O N S  J N T U  M O D E L  P A P E R{Www
D I G I T A L I C A P P L I C A T I O N S J N T U M O D E L P A P E R{Www
 
DOME 64-bit μDataCenter
DOME 64-bit μDataCenterDOME 64-bit μDataCenter
DOME 64-bit μDataCenter
 

More from Patrick Diehl

Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Patrick Diehl
 
A tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local modelsA tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local models
Patrick Diehl
 
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Patrick Diehl
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Patrick Diehl
 
Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...
Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...
Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...
Patrick Diehl
 

More from Patrick Diehl (15)

Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-TigerEvaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
Evaluating HPX and Kokkos on RISC-V using an Astrophysics Application Octo-Tiger
 
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and ToolsD-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
D-HPC Workshop Panel : S4PST: Stewardship of Programming Systems and Tools
 
Subtle Asynchrony by Jeff Hammond
Subtle Asynchrony by Jeff HammondSubtle Asynchrony by Jeff Hammond
Subtle Asynchrony by Jeff Hammond
 
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in FortranFramework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
Framework for Extensible, Asynchronous Task Scheduling (FEATS) in Fortran
 
JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...JOSS and FLOSS for science: Examples for promoting open source software and s...
JOSS and FLOSS for science: Examples for promoting open source software and s...
 
A tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local modelsA tale of two approaches for coupling nonlocal and local models
A tale of two approaches for coupling nonlocal and local models
 
Recent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-TigerRecent developments in HPX and Octo-Tiger
Recent developments in HPX and Octo-Tiger
 
Challenges for coupling approaches for classical linear elasticity and bond-b...
Challenges for coupling approaches for classical linear elasticity and bond-b...Challenges for coupling approaches for classical linear elasticity and bond-b...
Challenges for coupling approaches for classical linear elasticity and bond-b...
 
Quantifying Overheads in Charm++ and HPX using Task Bench
Quantifying Overheads in Charm++ and HPX using Task BenchQuantifying Overheads in Charm++ and HPX using Task Bench
Quantifying Overheads in Charm++ and HPX using Task Bench
 
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...Interactive C++ code development using C++Explorer and GitHub Classroom for e...
Interactive C++ code development using C++Explorer and GitHub Classroom for e...
 
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
Quasistatic Fracture using Nonliner-Nonlocal Elastostatics with an Analytic T...
 
A review of benchmark experiments for the validation of peridynamics models
A review of benchmark experiments for the validation of peridynamics modelsA review of benchmark experiments for the validation of peridynamics models
A review of benchmark experiments for the validation of peridynamics models
 
On the treatment of boundary conditions for bond-based peridynamic models
On the treatment of boundary conditions for bond-based peridynamic modelsOn the treatment of boundary conditions for bond-based peridynamic models
On the treatment of boundary conditions for bond-based peridynamic models
 
EMI 2021 - A comparative review of peridynamics and phase-field models for en...
EMI 2021 - A comparative review of peridynamics and phase-field models for en...EMI 2021 - A comparative review of peridynamics and phase-field models for en...
EMI 2021 - A comparative review of peridynamics and phase-field models for en...
 
Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...
Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...
Google Summer of Code mentor summit 2020 - Session 2 - Open Science and Open ...
 

Recently uploaded

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 

Recently uploaded (20)

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 

Deploying a Task-based Runtime System on Raspberry Pi Clusters

  • 1. Deploying a Task-based Runtime System on Raspberry Pi Clusters Patrick Diehl and Steven R. Brandt University of Louisiana {pdiehl,sbrandt}@cct.lsu.edu Extreme Scale Programming Models and Middleware (ESPM2’20) Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 1 / 23
  • 2. Motivation Arm® -Technology is emerging in supercomputers and data centers, e.g. Fugaku the fastest super computer in the Top500. The low power consumption. The low costs of a Raspberry Pi and building a small cluster. One cluster of 4 nodes costs around $200. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 2 / 23
  • 3. Outline 1 Tools 2 Hardware and Software 3 Benchmarks 4 Results Memory Computation time Energy consumption 5 Conclusion & Outlook Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 3 / 23
  • 4. Tools Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 4 / 23
  • 5. HPX The C++ Standard Library for Concurrency and Parallelism Thread Scheduling ActiveGlobal AddressSpace Parcel Transport Layer Local Control Objects PerformanceMonitoring Operating System HPX Application HPX’s lightweight user threads reduce context switching overhead Active Global Address Space (AGAS) makes a unified view of the application Overlapping communication and computation in the Parcel Layer Reference Kaiser et al., (2020). HPX - The C++ Standard Library for Parallelism and Concurrency. Journal of Open Source Software, 5(53), 2352, https://doi.org/10.21105/joss.02352 Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 5 / 23
  • 6. Phylanx An Asynchronous Distributed Array Computing Toolkit Run Python code within the HPX runtime system in parallel. Python is a common used language in machine and deep learning. Reference Tohid, R., et al. ”Asynchronous execution of python code on task-based runtime systems.” 2018 IEEE/ACM 4th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 2018. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 6 / 23
  • 7. Hardware and Software Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 7 / 23
  • 8. Raspberry PI cluster & Software Table: Specification/Architecture of the three nodes utilised in the benchmarks. Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B Micro-architecture Arm® v8-A Arm® v8-A Arm® v8 Processor Model Cortex-A53 Cortex-A53 Cortex-A72 Number of CPUs 1 1 1 Cores per CPU 4 4 4 Total Cores 4 4 4 Frequency 1.2GHz 1.4GHz 1.5GHz Memory 1GB 1GB 4GB 1 https://bitbucket.org/blaze-lib/blaze 2 https://github.com/STEllAR-GROUP/hpx Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
  • 9. Raspberry PI cluster & Software Table: Specification/Architecture of the three nodes utilised in the benchmarks. Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B Micro-architecture Arm® v8-A Arm® v8-A Arm® v8 Processor Model Cortex-A53 Cortex-A53 Cortex-A72 Number of CPUs 1 1 1 Cores per CPU 4 4 4 Total Cores 4 4 4 Frequency 1.2GHz 1.4GHz 1.5GHz Memory 1GB 1GB 4GB Table: Overview of the compilers, software, and operating system used. Operating System Ubuntu 20.04 LTS Kernel 5.4 for Arm® blaze1 75179e6 Compilers gcc 9.30.1 boost 1.71 hwloc 2.1.0 gperftools 2.7 lapack 3.8 HPX2 5b9de48ab1 1 https://bitbucket.org/blaze-lib/blaze 2 https://github.com/STEllAR-GROUP/hpx Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
  • 10. Benchmarks Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 9 / 23
  • 11. 2D Jacobi Solver (Shared memory) 2D Stencil based on the Jacobi method using Standard grid layout for GCC’s autovecorize (−03) Virtual Node Scheme1 for explicit vectorization Roofline model2 to predict the optimal performance Poptimal = Memory Bandwidth × AI with arithmetic intensity (AI) is given by 1/24 for double precision and 1/12 for single precision. References 1 P. Boyle, A. Yamaguchi, G. Cossu, and A. Portelli, “Grid: A next generation data parallel c++ qcd library,” arXiv preprint arXiv:1512.03487, 2015 2 S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick, “The roofline model: A pedagogical tool for auto-tuning kernels on multicore architectures,” in Hot Chips, vol. 20, 2008, pp. 24–26. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 10 / 23
  • 12. 1D Heat equation solver (Distributed memory) Parameter: heat transfer coefficient k = 0.5, time step dt = 1, and grid spacing dx = 1. 2 1 0 1 2 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 u Figure: Initial conditions 2 1 0 1 2 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 20 30 40 50 60 70 80 u Figure: Solution Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 11 / 23
  • 13. Results Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 12 / 23
  • 14. STREAM TRIAD Benchmark 0 1000 2000 3000 4000 5000 6000 1 2 3 4 Memory Bandwidth (in MBps) Core count STREAM TRIAD Results Rpi 3B Rpi 3B+ Rpi 4 Figure: Memory Bandwidth results using the STREAM TRIAD Benchmark with an array size of 10M elements Pi 3B/3B+ have very low memory bandwidth (MB) One single processor unit (PU) already saturated Pi 4 same behavior, but double MB Conclusion: Memory bus can only handle a certain amount of MB and concurrency at the same time. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 13 / 23
  • 15. 2D stencil (Raspberry Pi 4) 2D Stencil: Raspberry Pi 4 200 300 400 500 600 700 1 2 3 4 Performance (in MLUPs/s) Core count Single Precision 100 150 200 250 300 350 1 2 3 4 Core count Double Precision scalar (auto) vector (explicit) Expected Peak Figure: 2D stencil (Raspberry Pi 4): Grid size of 4096×4096 iterated over a 100 time steps. Conclusion: Best performance on 2 cores. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 14 / 23
  • 16. 2D stencil (Raspberry Pi 3B+) 2D Stencil: Raspberry Pi 3B+ 100 150 200 250 300 350 1 2 3 4 Performance (in MLUPs/s) Core count Single Precision 40 60 80 100 120 140 160 1 2 3 4 Core count Double Precision scalar (auto) vector (explicit) Expected Peak Figure: 2D stencil (Raspberry Pi 3B+): Grid size of 4096×4096 iterated over a 100 time steps. Conclusion: We can not achieve the expected peak performance. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 15 / 23
  • 17. 2D stencil (Raspberry Pi 3B) 2D Stencil: Raspberry Pi 3B 100 150 200 250 300 350 1 2 3 4 Performance (in MLUPs/s) Core count Single Precision 40 60 80 100 120 140 160 1 2 3 4 Core count Double Precision scalar (auto) vector (explicit) Expected Peak Figure: 2D stencil (Raspberry Pi 3B): Grid size of 4096×4096 iterated over a 100 time steps. Conclusion: PI 3B and 3B+ have similar performance, because these two model differ only in the clock speeds. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 16 / 23
  • 18. Multi-Node Benchmark 0 20 40 60 80 100 120 140 1 2 3 4 Execution Time (s) Node count Raspberry Pi 3B 0 20 40 60 80 100 120 140 1 2 3 4 Node count Raspberry Pi 3B+ 0 20 40 60 80 100 120 140 160 1 2 3 4 Node count Raspberry Pi 4 Np=30M, Nt=100 Np=60M, Nt=100 Np=60M, Nt=500 Figure: Execution time in seconds for various node counts using all 4 threads. Conclusion: Multi-node codes can scale well. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 17 / 23
  • 19. Multi-Node Benchmark 0 10 20 30 40 50 60 70 1 2 3 4 Execution Time (s) Node count Raspberry Pi 3B 0 10 20 30 40 50 60 1 2 3 4 Node count Raspberry Pi 3B+ 0 2 4 6 8 10 12 14 16 1 2 3 4 Node count Raspberry Pi 4 1 thread/node 2 threads/node 3 threads/node 4 threads/node Figure: Execution time in seconds for various node counts using various threads. Conclusion: Threads provide little performance gain, and actually hurt on the 3 and 3+. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 18 / 23
  • 20. Phylanx: ALS Benchmark 3 4 5 6 7 8 9 10 1 2 3 4 Execution Time (s) Core count Performance of the ALS Benchmark Rpi 4 Rpi 3B Rpi 3B+ Figure: Execution time in seconds for various core counts. Conclusion: Our ALS code is fastest on 2 cores, but probably needs more development. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 19 / 23
  • 21. Cost wrt Power Consumption 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Rpi3 Rpi3p Rpi4 Cost (in 1e-5 US ¢) Model 1D stencil: Cost wrt Power Consumption Figure: Cost with respect to power consumption for the 1D stencil code using 30 million stencil points per iteration and a total of 100 iterations. 0 2 4 6 8 10 12 Rpi3 Rpi3p Rpi4 Cost (in 1e-5 US ¢) Model ALS: Cost wrt Power Consumption Figure: Cost with respect to power consumption for the Alternating Least Square (ALS) benchmark in Phylanx for the MovieLens 20m database. The power consumption for all models was obtained using the Linux command stress3 for all four cores. 3 https://linux.die.net/man/1/stress Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 20 / 23
  • 22. Conclusion & Outlook Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 21 / 23
  • 23. Conclusion & Outlook Conclusion Limited memory bandwith limits the utilization of all cores. Ubuntu Server 2020 supports 32-bit and 64-bit. Frequency setting of “performance” used instead of “ondemand.” The cluster provides modest performance at a reasonable cost. Outlook Use the small and affordable cluster for teaching parallel and distributing computing. The interface to attach sensors could be used in field studies to collect data and Phylanx could process the data before uploading to more powerful devices to do the analysis. Use a larger cluster and more sophisticated Arm® hardware. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 22 / 23
  • 24. This work is licensed under a Creative Commons “Attribution-NonCommercial- NoDerivatives 4.0 International” license. Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 23 / 23