❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
Deploying a Task-based Runtime System on Raspberry Pi Clusters
1. Deploying a Task-based Runtime System on Raspberry
Pi Clusters
Patrick Diehl and Steven R. Brandt
University of Louisiana
{pdiehl,sbrandt}@cct.lsu.edu
Extreme Scale Programming Models and Middleware (ESPM2’20)
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 1 / 23
2. Motivation
Arm®
-Technology is emerging
in supercomputers and data
centers, e.g. Fugaku the fastest
super computer in the Top500.
The low power consumption.
The low costs of a Raspberry Pi
and building a small cluster.
One cluster of 4 nodes costs
around $200.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 2 / 23
3. Outline
1 Tools
2 Hardware and Software
3 Benchmarks
4 Results
Memory
Computation time
Energy consumption
5 Conclusion & Outlook
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 3 / 23
5. HPX
The C++ Standard Library for Concurrency and Parallelism
Thread Scheduling ActiveGlobal AddressSpace
Parcel Transport Layer Local Control Objects
PerformanceMonitoring
Operating System
HPX Application
HPX’s lightweight user threads reduce context switching overhead
Active Global Address Space (AGAS) makes a unified view of the
application
Overlapping communication and computation in the Parcel Layer
Reference
Kaiser et al., (2020). HPX - The C++ Standard Library for Parallelism and Concurrency. Journal of Open Source Software,
5(53), 2352, https://doi.org/10.21105/joss.02352
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 5 / 23
6. Phylanx
An Asynchronous Distributed Array Computing Toolkit
Run Python code within the HPX runtime system in parallel.
Python is a common used language in machine and deep learning.
Reference
Tohid, R., et al. ”Asynchronous execution of python code on task-based runtime systems.” 2018 IEEE/ACM 4th International
Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 2018.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 6 / 23
8. Raspberry PI cluster & Software
Table: Specification/Architecture of the three nodes utilised in the benchmarks.
Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B
Micro-architecture Arm® v8-A Arm® v8-A Arm® v8
Processor Model Cortex-A53 Cortex-A53 Cortex-A72
Number of CPUs 1 1 1
Cores per CPU 4 4 4
Total Cores 4 4 4
Frequency 1.2GHz 1.4GHz 1.5GHz
Memory 1GB 1GB 4GB
1
https://bitbucket.org/blaze-lib/blaze
2
https://github.com/STEllAR-GROUP/hpx
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
9. Raspberry PI cluster & Software
Table: Specification/Architecture of the three nodes utilised in the benchmarks.
Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B
Micro-architecture Arm® v8-A Arm® v8-A Arm® v8
Processor Model Cortex-A53 Cortex-A53 Cortex-A72
Number of CPUs 1 1 1
Cores per CPU 4 4 4
Total Cores 4 4 4
Frequency 1.2GHz 1.4GHz 1.5GHz
Memory 1GB 1GB 4GB
Table: Overview of the compilers, software, and operating system used.
Operating System Ubuntu 20.04 LTS Kernel 5.4
for Arm® blaze1
75179e6
Compilers gcc 9.30.1 boost 1.71
hwloc 2.1.0 gperftools 2.7
lapack 3.8 HPX2
5b9de48ab1
1
https://bitbucket.org/blaze-lib/blaze
2
https://github.com/STEllAR-GROUP/hpx
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 8 / 23
11. 2D Jacobi Solver (Shared memory)
2D Stencil based on the Jacobi method using
Standard grid layout for GCC’s autovecorize (−03)
Virtual Node Scheme1
for explicit vectorization
Roofline model2 to predict the optimal performance
Poptimal = Memory Bandwidth × AI
with arithmetic intensity (AI) is given by 1/24 for double precision and
1/12 for single precision.
References
1 P. Boyle, A. Yamaguchi, G. Cossu, and A. Portelli, “Grid: A next generation data
parallel c++ qcd library,” arXiv preprint arXiv:1512.03487, 2015
2 S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick, “The roofline model:
A pedagogical tool for auto-tuning kernels on multicore architectures,” in Hot
Chips, vol. 20, 2008, pp. 24–26.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 10 / 23
12. 1D Heat equation solver (Distributed memory)
Parameter: heat transfer coefficient k = 0.5, time step dt = 1, and
grid spacing dx = 1.
2 1 0 1 2
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
0
20
40
60
80
u
Figure: Initial conditions
2 1 0 1 2
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
20
30
40
50
60
70
80
u
Figure: Solution
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 11 / 23
14. STREAM TRIAD Benchmark
0
1000
2000
3000
4000
5000
6000
1 2 3 4
Memory
Bandwidth
(in
MBps)
Core count
STREAM TRIAD Results
Rpi 3B
Rpi 3B+
Rpi 4
Figure: Memory Bandwidth results using
the STREAM TRIAD Benchmark with
an array size of 10M elements
Pi 3B/3B+ have very low
memory bandwidth (MB)
One single processor unit (PU)
already saturated
Pi 4 same behavior, but double
MB
Conclusion: Memory bus can
only handle a certain amount of
MB and concurrency at the
same time.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 13 / 23
15. 2D stencil (Raspberry Pi 4)
2D Stencil: Raspberry Pi 4
200
300
400
500
600
700
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
100
150
200
250
300
350
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 4): Grid size of 4096×4096 iterated over a 100
time steps.
Conclusion: Best performance on 2 cores.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 14 / 23
16. 2D stencil (Raspberry Pi 3B+)
2D Stencil: Raspberry Pi 3B+
100
150
200
250
300
350
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
40
60
80
100
120
140
160
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 3B+): Grid size of 4096×4096 iterated over a
100 time steps.
Conclusion: We can not achieve the expected peak performance.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 15 / 23
17. 2D stencil (Raspberry Pi 3B)
2D Stencil: Raspberry Pi 3B
100
150
200
250
300
350
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
40
60
80
100
120
140
160
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 3B): Grid size of 4096×4096 iterated over a 100
time steps.
Conclusion: PI 3B and 3B+ have similar performance, because these two
model differ only in the clock speeds.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 16 / 23
18. Multi-Node Benchmark
0
20
40
60
80
100
120
140
1 2 3 4
Execution
Time
(s)
Node count
Raspberry Pi 3B
0
20
40
60
80
100
120
140
1 2 3 4
Node count
Raspberry Pi 3B+
0
20
40
60
80
100
120
140
160
1 2 3 4
Node count
Raspberry Pi 4
Np=30M, Nt=100
Np=60M, Nt=100
Np=60M, Nt=500
Figure: Execution time in seconds for various node counts using all 4 threads.
Conclusion: Multi-node codes can scale well.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 17 / 23
19. Multi-Node Benchmark
0
10
20
30
40
50
60
70
1 2 3 4
Execution
Time
(s)
Node count
Raspberry Pi 3B
0
10
20
30
40
50
60
1 2 3 4
Node count
Raspberry Pi 3B+
0
2
4
6
8
10
12
14
16
1 2 3 4
Node count
Raspberry Pi 4
1 thread/node
2 threads/node
3 threads/node
4 threads/node
Figure: Execution time in seconds for various node counts using various threads.
Conclusion: Threads provide little performance gain, and actually hurt on
the 3 and 3+.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 18 / 23
20. Phylanx: ALS Benchmark
3
4
5
6
7
8
9
10
1 2 3 4
Execution
Time
(s)
Core count
Performance of the ALS Benchmark
Rpi 4
Rpi 3B
Rpi 3B+
Figure: Execution time in seconds for various core counts.
Conclusion: Our ALS code is fastest on 2 cores, but probably needs more
development.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 19 / 23
21. Cost wrt Power Consumption
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Rpi3 Rpi3p Rpi4
Cost
(in
1e-5
US
¢)
Model
1D stencil: Cost wrt Power Consumption
Figure: Cost with respect to power
consumption for the 1D stencil code
using 30 million stencil points per
iteration and a total of 100 iterations.
0
2
4
6
8
10
12
Rpi3 Rpi3p Rpi4
Cost
(in
1e-5
US
¢)
Model
ALS: Cost wrt Power Consumption
Figure: Cost with respect to power
consumption for the Alternating Least
Square (ALS) benchmark in Phylanx for
the MovieLens 20m database.
The power consumption for all models was obtained using the Linux
command stress3 for all four cores.
3
https://linux.die.net/man/1/stress
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 20 / 23
23. Conclusion & Outlook
Conclusion
Limited memory bandwith limits the utilization of all cores.
Ubuntu Server 2020 supports 32-bit and 64-bit.
Frequency setting of “performance” used instead of “ondemand.”
The cluster provides modest performance at a reasonable cost.
Outlook
Use the small and affordable cluster for teaching parallel and
distributing computing.
The interface to attach sensors could be used in field studies to
collect data and Phylanx could process the data before uploading to
more powerful devices to do the analysis.
Use a larger cluster and more sophisticated Arm®
hardware.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 22 / 23
24. This work is licensed under a Creative
Commons “Attribution-NonCommercial-
NoDerivatives 4.0 International” license.
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 23 / 23