Deploying a Task-based Runtime System on Raspberry Pi Clusters

Deploying a Task-based Runtime System on Raspberry
Pi Clusters
Patrick Diehl and Steven R. Brandt
University of Louisiana
{pdiehl,sbrandt}@cct.lsu.edu
Extreme Scale Programming Models and Middleware (ESPM2’20)
Patrick Diehl and Steven R. Brandt (LSU) AMT on Raspberry Pi July 23, 2021 1 / 23

Motivation
Arm®
-Technology is emerging
in supercomputers and data
centers, e.g. Fugaku the fastest
super computer in the Top500.
The low power consumption.
The low costs of a Raspberry Pi
and building a small cluster.
One cluster of 4 nodes costs
around $200.

Outline
1 Tools
2 Hardware and Software
3 Benchmarks
4 Results
Memory
Computation time
Energy consumption
5 Conclusion & Outlook

Tools

HPX
The C++ Standard Library for Concurrency and Parallelism
Thread Scheduling ActiveGlobal AddressSpace
Parcel Transport Layer Local Control Objects
PerformanceMonitoring
Operating System
HPX Application
HPX’s lightweight user threads reduce context switching overhead
Active Global Address Space (AGAS) makes a unified view of the
application
Overlapping communication and computation in the Parcel Layer
Reference
Kaiser et al., (2020). HPX - The C++ Standard Library for Parallelism and Concurrency. Journal of Open Source Software,
5(53), 2352, https://doi.org/10.21105/joss.02352

Phylanx
An Asynchronous Distributed Array Computing Toolkit
Run Python code within the HPX runtime system in parallel.
Python is a common used language in machine and deep learning.
Reference
Tohid, R., et al. ”Asynchronous execution of python code on task-based runtime systems.” 2018 IEEE/ACM 4th International
Workshop on Extreme Scale Programming Models and Middleware (ESPM2). IEEE, 2018.

Hardware and Software

Raspberry PI cluster & Software
Table: Specification/Architecture of the three nodes utilised in the benchmarks.
Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B
Micro-architecture Arm® v8-A Arm® v8-A Arm® v8
Processor Model Cortex-A53 Cortex-A53 Cortex-A72
Number of CPUs 1 1 1
Cores per CPU 4 4 4
Total Cores 4 4 4
Frequency 1.2GHz 1.4GHz 1.5GHz
Memory 1GB 1GB 4GB
1
https://bitbucket.org/blaze-lib/blaze
2
https://github.com/STEllAR-GROUP/hpx

Raspberry PI cluster & Software
Table: Specification/Architecture of the three nodes utilised in the benchmarks.
Model Raspberry Pi 3B Raspberry Pi 3B+ Raspberry Pi 4B
Micro-architecture Arm® v8-A Arm® v8-A Arm® v8
Processor Model Cortex-A53 Cortex-A53 Cortex-A72
Number of CPUs 1 1 1
Cores per CPU 4 4 4
Total Cores 4 4 4
Frequency 1.2GHz 1.4GHz 1.5GHz
Memory 1GB 1GB 4GB
Table: Overview of the compilers, software, and operating system used.
Operating System Ubuntu 20.04 LTS Kernel 5.4
for Arm® blaze1
75179e6
Compilers gcc 9.30.1 boost 1.71
hwloc 2.1.0 gperftools 2.7
lapack 3.8 HPX2
5b9de48ab1
1
https://bitbucket.org/blaze-lib/blaze
2
https://github.com/STEllAR-GROUP/hpx

Benchmarks

2D Jacobi Solver (Shared memory)
2D Stencil based on the Jacobi method using
Standard grid layout for GCC’s autovecorize (−03)
Virtual Node Scheme1
for explicit vectorization
Roofline model2 to predict the optimal performance
Poptimal = Memory Bandwidth × AI
with arithmetic intensity (AI) is given by 1/24 for double precision and
1/12 for single precision.
References
1 P. Boyle, A. Yamaguchi, G. Cossu, and A. Portelli, “Grid: A next generation data
parallel c++ qcd library,” arXiv preprint arXiv:1512.03487, 2015
2 S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick, “The roofline model:
A pedagogical tool for auto-tuning kernels on multicore architectures,” in Hot
Chips, vol. 20, 2008, pp. 24–26.

1D Heat equation solver (Distributed memory)
Parameter: heat transfer coefficient k = 0.5, time step dt = 1, and
grid spacing dx = 1.
2 1 0 1 2
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
0
20
40
60
80
u
Figure: Initial conditions
2 1 0 1 2
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
20
30
40
50
60
70
80
u
Figure: Solution

Results

STREAM TRIAD Benchmark
0
1000
2000
3000
4000
5000
6000
1 2 3 4
Memory
Bandwidth
(in
MBps)
Core count
STREAM TRIAD Results
Rpi 3B
Rpi 3B+
Rpi 4
Figure: Memory Bandwidth results using
the STREAM TRIAD Benchmark with
an array size of 10M elements
Pi 3B/3B+ have very low
memory bandwidth (MB)
One single processor unit (PU)
already saturated
Pi 4 same behavior, but double
MB
Conclusion: Memory bus can
only handle a certain amount of
MB and concurrency at the
same time.

2D stencil (Raspberry Pi 4)
2D Stencil: Raspberry Pi 4
200
300
400
500
600
700
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
100
150
200
250
300
350
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 4): Grid size of 4096×4096 iterated over a 100
time steps.
Conclusion: Best performance on 2 cores.

2D stencil (Raspberry Pi 3B+)
2D Stencil: Raspberry Pi 3B+
100
150
200
250
300
350
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
40
60
80
100
120
140
160
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 3B+): Grid size of 4096×4096 iterated over a
100 time steps.
Conclusion: We can not achieve the expected peak performance.

2D stencil (Raspberry Pi 3B)
2D Stencil: Raspberry Pi 3B
100
150
200
250
300
350
1 2 3 4
Performance
(in
MLUPs/s)
Core count
Single Precision
40
60
80
100
120
140
160
1 2 3 4
Core count
Double Precision
scalar (auto)
vector (explicit)
Expected Peak
Figure: 2D stencil (Raspberry Pi 3B): Grid size of 4096×4096 iterated over a 100
time steps.
Conclusion: PI 3B and 3B+ have similar performance, because these two
model differ only in the clock speeds.

Multi-Node Benchmark
0
20
40
60
80
100
120
140
1 2 3 4
Execution
Time
(s)
Node count
Raspberry Pi 3B
0
20
40
60
80
100
120
140
1 2 3 4
Node count
Raspberry Pi 3B+
0
20
40
60
80
100
120
140
160
1 2 3 4
Node count
Raspberry Pi 4
Np=30M, Nt=100
Np=60M, Nt=100
Np=60M, Nt=500
Figure: Execution time in seconds for various node counts using all 4 threads.
Conclusion: Multi-node codes can scale well.

Multi-Node Benchmark
0
10
20
30
40
50
60
70
1 2 3 4
Execution
Time
(s)
Node count
Raspberry Pi 3B
0
10
20
30
40
50
60
1 2 3 4
Node count
Raspberry Pi 3B+
0
2
4
6
8
10
12
14
16
1 2 3 4
Node count
Raspberry Pi 4
1 thread/node
2 threads/node
3 threads/node
4 threads/node
Figure: Execution time in seconds for various node counts using various threads.
Conclusion: Threads provide little performance gain, and actually hurt on
the 3 and 3+.

Phylanx: ALS Benchmark
3
4
5
6
7
8
9
10
1 2 3 4
Execution
Time
(s)
Core count
Performance of the ALS Benchmark
Rpi 4
Rpi 3B
Rpi 3B+
Figure: Execution time in seconds for various core counts.
Conclusion: Our ALS code is fastest on 2 cores, but probably needs more
development.

Cost wrt Power Consumption
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Rpi3 Rpi3p Rpi4
Cost
(in
1e-5
US
¢)
Model
1D stencil: Cost wrt Power Consumption
Figure: Cost with respect to power
consumption for the 1D stencil code
using 30 million stencil points per
iteration and a total of 100 iterations.
0
2
4
6
8
10
12
Rpi3 Rpi3p Rpi4
Cost
(in
1e-5
US
¢)
Model
ALS: Cost wrt Power Consumption
Figure: Cost with respect to power
consumption for the Alternating Least
Square (ALS) benchmark in Phylanx for
the MovieLens 20m database.
The power consumption for all models was obtained using the Linux
command stress3 for all four cores.
3
https://linux.die.net/man/1/stress

Conclusion & Outlook

Conclusion & Outlook
Conclusion
Limited memory bandwith limits the utilization of all cores.
Ubuntu Server 2020 supports 32-bit and 64-bit.
Frequency setting of “performance” used instead of “ondemand.”
The cluster provides modest performance at a reasonable cost.
Outlook
Use the small and affordable cluster for teaching parallel and
distributing computing.
The interface to attach sensors could be used in field studies to
collect data and Phylanx could process the data before uploading to
more powerful devices to do the analysis.
Use a larger cluster and more sophisticated Arm®
hardware.

This work is licensed under a Creative
Commons “Attribution-NonCommercial-
NoDerivatives 4.0 International” license.

Deploying a Task-based Runtime System on Raspberry Pi Clusters

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deploying a Task-based Runtime System on Raspberry Pi Clusters

Similar to Deploying a Task-based Runtime System on Raspberry Pi Clusters (20)

More from Patrick Diehl

More from Patrick Diehl (15)

Recently uploaded

Recently uploaded (20)

Deploying a Task-based Runtime System on Raspberry Pi Clusters