LEGaTO: Software Stack Runtimes

The LEGaTO project has received funding from the European Union's Horizon 2020 research and
innovation programme under the grant agreement No 780681
10/13/20
LEGaTO:
Software Stack
Runtimes
HiPEAC 2020
Computer Systems Week
16-10-2020
Miquel Pericas
Chalmers University of Technology

2
HiPEAC CSW Autumn 2020
• Middleware – SLURM and RedFish
• OmpSs@FPGA (Xavier)
• XiTAO:
−Introduction: XiTAO execution Model
−Energy Aware Scheduler
−Software Topologies
−Pipeline parallelism
• FPGA Undervolting
• Fault tolerance - GPU Checkpointing
Outline

Slurm and RECS Master
• Integration of Slurm with RECS Master
o Nodes specification at slurm configuration (partitions, limits…)
o Slurm gets node specification and selects target nodes
o Allocates, joins and starts nodes
o Executes the application(s)
o Shuts-down nodes and destroys allocation
3
$ sinfo
PART… AVAIL LIMIT NODES STATE NODELIST
debug* up infinite 1 idle* pcxavim5
debug* up infinite 16 idle BB_1_[0,2-15],pcxavim6

• Slurm contacts RECS Master at job execution and
termination times
4
#!/bin/bash
#SBATCH -N 10
#SBATCH --constraint=ARM,bigLITTLE,hasGPU
#SBATCH -o test-%j.out
#SBATCH -e test-%j.err
// App invocation
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle* pcxavim5
debug* up infinite 10 alloc BB_1_[0,2-10]
debug* up infinite 6 idle BB_1_[11-15],pcxavim6
$ sbatch batch-10-bl.sh
Submitted batch job 39
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
39 debug batch-10 xavim R 0:42 10 BB_1_[0,2-10]

• Composed nodes are created using the
RECS Master webservice
• And started and stopped automatically
5
10 nodes are
turned on

6
OmpSs@FPGA
● Offload of matrix multiplication to FPGA
#pragma omp target device(fpga) num_instances(3)
#pragma omp task in([BSIZE*BSIZE]a, [BSIZE*BSIZE]b) inout([BSIZE*BSIZE]c)
void matmulBlock(const elem_t *a, const elem_t *b, elem_t *c)
{
#pragma HLS INLINE // off
#pragma HLS array_partition variable=a cyclic factor=4
#pragma HLS array_partition variable=b cyclic factor=BSIZE/4
#pragma HLS array_partition variable=c cyclic factor=BSIZE/2
for (int k = 0; k < BSIZE; ++k) {
…
}
}
FPGA

7
● Acceleration of matrix multiplication on FPGAs
− 4 ARM cores (OpenBLAS)
− 1 to 3 IP cores
● Block size 256x256
0
1
2
3
4
5
6
7
8
0
20
40
60
80
100
4 ARM cores 1 IP core 2 IP cores 3 IP cores
GFlops/W
GFlops
Axis Title
Matrix multiply, energy efficiency
Gflops Gflops/W
● Best performance
● 3 IP cores
● Best energy-efficiency
● 2 IP cores
OmpSs@FPGA

8
XiTAO: Energy Aware Scheduler
• Module 1: Power Profiling
• help runtime understand CPU power consumption trends (number/type of
cores, different frequencies)
•
• Module 2: Dynamic Performance Modeling
• provide accurate prediction for future task given a set of resources
• independent of platforms and frequencies
• achieve scalablity and portablity goals
•
• Module 3: Idleness Tracing
• give the information about real-time status of cores
• put cores to ”sleep” when it is under-utilized
• sleeping time exploits backoff exponential strategy
• provide the real-time parallel slackness of active cores =>
calculation of shared board static power on each running task
•
• Module 4: Task Mapping Algorithm (Per task level)
For a given configuration (Start core, number of cores):
• Performance Tracer => Execution Time Prediction
• Power Profiles => Dynamic Power Prediction
• Power Profiles + Idleness Tracer => Static Power Prediction
• Energy Prediction = (Static Power + Dynamic Power) x Execution Time

9
XiTAO: Energy Aware Scheduler
● 31%-74% energy
reduction than
RWS
● 19%-68% energy
reduction than
FCC
● 25%-73% energy
reduction than
LCC
Name Acronym ● Notion
Random Work Stealing
(+Sleep)
RWS
(+S)
Typical greedy scheduling (enhanced with Sleep)
Fastest Cores with
Criticality (+Sleep)
FCC
(+S)
Critical tasks are mapped to the set of cores that minimize
execution time and are not allowed work stealing, noncritical
tasks follow parent queue and only search for the best number of
cores that minimize the execution time of the task (enhanced with
Sleep)
Lowest Cost with
Criticality (+Sleep)
LCC
(+S)
The difference between LCC and FCC is that minimizing execution
time becomes minimizing parallel cost. The parallel cost means
”execution time * number of cores” (enhanced with Sleep)
Lowest Energy without
Criticality
LENC Task scheduling targets lowest energy, no need for criticality
awareness

10
STA
train
Sched
• Mapping logical data locations to physical locations (to create a model per locality)
• The Software Topology Address (STA) is a portable key that is to
be interpreted by the XiTAO runtime to map a task to a place.
• Example: space filling order is used as an STA, transforming
coordinates to an integer for Cartesian inputs. Paper includes
other example such keys.
• This STA-to-location mapping is leveraged to model the
performance per task’s data locality
• A performance model per the (STA, task_type) tuple is created
• Energy aware model can be potentially used here.
• Example system’s elastic partitions to be used by the
model
XiTAO: Software & Hardware Topologies

11
XiTAO: Model Validation on DAG Chain
•Adaptive resource selection (leader, width) for an
cache intensive task. Green is NUMA node where
task (depicted by STA) is initialized
•Scheduler mostly chooses widths 1 and 2 (within
the shared L2 cache)
• Adaptive resource selection (leader, width) for a
memory intensive task.
• Scheduler mostly chooses widths 12 (a socket
encapsulating 2 NUMA nodes)
• Random work-stealing behavior for compute
bound tasks while preferring larger widths
• Scalability of model running memory-bound DAG
chains. Up to 2.5x speedup with larger task count
• To validate the STA-driven
performance modeling, we
− Test on a 4-socket
AMD system (2
NUMA each)
− Print a resource
selection trace of a
chain of tasks
• The scheduler adaptively
behaves as locality-aware for
memory/cach intensive tasks,
and as a work-stealing
scheduler for compute bound
tasks

12
XiTAO: Moldable pipelines for CNNs
on heterogenous edge devices
● A simple template tensor language to develop CNN
networks.
● XiTAO Pipelines are generated using the information
provided by language interface.
● An online training phase determines the optimal pipeline
configuration.
• Network Layer distribution among pipeline stages.
• Resource partitioning among pipeline stages
● The training is led by a search algorithm which utilizes
computational hints provided by the language interface.

13
Network description in template language
main(){
…
Conv1 = CONV(ip, op, weights);
Conv2 = CONV(conv1, op, weights);
….
network.add(Conv1);
network.add(Conv2);
…
network.execute();
}
XiTAO: Moldable pipelines for CNNs
on heterogenous edge devices

14
FPGA Undervolting
Problem: FPGAs are at least 10X less power-efficient than equivalent ASICs
Goal: Bridge the power-efficiency gap between ASICs and FPGAs by
Undervolting below nominal level
• Case Study: Power consumption of neural networks is a main concern
✔ Hardware acceleration: GPUs, FPGAs, and ASICs
Evaluation Setup
✔ 5 Image classification workloads
✔ 3 Xilinx UltraScale+ ZCU102 platforms
✔ 2 On-chip voltage rails
Main Results
✔ Large voltage guardband (i.e., 33%)
✔ >3X power-efficiency gain

15
Overall Voltage Behavior
Slight variation of voltage behavior across platforms and benchmarks
❑ FPGA stops operatingCrash
❑ No performance or reliability loss
❑ Added by the vendor to ensure the
worst-case conditions
❑ Large guardband, average of 33%
Guard
band
❑ A narrow voltage region
❑ Neural network accuracy collapseCritical

16
GPU Checkpointing with FTI
● Transparent multi-
GPU/multi-node
checkpointing
● Parallel streams to
improve I/O efficiency
● Fast checksum
calculation using GPUs
MD5 algorithm

17
GPU Checkpointing with FTI
● Over 100x speed up
with the new GPU MD5
algorithm
● Checkpoint takes less
than 1 second
● FPGA checkpoint
implementation coming

LEGaTO: Software Stack Runtimes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LEGaTO: Software Stack Runtimes

Similar to LEGaTO: Software Stack Runtimes (20)

More from LEGATO project

More from LEGATO project (20)

Recently uploaded

Recently uploaded (20)

LEGaTO: Software Stack Runtimes