DATE 2020: Design, Automation and Test in Europe Conference
The LEGaTO project has received funding from the European Union's Horizon 2020 research
and innovation programme under the grant agreement No 780681
LEGaTO: Low-Energy, Secure,
and Resilient Toolset for
Heterogeneous Computing
DATE 2020
Valerio Schiavoni (Uni. Neuchatel)
11/March/2020
DATE 2020
The industry challenge
• Dennard scaling is dead
• Moore's law is slowing down
• Trend towards heterogeneous architectures
• Increasing focus on energy: ICT sector is responsible for
5% of global energy consumption
• Creates a programmability challenge
DATE 2020
LEGaTO Ambition
• Create software stack-support for energy-efficient
heterogeneous computing
o Starting with Made-in-Europe mature software stack, and
optimizing this stack to support energy-efficiency
o Computing on a commercial cutting-edge European-developed
heterogeneous hardware substrate with CPU + GPU + FPGA +
FPGA-based Dataflow Engines (DFE)
• Main goal: energy efficiency
DATE 2020
The future challenge of computing: MW, not FLOPS
4
“… without dramatic increases
in efficiency, ICT industry could
use 20% of all electricity and
emit up to 5.5% of the world’s
carbon emissions by 2025.”
“We have a tsunami of data
approaching. Everything which
can be is being digitalised. It is
a perfect storm.”
“ … a single $1bn Apple data
centre planned for Athenry in Co
Galway, expects to eventually
use 300MW of electricity, or
over 8% of the national capacity
and more than the daily entire
usage of Dublin. It will require
144 large diesel generators as
back up for when the wind does
not blow.”
DATE 2020
How did we get here?
5
Decades of exponential growth in performance
End of Dennard scaling
Moore’s Law is slowing down
Explore new architectures & models of computation
Exponential growth in demand & data
Move towards accelerators
DATE 2020
The End of Performance
• Frequency levels off, cores fill in the gap
http://doi.ieeecomputersociety.org/10.1109/MC.2015.368
DATE 2020
The real challenge: data movements
7
Pedram, Richardson, Galal, Kvatinsky, and Horowitz, “Dark Memory and Accelerator-
Rich System Optimization in the Dark Silicon Era”, April 2016, IEEE Design & Test
“As a result the most important optimization must be
done at the algorithm level, to reduce off- chip memory
accesses, to create Dark Memory. The algorithms must
first be (re)written for both locality and parallelism
before you tailor the hardware to accelerate them.”
“My hypothesis is that we can solve [the software crisis
in parallel computing], but only if we work from the
algorithm down to the hardware — not the traditional
hardware first mentality.” Tim Mattson, principal
engineer at Intel.
In agreement with Intel:
DATE 2020
FPGAs to the rescue?
• The model of computation is key
• Build ultra-deep, highly efficient pipelines
8
DATE 2020
Hardware Platforms
• Christmann (RECS Box)
o Heterogeneous platform: Supports CPU (X86 or Arm), GPU
(Nvidia) and FPGA (Xillinx and Altera)
• MAXELER
o Maxeler DFE platforms and toolset
• Heterogeneous microserver approach, integrating CPU, GPU and FPGA
technology
• High-speed, low latency communication infrastructure,
enabling accelerators, scalable across multiple servers
• Modular, blade-style design, hot pluggable/swappable
19/03/2020 14
RECS|Box overview
Concept
19/03/2020 15
• Dedicated network for
monitoring, control and
iKVM
• Multiple 1Gb/10Gb
Ethernet links per
Microserver – integrated 40
Gb Ethernet backbone
• Dedicated high-speed, low-
latency communication
infrastructure
• Host-2-Host PCIe between
CPUs
• Switched high-speed
serial lanes between
FPGAs
• Enables pooling of
accelerators
and assignment to
different hosts
Baseboard
Baseboard
Baseboard
3/16
Microserver
s
per
Baseboard
Microserver
Module
CPU Mem I/O
Microserver
Module
Microserver
Module
Backplane
Up to 15 Baseboards per
Server
Distributed
Monitoring
and KVM
Distributed
Monitoring
and KVM
Ethernet
Communication
Infrastructure
Ethernet
Communication
Infrastructure
High-speed
Low-latency
Communication
Storage /
I/O-Ext.
High-speed
Low-latency
Communication
RECS|Box overview
Architecture
CPU Mem I/O CPU Mem I/O
DATE 2020
Maxeler Dataflow Computing Systems
MPC-X3000
• 8 Dataflow Engines in 1U
• Up to 1TB of DFE RAM
• Zero-copy RDMA
• Dynamic allocation of DFEs to
conventional CPU servers through
Infiniband
• Equivalent performance to
20-50 x86 servers
DATE 2020
Dataflow System at Jülich Supercomputing Center
• Pilot system deployed in October 2017 as part of PRACE PCP
• System configuration
o 1U Maxeler MPC-X with 8 MAX5 DFEs
o 1U AMD EPYC server
• 2x AMD 7601 EPYC CPUs
• 1 TB RAM
• 9.6 TB SSD (data storage)
o one 1U login head node
• Runs 4 scientific workloads
18
MPC-X node
Remote users
MAX5 DFE EPYC CPU
1TB DDR4
Head/Build node
ipmi
56 Gbps 2x Infiniband @ 56 Gbps
10 Gbps
10 Gbps
Supermicro EPYC node
http://www.prace-ri.eu/pcp/
DATE 2020
Scaling Dataflow Computing into the Cloud
• MAX5 DFEs are compatible with Amazon EC2 F1
instances
• MaxCompiler supports programming F1
• Hybrid Cloud Model:
o On-premise Dataflow system from Maxeler
o Elastically expand to the Amazon Cloud when
needed
19
On-premise
Cloud
DATE 2020
Programming Models and Runtimes:
Task based programming + Dataflow!
• OmpSs / OmpSs@FPGA (BSC)
• MaxCompiler (Maxeler)
• DFiant HLS (TECHNION)
• XiTAO (CHALMERS)
19 March, 2020 20
DATE 2020
21
Towards a single source for any target
• A need if we expect programmers to survive
o Architectures appear like mushrooms
• Productivity
o Develop once → run everywhere
• Performance
• Key concept behind OmpSs
o Sequential task based program on single address/name space
o + directionality annotations
o Executed in parallel: Automatic runtime computation of dependences among
tasks
o LEGaTO: Extend tasks with resource requirements, propagate through the
stack to find the most energy efficient solution at run time
DATE 2020
23
AXIOM board:
Xilinx Zynq Ultrascale+ chip,
with 4 ARM Cortex-A53 cores,
and the ZU9EG FPGA.
IPs configuration 1*256, 3*128 Number of instances * size
Frequency (MHz) 200, 250, 300 Working frequency of the FPGA
Number of SMP cores SMP: 1 to 4
FPGA: 3+1 helper, 2+2 helpers
Combination of SMP and helper threads
Number of FPGA helper threads SMP: 0; FPGA: 1, 2 Helper threads are used to manage tasks on
the FPGA
Number of pending tasks 4, 8, 16 and 32 Number of tasks sent to the IP cores before
waiting for their finalization
OmpSs@FPGA Experiments (Matrix multiply)
DATE 2020
DFiant HLS
• Aims to bridge the programmability gap by combining constructs and
semantics from software, hardware and dataflow languages
• Programming model accommodates a middle-ground between low-
level HDL and high-level sequential programming
High-Level Synthesis
Languages and Tools
(e.g., C and Vivado HLS)
Register-Transfer Level
HDLs
(e.g., VHDL)
DFiant: A Dataflow HDL
Automatic pipelining
Not an HDL
Problem with state
Separating timing
from functionality
Concurrency
Fine-grain control
Automatic pipelining Concurrency
Fine-grain control
Bound to clock
Explicit pipelining
DATE 2020
PCI
Express
Manager
DFE
Memory
MyManager (.maxj)
x
x
+
30
x
Manager m = new Manager();
Kernel k =
new MyKernel();
m.setKernel(k);
m.setIO(
link(“x", CPU),
m.build();
link(“y", CPU));
MyKernel(DATA_SIZE,
x, DATA_SIZE*4,
y, DATA_SIZE*4);
for (int i =0; i < DATA_SIZE; i++)
y[i]= x[i] * x[i] + 30;
Main
Memory
CPU
CPU
Code
Host Code (.c)
MaxCompiler: Application Development
25
SLiC
MaxelerOS
DFEVar x = io.input("x", dfeInt(32));
DFEVar result = x * x + 30;
io.output("y", result, dfeInt(32));
MyKernel (.maxj)
int*x, *y;
y
x
x
+
30
y
x
Task-based kernel identification/DFE mapping
OmpSs identifies “static” task graphs while running and DFE kilo-task instance is
selected
Instantiate static, customized, ultra-deep (>1,000 stages) computing pipelines
DataFlow Engines
(DFEs) produce
one pair of results
at each clock cycle
Why extremely low-voltage for FPGAs?
Why Energy efficiency through Undervolting?
• Still way to go for <20MW@exascale
• FPGAs are at least 20X less power- and energy-efficient than ASICs.
• Undervolting directly delivers dynamic and static power/energy reduction.
• Note: undervolting is not DVFS
• Frequency is not lowered!
FPGA Undervolting Preliminary Experiments
Voltage scaling below standard nominal level, in on-chip
memories of Xilinx technology:
o Power quadratically decreases, deliver savings of more than 10X.
o A conservative voltage guardband exists below nominal.
o Fault exponentially increases, below the guardband.
o Faults can be detected / managed.
DATE 2020
Use Cases
• Healthcare: Infection biomarkers
o Statistical search for biomarkers, which often
needs intensive computation. A biomarker is
a measurable value that can indicate the
state of an organism, and is often the
presence, absence or severity of a specific
disease
• Smart Home: Assisted Living
o The ability of the home to learn from the
users behavior and anticipate future behavior
is still an open task and necessary to obtain
a broad user acceptance of assisted living in
the general public
DATE 2020
Use Cases
• Smart City: operational urban
pollutant dispersion modelling
o Modeling city landscape + sensor data +
wind prediction to issue a “pollutant
weather prediction”
• Machine Learning: Automated driving
and graphics rendering
o Object detection using CNN networks for
automated driving systems and CNN-
and LSTM-based methods for realistic
rendering of graphics for gaming and
multi-camera systems
• Secure IoT Gateway
o Variety of sensors and actors in an
industrial and private surrounding
DATE 2020
Smart City
7x gain in energy efficiency
using FPGAs
Health Care
822x speedup using
FPGAs, enabling a new
world of Biomarker
analysis
Machine Learning
up to 16x gain in energy
efficiency and
performance on the same
hardware, using the
EmbeDL optimizer
Boosting Use Cases by Using LEGaTO toolchain
DATE 2020
o Image: Object, face and gestures
o Speech with DeepSpeech
Displays personalized
information
All detections simultaniously
• Start: 16 FPS at 500 W
• Now: 25 FPS at 400 W
• Goal: 10 FPS at 50 W
Local machine learning
frameworks
Smart Mirror Use Case