LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project

The LEGaTO project has received funding from the European Union's Horizon 2020 research and
innovation programme under the grant agreement No 780681
LEGaTO: Low-Energy,
Heterogeneous Computing
Use of AI in the Project
AI4EU Café Webinar
Osman Unsal
Barcelona Supercomputing Center
28/October/2020

AI4EU Cafe
The future challenge of computing: MW, not FLOPS
2
“… without dramatic increases
in efficiency, ICT industry could
use 20% of all electricity and
emit up to 5.5% of the world’s
carbon emissions by 2025.”
“We have a tsunami of data
approaching. Everything which
can be is being digitalised. It is
a perfect storm.”
“ … a single $1bn Apple data
centre planned for Athenry in Co
Galway, expects to eventually
use 300MW of electricity, or
over 8% of the national capacity
and more than the daily entire
usage of Dublin. It will require
144 large diesel generators as
back up for when the wind does
not blow.”

AI4EU Cafe
How did we get here?
3
Decades of exponential growth in performance
End of Dennard scaling
Moore’s Law is slowing down
Explore new architectures & models of computation
Exponential growth in demand & data
Move towards accelerators

AI4EU Cafe
FPGAs to the rescue?
• The model of computation is key
• Build ultra-deep, highly efficient pipelines
4

AI4EU Cafe
LEGaTO Ambition
• Create software stack-support for energy-efficient
heterogeneous computing
o Starting with Made-in-Europe mature software stack, and optimizing
this stack to support energy-efficiency
o Computing on a commercial cutting-edge European-developed
heterogeneous hardware substrate with CPU + GPU + FPGA +
FPGA-based Dataflow Engines (DFE)
• Main goal: energy efficiency

AI4EU Cafe
LEGaTO Objectives
28.10.20 6

AI4EU Cafe
Use Cases
• Healthcare: Infection biomarkers
o Statistical search for biomarkers, which often
needs intensive computation. A biomarker is
a measurable value that can indicate the
state of an organism, and is often the
presence, absence or severity of a specific
disease
• Smart Home: Assisted Living
o The ability of the home to learn from the
users behavior and anticipate future
behavior is still an open task and necessary
to obtain a broad user acceptance of
assisted living in the general public

AI4EU Cafe
Use Cases
• Smart City: operational urban
pollutant dispersion modelling
o Modeling city landscape + sensor data +
wind prediction to issue a “pollutant
weather prediction”
• Machine Learning: Automated driving
and graphics rendering
o Object detection using CNN networks for
automated driving systems and CNN-
and LSTM-based methods for realistic
rendering of graphics for gaming and
multi-camera systems
• Secure IoT Gateway
o Variety of sensors and actors in an
industrial and private surrounding

AI4EU Cafe
LEGaTO Healthcare Use Case and AI
• Leverage tree based methods and LASSO regression for
Infection Biomarker research
• Integrated to LEGaTO Scone security technology
o Efficient deployment of Intel SGX security extensions
• LEGaTO scheduling techniques help to accelerate one of
the key algorithms using random forest
28.10.20 11

AI4EU Cafe
LEGaTO ML (DNN) Use Case
• In presentation of Hans Salomonsson (Embedl)
28.10.20 12

AI4EU Cafe
LEGaTO Smart Home Use Case and AI
• In presentation of Nils Kucza (University of Bielefeld)
28.10.20 13

AI4EU Cafe
LEGaTO Student Research Perspective on AI
• Scheduling VGG across heterogeneous cores in mobile
edge devices
• On Nvidia Jetson TX2
o 4-core ARM A57 and
o 2-core Denver 2
• In presentation of Pirah Noor Soomro (Chalmers University)
28.10.20 14

AI4EU Cafe
LEGaTO Undervolting Technology for DNN
• Following slides
28.10.20 15

Reduced-Voltage Operation in Modern FPGAs
for Neural Network Acceleration
Behzad Salami Baturay Onural Ismail Yuksel
Fahrettin Koc Oguz Ergin Adrian Cristal
Osman Unsal Hamid Sarbazi-Azad

Executive Summary
• Motivation: Power consumption of neural networks is a main concern
 Hardware acceleration: GPUs, FPGAs, and ASICs
• Problem: FPGAs are at least 10X less power-efficient than equivalent ASICs
• Goal: Bridge the power-efficiency gap between ASIC- and FPGA-based
neural networks by Undervolting below nominal level
• Evaluation Setup
 5 Image classification workloads
 3 Xilinx UltraScale+ ZCU102 platforms
 2 On-chip voltage rails
• Main Results
 Large voltage guardband (i.e., 33%)
 >3X power-efficiency gain

Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works

Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
• Prior Works
• Summary, Conclusion, and Future Works

Motivation and Background
• Motivation
 Power consumption of neural networks is a main concern
 Hardware acceleration: GPUs, FPGAs, and ASICs
 FPGAs: Getting popular but less power-efficient than equivalent ASICs
 Large voltage guardbands (12-35%) for CPUs, GPUs, DRAMs
 Any potential of “Undervolting FPGAs” for power-efficiency of neural networks?
• Background
 Neural Networks: Widely deployed with an inherent resilience to errors
 FPGAs: Higher throughput than GPUs and better flexibility than ASICs
 Undervolting: Reduces power cons., may incur reliability or performance issues

Our Goal
• Primary Goal
 Bridge the power-efficiency gap between ASIC- and FPGA-based
neural networks by:
 Undervolting (i.e., underscaling voltage below nominal level)
• Secondary Goals
 Study the voltage behavior of real FPGAs (e.g., guardband)
 Study the power-efficiency gain of undervolting for neural networks
 Study the reliability overhead
 Study the frequency underscaling to prevent the accuracy loss
 Study the effect of environmental temperature

Overall Methodology
• 5 CNN image classification
workloads, i.e., VGGNet, GoogleNet,
AlexNet, ResNet50, Inception.
• Xilinx DNNDK to map CNN into FPGA
 By default optimized for INT8
• 3 identical samples of Xilinx ZCU102
 ZYNQ Ultrscale+ architecture
 Hard-core ARM for data orchestration
 FPGA for CNN acceleration
• 2 on-chip voltage rails, via PMBus
 𝑉𝐶𝐶𝐼𝑁𝑇: DSPs, LUTs, buffers, …
 𝑉𝐶𝐶𝐵𝑅𝐴𝑀: BRAMs
 𝑉𝑛𝑜𝑚= 850mV (set by manufacturer)
Vast majority (>99.9%) of the power is dissipated on 𝑉𝐶𝐶𝐼𝑁𝑇

Overall Voltage Behavior
Slight variation of voltage behavior across platforms and benchmarks
 FPGA stops operatingCrash
• Guardband: Large region below nominal level (𝑽 𝒏𝒐𝒎 = 𝟖𝟓𝟎𝒎𝑽)
• Critical: Narrower region below guardband (𝑽 𝒎𝒊𝒏 = 𝟓𝟕𝟎𝒎𝑽)
• Crash: FPGA crashes below critical region (𝑽 𝒄𝒓𝒂𝒔𝒉 = 𝟓𝟒𝟎𝒎𝑽)
 No performance or reliability loss
 Added by the vendor to ensure the
worst-case conditions
 Large guardband, average of 33%
Guard
band
 A narrow voltage region
 Neural network accuracy collapse
Critical

Power-Reliability Trade-off
Power-efficiency (GOPs/W) gain
• >3X power saving (2.6X by eliminating guardband and further 43% in critical region)
Reliability overhead (i.e., CNN accuracy loss)
VGGNet GoogleNet AlexNet ResNet Inception
• Slight variation across 3 platforms and 5 workloads
• No accuracy loss in the guardband, accuracy collapse in the critical region
• Slight variation across 3 platforms and 5 workloads

Environmental Temperature
• Effects of environmental temperature on power-reliability
 Use fan speed to test temperature in [34 ℃, 50 ℃]
 On-board temperature monitored by PMBus
• Temperature effects on power consumption
 ↓ 𝑇𝑒𝑚𝑝 → ↓ 𝑃𝑜𝑤𝑒𝑟 (direct relation of power and temp)
 By undervolting, the impact of temperature on power consumption reduces.
• Temperature effects on reliability
 ↓ 𝑇𝑒𝑚𝑝 → ↑ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑙𝑜𝑠𝑠 (indirect relation of reliability and temp)
 In our temperature range, 𝑉 𝑚𝑖𝑛 and 𝑉𝑐𝑟𝑎𝑠ℎdo not change significantly.
GoogleNet

https://legato-project.eu/
legato-project@bsc.es

Denver 0
Denver 1
A57 2
A57 3
A57 4
A57 5
61 200 500
Timeline [s]
Pipeline Stage 1
Pipeline Stage 2
Pipeline Stage 3
Training Phase
FC
FC
FC
MAXPOOL
Conv64
Conv64
Conv64
MAXPOOL
Conv64
Conv64
Conv64
MAXPOOL
Conv64
Conv64
Conv64
MAXPOOL
Conv64
Conv64
MAXPOOL
Conv64
Conv64
Execution of VGG-16 on Nvidia Jetson TX2
Best Configuration: 3 staged pipeline, 6-5-10 layer partitioning, 2-2-2
core assignment

Preliminary Results:
Comparison of Pipe Search algorithm
with Brute Force Algorithm
Approach
Number of
trials
Training
Time [s]
Total execution
time of 2000
frames [s]
Best
Configuration*
Seed**
Exhaustive Search 1970 8129.21 8166.9 3,7,4,10,2,1,1, ….
Pipe Search Algorithm 41 116.305 2915.91 3,7,4,10,2,1,1, 3,6,5,10,2,1,1,
Experimental Setup:
Hardware: Nvidia Jetson TX2.
Used cores: 2 Denver, 2 A57
*. Throughput maximizing pipeline configuration. The sequence contains three distinct
sections:
1- Number of Pipeline Stages
2- Layer distribution among Pipeline Stages
3- Core placement for each Pipeline Stage.
**. Seed is a configuration which is calculated using computational hints. A good seeds
minimizes number of trials in search space exploration.
Application: VGG-16
Total 21 Layers, 16/21 are compute
intensive layers
Input Frames = 2000

LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project

Similar to LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project (20)

More from LEGATO project

More from LEGATO project (20)

Recently uploaded

Recently uploaded (20)

LEGaTO: Low-Energy Heterogeneous Computing Use of AI in the project

Editor's Notes