FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency

FPGA Undervolting for Energy-Efficiency
30th International Conference on Field-Programmable Logic and Applications (FPL).
3th September, 2020.
Behzad Salami
Barcelona Supercomputing Center (BSC)

2
Outline
• Motivation and Background
• Methodology and Results
- Undervolting FPGA On-Chip Memories
- Undervolting FPGA Internal Components
• More Information

3
Aggressive Undervolting
• Aggressive undervolting- Underscaling the supply voltage below the
nominal and safe level:
 Power/Energy Efficiency: Reduces dynamic and static power quadratically
and linearly, respectively.
 Reliability: Increases the circuit delay and in turn, causes timing faults.
• Dual/Multi-Vdd, DVS, and DVFS: Similar but different mechanisms to
aggressive undervolting:
 Similarity: Underscaling the supply voltage.
 Difference: Undervolting is until a certain safe level, usually constrained by
vendors.
Reliability
Power/Energy
Efficiency

4
State-of-the-art
• Aggressive undervolting has shown significant efficiency
to reduce the energy consumption.
 Devices:
 CPUs: Itanium II (ISCA2014), X86 (IOLTS2017), ARM
(HPCA2017)
 GPUs: NVidia (Micro2015)
 DRAMs: Multiple Brands (Sigmetrics2017)
 FPGA: This work
 Focus of the previous works:
 Voltage guardband
 Minimum safe voltage, i.e., Vmin prediction
 Fault characterization and mitigation
 Chip-to-chip, core-to-core, and workload-to-workload variation
 ….
• More straightforward and more parameters
but less precise
 ASIC DNN: Minerva (Micro2016), Thundervolt (DAC2018)
 CPU: Bravo (HPCA2017 )
 Network On-Chip (HPCA2014)
Real hardware:
Simulation-based studies:

5
Undervolting on FPGAs: Motivation
Contribution of FPGAs in large data centers is growing, expected
to be in 30% of datacenter servers by 2020 (Top500 news).
• In comparison to ASICs,
energy efficiency of FPGAs
is a serious concern, i.e.,
10X-100X less-efficient.
• Nominal voltage reduction
of FPGAs is naturally
applied for different
generations.
Undervolting
[Intel/Altera]
[Xilinx]

6
Outline

7
Undervolting FPGA On-Chip Memories
1. Undervolting FPGAs
 Voltage guardband
 Overall power and reliability trade-off
2. Fault characterization in FPGA on-chip memories
 Fault type, location, and rate
 Temperature, Chip
3. Low-voltage FPGA-based Neural Network (NN)
 Power consumption and NN accuracy characterization
 Fault mitigation techniques
 Application-aware technique
 Built-in ECC

8
Voltage Scaling Capability in Xilinx
VC707: performance-efficient design
KC705: power-efficient design (A & B)
Evaluated Xilinx platforms
VC707
Voltage distribution on Xilinx platforms
Voltage regulator
 Power Management Bus (PMBus).
 Hardwired to the host.
ZC702: ARM integrated with FPGA
VCCINT
VCCBRAM

9
Overall Voltage Behavior
• FPGA stops operating below
Vcrash, min operating voltageCRASH
• No observable fault
• Voltage Guardband below Vnom
SAFE
• Faults manifest
• Below Vmin, min safe voltage
CRITICAL
• Voltage guardband: to ensure
the worst-case environmental
and process technologies.
• Experimental conditions: At
ambient temperature and
maximum operating frequency.
Vnom
Vmin
Vcrash
0
0.2
0.4
0.6
0.8
1
VC707 ZC702 KC705-A KC705-B
VCCBRAM(V)
Platform
GUARDBAND
CRITICAL
CRASH

10
Floorplan of VC707
Experimental Methodology
• FPGA BRAMs:
 Hierarchy of set of bit-cells
 distributed over the chip.
 Size of each BRAM: 16-kbits
• Experimental Methodology:
 HW: Transfer content of BRAMs to the host.
 SW: Analyze data, and adjust voltage of BRAMs. (2060 BRAMs)

11
0
200
400
600
800
0
1
2
3
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
FaultRate(per1Mbit)
BRAMPower(Watts)
VCCBRAM (V)
BRAM Power
Fault Rate
Vmin=0.61V
Vcrash=0.54V
0
400
800
0
0.1
0.2
0.3
0.4
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
per1Mbit
Watts
Vnom=1V
Overall Trade-offs on BRAMs- Power & Reliability
VC707

12
0
150
300
0
0.05
0.1
0.15
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
per1Mbit
Watts
Overall Trade-offs on BRAMs- Multiple Platforms
0
200
400
600
800
0
1
2
3
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
FaultRate
(per1Mbit)
BRAMPower
(Watts)
VCCBRAM (V)
Vnom=1V
Vmin=0.61V
0
400
800
0
0.2
0.4
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
per1Mbit
Watts
VC707
0
50
100
150
200
0
10
20
30
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
FaultRate
(per1Mbit)
BRAMPower
(mWatts)
VCCBRAM (V)
Vnom=1V
Vcrash=0.53V
0
100
200
0
2
4
0.59 0.58 0.57 0.56 0.55 0.54 0.53
per1Mbit
mWatts
ZC702
0
100
200
300
0
1
2
3
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
FaultRate
(per1Mbit)
BRAMPower
(Watts)
VCCBRAM (V)
Vnom=1V
Vcrash=0.54V
Vmin=0.61V
KC705-A
0
20
40
60
80
0
1
2
3
1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
FaultRate
(per1Mbit)
BRAMPower
(Watts)
VCCBRAM (V)
Vnom=1V
Vmin=0.57V
Vcrash=0.54V
0
40
80
0
0.05
0.1
0.15
0.57 0.56 0.55 0.54
per1Mbit
Watts
KC705-B
Vmin=0.59V
Vcrash=0.54V

13
Contributions
 Built-in ECC

14
Fault Characterization at CRITICAL Region
• Fully non-uniform fault distribution.
• Majority of BRAMs do not experience many faults.
Fault variability among FPGA BRAMs:
Fully non-uniform fault distribution
VC707 (2060 BRAMs)
VCCBRAM@ Vcrash= 0.54V
Temperature@ Ambient
0.0%
0.3%
0.6%
0.9%
1.2%
1.5%
BRAMFaultRate(%)
%BRAMs Average Fault Rate (%)
1.8% 0.86%
High-vulnerable
9.4% 0.24%
Mid-vulnerable
52.3% 0.03%
Low-vulnerable
36.3% 0.0%
Zero-vulnerable
K-means clustering

15
Fault Characterization at CRITICAL Region
Type of undervolting faults:
Permanent faults at specific voltage
• There is no considerable change on the rate and location of faults over time.
• Validated by repeating experiments for 100 times.
• The physical location of BRAMs is extracted using Vivado.
• Fault Variation Map (FVM): Fault rate mapped to the physical location of
BRAMs.
FVM can be potentially used in fault mitigation techniques!
FPGA x-axis
FPGAy-axis
BRAMFaultRate(%)
FVM @
(VCCBRAM @Vcrash, T= ambient, chip= VC707)
1 10 20 30 40 50 60 70 80 90 100
0
200
400
600
800
1000
1 11 21 31 41 51 61 71 81 91
FaultRate(per1Mbit)
#Run
Individual Run Cumulative Median
Three parameters orthogonally have significant impact on the
rate and location of faults:
1. Voltage
2. Temperature
3. Chip

16
Fault Characterization (Voltage Impacts)
Location of undervolting faults:
Fault Inclusion Property (FIP)
• FIP: A corrupted bit at a specific voltage stays faulty in lower voltages as
well.
• FIP can be used in mitigation techniques.
0.1
1
10
100
1000
10000
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
FaultRate(per1Mbit)
logscale
VCCBRAM (V)
Illustration of FIP FIP shown as fault rate for VC707

17
Fault Characterization (Temperature Impacts)
• Methodology: Adjusting environmental temperature, monitoring on-
board temperature via PMBus.
• Experimental Observation:
 At higher temperatures, fault rate is significantly reduced.
• Inverse Temperature Dependency (𝑰𝑻𝑫) 𝟏:
 For nano-scale technology nodes, under ultra low-voltage operations, the
circuit delay reduces at higher temperatures since supply voltage approaches
the threshold voltage.
* x-axis: VCCBRAM (V). * y-axis: fault rate (per 1Mbit).
𝑇 = 50 0
𝐶 𝑇 = 60 0
𝐶 𝑇 = 70 0
𝐶 𝑇 = 80 0
𝐶
Practical confirmation of Inverse Temperature Dependency (ITD)
(1) Neshatpour, K., Burleson, W., Khajeh, A., & Homayoun, H. (2018). Enhancing Power, Performance, and Energy Efficiency in Chip
Multiprocessors Exploiting Inverse Thermal Dependence. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, (4), 778-791.

18
Fault Characterization (Chip Impacts)
KC705-BKC705-A
• Methodology: Repeating experiments on identical samples of KC705 (A&B).
•
• Observations:
 Fault rates significantly vary, more than 4X.
 Fault Variation Maps (FVMs) are entirely different.
Fault location Fault location
@VCCBRAM= Vcrash @VCCBRAM= Vcrash
Even identical samples of same chips have totally different reliability
behavior, due to the process variation/aging effects.
Fault rate
0
100
200
300
0.57 0.56 0.55 0.54
Per1Mbit VCCBRAM (V)
Fault rate
0
100
200
300
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
Per1Mbit
VCCBRAM (V)

19
Contributions
 Built-in ECC

20
Experimental Methodology
Neural Network (NN)
Type Fully-connected classifier
Total number of weights ~1.5 millions
Activation function Logsig (logarithmic sigmoid)
Major benchmark
Name-type MNIST- handwritten digit images
Number of images Training: 60000, Classification: 10000
Number of pixels per image 28*28=256
Number of output classes 10
Additional benchmarks
Names Forest and Reuters
Data representation model
Type 16-bits fixed-point
Precision Minimum sign and digit per layer
An example implementation on VC707
Frequency 100 Mhz
BRAM usage (total: 2060) 70.8%

21
NN Implementation on FPGA
• Input data: off-chip DDR
memory.
• Weights: on-chip FPGA BRAM.
• Computation: Streaming data
onto DSPs and LUTs.
• We undervolt VCCBRAM:
 Weights of the NN are
potentially affected.
FPGA Implementation

22
Low-Voltage FPGA-based NN
• Significant power reduction until
the minimum safe voltage, i.e.,
Vmin (By eliminating the voltage
guardband).
• Additional 40% power reduction
below the voltage guardband.
• The NN classification error exponentially
increases from 2.56% (inherent classification
error) to 6.74% through undervolting
BRAMs beyond Vmin.
• Fault mitigation techniques to prevent the
accuracy loss:
 Application-aware mechanism
 Built-in ECC
Power saving
NN accuracy loss
2.39 0.25 0.15
6.47
6.47 6.47
0
2
4
6
8
10
Vnom= 1 V Vmin= 0.61V Vcrash= 0.54V
On-chipPower(Watts)
BRAM Rest

23
Intelligently-Constrained BRAM placement (ICBP)
• Below voltage guardband level at CRITICAL voltage region, we present
ICBP to prevent NN classification error rate loss.
• Core Idea: Map most-sensitive weights to faults into robust BRAMs.
 Q: Which are the most-sensitive NN weights? A: Deeper Layers.
ICBP-Additional
ConstraintsintheFPGA
placementstage
1 1.4
2.1
3
5.7
LAYER0 LAYER1 LAYER2 LAYER3 LAYER4
Normalized
Vulnerability
NN Layers

24
ICBP Evaluation
• Pros:
 Significant accuracy loss prevention.
 No power and performance overhead.
• Cons:
 Needs the FVM as a pre-process step  Built-in ECC is evaluated without
having this cost.
0
0.1
0.2
0.3
0.4
0%
2%
4%
6%
8%
10%
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
BRAMsPower(Watt)
NNClassificationError(%)
VCCBRAM (V)
NN Error by Default Placement NN Error by ICBP BRAM Power
Inherent NN Error: 2.56%

25
Built-in ECC
• Built-in ECC of FPGA BRAMs:
 Hamming-code.
 Two (2) additional bits per row are reserved as parities.
 SECDED (Single-Error Correction and Double-Error Detection).
• Experimental Methodology:
 Activate built-in ECC under low-voltage read operations.
• Experimental Observations:
 >90% fault correction
 >7% fault detection (not correction)
0
200
400
600
800
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
Faultrate(per1Mbit)
VCCBRAM (V)
Without ECC With ECC
Parity Bits
single-bit
double-bit
multiple-bit

26
ECC for NN Accelerator
0%
2%
4%
ClassificationError(%)
VCCBRAM (V)
Without ECC With ECC
Inherent NN Error: 2.56%
Area Utilization (%)
BRAM LUT FF
Without ECC 96% 3% 0.25%
With ECC 100% 12% 0.25%
BRAM Power (W)
Vnom= 1V Vmin= 0.61V Vcrash=
0.54V
Without ECC 2.4 0.31 0.198
With ECC ---- ---- 0.211ECC efficiency to prevent NN accuracy loss
ECC area and power costs
• Pros:
 Significant accuracy loss prevention.
 Negligible power and performance overhead.
• Cons:
 Requires larger data rows/lines.
 Not all FPGAs are equipped with this technique.

27
Outline

28
Executive Summary
• Motivation: Power consumption of neural networks is a main concern
 Hardware acceleration: GPUs, FPGAs, and ASICs
• Problem: FPGAs are at least 10X less power-efficient than equivalent ASICs
• Goal: Bridge the power-efficiency gap between ASIC- and FPGA-based
neural networks by Undervolting below nominal level
• Evaluation Setup
 5 Image classification workloads
 3 Xilinx UltraScale+ ZCU102 platforms
 On-chip voltage rail for internal FPGA components
• Main Results
 Large voltage guardband (i.e., 33%)
 >3X power-efficiency gain

29
Outline
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works

30
Motivation and Background
• Motivation
 Power consumption of neural networks is a main concern
 Hardware acceleration: GPUs, FPGAs, and ASICs
 FPGAs: Getting popular but less power-efficient than equivalent ASICs
 Large voltage guardbands (12-35%) for CPUs, GPUs, DRAMs
 Any potential of “Undervolting FPGAs” for power-efficiency of neural networks?
• Background
 Neural Networks: Widely deployed with an inherent resilience to errors
 FPGAs: Higher throughput than GPUs and better flexibility than ASICs
 Undervolting: Reduces power cons., may incur reliability or performance issues

31
Outline
• Our Goal
• Methodology
• Results
• Prior Works

32
Our Goal
• Primary Goal
 Bridge the power-efficiency gap between ASIC- and FPGA-based
neural networks by:
 Undervolting (i.e., underscaling voltage below nominal level)
• Secondary Goals
 Study the voltage behavior of real FPGAs (e.g., guardband)
 Study the power-efficiency gain of undervolting for neural networks
 Study the reliability overhead
 Study the frequency underscaling to prevent the accuracy loss
 Study the effect of environmental temperature

33
Outline
• Our Goal
• Methodology
• Results
• Prior Works

34
Overall Methodology
• 5 CNN image classification
workloads, i.e., VGGNet, GoogleNet,
AlexNet, ResNet50, Inception.
• Xilinx DNNDK to map CNN into FPGA
 By default optimized for INT8
• 3 identical samples of Xilinx ZCU102
 ZYNQ Ultrscale+ architecture
 Hard-core ARM for data orchestration
 FPGA for CNN acceleration
• 1 on-chip voltage rails, via PMBus
 𝑉𝐶𝐶𝐼𝑁𝑇: DSPs, LUTs, buffers, …
 𝑉𝑛𝑜𝑚= 850mV (set by manufacturer)
Vast majority (>99.9%) of the power is dissipated on 𝑉𝐶𝐶𝐼𝑁𝑇

35
Outline
• Our Goal
• Methodology
• Results
• Prior Works

36
Overall Voltage Behavior
Slight variation of voltage behavior across platforms and benchmarks
 FPGA stops operatingCrash
• Guardband: Large region below nominal level (𝑽 𝒏𝒐𝒎 = 𝟖𝟓𝟎𝒎𝑽)
• Critical: Narrower region below guardband (𝑽 𝒎𝒊𝒏 = 𝟓𝟕𝟎𝒎𝑽)
• Crash: FPGA crashes below critical region (𝑽 𝒄𝒓𝒂𝒔𝒉 = 𝟓𝟒𝟎𝒎𝑽)
 No performance or reliability loss
 Added by the vendor to ensure the
worst-case conditions
 Large guardband, average of 33%
Guard
band
 A narrow voltage region
 Neural network accuracy collapse
Critical

37
Outline
• Our Goal
• Methodology
• Results
• Prior Works

38
Power-Reliability Trade-off
Power-efficiency (GOPs/W) gain
• >3X power saving (2.6X by eliminating guardband and further 43% in critical region)
Reliability overhead (i.e., CNN accuracy loss)
VGGNet GoogleNet AlexNet ResNet Inception
• Slight variation across 3 platforms and 5 workloads
• No accuracy loss in the guardband, accuracy collapse in the critical region
• Slight variation across 3 platforms and 5 workloads

39
Outline
• Our Goal
• Methodology
• Results
• Prior Works

40
VCCINT
(mV)
Fmax
(Mhz)
GOPs
(Norm)
Power (W)
Norm)
GOPs/W
(Norm)
GOPs/J
(Norm)
570 333 1 1 1 1
565 300 0.94 0.97 0.97 0.87
560 250 0.83 0.84 0.99 0.75
555 250 0.83 0.78 1.06 0.8
550 250 0.83 0.75 1.1 0.83
545 250 0.83 0.74 1.12 0.84
540 200 0.7 0.56 1.25 0.75
Frequency Underscaling
• Simultaneous frequency underscaling to prevent CNN accuracy collapse in
the critical voltage region
• For each voltage level below 𝑽 𝒎𝒊𝒏, we found the 𝑭 𝒎𝒂𝒙, the maximum
operating frequency at which there is no accuracy loss
• Leads to performance and energy-efficiency loss
Best setting for High-performance and Energy-efficiency Best setting for Power-efficiency
(Voltage steps= 5mV, Frequency steps= 50Mhz)- shown for GoogleNet

41
Outline
• Our Goal
• Methodology
• Results
• Prior Works

42
Environmental Temperature
• Effects of environmental temperature on power-reliability
 Use fan speed to test temperature in [34 ℃, 50 ℃]
 On-board temperature monitored by PMBus
• Temperature effects on power consumption
 ↓ 𝑇𝑒𝑚𝑝 → ↓ 𝑃𝑜𝑤𝑒𝑟 (direct relation of power and temp)
 By undervolting, the impact of temperature on power consumption reduces.
• Temperature effects on reliability
 ↓ 𝑇𝑒𝑚𝑝 → ↑ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑙𝑜𝑠𝑠 (indirect relation of reliability and temp)
 In our temperature range, 𝑉 𝑚𝑖𝑛 and 𝑉𝑐𝑟𝑎𝑠ℎdo not change significantly.
GoogleNet

43
Outline
• Our Goal
• Methodology
• Results
• Prior Works

44
Prior Works
• Undervolting
 Studies for off-the-shelf real CPUs, GPUs, ASICs, DRAMs
 Large voltage guardband (from 12% to 35%) for many devices
 This work extends such studies for off-the-shelf FPGAs especially for
neural network acceleration and confirms large guardbands (i.e., 33%)
• Power-Efficient Neural Networks
 Studies on architectural-, hardware-, and software-level techniques
 Undervolting in neural network ASIC accelerator (e.g., GreenTPU-DAC’19)
 This work proposes a hardware-level undervolting for further
power-saving (>3X) in FPGAs.
• Reliability in Neural Networks
 Analytical and simulation-based studies (e.g., Thundervolt-DAC’18)
 Some studies on real hardware (e.g., EDEN-MICRO’19)
 This work studies the reliability of neural networks on real FPGAs
when operating at reduced voltage levels.

45
Outline
• Our Goal
• Methodology
• Results
• Prior Works

46
Summary, Conclusion, and Future Works
• Summary
 We improve the power-efficiency (>3X) of off-the-shelf
FPGAs via undervolting for neural network accelerators:
 2.6X by eliminating the guardband (i.e., 33%) without any cost
 43% by further undervolting below the guardband with the cost of
 either accuracy loss, when the frequency is not underscaled
 or performance loss, when the frequency is underscaled
• Conclusion
 Undervolting is an effective way to achieve significant
power-saving for FPGA-based neural network accelerators
• Future Works
 HW & SW extension of our undervolting for FPGA clusters
and other neural network models and tools

47
Outline

48
References
• B. Salami, et al., "An Experimental Study of Reduced-Voltage Operation in Modern FPGAs
for Neural Network Acceleration," in 50th IEEE/IFIP International Conference on
Dependable Systems and Networks (DSN), 2020.
• B. Salami, et al., "Comprehensive Evaluation of Supply Voltage Underscaling in FPGA on-
chip Memories.", in 51st Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO ), 2018.
• B. Salami, et al., “Evaluating Built-in ECC of FPGA on-chip Memories for the Mitigation of
Undervolting Faults," in 27th Euromicro International Conference on Parallel, Distributed,
and Network-based Processing (PDP), 2019.
• B. Salami, et al., "Fault Characterization Through FPGAs Undervolting.", in 28th
International Conference on Field Programmable Logic & Applications (FPL), 2018.
• B. Salami, et al., “On the Resilience of RTL NN Accelerators: Fault Characterization and
Mitigation.", in 30th International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD), 2018.

49
Ongoing and Future Extensions
• Circuit-level simulation for validating the results
• Expansion for more number of FPGAs (cluster), more
workloads (DNN and non-DNN)
• Heterogeneous systems including hw-sw systems, more
voltage rails
• Design voltage-optimized FPGA components
• Integration with error handling systems like check-
pointing

50
Acknowledgment
• Adrian Cristal
• Osman Unsal
• Fahrettin Koc
• Baturay Onural
• Ismail Emir Yuksel

FPGA Undervolting for Energy-Efficiency
30th International Conference on Field-Programmable Logic and Applications (FPL).
3th September, 2020.
Behzad Salami
Barcelona Supercomputing Center (BSC)
behzad.salami@bsc.es

FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency

Similar to FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency (20)

More from LEGATO project

More from LEGATO project (20)

Recently uploaded

Recently uploaded (20)

FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency