Tutorial by Behzad Salami, Osman Unsal and Leonardo Bautista at 30th International Conference on Field-Programmable Logic and Applications (FPL2020), 3 September 2020
Bentham & Hooker's Classification. along with the merits and demerits of the ...
FPGA Undervolting and Checkpointing for Energy-Efficiency and Error-Resiliency
1. FPGA Undervolting for Energy-Efficiency
30th International Conference on Field-Programmable Logic and Applications (FPL).
3th September, 2020.
Behzad Salami
Barcelona Supercomputing Center (BSC)
2. 2
Outline
• Motivation and Background
• Methodology and Results
- Undervolting FPGA On-Chip Memories
- Undervolting FPGA Internal Components
• More Information
3. 3
Aggressive Undervolting
• Aggressive undervolting- Underscaling the supply voltage below the
nominal and safe level:
Power/Energy Efficiency: Reduces dynamic and static power quadratically
and linearly, respectively.
Reliability: Increases the circuit delay and in turn, causes timing faults.
• Dual/Multi-Vdd, DVS, and DVFS: Similar but different mechanisms to
aggressive undervolting:
Similarity: Underscaling the supply voltage.
Difference: Undervolting is until a certain safe level, usually constrained by
vendors.
Reliability
Power/Energy
Efficiency
4. 4
State-of-the-art
• Aggressive undervolting has shown significant efficiency
to reduce the energy consumption.
Devices:
CPUs: Itanium II (ISCA2014), X86 (IOLTS2017), ARM
(HPCA2017)
GPUs: NVidia (Micro2015)
DRAMs: Multiple Brands (Sigmetrics2017)
FPGA: This work
Focus of the previous works:
Voltage guardband
Minimum safe voltage, i.e., Vmin prediction
Fault characterization and mitigation
Chip-to-chip, core-to-core, and workload-to-workload variation
….
• More straightforward and more parameters
but less precise
ASIC DNN: Minerva (Micro2016), Thundervolt (DAC2018)
CPU: Bravo (HPCA2017 )
Network On-Chip (HPCA2014)
Real hardware:
Simulation-based studies:
5. 5
Undervolting on FPGAs: Motivation
Contribution of FPGAs in large data centers is growing, expected
to be in 30% of datacenter servers by 2020 (Top500 news).
• In comparison to ASICs,
energy efficiency of FPGAs
is a serious concern, i.e.,
10X-100X less-efficient.
• Nominal voltage reduction
of FPGAs is naturally
applied for different
generations.
Undervolting
[Intel/Altera]
[Xilinx]
6. 6
Outline
• Motivation and Background
• Methodology and Results
- Undervolting FPGA On-Chip Memories
- Undervolting FPGA Internal Components
• More Information
7. 7
Undervolting FPGA On-Chip Memories
1. Undervolting FPGAs
Voltage guardband
Overall power and reliability trade-off
2. Fault characterization in FPGA on-chip memories
Fault type, location, and rate
Temperature, Chip
3. Low-voltage FPGA-based Neural Network (NN)
Power consumption and NN accuracy characterization
Fault mitigation techniques
Application-aware technique
Built-in ECC
8. 8
Voltage Scaling Capability in Xilinx
VC707: performance-efficient design
KC705: power-efficient design (A & B)
Evaluated Xilinx platforms
VC707
Voltage distribution on Xilinx platforms
Voltage regulator
Power Management Bus (PMBus).
Hardwired to the host.
ZC702: ARM integrated with FPGA
VCCINT
VCCBRAM
9. 9
Overall Voltage Behavior
• FPGA stops operating below
Vcrash, min operating voltageCRASH
• No observable fault
• Voltage Guardband below Vnom
SAFE
• Faults manifest
• Below Vmin, min safe voltage
CRITICAL
• Voltage guardband: to ensure
the worst-case environmental
and process technologies.
• Experimental conditions: At
ambient temperature and
maximum operating frequency.
Vnom
Vmin
Vcrash
0
0.2
0.4
0.6
0.8
1
VC707 ZC702 KC705-A KC705-B
VCCBRAM(V)
Platform
GUARDBAND
CRITICAL
CRASH
10. 10
Floorplan of VC707
Experimental Methodology
• FPGA BRAMs:
Hierarchy of set of bit-cells
distributed over the chip.
Size of each BRAM: 16-kbits
• Experimental Methodology:
HW: Transfer content of BRAMs to the host.
SW: Analyze data, and adjust voltage of BRAMs. (2060 BRAMs)
13. 13
Contributions
1. Undervolting FPGAs
Voltage guardband
Overall power and reliability trade-off
2. Fault characterization in FPGA on-chip memories
Fault type, location, and rate
Temperature, Chip
3. Low-voltage FPGA-based Neural Network (NN)
Power consumption and NN accuracy characterization
Fault mitigation techniques
Application-aware technique
Built-in ECC
14. 14
Fault Characterization at CRITICAL Region
• Fully non-uniform fault distribution.
• Majority of BRAMs do not experience many faults.
Fault variability among FPGA BRAMs:
Fully non-uniform fault distribution
VC707 (2060 BRAMs)
VCCBRAM@ Vcrash= 0.54V
Temperature@ Ambient
0.0%
0.3%
0.6%
0.9%
1.2%
1.5%
BRAMFaultRate(%)
%BRAMs Average Fault Rate (%)
1.8% 0.86%
High-vulnerable
9.4% 0.24%
Mid-vulnerable
52.3% 0.03%
Low-vulnerable
36.3% 0.0%
Zero-vulnerable
K-means clustering
15. 15
Fault Characterization at CRITICAL Region
Type of undervolting faults:
Permanent faults at specific voltage
• There is no considerable change on the rate and location of faults over time.
• Validated by repeating experiments for 100 times.
• The physical location of BRAMs is extracted using Vivado.
• Fault Variation Map (FVM): Fault rate mapped to the physical location of
BRAMs.
FVM can be potentially used in fault mitigation techniques!
FPGA x-axis
FPGAy-axis
BRAMFaultRate(%)
FVM @
(VCCBRAM @Vcrash, T= ambient, chip= VC707)
1 10 20 30 40 50 60 70 80 90 100
0
200
400
600
800
1000
1 11 21 31 41 51 61 71 81 91
FaultRate(per1Mbit)
#Run
Individual Run Cumulative Median
Three parameters orthogonally have significant impact on the
rate and location of faults:
1. Voltage
2. Temperature
3. Chip
16. 16
Fault Characterization (Voltage Impacts)
Location of undervolting faults:
Fault Inclusion Property (FIP)
• FIP: A corrupted bit at a specific voltage stays faulty in lower voltages as
well.
• FIP can be used in mitigation techniques.
0.1
1
10
100
1000
10000
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
FaultRate(per1Mbit)
logscale
VCCBRAM (V)
Illustration of FIP FIP shown as fault rate for VC707
17. 17
Fault Characterization (Temperature Impacts)
• Methodology: Adjusting environmental temperature, monitoring on-
board temperature via PMBus.
• Experimental Observation:
At higher temperatures, fault rate is significantly reduced.
• Inverse Temperature Dependency (𝑰𝑻𝑫) 𝟏:
For nano-scale technology nodes, under ultra low-voltage operations, the
circuit delay reduces at higher temperatures since supply voltage approaches
the threshold voltage.
* x-axis: VCCBRAM (V). * y-axis: fault rate (per 1Mbit).
𝑇 = 50 0
𝐶 𝑇 = 60 0
𝐶 𝑇 = 70 0
𝐶 𝑇 = 80 0
𝐶
Practical confirmation of Inverse Temperature Dependency (ITD)
(1) Neshatpour, K., Burleson, W., Khajeh, A., & Homayoun, H. (2018). Enhancing Power, Performance, and Energy Efficiency in Chip
Multiprocessors Exploiting Inverse Thermal Dependence. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, (4), 778-791.
18. 18
Fault Characterization (Chip Impacts)
KC705-BKC705-A
• Methodology: Repeating experiments on identical samples of KC705 (A&B).
•
• Observations:
Fault rates significantly vary, more than 4X.
Fault Variation Maps (FVMs) are entirely different.
Fault location Fault location
@VCCBRAM= Vcrash @VCCBRAM= Vcrash
Even identical samples of same chips have totally different reliability
behavior, due to the process variation/aging effects.
Fault rate
0
100
200
300
0.57 0.56 0.55 0.54
Per1Mbit VCCBRAM (V)
Fault rate
0
100
200
300
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
Per1Mbit
VCCBRAM (V)
19. 19
Contributions
1. Undervolting FPGAs
Voltage guardband
Overall power and reliability trade-off
2. Fault characterization in FPGA on-chip memories
Fault type, location, and rate
Temperature, Chip
3. Low-voltage FPGA-based Neural Network (NN)
Power consumption and NN accuracy characterization
Fault mitigation techniques
Application-aware technique
Built-in ECC
20. 20
Experimental Methodology
Neural Network (NN)
Type Fully-connected classifier
Total number of weights ~1.5 millions
Activation function Logsig (logarithmic sigmoid)
Major benchmark
Name-type MNIST- handwritten digit images
Number of images Training: 60000, Classification: 10000
Number of pixels per image 28*28=256
Number of output classes 10
Additional benchmarks
Names Forest and Reuters
Data representation model
Type 16-bits fixed-point
Precision Minimum sign and digit per layer
An example implementation on VC707
Frequency 100 Mhz
BRAM usage (total: 2060) 70.8%
21. 21
NN Implementation on FPGA
• Input data: off-chip DDR
memory.
• Weights: on-chip FPGA BRAM.
• Computation: Streaming data
onto DSPs and LUTs.
• We undervolt VCCBRAM:
Weights of the NN are
potentially affected.
FPGA Implementation
22. 22
Low-Voltage FPGA-based NN
• Significant power reduction until
the minimum safe voltage, i.e.,
Vmin (By eliminating the voltage
guardband).
• Additional 40% power reduction
below the voltage guardband.
• The NN classification error exponentially
increases from 2.56% (inherent classification
error) to 6.74% through undervolting
BRAMs beyond Vmin.
• Fault mitigation techniques to prevent the
accuracy loss:
Application-aware mechanism
Built-in ECC
Power saving
NN accuracy loss
2.39 0.25 0.15
6.47
6.47 6.47
0
2
4
6
8
10
Vnom= 1 V Vmin= 0.61V Vcrash= 0.54V
On-chipPower(Watts)
BRAM Rest
23. 23
Intelligently-Constrained BRAM placement (ICBP)
• Below voltage guardband level at CRITICAL voltage region, we present
ICBP to prevent NN classification error rate loss.
• Core Idea: Map most-sensitive weights to faults into robust BRAMs.
Q: Which are the most-sensitive NN weights? A: Deeper Layers.
ICBP-Additional
ConstraintsintheFPGA
placementstage
1 1.4
2.1
3
5.7
LAYER0 LAYER1 LAYER2 LAYER3 LAYER4
Normalized
Vulnerability
NN Layers
24. 24
ICBP Evaluation
• Pros:
Significant accuracy loss prevention.
No power and performance overhead.
• Cons:
Needs the FVM as a pre-process step Built-in ECC is evaluated without
having this cost.
0
0.1
0.2
0.3
0.4
0%
2%
4%
6%
8%
10%
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
BRAMsPower(Watt)
NNClassificationError(%)
VCCBRAM (V)
NN Error by Default Placement NN Error by ICBP BRAM Power
Inherent NN Error: 2.56%
25. 25
Built-in ECC
• Built-in ECC of FPGA BRAMs:
Hamming-code.
Two (2) additional bits per row are reserved as parities.
SECDED (Single-Error Correction and Double-Error Detection).
• Experimental Methodology:
Activate built-in ECC under low-voltage read operations.
• Experimental Observations:
>90% fault correction
>7% fault detection (not correction)
0
200
400
600
800
0.61 0.6 0.59 0.58 0.57 0.56 0.55 0.54
Faultrate(per1Mbit)
VCCBRAM (V)
Without ECC With ECC
Parity Bits
single-bit
double-bit
multiple-bit
26. 26
ECC for NN Accelerator
0%
2%
4%
ClassificationError(%)
VCCBRAM (V)
Without ECC With ECC
Inherent NN Error: 2.56%
Area Utilization (%)
BRAM LUT FF
Without ECC 96% 3% 0.25%
With ECC 100% 12% 0.25%
BRAM Power (W)
Vnom= 1V Vmin= 0.61V Vcrash=
0.54V
Without ECC 2.4 0.31 0.198
With ECC ---- ---- 0.211ECC efficiency to prevent NN accuracy loss
ECC area and power costs
• Pros:
Significant accuracy loss prevention.
Negligible power and performance overhead.
• Cons:
Requires larger data rows/lines.
Not all FPGAs are equipped with this technique.
27. 27
Outline
• Motivation and Background
• Methodology and Results
- Undervolting FPGA On-Chip Memories
- Undervolting FPGA Internal Components
• More Information
28. 28
Executive Summary
• Motivation: Power consumption of neural networks is a main concern
Hardware acceleration: GPUs, FPGAs, and ASICs
• Problem: FPGAs are at least 10X less power-efficient than equivalent ASICs
• Goal: Bridge the power-efficiency gap between ASIC- and FPGA-based
neural networks by Undervolting below nominal level
• Evaluation Setup
5 Image classification workloads
3 Xilinx UltraScale+ ZCU102 platforms
On-chip voltage rail for internal FPGA components
• Main Results
Large voltage guardband (i.e., 33%)
>3X power-efficiency gain
29. 29
Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works
30. 30
Motivation and Background
• Motivation
Power consumption of neural networks is a main concern
Hardware acceleration: GPUs, FPGAs, and ASICs
FPGAs: Getting popular but less power-efficient than equivalent ASICs
Large voltage guardbands (12-35%) for CPUs, GPUs, DRAMs
Any potential of “Undervolting FPGAs” for power-efficiency of neural networks?
• Background
Neural Networks: Widely deployed with an inherent resilience to errors
FPGAs: Higher throughput than GPUs and better flexibility than ASICs
Undervolting: Reduces power cons., may incur reliability or performance issues
31. 31
Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works
32. 32
Our Goal
• Primary Goal
Bridge the power-efficiency gap between ASIC- and FPGA-based
neural networks by:
Undervolting (i.e., underscaling voltage below nominal level)
• Secondary Goals
Study the voltage behavior of real FPGAs (e.g., guardband)
Study the power-efficiency gain of undervolting for neural networks
Study the reliability overhead
Study the frequency underscaling to prevent the accuracy loss
Study the effect of environmental temperature
33. 33
Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works
34. 34
Overall Methodology
• 5 CNN image classification
workloads, i.e., VGGNet, GoogleNet,
AlexNet, ResNet50, Inception.
• Xilinx DNNDK to map CNN into FPGA
By default optimized for INT8
• 3 identical samples of Xilinx ZCU102
ZYNQ Ultrscale+ architecture
Hard-core ARM for data orchestration
FPGA for CNN acceleration
• 1 on-chip voltage rails, via PMBus
𝑉𝐶𝐶𝐼𝑁𝑇: DSPs, LUTs, buffers, …
𝑉𝑛𝑜𝑚= 850mV (set by manufacturer)
Vast majority (>99.9%) of the power is dissipated on 𝑉𝐶𝐶𝐼𝑁𝑇
35. 35
Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works
36. 36
Overall Voltage Behavior
Slight variation of voltage behavior across platforms and benchmarks
FPGA stops operatingCrash
• Guardband: Large region below nominal level (𝑽 𝒏𝒐𝒎 = 𝟖𝟓𝟎𝒎𝑽)
• Critical: Narrower region below guardband (𝑽 𝒎𝒊𝒏 = 𝟓𝟕𝟎𝒎𝑽)
• Crash: FPGA crashes below critical region (𝑽 𝒄𝒓𝒂𝒔𝒉 = 𝟓𝟒𝟎𝒎𝑽)
No performance or reliability loss
Added by the vendor to ensure the
worst-case conditions
Large guardband, average of 33%
Guard
band
A narrow voltage region
Neural network accuracy collapse
Critical
37. 37
Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works
38. 38
Power-Reliability Trade-off
Power-efficiency (GOPs/W) gain
• >3X power saving (2.6X by eliminating guardband and further 43% in critical region)
Reliability overhead (i.e., CNN accuracy loss)
VGGNet GoogleNet AlexNet ResNet Inception
• Slight variation across 3 platforms and 5 workloads
• No accuracy loss in the guardband, accuracy collapse in the critical region
• Slight variation across 3 platforms and 5 workloads
39. 39
Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works
40. 40
VCCINT
(mV)
Fmax
(Mhz)
GOPs
(Norm)
Power (W)
Norm)
GOPs/W
(Norm)
GOPs/J
(Norm)
570 333 1 1 1 1
565 300 0.94 0.97 0.97 0.87
560 250 0.83 0.84 0.99 0.75
555 250 0.83 0.78 1.06 0.8
550 250 0.83 0.75 1.1 0.83
545 250 0.83 0.74 1.12 0.84
540 200 0.7 0.56 1.25 0.75
Frequency Underscaling
• Simultaneous frequency underscaling to prevent CNN accuracy collapse in
the critical voltage region
• For each voltage level below 𝑽 𝒎𝒊𝒏, we found the 𝑭 𝒎𝒂𝒙, the maximum
operating frequency at which there is no accuracy loss
• Leads to performance and energy-efficiency loss
Best setting for High-performance and Energy-efficiency Best setting for Power-efficiency
(Voltage steps= 5mV, Frequency steps= 50Mhz)- shown for GoogleNet
41. 41
Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works
42. 42
Environmental Temperature
• Effects of environmental temperature on power-reliability
Use fan speed to test temperature in [34 ℃, 50 ℃]
On-board temperature monitored by PMBus
• Temperature effects on power consumption
↓ 𝑇𝑒𝑚𝑝 → ↓ 𝑃𝑜𝑤𝑒𝑟 (direct relation of power and temp)
By undervolting, the impact of temperature on power consumption reduces.
• Temperature effects on reliability
↓ 𝑇𝑒𝑚𝑝 → ↑ 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑙𝑜𝑠𝑠 (indirect relation of reliability and temp)
In our temperature range, 𝑉 𝑚𝑖𝑛 and 𝑉𝑐𝑟𝑎𝑠ℎdo not change significantly.
GoogleNet
43. 43
Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works
44. 44
Prior Works
• Undervolting
Studies for off-the-shelf real CPUs, GPUs, ASICs, DRAMs
Large voltage guardband (from 12% to 35%) for many devices
This work extends such studies for off-the-shelf FPGAs especially for
neural network acceleration and confirms large guardbands (i.e., 33%)
• Power-Efficient Neural Networks
Studies on architectural-, hardware-, and software-level techniques
Undervolting in neural network ASIC accelerator (e.g., GreenTPU-DAC’19)
This work proposes a hardware-level undervolting for further
power-saving (>3X) in FPGAs.
• Reliability in Neural Networks
Analytical and simulation-based studies (e.g., Thundervolt-DAC’18)
Some studies on real hardware (e.g., EDEN-MICRO’19)
This work studies the reliability of neural networks on real FPGAs
when operating at reduced voltage levels.
45. 45
Outline
• Motivation and Background
• Our Goal
• Methodology
• Results
- Overall Voltage Behavior
- Power-Reliability Trade-off
- Frequency Underscaling
- Environmental Temperature
• Prior Works
• Summary, Conclusion, and Future Works
46. 46
Summary, Conclusion, and Future Works
• Summary
We improve the power-efficiency (>3X) of off-the-shelf
FPGAs via undervolting for neural network accelerators:
2.6X by eliminating the guardband (i.e., 33%) without any cost
43% by further undervolting below the guardband with the cost of
either accuracy loss, when the frequency is not underscaled
or performance loss, when the frequency is underscaled
• Conclusion
Undervolting is an effective way to achieve significant
power-saving for FPGA-based neural network accelerators
• Future Works
HW & SW extension of our undervolting for FPGA clusters
and other neural network models and tools
47. 47
Outline
• Motivation and Background
• Methodology and Results
- Undervolting FPGA On-Chip Memories
- Undervolting FPGA Internal Components
• More Information
48. 48
References
• B. Salami, et al., "An Experimental Study of Reduced-Voltage Operation in Modern FPGAs
for Neural Network Acceleration," in 50th IEEE/IFIP International Conference on
Dependable Systems and Networks (DSN), 2020.
• B. Salami, et al., "Comprehensive Evaluation of Supply Voltage Underscaling in FPGA on-
chip Memories.", in 51st Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO ), 2018.
• B. Salami, et al., “Evaluating Built-in ECC of FPGA on-chip Memories for the Mitigation of
Undervolting Faults," in 27th Euromicro International Conference on Parallel, Distributed,
and Network-based Processing (PDP), 2019.
• B. Salami, et al., "Fault Characterization Through FPGAs Undervolting.", in 28th
International Conference on Field Programmable Logic & Applications (FPL), 2018.
• B. Salami, et al., “On the Resilience of RTL NN Accelerators: Fault Characterization and
Mitigation.", in 30th International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD), 2018.
49. 49
Ongoing and Future Extensions
• Circuit-level simulation for validating the results
• Expansion for more number of FPGAs (cluster), more
workloads (DNN and non-DNN)
• Heterogeneous systems including hw-sw systems, more
voltage rails
• Design voltage-optimized FPGA components
• Integration with error handling systems like check-
pointing
51. FPGA Undervolting for Energy-Efficiency
30th International Conference on Field-Programmable Logic and Applications (FPL).
3th September, 2020.
Behzad Salami
Barcelona Supercomputing Center (BSC)
behzad.salami@bsc.es