1. Operand Value Based Modeling of Dynamic
Energy Consumption of Soft Processors In FPGA
Zaid Al-Khatib, Samar Abdi
Presented at the Applied Reconfigurable Computing Conference
Ruhr University, Bochum
15 April 2015
2. # 2
Soft Processors in FPGAs
compared to using function specific hardware
• Advantage
– High programmability in FPGA fabric, can execute complex SW on
a small footprint.
• Short development time.
• Easy to reuse libraries.
• Drawbacks
– Can be very slow.
– May consume more energy.
3. # 3
Soft Processors in FPGAs
Drawbacks Mitigation Approach
1. Analyze the software execution for time / energy
consumption.
2. Identify the functions that consume the most time /
energy.
3. Examine SW optimizations or implementing the
function using HW accelerators.
4. Repeat until design meets requirements.
4. # 4
Energy Consumption Analysis
Measure or Estimate?
• For ASIC processors, physical measurement is
possible. Not for FPGA
• It would measure the energy consumed by the entire
FPGA chip, not the resources implementing the soft
processor
[Bazzaz, M. et al., IEEE Trans. On Instrumentation and Measurement, 20013]
5. # 5
Processor Power Model Description Accuracy / Speed
1 Transistor Level
2 Gate Level
3 RT Level
4 Pipeline state aware
5 Instruction Level
6 Analytical, instruction-class based
7 Function Level Macro Model
8 Mode Based
Processor Power Model Description Accuracy / Speed
1 Transistor Level
2 Gate Level
3 RT Level
4 Pipeline state aware
5 Instruction Level
6 Analytical, instruction-class based
7 Function Level Macro Model
8 Mode Based
[Bansal, N. et al., VLSI Design, 2005]
AbstractionlevelEstimating the Energy Consumption
Model Abstraction Levels
6. # 6
First Order Estimate nJ Instruction
1.5 lwi r4, r19, 8
1.5 lwi r3, r19, 4
1.25 mul r3, r4, r3
1.4 swi r3, r19, 12
5.65 nJ Total
First Order Estimate (nJ) Instruction
1.5 lwi r4, r19, 8
1.5 lwi r3, r19, 4
1.25 mul r3, r4, r3
1.4 swi r3, r19, 12
5.65 nJ Total
Instruction
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
Instruction Level Models
• First Order Model
Average energy for each instruction
• Two Types of Instruction Level Models:
7. # 7
First Order Estimate (nJ) Instruction Second Order Estimate (nJ)
1.5 lwi r4, r19, 8 1.5
1.5 lwi r3, r19, 4 0.8
1.25 mul r3, r4, r3 1.25
1.4 swi r3, r19, 12 1.4
5.65 nJ Total 4.95 nJ
Instruction
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
• Second Order Model
Inter Instruction Energy Effect
E( load, load) < E( load, mul)
• First Order Model
Average energy for each instruction
• Two Types of Instruction Level Models:
Instruction Level Models
8. # 8
Motivation for a new Instruction
Level Model
• When Tested to model the Energy consumed by a Microblaze soft
processor in Virtex5 FPGA, Instruction Level Models failed
because:
– Poorly designed instruction characterization techniques
Assumes the average power of an instruction is equal to the power executing it in an
infinite loop.
ex. E(add) = E (add in an infinite loop) – E(empty infinite loop)
– No account for operand value
Assumes E(add 0,0) = E(add 0x7fffffff, 0x7fffffff)
12. # 12
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory Shift
add
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
add 0.1147
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory Shift
add 0.1147 0.4882 0.1608
Instruction Base Energy
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory
add 0.1147 0.4882
…
mul r3,r4,r3
swi r3,r19,12
lwi r3,r19,12
xori r3,r3,589994
...
…
mul r3,r4,r3
swi r3,r19,12
addk r6,r7,r8
lwi r3,r19,12
xori r3,r3,589994
...
Original Loop
Loop with inserted addk
instruction
-0.2
0
0.2
0.4
0.6
0.8
1
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk
• Add Instruction Base Energy from Location Based Energy Profile.
• Accounting for inter-instruction energy effect
13. # 13
Instruction Base Energy
…
mul r3,r4,r3
swi r3,r19,12
lwi r3,r19,12
xori r3,r3,589994
...
…
mul r3,r4,r3
swi r3,r19,12
addk r6,r7,r8
lwi r3,r19,12
xori r3,r3,589994
...
Original Loop
Loop with inserted addk
instruction
-0.2
0
0.2
0.4
0.6
0.8
1
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory Shift
add 0.1147 0.4882 0.1608
Load word 0.7680 0.33536 0.9858
-0.2
0
0.2
0.4
0.6
0.8
1
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
lwi
• Load word Instruction Base Energy from Location Based Energy Profile.
14. # 14
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk minimum profile
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk minimum profile addk maximum profile
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk Energy Variance addk minimum profile addk maximum profile
Operand Value Effect
• Energy Variance of instruction: The maximum energy consumed result of
non-zero operand values
...
lwi r4, r19, 8
add r6, r7, r8
lwi r3, r19, 4
...
• Minimum Profile: r7 = r8 = 0• Minimum Profile: r7 = r8 = 0
• Maximum Profile: r7 = r8 = 0x7fffffff
• Minimum Profile: r7 = r8 = 0
• Maximum Profile: r7 = r8 = 0x7fffffff
• Energy Variance = Max profile – Min profile
• Instruction energy range – depending on operand value:
15. # 15
• Values of input array contain: a single 1 and 31x 0’s• Values of input array contain: 2x 1’s and 30x 0’s• Values of input array contain: 3x 1’s and 29x 0’s• Values of input array contain: 31x 1’s and a single 0
– Increased energy consumption by approx. %20
#define size 10
int main(){
int temp, arr_in[size]=
{1024, 4194304, 67108864, 2048, 128, 256, 2, 8388608, 32,
268435456};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
#define size 10
int main(){
int temp, arr_in[size]=
{33554433, 67109888, 524416, 4196352, 134217736, 671088640,
1073750016, 8388612, 20971520, 67141632};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
#define size 10
int main(){
int temp, arr_in[size]=
{1600, 1073809408, 268435496, 36872, 8413184, 135176, 11010048,
33560576, 138, 301990400};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
#define size 10
int main(){
int temp, arr_in[size]=
{2147418111, 2147481599, 2147482623, 2145386495, 2147467263,
2147352575, 2147483643, 1879048191, 2147475455, 2147221503};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
Operand Value – Energy Impact
190
195
200
205
210
215
220
225
230
235
0 10 20 30
dynamicenergyconsumed(nJ)
Number of ones in each input array value
16. # 16
Operand Value – Energy Impact
190
195
200
205
210
215
220
225
230
235
0 10 20 30
dynamicenergyconsumed(nJ)
Number of ones in each input array value
• Impact of operand density:
– Energy is linearly dependent on operand value density
190
195
200
205
210
215
220
225
230
235
0 10 20 30
dynamicenergyconsumed(nJ)
Number of ones in each input array value
193 nJ = ∑ Base Energy
of Instructions
𝑬(𝒊) = 𝑬 𝒃𝒂𝒔𝒆 + 𝒌 ∙ 𝑬𝑽 𝒊
k: fraction of Energy Variance
𝒌 = 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃
Linear function of Operand
Density OD.
17. # 17
• Energy of an instruction
– Instruction Energy = Base energy + Operand Impact
Operand Value Based Model
• Model Parameters:
– The linear parameters (m
and b)
– For each instruction
• Three values of Base
Energy Ebase (one for each
class)
• Maximum Energy Variance
per instruction
𝑬 𝒊 = 𝑬 𝒃𝒂𝒔𝒆 𝒊, 𝒋 + 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃 ∙ 𝑬𝑽 𝒊
Instruction
Base energy after instruction from class
(nJ) Max. Energy
VarianceArithmetic &
Logic
Memory Shift
add 0.1147 0.4882 0.1608 1.0034
rsubk 0.3461 1.0352 0.7762 0.7872
mul 0.1233 0.4819 0.4019 0.9795
idiv 0.1850 0.5401 0.4419 0.7602
and 0.0892 0.5306 0.4213 0.6977
xori 0.3257 0.6345 0.5921 0.6977
cmp 0.1821 0.7108 0.5727 1.0456
nop 0.1343 0.4808 0.1959 0
lwi 0.7680 0.3536 0.9858 0.5310
swi 0.8159 0.4108 0.9761 0.2208
srl 0.1628 0.5550 0.1124 1.0782
sra 0.1571 0.5836 0.1899 1.0373
Operand Value Impact - Linear Fit Parameters
𝑚 0.016
𝑏 -0.061
18. # 18
Estimation Tool
Application
C / C++
Processor
Energy Model
Energy Report
Annotated
Executable
Target
Device
Phase II
Ones
Densities
Execution Trace
[Basic Block Sequence]
List of Basic
Blocks
Phase I• Phase 1 – Generate model
inputs:
• Instruction in Basic Blocks
• annotated application for:
• Execution trace
• Densities of operand values
• Phase 2 – Estimate Energy
• Estimate energy of each instruction,
and each basic block
• Estimate total energy consumed and
generate energy report
20. # 20
Estimation Accuracy
Application Time (µs) Power (mW) Energy (mJ)
Dhrystone 39.35 33.35 1.31
Quicksort 164.20 33.78 5.55
ReadBMPBlock 251.61 39.96 10.05
DCT 166.68 30.84 5.14
Quantize 58.20 25.52 1.49
Zigzag 25.33 30.98 0.78
Huffman
Encode
471.95 40.70 19.21
JPEG 973.77 37.66 36.67
• Tested the tool with a diverse group of benchmarks
• Accurate estimation used as reference (XPA)
21. # 21
Instruction Level Models Accuracy
Application First order Model Second order Model
E (mJ) Err E (mJ) Err
Dhrystone 3.6 171% 3.3 155%
QuickSort 15.80 185% 12.63 128%
ReadBMP 24.6 145% 21.7 116%
DCT 18.2 253% 18.2 253%
Quantize 6.4 329% 4.0 169%
Zigzag 2.3 195% 2.3 194%
Huffman Enc. 50.7 164% 47.7 148%
JPEG 102.2 179% 93.8 156%
Average error 216% 156%
Std. Deviation of error 51.6% 35.0%
• State of the art instruction level models
Large Errors
22. # 22
Instruction Level Models Accuracy
• State of the art instruction level models
Large Errors
Can be calibrated using the error of the first benchmark estimate
Application First order Model Second order Model
E (mJ) Err E* (mJ) Err E (mJ) Err E* (mJ) Err
Dhrystone 3.6 171% 1.31 0.0%** 3.3 155% 1.31 0.0%**
QuickSort 15.80 185% 5.07 -8.7% 12.63 128% 4.95 -10.7%
ReadBMP 24.6 145% 7.90 -21.4% 21.7 116% 8.50 -15%
DCT 18.2 253% 5.82 13.2% 18.2 253% 7.12 38.5%
Quantize 6.4 329% 2.04 38% 4.0 169% 1.57 5.4%
Zigzag 2.3 195% 0.74 -5.3% 2.3 194% 0.90 15.3%
Huffman Enc. 50.7 164% 16.3 -15.4% 47.7 148% 18.7 -2.7%
JPEG 102.2 179% 32.8 -10.7% 93.8 156% 36.8 0.3%
Average error 216% 12.6% 156% 9.5%
Std. Deviation of error 51.6% 10.6% 35.0% 10.4%
23. # 23
Instruction Level Models Accuracy
• State of the art instruction level models
Even with calibration,
OVBM is more than twice as accurate
Application First order Model Second order Model OVBM
E* (mJ) Err E* (mJ) Err E (mJ) Err
Dhrystone 1.31 0.0%** 1.31 0.0%** 1.30 -0.7%
QuickSort 5.07 -8.7% 4.95 -10.7% 5.37 -3.2%
ReadBMP 7.90 -21.4% 8.50 -15% 8.82 -12%
DCT 5.82 13.2% 7.12 38.5% 4.96 -3.5%
Quantize 2.04 38% 1.57 5.4% 1.47 -0.9%
Zigzag 0.74 -5.3% 0.90 15.3% 0.78 -0.6%
Huffman Enc. 16.3 -15.4% 18.7 -2.7% 17.64 -8.2%
JPEG 32.8 -10.7% 36.8 0.3% 33.67 -8.2%
Average error 12.6% 9.5% 4.2%
Std. Deviation of error 10.6% 10.4% 3.5%
24. # 24
Estimation Speed
Application OVBM Tool (Seconds)
XPA (Hours)
Host Target Total
Dhrystone 0.03 7.49 7.53 1.2
Quicksort 0.01 23.08 23.09 2.5
ReadBMPBlock 0.21 5.88 6.08 3.4
DCT 0.03 10.85 10.88 2.5
Quantize 0.01 8.40 8.41 1.4
Zigzag 0.01 4.41 4.42 1.1
Huffman Encode 0.07 65.04 65.11 5.7
JPEG 0.28 104.24 104.52 10.6
• OVBM tool is 3 orders of magnitude faster than accurate XPA tool
• Speed of OVBM depends on speed of Target Device
25. # 25
Limitations
• The generated model is specific to a single
implementation and processor configuration.
• The source code of the application is required to
annotate, and trace operand value metrics.