SlideShare a Scribd company logo
1 of 26
Operand Value Based Modeling of Dynamic
Energy Consumption of Soft Processors In FPGA
Zaid Al-Khatib, Samar Abdi
Presented at the Applied Reconfigurable Computing Conference
Ruhr University, Bochum
15 April 2015
# 2
Soft Processors in FPGAs
compared to using function specific hardware
• Advantage
– High programmability in FPGA fabric, can execute complex SW on
a small footprint.
• Short development time.
• Easy to reuse libraries.
• Drawbacks
– Can be very slow.
– May consume more energy.
# 3
Soft Processors in FPGAs
Drawbacks Mitigation Approach
1. Analyze the software execution for time / energy
consumption.
2. Identify the functions that consume the most time /
energy.
3. Examine SW optimizations or implementing the
function using HW accelerators.
4. Repeat until design meets requirements.
# 4
Energy Consumption Analysis
Measure or Estimate?
• For ASIC processors, physical measurement is
possible. Not for FPGA
• It would measure the energy consumed by the entire
FPGA chip, not the resources implementing the soft
processor
[Bazzaz, M. et al., IEEE Trans. On Instrumentation and Measurement, 20013]
# 5
Processor Power Model Description Accuracy / Speed
1 Transistor Level
2 Gate Level
3 RT Level
4 Pipeline state aware
5 Instruction Level
6 Analytical, instruction-class based
7 Function Level Macro Model
8 Mode Based
Processor Power Model Description Accuracy / Speed
1 Transistor Level
2 Gate Level
3 RT Level
4 Pipeline state aware
5 Instruction Level
6 Analytical, instruction-class based
7 Function Level Macro Model
8 Mode Based
[Bansal, N. et al., VLSI Design, 2005]
AbstractionlevelEstimating the Energy Consumption
Model Abstraction Levels
# 6
First Order Estimate nJ Instruction
1.5 lwi r4, r19, 8
1.5 lwi r3, r19, 4
1.25 mul r3, r4, r3
1.4 swi r3, r19, 12
5.65 nJ Total
First Order Estimate (nJ) Instruction
1.5 lwi r4, r19, 8
1.5 lwi r3, r19, 4
1.25 mul r3, r4, r3
1.4 swi r3, r19, 12
5.65 nJ Total
Instruction
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
Instruction Level Models
• First Order Model
Average energy for each instruction
• Two Types of Instruction Level Models:
# 7
First Order Estimate (nJ) Instruction Second Order Estimate (nJ)
1.5 lwi r4, r19, 8 1.5
1.5 lwi r3, r19, 4 0.8
1.25 mul r3, r4, r3 1.25
1.4 swi r3, r19, 12 1.4
5.65 nJ Total 4.95 nJ
Instruction
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
• Second Order Model
Inter Instruction Energy Effect
E( load, load) < E( load, mul)
• First Order Model
Average energy for each instruction
• Two Types of Instruction Level Models:
Instruction Level Models
# 8
Motivation for a new Instruction
Level Model
• When Tested to model the Energy consumed by a Microblaze soft
processor in Virtex5 FPGA, Instruction Level Models failed
because:
– Poorly designed instruction characterization techniques
Assumes the average power of an instruction is equal to the power executing it in an
infinite loop.
ex. E(add) = E (add in an infinite loop) – E(empty infinite loop)
– No account for operand value
Assumes E(add 0,0) = E(add 0x7fffffff, 0x7fffffff)
# 9
$L2
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
$L2
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
New Instruction Energy estimation
Method
Reference Application
$L2
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
$L2
add r6, r7, r8
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
$L2
lwi r4, r19, 8
add r6, r7, r8
lwi r3, r19, 4
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
$L2
lwi r4, r19, 8
lwi r3, r19, 4
add r6, r7, r8
mul r3, r4, r3
swi r3, r19, 12
...
bri $L2
0
0.1
0.2
0.3
0.4
lwi lwi mul swi
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk
$L2
lwi r4, r19, 8
lwi r3, r19, 4
mul r3, r4, r3
add r6, r7, r8
swi r3, r19, 12
...
bri $L2
Location Based Instruction Energy Profiling
# 10
Energy Profiles of Instructions
-0.2
0.3
0.8
Instruction
Energy(nJ)
muli
-0.2
0.3
0.8
Instruction
Energy(nJ)
lwi
-0.2
0.3
0.8
Instruction
Energy(nJ)
Location of inserted instruction in benchmarking loop
srl
-0.2
0.3
0.8
Instruction
Energy(nJ)
addk
# 11
Instruction Classes
-0.2
0.3
0.8
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
Instruction
Energy(nJ)
lwi
-0.2
0.3
0.8
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
Instruction
Energy(nJ)
Location of inserted instruction in benchmarking loop
srl
• Three instruction classes
– Arithmetic and Logic
– Memory Load and Store
– Shift Operations
-0.2
0.3
0.8
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
Instruction
Energy(nJ)
addk
# 12
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory Shift
add
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
add 0.1147
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory Shift
add 0.1147 0.4882 0.1608
Instruction Base Energy
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory
add 0.1147 0.4882
…
mul r3,r4,r3
swi r3,r19,12
lwi r3,r19,12
xori r3,r3,589994
...
…
mul r3,r4,r3
swi r3,r19,12
addk r6,r7,r8
lwi r3,r19,12
xori r3,r3,589994
...
Original Loop
Loop with inserted addk
instruction
-0.2
0
0.2
0.4
0.6
0.8
1
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk
• Add Instruction Base Energy from Location Based Energy Profile.
• Accounting for inter-instruction energy effect
# 13
Instruction Base Energy
…
mul r3,r4,r3
swi r3,r19,12
lwi r3,r19,12
xori r3,r3,589994
...
…
mul r3,r4,r3
swi r3,r19,12
addk r6,r7,r8
lwi r3,r19,12
xori r3,r3,589994
...
Original Loop
Loop with inserted addk
instruction
-0.2
0
0.2
0.4
0.6
0.8
1
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk
Instruction
Base energy after instruction from class (nJ)
Arithmetic &
Logic
Memory Shift
add 0.1147 0.4882 0.1608
Load word 0.7680 0.33536 0.9858
-0.2
0
0.2
0.4
0.6
0.8
1
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
lwi
• Load word Instruction Base Energy from Location Based Energy Profile.
# 14
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk minimum profile
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk minimum profile addk maximum profile
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
lwi
lwi
mul
swi
lwi
imm
swi
lwi
lwi
idiv
swi
lwi
ori
swi
lwi
addk
sra
sra
sra
sra
addk
srl
srl
srl
srl
srl
addk
andi
rsubk
swi
lwi
addik
swi
lwi
addik
cmp
blti
InstructionEnergy(nJ)
Location of inserted instruction in benchmarking loop
addk Energy Variance addk minimum profile addk maximum profile
Operand Value Effect
• Energy Variance of instruction: The maximum energy consumed result of
non-zero operand values
...
lwi r4, r19, 8
add r6, r7, r8
lwi r3, r19, 4
...
• Minimum Profile: r7 = r8 = 0• Minimum Profile: r7 = r8 = 0
• Maximum Profile: r7 = r8 = 0x7fffffff
• Minimum Profile: r7 = r8 = 0
• Maximum Profile: r7 = r8 = 0x7fffffff
• Energy Variance = Max profile – Min profile
• Instruction energy range – depending on operand value:
# 15
• Values of input array contain: a single 1 and 31x 0’s• Values of input array contain: 2x 1’s and 30x 0’s• Values of input array contain: 3x 1’s and 29x 0’s• Values of input array contain: 31x 1’s and a single 0
– Increased energy consumption by approx. %20
#define size 10
int main(){
int temp, arr_in[size]=
{1024, 4194304, 67108864, 2048, 128, 256, 2, 8388608, 32,
268435456};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
#define size 10
int main(){
int temp, arr_in[size]=
{33554433, 67109888, 524416, 4196352, 134217736, 671088640,
1073750016, 8388612, 20971520, 67141632};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
#define size 10
int main(){
int temp, arr_in[size]=
{1600, 1073809408, 268435496, 36872, 8413184, 135176, 11010048,
33560576, 138, 301990400};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
#define size 10
int main(){
int temp, arr_in[size]=
{2147418111, 2147481599, 2147482623, 2145386495, 2147467263,
2147352575, 2147483643, 1879048191, 2147475455, 2147221503};
while(1){
for (int i=0; i<size; i++){
temp=arr_in[i];
temp*=2;
temp++;}}
return 0;}
Operand Value – Energy Impact
190
195
200
205
210
215
220
225
230
235
0 10 20 30
dynamicenergyconsumed(nJ)
Number of ones in each input array value
# 16
Operand Value – Energy Impact
190
195
200
205
210
215
220
225
230
235
0 10 20 30
dynamicenergyconsumed(nJ)
Number of ones in each input array value
• Impact of operand density:
– Energy is linearly dependent on operand value density
190
195
200
205
210
215
220
225
230
235
0 10 20 30
dynamicenergyconsumed(nJ)
Number of ones in each input array value
193 nJ = ∑ Base Energy
of Instructions
𝑬(𝒊) = 𝑬 𝒃𝒂𝒔𝒆 + 𝒌 ∙ 𝑬𝑽 𝒊
k: fraction of Energy Variance
𝒌 = 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃
Linear function of Operand
Density OD.
# 17
• Energy of an instruction
– Instruction Energy = Base energy + Operand Impact
Operand Value Based Model
• Model Parameters:
– The linear parameters (m
and b)
– For each instruction
• Three values of Base
Energy Ebase (one for each
class)
• Maximum Energy Variance
per instruction
𝑬 𝒊 = 𝑬 𝒃𝒂𝒔𝒆 𝒊, 𝒋 + 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃 ∙ 𝑬𝑽 𝒊
Instruction
Base energy after instruction from class
(nJ) Max. Energy
VarianceArithmetic &
Logic
Memory Shift
add 0.1147 0.4882 0.1608 1.0034
rsubk 0.3461 1.0352 0.7762 0.7872
mul 0.1233 0.4819 0.4019 0.9795
idiv 0.1850 0.5401 0.4419 0.7602
and 0.0892 0.5306 0.4213 0.6977
xori 0.3257 0.6345 0.5921 0.6977
cmp 0.1821 0.7108 0.5727 1.0456
nop 0.1343 0.4808 0.1959 0
lwi 0.7680 0.3536 0.9858 0.5310
swi 0.8159 0.4108 0.9761 0.2208
srl 0.1628 0.5550 0.1124 1.0782
sra 0.1571 0.5836 0.1899 1.0373
Operand Value Impact - Linear Fit Parameters
𝑚 0.016
𝑏 -0.061
# 18
Estimation Tool
Application
C / C++
Processor
Energy Model
Energy Report
Annotated
Executable
Target
Device
Phase II
Ones
Densities
Execution Trace
[Basic Block Sequence]
List of Basic
Blocks
Phase I• Phase 1 – Generate model
inputs:
• Instruction in Basic Blocks
• annotated application for:
• Execution trace
• Densities of operand values
• Phase 2 – Estimate Energy
• Estimate energy of each instruction,
and each basic block
• Estimate total energy consumed and
generate energy report
# 19
Estimated Energy Report
• Basic Block 43 consumes 41% of the energy
• Focus optimization on Basic Blocks 43 and 44
0
10
20
30
1
43
43
43
43
43
43
43
2
43
43
43
43
43
43
43
3
43
43
43
43
43
43
43
43
69
83
49
48
46
49
48
46
49
87
65
23
52
61
14
69
31
EstimatedDynamic
Energy(μJ)
Execution Trace - Basic Block IDs
0.0
0.5
1.0
1.5
1
43
43
43
43
43
43
43
2
43
43
43
43
43
43
43
3
43
43
43
43
43
43
43
43
69
83
49
48
46
49
48
46
49
87
65
23
52
61
14
69
31
EstimatedDynamic
Energy(mJ)
Execution Trace - Basic Block IDs
Contribution of each Basic Blocks
to total energy
BB#43
41%BB#44
14%
BB#50
6%
BB#49
5%
• Dhrystone Energy Report
– Consists of 91 basic blocks
– In total, 333 basic blocks executed
# 20
Estimation Accuracy
Application Time (µs) Power (mW) Energy (mJ)
Dhrystone 39.35 33.35 1.31
Quicksort 164.20 33.78 5.55
ReadBMPBlock 251.61 39.96 10.05
DCT 166.68 30.84 5.14
Quantize 58.20 25.52 1.49
Zigzag 25.33 30.98 0.78
Huffman
Encode
471.95 40.70 19.21
JPEG 973.77 37.66 36.67
• Tested the tool with a diverse group of benchmarks
• Accurate estimation used as reference (XPA)
# 21
Instruction Level Models Accuracy
Application First order Model Second order Model
E (mJ) Err E (mJ) Err
Dhrystone 3.6 171% 3.3 155%
QuickSort 15.80 185% 12.63 128%
ReadBMP 24.6 145% 21.7 116%
DCT 18.2 253% 18.2 253%
Quantize 6.4 329% 4.0 169%
Zigzag 2.3 195% 2.3 194%
Huffman Enc. 50.7 164% 47.7 148%
JPEG 102.2 179% 93.8 156%
Average error 216% 156%
Std. Deviation of error 51.6% 35.0%
• State of the art instruction level models
 Large Errors
# 22
Instruction Level Models Accuracy
• State of the art instruction level models
 Large Errors
 Can be calibrated using the error of the first benchmark estimate
Application First order Model Second order Model
E (mJ) Err E* (mJ) Err E (mJ) Err E* (mJ) Err
Dhrystone 3.6 171% 1.31 0.0%** 3.3 155% 1.31 0.0%**
QuickSort 15.80 185% 5.07 -8.7% 12.63 128% 4.95 -10.7%
ReadBMP 24.6 145% 7.90 -21.4% 21.7 116% 8.50 -15%
DCT 18.2 253% 5.82 13.2% 18.2 253% 7.12 38.5%
Quantize 6.4 329% 2.04 38% 4.0 169% 1.57 5.4%
Zigzag 2.3 195% 0.74 -5.3% 2.3 194% 0.90 15.3%
Huffman Enc. 50.7 164% 16.3 -15.4% 47.7 148% 18.7 -2.7%
JPEG 102.2 179% 32.8 -10.7% 93.8 156% 36.8 0.3%
Average error 216% 12.6% 156% 9.5%
Std. Deviation of error 51.6% 10.6% 35.0% 10.4%
# 23
Instruction Level Models Accuracy
• State of the art instruction level models
 Even with calibration,
 OVBM is more than twice as accurate
Application First order Model Second order Model OVBM
E* (mJ) Err E* (mJ) Err E (mJ) Err
Dhrystone 1.31 0.0%** 1.31 0.0%** 1.30 -0.7%
QuickSort 5.07 -8.7% 4.95 -10.7% 5.37 -3.2%
ReadBMP 7.90 -21.4% 8.50 -15% 8.82 -12%
DCT 5.82 13.2% 7.12 38.5% 4.96 -3.5%
Quantize 2.04 38% 1.57 5.4% 1.47 -0.9%
Zigzag 0.74 -5.3% 0.90 15.3% 0.78 -0.6%
Huffman Enc. 16.3 -15.4% 18.7 -2.7% 17.64 -8.2%
JPEG 32.8 -10.7% 36.8 0.3% 33.67 -8.2%
Average error 12.6% 9.5% 4.2%
Std. Deviation of error 10.6% 10.4% 3.5%
# 24
Estimation Speed
Application OVBM Tool (Seconds)
XPA (Hours)
Host Target Total
Dhrystone 0.03 7.49 7.53 1.2
Quicksort 0.01 23.08 23.09 2.5
ReadBMPBlock 0.21 5.88 6.08 3.4
DCT 0.03 10.85 10.88 2.5
Quantize 0.01 8.40 8.41 1.4
Zigzag 0.01 4.41 4.42 1.1
Huffman Encode 0.07 65.04 65.11 5.7
JPEG 0.28 104.24 104.52 10.6
• OVBM tool is 3 orders of magnitude faster than accurate XPA tool
• Speed of OVBM depends on speed of Target Device
# 25
Limitations
• The generated model is specific to a single
implementation and processor configuration.
• The source code of the application is required to
annotate, and trace operand value metrics.
ARC2015_I_Slides

More Related Content

What's hot

Timing synchronization F Ling_v1
Timing synchronization F Ling_v1Timing synchronization F Ling_v1
Timing synchronization F Ling_v1Fuyun Ling
 
Initial acquisition in digital communication systems
Initial acquisition in digital communication systemsInitial acquisition in digital communication systems
Initial acquisition in digital communication systemsFuyun Ling
 
Fast Fourier Transform
Fast Fourier TransformFast Fourier Transform
Fast Fourier Transformop205
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...a3labdsp
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES 2 clean1 new wi proposal perf...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES 2 clean1 new wi proposal perf...Emerson Eduardo Rodrigues - ENGINEERING STUDIES 2 clean1 new wi proposal perf...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES 2 clean1 new wi proposal perf...EMERSON EDUARDO RODRIGUES
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160665 track
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160665 trackEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160665 track
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160665 trackEMERSON EDUARDO RODRIGUES
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-clean
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-cleanEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-clean
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-cleanEMERSON EDUARDO RODRIGUES
 
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortUsing Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortAnhanguera Educacional S/A
 
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS FilterPerformance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filteridescitation
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160548 rm
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160548 rmEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160548 rm
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160548 rmEMERSON EDUARDO RODRIGUES
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1Emerson Eduardo Rodrigues - ENGINEERING STUDIES1
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1EMERSON EDUARDO RODRIGUES
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-rm
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-rmEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-rm
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-rmEMERSON EDUARDO RODRIGUES
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 152263 v6 with change his...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 152263 v6 with change his...Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 152263 v6 with change his...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 152263 v6 with change his...EMERSON EDUARDO RODRIGUES
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160680 must wid clean
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160680 must wid cleanEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160680 must wid clean
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160680 must wid cleanEMERSON EDUARDO RODRIGUES
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160676 new wi srs carrier...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160676 new wi srs carrier...Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160676 new wi srs carrier...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160676 new wi srs carrier...EMERSON EDUARDO RODRIGUES
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160571-mark new si propos...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160571-mark new si propos...Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160571-mark new si propos...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160571-mark new si propos...EMERSON EDUARDO RODRIGUES
 

What's hot (19)

Timing synchronization F Ling_v1
Timing synchronization F Ling_v1Timing synchronization F Ling_v1
Timing synchronization F Ling_v1
 
Initial acquisition in digital communication systems
Initial acquisition in digital communication systemsInitial acquisition in digital communication systems
Initial acquisition in digital communication systems
 
Coa.ppt2
Coa.ppt2Coa.ppt2
Coa.ppt2
 
Pipelining slides
Pipelining slides Pipelining slides
Pipelining slides
 
Fast Fourier Transform
Fast Fourier TransformFast Fourier Transform
Fast Fourier Transform
 
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
An Efficient DSP Based Implementation of a Fast Convolution Approach with non...
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES 2 clean1 new wi proposal perf...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES 2 clean1 new wi proposal perf...Emerson Eduardo Rodrigues - ENGINEERING STUDIES 2 clean1 new wi proposal perf...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES 2 clean1 new wi proposal perf...
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160665 track
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160665 trackEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160665 track
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160665 track
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-clean
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-cleanEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-clean
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-clean
 
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with TerasortUsing Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
Using Derivation-Free Optimization Methods in the Hadoop Cluster with Terasort
 
Design Of 10 gbps
Design Of 10 gbpsDesign Of 10 gbps
Design Of 10 gbps
 
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS FilterPerformance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
Performance Analysis of OFDM Transceiver with Folded FFT and LMS Filter
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160548 rm
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160548 rmEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160548 rm
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160548 rm
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1Emerson Eduardo Rodrigues - ENGINEERING STUDIES1
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-rm
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-rmEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-rm
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160600 e lwa-wid-v21-rm
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 152263 v6 with change his...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 152263 v6 with change his...Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 152263 v6 with change his...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 152263 v6 with change his...
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160680 must wid clean
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160680 must wid cleanEmerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160680 must wid clean
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160680 must wid clean
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160676 new wi srs carrier...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160676 new wi srs carrier...Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160676 new wi srs carrier...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160676 new wi srs carrier...
 
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160571-mark new si propos...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160571-mark new si propos...Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160571-mark new si propos...
Emerson Eduardo Rodrigues - ENGINEERING STUDIES1 Rp 160571-mark new si propos...
 

Similar to ARC2015_I_Slides

Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPA B Shinde
 
Application of OpenSees in Reliability-based Design Optimization of Structures
Application of OpenSees in Reliability-based Design Optimization of StructuresApplication of OpenSees in Reliability-based Design Optimization of Structures
Application of OpenSees in Reliability-based Design Optimization of Structuresopenseesdays
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesDilum Bandara
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Dr.K. Thirunadana Sikamani
 
IRJET- Efficient Design of Radix Booth Multiplier
IRJET- Efficient Design of Radix Booth MultiplierIRJET- Efficient Design of Radix Booth Multiplier
IRJET- Efficient Design of Radix Booth MultiplierIRJET Journal
 
1_Basic Structure of Computers.pptx
1_Basic Structure of Computers.pptx1_Basic Structure of Computers.pptx
1_Basic Structure of Computers.pptxKarthikChavan5
 
PCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingPCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingDVClub
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Databricks
 
Implementation of UART with BIST Technique Using Low Power LFSR
Implementation of UART with BIST Technique Using Low Power LFSRImplementation of UART with BIST Technique Using Low Power LFSR
Implementation of UART with BIST Technique Using Low Power LFSRIJERA Editor
 
LTE KPI Optimization - A to Z Abiola.pptx
LTE KPI Optimization - A to Z Abiola.pptxLTE KPI Optimization - A to Z Abiola.pptx
LTE KPI Optimization - A to Z Abiola.pptxssuser574918
 

Similar to ARC2015_I_Slides (20)

Computer Architecture Assignment Help
Computer Architecture Assignment HelpComputer Architecture Assignment Help
Computer Architecture Assignment Help
 
Unit vi
Unit viUnit vi
Unit vi
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
MES_MODULE 2.pptx
MES_MODULE 2.pptxMES_MODULE 2.pptx
MES_MODULE 2.pptx
 
Application of OpenSees in Reliability-based Design Optimization of Structures
Application of OpenSees in Reliability-based Design Optimization of StructuresApplication of OpenSees in Reliability-based Design Optimization of Structures
Application of OpenSees in Reliability-based Design Optimization of Structures
 
IS.pptx
IS.pptxIS.pptx
IS.pptx
 
Mod.2.pptx
Mod.2.pptxMod.2.pptx
Mod.2.pptx
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
Instruction Level Parallelism Compiler optimization Techniques Anna Universit...
 
IRJET- Efficient Design of Radix Booth Multiplier
IRJET- Efficient Design of Radix Booth MultiplierIRJET- Efficient Design of Radix Booth Multiplier
IRJET- Efficient Design of Radix Booth Multiplier
 
1_Basic Structure of Computers.pptx
1_Basic Structure of Computers.pptx1_Basic Structure of Computers.pptx
1_Basic Structure of Computers.pptx
 
PCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingPCI Express Verification using Reference Modeling
PCI Express Verification using Reference Modeling
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
Training Distributed Deep Recurrent Neural Networks with Mixed Precision on G...
 
C Programming - Refresher - Part II
C Programming - Refresher - Part II C Programming - Refresher - Part II
C Programming - Refresher - Part II
 
Lect08
Lect08Lect08
Lect08
 
Implementation of UART with BIST Technique Using Low Power LFSR
Implementation of UART with BIST Technique Using Low Power LFSRImplementation of UART with BIST Technique Using Low Power LFSR
Implementation of UART with BIST Technique Using Low Power LFSR
 
LTE KPI Optimization - A to Z Abiola.pptx
LTE KPI Optimization - A to Z Abiola.pptxLTE KPI Optimization - A to Z Abiola.pptx
LTE KPI Optimization - A to Z Abiola.pptx
 
module 5.1.pptx
module 5.1.pptxmodule 5.1.pptx
module 5.1.pptx
 
module 5.pptx
module 5.pptxmodule 5.pptx
module 5.pptx
 

ARC2015_I_Slides

  • 1. Operand Value Based Modeling of Dynamic Energy Consumption of Soft Processors In FPGA Zaid Al-Khatib, Samar Abdi Presented at the Applied Reconfigurable Computing Conference Ruhr University, Bochum 15 April 2015
  • 2. # 2 Soft Processors in FPGAs compared to using function specific hardware • Advantage – High programmability in FPGA fabric, can execute complex SW on a small footprint. • Short development time. • Easy to reuse libraries. • Drawbacks – Can be very slow. – May consume more energy.
  • 3. # 3 Soft Processors in FPGAs Drawbacks Mitigation Approach 1. Analyze the software execution for time / energy consumption. 2. Identify the functions that consume the most time / energy. 3. Examine SW optimizations or implementing the function using HW accelerators. 4. Repeat until design meets requirements.
  • 4. # 4 Energy Consumption Analysis Measure or Estimate? • For ASIC processors, physical measurement is possible. Not for FPGA • It would measure the energy consumed by the entire FPGA chip, not the resources implementing the soft processor [Bazzaz, M. et al., IEEE Trans. On Instrumentation and Measurement, 20013]
  • 5. # 5 Processor Power Model Description Accuracy / Speed 1 Transistor Level 2 Gate Level 3 RT Level 4 Pipeline state aware 5 Instruction Level 6 Analytical, instruction-class based 7 Function Level Macro Model 8 Mode Based Processor Power Model Description Accuracy / Speed 1 Transistor Level 2 Gate Level 3 RT Level 4 Pipeline state aware 5 Instruction Level 6 Analytical, instruction-class based 7 Function Level Macro Model 8 Mode Based [Bansal, N. et al., VLSI Design, 2005] AbstractionlevelEstimating the Energy Consumption Model Abstraction Levels
  • 6. # 6 First Order Estimate nJ Instruction 1.5 lwi r4, r19, 8 1.5 lwi r3, r19, 4 1.25 mul r3, r4, r3 1.4 swi r3, r19, 12 5.65 nJ Total First Order Estimate (nJ) Instruction 1.5 lwi r4, r19, 8 1.5 lwi r3, r19, 4 1.25 mul r3, r4, r3 1.4 swi r3, r19, 12 5.65 nJ Total Instruction lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 Instruction Level Models • First Order Model Average energy for each instruction • Two Types of Instruction Level Models:
  • 7. # 7 First Order Estimate (nJ) Instruction Second Order Estimate (nJ) 1.5 lwi r4, r19, 8 1.5 1.5 lwi r3, r19, 4 0.8 1.25 mul r3, r4, r3 1.25 1.4 swi r3, r19, 12 1.4 5.65 nJ Total 4.95 nJ Instruction lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 • Second Order Model Inter Instruction Energy Effect E( load, load) < E( load, mul) • First Order Model Average energy for each instruction • Two Types of Instruction Level Models: Instruction Level Models
  • 8. # 8 Motivation for a new Instruction Level Model • When Tested to model the Energy consumed by a Microblaze soft processor in Virtex5 FPGA, Instruction Level Models failed because: – Poorly designed instruction characterization techniques Assumes the average power of an instruction is equal to the power executing it in an infinite loop. ex. E(add) = E (add in an infinite loop) – E(empty infinite loop) – No account for operand value Assumes E(add 0,0) = E(add 0x7fffffff, 0x7fffffff)
  • 9. # 9 $L2 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... $L2 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 New Instruction Energy estimation Method Reference Application $L2 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 $L2 add r6, r7, r8 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 $L2 lwi r4, r19, 8 add r6, r7, r8 lwi r3, r19, 4 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 $L2 lwi r4, r19, 8 lwi r3, r19, 4 add r6, r7, r8 mul r3, r4, r3 swi r3, r19, 12 ... bri $L2 0 0.1 0.2 0.3 0.4 lwi lwi mul swi InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk $L2 lwi r4, r19, 8 lwi r3, r19, 4 mul r3, r4, r3 add r6, r7, r8 swi r3, r19, 12 ... bri $L2 Location Based Instruction Energy Profiling
  • 10. # 10 Energy Profiles of Instructions -0.2 0.3 0.8 Instruction Energy(nJ) muli -0.2 0.3 0.8 Instruction Energy(nJ) lwi -0.2 0.3 0.8 Instruction Energy(nJ) Location of inserted instruction in benchmarking loop srl -0.2 0.3 0.8 Instruction Energy(nJ) addk
  • 11. # 11 Instruction Classes -0.2 0.3 0.8 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti Instruction Energy(nJ) lwi -0.2 0.3 0.8 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti Instruction Energy(nJ) Location of inserted instruction in benchmarking loop srl • Three instruction classes – Arithmetic and Logic – Memory Load and Store – Shift Operations -0.2 0.3 0.8 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti Instruction Energy(nJ) addk
  • 12. # 12 Instruction Base energy after instruction from class (nJ) Arithmetic & Logic Memory Shift add Instruction Base energy after instruction from class (nJ) Arithmetic & Logic add 0.1147 Instruction Base energy after instruction from class (nJ) Arithmetic & Logic Memory Shift add 0.1147 0.4882 0.1608 Instruction Base Energy Instruction Base energy after instruction from class (nJ) Arithmetic & Logic Memory add 0.1147 0.4882 … mul r3,r4,r3 swi r3,r19,12 lwi r3,r19,12 xori r3,r3,589994 ... … mul r3,r4,r3 swi r3,r19,12 addk r6,r7,r8 lwi r3,r19,12 xori r3,r3,589994 ... Original Loop Loop with inserted addk instruction -0.2 0 0.2 0.4 0.6 0.8 1 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk • Add Instruction Base Energy from Location Based Energy Profile. • Accounting for inter-instruction energy effect
  • 13. # 13 Instruction Base Energy … mul r3,r4,r3 swi r3,r19,12 lwi r3,r19,12 xori r3,r3,589994 ... … mul r3,r4,r3 swi r3,r19,12 addk r6,r7,r8 lwi r3,r19,12 xori r3,r3,589994 ... Original Loop Loop with inserted addk instruction -0.2 0 0.2 0.4 0.6 0.8 1 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk Instruction Base energy after instruction from class (nJ) Arithmetic & Logic Memory Shift add 0.1147 0.4882 0.1608 Load word 0.7680 0.33536 0.9858 -0.2 0 0.2 0.4 0.6 0.8 1 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) lwi • Load word Instruction Base Energy from Location Based Energy Profile.
  • 14. # 14 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk minimum profile -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk minimum profile addk maximum profile -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 lwi lwi mul swi lwi imm swi lwi lwi idiv swi lwi ori swi lwi addk sra sra sra sra addk srl srl srl srl srl addk andi rsubk swi lwi addik swi lwi addik cmp blti InstructionEnergy(nJ) Location of inserted instruction in benchmarking loop addk Energy Variance addk minimum profile addk maximum profile Operand Value Effect • Energy Variance of instruction: The maximum energy consumed result of non-zero operand values ... lwi r4, r19, 8 add r6, r7, r8 lwi r3, r19, 4 ... • Minimum Profile: r7 = r8 = 0• Minimum Profile: r7 = r8 = 0 • Maximum Profile: r7 = r8 = 0x7fffffff • Minimum Profile: r7 = r8 = 0 • Maximum Profile: r7 = r8 = 0x7fffffff • Energy Variance = Max profile – Min profile • Instruction energy range – depending on operand value:
  • 15. # 15 • Values of input array contain: a single 1 and 31x 0’s• Values of input array contain: 2x 1’s and 30x 0’s• Values of input array contain: 3x 1’s and 29x 0’s• Values of input array contain: 31x 1’s and a single 0 – Increased energy consumption by approx. %20 #define size 10 int main(){ int temp, arr_in[size]= {1024, 4194304, 67108864, 2048, 128, 256, 2, 8388608, 32, 268435456}; while(1){ for (int i=0; i<size; i++){ temp=arr_in[i]; temp*=2; temp++;}} return 0;} #define size 10 int main(){ int temp, arr_in[size]= {33554433, 67109888, 524416, 4196352, 134217736, 671088640, 1073750016, 8388612, 20971520, 67141632}; while(1){ for (int i=0; i<size; i++){ temp=arr_in[i]; temp*=2; temp++;}} return 0;} #define size 10 int main(){ int temp, arr_in[size]= {1600, 1073809408, 268435496, 36872, 8413184, 135176, 11010048, 33560576, 138, 301990400}; while(1){ for (int i=0; i<size; i++){ temp=arr_in[i]; temp*=2; temp++;}} return 0;} #define size 10 int main(){ int temp, arr_in[size]= {2147418111, 2147481599, 2147482623, 2145386495, 2147467263, 2147352575, 2147483643, 1879048191, 2147475455, 2147221503}; while(1){ for (int i=0; i<size; i++){ temp=arr_in[i]; temp*=2; temp++;}} return 0;} Operand Value – Energy Impact 190 195 200 205 210 215 220 225 230 235 0 10 20 30 dynamicenergyconsumed(nJ) Number of ones in each input array value
  • 16. # 16 Operand Value – Energy Impact 190 195 200 205 210 215 220 225 230 235 0 10 20 30 dynamicenergyconsumed(nJ) Number of ones in each input array value • Impact of operand density: – Energy is linearly dependent on operand value density 190 195 200 205 210 215 220 225 230 235 0 10 20 30 dynamicenergyconsumed(nJ) Number of ones in each input array value 193 nJ = ∑ Base Energy of Instructions 𝑬(𝒊) = 𝑬 𝒃𝒂𝒔𝒆 + 𝒌 ∙ 𝑬𝑽 𝒊 k: fraction of Energy Variance 𝒌 = 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃 Linear function of Operand Density OD.
  • 17. # 17 • Energy of an instruction – Instruction Energy = Base energy + Operand Impact Operand Value Based Model • Model Parameters: – The linear parameters (m and b) – For each instruction • Three values of Base Energy Ebase (one for each class) • Maximum Energy Variance per instruction 𝑬 𝒊 = 𝑬 𝒃𝒂𝒔𝒆 𝒊, 𝒋 + 𝒎 ∙ 𝑶𝑫 𝒊 + 𝒃 ∙ 𝑬𝑽 𝒊 Instruction Base energy after instruction from class (nJ) Max. Energy VarianceArithmetic & Logic Memory Shift add 0.1147 0.4882 0.1608 1.0034 rsubk 0.3461 1.0352 0.7762 0.7872 mul 0.1233 0.4819 0.4019 0.9795 idiv 0.1850 0.5401 0.4419 0.7602 and 0.0892 0.5306 0.4213 0.6977 xori 0.3257 0.6345 0.5921 0.6977 cmp 0.1821 0.7108 0.5727 1.0456 nop 0.1343 0.4808 0.1959 0 lwi 0.7680 0.3536 0.9858 0.5310 swi 0.8159 0.4108 0.9761 0.2208 srl 0.1628 0.5550 0.1124 1.0782 sra 0.1571 0.5836 0.1899 1.0373 Operand Value Impact - Linear Fit Parameters 𝑚 0.016 𝑏 -0.061
  • 18. # 18 Estimation Tool Application C / C++ Processor Energy Model Energy Report Annotated Executable Target Device Phase II Ones Densities Execution Trace [Basic Block Sequence] List of Basic Blocks Phase I• Phase 1 – Generate model inputs: • Instruction in Basic Blocks • annotated application for: • Execution trace • Densities of operand values • Phase 2 – Estimate Energy • Estimate energy of each instruction, and each basic block • Estimate total energy consumed and generate energy report
  • 19. # 19 Estimated Energy Report • Basic Block 43 consumes 41% of the energy • Focus optimization on Basic Blocks 43 and 44 0 10 20 30 1 43 43 43 43 43 43 43 2 43 43 43 43 43 43 43 3 43 43 43 43 43 43 43 43 69 83 49 48 46 49 48 46 49 87 65 23 52 61 14 69 31 EstimatedDynamic Energy(μJ) Execution Trace - Basic Block IDs 0.0 0.5 1.0 1.5 1 43 43 43 43 43 43 43 2 43 43 43 43 43 43 43 3 43 43 43 43 43 43 43 43 69 83 49 48 46 49 48 46 49 87 65 23 52 61 14 69 31 EstimatedDynamic Energy(mJ) Execution Trace - Basic Block IDs Contribution of each Basic Blocks to total energy BB#43 41%BB#44 14% BB#50 6% BB#49 5% • Dhrystone Energy Report – Consists of 91 basic blocks – In total, 333 basic blocks executed
  • 20. # 20 Estimation Accuracy Application Time (µs) Power (mW) Energy (mJ) Dhrystone 39.35 33.35 1.31 Quicksort 164.20 33.78 5.55 ReadBMPBlock 251.61 39.96 10.05 DCT 166.68 30.84 5.14 Quantize 58.20 25.52 1.49 Zigzag 25.33 30.98 0.78 Huffman Encode 471.95 40.70 19.21 JPEG 973.77 37.66 36.67 • Tested the tool with a diverse group of benchmarks • Accurate estimation used as reference (XPA)
  • 21. # 21 Instruction Level Models Accuracy Application First order Model Second order Model E (mJ) Err E (mJ) Err Dhrystone 3.6 171% 3.3 155% QuickSort 15.80 185% 12.63 128% ReadBMP 24.6 145% 21.7 116% DCT 18.2 253% 18.2 253% Quantize 6.4 329% 4.0 169% Zigzag 2.3 195% 2.3 194% Huffman Enc. 50.7 164% 47.7 148% JPEG 102.2 179% 93.8 156% Average error 216% 156% Std. Deviation of error 51.6% 35.0% • State of the art instruction level models  Large Errors
  • 22. # 22 Instruction Level Models Accuracy • State of the art instruction level models  Large Errors  Can be calibrated using the error of the first benchmark estimate Application First order Model Second order Model E (mJ) Err E* (mJ) Err E (mJ) Err E* (mJ) Err Dhrystone 3.6 171% 1.31 0.0%** 3.3 155% 1.31 0.0%** QuickSort 15.80 185% 5.07 -8.7% 12.63 128% 4.95 -10.7% ReadBMP 24.6 145% 7.90 -21.4% 21.7 116% 8.50 -15% DCT 18.2 253% 5.82 13.2% 18.2 253% 7.12 38.5% Quantize 6.4 329% 2.04 38% 4.0 169% 1.57 5.4% Zigzag 2.3 195% 0.74 -5.3% 2.3 194% 0.90 15.3% Huffman Enc. 50.7 164% 16.3 -15.4% 47.7 148% 18.7 -2.7% JPEG 102.2 179% 32.8 -10.7% 93.8 156% 36.8 0.3% Average error 216% 12.6% 156% 9.5% Std. Deviation of error 51.6% 10.6% 35.0% 10.4%
  • 23. # 23 Instruction Level Models Accuracy • State of the art instruction level models  Even with calibration,  OVBM is more than twice as accurate Application First order Model Second order Model OVBM E* (mJ) Err E* (mJ) Err E (mJ) Err Dhrystone 1.31 0.0%** 1.31 0.0%** 1.30 -0.7% QuickSort 5.07 -8.7% 4.95 -10.7% 5.37 -3.2% ReadBMP 7.90 -21.4% 8.50 -15% 8.82 -12% DCT 5.82 13.2% 7.12 38.5% 4.96 -3.5% Quantize 2.04 38% 1.57 5.4% 1.47 -0.9% Zigzag 0.74 -5.3% 0.90 15.3% 0.78 -0.6% Huffman Enc. 16.3 -15.4% 18.7 -2.7% 17.64 -8.2% JPEG 32.8 -10.7% 36.8 0.3% 33.67 -8.2% Average error 12.6% 9.5% 4.2% Std. Deviation of error 10.6% 10.4% 3.5%
  • 24. # 24 Estimation Speed Application OVBM Tool (Seconds) XPA (Hours) Host Target Total Dhrystone 0.03 7.49 7.53 1.2 Quicksort 0.01 23.08 23.09 2.5 ReadBMPBlock 0.21 5.88 6.08 3.4 DCT 0.03 10.85 10.88 2.5 Quantize 0.01 8.40 8.41 1.4 Zigzag 0.01 4.41 4.42 1.1 Huffman Encode 0.07 65.04 65.11 5.7 JPEG 0.28 104.24 104.52 10.6 • OVBM tool is 3 orders of magnitude faster than accurate XPA tool • Speed of OVBM depends on speed of Target Device
  • 25. # 25 Limitations • The generated model is specific to a single implementation and processor configuration. • The source code of the application is required to annotate, and trace operand value metrics.