ALEA:Fine-grain Energy Profiling with Basic Block sampling

1
ALEA:Fine-grain Energy Proﬁling with Basic Block
sampling
Lev Mukhanov,Dimitrios S. Nikolopoulos and Bronis R. de Supinski
Queen’s University of Belfast
PACT 2015

2
Executive summary
Fine-grain energy profiling is essential for energy
optimization
Contribution:
Probabilistic approach and a tool(ALEA) for
fine-grain energy profiling

3
Outline
Introduction
Probabilistic approach and ALEA
implementation
Validation process
Experiments and use cases

4
Energy optimization challenge
18
Numberoflines,Millions
Numberofcommitsfor12months,Thousands
Energy efficient???
68

5
How workloads aﬀect energy?
I/O Blocks
Block 1
Block 2
10.2 Seconds
0.3 Seconds
10 Seconds
Kmeans

6
How workloads aﬀect energy?
I/O Blocks
Block 1
Block 2
10.2 Seconds
0.3 Seconds
10 Seconds
83 Joules
2 Joules
246 Joules
Kmeans

7
Fine-grain energy proﬁling challenges
Coarse-grained power/energy meters
Any measurements bias real energy
Overhead introduced by measurements is critical

8
Fine-grain energy proﬁling challenges
Coarse-grained power/energy meters
Any measurements bias real energy
Overhead introduced by measurements is critical

9
State of the art approaches
Manual Instrumentation[PowerPack,R.Ge 2010]
Low overhead
Coarse-grain
What code should be instrumented?
Source code should be modiﬁed
Binary Instrumentation
Fine-grain
Overhead(PIN - overhead more than 300%)
HPM(Hardware Performance Monitos)[R.Bertran 2013]
EPI(Energy Per Instruction) models[Y.S. Shao 2013]
Low overhead
Do not capture the dynamic execution context
Low accuracy
Sampling[PowerScope,J.Flinn 1999]
Low overhead
Is it ﬁne-grain?

10
Performance proﬁling based on Sampling
Performance proﬁling model: a period between samples is
associated with the sampled object =⇒ coarse-grain
probabilistic model

11
Probabilistic model
Normal distribution
ˆtbbm =ˆpbbm ·texec =nbbm ·texec
n
pbbm =P(Xbbm =1)=C1
tbbm
C1
texec
=
k
j=1
latencyj
bbm
texec
=tbbm
texec
ˆpowu
bbm = ˆpowbbm +zα/2
s
nbbm
ˆpbbm =nbbm
n
ˆebbm = ˆpowbbm ·ˆtbbm
ˆpowu
s
nbbm
ˆpowbbm = 1
nbbm
·
nbbm
i=1
powi
bbm
95 % confidence interval
ˆpowbbm = 1
nbbm
·
nbbm
i=1
powi
bbm
ˆpowu
s
nbbm
ˆpowu
s
nbbm
s= 1
nbbm−1 ·
nbbm
i=1
(powi
bbm− ˆpowbbm)2
ˆpowl
bbm = ˆpowbbm−zα/2
s
nbbm
ˆpbbm =nbbm
n
ˆtbbm =ˆpbbm ·texec =nbbm ·texec
n
ˆpowbbm = 1
nbbm
·
nbbm
i=1
powi
bbm
ˆpowbbm = 1
nbbm
·
nbbm
i=1
powi
bbm
pbbm =P(Xbbm =1)=C1
tbbm
C1
texec
=
k
j=1
latencyj
bbm
texec
=tbbm
texec
ˆpowl
bbm = ˆpowbbm−zα/2
s
nbbm

12
Probabilistic model
Execution time of a block:
timeblock = pblock · timeapplication (1)
Estimation of ˆpblock using sampling
Estimation of execution time:
ˆtimeblock = ˆpblock · timeapplication (2)
Power measurements and a sample are taken simultaneously
to estimate ˆpowerblock
Estimation of energy consumption:
ˆenergyblock = ˆpowerblock · ˆtimeblock (3)

13
Random sampling ≈ Systematic sampling
Application
Time(ticks)1 2 ... 1023 1024 ... 5990
power
block2
power
block2
power
block3
power
block9
Random sampling
Application
Time(ticks)... 23 ... 1023
power
block5
power
block2
power
block9
Systematic sampling
... 1023
1000 1000
random

14
Parallel proﬁling challenge
POWER

15
Parallel proﬁling challenge
POWER
?

16
Proﬁling of parallel applications
How to apportion power/energy between threads?
Basic block vector(BBV) bbm:
bbm = bbthread1 , bbthread2 , ..., bbthreadl
(4)

17
Implementation
ALEA
Thread1
Thread2
...
ThreadN
Application
RAPLRAPL INA231
DWARF is used to assign energy estimates to source code
Architecture independent implementation(portable)
Low overhead( 1%) - suitable for on-line proﬁling

18
Sampling period and accuracy of the estimates
Accuracy ∼ the number of samples
0 2000 4000 6000 8000 10000
Number of samples
0.28
0.29
0.30
0.31
0.32
0.33
0.34Executiontime,Sec
Random error
Time
0 2000 4000 6000 8000 10000
Number of samples
0.35
0.40
0.45
0.50
0.55
Energy,J
Confidence interval
Energy
estimated time/energy
measured time/energy
Sampling period =⇒ Accuracy

19
Sampling period and accuracy of the estimates
Sampling incurs overhead - bias of the estimates
↓ sampling period ↓ random error ↑ overhead
↑ sampling period ↑ random error ↓ overhead
sampling period ↓↑?
1 2 5 8 10 15 20 25 50 100
Sampling period,ms
0
5
10
15
20
25
30
Overhead,%
Optimal:10 ms
Overhead ∼ 1%
Sandy Bridge
Overhead(sequential)
Overhead(parallel)
0
5
10
15
20
25
30
Error,%
1 2 5 8 10 15 20 25 50 100
Sampling period,ms
0
5
10
15
20
25
30
Overhead,%
Optimal:10 ms
Overhead ∼ 1%
Exynos
0
5
10
15
20
25
30
Error,%
Error(sequential)
Error(parallel)

20
Validation
14 benchmarks(SPEC 2000, Parsec, Rodinia, SPEC OMP)
direct instrumentaion
81% coverage
Energy estimates Average Error
Sandy Bridge Exynos
all blocks 1.4 % 2.6 %
fine-grain blocks 1.6 % 3.7 %
parallel blocks 3.1 % 3.6 %
all bench 1.4 % 1.9 %

21
Eﬀect of cache instructions and pipelining
Arithmetic
Original
Cache
0
2
4
6
8
10
12
Power,W
Sandy Bridge
Arithmetic
Cache
Original
0.0
0.5
1.0
1.5
2.0
Power,W
Exynos
Arithmetic
Cache
Original
EPIOriginal
0
500
1000
1500
2000
2500
Energy,J
50%
Sandy Bridge
Energy
0
50
100
150
200
250
Time,Sec
Arithmetic
Cache
Original
EPIOriginal
0
100
200
300
400
500
600
700
800
Energy,J
29%
Exynos
0
100
200
300
400
500
600
Time,Sec
Time
Pipelining hides latency( =⇒ energy) of cache accesses
EPI models could lead to signiﬁcant errors

22
Use cases
kmeans/Sandy Bridge
profiling: 50 % of the total energy is spent on one block(Euclidean
distance)
optimization strategy:align and to restrict pointers,forced unroll
results: 7x energy reduction
ocean cp/Exynos
profiling: more than 50% of energy is spent on 6 blocks
optimization strategy: disable predictive commoning
optimization
results: 10 % power reduction
raytrace/Exynos
profiling: 50% of the total energy is spent on 2
blocks(SphPeIntersect)
optimization strategy: remove redundant memory accesses
and indirect addressing instruction
results: reduce energy by 6 %

23
Conclusion
The proposed probabilistic approach and ALEA provides:
low overhead(∼ 1 %,on-line profiling)
accurate estimates(Intel ∼ 1.4 %,ARM ∼ 2.6 %)
estimates at the fine-grain level
architecture-independent approach
ALEA could be effectively applied to optimize energy and
power consumption
Future work:
improve accuracy of the estimates
port to new architectures(GPUs and Intel Xeon Phi)
profiling of VMs

23
Thank you
This research has been supported by the UK EPSRC and by the EC FP7

25
Probabilistic model
Random sampling is approximated by systematic sampling
Power and a basic block are sampled simultaneously
For each block time, energy and power estimates are provided
For each estimate a conﬁdence interval is provided
See the paper for more details

26
Use cases
kmeans.Sandy Bridge
50 % of energy is spent on one block(Euclidean distance)
optimization strategy:align and to restrict pointers,forced unroll
results: 7x energy reduction
ocean cp.Exynos
more than 50% of energy is spent on 6 blocks
optimization strategy: disable predictive commoning
optimization
results: 10 % power reduction
raytrace.Exynos
50% of energy is spent on 2 blocks (SphPeIntersect)
optimization strategy: remove redundant memory accesses
and indirect addressing instruction
results: reduce energy by 6 %

27
Use cases.Sandy Bridge
56 % of time is spent on one block(Euclidean distance)
Problems:unroll and auto-vectorization are not applied
Optimization strategy:align and to restrict pointers,forced
unroll
Results: 7x energy decrease
0 1 2 3 4 5 6 7 8 9
Threads
0
5
10
15
20
25
30
Time,Sec
Cache sharing effectCache sharing effect
basic block -O3
basic block -O3 + hints
0 1 2 3 4 5 6 7 8 9
Threads
5
10
15
20
25
30
35
40
45
Power,W
0 1 2 3 4 5 6 7 8 9
Threads
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Energy,100J
221 Joules(697 %)

28
Validation.Sandy Bridge results
fft(seq)oceancp(seq)
oceanncp(seq)radix(seq)art(seq)ammp(seq)quake(seq)cfd(seq)heartwall(seq)
streamclu(seq)cfd(par)streamclu(par)ammp(par)quake(par)aver(seq)aver(par)
0
1
2
3
4
5
6
7
Averageerror,%
1-st basic block(MORE samples)
2-nd basic block(LESS samples)
Time estimates
Energy estimates

29
Impact of Memory instructions
Nop
Arithm
Mem(L1,store)
Mem(L1,load)
Mem(store)
Mem(L1)
Original
Mem(L2,store)
Mem(load)
Mem(L2,load)
Mem
Mem(L2)
0
2
4
6
8
10
12
Power,W
cache access intensity
Sandy Bridge
Arithm
Nop
Mem(L1,load)
Mem(L1)
Mem(L1,store)
Mem(L2,load)
Mem(L2,store)
Mem(L2)
Original(L2)
0.0
0.5
1.0
1.5
2.0
Power,W
cache access intensity
Exynos
CPU power is primarily aﬀected by cache accesses

30
Probability to sample a basic block
Basic block execution
Introduce Xbbm associated with each tick:
Xbbm =
1, if bbm is the sampled basic block
0, otherwise
(5)
Take one random sampling. Probability that bbm is sampled:
pbbm = P(Xbbm = 1) =
C1
tbbm
C1
texec
=
k
j=1 latencyj
bbm
texec
=
tbbm
texec
(6)

31
Execution time estimates
Take samples several times.Random sampling
Xbbm random and follows the Bernoulli distribution
Estimate pbbm using the maximum likelihood estimator of
parameter pbbm in the Bernoulli distribution for Xbbm
ˆpbbm =
nbbm
n
(7)
tbbm is estimated as
ˆtbbm = ˆpbbm · texec =
nbbm · texec
n
(8)

32
Power and Energy estimates
The same probabilistic approach
Power consumption is random variable(Normal distribution)
Implementation of the variable is associated with each tick
The mean power consumption of bbm:
ˆpowbbm =
1
nbbm
·
nbbm
i=1
powi
bbm (9)
Energy consumption of bbm:
ˆebbm = ˆpowbbm · ˆtbbm (10)

33
Quality of time estimates
Conﬁdence interval for pbbm
ˆpu
bbm = ˆpbbm + zα/2
1
n
· ˆpbbm · (1 − ˆpbbm) (11)
ˆpl
bbm = ˆpbbm − zα/2
1
n
· ˆpbbm · (1 − ˆpbbm) (12)
ˆpl
bbm ≤ p ≤ ˆpu
bbm (13)
Conﬁdence interval for tbbm
ˆpl
bbm · texec ≤ tbbm ≤ ˆpu
bbm · texec (14)

34
Bounds and Confidence.Energy
We can similarly build a confidence interval for power
ˆpowu
bbm = ˆpowbbm + zα/2
s
√
nbbm
(15)
ˆpowl
bbm = ˆpowbbm − zα/2
s
√
nbbm
(16)
s =
1
nbbm − 1
·
nbbm
i=1
(powi
bbm − ˆpowbbm)2 (17)
ˆpowl
bbm ≤ powbbm ≤ ˆpowu
bbm (18)
Confidence interval for energy consumption
ˆpl
bbm · texec · ˆpowl
bbm ≤ ebbm ≤ ˆpu
bbm · texec · ˆpowu
bbm (19)

35
Parallel applications
Basic block vector(BBV) bbm
bbm = bbthread1 , bbthread2 , ..., bbthreadl
(20)
ˆtbbm
= ˆpbbm
· texec =
nbbm
· texec
n
(21)
ˆpowbbm
=
1
nbbm
·
nbbm
i=1
powi
bbm
(22)
ˆebbm
= ˆpowbbm
· ˆtbbm
(23)

36
Experiments. Impact of Memory instruction
How to optimize energy consumption?
Performance vs Power optimization
How to decrease power consumption? What aﬀects power
consumption?
Block Description
Basic block A Copy of BBA
Mem Only memory access instructions of BBA
NoMem Only arithmetic/logic instructions of BBA
Mem(L2) Mem block with the size of accessed
data limited to 2MB (L2 cache size on Exynos)
Mem(L1) Mem block with the size of accessed
data limited to 2KB (L1 cache size on Exynos)
Mem(load) Mem block with load instructions only
Mem(store) Mem block with store instructions only
Mem(L2,load) Mem(L2) block with loads only
Mem(L2,store) Mem(L2) block with stores only
Mem(L1,load) Mem(L1) block with loads only
Mem(L1,store) Mem(L1) block with stores only

37
Use case(Exynos).Power optimization.ocean cp
more than 50% of the total execution time is spent in 6 basic
blocks
optimization strategy: remove redundant cache accesses
disable prefetch,predictive commoning optimization
(up to 14 % power decrease)
for each basic block diﬀerent strategy should be applied
DVFS could be applied also...
Baseline Energy-optimal
Time(s) Energy (J) Time (s) Energy (J) Threads Frequency Manual optimization
bb1,jacobcalc2.C:301 2.03 8.48 1.87 6.03 4 1500 MHz No
bb2,slave2.C:641 1.54 6.70 1.31 4.16 2 1600 MHz Yes
bb3,laplacalc.C:83 2.02 9.53 2.55 7.98 2 1500 MHz No
bb4,multi.C:253 2.17 7.22 2.62 6.52 2 1500 MHz No
bb5,multi.C:235 2.36 7.88 3.29 5.56 1 1500 MHZ No
bb6,multi.C:290 2.67 9.23 3.23 5.46 1 1500 MHz No
program 29.93 108.64 26.88 72.84 2.0 (avg.) 1516 MHz (avg.) Yes

38
Platforms
Intel Sandy Bridge
(Intel Xeon E5-2650), 2 CPU, 8 cores, 32KB/32KB
I/D-Cache per core, 2MB L2 cache, 20MB L3
cache. OS: CentOS (release 6.5). Frequency: 2 GHz.
Energy measurements: RAPL
Samsung Exynos 5 Octa(Odroid-XU+E),
ARM Big.LITTLE, 4 A15 cores,4
A7 cores, 32KB/32KB I/D-Cache per core, 2MB
L2 cache, OS: Ubuntu 14.04 LTS.Frequency:1.6 GHz
Energy measurements: Power meters(INA 231)

ALEA:Fine-grain Energy Profiling with Basic Block sampling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ALEA:Fine-grain Energy Profiling with Basic Block sampling

Similar to ALEA:Fine-grain Energy Profiling with Basic Block sampling (20)

Recently uploaded

Recently uploaded (20)

ALEA:Fine-grain Energy Profiling with Basic Block sampling