The Codex of Business Writing Software for Real-World Solutions 2.pptx
Oscar compiler for power reduction
1. OSCAR
Compiler
Controlled
Mul3core
Power
Reduc3on
on
Android
Pla8orm
Hideo
Yamamoto¹,
Tomohiro
Hirano¹,
Kohei
Muto¹,
Hiroki
Mikami¹,
Takashi
Goto¹,
Dominic
Hillenbrand¹,
Moriyuki
Takamura²,
Keiji
Kimura¹
and
Hironori
Kasahara¹
¹Green
Compu3ng
Systems
Research
and
Department
Center
Waseda
University
²FUJITSU
LABORATORIES
LTD.
LCPC2013
1
2. Presenta3on
Outline
• Background
– Power
consump3on
in
mul3core
– Power
control
mechanism
of
the
OSCAR
Compiler
– Power
control
on
the
Android™ pla8orm
• Experimental
– Evalua3on
target
,
power
rail
and
measurement
device
– Precise
power
measurement
method
Using
GPIO
– Bind
mode
– Clock
ga3ng
method
using
WFI
instruc3on
• Highlight
event
in
data
– Power
consump3on
of
MPEG2
decoder
• Conclusion
LCPC2013
2
5. In
quad
core
case,
you
can
reduce
‘f’
to
¼
keeping
the
same
performance.
If
‘v’
is
0.6(v)
for
¼
‘f’,
power
consump3on
will
be
reduced
to
0.36
Power
Consump3on
in
mul3
core
• Uni
Core
P
=
f*c*v^2
・・・・・ Eq.1
•
Mul3
Core
P
=
n*f*c*v^2
・・・・・ Eq.2
LCPC2013
5
6. OSCAR
Compiler
LCPC2013
6
Waseda
University
Mul3grain
Parallel
Processing
• Hierarchical
and
Global
Paralleliza3on
• Coarse
grain
task
parallel
• Loop
itera3on
parallel
• Statement
level
parallel
Data
Locality
Op3miza3on
• Task
(or
loop)
decomposi3on
considering
cache
size
or
local
memory
size
• Task
scheduling
considering
data
affinity
Low
power
op3miza3on
• Power
scheduling
with
DVFS,
clock
ga3ng
and
power
ga3ng
by
somware
Doall loop
Seq. loop
Task level or
statement level
parallelization
7. Power
Control
Mechanism
of
the
OSCAR
Compiler
• Es3mate
execu3on
3me
of
each
MT
and
find
cri3cal
path
• Determine
execu3on
3me
to
sa3sfy
the
given
deadline
• Decide
op3mal
frequency
and
voltage
of
each
MT.
LCPC2013
7
MT1
MT2
MT5
MT3
MT6
MT8
MT4
MT7
MT9
Core0
Core1
Core2
Core0
Core1
Core2
MT1
MT2
MT5
(Low
freq.)
MT3
(Low
freq.)
MT6
MT8
MT4
MT7
MT9
Given
Dead
Line
3me
Margin
Clock
ga>ng
Power
ga3ng
Power
ga3ng
Power
ga3ng
Sta3c
scheduled
MTG
Power
scheduling
with
DVFS,
clock
ga3ng
and
power
ga3ng
by
somware
Time
management
3me
8. Power
Control
on
Android
• CPUFreq
– Frequency
and
voltage
scaling
of
a
target
CPU
• CPUIdle
– Manages
the
level
of
idle
on
each
core
of
the
CPU
• HotPlug
>
10ms
– Extended
func3on
of
CPUFreq
and
CPUIdle
– Adds
another
core
to
distribute
the
load
in
high
u3liza3on
– Shuts
down
excess
core
with
low
u3liza3on
– Decide
core
on/off
line
in
a
heuris3c
adap3on
LCPC2013
8
9. Problems
of
Linux
power
control
and
parallel
processing
• Hotplug
can’t
online
core
and
thread
binding
swimly
– In
worst
case
it
needs
several
hundred
milliseconds
• Non
real-‐3me
– Linux
can’t
control
fine
resolu3on
3me
under
5-‐10ms
LCPC2013
440.6ms
9
Startup
3me
440.6ms
10. Background
• Mo3va3on
– Paralleliza3on
is
effec3ve
for
low
power
execu3on
with
DVFS,
power-‐ga3ng
and
clock-‐ga3ng
– OSCAR
compiler
has
the
capability
to
generate
power
control
API
automa3cally
• Obstacle
– Linux
needs
long
startup
3me
for
distribu3ng
load
to
mul3cores
– Lack
of
fine
resolu3on
3me
control
• Challenge
– Low
power
execu3on
Android
pla8orm
by
paralleliza3on
LCPC2013
10
12. Evalua3on
board
-‐
ODROID-‐X2
• Samsung
Exynos4412
Prime
– ARM
Cortex-‐A9
Quad
core
– Maximum
clock
frequency
1.7GHz
– Used
by
Samsung's
Galaxy
S3
• DVFS
can’t
be
applied
to
each
core
independently
• Android
Open
Source
version
is
in
place
• Circuit
Schema3c
is
available
on
request
LCPC2013
12
13. SoC Exynos4412
Power
Rail
for
Exynos4412
• Exynos4412
is
powered
by
4
PMIC
(Power
Management
IC)
voltage
– VDD_ARM
CORE
– VDD_INT
Interrupt
controller
and
L2
– VDD_G3D
GPU
– VDD_MIF
DDR
Memory
• Power
consump3on
of
VDD_ARM
(CORE)
has
been
measured
LCPC2013
Cortex-‐A9
32KB
I/D
NEON
Cortex-‐A9
32KB
I/D
NEON
Cortex-‐A9
32KB
I/D
NEON
Cortex-‐A9
32KB
I/D
NEON
Interrupt
controller
+
L2
GPU
DDR
VDD_ARM
VDD_INT
VDD_G3D
VDD_MIF
PMIC
13
14. Modified
Circuit
Diagram
of
ODROID-‐X2
LCPC2013
14
Current
Voltage
Voltage
(V)
Current
(A)
x
=
Power
(W)
15. How
to
measure
CORE
power
on
ODROID-‐X2
• Adding
a
40
mΩ
shunt
resistor
to
VDD_ARM
LCPC2013
SoC
PMIC
Shunt
Instrumenta3on
amp
Voltage
drop
15
17. “bind”
mode
• Core
assignment
logic
of
Android
Linux
hotplug
is
heuris3c
• New
core
assignment
mode
called
“bind”
mode
is
developed
for
efficient
parallel
execu3on
• "bind"
mode
is
integrated
in
Android
Linux
as
OSCAR
run3me
and
API
• Specifica3on
of
OSCAR
API
for
“bind”
mode
– Core
0
is
reserved
for
Android
system
and
non
OSCAR
parallel
program
– Applica3on
can
disable
hotplug
and
control
for
Core
ON/OFF
line
– Applica3on
can
Bind
Core
1,2
and
3
to
OSCAR
parallel
program
LCPC2013
17
Startup
3me
7.2ms
18. clock
ga3ng
• WFI
instruc3on
– WFI
instruc3on
suspends
the
execu3on
of
the
processor
core
and
stops
the
clock
un3l
3mer
event
• Clock
ga3ng
driver
using
WFI
instruc3on
– The
WFI
instruc3on
is
privileged
instruc3on
– The
API
allows
user
program
to
execute
WFI
instruc3on
within
Linux
driver
LCPC2013
18
19. while(1)
{
gpio_value(1);
call_wfi_api(1);
gpio_value(0);
}
250mA
500mA
Fine
3ming
control
by
WFI
driver
LCPC2013
19
250mA
500mA
2000us
(4
slot)
Wake
up
Time
Slot
is
500
us
GPIO
while(1)
{
gpio_value(1);
call_wfi_api(4);
gpio_value(0);
}
GPIO
Clock
ga3ng
0us
<
T
<
500us
1500us
<
T
<
2000us
15000us
(3
slot)
(N
-‐1)
x
500us
<
T
<
N
x
500us
20. Current
waveform
of
busy
wait
without
clock
ga3ng
1000mA
1500mA
2000mA
500mA
1core
2cores
3cores
4cores
Busy
wait
in
ordinary
execute
20
21. Current
waveform
of
busy
wait
with
clock
ga3ng
LCPC2013
1000mA
1500mA
2000mA
500mA
1core
2cores
3cores
4cores
Busy
wait
with
clock
ga>ng
21
Wake
up
all
cores
Clock
ga3ng
all
cores
22.
Compare
with
current
waveforms
1000mA
1500mA
2000mA
500mA
1core
2cores
3cores
4cores
Busy
wait
in
ordinary
execute
LCPC2013
1000mA
1500mA
2000mA
500mA
1core
2cores
3cores
4cores
Busy
wait
with
clock
ga>ng
22
Wake
up
all
cores
Clock
ga3ng
all
cores
24. Power
Consump3on
of
MPEG2
Decoder
on
ODROID-‐X2
LCPC2013
1/7(13.3%)
1/3(38.1%)
NUMBER
OF
CORES
24
With
Power
Reduc3on
Control
Without
Power
Reduc3on
Control
26. LCPC2013
MPEG2
Decode
execu3on
In
high
clock
and
voltage
Busy
Wait
execu3on
Clock
ga3ng
by
WFI
Reduced
by
WFI
Consumed
Reduced
26
(a)
Without
Power
Reduc3on
Control
(b)
With
Power
Reduc3on
Control
Power
Waveform
of
MPEG2
Decoder
for
1
Core
1.7GHz,
1.4V
1.7GHz,
1.4V
27. LCPC2013
Busy
Wait
execu3on
Clock
ga3ng
by
WFI
MPEG2
Decode
execu3on
In
low
clock
and
voltage
Power
Waveform
of
MPEG2
Decoder
for
3
Core
DVFS
P
=
n*f*c*V^2
Reduced
by
WFI
MPEG2
Decode
execu3on
In
high
clock
and
voltage
Consumed
Reduced
27
(a)
Without
Power
Reduc3on
Control
(b)
With
Power
Reduc3on
Control
1.7GHz,
1.4V
400MHz,
1.05V
200MHz,
0.92V
28. Power
Consump3on
of
MPEG2
Decoder
on
ODROID-‐X2
LCPC2013
NUMBER
OF
CORES
2.79
0.97
0.63
0.37
WFI
DVFS
WFI
1/3(38.1%)
Consumed
Reduced
28
29. Conclusions
• The
ODROID-‐X2
Circuit
is
modified
such
that
1. Precise
Power
waveforms
at
the
output
of
PMIC
is
observed,
and
2. The
power
waveforms
and
parallel
program
event
are
inter-‐
related
in
3ming
for
OSCAR
compiler
op3miza3on.
• The
efficient
parallel
program
execu3on
pla8orm
on
Android
is
established
by
1. “bind”
mode,
and
2. The
WFI
instruc3on
by
the
OSCAR
compiler.
• The
newly
developed
OSCAR
compiler
power
control
mechanism
has
decreased
the
power
to
one
third,
from
0.97
Wa~
in
1-‐core
to
0.37
Wa~
in
3-‐core,
in
running
MPEG2
decoder
on
Android
pla8orm.
LCPC2013
29
33. Power
Waveform
of
Op3cal
Flow
for
1core
LCPC2013
Op3cal
Flow
execu3on
Busy
Wait
execu3on
Clock
ga3ng
by
WFI
Reduce
power
of
waste
CPU
cycles
33
34. Power
Waveform
of
Op3cal
Flow
for
3core
LCPC2013
Op3cal
Flow
execu3on
In
high
clock
and
voltage
Busy
Wait
execu3on
Clock
ga3ng
by
WFI
P
=
n*f*c*V^2
Op3cal
Flow
execu3on
In
low
clock
and
voltage
34
35. #pragma
oscar
get_current_>me(current,
>mer_no
Low-‐power
code
with
OSCAR
API
LCPC2013
Proc0
Scheduled
Tasks
T1 off
Proc1
Scheduled
Tasks
T2 T4
Proc2
Scheduled
Tasks
T3 T6(slow)
OSCAR
Compiler
• Multigrain
Parallelization
• Memory
Optimization
• Data Transfer
Optimization
• DVFS,
Clock gating
Sequential
Programs
C/Fortran
Low-‐power
parallel
C/Fortran
Programs
including
OSCAR
API
Backend Compiler
API
Decoder
Na3ve
Compiler
#pragma
oscar
fvcontrol(pe,
(id,
state))
#pragma
oscar
get_fvstatus(pe,
id,
state)
Translate
OSCAR
API
into
Library
call
Exec.
Object
35
38. How to work hotplug
L L L L
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L L
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2
up2g0_delay
up2gn_delay
down_delay
up2gn_delay
down_delay
1 1
up
up
up
Down
Down
Down
down_delay
Idle
Idle
Idle
Idle
up
down
idle
disable
39. Auto hotplug governor
tegra_cpu_set_speed_cap
578 int tegra_cpu_set_speed_cap(unsigned int *speed_cap)
579 {
581 unsigned int new_speed = tegra_cpu_highest_speed();
586 new_speed = tegra_throttle_governor_speed(new_speed);
587 new_speed = edp_governor_speed(new_speed);
588 new_speed = user_cap_speed(new_speed);
592 ret = tegra_update_cpu_speed(new_speed);
594 tegra_auto_hotplug_governor(new_speed, false);
596 }
tegra_auto_hotplug_governor
parameters
LP-mode
GP-MODE
up_delay
up2g0_delay
up2dn_delay
down_delay
down_deley
down_delay
top_freq
idle_top_freq
idle_bottom_freq
botttom_freq
0
idle_bottom_freq
Current
State
Compare with
requested freq
New
State
Delay to effecte
IDLE
> top_freq
UP
Up_delay
IDLE
<=bottom_freq
DOWN
Down_delay
DOWN
>top_freq
UP
Up_delay
DOWN
>bottom_freq
IDLE
NA
UP
<bottom_freq
DOWN
Down_delay
UP
<=top_freq
IDLE
ND
Throttle_table
throttle_index
Update form user
thermal_cooling_device
Edp_Thermal
Auto Hot plug
Suspend
CpuFreq