Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embedded Real-Time Operating System

Quantifying Energy Consumption
for Practical Fork-Join Parallelism
on an Embedded Real-Time Operating System
Antonio Paolillo - Ph.D student & software engineer
24th ACM International Conference on Real-Time Networks and Systems
21th October 2016

Goal:
Parallelism helps to reduce energy
while meeting real-time requirements
4

5

6

7

8

9

10

11

12

Combine:
● parallel task model
● Power-aware multi-core scheduling
55
Save more with parallelism? Yes!

Combine:
● parallel task model
● Power-aware multi-core scheduling
→ validated in theory → malleable jobs (RTCSA’14)
56
Save more with parallelism? Yes!

This work
Goal:
Reduce energy consumption as much possible
while meeting real-time requirements.
57

This work
Goal:
Reduce energy consumption as much possible
while meeting real-time requirements.
Validate it in practice
58

Validate in practice
● Parallel programming model
● RTOS techniques → power-aware multi-core hard real-time scheduling
● Evaluated with full experimental stack (RTOS + hardware in the loop)
● Simple, but real application use cases
59

1. Run-time framework
2. Task model and analysis
3. Experiments
60

3. Experiments
61

Run-time framework
62
MPSoC platform

Run-time framework
63
MPSoC platform
RTOS
& lib support

Run-time framework
64
MPSoC platform
RTOS
& lib support

Run-time framework
65
MPSoC platform
RTOS
& lib support
Parallel app.
benchmark

Run-time framework
66
MPSoC platform
RTOS
& lib support
Parallel app.
benchmark

(Simple) Parallel Program Flow
69

(Simple) Parallel Program Flow
70

An embedded RTOS
72
● Micro-kernel based
● Natively supports multi-core
● Provides hard-real time scheduling
● Power management fixed at task level
We developed HOMPRTL
→ OpenMP run-time support for HIPPEROS.

3. Experiments
73

● Sporadic fork-join
● Arbitrary deadlines
● Rate monotonic based analysis
● Threads are partitioned
● 3 stages, very simple
Parallel task model
76

● Symmetric Multi-Core
● Global DVFS, finite set of operating points 〈 Vdd
, f 〉
● Power increases monotonically with frequency
Platform model
77

Optimisation & partitioning process
78
Power optimisation
Partitioner
Schedulability test

Optimisation & partitioning process
79
Power optimisation
Partitioner
Schedulability test
Axer et al
proved wrong (shepherding)

3. Experiments
80

3. Experiments
a. Testbed description
b. Single use cases
c. Task systems
81

3. Experiments
b. Single use cases
c. Task systems
82

Experimental stack
83
MPSoC platform
RTOS
& lib support
Parallel app.
benchmark

Experimental stack
84
MPSoC platform
RTOS
& lib support
Parallel app.
benchmark

Experimental stack
85
MPSoC platform
RTOS
& lib support
Parallel app.
benchmark
Measurement
framework

Experimental stack
86
MPSoC platform
RTOS
& lib support
Parallel app.
benchmark
Measurement
framework

Target platform - i.MX6q SabreLite
87
● Embedded board ARM Cortex A9 MP
● Supported by HIPPEROS
● 4 cores, but global DVFS
● Operating points

91
MPSoC
Embedded board
Power supply
220 V
220 V
Transformer
5 V

92
MPSoC
Oscilloscope
Embedded board
Power supply
220 V
220 V5 V
R
Voltage
measurements
P = V² / R
Transformer

93
MPSoC
Control Computer
Oscilloscope
Embedded board
Power supply
220 V
220 V5 V
R
Serial
JTAG
USB
Captured data
transfer
Oscilloscope
trigger
Voltage
measurements
P = V² / R
Transformer
HIPPEROS
Inputs Outputs
Serial

Use cases
94
MPSoC platform
RTOS
& lib support
Parallel app.
benchmark
Measurement
framework

Use cases
95
MPSoC platform
RTOS
& lib support
Parallel app.
benchmark
Measurement
framework
π
AES
Image

● Different workloads
● Easy OpenMP implementation
● Good scaling expected
Use cases
96
Image
AES π

3. Experiments
b. Single use cases
c. Task systems
97

Use case alone - voltage probe output
98

Use case alone - converted to power values
99

Use case alone - execution time measurement
100

Use case alone - peak power measurement
101

Use case alone - energy measurement
102

Use case alone - execution time
103

Use case alone - peak power
104

Single core power consumption
105
P ∝ Vdd
2
f
Vdd
∝ f

Single core power consumption
106
P ∝ Vdd
2
f
Vdd
∝ f
→ P ∝ f3

Multi-core power consumption
107
P ∝ k f3

Use case alone - peak power
108
P ∝ k f 3

Execution time speedup
110
Frequency scaling OpenMP parallelism

3. Experiments
b. Single use cases
c. Task systems
111

● Generate 232 random task systems (U = 0.6 → 2.9)
System experiments
113

● Bind generated tasks to use cases
System experiments
114

● Scheduling test & optimisation in Python for 1, 2, 3, 4 threads
System experiments
115

● Operating points and threads partition for each task
System experiments
116

● Generate HIPPEROS builds
System experiments
117

● Run feasible systems and measures
System experiments
118

● Run feasible systems and measures
System experiments
119

Conclusions
122
● Practical experimental framework flow for parallel real-time applications
● Parallelisation saves up to 25% energy (the whole board)
● Confronted theory with practice
○ Speedup factors
○ Power measurements
● Challenge: integrate “OpenMP-like” programs in industry systems

Conclusion
123
Parallelism helps to reduce energy
while meeting real-time requirements

Dynamic power vs static leakage
126
P ∝ k f3
+ α k

P ∝ k f3
+ α k
127
dynamic static

128
dynamic static
● α is platform-dependent
● Comparison between DVFS and DPM
P ∝ k f3
+ α k

● 232 systems for each degree of //
○ No high utilisation systems
→ Partitioned Rate Monotonic
○ Only feasible systems for 1 → 4 threads
● More threads consume less
● Low utilisation systems don’t need
high operating points (often idle)
● Analysis is pessimistic
for high number of threads
Energy consumption
129

● Same systems, but rel. to 1 thread
● Parallelism helps to save energy :-)
● < 0.13, but not a lot of systems...
● “4 threads” dominates
● ↗ Utot
⇒ ↘ energy
○ Until Utot
= 2.4
○ For high Utot
:
■ Analysis pessimism
■ Schedulable systems only bias,
so already well distributed
■ Some systems are only schedulable in parallel
Relative energy savings
130

● “3 threads” is bad:
○ Loss of symmetry with 4 cores
○ Different tasks executes simultaneously (1+3)
● Measures on the whole platform
○ CPU alone would give better results
● Need more data
○ Charts would be smoother
○ But it takes time…
■ 1 minute/execution
■ 4 executions/system
■ 232 systems, ≈ 15 hours
Questions on experiment results
131

● Flawed, but does not impact results
○ Axer et al (ECRTS’13)
○ Flaw pointed by Fonseca et al (SIES’16)
● Scheduling framework
○ Part of the technical framework
○ Not the important contribution/result
● Will be improved with future work
Analysis technique
132

Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embedded Real-Time Operating System

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embedded Real-Time Operating System

Similar to Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embedded Real-Time Operating System (20)

More from Tulipp. Eu

More from Tulipp. Eu (18)

Recently uploaded

Recently uploaded (20)

Quantifying Energy Consumption for Practical Fork-Join Parallelism on an Embedded Real-Time Operating System