7. Embedded architectural trend
1Li et al., “McPAT: An integrated power, area, and timing modeling framework for
multicore and manycore architectures”.
2Binkert et al., “The Gem5 Simulator”.
B. Roux 1
Embbeded market trend
• Computation
• Power consumption
HMpSoC issues
• Complex and hard to solve HW/SW partionning and task mapping
• Estimate power consumption early in design flow is mandatory
• Huge design space
⇒ Available power modelling tools1,2
are not well-adapted to complex
multicores
8. Embedded architectural trend
1Li et al., “McPAT: An integrated power, area, and timing modeling framework for
multicore and manycore architectures”.
2Binkert et al., “The Gem5 Simulator”.
B. Roux 1
Embbeded market trend
• Computation
• Power consumption
HMpSoC issues
• Complex and hard to solve HW/SW partionning and task mapping
• Estimate power consumption early in design flow is mandatory
• Huge design space
⇒ Available power modelling tools1,2
are not well-adapted to complex
multicores⇒ Need a fast power modelling tool
9. Outline
Heterogeneous MpSoC
• Definition of HMpSoC families
• Generic representation focused on memory
Communication-based power model
• Fast power modelling approach for task-mapping on HMpSoC
Model parameter extraction
• µBenchmarking methodology to ease power model’s parameter
extraction
Validation on Xilinx Zynq
• Power model parameter extraction with µBenchmark
• Power model output validation with mutant applications
B. Roux 2
20. Heterogeneous MpSoC
B. Roux 4
HMpSoC family
• Distributed HMpSoC: small HW accelerators, fast communications
with SW
• Shared HMpSoC: large HW accelerators shared between clusters, slow
communications with SW
HW
Memory
unit
processor
unit 1
processor
unit n
NoC
NoC
N
o
C
...
...
...
...
SW
1
NoC_itf
Memory
unit
SW
...
processor
unit 1MEMORY
SW
2
SW
N
GPIO
DDR
N
o
C
Memory
unit
processor
unit 1
processor
unit n
Memory
unit
processor
unit 1
GPIO
DDR
Memory
unit
processor
unit 1
processor
unit n
Memory
unit
processor
unit 1
GPIO
DDR
Memory
unit
processor
unit 1
processor
unit n
Memory
unit
processor
unit 1
GPIO
DDR
HW
SW
1
NoC_itf
PU
...
MEMORY
SW
2
PU
N
HW
SW
1
NoC_itf
SW
...
MEMORY
SW
2
SW
N
HW
SW
1
NoC_itf
SW
...
MEMORY
SW
2
SW
N
HW
Memory
unit
hardware
unit A
processor
unit 1
processor
unit n
NoC
NoC
N
o
C
...
...
...
...
NoC_itf
Memory
unit
processor
unit 1MEMORY
GPIO
DDR
PU
0
N
o
C
GPIO
DDR
GPIO
DDR
PU
1
PU
2
PU
3
PU
...
PU
N
NoC_itf
MEMORY
PU
0
PU
1
PU
2
PU
3
PU
...
PU
N
NoC_itf
MEMORY
PU
0
PU
1
PU
2
PU
3
PU
...
PU
N
NoC_itf
Distributed HMpSoC Shared HMpSoC
21. Heterogeneous MpSoC
B. Roux 4
HMpSoC family
• Distributed HMpSoC: small HW accelerators, fast communications
with SW
• Shared HMpSoC: large HW accelerators shared between clusters, slow
communications with SW
HW
Memory
unit
processor
unit 1
processor
unit n
NoC
NoC
N
o
C
...
...
...
...
SW
1
NoC_itf
Memory
unit
SW
...
processor
unit 1MEMORY
SW
2
SW
N
GPIO
DDR
N
o
C
Memory
unit
processor
unit 1
processor
unit n
Memory
unit
processor
unit 1
GPIO
DDR
Memory
unit
processor
unit 1
processor
unit n
Memory
unit
processor
unit 1
GPIO
DDR
Memory
unit
processor
unit 1
processor
unit n
Memory
unit
processor
unit 1
GPIO
DDR
HW
SW
1
NoC_itf
PU
...
MEMORY
SW
2
PU
N
HW
SW
1
NoC_itf
SW
...
MEMORY
SW
2
SW
N
HW
SW
1
NoC_itf
SW
...
MEMORY
SW
2
SW
N
HW
Memory
unit
hardware
unit A
processor
unit 1
processor
unit n
NoC
NoC
N
o
C
...
...
...
...
NoC_itf
Memory
unit
processor
unit 1MEMORY
GPIO
DDR
PU
0
N
o
C
GPIO
DDR
GPIO
DDR
PU
1
PU
2
PU
3
PU
...
PU
N
NoC_itf
MEMORY
PU
0
PU
1
PU
2
PU
3
PU
...
PU
N
NoC_itf
MEMORY
PU
0
PU
1
PU
2
PU
3
PU
...
PU
N
NoC_itf
Generic description
How to precisely describe an architecture in those families?
22. Memory Tree Abstraction
B. Roux 5
Sw Core
1
Memory
sublevel1
NoC
sublevel0
Memory
sublevel0
NoC
sublevel1
Sw Core
1
Sw Core
N
Memory
sublevel1
Memory
sublevel0
Network
Classe
Cluster
Classe
Core
Classe
...
Sw Core
N
... ...Hw Core Hw Core
24. Motivation
HMpSoC energy consumption
Three main sources:
• Dynamic energy consumption used for computations
• Static energy dissipated during execution time
• Energy used for communications between cores
Assumptions
A parallelisable application could be executed on multiple threads,
reducing the execution time but not its complexity:
• Amount of computations is independent of chosen parallelism degree
• Amount of communications and synchronizations is directly linked to
the number of execution threads
B. Roux 6
25. Power Model Structure (1)
Communication energy cost
• Communications are map into memory
• C(Tki , Tkj ): set of communication channels crossed from task Tki
to task Tkj
Ecom(Tki , Tkj ) =
c∈C(Tki ,Tkj )
e0c + e1c × bytes(Tki , Tkj )
Note: Synchronization and IO events are managed as communications.
Computation energy cost
Ecomp(Tkk ): computed once for each kind of available computational
cores.
B. Roux 7
26. Power Model Structure (1)
B. Roux 7
H
A
R
D
W
A
R
E
S
U
I
T
A
B
L
E
Load
Store
Basic power Bloc
instructions
Load
Store
Basic power Bloc
instructions
Load
Store
Basic power Bloc
instructions
Load
Store
Basic power Bloc
instructions
(42)
(34)
H
A
R
D
W
A
R
E
S
U
I
T
A
B
L
E
(10)
(30)
Load
Store
Basic power Bloc
instructions
(95)
(17)
27. Power Model Structure (1)
B. Roux 7
H
A
R
D
W
A
R
E
S
U
I
T
A
B
L
E
Load
Store
Basic power Bloc
instructions
Load
Store
Basic power Bloc
instructions
Load
Store
Basic power Bloc
instructions
Load
Store
Basic power Bloc
instructions
(42)
(34)
H
A
R
D
W
A
R
E
S
U
I
T
A
B
L
E
(10)
(30)
Load
Store
Basic power Bloc
instructions
(95)
(17)
NoC
Network
Classe
Cluster
Classe
Core
Classe
Cluster A Cluster B
CLUSTER MEMORY
SW MEMORY
CLUSTER MEMORY
SW MEMORY
HW
A
SW
A1
SW
B
HW
B
28. Power Model Structure (1)
B. Roux 7
Load
Store
SW
A
Load
Store
SW
A
Load
Store
HW
A
Load
Store
SW
B
(42)
(34)
(10)
(30)
Load
Store
HW
B
(95)
(17)
NoC
Network
Classe
Cluster
Classe
Core
Classe
Cluster A Cluster B
CLUSTER MEMORY
SW MEMORY
CLUSTER MEMORY
SW MEMORY
HW
A
SW
A1
SW
B
HW
B
29. Power Model Structure (1)
B. Roux 7
Load
Store
SW
A
Load
Store
SW
A
Load
Store
HW
A
Load
Store
SW
B
(42)
(34)
(10)
(30)
Load
Store
HW
B
(95)
(17)
NoC
Network
Classe
Cluster
Classe
Core
Classe
Cluster A Cluster B
CLUSTER MEMORY
SW MEMORY
CLUSTER MEMORY
SW MEMORY
HW
A
SW
A1
SW
B
HW
B
30. Power Model Structure (1)
B. Roux 7
Load
Store
SW
A
Load
Store
SW
A
Load
Store
HW
A
Load
Store
SW
B
(42)
(34)
(10)
(30)
Store
HW
B
(95)
(17)
Load
NoC
Network
Classe
Cluster
Classe
Core
Classe
Cluster A Cluster B
CLUSTER MEMORY
SW MEMORY
CLUSTER MEMORY
SW MEMORY
HW
A
SW
A1
SW
B
HW
B
31. Power Model Structure (1)
Communication energy cost
• Communications are map into memory
• C(Tki , Tkj ): set of communication channels crossed from task Tki
to task Tkj
Ecom(Tki , Tkj ) =
c∈C(Tki ,Tkj )
e0c + e1c × bytes(Tki , Tkj )
Note: Synchronization and IO events are managed as communications.
Computation energy cost
Ecomp(Tkk ): computed once for each kind of available computational
cores.
B. Roux 7
32. Power Model Structure (2)
Static energy cost
Estat = Texec × Pstat
where Pstat is the static power, Texec is the critical path in the mapping
graph weighted with computations and communications.
Global energy cost
Et = Estat +
k∈NTk
Ecomp(Tkk ) +
(i,j)∈N2
Tk
Ecom(Tki , Tkj )
B. Roux 8
33. Power Model Structure (2)
Static energy cost
Estat = Texec × Pstat
where Pstat is the static power, Texec is the critical path in the mapping
graph weighted with computations and communications.
Global energy cost
Et = Estat +
k∈NTk
Ecomp(Tkk ) +
(i,j)∈N2
Tk
Ecom(Tki , Tkj )
B. Roux 8
35. µBenchmarks purpose
Definition
A µBenchmark is a simple and synthetic application that aims at
stressing a specific part of the execution architecture.
Properties
• Selectivity: µbenchs only stress a specific communication channel
• Intensity variability: µbenchs stress a communication channel with
different intensity
• Duration variability: µbenchs duration is adapted to match power
measurement timing resolution
B. Roux 9
36. µBenchmark structure
General structure
• InterCluster
• IntraCluster
• HwChannel
• SwChannel
B. Roux 10
Algorithm 1: Generic µBenchmark structure.
Data: scaleFactor, size
initBenchmarkEnv()
startPowerMeasure()
for iteration in scaleFactor do
openCommunicationChannel()
producer = spawnProducerThread(size)
consummer = spawnConsumerThread(size)
waitThread(producer, consumer)
closeCommunicationChannel()
end
stopPowerMeasure()
writePowerMeasureToFile()
Sw Core
1
Memory
sublevel1
NoC
sublevel0
Memory
sublevel0
NoC
sublevel1
Sw Core
1
Sw Core
N
Memory
sublevel1
Memory
sublevel0
Network
Classe
Cluster
Classe
Core
Classe
...
Sw Core
N
... ...Hw Core Hw Core
37. µBenchmark structure
General structure
• InterCluster
• IntraCluster
• HwChannel
• SwChannel
B. Roux 10
Algorithm 1: Generic µBenchmark structure.
Data: scaleFactor, size
initBenchmarkEnv()
startPowerMeasure()
for iteration in scaleFactor do
openCommunicationChannel()
producer = spawnProducerThread(size)
consummer = spawnConsumerThread(size)
waitThread(producer, consumer)
closeCommunicationChannel()
end
stopPowerMeasure()
writePowerMeasureToFile()
Network
Classe
Cluster
Classe
Core
Classe
Sw Core
1
Memory
sublevel1
NoC
sublevel0
Memory
sublevel0
NoC
sublevel1
Sw Core
1
Sw Core
N
Memory
sublevel1
Memory
sublevel0
...
Sw Core
N
... ...Hw Core Hw Core
Memory
sublevel1
NoC
sublevel0
NoC
sublevel1
Memory
sublevel1
38. µBenchmark structure
General structure
• InterCluster
• IntraCluster
• HwChannel
• SwChannel
B. Roux 10
Algorithm 1: Generic µBenchmark structure.
Data: scaleFactor, size
initBenchmarkEnv()
startPowerMeasure()
for iteration in scaleFactor do
openCommunicationChannel()
producer = spawnProducerThread(size)
consummer = spawnConsumerThread(size)
waitThread(producer, consumer)
closeCommunicationChannel()
end
stopPowerMeasure()
writePowerMeasureToFile()
Network
Classe
Cluster
Classe
Core
Classe
Sw Core
1
Memory
sublevel1
NoC
sublevel0
Memory
sublevel0
NoC
sublevel1
Sw Core
1
Sw Core
N
Memory
sublevel1
Memory
sublevel0
...
Sw Core
N
... ...Hw Core Hw Core
Memory
sublevel1
Memory
sublevel0
Memory
sublevel1
Memory
sublevel0
39. µBenchmark structure
General structure
• InterCluster
• IntraCluster
• HwChannel
• SwChannel
B. Roux 10
Algorithm 1: Generic µBenchmark structure.
Data: scaleFactor, size
initBenchmarkEnv()
startPowerMeasure()
for iteration in scaleFactor do
openCommunicationChannel()
producer = spawnProducerThread(size)
consummer = spawnConsumerThread(size)
waitThread(producer, consumer)
closeCommunicationChannel()
end
stopPowerMeasure()
writePowerMeasureToFile()
Network
Classe
Cluster
Classe
Core
Classe
Sw Core
1
Memory
sublevel1
NoC
sublevel0
Memory
sublevel0
NoC
sublevel1
Sw Core
1
Sw Core
N
Memory
sublevel1
Memory
sublevel0
...
Sw Core
N
... ...Hw Core Hw Core
Memory
sublevel1
Memory
sublevel0
Sw Core
N
Memory
sublevel1
Memory
sublevel0
Hw Core Hw Core
40. µBenchmark structure
General structure
• InterCluster
• IntraCluster
• HwChannel
• SwChannel
B. Roux 10
Algorithm 1: Generic µBenchmark structure.
Data: scaleFactor, size
initBenchmarkEnv()
startPowerMeasure()
for iteration in scaleFactor do
openCommunicationChannel()
producer = spawnProducerThread(size)
consummer = spawnConsumerThread(size)
waitThread(producer, consumer)
closeCommunicationChannel()
end
stopPowerMeasure()
writePowerMeasureToFile()
Network
Classe
Cluster
Classe
Core
Classe
Sw Core
1
Memory
sublevel1
NoC
sublevel0
Memory
sublevel0
NoC
sublevel1
Sw Core
1
Sw Core
N
Memory
sublevel1
Memory
sublevel0
...
Sw Core
N
... ...Hw Core Hw Core
Sw Core
1
Memory
sublevel0
Sw Core
1
Sw Core
N
Memory
sublevel0
Sw Core
N
... ...
42. Experimental infrastructure
Zynq architecture
Virtual
Memory
space
Processing System
Progammable
Logic
PS7_0 PS7_1
SCU
A
M
B
A
Cache L2
Cache L1 Cache L1
DDR
A
M
B
A
interconnect
ext
P1
ext
P0
ext
Pn
HP1
HP2
HP3
HP4
GP0
GP1
GP2
GP3
.
.
. IRQ
IRQ0
...
IRQ15
A
C
P
Advanced
coherency
protocol
interconnect
Experimental setup
• Board: Xilinx Zc702
• OS: Linux kernel v4.0.0
• Power measurement: TI UCD92xx, PMBus
controlled, 5ms resolution, 7 rails
B. Roux 11
47. Parameters Extraction (2)
3Results over 22 benchmarks are available in the paperB. Roux 13
300 400 500 600 700 800 900 1000
Size [bytes]
0.2
0.4
0.6
0.8
Energy[J]
1e 4
CL1read
CL1read_burst
CL2read
CL2read_burst
CL1write
CL1write_burst
CL2write
CL2write_burst
Energy curves
3
Time [s] Energy [J]
Benchmark f : x → t1x + t0 f : x → e1x + e0
t1 t0 e1 e0
CL1 read 1.82e-08 -2.95e-08 1.52e-09 -2.45e-09
CL1 read burst 1.02e-08 6.68e-09 8.40e-10 5.50e-10
CL1 write 6.03e-08 3.12e-07 4.72e-09 2.44e-08
CL1 write burst 5.05e-08 -3.73e-07 3.73e-09 -2.75e-08
CL2 read 1.76e-08 -2.69e-08 1.51e-09 -2.30e-09
CL2 read burst 9.62e-09 3.19e-07 8.01e-10 2.66e-08
CL2 write 7.19e-08 -5.46e-09 1.70e-08 -1.29e-09
CL2 write burst 5.08e-08 -4.14e-07 3.71e-09 -3.02e-08
48. Validation on mutant applications (1)
Mutant application
• Abstract application automatically generated from pattern functions
• Randomly generates communication traffic
Mutant generation
• (n + 1) Rounds per application
• 3 workers per Round
• 12 Software patterns
• 6 Hardware patterns
B. Roux 14
Round 0
Round 1
Round n
SW slotB
random size
and
pattern function
SW slotA
random size
and
pattern function
HW slotA
random size
and
pattern function
SW slotB
random size
and
pattern function
SW slotA
random size
and
pattern function
HW slotA
random size
and
pattern function
SW slotB
random size
and
pattern function
SW slotA
random size
and
pattern function
HW slotA
random size
and
pattern function
49. Validation on mutant applications (2)
B. Roux 15
Table 1: Communications spread over channels in two mutants
Total bytes
Channel name
Cache L1 Cache L2 DDR HPx ACP GPx
4.56e+07
read 6.6% read 1.3% read 1.3%
read 1.2% read 0.6% polling 3.5%
read burst 0.6% read burst 5.5% read burst 18.0%
write 6.8% write 2.5% write 1.1%
write 2.0% write 0.4% irq 0.5%
write burst 6.6% write burst 0.0% write burst 41.3%
5.37e+07
read 6.7% read 2.1% read 4.7%
read 2.9% read 4.8% polling 2.0%
read burst 4.7% read burst 0.2% read burst 10.0%
write 5.0% write 0.6% write 0.0%
write 7.4% write 2.0% irq 0.5%
write burst 4.7% write burst 6.8% write burst 35.0%
Table 2: Estimation vs. measures
mutantRank
Time [s] Energy [J] Error
measured estimated measured estimated time energy
mutant 1 2.308 2.311 2.949 2.943 0.1% 0.2%
mutant 2 2.340 2.336 3.031 2.964 0.2% 2.2%
average on 80 mutants 2.974 2.975 3.855 3.861 0.5 % 1.0 %
Power estimation time for 80 mutants 0.5s
50. Conclusion
Initial issue
Provide a very-fast power modelling methodology for task-mapping in
Heterogeneous MpSoC
Proposals
• Generic model of Heterogeneous MpSoC
• Power modelling approach focused on communication channels
• µBenchmark approach that enable architecture’s parameters
extraction
Ongoing work
• Integrate this methodology in state of the art compiler frameworks4,5
• Towards HW/Sw partitioning for HMpSoC under energy efficiency
constraint
4Floch et al., “GeCoS: A framework for prototyping custom hardware design flows”.
5Ceng et al., “MAPS: An Integrated Framework for MPSoC Application
Parallelization”.
B. Roux 16
51. Conclusion
Initial issue
Provide a very-fast power modelling methodology for task-mapping in
Heterogeneous MpSoC
Proposals
• Generic model of Heterogeneous MpSoC
• Power modelling approach focused on communication channels
• µBenchmark approach that enable architecture’s parameters
extraction
⇒ Estimation accuracy and time fit well with task-mapping
Ongoing work
• Integrate this methodology in state of the art compiler frameworks4,5
• Towards HW/Sw partitioning for HMpSoC under energy efficiency
constraint
4Floch et al., “GeCoS: A framework for prototyping custom hardware design flows”.
5Ceng et al., “MAPS: An Integrated Framework for MPSoC Application
Parallelization”.
B. Roux 16
52. Thanks for your attention
Do you have any questions?
B. Roux 16
54. Zynq architecture
B. Roux
Virtual
Memory
space
Processing System
Progammable
Logic
PS7_0 PS7_1
SCU
A
M
B
A
Cache L2
Cache L1 Cache L1
DDR
A
M
B
A
interconnect
ext
P1
ext
P0
ext
Pn
HP1
HP2
HP3
HP4
GP0
GP1
GP2
GP3
.
.
. IRQ
IRQ0
...
IRQ15
A
C
P
Advanced
coherency
protocol
interconnect