This document describes SmartBalance, a sensing-driven Linux load balancer for heterogeneous multi-processor systems-on-chips (MPSoCs) that aims to improve energy efficiency. It does this through a predictive approach that balances tasks among cores based on ongoing performance and power measurements, with the goal of jointly addressing workload variability and hardware heterogeneity. The key aspects are on-chip sensing to monitor performance and power, online prediction of these metrics, and a simulated annealing-based allocation algorithm to optimize task distribution across cores in each scheduling epoch.
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
SmartBalance-DAC-v2
1. SmartBalance:
A
Sensing-‐Driven
Linux
Load
Balancer
for
Energy
Efficiency
of
Heterogeneous
MPSoCs
Santanu
Sarma
Computer
Science,
UC
Irvine
h3p://variability.org
Coauthors: Tiago R. Muck, Danny Bathen, N. Dutt, A. Nicolau
T1: Measurement and Modeling
T2: Design Tools and Testing
T3: Microarchitecture and Compilers
T4: Runtime Support
T5: Applications and Testbeds
T6: Outreach and Education
3. Heterogeneous
PlaTorms
Examples: ARM (big.Little) , NVidia Tegra, and AMD GPGPU
Clear
Trend
Towards
Heterogeneous
Many/mul;
core
Architectures
with
different
core
types
Examples: ARM (big.Little) , NVidia Tegra, and AMD GPGPU
4. Emerging & Future HMPs
6/10/15
4
Futuris;c
heterogeneous
mul;core
processor
are
expected
to
have
shared
memories,
coherent
bus,
mul;ple
networks
and
accelerators
A15
Bluetooth
GSM
WiFi
3/4G
5G
A7
A7
A7
A7
A7
A7
A7
A7
A7
L2
A11
A11
A11
A11
L2
L2
Cache
Coherent
Interconnect
L3
GPU
Accelerator
Disk
Global
Interrupt
Controller
DRAM
SPM
Y
Y
Z
OtherAccelerators
5. Smart
Load
Balancing
Problem
• Standard
Load
Balance:
Distribute
threads
(tasks)
among
cores
uniformly
and
randomly
(lack
of
awareness
at
thread
level)
• Smart
Load
Balancing
:
Distribute
threads
(tasks)
among
cores
with
awareness
of
energy/power
at
thread
levels
5
6. Tradi?onal
OS
Allocator
&
Scheduler
• Do
not
cope
jointly
with
workload
variability
and
heterogeneity
• Do
not
expose
Variability
at
OS
layer
• Lacks
suitable
Abstrac?ons
• Lacks
support
for
Generic
HMPs
6
Alloca?on
/
Balancing
A7
A11
A15
A11
A7
A7
A7
A7
A7
A7
A7
A7
A11
A11
A7
A11
A15
A11
A7
A7
A7
A7
A7
A7
A7
A7
A11
A11
LLC
Task1
Task2
Task
n
Task
m
Scheduler
Tradi;onal
OS
(eg.
Linux)
not
yet
ready
to
deal
with
DUAL
Challenge
of
Heterogeneity
and
Variability
11. Performance-‐Power
Predic?on
• Performance
predic?on
at
each
core
types
based
on
profiling
or
online
learning
• Customized
predictors
for
each
different
core
with
fine
/
precise
predic?on
– For
known
architectures
• A
generic
predictor
for
coarse
predic?on
– For
unknown
architectures
with
new
core
types
11
Epoch
1 Epoch
2 Epoch
3
Variability-‐Aware
Performance
&
Power
Predic?on
Performance
Counters
Configura?ons
…..
Perf. Matrix
Power Matrix
Power
&
Variability
Sensing
12. On-‐line
Op?miza?on
l Problem
Defini?on:
l NP-‐HARD
problem
l Finding
soluGon
requires
heurisGcs
l Simulated
annealing
based
AllocaGon
l On-‐line
low-‐overhead
(<
1%
for
100ms
Epoch
)
Task
AllocaGon
(SA
Based
Online
Solver)
Objec?ve(s)
Alloca?on
Epoch
1 Epoch
2 Epoch
3
t1 t4
t3
t2
ipc00
ipcij
Perf. matrix
p00
pij
Power. matrixmax
Ψ
IPS
Power
⎛
⎝⎜
⎞
⎠⎟
CFS CFS CFS
15. Experimental
Goals
&
Benchmarks
• Goals:
Improve
Energy
Efficiency
• Benchmarks:
PARSEC
&
Mixes
• Interac?ve
Benchmarks
(IMB)
– 9
IMBs
(e.g.
High
Throughput
High
Interac?vity
HTHI)
– Ability
to
control
phases,
wait
periods
etc
15
PARSEC
Mixes,
Mix1, Mix2, Mix3, Mix4, Mix5, Mix6,
X264Hcrew,
x264Hbow,
x264Lcrew,
x264Lbow,
x264Lcrew,
x264Hbow,
x264Hcrew,
x264Lbow,
Bodytrack,
x264Hcrew,
,
Bodytrack,
x264Hcrew,
x264Lbow,
16. 16
Results
w.r.t.
Vanilla
Linux
CFS
Over 50 % improvement wrt to Vanilla Linux Kernel
• Linux CFS Scheduler Uniformly distributes the threads
irrespective of the core types & feature
• SmartBalance makes workload & power-aware runtime decisions
17. 17
Results
wrt
ARM
GTS
Over ~20% improvement wrt State-of-the-art ARM GTS
• ARM GTS makes binary decision to select either a big core or a
small core based on utilization threshold
• Unaware of thread-level power and performance
22. Summary
and
Future
Work
• Performance-‐Power-‐Aware
PredicGve
Linux
Load
Balancing
• Over
50%
improvement
for
Quad
core
HMP
at
<1%
overhead
22
Predict
Balance
Sense
• Over
20
%
improvement
in
energy
efficiency
wrt
ART
GTS
policy
• Future
Work:
Load,
Priority,
and
Thermal
Awareness
of
the
balancer
25. Implementa?on
in
Linux
• Modifica?ons
for
SmartBalance
– Load
balancing
replaced
by
SmartBalance
– Each
phase
runs
as
a
kernel
thread
– System
does
not
halt
while
running
SmartBalance
25
29. CPSoC
Computa?onal
PlaTorm
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPS Core
Adaptive Router
Chip Hardware
App 1 App 2 App N
Cross-Layer Sensors
(Virtual & Physical)
Decisions & Learning
(Controller)
Actuation (software
and hardware)
ReflectiveMiddlewareLayer
Scheduling
Memory
Manager
File System
Device
Drivers
Traditional Operating System
Hypervisor
Observe Decide
Act
Application
Layer
CPSCore
DDRO(s)
Oxide
Sensor(s)
Temperature
Sensor(s)
Leakage
Sensor(s)
Aging Sensor(s)
Reliability Sensor(s)
Performance Counters
CPU(s)
$I $D
$L2
Scratch pad/
On-Chip SRAM
NIA
Timer & RTC
PLL
On-chip
Actuation Unit
On-Chip Sensing & Actuation (OCSA)
GPIO
!
30. Cross-‐Layer
Physical/Virtual
Sensing
&
Actua?on
ApplicaGons
OperaGng
System
Network/Bus
CommunicaGon
Architecture
Hardware
Architecture
Device/Circuit
Architecture
SA
SO
SN
SH
SC
Sensors
(Observer)
AdapGve
Control
(Decide)
AA
AO
AN
AH
AC
Actua?on
(Act)