SmartBalance-DAC-v2

SmartBalance:
A
Sensing-‐Driven
Linux

Load
Balancer
for
Energy
Eﬃciency
of

Heterogeneous
MPSoCs

Santanu
Sarma

Computer
Science,
UC
Irvine

h3p://variability.org

Coauthors: Tiago R. Muck, Danny Bathen, N. Dutt, A. Nicolau
T1: Measurement and Modeling
T2: Design Tools and Testing
T3: Microarchitecture and Compilers
T4: Runtime Support
T5: Applications and Testbeds
T6: Outreach and Education

Current Trends in MPSoC
•  Emerging
and
future
compu?ng
systems
will
be

heterogeneous
mul?core
processor(HMP)[Borkar11]

•  They
will
be
rich
in
diﬀerent
types
of
cores
with

diverse
memories
and
accelerators
[ARM
big.Li3le,
2013;
Angstrom
plaTorm,
MIT
2014,
P2012
PlaTorm]

•  Heterogeneity
manifest
even
in
homogenous

architectures
due
to
process
variability

[Teodorescu08]

•  They
are
monitor–rich
at
lower
layers
of
abstrac?ons

[Kornaros13,
Lefurgy13,
Gupta13]

6/10/15

©
VLSI
Design
&
Embedded
Systems

Conference
-‐
2015

2

Heterogeneous
PlaTorms

Examples: ARM (big.Little) , NVidia Tegra, and AMD GPGPU
Clear
Trend
Towards
Heterogeneous
Many/mul;

core
Architectures
with
diﬀerent
core
types

Examples: ARM (big.Little) , NVidia Tegra, and AMD GPGPU

Emerging & Future HMPs
6/10/15
4

Futuris;c
heterogeneous
mul;core
processor
are
expected
to
have

shared
memories,
coherent
bus,
mul;ple
networks
and
accelerators

A15

Bluetooth
GSM
WiFi
3/4G
5G

A7

A7

A7

A7

A7

A7

A7

A7

A7

L2

A11

A11

A11

A11

L2

L2

Cache
Coherent
Interconnect

L3

GPU

Accelerator

Disk

Global
Interrupt
Controller

DRAM
SPM

Y
Y

Z

OtherAccelerators

Smart
Load
Balancing
Problem

•  Standard
Load
Balance:
Distribute
threads

(tasks)
among
cores
uniformly
and
randomly

(lack
of
awareness
at
thread
level)

•  Smart
Load
Balancing
:

Distribute
threads

(tasks)
among
cores
with
awareness
of

energy/power
at
thread
levels

5

Tradi?onal
OS
Allocator
&
Scheduler

•  Do
not
cope
jointly
with

workload
variability
and

heterogeneity

•  Do
not
expose

Variability
at
OS
layer

•  Lacks
suitable

Abstrac?ons

•  Lacks
support
for

Generic
HMPs

6

Alloca?on
/
Balancing

A7
A11

A15

A11

A7

A7

A7

A7

A7

A7

A7

A7

A11

A11

A7
A11

A15

A11

A7

A7

A7

A7

A7

A7

A7

A7

A11

A11

LLC

Task1

Task2

Task
n
Task
m

Scheduler

Tradi;onal
OS
(eg.
Linux)
not
yet
ready
to
deal
with

DUAL
Challenge
of
Heterogeneity
and
Variability

SmartBalance
Approach

7

A7
A11

A15

A11

A7

A7

A7

A7

A7

A7

A7

A7

A11

A11

A7
A11

A15

A11
A7

A7

A7

A7

A7

A7

A7

A11

A11

LLC

Task1

Task2

Task
n
Task
m

Predict

Balance
Sense

Scheduler

•  Sensing-‐driven
closed

loop
predic?ve
approach

•  Support
Generic
HMPs

•  Supports
shared
&

independent
task
models

Heterogeneity
and
Performance-‐Power-‐Aware

Balancer/Allocator
for
Generic
HMPs

SmartBalance
Approach

8

Sensing
Es?ma?on
&
Predic?on
Alloca?on
Epoch
1 Epoch
2 Epoch
3
Scheduling

(CFS)
TSTA
Scheduling

(CFS)
Scheduling

(CFS)
TEpoch
A7
A11

A15

A11

A7

A7

A7

A7

A7

A7

A7

A7

A11

A11

A7
A11

A15

A11
A7

A7

A7

A7

A7

A7

A7

A11

A11

LLC

Task1

Task2

Task
n
Task
m

Predict

Balance
Sense

Scheduler

SmartBalance
stages
are
divided
into
;me
slices

called
EPOCHS

On-‐Chip
Sensing
and
Measurement

•  Performance
Sensing

–  Hardware
Performance

counters
at
each
core

•  Power
Sensing

–  Per
core
total
power

sensing

–  Dynamic
Power
Sensing:

virtual
sensor
based
per

core

–  Leakage
Power
Sensing:

•  Per
block
leakage
sensor

•  Network
of
sensors

Epoch
1 Epoch
2 Epoch
3
A7
A11

A15

A11

A7

A7

A7

A7

A7

A7

A7

A7

A11

A11

A7
A11

A15

A11
A7

A7

A7

A7

A7

A7

A7

A11

A11

LLC

10

SmartBalance
Sensing
and
Measurement

…….#
Smart#Balancing#Epoch#TEpoch(k)#
Linux#CFS#
Sched#Period#T1k(1)# T1k(L)#
…#τ1 τ2 τ m @me#Core1#
TEpoch(kA1)#
…….# @me#Core2#
…….# @me#Coren#
Sense# Es@mate#&#predict# Balance#

Performance-‐Power
Predic?on

•  Performance
predic?on
at

each
core
types
based
on

profiling
or
online
learning

•  Customized
predictors
for

each
different
core
with

fine
/
precise
predic?on

–  For
known
architectures

•  A
generic
predictor
for

coarse
predic?on

–  For
unknown
architectures

with
new
core
types

11

Epoch
1 Epoch
2 Epoch
3
Variability-‐Aware

Performance
&

Power
Predic?on

Performance

Counters

Configura?ons
…..
Perf. Matrix
Power Matrix
Power
&
Variability

Sensing

On-‐line
Op?miza?on

l  Problem
Deﬁni?on:

l  NP-‐HARD
problem

l  Finding
soluGon
requires

heurisGcs

l  Simulated
annealing
based

AllocaGon

l  On-‐line
low-‐overhead
(<
1%
for

100ms
Epoch
)

Task

AllocaGon

(SA
Based

Online
Solver)
Objec?ve(s)
Alloca?on
Epoch
1 Epoch
2 Epoch
3
t1 t4
t3
t2
ipc00
ipcij
Perf. matrix
p00
pij
Power. matrixmax
Ψ
IPS
Power
⎛
⎝⎜
⎞
⎠⎟
CFS CFS CFS

SmartBalance
Approach

13

35#
Epoch#1# Epoch#2# Epoch#3#
Variability3Aware#
Performance#&#
Power#Predic=on###Performance##
Counters##
Conﬁgura=ons#
…..#
Perf.#Matrix#
Power#Matrix#
Power#&#Variability##
Sensing#
A7# A11#
A15#
A11#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A11#
A11#
A7# A11#
A15#
A11#A7#
A7#
A7#
A7#
A7#
A7#
A7#
A11#
A11#
LLC#
Task1#
Task2#
Task#n#Task#m#
Predict(
Balance(Sense(
Scheduler#
Alloca=on#/##
Load#Balancer#
A15#
A15#
LLC#
Task1#
Task2#
Task#n#Task#m#
Scheduler#
(a)  Tradi2onal(OS((
Allocator/Scheduler(
(b)(Sensing;driven(Predic2ve(
OS(Allocator/Scheduler(
##
A15#
A15#
A15#
A15#
(c)(Predic2ve(alloca2on(for(epochs(
(Each(epoch(cover(mul2ple(Linux(scheduling(cycles(

Experimental
PlaTorm

•  Extension
of
the

gem5

– McPAT

+
power

variability

– Sensing
interface

•  Heterogeneous

Alpha-‐based
cores:

–  8-‐way
OoO
(Huge)

–  4-‐way
OoO
(Big)

–  2-‐way
OoO
(Medium)

–  Inorder
(Small)

14

Thread'0'
Thread'n'
App'0'
Thread'0'
Thread'n'
App'n'
Applica/ons'
Opera/ng''
System'
Extended'
Gem5'
Pla;orm'
Benchmarks'
Disk' DRAM'
McPAT'
HPC/'
Sensing'
Interface'
….'
Power'Perf.'
Core'1'
RQ'
Schedule()'
Core'2'
RQ'
Schedule()'
Core'n'
RQ'
Schedule()'
load_balance()'
smart_balance()'
Linux'Kernel'
……'
……'
Big(
$I' $D'
L2'
Medium(
$I' $D'
L2'
Small(
$I' $D'
L2'
Huge(
$I' $D'
L2'

Experimental
Goals
&
Benchmarks

•  Goals:
Improve
Energy
Eﬃciency

•  Benchmarks:
PARSEC

&
Mixes

•  Interac?ve
Benchmarks
(IMB)

– 9
IMBs
(e.g.
High
Throughput
High
Interac?vity

HTHI)

– Ability
to
control
phases,
wait
periods
etc

15

PARSEC
Mixes,
Mix1, Mix2, Mix3, Mix4, Mix5, Mix6,
X264Hcrew,
x264Hbow,
x264Lcrew,
x264Lbow,
x264Lcrew,
x264Hbow,
x264Hcrew,
x264Lbow,
Bodytrack,
x264Hcrew,
,
Bodytrack,
x264Hcrew,
x264Lbow,

16

Results
w.r.t.
Vanilla
Linux
CFS

Over 50 % improvement wrt to Vanilla Linux Kernel
•  Linux CFS Scheduler Uniformly distributes the threads
irrespective of the core types & feature
•  SmartBalance makes workload & power-aware runtime decisions

17

Results
wrt
ARM
GTS

Over ~20% improvement wrt State-of-the-art ARM GTS
•  ARM GTS makes binary decision to select either a big core or a
small core based on utilization threshold
•  Unaware of thread-level power and performance

18

Overheads

Overhead is < 1% for 100ms Epoch for Quad-core system

19

Predictor
Performance

21

Related
Work

Reference' Scheme'Generality' Per2Thread'
Awareness'
Per2Core'Awareness' Integrated'
&'
Implemen
ted'in''OS'No'Core'
Types'>2'
Thread2to2
core2raBo'
>1'
'
IPC' Power' UBl.' IPC' Power'
Chen2009(( Yes( No( No( No( No( Yes( Yes( No(
Annamalai2013( No( No( No( No( No( Yes( Yes( No(
Liu2013( Yes( Yes( No( No( No( Yes( Yes( No(
Kim2014(( No( Yes( No( No( Yes( No( No( Yes(
Linaro(IKS(2013( No( Yes( No( No( Yes( No( No( Yes(
ARM(GTS(2013( No( Yes( No( No( Yes( No( No( Yes(
SmartBalance' Yes' Yes' Yes' Yes' Yes' Yes' Yes' Yes'

Summary
and
Future
Work

•  Performance-‐Power-‐Aware
PredicGve
Linux

Load
Balancing

•  Over
50%
improvement
for
Quad
core
HMP

at
<1%
overhead

22

Predict

Balance
Sense

•  Over
20
%
improvement
in

energy
eﬃciency
wrt
ART

GTS
policy

•  Future
Work:
Load,
Priority,

and
Thermal
Awareness
of

the
balancer

24

Experimental
Setup

Task%0%
Task%n%
App%0%
Task%0%
Task%n%
App%n%
Applica'ons+
Opera'ng%%
System+
Pla5orm+
Benchmarks%
Huge%
Big%
Medium+ Small+
Disk% DRAM%
McPAT%
HPC/%
Sensing%
Interface%
….%
Gem5%Performance%Simulator%
Ext.%for%Heterogeneous%MPSOC%
Core%0%
RQ%
Schedule()%
Core%1%
RQ%
Schedule()%
Core%n%
RQ%
Schedule()%
Power%Perf.%
load_balance()%
smart_balance()+
Linux+Kernel+
……%
……%
CORE+FEATURES+ Huge+ Big+ Medium+ Small+
Issue%width%% 8% 4% 2% 1%
LQ/SQ%size%% 32/32% 16/16% 8/8% 8/8%
IQ%size%% 64% 32% 16% 16%
ROB%size%% 192% 128% 64% 64%
Int/ﬂoat%Regs%% 256% 128% 64% 64%
L1$I%size%(KB)%% 64% 32% 16% 16%
L1$D%size%(KB)%% 64% 32% 16% 16%
Freq.%(MHz)%% 2000% 1500% 1000% 500%
Voltage%(V)1% 1% 0.8% 0.7% 0.6%
Peak%Throughput1%% 4.18% 2.60% 1.31% 0.91%
Peak%Power%(W)1% 8.62% 1.41% 0.53% 0.095%
Area%(%mm%2%)%1%% 11.99% 5.08% 3.04% 2.27%
1%Es^mated%using%Gem5+and%McPAT%at%22nm%with%PARSEC%benchmarks%
(a)% (b)%

Implementa?on
in
Linux

•  Modiﬁca?ons
for
SmartBalance

– Load
balancing
replaced
by
SmartBalance

– Each
phase
runs
as
a
kernel
thread

– System
does
not
halt
while
running
SmartBalance
25

26

Predictor
Coeﬃcients

Preliminary
results

27

28

Sensing
Overhead

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1
2

%
Overhead
wrt

4
cores

Sensor
type

Leakage
Sensing
Overhead

%
Area
Overhead

%Power
Overhead

CPSoC
Computa?onal
PlaTorm

CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPU CPU
$I $D
$L2
NIA
OCSA
NoC
Router
OCSA
CPS Core
Adaptive Router
Chip Hardware
App 1 App 2 App N
Cross-Layer Sensors
(Virtual & Physical)
Decisions & Learning
(Controller)
Actuation (software
and hardware)
ReflectiveMiddlewareLayer
Scheduling
Memory
Manager
File System
Device
Drivers
Traditional Operating System
Hypervisor
Observe Decide
Act
Application
Layer
CPSCore
DDRO(s)
Oxide
Sensor(s)
Temperature
Sensor(s)
Leakage
Sensor(s)
Aging Sensor(s)
Reliability Sensor(s)
Performance Counters
CPU(s)
$I $D
$L2
Scratch pad/
On-Chip SRAM
NIA
Timer & RTC
PLL
On-chip
Actuation Unit
On-Chip Sensing & Actuation (OCSA)
GPIO
!

Cross-‐Layer
Physical/Virtual
Sensing
&

Actua?on

ApplicaGons

OperaGng
System

Network/Bus
CommunicaGon

Architecture

Hardware
Architecture

Device/Circuit
Architecture

SA

SO

SN

SH

SC

Sensors

(Observer)

AdapGve
Control

(Decide)

AA

AO

AN

AH

AC

Actua?on

(Act)

SmartBalance-DAC-v2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SmartBalance-DAC-v2

Similar to SmartBalance-DAC-v2 (20)

SmartBalance-DAC-v2