SlideShare a Scribd company logo
VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  
January	
  2015	
  
Bengaluru,	
  India	
  
	
  
Cross-Layer Exploration of
Heterogeneous Multicore
Processor Configurations
Santanu Sarma and N. Dutt
Introduction & Motivation
•  Emerging	
  and	
  future	
  compuCng	
  systems	
  will	
  be	
  
heterogeneous	
  mulCcore	
  processor(HMP)[Borkar11]	
  
•  Heterogeneity	
  manifest	
  even	
  in	
  homogenous	
  
architectures	
  due	
  to	
  process	
  variability	
  	
  
[Teodorescu08]	
  
•  They	
  will	
  be	
  rich	
  in	
  different	
  types	
  of	
  cores	
  with	
  
diverse	
  memories	
  and	
  accelerators	
  [P20	
  PlaQorm	
  ;	
  
ARM2013;	
  Angstrom	
  plaQorm,	
  MIT	
  2014]	
  	
  
•  They	
  are	
  monitor–rich	
  at	
  lower	
  layers	
  of	
  abstracCons	
  
[Kornaros13,	
  Lefurgy13,	
  Gupta13]	
  	
  	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   2	
  
Examples of Existing HMPs
Examples: ARM (big.Little) , NVidia Tegra, and AMD GPGPU
Trend	
  towards	
  Heterogeneous	
  Mul7core	
  Processors	
  
with	
  different	
  core	
  specializa7on	
  
Examples: ARM (big.Little) , NVidia Tegra, and AMD GPGPU
Emerging & Future HMPs
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   4	
  
NoC	
   NoC	
   NoC	
  
NoC	
   NoC	
   NoC	
  
NoC	
   NoC	
   NoC	
  
SRAM	
  
/SPM	
  
Y	
   Y	
  
Z	
  
eDRAM	
  
GPU
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
L2	
  
A11	
  
A11	
  
A11	
  
A11	
  
L2	
  
A15	
  
L2	
  
L3	
  
On-chip
Flash
Accelerators
Futuris7c	
  heterogeneous	
  many	
  core	
  processor	
  with	
  distributed	
  
memories,	
  heterogeneous	
  networks	
  and	
  accelerators	
  
Emerging & Future HMPs
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   5	
  
Futuris7c	
  heterogeneous	
  mul7core	
  processor	
  are	
  expected	
  to	
  have	
  
shared	
  memories,	
  coherent	
  bus,	
  mul7ple	
  networks	
  and	
  accelerators	
  
A15	
  
Bluetooth	
   GSM	
  WiFi	
   3/4G	
   5G	
  
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
A7	
  
L2	
  
A11	
  
A11	
  
A11	
  
A11	
  
L2	
  
L2	
  
Cache	
  Coherent	
  Interconnect	
  
L3	
  
GPU	
  	
  
Accelerator	
  
Disk	
  
Global	
  Interrupt	
  Controller	
  	
  
DRAM	
   SPM	
  
Y	
   Y	
  
Z	
  
OtherAccelerators
HMP Composition Problem
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   6	
  
A7# A11#
A15#
A11#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A11#
A11#
A7# A11#
A15#
A11#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A11#
A11#
LLC#
A11#
A15#
A11#
A11#
A11#
A11#
A15#
A11#
A11#
A11#
LLC#
A11#
A11#
A11#
A11#
A11#
A11#
A11#
A11#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A15#A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
LLC#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7# A11#
A11#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A11#
A11#
A7# A11#
A11#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A7#
A11#
A11#
LLC#
A11#
A11#
A11#
A11#
A11#
A11#
A11#
A11#
(a)# (b)#
(c)# (d)#
Representative Applications Area-Power Constrained HMP Architecture
A configuration = a set of no of cores of each type
Which	
   HMP	
   configura7on	
   is	
   the	
   best	
   for	
   the	
   representa7ve	
  
applica7ons?	
  
HMP Composition Problem
6/8/15	
   7	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
0 5 10 15 20 25 30 35 40
0
5
10
15
20
25
30
35
#ofCores
HMP Configuration #
HMP configuration for Area Budget of 4Ev6
Ev4
Ev5
Ev6
Ev8
Relative Core Sizes
EV8	
  
EV6	
  
EV5	
  
EV4	
  
Large	
  design	
  space	
  of	
  HMP	
  configura7ons;	
  4xEV8	
  area	
  results	
  in	
  46428	
  
HMP	
  configura7ons	
  
HMP Composition Problem
6/8/15	
   8	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
0 5 10 15 20 25 30 35 40
0
5
10
15
20
25
30
35
#ofCores
HMP Configuration #
HMP configuration for Area Budget of 4Ev6
Ev4
Ev5
Ev6
Ev8
Config# 1
LLC	
  
EV6	
   EV6	
  
EV6	
   EV6	
  
HMP Composition Problem
6/8/15	
   9	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
0 5 10 15 20 25 30 35 40
0
5
10
15
20
25
30
35
#ofCores
HMP Configuration #
HMP configuration for Area Budget of 4Ev6
Ev4
Ev5
Ev6
Ev8
Config# 2
LLC	
  
EV6	
   EV6	
  
EV6	
  
EV5	
   EV5	
  
EV5	
   EV5	
  
E
V
5	
  
HMP Composition Problem
6/8/15	
   10	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
0 5 10 15 20 25 30 35 40
0
5
10
15
20
25
30
35
#ofCores
HMP Configuration #
HMP configuration for Area Budget of 4Ev6
Ev4
Ev5
Ev6
Ev8
Config# 9
LLC	
  
EV6	
  
EV6	
  
EV5	
   EV5	
  
EV5	
   EV5	
  
EV4	
   EV4	
  
EV4	
  EV4	
   EV4	
  
EV4	
   EV4	
  EV4	
  
EV5	
  
HMP Composition Problem
6/8/15	
   11	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
0 5 10 15 20 25 30 35 40
0
5
10
15
20
25
30
35
#ofCores
HMP Configuration #
HMP configuration for Area Budget of 4Ev6
Ev4
Ev5
Ev6
Ev8
Config# 37
LLC	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
Goal
•  Explore	
  and	
  configure	
  a	
  HMP	
  for	
  a	
  given	
  system	
  goal	
  under	
  system	
  level	
  
constraints	
  (e.g.,	
  Area	
  or	
  Power)	
  
•  Performance	
  MaximizaCon	
  (PerfMax)	
  
•  Energy	
  MinimizaCon	
  (EnergyMin)	
  
•  Power	
  MinimizaCon	
  (PowerMin)	
  
•  Energy	
  Efficiency	
  MaximizaCon	
  (EEMax)	
  
•  Enables	
  the	
  designer	
  to	
  comparaCvely	
  evaluate	
  and	
  select	
  the	
  most	
  
promising	
  (e.g.,	
  energy	
  efficient)	
  HMP	
  architecture	
  
•  Improve	
  exploraCon	
  Cme	
  and	
  resource	
  requirement	
  at	
  relaCvely	
  small	
  
error	
  	
  
•  Present	
  a	
  holisCc	
  cross-­‐layer	
  approach	
  that	
  is	
  more	
  representaCve	
  of	
  
actual	
  HMP	
  systems	
  	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   12	
  
HMP Composition Problem
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   13	
  
J Platform Goal/Objective
Processing cores in the HMP configuration
Total number of cores of K types
Optimization Problem:
Total area of the HMP; ai area of core type i
HMP Configuration, a set of # each core types
Set of all feasible configurations
Challenges in
HMP Composition
•  Extremely	
  large	
  design	
  space	
  
–  Large	
  parametric	
  space	
  	
  
–  Huge	
  spaCal-­‐temporal	
  dynamics	
  	
  
•  Complex	
  InteracCon	
  of	
  layers	
  	
  
–  Features	
  and	
  alributes	
  idenCficaCon	
  
–  Difficulty	
  to	
  capture	
  layer	
  specific	
  alributes	
  
–  Mechanism	
  to	
  actuate	
  layer	
  specific	
  features	
  
•  Full-­‐Stack	
  Model	
  Building	
  Challenge	
  
–  Large	
  volume	
  of	
  data	
  /	
  Big-­‐data	
  for	
  model	
  building	
  
–  Model	
  composiCon	
  	
  
–  Accuracy-­‐complexity	
  trade-­‐off	
  	
  	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   14	
  
Related Work
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   15	
  
•  ExisCng	
  work	
  focus	
  towards	
  HMP	
  runCme	
  systems	
  [14],	
  [4],	
  
[21],	
  [17],	
  [2],	
  [6],	
  [10]	
  
•  Limited	
  words	
  in	
  cross-­‐layer	
  modeling	
  of	
  HMPs	
  and	
  cross-­‐
layer	
  DSE	
  but	
  several	
  piece	
  	
  work	
  in	
  DoE	
  [1],	
  	
  
•  Resource	
  allocaCon	
  [Zidenberg	
  2012,	
  Zidenberg	
  2013]	
  
–  OpCmal	
  resource	
  allocaCon	
  to	
  specialized	
  Accelerators	
  in	
  
SoC;	
  	
  not	
  to	
  cores	
  in	
  HMPs	
  	
  
–  System	
  objecCve	
  :	
  improve	
  performance	
  	
  
–  Do	
  not	
  consider	
  Full-­‐system	
  stack	
  and	
  OS	
  	
  
–  Narrowly	
  focuses	
  on	
  the	
  Hardware	
  layer	
  ,	
  not	
  applicable	
  
for	
  generic	
  HMPs	
  	
  
	
  	
  
HMP	
  ComposiCon	
  Approach
•  Four	
  Stages	
  of	
  performing	
  
Cross-­‐Layer	
  Design	
  Space	
  
ExploraCons	
  	
  
1.  Build	
  PredicCve	
  Model	
  
of	
  each	
  Core	
  Types	
  
2.  Compose	
  PredicCve	
  
Model	
  of	
  HMP	
  
ConfiguraCon	
  
3.  Construct	
  RSM	
  of	
  
System	
  ObjecCve	
  (J)	
  
4.  Find/Search	
  the	
  Best	
  
HMP	
  ConfiguraCon	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   16	
  
Build	
  PredicCve	
  Model	
  
of	
  each	
  Core	
  Types	
  
Compose	
  PredicCve	
  
Model	
  of	
  HMP	
  
ConfiguraCon	
  
Use	
  HMP	
  PredicCve	
  
Model	
  to	
  build	
  RSM	
  of	
  
ObjecCve	
  (J)	
  
Find	
  the	
  Best	
  HMP	
  
ConfiguraCon	
  for	
  the	
  
ObjecCve	
  
Cross-Layer Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   17	
  
Opera.ng	
  System	
  
Instruc.on	
  Set	
  Architecture	
  
Hardware	
  Architecture	
  
	
  
Network/Bus	
  
Communica.on	
  Architecture	
  	
  
	
  
Device/Circuit	
  Architecture	
  	
  	
  
SO	
  
SI	
  
SN	
  
SH	
  
SC	
  
Sensors,	
  monitors	
  	
  
and	
  Observer	
  
OPERATING	
  CONDITION	
  
Sensing and monitoring
at different Layers
Virtual Sensors / monitors
Physical Sensors/ monitors
Applica.ons	
   SA	
  
Predic.ve	
  
Model	
  
Perf.
Power
Energy
HMP StackOperating Parameters HMP Predictive model
Temp.
Reliability
Error
Cross-Layer Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   18	
  
Opera.ng	
  System	
  
Instruc.on	
  Set	
  Architecture	
  
Hardware	
  Architecture	
  
	
  
Network/Bus	
  
Communica.on	
  Architecture	
  	
  
	
  
Device/Circuit	
  Architecture	
  	
  	
  
Applica.ons	
  
Applica7on	
  Layer	
  Features	
  :	
  
#	
  of	
  ApplicaCon	
  
ApplicaCon	
  Type	
  
•  memory	
  bound	
  /	
  core	
  bound	
  
•  Real-­‐Cme	
  vs	
  sor	
  
•  Exact,	
  approximate	
  
•  fixed	
  vs	
  floaCng	
  point	
  
ApplicaCon	
  Size/	
  Memory	
  footprint	
  
ApplicaCon	
  Phases	
  
ApplicaCon	
  CriCcality	
  
#of	
  funcCons,	
  classes,	
  loc	
  
ApplicaCon	
  Complexity	
  
Degree	
  of	
  ILP,	
  MLP	
  
Accuracy	
  requirement	
  
Performance	
  requirement	
  	
  
	
  
Cross-Layer Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   19	
  
Opera.ng	
  System	
  
Instruc.on	
  Set	
  Architecture	
  
Hardware	
  Architecture	
  
	
  
Network/Bus	
  
Communica.on	
  Architecture	
  	
  
	
  
Device/Circuit	
  Architecture	
  	
  	
  
Applica.ons	
  
	
  
Opera7ng	
  System	
  Layer	
  Features:	
  
Scheduling	
  Policy	
  
Alloca4on	
  /	
  Balancing	
  Policy	
  
Scheduling	
  Epoch	
  
Balancing	
  Epoch	
  
#	
  of	
  Threads,	
  Thread	
  Types,	
  Thread	
  Priority	
  	
  
Thread	
  loca4on	
  history	
  
No	
  of	
  Context	
  Switch	
  
Migra4on	
  Overhead	
  
busy	
  cycles	
  (cyBusy),	
  idle	
  cycles	
  (cyIdle),	
  	
  
sleep	
  cycles	
  (cySleep)	
  
Execu4on	
  Time	
  Matrix	
  (	
  	
  	
  	
  	
  	
  )	
  
Performance	
  Characteriza4on	
  Matrix	
  (S)	
  
Power	
  Characteriza4on	
  Matrix	
  (P)	
  
Energy	
  Characteriza4on	
  Matrix	
  (E)	
  
	
  
	
  
Cross-Layer Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   20	
  
Opera.ng	
  System	
  
Instruc.on	
  Set	
  Architecture	
  
Hardware	
  Architecture	
  
	
  
Network/Bus	
  
Communica.on	
  Architecture	
  	
  
	
  
Device/Circuit	
  Architecture	
  	
  	
  
Applica.ons	
   Instruc7on	
  Set	
  Layer	
  Features:	
  
	
  
ISA	
  Type	
  and	
  Width	
  (fixed)	
  
commiLed	
  instruc4ons	
  (Itotal),	
  
commiLed	
  load	
  and	
  stores	
  (Imem),	
  
commiLed	
  branches	
  (Ibranch)	
  
Floa4ng	
  point	
  Instruc4ons	
  (IFP)	
  
Integer	
  Instruc4ons	
  (Iint)	
  
Cri4cal	
  Instruc4ons	
  (Icr)	
  
Non-­‐Cri4cal	
  Instruc4ons	
  (Incr)	
  
	
  
	
  
	
  
Cross-Layer Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   21	
  
Opera.ng	
  System	
  
Instruc.on	
  Set	
  Architecture	
  
Hardware	
  Architecture	
  
	
  
Network/Bus	
  
Communica.on	
  Architecture	
  	
  
	
  
Device/Circuit	
  Architecture	
  	
  	
  
Applica.ons	
  
Hardware	
  Layer	
  Features	
  and	
  Proper7es:	
  
Core	
  Type	
  
Issue	
  width	
  (Iw),	
  
LQ/SQ	
  size	
  (LSQ),	
  	
  
IQ	
  size	
  (IQ),	
  
ROB	
  size	
  (ROB),	
  
Int/float	
  Regs	
  (IFR),	
  
L1$I	
  size	
  (KB)	
  (L1I	
  ),	
  
L1$D	
  size	
  (KB)	
  (L1D),	
  
L2$I	
  size	
  (KB)	
  (L2I	
  ),	
  
L2$D	
  size	
  (KB)	
  (L2D	
  )	
  
Core	
  Freq.	
  (MHz)	
  (F),	
  	
  
Core	
  Voltage	
  (V	
  ),	
  	
  
Core	
  Area	
  (a),	
  Uncore	
  Area	
  (au)	
  
Core	
  Power	
  (pw)	
  
Floorplan	
  and	
  placement	
  	
  	
  
Cross-Layer Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   22	
  
Opera.ng	
  System	
  
Instruc.on	
  Set	
  Architecture	
  
Hardware	
  Architecture	
  
	
  
Network/Bus	
  
Communica.on	
  Architecture	
  	
  
	
  
Device/Circuit	
  Architecture	
  	
  	
  
Applica.ons	
   Network/Bus	
  Layer	
  	
  Features:	
  	
  
Bus	
  Proper7es:	
  
Gem5	
  Shared	
  Bus	
  Model	
  [Binkert11]	
  
No	
  of	
  Bus	
  
Bus	
  Type,	
  Bus	
  Width,	
  Bus	
  Frequency,	
  Bus	
  Mode	
  
#L2	
  Bus,	
  #	
  coherence	
  domains	
  
Conten4ons	
  
Latency	
  
NoC	
  Proper7es	
  [Orion	
  2.0]:	
  
Topology	
  
Rou4ng	
  policy	
  
Flit	
  size,	
  Flit	
  width	
  
#of	
  VC	
  
Buffer	
  Size	
  
Frequency	
  &	
  Latency	
  
Conten4ons	
  
	
  
	
  
	
  
	
  
	
  
Cross-Layer Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   23	
  
Opera.ng	
  System	
  
Instruc.on	
  Set	
  Architecture	
  
Hardware	
  Architecture	
  
	
  
Network/Bus	
  
Communica.on	
  Architecture	
  	
  
	
  
Device/Circuit	
  Architecture	
  	
  	
  
Applica.ons	
   Circuit	
  and	
  Device	
  Proper7es:	
  
Imported	
  from	
  CACTI	
  [Thoziyoor08]	
  &	
  
McPAT	
  [Shen09]	
  	
  
Technology	
  Node	
  
Tech.	
  Parameters	
  
VDD,	
  VTh,	
  Bias	
  Voltage	
  
Wire	
  model	
  parameters	
  	
  
Delay	
  model	
  parameters	
  	
  
Memory	
  cell	
  model	
  parameters	
  	
  
Cell	
  Power	
  model	
  parameters	
  	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Building Predictive Model of
Core Types
•  Divided	
  in	
  Two	
  Phases:	
  
–  Training	
  phase:	
  known	
  data	
  (or	
  training	
  set)	
  are	
  used	
  to	
  idenCfy	
  the	
  
predicCve	
  model	
  configuraCon;	
  use	
  special	
  benchmarks	
  for	
  coverage	
  
–  PredicCon	
   phase:	
   predicCve	
   model	
   is	
   used	
   to	
   forecast	
   the	
   unknown	
  
system	
  response	
  
•  Use	
  Regression	
  based	
  data	
  fitng	
  in	
  the	
  predicCve	
  model	
  of	
  core	
  types	
  
Performance	
  (Throughput)	
  and	
  Power	
  
•  Predictor	
  for	
  each	
  core	
  type:	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   24	
  
Performance Predictor coefficients
Power Predictor coefficients
Cross-layer feature vector
Predictive Model of
Core Types
6/8/15	
   25	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
Applications
Linux	
  Kernel	
  
Task	
  0	
  
Task	
  n	
  
App 0
Task	
  0	
  
Task	
  n	
  
App n
Operating
System
HMP
Platform
Benchmarks
Ev8	
  
Ev6	
  
Ev5	
   Ev4	
  
Disk	
   DRAM	
  
McPAT	
  
HPC/	
  
Sensing	
  
Interface	
  
….
PowerPerf.
Gem5
Predic.ve	
  	
  
Model	
  
±
App.	
  Type,	
  Size,	
  etc	
  
Task/Thread	
  Model	
  
Task	
  ExecuCon	
  Time	
  
Task	
  Throughput	
  	
  
	
  	
  
Task	
  AllocaCon	
  &	
  
Scheduling	
  Policy/Strategy	
  
Memory	
  AllocaCon	
  
Etc..	
  
Hardware	
  Architecture	
  
ConfiguraCons,	
  Performance	
  
Events	
  Counters	
  
Bus	
  SpecificaCons	
  
Circuit/Device	
  Scaling	
  
Technology	
  Parameters	
  
Power/Energy	
  ConsumpCon	
  
circuit	
  delay	
  parameters	
  
	
  	
  
System Specifications
System Perf.
System Power
System Energy
Heterogeneous Platform Simulator
DoE
Data
Regression
Fitting
Full System Stack
Compose Predictive Model
of HMP Configuration
•  Use	
  predicCve	
  models	
  of	
  
individual	
  core	
  type	
  to	
  compose	
  
total	
  system	
  model	
  
•  The	
  performance	
  and	
  power	
  of	
  
each	
  core	
  are	
  added	
  to	
  get	
  full	
  
system	
  power	
  and	
  performance	
  
•  Core	
  to	
  core	
  interference	
  and	
  
interacCons	
  is	
  captured	
  via	
  the	
  
feature	
  of	
  the	
  last	
  level	
  cache	
  
and	
  network	
  	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   26	
  
Build	
  PredicCve	
  Model	
  
of	
  each	
  Core	
  Types	
  
Compose	
  PredicCve	
  
Model	
  of	
  HMP	
  
ConfiguraCon	
  
Use	
  HMP	
  PredicCve	
  
Model	
  to	
  build	
  RSM	
  of	
  
ObjecCve	
  (J)	
  
Find	
  the	
  Best	
  HMP	
  
ConfiguraCon	
  for	
  the	
  
ObjecCve	
  
Construct RSM of the
System Objective (J)
•  Response	
  Surface	
  Models	
  (RSM)	
  are	
  
analyCcal	
  approximate	
  expression	
  of	
  
the	
  System	
  ObjecCve	
  (J)	
  
•  A	
  higher	
  level	
  predicCve	
  model	
  using	
  
the	
  individual	
  core	
  type	
  predicCve	
  
models	
  
•  System	
  level	
  RSM	
  can	
  include	
  un-­‐core	
  
components	
  and	
  core	
  to	
  core	
  
interacCon	
  characterisCcs	
  
•  RSMs	
  are	
  dominated	
  by	
  	
  
–  core	
  characterisCcs	
  for	
  computaCon	
  centric	
  apps	
  
–  Network	
  for	
  communicaCon	
  centric	
  apps	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   27	
  
Build	
  PredicCve	
  Model	
  
of	
  each	
  Core	
  Types	
  
Compose	
  PredicCve	
  
Model	
  of	
  HMP	
  
ConfiguraCon	
  
Use	
  HMP	
  PredicCve	
  
Model	
  to	
  build	
  RSM	
  of	
  
ObjecCve	
  (J)	
  
Find	
  the	
  Best	
  HMP	
  
ConfiguraCon	
  for	
  the	
  
ObjecCve	
  
DSE Optimization
•  Formulated	
  as	
  OpCmizaCon	
  Problem	
  
–  Uses	
  PredicCve	
  Models	
  of	
  Core	
  types	
  
and	
  of	
  HMP	
  
–  Layer	
  specific	
  goals	
  can	
  be	
  include	
  in	
  
system	
  level	
  goals	
  	
  
–  Models	
  of	
  Individual	
  core	
  types	
  are	
  
used	
  to	
  build	
  HMP	
  models	
  
•  Search	
  for	
  the	
  best	
  configuraCon	
  for	
  the	
  
objecCve	
  using	
  predicCve	
  models	
  /	
  RSM	
  
•  Used	
  global	
  opCmizaCon	
  methods	
  (SA)	
  to	
  
find	
  the	
  configuraCon	
  
	
  
	
  6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   28	
  
Build	
  PredicCve	
  Model	
  
of	
  each	
  Core	
  Types	
  
Compose	
  PredicCve	
  
Model	
  of	
  HMP	
  
ConfiguraCon	
  
Use	
  HMP	
  PredicCve	
  
Model	
  to	
  build	
  RSM	
  of	
  
ObjecCve	
  (J)	
  
Find	
  the	
  Best	
  HMP	
  
ConfiguraCon	
  for	
  the	
  
ObjecCve	
  
Experiments & Setup
•  Experiments	
  Goal:	
  Find	
  the	
  HMP	
  configuraCon	
  C	
  
under	
  system	
  constraint	
  
•  Given:	
  
–  System-­‐Level	
  Goal	
  (J):	
  	
  
•  Performance	
  MaximizaCon	
  (PerfMax)	
  
•  Energy	
  MinimizaCon	
  (EnergyMin)	
  
•  Power	
  MinimizaCon	
  (PowerMin)	
  
•  Energy	
  Efficiency	
  MaximizaCon	
  (EEMax)	
  
–  Individual	
  Layer	
  Specific	
  Goal:	
  
•  E.g.,:	
  OS	
  AllocaCon	
  ObjecCve:	
  minD,	
  minE,	
  minED,	
  minED2	
  
•  Heterogeneity	
  Aware	
  Allocator/Scheduler	
  
–  Set	
  of	
  Representa.ve	
  Benchmarks:	
  PARSEC,	
  MediaBench	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   29	
  
HMP Platform Setup
6/8/15	
   30	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
Setup:	
  	
  
•  Full	
  System	
  Stack	
  
•  Integrated	
  with	
  Gem5+	
  
McPAT+	
  CacC	
  
•  Supports	
  mulCple	
  ISA	
  
•  Run	
  Linux	
  OS	
  
•  Modified	
  for	
  Linux	
  OS	
  
Allocator	
  for	
  Heterogeneity	
  
Awareness	
  
•  SimulaCon	
  Environment:	
  
Cluster	
  with	
  10000	
  cores;	
  10	
  
TB	
  storage	
  
Thread	
  0	
  
Thread	
  n	
  
App 0
Thread	
  0	
  
Thread	
  n	
  
App n
Applications
Operating
System
Extended
Gem5
Platform
Benchmarks
Disk	
   DRAM	
  
McPAT	
  
HPC/	
  Sensing	
  
Interface	
  
….
PowerPerf.
Core	
  1	
  
RQ	
  
Schedule()	
  
Core	
  2	
  
RQ	
  
Schedule()	
  
Core	
  n	
  
RQ	
  
Schedule()	
  
load_balance()	
  
Heterogeneity-­‐
Aware	
  Scheduler	
  
Linux Kernel
……
……
Ev6	
  
$I	
   $D	
  
L2	
  
Ev7	
  
$I	
   $D	
  
L2	
  
Ev4	
  
$
I	
  
$D	
  
L2	
  
EV8	
  
$I	
   $D	
  
L2	
  
Performance of Predictive
Model of Core Types
6/8/15	
   31	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
Full-System Performance and Power
Characteristics of core types
0	
  
1	
  
2	
  
3	
  
4	
  
5	
  
6	
  
%	
  Error	
  
Benchmarks	
  
Predictor	
  Error	
  for	
  Core	
  Type	
  Ev6	
  
Perf.	
  Error	
  
Power	
  Error	
  
0	
  
1	
  
2	
  
3	
  
4	
  
5	
  
6	
  
%	
  Error	
  
Benchmarks	
  
Predictor	
  Error	
  for	
  Core	
  Type	
  Ev4	
  
Perf.	
  Error	
  
Power	
  Error	
  
Performance	
  and	
  Power	
  Predic7on	
  Errors	
  are	
  with	
  in	
  5	
  %	
  for	
  each	
  core	
  
Types	
  	
  
Performance of Predictive
Model of Core Types
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   32	
  
Legend: Huge (EV8) ; Big (EV6); Medium (EV5); Small (EV4)
Performance	
  and	
  Power	
  Predic7on	
  Errors	
  are	
  with	
  in	
  9	
  %	
  for	
  core-­‐to-­‐
core	
  types	
  
Performance of HMP
System Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   33	
  
0 5 10 15 20 25 30 35 40
0
5
10
15
20
25
30
35
#ofCores
HMP Configuration #
HMP configuration for Area Budget of 4Ev6
Ev4
Ev5
Ev6
Ev8
Performance	
  and	
  Power	
  Predic7on	
  Errors	
  are	
  with	
  in	
  9	
  %	
  for	
  System	
  
Level	
  HMP	
  Configura7ons;	
  Over	
  1000x	
  speedup	
  	
  	
  
Experimental	
  DSE	
  Results	
  
•  HMP	
  configuraCons	
  can	
  
have	
  2x-­‐3x	
  performance	
  
power	
  difference	
  for	
  same	
  
area	
  resource	
  
•  With	
  increasing	
  load,	
  the	
  
EDP	
  increases	
  with	
  
heterogeneity-­‐awareness	
  
•  Layer	
  specific	
  objecCves	
  
can	
  severely	
  interfere	
  with	
  
system	
  objecCve	
  
•  Some	
  cross-­‐layer	
  features	
  
have	
  dominant	
  impact	
  on	
  
HMP	
  power	
  and	
  
performance	
  
	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   34	
  
Lack	
   of	
   Heterogeneity-­‐Awareness	
   will	
   have	
   serious	
   implica7ons	
   in	
  
HMP	
  performance	
  and	
  power	
  and	
  thus	
  composi7on	
  problem	
  
Cross-Layer DSE using Predictive Models
HMP Configurations
6/8/15	
   35	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
LLC	
  
EV6	
   EV6	
  
EV6	
   EV6	
  
LLC	
  
EV6	
   EV6	
  
EV6	
  
EV5	
   EV5	
  
EV5	
   EV5	
  
E
V
5	
  
LLC	
  
EV6	
  
EV6	
  
EV5	
   EV5	
  
EV5	
   EV5	
  
EV4	
   EV4	
  
EV4	
  EV4	
   EV4	
  
EV4	
   EV4	
  EV4	
  
EV5	
   LLC	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
EV4	
  
Config#9
[Energy Efficient]
Config#37
[Energy Centric]
Config#2
[Power-Throughput efficient]
Config#1
[Performance Centric]
Conclusion
•  We	
  presented	
  a	
  holisCc	
  cross-­‐layer	
  approach	
  for	
  HMPs	
  composiCon	
  under	
  
system	
  level	
  constraints	
  (e.g.,	
  Area	
  or	
  Power)	
  as	
  an	
  OpCmizaCon	
  problem	
  
•  The	
  approach	
  consists	
  of	
  predicCve	
  cross-­‐layer	
  model	
  of	
  core	
  types	
  
and	
  total	
  system	
  	
  that	
  are	
  computaConally	
  efficient	
  for	
  design	
  
exploraCon	
  
•  Enable	
  over	
  two	
  order	
  of	
  magnitude	
  improvement	
  in	
  exploraCon	
  Cme	
  
and	
  resource	
  requirement	
  	
  at	
  less	
  than	
  7%	
  average	
  error	
  
•  We	
  show	
  :	
  
–  HMP	
  configuraCons	
  can	
  have	
  2x-­‐3x	
  performance	
  power	
  difference	
  for	
  
same	
  area	
  resource	
  
–  With	
  increasing	
  load,	
  the	
  EDP	
  increases	
  with	
  heterogeneity-­‐awareness	
  
–  Layer	
  specific	
  objecCves	
  can	
  severely	
  interfere	
  with	
  system	
  objecCve	
  
–  Some	
  cross-­‐layer	
  features	
  have	
  dominant	
  impact	
  on	
  HMP	
  power	
  and	
  
performance	
  
	
  6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   36	
  
References
1.  The	
  Design	
  and	
  Analysis	
  of	
  Computer	
  Experiments.	
  Springer-­‐Verlag,	
  2003.	
  
2.  G.	
   Palermo	
   et	
   al.	
   Respir:	
   A	
   response	
   surface-­‐based	
   pareto	
   iteraCve	
   refinement	
   for	
   applicaCon-­‐specific	
  
design	
  space	
  exploraCon.	
  Computer-­‐Aided	
  Design	
  of	
  Integrated	
  Circuits	
  and	
  Systems,	
  IEEE	
  TransacCons	
  
on,	
  28(12):1816	
  –1829,	
  dec.	
  2009.	
  
3.  A.D.	
   Pimentel	
   et	
   al.	
   A	
   systemaCc	
   approach	
   to	
   exploring	
   embedded	
   system	
   architectures	
   at	
   mulCple	
  
abstracCon	
  levels.	
  Computers,	
  IEEE	
  TransacCons	
  on,	
  55(2):99	
  –	
  112,	
  feb.	
  2006.	
  
4.  K.	
  Keutzer	
  et	
  al.	
  System-­‐level	
  design:	
  orthogonalizaCon	
  of	
  concerns	
  and	
  plaQorm-­‐based	
  design.	
  Computer-­‐
Aided	
  Design	
  of	
  Integrated	
  Circuits	
  and	
  Systems,	
  IEEE	
  TransacCons	
  on,	
  19(12):1523	
  –1543,	
  dec	
  2000.	
  
5.  P.	
  Greenhalgh.	
  Big.lille	
  processing	
  with	
  arm	
  cortex-­‐a15	
  &	
  cortex-­‐a7:	
  Improving	
  energy	
  efficiency	
  in	
  high-­‐
performance	
  mobile	
  plaQorms.	
  2011	
  
6.  NVidia.	
  Variable	
  smp	
  -­‐	
  a	
  mulC-­‐core	
  cpu	
  architecture	
  for	
  low	
  power	
  and	
  high	
  performance.	
  2011	
  
7.  T.	
  Zidenberg,	
  I	
  Keslassy,	
  and	
  U.	
  Weiser.	
  OpCmal	
  resource	
  allocaCon	
  with	
  mulCamdahl.	
  Computer,	
  46(7):
70–77,	
  July	
  2013.	
  
8.  Tsahee	
   Zidenberg,	
   Isaac	
   Keslassy,	
   and	
   Uri	
   Weiser.	
   MulCamdahl:	
   How	
   should	
   i	
   divide	
   my	
   heterogenous	
  
chip?	
  Computer	
  Architecture	
  Lelers,	
  11(2):65–68,	
  2012.	
  
9.  Sheng	
   Li	
   et	
   al.	
   McPAT:	
   An	
   integrated	
   power,	
   area,	
   and	
   Cming	
   modeling	
   framework	
   for	
   mulCcore	
   and	
  
manycore	
   architectures.	
   In	
   Microarchitecture,	
   2009.	
   MICRO-­‐42.	
   42nd	
   Annual	
   IEEE/ACM	
   InternaConal	
  
Symposium	
  on,	
  pages	
  469–480,	
  2009.	
  
10.  Thoziyoor,	
  Shyamkumar,	
  et	
  al.	
  "CACTI	
  5.1."	
  HP	
  Laboratories,	
  April	
  2	
  (2008).	
  
11.  Nathan	
  Binkert	
  et	
  al.	
  The	
  gem5	
  simulator.	
  SIGARCH	
  Comput.	
  Archit.	
  News,	
  39(2):1–7,	
  August	
  2011.	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   37	
  
www.variability.org	
   www.nsf.gov	
   www.uci.edu	
  
THANKS	
  
S.Sarma	
   38	
  
Towards Full System
Energy Efficiency Models
6/8/15	
   39	
  ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
  
ApplicaCon	
  &	
  Workload	
  Model	
   OS	
  and	
  Scheduling	
  Model	
  
Hardware,	
  Memory	
  &	
  Bus	
  	
  
Architecture	
   Circuit	
  and	
  Device	
  Models	
  
HMP	
  ComposiCon	
  Approach
•  Preform	
  Cross-­‐Layer	
  Design	
  
Space	
  ExploraCons	
  
•  Large	
  design	
  space	
  pruned	
  by	
  
using	
  DoE	
  	
  
•  Formulated	
  as	
  OpCmizaCon	
  
Problem	
  
–  Uses	
  PredicCve	
  Models	
  of	
  
HMP	
  
–  System	
  and	
  layer	
  specific	
  
goals	
  evaluaCon	
  using	
  the	
  
predicCve	
  models	
  
–  Models	
  of	
  Individual	
  core	
  
types	
  are	
  used	
  to	
  build	
  HMP	
  
models	
  
•  Used	
  Global	
  opCmizaCon	
  
methods	
  (SA	
  or	
  GA)	
  to	
  find	
  the	
  
configuraCon	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   40	
  
Algorithm for DSE
Predictive Model
Layers	
   Parametric	
  Features	
  &	
  AZributes	
   Remarks	
  
Hardware	
  Architecture	
  
Features	
  
Issue	
  width	
  (Iw),	
  
LQ/SQ	
  size	
  (LSQ),	
  IQ	
  size	
  (IQ),	
  
ROB	
  size	
  (ROB),	
  
Int/float	
  Regs	
  (IFR),	
  
L1$I	
  size	
  (KB)	
  (L1I	
  ),cL1$D	
  size	
  (KB)	
  (L1D),	
  
Freq.	
  (MHz)	
  (F),	
  Voltage	
  (V	
  ),	
  	
  
Core	
  Area	
  (a).	
  
Performance	
  Events	
  Counters	
   branch	
  mispredicCon	
  rate	
  (mB);	
  
L1	
  instrucCon	
  miss	
  rate	
  (mL1I	
  ),	
  
L1	
  data	
  cache	
  miss	
  rate	
  (mL1D),	
  
instrucCon	
  TLB	
  miss	
  rate	
  (mITLB)	
  
data	
  TLB	
  miss	
  rate	
  (mDTLB)	
  
Context	
  switch	
  counters	
  (Cw)	
  
Cycle	
  and	
  InstrucCon	
  Counters	
   busy	
  cycles	
  (cyBusy),	
  idle	
  cycles	
  (cyIdle),	
  sleep	
  
cycles	
  (cySleep)	
  
commiLed	
  instruc4ons	
  (Itotal),	
  
commiLed	
  load	
  and	
  stores	
  (Imem),	
  
commiLed	
  branches	
  (Ibranch)	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   41	
  
HMP	
  ComposiCon	
  Approach
•  Preform	
  Cross-­‐Layer	
  Design	
  Space	
  ExploraCons	
  Four	
  Stages	
  
–  Provides	
  an	
  holisCc	
  approach	
  with	
  complete	
  
system	
  	
  
–  Jointly	
  consider	
  features	
  of	
  the	
  applicaCons,	
  OS,	
  
HW,	
  Bus/Network,	
  Circuits	
  and	
  devices	
  layers	
  
–  Avoids	
  pathological	
  scenarios	
  of	
  single	
  layer	
  
approach	
  	
  
–  EffecCvely	
  captures	
  crucial	
  interacCon	
  between	
  
layers	
  
–  Improve	
  exploraCon	
  Cme	
  and	
  resource	
  for	
  small	
  
errors	
  
–  Uses	
  computaConally	
  efficient	
  predicCve	
  
models	
  developed	
  from	
  	
  	
  
•  Large	
  design	
  space	
  pruned	
  by	
  using	
  DoE	
  	
  
•  Formulated	
  as	
  OpCmizaCon	
  Problem	
  
–  Uses	
  PredicCve	
  Models	
  of	
  HMP	
  
–  System	
  and	
  layer	
  specific	
  goals	
  evaluaCon	
  using	
  the	
  
predicCve	
  models	
  
–  Models	
  of	
  Individual	
  core	
  types	
  are	
  used	
  to	
  build	
  
HMP	
  models	
  
•  Used	
  Global	
  opCmizaCon	
  methods	
  (SA	
  or	
  GA)	
  to	
  find	
  
the	
  configuraCon	
  
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   42	
  
Build	
  PredicCve	
  Model	
  
of	
  Core	
  Types	
  
Compose	
  PredicCve	
  
Model	
  of	
  HMP	
  
ConfiguraCon	
  
Use	
  HMP	
  PredicCve	
  
Model	
  to	
  build	
  RSM	
  of	
  
ObjecCve	
  (J)	
  
Find	
  the	
  Best	
  HMP	
  
ConfiguraCon	
  for	
  the	
  
ObjecCve	
  
Related Work
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   43	
  
•  ExisCng	
  work	
  focuses	
  toward	
  HMP	
  runCme	
  systems	
  
[14],	
  [4],	
  [21],	
  [17],	
  [2],	
  [6],	
  [10]	
  
•  Limited	
  words	
  in	
  cross-­‐layer	
  modeling	
  of	
  HMPs	
  and	
  
cross-­‐layer	
  DSE	
  
•  Closest	
  to	
  our	
  work	
  [Zidenberg	
  2012,	
  Zidenberg	
  
2013]	
  
– OpCmal	
  resource	
  allocaCon	
  to	
  specialized	
  
Accelerators	
  in	
  SoC	
  not	
  cores	
  
– System	
  objecCve	
  :	
  improve	
  performance	
  	
  
– Do	
  not	
  consider	
  Full-­‐system	
  stack	
  and	
  OS	
  	
  
– Focus	
  only	
  in	
  the	
  Hardware	
  layer	
  	
  
Cross-Layer Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   44	
  
Opera.ng	
  System	
  
Instruc.on	
  Set	
  Architecture	
  
Hardware	
  Architecture	
  
	
  
Network/Bus	
  
Communica.on	
  Architecture	
  	
  
	
  
Device/Circuit	
  Architecture	
  	
  	
  
SO	
  
SI	
  
SN	
  
SH	
  
SC	
  
Sensors,	
  monitors	
  	
  
and	
  Observer	
  
OPERATING	
  CONDITION	
  
Sensing and monitoring
at different Layers
Virtual Sensors / monitors
Physical Sensors/ monitors
Applica.ons	
   SA	
  
Predic.ve	
  
Model	
  
Perf.
Power
Energy
HMP StackOperating Parameters HMP Predictive model
Temp.
Reliability
Error
Cross-Layer Predictive Model
6/8/15	
   ©	
  VLSI	
  Design	
  &	
  Embedded	
  Systems	
  Conference	
  -­‐	
  2015	
   45	
  
Opera.ng	
  System	
  
Instruc.on	
  Set	
  Architecture	
  
Hardware	
  Architecture	
  
	
  
Network/Bus	
  
Communica.on	
  Architecture	
  	
  
	
  
Device/Circuit	
  Architecture	
  	
  	
  
SO	
  
SI	
  
SN	
  
SH	
  
SC	
  
Sensors,	
  monitors	
  	
  
and	
  Observer	
  
OPERATING	
  CONDITION	
  
Sensing and monitoring
at different Layers
Virtual Sensors / monitors
Physical Sensors/ monitors
Applica.ons	
   SA	
  
Predic.ve	
  
Model	
  
Perf.
Power
Energy
HMP StackOperating Parameters HMP Predictive model
Temp.
Reliability
Errors
Vulnerabil

More Related Content

Similar to VLSID_2015_DSE_HMP_v3

MuleSoft Runtime Fabric (RTF): Foundations : MuleSoft Virtual Muleys Meetups
MuleSoft Runtime Fabric (RTF): Foundations  : MuleSoft Virtual Muleys MeetupsMuleSoft Runtime Fabric (RTF): Foundations  : MuleSoft Virtual Muleys Meetups
MuleSoft Runtime Fabric (RTF): Foundations : MuleSoft Virtual Muleys Meetups
Angel Alberici
 
IBM - Developing portlets using Script portlet in WP 8001
IBM - Developing portlets using Script portlet in WP 8001IBM - Developing portlets using Script portlet in WP 8001
IBM - Developing portlets using Script portlet in WP 8001Vinayak Tavargeri
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Indrajit Poddar
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMchiportal
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Srivatsan Ramanujam
 
Arm - ceph on arm update
Arm - ceph on arm updateArm - ceph on arm update
Arm - ceph on arm update
inwin stack
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
dairsie
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Subhajit Sahu
 
Massively Parallel RISC-V Processing with Transactional Memory
Massively Parallel RISC-V Processing with Transactional MemoryMassively Parallel RISC-V Processing with Transactional Memory
Massively Parallel RISC-V Processing with Transactional Memory
Netronome
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
xKinAnx
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
Paul Hofmann
 
Webinar presentation on AUTOSAR Multicore Systems
Webinar presentation on AUTOSAR Multicore SystemsWebinar presentation on AUTOSAR Multicore Systems
Webinar presentation on AUTOSAR Multicore Systems
KPIT
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based MicroprocessorsPerformance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Hannes Tschofenig
 
Mule soft step up session
Mule soft step up sessionMule soft step up session
Mule soft step up session
Amit Behere
 
VMUGIT UC 2013 - 04 Duncan Epping
VMUGIT UC 2013 - 04 Duncan EppingVMUGIT UC 2013 - 04 Duncan Epping
VMUGIT UC 2013 - 04 Duncan Epping
VMUG IT
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
IRJET Journal
 
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGNFUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
Pankaj Singh
 

Similar to VLSID_2015_DSE_HMP_v3 (20)

MuleSoft Runtime Fabric (RTF): Foundations : MuleSoft Virtual Muleys Meetups
MuleSoft Runtime Fabric (RTF): Foundations  : MuleSoft Virtual Muleys MeetupsMuleSoft Runtime Fabric (RTF): Foundations  : MuleSoft Virtual Muleys Meetups
MuleSoft Runtime Fabric (RTF): Foundations : MuleSoft Virtual Muleys Meetups
 
resume
resumeresume
resume
 
IBM - Developing portlets using Script portlet in WP 8001
IBM - Developing portlets using Script portlet in WP 8001IBM - Developing portlets using Script portlet in WP 8001
IBM - Developing portlets using Script portlet in WP 8001
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Track A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBMTrack A-Compilation guiding and adjusting - IBM
Track A-Compilation guiding and adjusting - IBM
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Arm - ceph on arm update
Arm - ceph on arm updateArm - ceph on arm update
Arm - ceph on arm update
 
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated SystemsPetapath HP Cast 12 - Programming for High Performance Accelerated Systems
Petapath HP Cast 12 - Programming for High Performance Accelerated Systems
 
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesIs Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : Notes
 
Massively Parallel RISC-V Processing with Transactional Memory
Massively Parallel RISC-V Processing with Transactional MemoryMassively Parallel RISC-V Processing with Transactional Memory
Massively Parallel RISC-V Processing with Transactional Memory
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
Ibm spectrum scale fundamentals workshop for americas part 7 spectrumscale el...
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
 
Webinar presentation on AUTOSAR Multicore Systems
Webinar presentation on AUTOSAR Multicore SystemsWebinar presentation on AUTOSAR Multicore Systems
Webinar presentation on AUTOSAR Multicore Systems
 
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based MicroprocessorsPerformance of State-of-the-Art Cryptography on ARM-based Microprocessors
Performance of State-of-the-Art Cryptography on ARM-based Microprocessors
 
Mule soft step up session
Mule soft step up sessionMule soft step up session
Mule soft step up session
 
VMUGIT UC 2013 - 04 Duncan Epping
VMUGIT UC 2013 - 04 Duncan EppingVMUGIT UC 2013 - 04 Duncan Epping
VMUGIT UC 2013 - 04 Duncan Epping
 
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
Limited Budget but Effective End to End MLOps Practices (Machine Learning Mod...
 
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGNFUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
FUSION APU & TRENDS/ CHALLENGES IN FUTURE SoC DESIGN
 

VLSID_2015_DSE_HMP_v3

  • 1. VLSI  Design  &  Embedded  Systems  Conference   January  2015   Bengaluru,  India     Cross-Layer Exploration of Heterogeneous Multicore Processor Configurations Santanu Sarma and N. Dutt
  • 2. Introduction & Motivation •  Emerging  and  future  compuCng  systems  will  be   heterogeneous  mulCcore  processor(HMP)[Borkar11]   •  Heterogeneity  manifest  even  in  homogenous   architectures  due  to  process  variability     [Teodorescu08]   •  They  will  be  rich  in  different  types  of  cores  with   diverse  memories  and  accelerators  [P20  PlaQorm  ;   ARM2013;  Angstrom  plaQorm,  MIT  2014]     •  They  are  monitor–rich  at  lower  layers  of  abstracCons   [Kornaros13,  Lefurgy13,  Gupta13]       6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   2  
  • 3. Examples of Existing HMPs Examples: ARM (big.Little) , NVidia Tegra, and AMD GPGPU Trend  towards  Heterogeneous  Mul7core  Processors   with  different  core  specializa7on   Examples: ARM (big.Little) , NVidia Tegra, and AMD GPGPU
  • 4. Emerging & Future HMPs 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   4   NoC   NoC   NoC   NoC   NoC   NoC   NoC   NoC   NoC   SRAM   /SPM   Y   Y   Z   eDRAM   GPU A7   A7   A7   A7   A7   A7   A7   A7   A7   L2   A11   A11   A11   A11   L2   A15   L2   L3   On-chip Flash Accelerators Futuris7c  heterogeneous  many  core  processor  with  distributed   memories,  heterogeneous  networks  and  accelerators  
  • 5. Emerging & Future HMPs 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   5   Futuris7c  heterogeneous  mul7core  processor  are  expected  to  have   shared  memories,  coherent  bus,  mul7ple  networks  and  accelerators   A15   Bluetooth   GSM  WiFi   3/4G   5G   A7   A7   A7   A7   A7   A7   A7   A7   A7   L2   A11   A11   A11   A11   L2   L2   Cache  Coherent  Interconnect   L3   GPU     Accelerator   Disk   Global  Interrupt  Controller     DRAM   SPM   Y   Y   Z   OtherAccelerators
  • 6. HMP Composition Problem 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   6   A7# A11# A15# A11# A7# A7# A7# A7# A7# A7# A7# A7# A11# A11# A7# A11# A15# A11# A7# A7# A7# A7# A7# A7# A7# A7# A11# A11# LLC# A11# A15# A11# A11# A11# A11# A15# A11# A11# A11# LLC# A11# A11# A11# A11# A11# A11# A11# A11# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A15#A7# A7# A7# A7# A7# A7# A7# A7# LLC# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A7# A11# A11# A7# A7# A7# A7# A7# A7# A7# A7# A11# A11# A7# A11# A11# A7# A7# A7# A7# A7# A7# A7# A7# A11# A11# LLC# A11# A11# A11# A11# A11# A11# A11# A11# (a)# (b)# (c)# (d)# Representative Applications Area-Power Constrained HMP Architecture A configuration = a set of no of cores of each type Which   HMP   configura7on   is   the   best   for   the   representa7ve   applica7ons?  
  • 7. HMP Composition Problem 6/8/15   7  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 #ofCores HMP Configuration # HMP configuration for Area Budget of 4Ev6 Ev4 Ev5 Ev6 Ev8 Relative Core Sizes EV8   EV6   EV5   EV4   Large  design  space  of  HMP  configura7ons;  4xEV8  area  results  in  46428   HMP  configura7ons  
  • 8. HMP Composition Problem 6/8/15   8  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 #ofCores HMP Configuration # HMP configuration for Area Budget of 4Ev6 Ev4 Ev5 Ev6 Ev8 Config# 1 LLC   EV6   EV6   EV6   EV6  
  • 9. HMP Composition Problem 6/8/15   9  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 #ofCores HMP Configuration # HMP configuration for Area Budget of 4Ev6 Ev4 Ev5 Ev6 Ev8 Config# 2 LLC   EV6   EV6   EV6   EV5   EV5   EV5   EV5   E V 5  
  • 10. HMP Composition Problem 6/8/15   10  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 #ofCores HMP Configuration # HMP configuration for Area Budget of 4Ev6 Ev4 Ev5 Ev6 Ev8 Config# 9 LLC   EV6   EV6   EV5   EV5   EV5   EV5   EV4   EV4   EV4  EV4   EV4   EV4   EV4  EV4   EV5  
  • 11. HMP Composition Problem 6/8/15   11  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 #ofCores HMP Configuration # HMP configuration for Area Budget of 4Ev6 Ev4 Ev5 Ev6 Ev8 Config# 37 LLC   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4  
  • 12. Goal •  Explore  and  configure  a  HMP  for  a  given  system  goal  under  system  level   constraints  (e.g.,  Area  or  Power)   •  Performance  MaximizaCon  (PerfMax)   •  Energy  MinimizaCon  (EnergyMin)   •  Power  MinimizaCon  (PowerMin)   •  Energy  Efficiency  MaximizaCon  (EEMax)   •  Enables  the  designer  to  comparaCvely  evaluate  and  select  the  most   promising  (e.g.,  energy  efficient)  HMP  architecture   •  Improve  exploraCon  Cme  and  resource  requirement  at  relaCvely  small   error     •  Present  a  holisCc  cross-­‐layer  approach  that  is  more  representaCve  of   actual  HMP  systems     6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   12  
  • 13. HMP Composition Problem 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   13   J Platform Goal/Objective Processing cores in the HMP configuration Total number of cores of K types Optimization Problem: Total area of the HMP; ai area of core type i HMP Configuration, a set of # each core types Set of all feasible configurations
  • 14. Challenges in HMP Composition •  Extremely  large  design  space   –  Large  parametric  space     –  Huge  spaCal-­‐temporal  dynamics     •  Complex  InteracCon  of  layers     –  Features  and  alributes  idenCficaCon   –  Difficulty  to  capture  layer  specific  alributes   –  Mechanism  to  actuate  layer  specific  features   •  Full-­‐Stack  Model  Building  Challenge   –  Large  volume  of  data  /  Big-­‐data  for  model  building   –  Model  composiCon     –  Accuracy-­‐complexity  trade-­‐off       6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   14  
  • 15. Related Work 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   15   •  ExisCng  work  focus  towards  HMP  runCme  systems  [14],  [4],   [21],  [17],  [2],  [6],  [10]   •  Limited  words  in  cross-­‐layer  modeling  of  HMPs  and  cross-­‐ layer  DSE  but  several  piece    work  in  DoE  [1],     •  Resource  allocaCon  [Zidenberg  2012,  Zidenberg  2013]   –  OpCmal  resource  allocaCon  to  specialized  Accelerators  in   SoC;    not  to  cores  in  HMPs     –  System  objecCve  :  improve  performance     –  Do  not  consider  Full-­‐system  stack  and  OS     –  Narrowly  focuses  on  the  Hardware  layer  ,  not  applicable   for  generic  HMPs        
  • 16. HMP  ComposiCon  Approach •  Four  Stages  of  performing   Cross-­‐Layer  Design  Space   ExploraCons     1.  Build  PredicCve  Model   of  each  Core  Types   2.  Compose  PredicCve   Model  of  HMP   ConfiguraCon   3.  Construct  RSM  of   System  ObjecCve  (J)   4.  Find/Search  the  Best   HMP  ConfiguraCon   6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   16   Build  PredicCve  Model   of  each  Core  Types   Compose  PredicCve   Model  of  HMP   ConfiguraCon   Use  HMP  PredicCve   Model  to  build  RSM  of   ObjecCve  (J)   Find  the  Best  HMP   ConfiguraCon  for  the   ObjecCve  
  • 17. Cross-Layer Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   17   Opera.ng  System   Instruc.on  Set  Architecture   Hardware  Architecture     Network/Bus   Communica.on  Architecture       Device/Circuit  Architecture       SO   SI   SN   SH   SC   Sensors,  monitors     and  Observer   OPERATING  CONDITION   Sensing and monitoring at different Layers Virtual Sensors / monitors Physical Sensors/ monitors Applica.ons   SA   Predic.ve   Model   Perf. Power Energy HMP StackOperating Parameters HMP Predictive model Temp. Reliability Error
  • 18. Cross-Layer Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   18   Opera.ng  System   Instruc.on  Set  Architecture   Hardware  Architecture     Network/Bus   Communica.on  Architecture       Device/Circuit  Architecture       Applica.ons   Applica7on  Layer  Features  :   #  of  ApplicaCon   ApplicaCon  Type   •  memory  bound  /  core  bound   •  Real-­‐Cme  vs  sor   •  Exact,  approximate   •  fixed  vs  floaCng  point   ApplicaCon  Size/  Memory  footprint   ApplicaCon  Phases   ApplicaCon  CriCcality   #of  funcCons,  classes,  loc   ApplicaCon  Complexity   Degree  of  ILP,  MLP   Accuracy  requirement   Performance  requirement      
  • 19. Cross-Layer Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   19   Opera.ng  System   Instruc.on  Set  Architecture   Hardware  Architecture     Network/Bus   Communica.on  Architecture       Device/Circuit  Architecture       Applica.ons     Opera7ng  System  Layer  Features:   Scheduling  Policy   Alloca4on  /  Balancing  Policy   Scheduling  Epoch   Balancing  Epoch   #  of  Threads,  Thread  Types,  Thread  Priority     Thread  loca4on  history   No  of  Context  Switch   Migra4on  Overhead   busy  cycles  (cyBusy),  idle  cycles  (cyIdle),     sleep  cycles  (cySleep)   Execu4on  Time  Matrix  (            )   Performance  Characteriza4on  Matrix  (S)   Power  Characteriza4on  Matrix  (P)   Energy  Characteriza4on  Matrix  (E)      
  • 20. Cross-Layer Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   20   Opera.ng  System   Instruc.on  Set  Architecture   Hardware  Architecture     Network/Bus   Communica.on  Architecture       Device/Circuit  Architecture       Applica.ons   Instruc7on  Set  Layer  Features:     ISA  Type  and  Width  (fixed)   commiLed  instruc4ons  (Itotal),   commiLed  load  and  stores  (Imem),   commiLed  branches  (Ibranch)   Floa4ng  point  Instruc4ons  (IFP)   Integer  Instruc4ons  (Iint)   Cri4cal  Instruc4ons  (Icr)   Non-­‐Cri4cal  Instruc4ons  (Incr)        
  • 21. Cross-Layer Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   21   Opera.ng  System   Instruc.on  Set  Architecture   Hardware  Architecture     Network/Bus   Communica.on  Architecture       Device/Circuit  Architecture       Applica.ons   Hardware  Layer  Features  and  Proper7es:   Core  Type   Issue  width  (Iw),   LQ/SQ  size  (LSQ),     IQ  size  (IQ),   ROB  size  (ROB),   Int/float  Regs  (IFR),   L1$I  size  (KB)  (L1I  ),   L1$D  size  (KB)  (L1D),   L2$I  size  (KB)  (L2I  ),   L2$D  size  (KB)  (L2D  )   Core  Freq.  (MHz)  (F),     Core  Voltage  (V  ),     Core  Area  (a),  Uncore  Area  (au)   Core  Power  (pw)   Floorplan  and  placement      
  • 22. Cross-Layer Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   22   Opera.ng  System   Instruc.on  Set  Architecture   Hardware  Architecture     Network/Bus   Communica.on  Architecture       Device/Circuit  Architecture       Applica.ons   Network/Bus  Layer    Features:     Bus  Proper7es:   Gem5  Shared  Bus  Model  [Binkert11]   No  of  Bus   Bus  Type,  Bus  Width,  Bus  Frequency,  Bus  Mode   #L2  Bus,  #  coherence  domains   Conten4ons   Latency   NoC  Proper7es  [Orion  2.0]:   Topology   Rou4ng  policy   Flit  size,  Flit  width   #of  VC   Buffer  Size   Frequency  &  Latency   Conten4ons            
  • 23. Cross-Layer Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   23   Opera.ng  System   Instruc.on  Set  Architecture   Hardware  Architecture     Network/Bus   Communica.on  Architecture       Device/Circuit  Architecture       Applica.ons   Circuit  and  Device  Proper7es:   Imported  from  CACTI  [Thoziyoor08]  &   McPAT  [Shen09]     Technology  Node   Tech.  Parameters   VDD,  VTh,  Bias  Voltage   Wire  model  parameters     Delay  model  parameters     Memory  cell  model  parameters     Cell  Power  model  parameters                    
  • 24. Building Predictive Model of Core Types •  Divided  in  Two  Phases:   –  Training  phase:  known  data  (or  training  set)  are  used  to  idenCfy  the   predicCve  model  configuraCon;  use  special  benchmarks  for  coverage   –  PredicCon   phase:   predicCve   model   is   used   to   forecast   the   unknown   system  response   •  Use  Regression  based  data  fitng  in  the  predicCve  model  of  core  types   Performance  (Throughput)  and  Power   •  Predictor  for  each  core  type:   6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   24   Performance Predictor coefficients Power Predictor coefficients Cross-layer feature vector
  • 25. Predictive Model of Core Types 6/8/15   25  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   Applications Linux  Kernel   Task  0   Task  n   App 0 Task  0   Task  n   App n Operating System HMP Platform Benchmarks Ev8   Ev6   Ev5   Ev4   Disk   DRAM   McPAT   HPC/   Sensing   Interface   …. PowerPerf. Gem5 Predic.ve     Model   ± App.  Type,  Size,  etc   Task/Thread  Model   Task  ExecuCon  Time   Task  Throughput         Task  AllocaCon  &   Scheduling  Policy/Strategy   Memory  AllocaCon   Etc..   Hardware  Architecture   ConfiguraCons,  Performance   Events  Counters   Bus  SpecificaCons   Circuit/Device  Scaling   Technology  Parameters   Power/Energy  ConsumpCon   circuit  delay  parameters       System Specifications System Perf. System Power System Energy Heterogeneous Platform Simulator DoE Data Regression Fitting Full System Stack
  • 26. Compose Predictive Model of HMP Configuration •  Use  predicCve  models  of   individual  core  type  to  compose   total  system  model   •  The  performance  and  power  of   each  core  are  added  to  get  full   system  power  and  performance   •  Core  to  core  interference  and   interacCons  is  captured  via  the   feature  of  the  last  level  cache   and  network     6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   26   Build  PredicCve  Model   of  each  Core  Types   Compose  PredicCve   Model  of  HMP   ConfiguraCon   Use  HMP  PredicCve   Model  to  build  RSM  of   ObjecCve  (J)   Find  the  Best  HMP   ConfiguraCon  for  the   ObjecCve  
  • 27. Construct RSM of the System Objective (J) •  Response  Surface  Models  (RSM)  are   analyCcal  approximate  expression  of   the  System  ObjecCve  (J)   •  A  higher  level  predicCve  model  using   the  individual  core  type  predicCve   models   •  System  level  RSM  can  include  un-­‐core   components  and  core  to  core   interacCon  characterisCcs   •  RSMs  are  dominated  by     –  core  characterisCcs  for  computaCon  centric  apps   –  Network  for  communicaCon  centric  apps   6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   27   Build  PredicCve  Model   of  each  Core  Types   Compose  PredicCve   Model  of  HMP   ConfiguraCon   Use  HMP  PredicCve   Model  to  build  RSM  of   ObjecCve  (J)   Find  the  Best  HMP   ConfiguraCon  for  the   ObjecCve  
  • 28. DSE Optimization •  Formulated  as  OpCmizaCon  Problem   –  Uses  PredicCve  Models  of  Core  types   and  of  HMP   –  Layer  specific  goals  can  be  include  in   system  level  goals     –  Models  of  Individual  core  types  are   used  to  build  HMP  models   •  Search  for  the  best  configuraCon  for  the   objecCve  using  predicCve  models  /  RSM   •  Used  global  opCmizaCon  methods  (SA)  to   find  the  configuraCon      6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   28   Build  PredicCve  Model   of  each  Core  Types   Compose  PredicCve   Model  of  HMP   ConfiguraCon   Use  HMP  PredicCve   Model  to  build  RSM  of   ObjecCve  (J)   Find  the  Best  HMP   ConfiguraCon  for  the   ObjecCve  
  • 29. Experiments & Setup •  Experiments  Goal:  Find  the  HMP  configuraCon  C   under  system  constraint   •  Given:   –  System-­‐Level  Goal  (J):     •  Performance  MaximizaCon  (PerfMax)   •  Energy  MinimizaCon  (EnergyMin)   •  Power  MinimizaCon  (PowerMin)   •  Energy  Efficiency  MaximizaCon  (EEMax)   –  Individual  Layer  Specific  Goal:   •  E.g.,:  OS  AllocaCon  ObjecCve:  minD,  minE,  minED,  minED2   •  Heterogeneity  Aware  Allocator/Scheduler   –  Set  of  Representa.ve  Benchmarks:  PARSEC,  MediaBench   6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   29  
  • 30. HMP Platform Setup 6/8/15   30  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   Setup:     •  Full  System  Stack   •  Integrated  with  Gem5+   McPAT+  CacC   •  Supports  mulCple  ISA   •  Run  Linux  OS   •  Modified  for  Linux  OS   Allocator  for  Heterogeneity   Awareness   •  SimulaCon  Environment:   Cluster  with  10000  cores;  10   TB  storage   Thread  0   Thread  n   App 0 Thread  0   Thread  n   App n Applications Operating System Extended Gem5 Platform Benchmarks Disk   DRAM   McPAT   HPC/  Sensing   Interface   …. PowerPerf. Core  1   RQ   Schedule()   Core  2   RQ   Schedule()   Core  n   RQ   Schedule()   load_balance()   Heterogeneity-­‐ Aware  Scheduler   Linux Kernel …… …… Ev6   $I   $D   L2   Ev7   $I   $D   L2   Ev4   $ I   $D   L2   EV8   $I   $D   L2  
  • 31. Performance of Predictive Model of Core Types 6/8/15   31  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   Full-System Performance and Power Characteristics of core types 0   1   2   3   4   5   6   %  Error   Benchmarks   Predictor  Error  for  Core  Type  Ev6   Perf.  Error   Power  Error   0   1   2   3   4   5   6   %  Error   Benchmarks   Predictor  Error  for  Core  Type  Ev4   Perf.  Error   Power  Error   Performance  and  Power  Predic7on  Errors  are  with  in  5  %  for  each  core   Types    
  • 32. Performance of Predictive Model of Core Types 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   32   Legend: Huge (EV8) ; Big (EV6); Medium (EV5); Small (EV4) Performance  and  Power  Predic7on  Errors  are  with  in  9  %  for  core-­‐to-­‐ core  types  
  • 33. Performance of HMP System Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   33   0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 #ofCores HMP Configuration # HMP configuration for Area Budget of 4Ev6 Ev4 Ev5 Ev6 Ev8 Performance  and  Power  Predic7on  Errors  are  with  in  9  %  for  System   Level  HMP  Configura7ons;  Over  1000x  speedup      
  • 34. Experimental  DSE  Results   •  HMP  configuraCons  can   have  2x-­‐3x  performance   power  difference  for  same   area  resource   •  With  increasing  load,  the   EDP  increases  with   heterogeneity-­‐awareness   •  Layer  specific  objecCves   can  severely  interfere  with   system  objecCve   •  Some  cross-­‐layer  features   have  dominant  impact  on   HMP  power  and   performance     6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   34   Lack   of   Heterogeneity-­‐Awareness   will   have   serious   implica7ons   in   HMP  performance  and  power  and  thus  composi7on  problem   Cross-Layer DSE using Predictive Models
  • 35. HMP Configurations 6/8/15   35  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   LLC   EV6   EV6   EV6   EV6   LLC   EV6   EV6   EV6   EV5   EV5   EV5   EV5   E V 5   LLC   EV6   EV6   EV5   EV5   EV5   EV5   EV4   EV4   EV4  EV4   EV4   EV4   EV4  EV4   EV5   LLC   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   EV4   Config#9 [Energy Efficient] Config#37 [Energy Centric] Config#2 [Power-Throughput efficient] Config#1 [Performance Centric]
  • 36. Conclusion •  We  presented  a  holisCc  cross-­‐layer  approach  for  HMPs  composiCon  under   system  level  constraints  (e.g.,  Area  or  Power)  as  an  OpCmizaCon  problem   •  The  approach  consists  of  predicCve  cross-­‐layer  model  of  core  types   and  total  system    that  are  computaConally  efficient  for  design   exploraCon   •  Enable  over  two  order  of  magnitude  improvement  in  exploraCon  Cme   and  resource  requirement    at  less  than  7%  average  error   •  We  show  :   –  HMP  configuraCons  can  have  2x-­‐3x  performance  power  difference  for   same  area  resource   –  With  increasing  load,  the  EDP  increases  with  heterogeneity-­‐awareness   –  Layer  specific  objecCves  can  severely  interfere  with  system  objecCve   –  Some  cross-­‐layer  features  have  dominant  impact  on  HMP  power  and   performance    6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   36  
  • 37. References 1.  The  Design  and  Analysis  of  Computer  Experiments.  Springer-­‐Verlag,  2003.   2.  G.   Palermo   et   al.   Respir:   A   response   surface-­‐based   pareto   iteraCve   refinement   for   applicaCon-­‐specific   design  space  exploraCon.  Computer-­‐Aided  Design  of  Integrated  Circuits  and  Systems,  IEEE  TransacCons   on,  28(12):1816  –1829,  dec.  2009.   3.  A.D.   Pimentel   et   al.   A   systemaCc   approach   to   exploring   embedded   system   architectures   at   mulCple   abstracCon  levels.  Computers,  IEEE  TransacCons  on,  55(2):99  –  112,  feb.  2006.   4.  K.  Keutzer  et  al.  System-­‐level  design:  orthogonalizaCon  of  concerns  and  plaQorm-­‐based  design.  Computer-­‐ Aided  Design  of  Integrated  Circuits  and  Systems,  IEEE  TransacCons  on,  19(12):1523  –1543,  dec  2000.   5.  P.  Greenhalgh.  Big.lille  processing  with  arm  cortex-­‐a15  &  cortex-­‐a7:  Improving  energy  efficiency  in  high-­‐ performance  mobile  plaQorms.  2011   6.  NVidia.  Variable  smp  -­‐  a  mulC-­‐core  cpu  architecture  for  low  power  and  high  performance.  2011   7.  T.  Zidenberg,  I  Keslassy,  and  U.  Weiser.  OpCmal  resource  allocaCon  with  mulCamdahl.  Computer,  46(7): 70–77,  July  2013.   8.  Tsahee   Zidenberg,   Isaac   Keslassy,   and   Uri   Weiser.   MulCamdahl:   How   should   i   divide   my   heterogenous   chip?  Computer  Architecture  Lelers,  11(2):65–68,  2012.   9.  Sheng   Li   et   al.   McPAT:   An   integrated   power,   area,   and   Cming   modeling   framework   for   mulCcore   and   manycore   architectures.   In   Microarchitecture,   2009.   MICRO-­‐42.   42nd   Annual   IEEE/ACM   InternaConal   Symposium  on,  pages  469–480,  2009.   10.  Thoziyoor,  Shyamkumar,  et  al.  "CACTI  5.1."  HP  Laboratories,  April  2  (2008).   11.  Nathan  Binkert  et  al.  The  gem5  simulator.  SIGARCH  Comput.  Archit.  News,  39(2):1–7,  August  2011.   6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   37  
  • 38. www.variability.org   www.nsf.gov   www.uci.edu   THANKS   S.Sarma   38  
  • 39. Towards Full System Energy Efficiency Models 6/8/15   39  ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   ApplicaCon  &  Workload  Model   OS  and  Scheduling  Model   Hardware,  Memory  &  Bus     Architecture   Circuit  and  Device  Models  
  • 40. HMP  ComposiCon  Approach •  Preform  Cross-­‐Layer  Design   Space  ExploraCons   •  Large  design  space  pruned  by   using  DoE     •  Formulated  as  OpCmizaCon   Problem   –  Uses  PredicCve  Models  of   HMP   –  System  and  layer  specific   goals  evaluaCon  using  the   predicCve  models   –  Models  of  Individual  core   types  are  used  to  build  HMP   models   •  Used  Global  opCmizaCon   methods  (SA  or  GA)  to  find  the   configuraCon   6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   40   Algorithm for DSE
  • 41. Predictive Model Layers   Parametric  Features  &  AZributes   Remarks   Hardware  Architecture   Features   Issue  width  (Iw),   LQ/SQ  size  (LSQ),  IQ  size  (IQ),   ROB  size  (ROB),   Int/float  Regs  (IFR),   L1$I  size  (KB)  (L1I  ),cL1$D  size  (KB)  (L1D),   Freq.  (MHz)  (F),  Voltage  (V  ),     Core  Area  (a).   Performance  Events  Counters   branch  mispredicCon  rate  (mB);   L1  instrucCon  miss  rate  (mL1I  ),   L1  data  cache  miss  rate  (mL1D),   instrucCon  TLB  miss  rate  (mITLB)   data  TLB  miss  rate  (mDTLB)   Context  switch  counters  (Cw)   Cycle  and  InstrucCon  Counters   busy  cycles  (cyBusy),  idle  cycles  (cyIdle),  sleep   cycles  (cySleep)   commiLed  instruc4ons  (Itotal),   commiLed  load  and  stores  (Imem),   commiLed  branches  (Ibranch)   6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   41  
  • 42. HMP  ComposiCon  Approach •  Preform  Cross-­‐Layer  Design  Space  ExploraCons  Four  Stages   –  Provides  an  holisCc  approach  with  complete   system     –  Jointly  consider  features  of  the  applicaCons,  OS,   HW,  Bus/Network,  Circuits  and  devices  layers   –  Avoids  pathological  scenarios  of  single  layer   approach     –  EffecCvely  captures  crucial  interacCon  between   layers   –  Improve  exploraCon  Cme  and  resource  for  small   errors   –  Uses  computaConally  efficient  predicCve   models  developed  from       •  Large  design  space  pruned  by  using  DoE     •  Formulated  as  OpCmizaCon  Problem   –  Uses  PredicCve  Models  of  HMP   –  System  and  layer  specific  goals  evaluaCon  using  the   predicCve  models   –  Models  of  Individual  core  types  are  used  to  build   HMP  models   •  Used  Global  opCmizaCon  methods  (SA  or  GA)  to  find   the  configuraCon   6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   42   Build  PredicCve  Model   of  Core  Types   Compose  PredicCve   Model  of  HMP   ConfiguraCon   Use  HMP  PredicCve   Model  to  build  RSM  of   ObjecCve  (J)   Find  the  Best  HMP   ConfiguraCon  for  the   ObjecCve  
  • 43. Related Work 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   43   •  ExisCng  work  focuses  toward  HMP  runCme  systems   [14],  [4],  [21],  [17],  [2],  [6],  [10]   •  Limited  words  in  cross-­‐layer  modeling  of  HMPs  and   cross-­‐layer  DSE   •  Closest  to  our  work  [Zidenberg  2012,  Zidenberg   2013]   – OpCmal  resource  allocaCon  to  specialized   Accelerators  in  SoC  not  cores   – System  objecCve  :  improve  performance     – Do  not  consider  Full-­‐system  stack  and  OS     – Focus  only  in  the  Hardware  layer    
  • 44. Cross-Layer Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   44   Opera.ng  System   Instruc.on  Set  Architecture   Hardware  Architecture     Network/Bus   Communica.on  Architecture       Device/Circuit  Architecture       SO   SI   SN   SH   SC   Sensors,  monitors     and  Observer   OPERATING  CONDITION   Sensing and monitoring at different Layers Virtual Sensors / monitors Physical Sensors/ monitors Applica.ons   SA   Predic.ve   Model   Perf. Power Energy HMP StackOperating Parameters HMP Predictive model Temp. Reliability Error
  • 45. Cross-Layer Predictive Model 6/8/15   ©  VLSI  Design  &  Embedded  Systems  Conference  -­‐  2015   45   Opera.ng  System   Instruc.on  Set  Architecture   Hardware  Architecture     Network/Bus   Communica.on  Architecture       Device/Circuit  Architecture       SO   SI   SN   SH   SC   Sensors,  monitors     and  Observer   OPERATING  CONDITION   Sensing and monitoring at different Layers Virtual Sensors / monitors Physical Sensors/ monitors Applica.ons   SA   Predic.ve   Model   Perf. Power Energy HMP StackOperating Parameters HMP Predictive model Temp. Reliability Errors Vulnerabil