SlideShare a Scribd company logo
Performance	
  experiments	
  on	
  
BOUT++	
  
Praveen	
  Narayanan	
  and	
  Alice	
  Koniges	
  
2	
  
Why Analyze Performance?
•  Improving	
  performance	
  on	
  HPC	
  systems	
  has	
  compelling	
  economic	
  and	
  
scienAfic	
  raAonales.	
  
–  Dave	
  Bailey:	
  Value	
  of	
  improving	
  performance	
  of	
  a	
  single	
  applicaAon,	
  5%	
  of	
  
machine’s	
  cycles	
  by	
  20%	
  over	
  10	
  years:	
  $1,500,000	
  
–  ScienAfic	
  benefit	
  probably	
  much	
  higher	
  
•  Goal:	
  solve	
  problems	
  faster;	
  solve	
  larger	
  problems	
  
•  Accurately	
  state	
  computaAonal	
  need	
  
•  Only	
  that	
  which	
  can	
  be	
  measured	
  can	
  be	
  improved	
  
•  The	
  challenge	
  is	
  mapping	
  the	
  applicaAon	
  to	
  an	
  increasingly	
  more	
  complex	
  
system	
  architecture	
  
–  or	
  set	
  of	
  architectures	
  
3	
  
Performance Analysis Issues
•  Difficult	
  process	
  for	
  real	
  codes	
  
•  Many	
  ways	
  of	
  measuring,	
  reporAng	
  
•  Very	
  broad	
  space:	
  Not	
  just	
  Ame	
  on	
  one	
  size	
  
–  for	
  fixed	
  size	
  problem	
  (same	
  memory	
  per	
  processor):	
  Strong	
  
Scaling	
  	
  
–  scaled	
  up	
  problem	
  (fixed	
  execuAon	
  Ame):	
  	
  
Weak	
  Scaling	
  
•  A	
  variety	
  of	
  pi]alls	
  abound	
  
–  Must	
  compare	
  parallel	
  performance	
  to	
  best	
  uniprocessor	
  
algorithm,	
  not	
  just	
  parallel	
  program	
  on	
  1	
  processor	
  (unless	
  it’s	
  
best)	
  
–  Be	
  careful	
  relying	
  on	
  any	
  single	
  number	
  
•  Amdahl’s	
  Law	
  
4	
  
Performance Questions
•  How	
  can	
  we	
  tell	
  if	
  a	
  program	
  is	
  performing	
  well?	
  
•  Or	
  isn’t?	
  
•  If	
  performance	
  is	
  not	
  “good”,	
  how	
  can	
  we	
  
pinpoint	
  why?	
  	
  	
  
•  How	
  can	
  we	
  idenAfy	
  the	
  causes?	
  
•  What	
  can	
  we	
  do	
  about	
  it?	
  
The multicore era
•  Moore’s	
  law	
  sAll	
  extant	
  
•  TradiAonal	
  sources	
  of	
  
performance	
  improvement	
  
ending	
  
–  Old	
  trend:	
  double	
  clock	
  
frequency	
  every	
  18	
  
months	
  
–  New	
  trend	
  double	
  #cores	
  
every	
  18	
  months	
  
–  ImplicaAons:	
  flops	
  
cheap,	
  communicaAon,	
  
network	
  bandwidth	
  
expensive	
  in	
  future	
  	
  
6	
  
Processor-DRAM Gap (latency)
•  Memory	
  hierarchies	
  are	
  gefng	
  deeper	
  
–  Processors	
  get	
  faster	
  more	
  quickly	
  than	
  memory	
  
µProc	
  
60%/yr.	
  
DRAM	
  
7%/yr.	
  
1!
10!
100!
1000!
1980!
1981!
1983!
1984!
1985!
1986!
1987!
1988!
1989!
1990!
1991!
1992!
1993!
1994!
1995!
1996!
1997!
1998!
1999!
2000!
DRAM	
  
CPU!1982!
Processor-­‐Memory	
  
Performance	
  Gap:	
  
(grows	
  50%	
  /	
  year)	
  
Performance	
  
Time	
  
“Moore’s	
  Law”	
  
Slide	
  from	
  Katherine	
  Yellick’s	
  ppt	
  
CS-­‐267	
  
Performance analysis tools
•  Use	
  of	
  profilers	
  to	
  measure	
  performance	
  
•  Approach	
  
•  Build	
  and	
  instrument	
  code	
  with	
  binary	
  to	
  monitor	
  funcAon	
  groups	
  of	
  interest	
  	
  
–  MPI,	
  OpenMP,	
  PGAS,	
  HWC,	
  IO	
  
•  Run	
  instrumented	
  binary	
  
•  IdenAfy	
  performance	
  bonlenecks	
  
•  Suggest	
  (and	
  possibly	
  implement)	
  improvements	
  to	
  opAmize	
  code	
  
•  Tools	
  used	
  
–  IPM:	
  Low	
  overhead,	
  communicaAon,	
  flops,	
  code	
  regions	
  
–  CrayPAT:	
  CommunicaAon,	
  flops,	
  code	
  regions,	
  PGAS,	
  variety	
  of	
  tracing	
  
opAons	
  
•  Overhead	
  depends	
  on	
  number	
  of	
  trace	
  groups	
  monitored	
  
•  Level	
  of	
  detail	
  in	
  study	
  depends	
  on	
  specifics:	
  Ame	
  available,	
  
difficulAes	
  presented	
  by	
  code	
  
Performance checklist
•  Scaling	
  
–  ApplicaAon	
  Ame,	
  speed	
  (flop/s)	
  	
  
–  Double	
  concurrency,	
  does	
  speed	
  double?	
  
•  CommunicaAon	
  (vs	
  computaAon)	
  
•  Load	
  imbalance	
  	
  
–  Check	
  cabinet	
  for	
  mammoths	
  and	
  mosquitoes	
  	
  
•  Size	
  and	
  composiAon	
  of	
  communicaAon	
  
–  Bandwidth	
  bound?	
  
–  Latency	
  bound?	
  
–  CollecAves	
  (large	
  concurrency)	
  
•  Memory	
  bandwidth	
  sensiAvity	
  
BOUT++ scaling results (Elm-pb)
Grid	
  used:	
  
nx=2052	
  ny=512	
  
Set:	
  nxpe=256	
  
good	
  
less	
  good	
  
good	
   less	
  good	
  
Does	
  not	
  scale	
  
But	
  …	
  
CommunicaAon	
  	
  
Does	
  not	
  increase	
  
stagnant	
  
RunAme	
  increases	
  because	
  of	
  other	
  reasons	
  
Performance	
  decreases	
  because	
  of	
  other	
  reasons	
  
Communication not reason for
performance degradation
•  Separate	
  out	
  communicaAon	
  
porAon	
  from	
  wallAme	
  and	
  
compute	
  	
  
	
  speed=Flop/(computaAon	
  Ame-­‐core)	
  
•  Should	
  be	
  constant	
  for	
  ideal	
  
scaling,	
  if	
  comm	
  were	
  reason	
  
for	
  performance	
  degradaAon	
  
less	
  good	
  
good	
  
MPI pie shows significance of
Allreduce calls
•  MPI_Allreduce	
  calls	
  form	
  bulk	
  of	
  
pie	
  
•  MPI	
  average	
  message	
  size	
  5	
  kB,	
  
74,000	
  calls	
  
•  Not	
  quite	
  enArely	
  latency	
  
bound	
  on	
  hopper	
  (5	
  kB	
  should	
  
be	
  large	
  enough)	
  
•  Might	
  become	
  bonleneck	
  ater	
  
other	
  issues	
  are	
  sorted	
  out	
  
•  (communicaAon	
  not	
  yet	
  a	
  
bonleneck)	
  
MPI_Wait	
  
MPI_AllReduce	
  
65,536	
  
Bout++: break up times in each
kernel to check how they scale
•  Breakup	
  by	
  Ame	
  spent:	
  Calc	
  scales	
  somewhat,	
  
but	
  inv,	
  solver	
  do	
  not	
  scale	
  
1	
  
10	
  
100	
  
1000	
  
4096	
   8192	
   16384	
   32768	
   65536	
  
Wall	
  
Calc	
  
Inv	
  
Comm	
  
Solver	
  
Bout++ (elm) scaling summary
•  Up	
  to	
  concurrency	
  8,192,	
  code	
  scales	
  nearly	
  
perfectly.	
  	
  
•  Two	
  issues	
  beyond	
  8,192	
  
– Performance	
  decreases	
  (flop/Ame	
  decreases)	
  
– Wall	
  Ame	
  increases	
  	
  
– MPI	
  not	
  the	
  reason	
  for	
  performance	
  degradaAon	
  
– ComputaAonal	
  performance	
  decreases	
  
14	
  
Issues	
  with	
  increasing	
  flop	
  
count	
  
•  Steady	
  increase	
  in	
  flop	
  
count	
  
	
  	
  	
  	
  	
  (number	
  of	
  operaAons)	
  
Conjecture:	
  Extra	
  
computaAons	
  in	
  ghost	
  
cells	
  	
  (and	
  more	
  cycles	
  
spent	
  in	
  doing	
  these)	
  
Valid	
  region	
  (excluding	
  
ghost	
  region)	
  does	
  same	
  
amount	
  of	
  work	
  
Increaseasing	
  concurrency	
  
8	
  
8	
  
4	
  
4	
  
4	
  
4	
  
Grid	
  points/proc	
  
Decreases	
  with	
  
concurrency	
  
BOUT++: Expt-LAPD-drift
•  Experiment:	
  turbulence	
  in	
  an	
  annulus	
  (LAPD)	
  
•  InvesAgate	
  source	
  of	
  extra	
  computaAons	
  in	
  code	
  
[more	
  work	
  done	
  –	
  leads	
  to	
  greater	
  flop’s	
  (not	
  
‘flop/s’)	
  count]	
  	
  
•  Code	
  annotated	
  to	
  give	
  flop	
  count	
  with	
  CrayPAT	
  	
  
•  A	
  given	
  porAon	
  is	
  annotated	
  and	
  flop	
  count	
  in	
  
that	
  code	
  secAon	
  is	
  compared	
  across	
  concurrency	
  	
  
•  Conjecture:	
  increase	
  in	
  flops	
  in	
  this	
  secAon	
  
because	
  of	
  ghost	
  cell	
  computaAons	
  (arrays	
  
consist	
  of	
  valid+ghost	
  regions)	
  
Annotated code region
PAT_region_begin(24,
"phys_run-14");
nu = nu_hat * Nit /
(Tet^1.5);
mu_i = mui_hat *
(Tit^2.5)/Nit;
kapa_Te = 3.2*(1./fmei)*
(wci/nueix)*(Tet^2.5);
kapa_Ti = 3.9*(wci/
nuiix)*(Tit^2.5);
// Calculate pressures
pei = (Tet+Tit)*Nit;
pe = Tet*Nit;
PAT_region_end(24);
•  QuanAAes	
  Tet,	
  Tit,	
  Nit,	
  etc	
  declared	
  as	
  
follows	
  
// 3D evolving fields
Field3D rho, ni, ajpar, te;
// Derived 3D variables
Field3D phi, Apar, Ve, jpar;
// Non-linear coefficients
Field3D nu, mu_i, kapa_Te, kapa_Ti;
// 3D total values
Field3D Nit, Tit, Tet, Vit, phit, VEt, ajp0;
// pressures
Field3D pei, pe;
Field2D pei0, pe0;
Variables	
  defined	
  to	
  
comprise	
  valid	
  region
+ghost	
  cells	
  
Extra	
  
ComputaAon	
  
in	
  box	
  can	
  be	
  
	
  measured	
  	
  
Computation in guard cells
•  Grid:	
  204x128,	
  ghost	
  cells:	
  MXG,	
  MYG=2	
  
6400	
  (extreme-­‐2	
  inner	
  grid	
  pts)	
   3200	
  (extreme	
  but	
  one)	
  	
  
Extreme	
  case	
  3	
  /mes	
  the	
  work	
  
as	
  expected!	
  
Twice	
  as	
  much	
  work	
  as	
  
expected	
  
w	
  
w	
  
w	
  
w	
  
w/2	
  
w/2	
  
18	
  
ObservaAons:	
  validaAon	
  of	
  
ghost	
  cell	
  conjecture	
  
Concurrency	
   FPO	
  in	
  given	
  region	
  
from	
  CrayPat	
  
Factor	
  predicted	
   Actual	
  
6400	
   51512160000 1(reference	
  case)	
   -­‐-­‐	
  
3200	
   34341360000 2/3	
   2/3	
  
1600	
   25755960000 ½	
   ½	
  
800	
   21463260000 1.25/3	
  	
   1.25/3	
  
Predicted	
  values	
  match	
  exactly	
  with	
  computed	
  flops,	
  in	
  terms	
  of	
  ra/os	
  
Hypothesis	
  seems	
  to	
  be	
  correct	
  
Kernels: Calc scales, affected by
ghost cells
•  Breakup	
  by	
  Ame	
  spent	
  
Concurrency wall Calc Inv Comm Solver
4096 364 298 24 23 19
8192 201 151 19 20 11
16384 124 74 16 18 16
16384 197 136 17 30 13
8	
  2	
   2	
   8	
  8	
   8	
  
16384	
  (1st	
  expt)	
   16384	
  (2nd	
  expt)	
  
Does	
  not	
  scale	
  
Scales	
  
BOUT++ results: improve INV,
PVODE
•  Summary	
  
–  Scaling	
  degrades	
  beyond	
  8192	
  procs	
  
–  Scaling	
  efficiency	
  is	
  very	
  poor	
  at	
  32768	
  procs	
  
–  Issues	
  
•  Extra	
  computaAons	
  in	
  ghost	
  cells	
  
»  Put	
  in	
  place	
  to	
  reduce	
  communicaAon	
  
»  Need	
  to	
  check	
  if	
  extra	
  computaAons	
  performed	
  are	
  worth	
  it	
  
»  Performance	
  degrades	
  because	
  
•  Laplacian	
  inversion	
  does	
  not	
  scale	
  
•  Pvode	
  solver	
  does	
  not	
  scale	
  
•  MPI	
  collecAves	
  
•  Surface	
  to	
  volume	
  raAo	
  tested	
  may	
  not	
  be	
  best	
  but	
  issues	
  remain	
  
–  MPI:	
  CollecAves	
  grow	
  with	
  concurrency,	
  but	
  Laplacian	
  inversion	
  and	
  PVODE	
  
solver	
  seem	
  to	
  be	
  culpable	
  in	
  equal	
  measure	
  
•  RecommendaAons:	
  need	
  to	
  check	
  if	
  replacing	
  ghost	
  cell	
  computaAon	
  with	
  
communicaAon	
  improves	
  runAme	
  (or	
  not)	
  
•  Need	
  to	
  improve	
  Laplacian	
  inversion	
  and	
  PVODE	
  kernels	
  
Investigation of collectives in time-
stepping algorithms (PETSc)
•  Time-­‐stepping	
  algorithms	
  suspected	
  to	
  have	
  
collecAves	
  
– First	
  step:	
  check	
  growth	
  of	
  collecAves	
  in	
  Ame-­‐
steppers	
  
– Hook	
  with	
  PETSc	
  and	
  turn	
  on	
  profiling	
  layer	
  	
  
–  -­‐log_summary	
  
– Examine	
  %collecAves	
  vis	
  a	
  vis	
  runAme	
  
22	
  
PETSc	
  
Overall conclusions and future
directions(?)
•  BOUT++	
  scales	
  remarkably	
  well	
  for	
  a	
  strong	
  scaling	
  test	
  
–  Performance	
  degradaAon	
  ‘not	
  just’	
  because	
  of	
  increase	
  in	
  surface	
  to	
  volume	
  
raAo	
  
•  CommunicaAon	
  increases	
  at	
  higher	
  concurrency,	
  could	
  consAtute	
  the	
  
ulAmate	
  scaling	
  bonleneck	
  
•  Extra	
  ghost	
  cell	
  computaAons	
  	
  
–  Put	
  in	
  there	
  for	
  a	
  reason,	
  to	
  lessen	
  communicaAon,	
  but	
  manifests	
  as	
  extra	
  Ame	
  
spent	
  in	
  computaAon	
  
–  Might	
  be	
  good	
  overhead	
  when	
  flops	
  become	
  cheaper	
  
•  Bandwidth	
  sensiAvity	
  not	
  an	
  issue	
  in	
  BOUT++	
  
•  Two	
  dimensional	
  domain	
  decomposiAon	
  
–  Possibly	
  add	
  OpenMP	
  in	
  third	
  direcAon?	
  
•  CollecAves	
  might	
  play	
  dominant	
  role	
  in	
  Ame-­‐steppers	
  
–  Find	
  ways	
  of	
  minimizing	
  this	
  
–  What	
  is	
  the	
  effect	
  of	
  pufng	
  in	
  precondiAoners?	
  

More Related Content

What's hot

Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
Flink Forward
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
Kostas Tzoumas
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Till Rohrmann
 
An Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingAn Introduction to Distributed Data Streaming
An Introduction to Distributed Data Streaming
Paris Carbone
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis -  Massimo PeriniDeep Stream Dynamic Graph Analytics with Grapharis -  Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Flink Forward
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
DataWorks Summit/Hadoop Summit
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
Stephan Ewen
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Till Rohrmann
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
Flink Forward
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Flink Forward
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
Flink Forward
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Andra Lungu
 
Handling Large State on BEAM
Handling Large State on BEAMHandling Large State on BEAM
Handling Large State on BEAM
Yoshihiro TANAKA
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
Till Rohrmann
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Flink Forward
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
DataStax Academy
 
Back to FME School - Day 2: Your Data and FME
Back to FME School - Day 2: Your Data and FMEBack to FME School - Day 2: Your Data and FME
Back to FME School - Day 2: Your Data and FME
Safe Software
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
Kostas Tzoumas
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
Robert Metzger
 

What's hot (20)

Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015
 
An Introduction to Distributed Data Streaming
An Introduction to Distributed Data StreamingAn Introduction to Distributed Data Streaming
An Introduction to Distributed Data Streaming
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis -  Massimo PeriniDeep Stream Dynamic Graph Analytics with Grapharis -  Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, AlibabaWhat's new in 1.9.0 blink planner - Kurt Young, Alibaba
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
 
Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015Flink Gelly - Karlsruhe - June 2015
Flink Gelly - Karlsruhe - June 2015
 
Handling Large State on BEAM
Handling Large State on BEAMHandling Large State on BEAM
Handling Large State on BEAM
 
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: Streaming Done Right @ FOSDEM 2016
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
Back to FME School - Day 2: Your Data and FME
Back to FME School - Day 2: Your Data and FMEBack to FME School - Day 2: Your Data and FME
Back to FME School - Day 2: Your Data and FME
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 

Viewers also liked

Nieuwjaarswensen
NieuwjaarswensenNieuwjaarswensen
Nieuwjaarswensen
cursistlili
 
Yackson lara
Yackson lara Yackson lara
Yackson lara
Yackson Lara
 
Cts module outline mac 2015
Cts module outline mac 2015Cts module outline mac 2015
Cts module outline mac 2015
Lau Hui Ming Belinda
 
Объект №1
Объект №1Объект №1
Объект №1
StavskayaKr
 
Pastelgothshop
PastelgothshopPastelgothshop
Pastelgothshop
Lidia Robles
 
Trabajo de contabilidad 1
Trabajo de contabilidad 1Trabajo de contabilidad 1
Trabajo de contabilidad 1
pabe2211
 
fashions_of_a_decade_the_1990s
fashions_of_a_decade_the_1990sfashions_of_a_decade_the_1990s
fashions_of_a_decade_the_1990s
Naheed Karimi
 
Blouse, skirt & pants style book 2011
Blouse, skirt & pants style book 2011Blouse, skirt & pants style book 2011
Blouse, skirt & pants style book 2011
Naheed Karimi
 
201.02 developing a sociological perspective and imagination
201.02 developing a sociological perspective and imagination 201.02 developing a sociological perspective and imagination
201.02 developing a sociological perspective and imagination
cjsmann
 
Dr Isaac Karimi CV
Dr Isaac Karimi   CVDr Isaac Karimi   CV
Dr Isaac Karimi CVISAAC KARIMI
 

Viewers also liked (12)

Nieuwjaarswensen
NieuwjaarswensenNieuwjaarswensen
Nieuwjaarswensen
 
Yackson lara
Yackson lara Yackson lara
Yackson lara
 
Cts module outline mac 2015
Cts module outline mac 2015Cts module outline mac 2015
Cts module outline mac 2015
 
Объект №1
Объект №1Объект №1
Объект №1
 
Pastelgothshop
PastelgothshopPastelgothshop
Pastelgothshop
 
Trabajo de contabilidad 1
Trabajo de contabilidad 1Trabajo de contabilidad 1
Trabajo de contabilidad 1
 
Francisco_diploma
Francisco_diplomaFrancisco_diploma
Francisco_diploma
 
fashions_of_a_decade_the_1990s
fashions_of_a_decade_the_1990sfashions_of_a_decade_the_1990s
fashions_of_a_decade_the_1990s
 
Blouse, skirt & pants style book 2011
Blouse, skirt & pants style book 2011Blouse, skirt & pants style book 2011
Blouse, skirt & pants style book 2011
 
Campesino
CampesinoCampesino
Campesino
 
201.02 developing a sociological perspective and imagination
201.02 developing a sociological perspective and imagination 201.02 developing a sociological perspective and imagination
201.02 developing a sociological perspective and imagination
 
Dr Isaac Karimi CV
Dr Isaac Karimi   CVDr Isaac Karimi   CV
Dr Isaac Karimi CV
 

Similar to PraveenBOUT++

04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
Yutaka Kawai
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
Yinghai Lu
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptx
dhivyak49
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
inside-BigData.com
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
inside-BigData.com
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
AllineaSoftware
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
Ganesan Narayanasamy
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Eric Haibin Lin
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
HPCC Systems
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
03 performance
03 performance03 performance
03 performance
marangburu42
 
Validation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environmentValidation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environmentObsidian Software
 
Validation and Design in a Small Team Environment
Validation and Design in a Small Team EnvironmentValidation and Design in a Small Team Environment
Validation and Design in a Small Team EnvironmentDVClub
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
EUDAT
 
04 performance
04 performance04 performance
04 performance
marangburu42
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
Nagasuri Bala Venkateswarlu
 

Similar to PraveenBOUT++ (20)

04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers04 accelerating dl inference with (open)capi and posit numbers
04 accelerating dl inference with (open)capi and posit numbers
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
 
Fundamentals.pptx
Fundamentals.pptxFundamentals.pptx
Fundamentals.pptx
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Exascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate AnalyticsExascale Deep Learning for Climate Analytics
Exascale Deep Learning for Climate Analytics
 
Application Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance CenterApplication Profiling at the HPCAC High Performance Center
Application Profiling at the HPCAC High Performance Center
 
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Callgraph analysis
Callgraph analysisCallgraph analysis
Callgraph analysis
 
03 performance
03 performance03 performance
03 performance
 
Validation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environmentValidation and-design-in-a-small-team-environment
Validation and-design-in-a-small-team-environment
 
Validation and Design in a Small Team Environment
Validation and Design in a Small Team EnvironmentValidation and Design in a Small Team Environment
Validation and Design in a Small Team Environment
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
 
Lecture1
Lecture1Lecture1
Lecture1
 
04 performance
04 performance04 performance
04 performance
 
Nbvtalkatjntuvizianagaram
NbvtalkatjntuvizianagaramNbvtalkatjntuvizianagaram
Nbvtalkatjntuvizianagaram
 

More from Praveen Narayanan

magellan_mongodb_workload_analysis
magellan_mongodb_workload_analysismagellan_mongodb_workload_analysis
magellan_mongodb_workload_analysisPraveen Narayanan
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverPraveen Narayanan
 
governing_equations_radiation_problem
governing_equations_radiation_problemgoverning_equations_radiation_problem
governing_equations_radiation_problemPraveen Narayanan
 
RadiationDrivenExtinctionInFires
RadiationDrivenExtinctionInFiresRadiationDrivenExtinctionInFires
RadiationDrivenExtinctionInFiresPraveen Narayanan
 

More from Praveen Narayanan (8)

magellan_mongodb_workload_analysis
magellan_mongodb_workload_analysismagellan_mongodb_workload_analysis
magellan_mongodb_workload_analysis
 
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solverS4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
S4495-plasma-turbulence-sims-gyrokinetic-tokamak-solver
 
cug2011-praveen
cug2011-praveencug2011-praveen
cug2011-praveen
 
LBNLppt
LBNLpptLBNLppt
LBNLppt
 
governing_equations_radiation_problem
governing_equations_radiation_problemgoverning_equations_radiation_problem
governing_equations_radiation_problem
 
peters_corrected-1
peters_corrected-1peters_corrected-1
peters_corrected-1
 
blasius
blasiusblasius
blasius
 
RadiationDrivenExtinctionInFires
RadiationDrivenExtinctionInFiresRadiationDrivenExtinctionInFires
RadiationDrivenExtinctionInFires
 

PraveenBOUT++

  • 1. Performance  experiments  on   BOUT++   Praveen  Narayanan  and  Alice  Koniges  
  • 2. 2   Why Analyze Performance? •  Improving  performance  on  HPC  systems  has  compelling  economic  and   scienAfic  raAonales.   –  Dave  Bailey:  Value  of  improving  performance  of  a  single  applicaAon,  5%  of   machine’s  cycles  by  20%  over  10  years:  $1,500,000   –  ScienAfic  benefit  probably  much  higher   •  Goal:  solve  problems  faster;  solve  larger  problems   •  Accurately  state  computaAonal  need   •  Only  that  which  can  be  measured  can  be  improved   •  The  challenge  is  mapping  the  applicaAon  to  an  increasingly  more  complex   system  architecture   –  or  set  of  architectures  
  • 3. 3   Performance Analysis Issues •  Difficult  process  for  real  codes   •  Many  ways  of  measuring,  reporAng   •  Very  broad  space:  Not  just  Ame  on  one  size   –  for  fixed  size  problem  (same  memory  per  processor):  Strong   Scaling     –  scaled  up  problem  (fixed  execuAon  Ame):     Weak  Scaling   •  A  variety  of  pi]alls  abound   –  Must  compare  parallel  performance  to  best  uniprocessor   algorithm,  not  just  parallel  program  on  1  processor  (unless  it’s   best)   –  Be  careful  relying  on  any  single  number   •  Amdahl’s  Law  
  • 4. 4   Performance Questions •  How  can  we  tell  if  a  program  is  performing  well?   •  Or  isn’t?   •  If  performance  is  not  “good”,  how  can  we   pinpoint  why?       •  How  can  we  idenAfy  the  causes?   •  What  can  we  do  about  it?  
  • 5. The multicore era •  Moore’s  law  sAll  extant   •  TradiAonal  sources  of   performance  improvement   ending   –  Old  trend:  double  clock   frequency  every  18   months   –  New  trend  double  #cores   every  18  months   –  ImplicaAons:  flops   cheap,  communicaAon,   network  bandwidth   expensive  in  future    
  • 6. 6   Processor-DRAM Gap (latency) •  Memory  hierarchies  are  gefng  deeper   –  Processors  get  faster  more  quickly  than  memory   µProc   60%/yr.   DRAM   7%/yr.   1! 10! 100! 1000! 1980! 1981! 1983! 1984! 1985! 1986! 1987! 1988! 1989! 1990! 1991! 1992! 1993! 1994! 1995! 1996! 1997! 1998! 1999! 2000! DRAM   CPU!1982! Processor-­‐Memory   Performance  Gap:   (grows  50%  /  year)   Performance   Time   “Moore’s  Law”   Slide  from  Katherine  Yellick’s  ppt   CS-­‐267  
  • 7. Performance analysis tools •  Use  of  profilers  to  measure  performance   •  Approach   •  Build  and  instrument  code  with  binary  to  monitor  funcAon  groups  of  interest     –  MPI,  OpenMP,  PGAS,  HWC,  IO   •  Run  instrumented  binary   •  IdenAfy  performance  bonlenecks   •  Suggest  (and  possibly  implement)  improvements  to  opAmize  code   •  Tools  used   –  IPM:  Low  overhead,  communicaAon,  flops,  code  regions   –  CrayPAT:  CommunicaAon,  flops,  code  regions,  PGAS,  variety  of  tracing   opAons   •  Overhead  depends  on  number  of  trace  groups  monitored   •  Level  of  detail  in  study  depends  on  specifics:  Ame  available,   difficulAes  presented  by  code  
  • 8. Performance checklist •  Scaling   –  ApplicaAon  Ame,  speed  (flop/s)     –  Double  concurrency,  does  speed  double?   •  CommunicaAon  (vs  computaAon)   •  Load  imbalance     –  Check  cabinet  for  mammoths  and  mosquitoes     •  Size  and  composiAon  of  communicaAon   –  Bandwidth  bound?   –  Latency  bound?   –  CollecAves  (large  concurrency)   •  Memory  bandwidth  sensiAvity  
  • 9. BOUT++ scaling results (Elm-pb) Grid  used:   nx=2052  ny=512   Set:  nxpe=256   good   less  good   good   less  good   Does  not  scale   But  …   CommunicaAon     Does  not  increase   stagnant   RunAme  increases  because  of  other  reasons   Performance  decreases  because  of  other  reasons  
  • 10. Communication not reason for performance degradation •  Separate  out  communicaAon   porAon  from  wallAme  and   compute      speed=Flop/(computaAon  Ame-­‐core)   •  Should  be  constant  for  ideal   scaling,  if  comm  were  reason   for  performance  degradaAon   less  good   good  
  • 11. MPI pie shows significance of Allreduce calls •  MPI_Allreduce  calls  form  bulk  of   pie   •  MPI  average  message  size  5  kB,   74,000  calls   •  Not  quite  enArely  latency   bound  on  hopper  (5  kB  should   be  large  enough)   •  Might  become  bonleneck  ater   other  issues  are  sorted  out   •  (communicaAon  not  yet  a   bonleneck)   MPI_Wait   MPI_AllReduce   65,536  
  • 12. Bout++: break up times in each kernel to check how they scale •  Breakup  by  Ame  spent:  Calc  scales  somewhat,   but  inv,  solver  do  not  scale   1   10   100   1000   4096   8192   16384   32768   65536   Wall   Calc   Inv   Comm   Solver  
  • 13. Bout++ (elm) scaling summary •  Up  to  concurrency  8,192,  code  scales  nearly   perfectly.     •  Two  issues  beyond  8,192   – Performance  decreases  (flop/Ame  decreases)   – Wall  Ame  increases     – MPI  not  the  reason  for  performance  degradaAon   – ComputaAonal  performance  decreases  
  • 14. 14   Issues  with  increasing  flop   count   •  Steady  increase  in  flop   count            (number  of  operaAons)   Conjecture:  Extra   computaAons  in  ghost   cells    (and  more  cycles   spent  in  doing  these)   Valid  region  (excluding   ghost  region)  does  same   amount  of  work   Increaseasing  concurrency   8   8   4   4   4   4   Grid  points/proc   Decreases  with   concurrency  
  • 15. BOUT++: Expt-LAPD-drift •  Experiment:  turbulence  in  an  annulus  (LAPD)   •  InvesAgate  source  of  extra  computaAons  in  code   [more  work  done  –  leads  to  greater  flop’s  (not   ‘flop/s’)  count]     •  Code  annotated  to  give  flop  count  with  CrayPAT     •  A  given  porAon  is  annotated  and  flop  count  in   that  code  secAon  is  compared  across  concurrency     •  Conjecture:  increase  in  flops  in  this  secAon   because  of  ghost  cell  computaAons  (arrays   consist  of  valid+ghost  regions)  
  • 16. Annotated code region PAT_region_begin(24, "phys_run-14"); nu = nu_hat * Nit / (Tet^1.5); mu_i = mui_hat * (Tit^2.5)/Nit; kapa_Te = 3.2*(1./fmei)* (wci/nueix)*(Tet^2.5); kapa_Ti = 3.9*(wci/ nuiix)*(Tit^2.5); // Calculate pressures pei = (Tet+Tit)*Nit; pe = Tet*Nit; PAT_region_end(24); •  QuanAAes  Tet,  Tit,  Nit,  etc  declared  as   follows   // 3D evolving fields Field3D rho, ni, ajpar, te; // Derived 3D variables Field3D phi, Apar, Ve, jpar; // Non-linear coefficients Field3D nu, mu_i, kapa_Te, kapa_Ti; // 3D total values Field3D Nit, Tit, Tet, Vit, phit, VEt, ajp0; // pressures Field3D pei, pe; Field2D pei0, pe0; Variables  defined  to   comprise  valid  region +ghost  cells   Extra   ComputaAon   in  box  can  be    measured    
  • 17. Computation in guard cells •  Grid:  204x128,  ghost  cells:  MXG,  MYG=2   6400  (extreme-­‐2  inner  grid  pts)   3200  (extreme  but  one)     Extreme  case  3  /mes  the  work   as  expected!   Twice  as  much  work  as   expected   w   w   w   w   w/2   w/2  
  • 18. 18   ObservaAons:  validaAon  of   ghost  cell  conjecture   Concurrency   FPO  in  given  region   from  CrayPat   Factor  predicted   Actual   6400   51512160000 1(reference  case)   -­‐-­‐   3200   34341360000 2/3   2/3   1600   25755960000 ½   ½   800   21463260000 1.25/3     1.25/3   Predicted  values  match  exactly  with  computed  flops,  in  terms  of  ra/os   Hypothesis  seems  to  be  correct  
  • 19. Kernels: Calc scales, affected by ghost cells •  Breakup  by  Ame  spent   Concurrency wall Calc Inv Comm Solver 4096 364 298 24 23 19 8192 201 151 19 20 11 16384 124 74 16 18 16 16384 197 136 17 30 13 8  2   2   8  8   8   16384  (1st  expt)   16384  (2nd  expt)   Does  not  scale   Scales  
  • 20. BOUT++ results: improve INV, PVODE •  Summary   –  Scaling  degrades  beyond  8192  procs   –  Scaling  efficiency  is  very  poor  at  32768  procs   –  Issues   •  Extra  computaAons  in  ghost  cells   »  Put  in  place  to  reduce  communicaAon   »  Need  to  check  if  extra  computaAons  performed  are  worth  it   »  Performance  degrades  because   •  Laplacian  inversion  does  not  scale   •  Pvode  solver  does  not  scale   •  MPI  collecAves   •  Surface  to  volume  raAo  tested  may  not  be  best  but  issues  remain   –  MPI:  CollecAves  grow  with  concurrency,  but  Laplacian  inversion  and  PVODE   solver  seem  to  be  culpable  in  equal  measure   •  RecommendaAons:  need  to  check  if  replacing  ghost  cell  computaAon  with   communicaAon  improves  runAme  (or  not)   •  Need  to  improve  Laplacian  inversion  and  PVODE  kernels  
  • 21. Investigation of collectives in time- stepping algorithms (PETSc) •  Time-­‐stepping  algorithms  suspected  to  have   collecAves   – First  step:  check  growth  of  collecAves  in  Ame-­‐ steppers   – Hook  with  PETSc  and  turn  on  profiling  layer     –  -­‐log_summary   – Examine  %collecAves  vis  a  vis  runAme  
  • 23. Overall conclusions and future directions(?) •  BOUT++  scales  remarkably  well  for  a  strong  scaling  test   –  Performance  degradaAon  ‘not  just’  because  of  increase  in  surface  to  volume   raAo   •  CommunicaAon  increases  at  higher  concurrency,  could  consAtute  the   ulAmate  scaling  bonleneck   •  Extra  ghost  cell  computaAons     –  Put  in  there  for  a  reason,  to  lessen  communicaAon,  but  manifests  as  extra  Ame   spent  in  computaAon   –  Might  be  good  overhead  when  flops  become  cheaper   •  Bandwidth  sensiAvity  not  an  issue  in  BOUT++   •  Two  dimensional  domain  decomposiAon   –  Possibly  add  OpenMP  in  third  direcAon?   •  CollecAves  might  play  dominant  role  in  Ame-­‐steppers   –  Find  ways  of  minimizing  this   –  What  is  the  effect  of  pufng  in  precondiAoners?