The document discusses performance analysis of the BOUT++ code. It notes that improving HPC performance has economic and scientific benefits. The goals of performance analysis are to identify bottlenecks and suggest improvements to optimize code performance. Profilers are used to measure performance and identify issues such as poor scaling, load imbalance, communication overhead, and memory bandwidth sensitivity. Analysis of BOUT++ shows good scaling up to 8,192 cores but decreased performance at higher concurrencies potentially due to increased computational work in ghost cells as the grid points per processor decrease with increasing concurrency.
Keystone Data Pipeline manages several thousand Flink pipelines, with variable workloads. These pipelines are simple routers which consume from Kafka and write to one of three sinks. In order to alleviate our operational overhead, we’ve implemented autoscaling for our routers. Autoscaling has reduced our resource usage by 25% - 45% (varying by region and time), and has reduced our on call burden. This talk will take an in depth look at the mathematics, algorithms, and infrastructure details for implementing autoscaling of simple pipelines at scale. It will also discuss future work for autoscaling complex pipelines.
Keystone Data Pipeline manages several thousand Flink pipelines, with variable workloads. These pipelines are simple routers which consume from Kafka and write to one of three sinks. In order to alleviate our operational overhead, we’ve implemented autoscaling for our routers. Autoscaling has reduced our resource usage by 25% - 45% (varying by region and time), and has reduced our on call burden. This talk will take an in depth look at the mathematics, algorithms, and infrastructure details for implementing autoscaling of simple pipelines at scale. It will also discuss future work for autoscaling complex pipelines.
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Till Rohrmann
How to scale recommendations to extremely large scale using Apache Flink. We use matrix factorization to calculate a latent factor model which can be used for collaborative filtering. The implemented alternating least squares algorithm is able to deal with data sizes on the scale of Netflix.
An Introduction to Distributed Data StreamingParis Carbone
A lecture on distributed data streaming, introducing all basic abstractions such as windowing, synopses (state), partitioning and parallelism and applying into an example pipeline for detecting fires. It also offers a brief introduction and motivation on reliability guarantees and the need for repeatable sources and application level fault tolerance and consistency.
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Till Rohrmann
The talk explains how Apache Flink checkpoints stateful jobs using the asynchronous barrier snapshotting algorithm to give exactly once semantics in streaming. Furthermore, Flink's approach to master high availability (HA) is described which solves the problem of the JobManager being the single point of failure. Job checkpointing in combination with HA is the basis for Flink's fault tolerance mechanism to recover from occurring failures.
What's new in 1.9.0 blink planner - Kurt Young, AlibabaFlink Forward
Flink 1.9.0 added the ability to support multiple SQL planners under the same API. With this help. we successfully merged a lot features which comes from Alibaba's internal flink version, called blink. In this talk, I will give a introduction about the architecture of the blink planner, and also share with you the functionalities and performance enhancements we added.
Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann
The talk I gave at the FOSDEM 2016 on the 31st of January.
The talk explains how we can do stateful stream processing with Apache Flink at the example of counting tweet impressions. It covers Flink's windowing semantics, stateful operators, fault tolerance and performance numbers. The talks ends with giving an outlook on what's is going to happen in the next couple of months.
Ever tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectly. As we found out, its is not straightforward and it is not well documented either. This session will provide information on the types of memory to be aware of, the calculations involved in determining how much is allocated to each type of memory and how to tune it depending on the use case.
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
Back to FME School - Day 2: Your Data and FMESafe Software
It’s that time of year. The season is changing and FME ‘school’ is now in session! Join us for a series of 9 mini-talks to learn the latest tips for data transformation, see live demos, and get your FME questions answered. Registration gives you access for all three days — sign up now to tune in to the talks you’re most interested in.
Course schedule - Day 2
Automating Everything – Wednesday, September 27, 8:00am – 10:00am PDT
8:00am – Bulk data processing
8:40am – Ultimate Real-Time: Monitor Anything, Update Anything
9:20am – FME in the Enterprise
Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015Till Rohrmann
How to scale recommendations to extremely large scale using Apache Flink. We use matrix factorization to calculate a latent factor model which can be used for collaborative filtering. The implemented alternating least squares algorithm is able to deal with data sizes on the scale of Netflix.
An Introduction to Distributed Data StreamingParis Carbone
A lecture on distributed data streaming, introducing all basic abstractions such as windowing, synopses (state), partitioning and parallelism and applying into an example pipeline for detecting fires. It also offers a brief introduction and motivation on reliability guarantees and the need for repeatable sources and application level fault tolerance and consistency.
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015Till Rohrmann
The talk explains how Apache Flink checkpoints stateful jobs using the asynchronous barrier snapshotting algorithm to give exactly once semantics in streaming. Furthermore, Flink's approach to master high availability (HA) is described which solves the problem of the JobManager being the single point of failure. Job checkpointing in combination with HA is the basis for Flink's fault tolerance mechanism to recover from occurring failures.
What's new in 1.9.0 blink planner - Kurt Young, AlibabaFlink Forward
Flink 1.9.0 added the ability to support multiple SQL planners under the same API. With this help. we successfully merged a lot features which comes from Alibaba's internal flink version, called blink. In this talk, I will give a introduction about the architecture of the blink planner, and also share with you the functionalities and performance enhancements we added.
Apache Flink: Streaming Done Right @ FOSDEM 2016Till Rohrmann
The talk I gave at the FOSDEM 2016 on the 31st of January.
The talk explains how we can do stateful stream processing with Apache Flink at the example of counting tweet impressions. It covers Flink's windowing semantics, stateful operators, fault tolerance and performance numbers. The talks ends with giving an outlook on what's is going to happen in the next couple of months.
Ever tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectly. As we found out, its is not straightforward and it is not well documented either. This session will provide information on the types of memory to be aware of, the calculations involved in determining how much is allocated to each type of memory and how to tune it depending on the use case.
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
SignalFx ingests, processes runs analytics against, (and ultimately stores) massive numbers of time series streaming in parallel into our service which provides an analytics-based monitoring platform for modern applications.
We've chose to build our time series database (TSDB) on Cassandra for it's read and write performance at high load. This presentation will go over our evolution of optimizations to squeeze the most performance out of the TSDB to date and some steps we'll be taking in the future.
Back to FME School - Day 2: Your Data and FMESafe Software
It’s that time of year. The season is changing and FME ‘school’ is now in session! Join us for a series of 9 mini-talks to learn the latest tips for data transformation, see live demos, and get your FME questions answered. Registration gives you access for all three days — sign up now to tune in to the talks you’re most interested in.
Course schedule - Day 2
Automating Everything – Wednesday, September 27, 8:00am – 10:00am PDT
8:00am – Bulk data processing
8:40am – Ultimate Real-Time: Monitor Anything, Update Anything
9:20am – FME in the Enterprise
ПЛОЩАДЬ / Общая площадь 349.6 m2
Сложный, эклектический интерьер этих апартаментов объединяет между собой как вычурные барочные рамы и зеркала так и строгие готические своды и розы. Контрастное цветовое решение сочетается с более мягкой гаммой, подчёркивая зонирование пространства.
04 accelerating dl inference with (open)capi and posit numbersYutaka Kawai
This was presented by Louis Ledoux and Marc Casas at OpenPOWER summit EU 2019. The original one is uploaded at:
https://static.sched.com/hosted_files/opeu19/1a/presentation_louis_ledoux_posit.pdf
Have you ever wondered how to speed up your code in Python? This presentation will show you how to start. I will begin with a guide how to locate performance bottlenecks and then give you some tips how to speed up your code. Also I would like to discuss how to avoid premature optimization as it may be ‘the root of all evil’ (at least according to D. Knuth).
High Performance Erlang - Pitfalls and SolutionsYinghai Lu
Presented at Erlang Factory 2016, San Francisco, CA.
Erlang is widely used for building concurrent applications. However, when we push the performance of our Erlang based application to handle millions of concurrent clients, some Erlang scalability issues begin to show and some conventional programming paradigm of Erlang no longer hold. We would like to share some of these issue and how we address them. In addition, we share some of our experience on how to profile an Erlang application to identify bottlenecks.
We will take a deep look at some of the basic mechanisms of Erlang and show how they behave under high load and parallelism, which includes message delivery, process management and shared data structures such as maps and ETS tables. We will demonstrate their limitations and propose techniques to alleviate the issues.
We will also share profiling techniques on how to find those bottlenecks in Erlang applications across different levels. We will share techniques for writing highly performant Erlang applications.
In this deck from the GPU Technology Conference, Thorsten Kurth from Lawrence Berkeley National Laboratory and Josh Romero from NVIDIA present: Exascale Deep Learning for Climate Analytics.
"We'll discuss how we scaled the training of a single deep learning model to 27,360 V100 GPUs (4,560 nodes) on the OLCF Summit HPC System using the high-productivity TensorFlow framework. We discuss how the neural network was tweaked to achieve good performance on the NVIDIA Volta GPUs with Tensor Cores and what further optimizations were necessary to provide excellent scalability, including data input pipeline and communication optimizations, as well as gradient boosting for SGD-type solvers. Scalable deep learning becomes more and more important as datasets and deep learning models grow and become more complicated. This talk is targeted at deep learning practitioners who are interested in learning what optimizations are necessary for training their models efficiently at massive scale."
Watch the video: https://wp.me/p3RLHQ-kgT
Learn more: https://ml4sci.lbl.gov/home
and
https://www.nvidia.com/en-us/gtc/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
Pak Lui from the HPC Advisory Council presented this deck at the 2017 Stanford HPC Conference.
"To achieve good scalability performance on the HPC scientific applications typically involves good understanding of the workload though performing profile analysis, and comparing behaviors of using different hardware which pinpoint bottlenecks in different areas of the HPC cluster. In this session, a selection of HPC applications will be shown to demonstrate various methods of profiling and analysis to determine the bottleneck, and the effectiveness of the tuning to improve on the application performance from tests conducted at the HPC Advisory Council High Performance Center."
Watch the video presentation: http://wp.me/p3RLHQ-gpY
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
Training large deep learning models like Mask R-CNN and BERT takes lots of time and compute resources. Using MXNet, the Amazon Web Services deep learning framework team has been working with NVIDIA to optimize many different areas to cut the training time from hours to minutes.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...EUDAT
Giuseppe will present the differences between high-performance and high-throughput applications. High-throughput computing (HTC) refers to computations where individual tasks do not need to interact while running. It differs from High-performance (HPC) where frequent and rapid exchanges of intermediate results is required to perform the computations. HPC codes are based on tightly coupled MPI, OpenMP, GPGPU, and hybrid programs and require low latency interconnected nodes. HTC makes use of unreliable components distributing the work out to every node and collecting results at the end of all parallel tasks.
Visit: https://www.eudat.eu/eudat-summer-school
This talk is given at Vizianagaram where many Engineering college faculty were attended. I have introduced developments in multi-core computers along with their architectural developments. Also, I have explained about high performance computing, where these are used. I have introduced the concept of pipelining, Amdahl's law, issues related to pipelining, MIPS architecture.
2. 2
Why Analyze Performance?
• Improving
performance
on
HPC
systems
has
compelling
economic
and
scienAfic
raAonales.
– Dave
Bailey:
Value
of
improving
performance
of
a
single
applicaAon,
5%
of
machine’s
cycles
by
20%
over
10
years:
$1,500,000
– ScienAfic
benefit
probably
much
higher
• Goal:
solve
problems
faster;
solve
larger
problems
• Accurately
state
computaAonal
need
• Only
that
which
can
be
measured
can
be
improved
• The
challenge
is
mapping
the
applicaAon
to
an
increasingly
more
complex
system
architecture
– or
set
of
architectures
3. 3
Performance Analysis Issues
• Difficult
process
for
real
codes
• Many
ways
of
measuring,
reporAng
• Very
broad
space:
Not
just
Ame
on
one
size
– for
fixed
size
problem
(same
memory
per
processor):
Strong
Scaling
– scaled
up
problem
(fixed
execuAon
Ame):
Weak
Scaling
• A
variety
of
pi]alls
abound
– Must
compare
parallel
performance
to
best
uniprocessor
algorithm,
not
just
parallel
program
on
1
processor
(unless
it’s
best)
– Be
careful
relying
on
any
single
number
• Amdahl’s
Law
4. 4
Performance Questions
• How
can
we
tell
if
a
program
is
performing
well?
• Or
isn’t?
• If
performance
is
not
“good”,
how
can
we
pinpoint
why?
• How
can
we
idenAfy
the
causes?
• What
can
we
do
about
it?
5. The multicore era
• Moore’s
law
sAll
extant
• TradiAonal
sources
of
performance
improvement
ending
– Old
trend:
double
clock
frequency
every
18
months
– New
trend
double
#cores
every
18
months
– ImplicaAons:
flops
cheap,
communicaAon,
network
bandwidth
expensive
in
future
6. 6
Processor-DRAM Gap (latency)
• Memory
hierarchies
are
gefng
deeper
– Processors
get
faster
more
quickly
than
memory
µProc
60%/yr.
DRAM
7%/yr.
1!
10!
100!
1000!
1980!
1981!
1983!
1984!
1985!
1986!
1987!
1988!
1989!
1990!
1991!
1992!
1993!
1994!
1995!
1996!
1997!
1998!
1999!
2000!
DRAM
CPU!1982!
Processor-‐Memory
Performance
Gap:
(grows
50%
/
year)
Performance
Time
“Moore’s
Law”
Slide
from
Katherine
Yellick’s
ppt
CS-‐267
7. Performance analysis tools
• Use
of
profilers
to
measure
performance
• Approach
• Build
and
instrument
code
with
binary
to
monitor
funcAon
groups
of
interest
– MPI,
OpenMP,
PGAS,
HWC,
IO
• Run
instrumented
binary
• IdenAfy
performance
bonlenecks
• Suggest
(and
possibly
implement)
improvements
to
opAmize
code
• Tools
used
– IPM:
Low
overhead,
communicaAon,
flops,
code
regions
– CrayPAT:
CommunicaAon,
flops,
code
regions,
PGAS,
variety
of
tracing
opAons
• Overhead
depends
on
number
of
trace
groups
monitored
• Level
of
detail
in
study
depends
on
specifics:
Ame
available,
difficulAes
presented
by
code
8. Performance checklist
• Scaling
– ApplicaAon
Ame,
speed
(flop/s)
– Double
concurrency,
does
speed
double?
• CommunicaAon
(vs
computaAon)
• Load
imbalance
– Check
cabinet
for
mammoths
and
mosquitoes
• Size
and
composiAon
of
communicaAon
– Bandwidth
bound?
– Latency
bound?
– CollecAves
(large
concurrency)
• Memory
bandwidth
sensiAvity
9. BOUT++ scaling results (Elm-pb)
Grid
used:
nx=2052
ny=512
Set:
nxpe=256
good
less
good
good
less
good
Does
not
scale
But
…
CommunicaAon
Does
not
increase
stagnant
RunAme
increases
because
of
other
reasons
Performance
decreases
because
of
other
reasons
10. Communication not reason for
performance degradation
• Separate
out
communicaAon
porAon
from
wallAme
and
compute
speed=Flop/(computaAon
Ame-‐core)
• Should
be
constant
for
ideal
scaling,
if
comm
were
reason
for
performance
degradaAon
less
good
good
11. MPI pie shows significance of
Allreduce calls
• MPI_Allreduce
calls
form
bulk
of
pie
• MPI
average
message
size
5
kB,
74,000
calls
• Not
quite
enArely
latency
bound
on
hopper
(5
kB
should
be
large
enough)
• Might
become
bonleneck
ater
other
issues
are
sorted
out
• (communicaAon
not
yet
a
bonleneck)
MPI_Wait
MPI_AllReduce
65,536
12. Bout++: break up times in each
kernel to check how they scale
• Breakup
by
Ame
spent:
Calc
scales
somewhat,
but
inv,
solver
do
not
scale
1
10
100
1000
4096
8192
16384
32768
65536
Wall
Calc
Inv
Comm
Solver
13. Bout++ (elm) scaling summary
• Up
to
concurrency
8,192,
code
scales
nearly
perfectly.
• Two
issues
beyond
8,192
– Performance
decreases
(flop/Ame
decreases)
– Wall
Ame
increases
– MPI
not
the
reason
for
performance
degradaAon
– ComputaAonal
performance
decreases
14. 14
Issues
with
increasing
flop
count
• Steady
increase
in
flop
count
(number
of
operaAons)
Conjecture:
Extra
computaAons
in
ghost
cells
(and
more
cycles
spent
in
doing
these)
Valid
region
(excluding
ghost
region)
does
same
amount
of
work
Increaseasing
concurrency
8
8
4
4
4
4
Grid
points/proc
Decreases
with
concurrency
15. BOUT++: Expt-LAPD-drift
• Experiment:
turbulence
in
an
annulus
(LAPD)
• InvesAgate
source
of
extra
computaAons
in
code
[more
work
done
–
leads
to
greater
flop’s
(not
‘flop/s’)
count]
• Code
annotated
to
give
flop
count
with
CrayPAT
• A
given
porAon
is
annotated
and
flop
count
in
that
code
secAon
is
compared
across
concurrency
• Conjecture:
increase
in
flops
in
this
secAon
because
of
ghost
cell
computaAons
(arrays
consist
of
valid+ghost
regions)
16. Annotated code region
PAT_region_begin(24,
"phys_run-14");
nu = nu_hat * Nit /
(Tet^1.5);
mu_i = mui_hat *
(Tit^2.5)/Nit;
kapa_Te = 3.2*(1./fmei)*
(wci/nueix)*(Tet^2.5);
kapa_Ti = 3.9*(wci/
nuiix)*(Tit^2.5);
// Calculate pressures
pei = (Tet+Tit)*Nit;
pe = Tet*Nit;
PAT_region_end(24);
• QuanAAes
Tet,
Tit,
Nit,
etc
declared
as
follows
// 3D evolving fields
Field3D rho, ni, ajpar, te;
// Derived 3D variables
Field3D phi, Apar, Ve, jpar;
// Non-linear coefficients
Field3D nu, mu_i, kapa_Te, kapa_Ti;
// 3D total values
Field3D Nit, Tit, Tet, Vit, phit, VEt, ajp0;
// pressures
Field3D pei, pe;
Field2D pei0, pe0;
Variables
defined
to
comprise
valid
region
+ghost
cells
Extra
ComputaAon
in
box
can
be
measured
17. Computation in guard cells
• Grid:
204x128,
ghost
cells:
MXG,
MYG=2
6400
(extreme-‐2
inner
grid
pts)
3200
(extreme
but
one)
Extreme
case
3
/mes
the
work
as
expected!
Twice
as
much
work
as
expected
w
w
w
w
w/2
w/2
18. 18
ObservaAons:
validaAon
of
ghost
cell
conjecture
Concurrency
FPO
in
given
region
from
CrayPat
Factor
predicted
Actual
6400
51512160000 1(reference
case)
-‐-‐
3200
34341360000 2/3
2/3
1600
25755960000 ½
½
800
21463260000 1.25/3
1.25/3
Predicted
values
match
exactly
with
computed
flops,
in
terms
of
ra/os
Hypothesis
seems
to
be
correct
20. BOUT++ results: improve INV,
PVODE
• Summary
– Scaling
degrades
beyond
8192
procs
– Scaling
efficiency
is
very
poor
at
32768
procs
– Issues
• Extra
computaAons
in
ghost
cells
» Put
in
place
to
reduce
communicaAon
» Need
to
check
if
extra
computaAons
performed
are
worth
it
» Performance
degrades
because
• Laplacian
inversion
does
not
scale
• Pvode
solver
does
not
scale
• MPI
collecAves
• Surface
to
volume
raAo
tested
may
not
be
best
but
issues
remain
– MPI:
CollecAves
grow
with
concurrency,
but
Laplacian
inversion
and
PVODE
solver
seem
to
be
culpable
in
equal
measure
• RecommendaAons:
need
to
check
if
replacing
ghost
cell
computaAon
with
communicaAon
improves
runAme
(or
not)
• Need
to
improve
Laplacian
inversion
and
PVODE
kernels
21. Investigation of collectives in time-
stepping algorithms (PETSc)
• Time-‐stepping
algorithms
suspected
to
have
collecAves
– First
step:
check
growth
of
collecAves
in
Ame-‐
steppers
– Hook
with
PETSc
and
turn
on
profiling
layer
– -‐log_summary
– Examine
%collecAves
vis
a
vis
runAme
23. Overall conclusions and future
directions(?)
• BOUT++
scales
remarkably
well
for
a
strong
scaling
test
– Performance
degradaAon
‘not
just’
because
of
increase
in
surface
to
volume
raAo
• CommunicaAon
increases
at
higher
concurrency,
could
consAtute
the
ulAmate
scaling
bonleneck
• Extra
ghost
cell
computaAons
– Put
in
there
for
a
reason,
to
lessen
communicaAon,
but
manifests
as
extra
Ame
spent
in
computaAon
– Might
be
good
overhead
when
flops
become
cheaper
• Bandwidth
sensiAvity
not
an
issue
in
BOUT++
• Two
dimensional
domain
decomposiAon
– Possibly
add
OpenMP
in
third
direcAon?
• CollecAves
might
play
dominant
role
in
Ame-‐steppers
– Find
ways
of
minimizing
this
– What
is
the
effect
of
pufng
in
precondiAoners?