In this presentation from Moabcon 2013, Bill Kramer from NCSA presents: Blue Waters and Resource Management - Now and in the Future.
Watch the video of this presentation: http://insidehpc.com/?p=36343
Blue Waters and Resource Management - Now and in the Future
1. Blue Waters and Resource Management –
Now and in the Future
Dr. William Kramer
National Center for Supercomputing Applications, University of Illinois
2. Science & Engineering on Blue Waters
Molecular Science Weather & Climate
Forecasting
Earth ScienceAstro* Health
Blue Waters will enable advances in a broad range of science and
engineering disciplines. Examples include:
Moabcon 2013 - April 10, 2013 - Salt Lake
City
2
3. What Have I been Doing?
3
Moabcon 2013 - April 10, 2013 - Salt Lake
City
4. Science
Area
Number
of
Teams
Codes
Struct
Grids
Unstruct
Grids
Dense
Matrix
Sparse
Matrix
N-‐
Body
Monte
Carlo
FFT
PIC
Significant
I/O
Climate
and
Weather
3
CESM,
GCRM,
CM1/WRF,
HOMME
X
X
X
X
X
Plasmas/Magnetosphere
2
H3D(M),VPIC,
OSIRIS,
Magtail/
UPIC
X
X
X
X
Stellar
Atmospheres
and
Supernovae
5
PPM,
MAESTRO,
CASTRO,
SEDONA,
ChaNGa,
MS-‐
FLUKSS
X
X
X
X
X
X
Cosmology
2
Enzo,
pGADGET
X
X
X
CombusRon/Turbulence
2
PSDNS,
DISTUF
X
X
General
RelaRvity
2
Cactus,
Harm3D,
LazEV
X
X
Molecular
Dynamics
4
AMBER,
Gromacs,
NAMD,
LAMMPS
X
X
X
Quantum
Chemistry
2
SIAL,
GAMESS,
NWChem
X
X
X
X
X
Material
Science
3
NEMOS,
OMEN,
GW,
QMCPACK
X
X
X
X
Earthquakes/Seismology
2
AWP-‐ODC,
HERCULES,
PLSQR,
SPECFEM3D
X
X
X
X
Quantum
Chromo
Dynamics
1
Chroma,
MILC,
USQCD
X
X
X
X
X
Social
Networks
1
EPISIMDEMICS
EvoluRon
1
Eve
Engineering/System
of
Systems
1
GRIPS,Revisit
X
Computer
Science
1
X
X
X
X
X
Moabcon 2013 - April 10, 2013 - Salt Lake
City
4
5. Blue Waters Computing System
Moabcon
2013
-‐
April
10,
2013
-‐
Salt
Lake
City
Sonexion:
26
usable
PB
>1
TB/sec
100
GB/sec
10/40/100
Gb
Ethernet
Switch
Spectra
Logic:
300
usable
PB
120+
Gb/sec
100-‐300
Gbps
WAN
IB
Switch
5
External
Servers
Aggregate
Memory
–
1.5
PB
6. 40
GbE
FDR
IB
10
GbE
QDR
IB
Cray
HSN
1.2PB
useable
Disk
1,200 GB/s
100 GB/s
28
Dell
720
IE
servers
4
Dell
esLogin
Online
disk
>25PB
/home,
/project
/scratch
LNET(s)
rSIP
GW
300 GB/s
Network
GW
FC8
LNET
TCP/IP (10 GbE)
SCSI (FCP)
Protocols
GridFTP (TCP/IP)
380PB
RAW
Tape
50
Dell
720
Near
Line
servers
55 GB/s
100
GB/s
100
GB/s
6
100 GB/s
40GbE
Switch
440
Gb/s
Ethernet
from
site
network
Core
FDR/QDR
IB
Extreme
Switches
LAN/WAN
100 GB/s
All storage sizes
given as the
amount usable.
Rates are always
usable/measured
sustained rates
Moabcon 2013 - April 10, 2013 - Salt Lake
City
8. BW Focus on Sustained Performance
• Blue Water’s and NSF are focusing on sustained performance in a way few have
been before.
• Sustained is the computer’s useful, consistent performance on a broad range of
applications that scientists and engineers use every day.
• Time to solution for a given amount of work is the important metric – not hardware Ops/s
• Sustained performance (and therefore tests) include time to read data and write the results
• NSF’s Track-1 call emphasized sustained performance, demonstrated on a collection of
application benchmarks (application + problem set)
• Not just simplistic metrics (e.g. HP Linpack)
• Applications include both Petascale applications (effectively use the full machine, solving
scalability problems for both compute and I/O) and applications that use a large fraction of the
system
• Blue Waters project focus is on delivering sustained Petascale performance to
computational and data focused applications
• Develop tools, techniques, samples, that exploit all parts of the system
• Explore new tools, programming models, and libraries to help applications get the most from
the system
• By the Sustained Petascale Performance Metrics Blue Waters sustained >1.3 across
12 different time to solution application tests.
Moabcon 2013 - April 10, 2013 - Salt Lake
City
8
9. View from the Blue Waters Portal
9
Moabcon 2013 - April 10, 2013 - Salt Lake
City
As of April 2, 2013, Blue Waters
has delivered over 1.3 Billion
core-hours to S&E Teams
10. Usage Breakdown – Jan 1 to Mar 26, 2013
• Torque log accounting (NCSA, Mike Showerman)
10
Moabcon 2013 - April 10, 2013 - Salt Lake
City
0
5
10
15
20
25
30
35
1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768
usage
(M
node-‐hours)
XE
job
size
(NODES:
1
node
=
32
cores)
Accumulated
XE
node-‐hours
–
January
1
to
March
26,
2013
Power of 2
> 65,536 Cores
> 262,144 Cores
11. OBSERVATIONS AND THOUGHTS:
FLEXIBILITY IF THE WORD OF THE
NEXT DECADE
What is Blue Waters already telling us about the future @Scale
systems
11
Moabcon 2013 - April 10, 2013 - Salt Lake
City
12. Observation 1: Topology Matters
• Much of the work for performance improvement of early
applications was understanding and tuning for layout/
topology even on dedicated systems
• Factors of almost 10 were seen for some applications
• Nvidia’s Linpack results are mostly due to topology aware
work layout
• Done with hand tuning, special node selection etc,
• Needs to become common place to really benefit use
12
Moabcon 2013 - April 10, 2013 - Salt Lake
City
13. Topology Matters
• Even very small changes can
have dramatic and
unexpected consequences.
• Example – having just 1
down gemini out of 6114
can slow an application by
>20%
• 0.0156% of components
unavailable can extend an
application run time by
>20% if the components
just happen to be in the
wrong place
• P3DNS – 6114 Nodes
13
Moabcon 2013 - April 10, 2013 - Salt Lake
City
14. Topology
• 1 poorly placed node out of
4116 (0.02%) can slow an
application by >30%
• On a dedicated system!
• It is hard to get an optimal
topology assignments,
especial in non-dedicated
use, but is should be easy
to avoid really detrimental
topology assignments.
14
1 poorly placed node out of
4116 (0.02%) can slow an
application by >30%
Moabcon 2013 - April 10, 2013 - Salt Lake
City
15. Topology Awareness Needed for
All Types of Interconnects
15
Moabcon 2013 - April 10, 2013 - Salt Lake
City
Tori Trees Hypercubes
Direct Connect & Dragonflys
16. Performance and Scalability through
Flexibility
• Harder for applications are able to scale in the face of limited bandwidths.
• BW works with science teams and technology providers to
• Understand and develop better process-to-node mapping analysis to determine
behavior and usage patterns.
• Better instrumentation of what the network is really doing
• Topology aware resource and systems management that enable and reward
topology aware applications
• Malleability – for applications and systems
• Understanding topology given and maximizing effectiveness
• Being able to express desired topology based on algorithms
• Mid ware support
• Even if applications scale, consistency becomes an increasing issue for
systems and applications
• This will only get worse in future systems
16
Moabcon 2013 - April 10, 2013 - Salt Lake
City
17. Flexible Resiliency Modes
• Run again
• Defensive I/O (traditional Checkpoint/restart)
• Expensive
• Extra overhead for application and system
• Intrusive
• I/O infrastructure share across all jobs
• New C/R (node memory copy, SSD, Journaling,…)
• Spare nodes in job requests to rebalance work if a single point of failure
• Wastes resources
• Run times do not support well yet (but can do it)
• Redistribute work within remaining nodes
• Charm++ , some MPI implementations
• Takes longer
• Add spare nodes from system pool to job
• Job scheduler and resource manager and runtime all have to made more flexible
17
Moabcon 2013 - April 10, 2013 - Salt Lake
City
18. Observation 2: Resiliency Flexibility Critical
• Migrate From Checkpoint to Application Resiliency
• Traditional system based checkpoint restart is no longer viable
• Defensive I/O per application is inefficient but the current state of
the art
• Better application resiliency requires improvements in both systems
and applications
• Several teams moving to new frameworks (e.g. Charm++) to
improve resiliency
• MPI trying to add better features for resiliency
18
Moabcon 2013 - April 10, 2013 - Salt Lake
City
19. Resiliency Flexibility
• Application Based Resiliency
• Multiple layers of Software and Hardware have to coordinate
information and reaction
• Analysis and understanding is needed before action
• Correct and actionable messages need to flow up and down the
stack to the applications so they can take the proper action with
correct information
• Application Situational Awareness - need to understand
circumstances and take action
• Flexible resource provisioning needed in real time
• Replacing failed node on the dynamically from a system pool of
nodes
• Interaction with other constraints so sub-optimization does not
adversely impact overall system optimization
19
Moabcon 2013 - April 10, 2013 - Salt Lake
City
20. The Chicken or the Egg
• Applications cannot take advantage of features the
system does not provide
• So they do the best they can with guesses
• Technology providers do not provide features
because they say applications do not use them
My message is
We can not brute force our way to future @scale
systems and applications any longer.
20
Moabcon 2013 - April 10, 2013 - Salt Lake
City
21. Many Other Observations – Other
Presentations
• Storage and I/O significant challenges
• System software quality and resiliency
• Testing for function, feature and performance at scale
• Information Gathering for the system
• Application Methods
• Measuring real time to solution performance
• System SW scale performance
• Heterogeneous components
• Application Consistency
• Efficiency
• Energy, TCO, utilization
S&E Team Productivity
• ….
21
Moabcon 2013 - April 10, 2013 - Salt Lake
City
22. Summary
• Blue Waters is delivering on its commitment to sustained performance to the
Nation for computational and data focused @Scale problems.
• We appreciate the tremendous efforts and support of all our technology
providers and science team partners
• I am pleased to see Adaptive and Cray seriously addressing topology
awareness issues – to meet BW specific needs and hopefully beyoind
• I am pleased Cray made initial improvements to enable application resiliency
but Adaptive, Cray, MPI and other technology providers need to do much
more to solve
• I am very encouraged application teams are willing (and desire) to implement
flexibility in their codes if they have options
• We need more commonality across technology providers and implement
• BW is an excellent platform for studying the issues as well as providing an
unprecedented S&E resource
• Stay tuned for amazing results from BW
22
Moabcon 2013 - April 10, 2013 - Salt Lake
City
23. Acknowledgements
This work is part of the Blue Waters sustained-petascale computing project,
which is supported by the National Science Foundation (award number OCI
07-25070) and the state of Illinois. Blue Waters is a joint effort of the
University of Illinois at Urbana-Champaign, its National Center for
Supercomputing Applications, Cray, and the Great Lakes Consortium for
Petascale Computation.
The work described is achievable through the efforts of the many other on
different teams.
Moabcon 2013 - April 10, 2013 - Salt Lake
City
23