Blue Waters and Resource Management - Now and in the Future

Blue Waters and Resource Management –
Now and in the Future
Dr. William Kramer
National Center for Supercomputing Applications, University of Illinois

Science & Engineering on Blue Waters
Molecular Science Weather & Climate
Forecasting
Earth ScienceAstro* Health
Blue Waters will enable advances in a broad range of science and
engineering disciplines. Examples include:
Moabcon 2013 - April 10, 2013 - Salt Lake
City
2

What Have I been Doing?
3
City

Science
Area
Number

of
Teams

Codes
Struct

Grids

Unstruct

Grids

Dense

Matrix

Sparse

Matrix

N-‐
Body

Monte

Carlo

FFT
PIC
Signiﬁcant

I/O

Climate
and
Weather
3
CESM,
GCRM,

CM1/WRF,

HOMME

X
X
X
X
X

Plasmas/Magnetosphere
2
H3D(M),VPIC,

OSIRIS,
Magtail/
UPIC

X
X
X
X

Stellar
Atmospheres
and

Supernovae

5
PPM,
MAESTRO,

CASTRO,
SEDONA,

ChaNGa,
MS-‐
FLUKSS

X
X
X
X
X
X

Cosmology
2
Enzo,
pGADGET
X
X
X

CombusRon/Turbulence
2
PSDNS,
DISTUF
X
X

General
RelaRvity
2
Cactus,
Harm3D,

LazEV

X
X

Molecular
Dynamics
4
AMBER,
Gromacs,

NAMD,
LAMMPS

X
X
X

Quantum
Chemistry
2
SIAL,
GAMESS,

NWChem

X
X
X
X
X

Material
Science
3
NEMOS,
OMEN,

GW,
QMCPACK

X
X
X
X

Earthquakes/Seismology
2
AWP-‐ODC,

HERCULES,
PLSQR,

SPECFEM3D

X
X
X
X

Quantum
Chromo

Dynamics

1
Chroma,
MILC,

USQCD

X
X
X
X
X

Social
Networks
1
EPISIMDEMICS

EvoluRon
1
Eve

Engineering/System
of

Systems

1
GRIPS,Revisit
X

Computer
Science
1
X
X
X
X
X

City
4

Blue Waters Computing System
Moabcon
2013
-‐
April
10,
2013
-‐
Salt
Lake
City

Sonexion:
26
usable
PB

>1
TB/sec

100
GB/sec

10/40/100
Gb

Ethernet
Switch

Spectra
Logic:
300
usable
PB

120+
Gb/sec

100-‐300
Gbps
WAN

IB
Switch

5
External
Servers

Aggregate
Memory
–
1.5
PB

40
GbE

FDR
IB

10
GbE

QDR
IB
Cray
HSN

1.2PB

useable

Disk

1,200 GB/s
100 GB/s
28
Dell
720

IE
servers

4
Dell

esLogin
Online
disk
>25PB

/home,
/project

/scratch

LNET(s)
rSIP

GW

300 GB/s
Network

GW

FC8

LNET
TCP/IP (10 GbE)
SCSI (FCP)
Protocols
GridFTP (TCP/IP)
380PB

RAW

Tape

50
Dell
720

Near
Line

servers

55 GB/s
100
GB/s
100
GB/s
6
100 GB/s
40GbE

Switch

440
Gb/s
Ethernet

from
site
network

Core

FDR/QDR
IB

Extreme
Switches

LAN/WAN
100 GB/s
All storage sizes
given as the
amount usable.
Rates are always
usable/measured
sustained rates
City

7
Cray XE6/XK7 - 276 Cabinets
XE6
Compute
Nodes
-‐
5,688
Blades
–
22,640
Nodes
–

362,240
FP
(bulldozer)
Cores
–
724,480
Integer
Cores

4
GB
per
FP
core

DSL

48
Nodes

Resource

Manager
(MOM)

64
Nodes

H2O
Login

4
Nodes

Import/Export

Nodes

Management
Node

esServers Cabinets
HPSS
Data
Mover

Nodes

XK7

GPU
Nodes

768
Blades
–
3,072
(4,224)
Nodes

24,576
(33,792)
FP
Cores

–
4,224
GPUs

4
GB
per
FP
core

Sonexion

25+
usable
PB
online
storage

36
racks

BOOT

2
Nodes

SDB

2
Nodes

Network
GW

8
Nodes

Reserved

74
Nodes

LNET
Routers

582
Nodes

InﬁniBand
fabric

Boot RAID
Boot Cabinet
SMW

10/40/100
Gb

Ethernet
Switch

Gemini Fabric (HSN)
RSIP

12Nodes

NCSAnet

Near-‐Line
Storage

300+
usable
PB

SupporRng
systems:
LDAP,
RSA,
Portal,
JIRA,
Globus
CA,

Bro,
test
systems,
Accounts/AllocaRons,
CVS,
Wiki

Cyber
ProtecRon
IDPS

NPCF
City
SCUBA

BW Focus on Sustained Performance
•  Blue Water’s and NSF are focusing on sustained performance in a way few have
been before.
•  Sustained is the computer’s useful, consistent performance on a broad range of
applications that scientists and engineers use every day.
•  Time to solution for a given amount of work is the important metric – not hardware Ops/s
•  Sustained performance (and therefore tests) include time to read data and write the results
•  NSF’s Track-1 call emphasized sustained performance, demonstrated on a collection of
application benchmarks (application + problem set)
•  Not just simplistic metrics (e.g. HP Linpack)
•  Applications include both Petascale applications (effectively use the full machine, solving
scalability problems for both compute and I/O) and applications that use a large fraction of the
system
•  Blue Waters project focus is on delivering sustained Petascale performance to
computational and data focused applications
•  Develop tools, techniques, samples, that exploit all parts of the system
•  Explore new tools, programming models, and libraries to help applications get the most from
the system
•  By the Sustained Petascale Performance Metrics Blue Waters sustained >1.3 across
12 different time to solution application tests.
City
8

View from the Blue Waters Portal
9
City
As of April 2, 2013, Blue Waters
has delivered over 1.3 Billion
core-hours to S&E Teams

Usage Breakdown – Jan 1 to Mar 26, 2013
•  Torque log accounting (NCSA, Mike Showerman)
10
City
0

5

10

15

20

25

30

35

1
2
4
8
16
32
64
128
256
512
1024
2048
4096
8192
16384
32768

usage
(M
node-‐hours)

XE
job
size
(NODES:
1
node
=
32
cores)

Accumulated
XE
node-‐hours
–
January
1
to
March
26,
2013

Power of 2
> 65,536 Cores
> 262,144 Cores

OBSERVATIONS AND THOUGHTS:
FLEXIBILITY IF THE WORD OF THE
NEXT DECADE
What is Blue Waters already telling us about the future @Scale
systems
11
City

Observation 1: Topology Matters
•  Much of the work for performance improvement of early
applications was understanding and tuning for layout/
topology even on dedicated systems
•  Factors of almost 10 were seen for some applications
•  Nvidia’s Linpack results are mostly due to topology aware
work layout
•  Done with hand tuning, special node selection etc,
•  Needs to become common place to really benefit use
12
City

Topology Matters
•  Even very small changes can
have dramatic and
unexpected consequences.
•  Example – having just 1
down gemini out of 6114
can slow an application by
>20%
•  0.0156% of components
unavailable can extend an
application run time by
>20% if the components
just happen to be in the
wrong place
•  P3DNS – 6114 Nodes
13
City

Topology
•  1 poorly placed node out of
4116 (0.02%) can slow an
application by >30%
•  On a dedicated system!
•  It is hard to get an optimal
topology assignments,
especial in non-dedicated
use, but is should be easy
to avoid really detrimental
topology assignments.
14
1 poorly placed node out of
4116 (0.02%) can slow an
application by >30%
City

Topology Awareness Needed for
All Types of Interconnects
15
City
Tori Trees Hypercubes
Direct Connect & Dragonflys

Performance and Scalability through
Flexibility
•  Harder for applications are able to scale in the face of limited bandwidths.
•  BW works with science teams and technology providers to
•  Understand and develop better process-to-node mapping analysis to determine
behavior and usage patterns.
•  Better instrumentation of what the network is really doing
•  Topology aware resource and systems management that enable and reward
topology aware applications
•  Malleability – for applications and systems
•  Understanding topology given and maximizing effectiveness
•  Being able to express desired topology based on algorithms
•  Mid ware support
•  Even if applications scale, consistency becomes an increasing issue for
systems and applications
•  This will only get worse in future systems
16
City

Flexible Resiliency Modes
•  Run again
•  Defensive I/O (traditional Checkpoint/restart)
•  Expensive
•  Extra overhead for application and system
•  Intrusive
•  I/O infrastructure share across all jobs
•  New C/R (node memory copy, SSD, Journaling,…)
•  Spare nodes in job requests to rebalance work if a single point of failure
•  Wastes resources
•  Run times do not support well yet (but can do it)
•  Redistribute work within remaining nodes
•  Charm++ , some MPI implementations
•  Takes longer
•  Add spare nodes from system pool to job
•  Job scheduler and resource manager and runtime all have to made more flexible
17
City

Observation 2: Resiliency Flexibility Critical
•  Migrate From Checkpoint to Application Resiliency
•  Traditional system based checkpoint restart is no longer viable
•  Defensive I/O per application is inefficient but the current state of
the art
•  Better application resiliency requires improvements in both systems
and applications
•  Several teams moving to new frameworks (e.g. Charm++) to
improve resiliency
•  MPI trying to add better features for resiliency
18
City

Resiliency Flexibility
•  Application Based Resiliency
•  Multiple layers of Software and Hardware have to coordinate
information and reaction
•  Analysis and understanding is needed before action
•  Correct and actionable messages need to flow up and down the
stack to the applications so they can take the proper action with
correct information
•  Application Situational Awareness - need to understand
circumstances and take action
•  Flexible resource provisioning needed in real time
•  Replacing failed node on the dynamically from a system pool of
nodes
•  Interaction with other constraints so sub-optimization does not
adversely impact overall system optimization
19
City

The Chicken or the Egg
•  Applications cannot take advantage of features the
system does not provide
•  So they do the best they can with guesses
•  Technology providers do not provide features
because they say applications do not use them
My message is
We can not brute force our way to future @scale
systems and applications any longer.
20
City

Many Other Observations – Other
Presentations
•  Storage and I/O significant challenges
•  System software quality and resiliency
•  Testing for function, feature and performance at scale
•  Information Gathering for the system
•  Application Methods
•  Measuring real time to solution performance
•  System SW scale performance
•  Heterogeneous components
•  Application Consistency
•  Efficiency
•  Energy, TCO, utilization
S&E Team Productivity
•  ….
21
City

Summary
•  Blue Waters is delivering on its commitment to sustained performance to the
Nation for computational and data focused @Scale problems.
•  We appreciate the tremendous efforts and support of all our technology
providers and science team partners
•  I am pleased to see Adaptive and Cray seriously addressing topology
awareness issues – to meet BW specific needs and hopefully beyoind
•  I am pleased Cray made initial improvements to enable application resiliency
but Adaptive, Cray, MPI and other technology providers need to do much
more to solve
•  I am very encouraged application teams are willing (and desire) to implement
flexibility in their codes if they have options
•  We need more commonality across technology providers and implement
•  BW is an excellent platform for studying the issues as well as providing an
unprecedented S&E resource
•  Stay tuned for amazing results from BW
22
City

Acknowledgements
This work is part of the Blue Waters sustained-petascale computing project,
which is supported by the National Science Foundation (award number OCI
07-25070) and the state of Illinois. Blue Waters is a joint effort of the
University of Illinois at Urbana-Champaign, its National Center for
Supercomputing Applications, Cray, and the Great Lakes Consortium for
Petascale Computation.
The work described is achievable through the efforts of the many other on
different teams.
City
23

Blue Waters and Resource Management - Now and in the Future

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Blue Waters and Resource Management - Now and in the Future

Similar to Blue Waters and Resource Management - Now and in the Future (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Blue Waters and Resource Management - Now and in the Future