Optimizing HPC I/O Performance and Trends

Jialin Liu!
Data Analytics & Service Group!
NERSC/LBNL!

Parallel IO
- 1 -
June 30, 2017
Scaling to Petascale Ins/tute

Outline
- 2 -
Ø  I/O Challenges in 2020/2025
Ø  HPC I/O Stack
q  Hardware: HDD, SSD
q  SoDware: Lustre, MPI-IO, HDF5, H5py
q  Proﬁling I/O with Darshan
Ø  OpKmizing and Scaling I/O
Ø  HPC I/O & Storage Trend
q  Burst Buﬀer
q  Object Store

Introduction: I/O Challenges in 2020/2025
- 3 -
Ø  ScienKﬁc applicaKons/simulaKons generate massive quanKKes of data.
Ø  Example, BES: Basic Energy Science, Requirement Review, 2015
Ø  19 projects review
Ø  Example projects: Quantum Materials, SoD Ma]ers, CombusKon,
Average Increasing RaKo

Common I/O Issues
Ø  Bandwidth
Ø  “The peak bandwidth is XXX GB/s, why I could only get 1% of that?”
Ø  Scalability
Ø  “I have used more IO processes, why the performance is not scalable?”
Ø  Metadata
Ø  “File closing is so slow in my test…”
Ø  Pain of ProducKvity
Ø  “I like to use Python/Spark, but the I/O seems slow”
4

What does Parallel I/O Mean?
5
Ø  At the program level:
Ø  Concurrent reads or writes from mulKple processes to a common ﬁle
Ø  At the system level:
Ø  A parallel ﬁle system and hardware that support such concurrent access
-William Gropp

HPC I/O Software Stack
High Level I/O Libraries map
applicaKon abstracKons onto
storage abstracKons and provide
data portability.

HDF5, Parallel netCDF, ADIOS
I/O Middleware organizes
accesses from many processes,
especially those using collecKve
I/O.

MPI-IO, GLEAN, PLFS

I/O Forwarding transforms I/O
from many clients into fewer,
larger request; reduces lock
contenKon; and bridges between
the HPC system and external
storage.

IBM ciod, IOFSL, Cray DVS, Cray
Datawarp
Parallel file system maintains
logical file model and provides
efficient access to data.

PVFS, PanFS, GPFS, Lustre
6
I/O Hardware
Application
Parallel File System
High-Level I/O Library
I/O Middleware
I/O Forwarding
Productive Interface
Produc/ve Interface builds a thin
layer on top of exisKng high
performance I/O library for
producKve big data analyKcs

H5py, H5Spark, Julia, Pandas, Fits

Data Complexity in Computational Science
Ø  ApplicaKons use advanced data models to fit
the problem at hand
Ø  MulKdimensional typed arrays, images
composed of scan lines, …
Ø  Headers, a]ributes on data
Ø  I/O systems have very simple data models
Ø  Tree-based hierarchy of containers
Ø  Some containers have streams of bytes (files)
Ø  Others hold collecKons of other containers
(directories or folders)
EffecKve mapping from applicaKon data models
to I/O system data models is the key to I/O
performance.
Right Interior
Carotid Artery
Platelet
Aggregation
Model complexity:
Spectral element mesh (top) for
thermal hydraulics computaKon
coupled with finite element
mesh (bo]om) for neutronics
calculaKon.
Scale complexity:
SpaKal range from the
reactor core in meters
to fuel pellets in
millimeters.
Images from T. Tautges (ANL) (upper leD), M. Smith (ANL)
(lower leD), and K. Smith (MIT) (right).
7

I/O Hardware
Ø  Storage Side
Ø  Hard Disk Drive (TradiKonal)
Ø  Solid State Drive (Future)
Ø  Compute Side
Ø  DRAM, Cache (TradiKonal)
Ø  HBM(e.g., MCDRAM), NVRAM(e.g., 3D Xpoint)
8
DRAM
HDD
SSD
On-package MCDRAM
Courtesy of Tweaktown

I/O Hardware: HDD
9
ConKguous IO
•  read Kme, 0.1 ms
NonconKguous IO
•  seek Kme, 4ms
•  rotaKon Kme, 3ms
•  read Kme, 0.1 ms
SSD: No moving parts

HPC I/O Hierarchy
Memory
(DRAM)
Storage
(HDD)
CPU
CPU
Far Memory
(DRAM)
Far Storage
(HDD)
Near Storage
(SSD)
Near Memory
(HBM)
Past Future
On
Chip
On
Chip
Oﬀ
Chip
Oﬀ
Chip
- 10 -

Ø  Store applicaKon data persistently
Ø  Usually extremely large datasets that can’t ﬁt in memory
Ø  Provide global shared-namespace (ﬁles, directories)
Ø  Designed for parallelism
Ø  Concurrent (oDen coordinated) access from many clients
Ø  Designed for high performance
Ø  Operate over high speed networks (IB, Myrinet, Portals)
Ø  OpKmized I/O path for maximum bandwidth

Ø  Examples
Ø  Lustre: Most leadership supercomputers have deployed Lustre
Ø  PVFS-> OrangeFS
Ø  GPFS-> IBM Spectrum Scale, Commercial & HPC
11

Parallel File System: Lustre
12
Lustre.org
OSS: Object Storage Server
OST: Object Storage Target
MDT: Metadata Servers

Parallel File System: Cori FS
13
248 OSS OSS … …
OST OST OST OST OST OST … …
MDS4 MDS3
MDS1 MDS2
1 primary MDS,
4 additional MDS
ADU1
ADU2
MDS
248
OSS OSS OSS OSS
Inﬁniband
130
Haswell with Aries Network
… CMP CMP
… …
CMP
…
CMP
LNET
…
…
…
CMP CMP
LNET LNET
KNL with Aries Network
… CMP CMP
… …
CMP
…
CMP
LNET
…
…
…
CMP CMP
LNET LNET …
2004 9688

I/O Forwarding
Ø  A layer between compuKng system and storage system
Ø  Compute nodes kernel ships I/O to dedicated I/O nodes [1]
Ø  Examples
Ø  Cray DVS
Ø  IOFSL
Ø  Cray Datawarp
14
1. AcceleraKng I/O Forwarding in IBM Blue Gene/P Systems
2. h]p://www.glennklockwood.com/data-intensive/storage/io-
forwarding.html

I/O Forwarding: Cray DVS
15
DVS on parallel file system, e.g., GPFS,
Lustre
Ø  The DVS clients can spread their I/O traffic
between the DVS servers using a
determinisKc mapping.
Ø  Configurable number DVS clients
Ø  Reduces the number of clients that
communicate with the backing file system
(GPFS supports limited number of clients)
Stephen Sugiyama, etc, Cray DVS: Data VirtualizaKon Service, CUG 2008

I/O Middleware
Ø  Why addiKonal I/O SoDware?
Ø  AddiKonal I/O soDware provides improved performance and usability over directly
accessing the parallel file system.
Ø  Reduces or (ideally) eliminates need for opKmizaKon in applicaKon codes.
Ø  MPI-IO
Ø  I/O interface specificaKon for use in MPI apps
Ø  Data model is same as POSIX: Stream of bytes in a file
Ø  MPI-IO Features
Ø  CollecKve I/O
Ø  NonconKguous I/O with MPI datatypes and file views
Ø  Nonblocking I/O
Ø  Fortran bindings (and addiKonal languages)
Ø  System for encoding files in a portable format (external32)
16

What’s Wrong with POSIX?
Ø  It’s a useful, ubiquitous interface for basic I/O
Ø  It lacks constructs useful for parallel I/O
Ø  Cluster applicaKon is really one program running on N nodes, but looks like N programs
to the ﬁlesystem
Ø  No support for nonconKguous I/O
Ø  No hinKng/prefetching
Ø  Its rules hurt performance for parallel apps
Ø  Atomic writes, read-aDer-write consistency
Ø  A]ribute freshness
Ø  POSIX should not have to be used (directly) in parallel applicaKons that want good
performance
Ø  But developers use it anyway
17

Independent and Collective I/O
Ø  Independent I/O operaKons specify only what a single process will do
Ø  Independent I/O calls do not pass on relaKonships between I/O on other processes
Ø  Why use independent I/O
Ø  SomeKmes the synchronizaKon of collecKve calls is not natural
Ø  SomeKmes the overhead of collecKve calls outweighs their beneﬁts
Ø  Example: very small I/O during metadata operaKons
18
P0 P1 P2 P3 P4 P5
Independent I/O

Independent and Collective I/O
Ø  CollecKve I/O is coordinated access to storage by a group of processes
Ø  CollecKve I/O funcKons are called by all processes parKcipaKng in I/O
Ø  Why use collecKve I/O
Ø  Allows I/O layers to know more about access as a whole, more opportuniKes for
opKmizaKon in lower soDware layers, be]er performance
Ø  Combined with non-conKguous accesses yields highest performance

19
P0 P1 P2 P3 P4 P5
CollecKve I/O

Two Key Optimizations in ROMIO (MPIIO)
Ø  MPI IO has many implementaKons
Ø  ROMIO
Ø  Cray, IBM, OpenMPI all have their own implementaKons/variants.
Ø  Data sieving
Ø  For independent nonconKguous requests
Ø  ROMIO makes large I/O requests to the ﬁle system and, in memory, extracts the
data requested
Ø  For wriKng, a read-modify-write is required
Ø  Two-phase collecKve I/O
Ø  CommunicaKon phase to merge data into large chunks
Ø  I/O phase to write large chunks in parallel
20

Contiguous and Noncontiguous I/O
Ø  ConKguous I/O moves data from a single memory block into a single ﬁle region
Ø  NonconKguous I/O has three forms:
Ø  NonconKguous in memory, nonconKguous in ﬁle, or nonconKguous in both
Ø  Structured data leads naturally to nonconKguous I/O (e.g. block decomposiKon)
Ø  Describing nonconKguous accesses with a single operaKon passes more knowledge
to I/O system
21
Process 0 Process 0
NonconKguous
in File
NonconKguous
in Memory
Ghost cell
Stored element
…
Vars 0, 1, 2, 3, … 23
ExtracKng variables from a block and
skipping ghost cells will result in
nonconKguous I/O.

Example: Collective IO for Noncontiguous IO
22
Courtesy of William Gropp
Large array
distributed among
16 processes
Each square
represents a
subarray in the
memory of a single
process
Access pa]ern in the ﬁle (row major)

Example: Collective IO for Noncontiguous IO
23
MPI_Type_create_subarray(ndims,..., &subarray);
MPI_Type_commit(&subarray);
MPI_File_open(MPI_COMM_WORLD, file,...,&fh);
MPI_File_set_view(fh, ..., subarray, ...);
MPI_File_read_all(fh, A, ...);
MPI_File_close(&fh);
MPI_File_open(MPI_COMM_WORLD, file, ...,&fh);
for (i=0; i<n_local_rows; i++) {
MPI_File_seek(fh, ...);
MPI_File_read(fh, &(A[i][0]), ...);
}
MPI_File_close(&fh);
Each process makes one independent read request
for each row in the local array
Each process creates a derived datatype to
describe the nonconKguous access pa]ern, deﬁnes
a ﬁle view, and calls independent I/O funcKons

Example: Collective IO + Noncontiguous IO
24

Example: Collective IO + Noncontiguous IO
25
h]p://wgropp.cs.illinois.edu/

High Level I/O Libraries
Ø  Take advantage of high-performance parallel I/O while reducing complexity
Ø  Add a well-deﬁned layer to the I/O stack
Ø  Allow users to specify complex data relaKonships and dependencies
Ø  Come with machine-independent data formats, self-describing, suitable for array-
oriented scienKﬁc data
Ø  Examples
Ø  HDF5: HDF group, since 1989, top 5 libraries at NERSC
Ø  Parallel netCDF: NWU, ANL, since 2001
Ø  ADIOS: ORNL, since 2009
26

High Level I/O Libraries: HDF5
MPI_Init(&argc, &argv);
fapl_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(fapl_id, comm, info);
file_id = H5Fcreate(FNAME,…, fapl_id);
space_id = H5Screate_simple(…);
dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT,
space_id,…);
xf_id = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(xf_id, H5FD_MPIO_COLLECTIVE);
status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, xf_id…);
MPI_Finalize();
27
Ø  A parallel HDF5 program has a few extra calls than a serial one

Productive I/O Interface
Ø  Big Data AnalyKcs Stack
Ø  Spark
Ø  Tensorflow
Ø  Caffe
Ø  Science data needs to be loaded
efficiently into the engine.
Ø  H5py
Ø  H5Spark
Ø  Fitsio
28

Productive I/O Interface: H5py
29
HDF5 C API
(libhdf5)
hsize_t H5Dget_storage_size(hid_t dset_id)
Cython
(h5d.pyx)
cdef class DatasetID(ObjectID):
def get_storage_size(self):
return H5Dget_storage_size(self.id)
Python
(_hl/dataset.py)
class Dataset(HLObject):
@property
def storagesize(self):
return self.id.get_storage_size()
DatasetID
Dataset
File
Group
FileID

Productive I/O Interface: H5py
30
Independent IO CollecKve IO

- 32 -
H5py vs. HDF5 Performance
Single Node Mul/-nodes
Metadata
1k File CreaKon 63.8%
1k Object Scanning 60.0%
Independent I/O
Weak Scaling 97.8% 100%
Strong Scaling 100% 97.1%
CollecKve I/O
Weak Scaling 100% 90%
Strong Scaling 98.6% 87%
H5Py Performance / HDF5 Performance
Ques/ons: When you gain the producKvity, how much performance you can aﬀord to lose?

HPC I/O Software Stack
High Level I/O Libraries map
applicaKon abstracKons onto
storage abstracKons and provide
data portability.

HDF5, Parallel netCDF, ADIOS
I/O Middleware organizes
accesses from many processes,
especially those using collecKve
I/O.

MPI-IO, GLEAN, PLFS

I/O Forwarding transforms I/O
from many clients into fewer,
larger request; reduces lock
contenKon; and bridges between
the HPC system and external
storage.

IBM ciod, IOFSL, Cray DVS, Cray
Datawarp
Parallel file system maintains
logical file model and provides
efficient access to data.

PVFS, PanFS, GPFS, Lustre
33
I/O Hardware
Application
High-Level I/O Library
I/O Middleware
I/O Forwarding
Productive Interface
Produc/ve Interface builds a thin
layer on top of exisKng high
performance I/O library for
producKve big data analyKcs

H5py, H5Spark, Julia, Pandas, Fits

Get to Know Your I/O: Warp IO
Ø  CharacterisKcs:
Ø  Number of Files
Ø  Size per File
Ø  Number of Processes
Ø  I/O API
34
iteraKon 0
iteraKon 1 Warp IO Pa]ern
²  172 - 600 MB per ﬁle

Leverage I/O Proﬁling Tool: Darshan
Ø  Lightweight scalable I/O proﬁling tools
Ø  Goal: to observe I/O pa]erns of the majority of applicaKons running on producKon
HPC plauorms, without perturbing their execuKon, with enough detail to gain
insight and aid in performance debugging.
Ø  Majority of applicaKons – transparent integraKon with system build environment
Ø  Without perturbaKon – bounded use of resources (memory, network, storage); no
communicaKon or I/O prior to job terminaKon; compression.
Ø  Adequate detail:
Ø  Basic job staKsKcs
Ø  File access informaKon from mulKple APIs
35

The Technology behind Darshan
Ø  Intercepts I/O funcKons using link-Kme wrappers
Ø  No code modificaKon
Ø  Can be transparently enabled in MPI compiler scripts
Ø  CompaKble with all major C, C++, and Fortran compilers
Ø  Record staKsKcs independently at each process, for each file
Ø  Bounded memory consumpKon
Ø  Compact summary rather than verbaKm record
Ø  Collect, compress, and store results at shutdown Kme
Ø  Aggregate shared file data using custom MPI reducKon operator
Ø  Compress remaining data in parallel with zlib
Ø  Write results with collecKve MPI-IO
Ø  Result is a single gzip-compaKble file containing characterizaKon informaKon
Ø  Works for Linux clusters, Blue Gene, and Cray systems
36

Darshan Analysis Example
37
hdf5writeTest 2 (2/4/2016) 1 of 3
jobid: 42375 uid: 58179 nprocs: 64 runtime: 84 seconds
0
20
40
60
80
100
PO
SIX
M
PI-IO
Percentageofruntime
Average I/O cost per process
Read
Write
Metadata
Other (including application compute)
0
10000
20000
30000
40000
50000
60000
70000
Read Write Open Stat Seek Mmap Fsync
Ops(Total,AllProcesses)
I/O Operation Counts
POSIX
MPI-IO Indep.
MPI-IO Coll.
0
10000
20000
30000
40000
50000
60000
70000
0-100
101-1K1K-10K10K-100K100K-1M1M
-4M
4M
-10M10M
-100M
100M
-1G1G
+
Count(Total,AllProcs)
I/O Sizes
Read Write
0
10000
20000
30000
40000
50000
60000
70000
Read Write
Ops(Total,AllProcs)
I/O Pattern
Total
Sequential
Consecutive
Most Common Access Sizes
access size count
1048576 65331
272 1
544 1
328 1
File Count Summary
(estimated by I/O access offsets)
type number of files avg. size max size
total opened 1 64G 64G
read-only files 0 0 0
write-only files 1 64G 64G
read/write files 0 0 0
created files 1 64G 64G
1
5
4
3 2
Ø The darshan-job-summary tool
produces a 3-page PDF file that
summarizes job I/O behavior

1. Run Kme

2. Percentage of runKme in I/O
3. Access type histograms
4. Access size histogram
5. File usage

Ø  This graph (and others like it) are on the second page of the darshan-job-
summary.pl output. This example shows intervals of I/O acKvity from each
MPI process. In this case we see that diﬀerent ranks completed I/O at very
diﬀerent Kmes.

MPI Ranks
Time
Darshan Analysis Example (page 2)

How to Use Darshan
Ø  How to link with Darshan
Ø  Compile a C, C++, or FORTRAN program that uses MPI
Ø  Run the applicaKon
Ø  Look for the Darshan log file
Ø  This will be in a parKcular directory (depending on your system’s configuraKon)
–  <dir>/<year>/<month>/<day>/<username>_<appname>*.darshan*
Ø  Mira: see /projects/logs/darshan/
Ø  Edison: see /scratch1/scratchdirs/darshanlogs/
Ø  Cori: see /global/cscratch1/sd/darshanlogs/
Ø  Use Darshan command line tools to analyze the log file
Ø  ApplicaKon must run to compleKon and call MPI_Finalize() to generate a log file
39

Optimizing I/O from File System Layer: Lustre
Ø  File striping is a way to increase IO performance
Ø  An increase in bandwidth because mulKple processes can simultaneously access
Ø  To store large ﬁles that would take more space than a single OST.
40 Figure courtesy of NICS

Ø  Default Striping: 1MB stripe size, 1 OST
Ø  lfs getstripe
Ø  lfs setstripe –c 100 –S 8m
Ø  Chunk the ﬁle into 8MB blocks and distributed onto 100 OSTs
41

Ø  More OSTS generally helps
Ø  Striping a relaKvely small file on too many OSTs is not Good
Ø  CommunicaKon overhead
Ø  Saturated I/O bandwidth on other layer, e.g., compute nodes
Ø  Storage straggler or bad OST
42
Write a 5GB file Write a 100GB file

Ø  Empirical recommendaKons on Cori @NERSC
Ø  File per process à Use default striping
Ø  Single shared ﬁle à
43

Optimizing Lustre on Cori
Ø  Increasing Lustre Readahead
44

Optimizing I/O from I/O Middle Layer: MPIIO
Ø  Aggregate small and/or non-conKguous I/O into larger conKguous I/O
Ø  CollecKve buffer size
Ø  CollecKve buffer node # the actual number of I/O processes
Ø  Use customized MPI-IO for be]er leveraging underneath file system
Ø  Cray’s MPIIO knows its Lustre be]er, e.g., reduce contenKon by revealing data layout
informaKon on Lustre to MPI-IO layer
Ø  How to pass the hints
Ø  Through environmental variable:
setenv MPICH_MPIIO_HINTS "*:romio_cb_write=enable:romio_ds_write=disable”
Ø  Use MPI_Info_set
MPI_Info_set(info, “striping_factor”, “4”);
MPI_Info_set(info, “cb_nodes”, “4”);
45

Optimizing I/O from I/O Middle Layer: MPIIO
46
cb_buffer_size = 16777216
romio_cb_read = automatic
romio_cb_write = automatic
cb_nodes = 61
cb_align = 2
romio_no_indep_rw = false
romio_cb_pfr = disable
romio_cb_fr_types = aar
romio_cb_fr_alignment = 1
romio_cb_ds_threshold = 0
romio_cb_alltoall = automatic
ind_rd_buffer_size = 4194304
ind_wr_buffer_size = 524288
romio_ds_read = disable
romio_ds_write = disable
striping_factor = 61
striping_unit = 8388608
direct_io = false
aggregator_placement_stride = -1
abort_on_rw_error = disable
cb_config_list = *:*
romio_filesystem_type = CRAY ADIO
Ø  Dump summary of MPI-IO hints:
MPICH_MPIIO_HINTS_DISPLAY

Optimizing MPI-IO on Cori Haswell vs. KNL
47
Ø  Different colors: Different number of aggregators
Ø  X-axis : collecKve buffer size
About this test:
Ø  MPI-IO, collecKve IO
Ø  486GB file
Ø  32 processes per node
Ø  32 nodes
We recommend:
Ø  4 aggregator per node on Haswell
Ø  8 aggregator per node on KNL
Ø  Check our paper in CUG’17
J. Liu, etc. Understanding the IO Performance Gap Between Cori KNL and Haswell, CUG’17

Optimizing I/O from I/O Middle Layer: Guidelines
48
Ø  Limit the number of files (less metadata and easier to post-process)
Ø  Make large and conKguous requests
Ø  Avoid small accesses
Ø  Avoid non-conKguous accesses
Ø  Avoid random accesses Prefer collecKve I/O to independent I/O
(especially if the operaKons can be aggregated as single large conKguous
requests)
Ø  Use derived datatypes and file views to ease the MPI I/O collecKve work
Ø  Try MPI I/O hints (especially the collecKve buffering opKmizaKon;
disabling data sieving is also very oDen a good idea; also useful for
libraries based on MPI-IO)
Credits: Philippe Wautelet @IDRIS

Optimizing I/O from High Level Interface: HDF5
Ø  MPI-IO layers’ opKmizaKon guidelines generally applies to HDF5 layer
Ø  CollecKve metadata operaKon, available in HDF5 1.10
Ø  This allows the library to just use one rank to read the data and broadcast it to all other
ranks
Ø  And constructs an MPI derived datatype and writes collecKvely in a single call
Ø  Increase page buﬀering
Ø  H5Pset_page_buffer_size
Ø  Stay tuned for next HDF5 talk at 10am
49

Optimizing I/O from Python Interface: H5py
Ø  OpKmal HDF5 ﬁle creaKon
Ø  Use low-level API in H5py
50
2.25X
Get closer to the HDF5 C library, ﬁne tuning

Ø  Speedup the I/O with collecKve I/O
51
Using 1k process to write 1TB ﬁle, collecKve IO
achieved 2X speedup on Cori

Ø  Avoid type casKng in H5py
52
Reduced IO from 527 seconds to 1.3

Object Store for HPC
Ø  Amazon S3 is so successful in supporKng various applicaKons, e.g., Instagram,
Dropbox
Ø  HPC file system relies on strong POSIX, which hinders the performance, e.g.,
scalability.
Ø  Object Store:
Ø  Put : creates a new object and fills it with data
Ø  Get: retrieves the data based on object ID
Ø  Benefits: Scalability
Ø  Lockless: Objects are immutable, write-once, no need to lock it before read
Ø  Fast lookup, based simple hash of object ID
53

Object Store for HPC
Ø  Disadvantages
Ø  Data can not be modiﬁed, most HPC applicaKons like to read/write the data
Ø  Limited metadata support, e.g., user info, access permission, requires a database layer
54
Figure courtesy of Glenn Lockwood
DDN WOS, Object Store Testbed at NERSC, (TBA Soon)
Contact me or Damian Hazen if you are interested

Burst Buffer
Ø  Burst buffer on Cori
Ø  1.7 TB/second of peak I/O performance with 28M IOPs,
Ø  1.8PB of storage
Ø  NVRAM-based ‘Burst Buffer’ (BB) as intermediate layer
Ø  Handle I/O spikes without a huge PFS (stage to PFS asynchronously)
Ø  Underlying media supports challenging I/O
Ø  SoDware for filesystems- on-demand - scales be]er than large POSIX PFS
Ø  Cray DataWarp soDware allocates storage to users per-job
Ø  Users see a POSIX filesystem on-demand, striped across nodes
Ø  Can specify data to stage in/out from Lustre while job is in queue
55

NERSC Burst Buffer Architecture
- 56 -
Burst Buffer Blade:
Compute Nodes
Aries High-Speed
Network
Blade = 2x Burst Buffer Node: 4 Intel P3608 3.2 TB SSDs
I/O Node (2x InfiniBand HCA)
InfiniBand Fabric
Lustre OSSs/OSTs
Storage
Fabric
(InfiniBand)
Storage Servers
CN
CN CN
CN
BB SSD
SSD
ION IB
IB
1.8 PB on 144 BB
nodes

0"
100"
200"
300"
400"
500"
600"
File"open" Fiber"object"copy" Catalog"query"&"copy"
Cost%(s)%
Steps%in%Workflow%
Lustre>Cori" BB"
Burst Buffer Use Case: H5Boss in Astronomy
•  BOSS Baryon OscillaKon Spectroscopic
Survey – from SDSS
•  Perform typical randomly generated
query to extract small amount of
stars/galaxies from millions
•  Workflows involve 1000s of file open/
close and random and small read/
write I/O
•  Run on final release of
SDSS-III complete BOSS dataset
–  2393 HDF5 files - total ~3.2TB

- 57 -
•  4.4 TB Burst Buffer - 22 nodes
•  Lower I/O Kmes on Burst Buffer
•  5.5x speedup for enKre workflow

Thanks Rob Latham, Quincey Koziol, Phil Carns and Wahid Bhimji for
sharing the slides.

Thank You

Email: Jalnliu@lbl.gov
58

[1] Lustre Striping RecommendaKon on Cori:
h]p://www.nersc.gov/users/storage-and-ﬁle-systems/i-o-resources-for-scienKﬁc-applicaKons/opKmizing-io-performance-for-lustre/
[2] J.L. Liu, Q. Koziol, H.J. Tang, F. Tessier, W. Bhimji, B. Cook, B. AusKn, S. Byna, B. Thakur, G. Lockwood, J. Deslippe, Prabhat,
Understanding the IO Performance Gap Between Cori KNL and Haswell, CUG’17
59

Optimizing HPC I/O Performance and Trends

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optimizing HPC I/O Performance and Trends

Similar to Optimizing HPC I/O Performance and Trends (20)

Recently uploaded

Recently uploaded (20)

Optimizing HPC I/O Performance and Trends