3. Introduction: I/O Challenges in 2020/2025
- 3 -
Ø ScienKfic applicaKons/simulaKons generate massive quanKKes of data.
Ø Example, BES: Basic Energy Science, Requirement Review, 2015
Ø 19 projects review
Ø Example projects: Quantum Materials, SoD Ma]ers, CombusKon,
Average Increasing RaKo
4. Common I/O Issues
Ø Bandwidth
Ø “The peak bandwidth is XXX GB/s, why I could only get 1% of that?”
Ø Scalability
Ø “I have used more IO processes, why the performance is not scalable?”
Ø Metadata
Ø “File closing is so slow in my test…”
Ø Pain of ProducKvity
Ø “I like to use Python/Spark, but the I/O seems slow”
4
5. What does Parallel I/O Mean?
5
Ø At the program level:
Ø Concurrent reads or writes from mulKple processes to a common file
Ø At the system level:
Ø A parallel file system and hardware that support such concurrent access
-William Gropp
6. HPC I/O Software Stack
High Level I/O Libraries map
applicaKon abstracKons onto
storage abstracKons and provide
data portability.
HDF5, Parallel netCDF, ADIOS
I/O Middleware organizes
accesses from many processes,
especially those using collecKve
I/O.
MPI-IO, GLEAN, PLFS
I/O Forwarding transforms I/O
from many clients into fewer,
larger request; reduces lock
contenKon; and bridges between
the HPC system and external
storage.
IBM ciod, IOFSL, Cray DVS, Cray
Datawarp
Parallel file system maintains
logical file model and provides
efficient access to data.
PVFS, PanFS, GPFS, Lustre
6
I/O Hardware
Application
Parallel File System
High-Level I/O Library
I/O Middleware
I/O Forwarding
Productive Interface
Produc/ve Interface builds a thin
layer on top of exisKng high
performance I/O library for
producKve big data analyKcs
H5py, H5Spark, Julia, Pandas, Fits
7. Data Complexity in Computational Science
Ø ApplicaKons use advanced data models to fit
the problem at hand
Ø MulKdimensional typed arrays, images
composed of scan lines, …
Ø Headers, a]ributes on data
Ø I/O systems have very simple data models
Ø Tree-based hierarchy of containers
Ø Some containers have streams of bytes (files)
Ø Others hold collecKons of other containers
(directories or folders)
EffecKve mapping from applicaKon data models
to I/O system data models is the key to I/O
performance.
Right Interior
Carotid Artery
Platelet
Aggregation
Model complexity:
Spectral element mesh (top) for
thermal hydraulics computaKon
coupled with finite element
mesh (bo]om) for neutronics
calculaKon.
Scale complexity:
SpaKal range from the
reactor core in meters
to fuel pellets in
millimeters.
Images from T. Tautges (ANL) (upper leD), M. Smith (ANL)
(lower leD), and K. Smith (MIT) (right).
7
8. I/O Hardware
Ø Storage Side
Ø Hard Disk Drive (TradiKonal)
Ø Solid State Drive (Future)
Ø Compute Side
Ø DRAM, Cache (TradiKonal)
Ø HBM(e.g., MCDRAM), NVRAM(e.g., 3D Xpoint)
8
DRAM
HDD
SSD
On-package MCDRAM
Courtesy of Tweaktown
9. I/O Hardware: HDD
9
ConKguous IO
• read Kme, 0.1 ms
NonconKguous IO
• seek Kme, 4ms
• rotaKon Kme, 3ms
• read Kme, 0.1 ms
SSD: No moving parts
11. Parallel File System
Ø Store applicaKon data persistently
Ø Usually extremely large datasets that can’t fit in memory
Ø Provide global shared-namespace (files, directories)
Ø Designed for parallelism
Ø Concurrent (oDen coordinated) access from many clients
Ø Designed for high performance
Ø Operate over high speed networks (IB, Myrinet, Portals)
Ø OpKmized I/O path for maximum bandwidth
Ø Examples
Ø Lustre: Most leadership supercomputers have deployed Lustre
Ø PVFS-> OrangeFS
Ø GPFS-> IBM Spectrum Scale, Commercial & HPC
11
14. I/O Forwarding
Ø A layer between compuKng system and storage system
Ø Compute nodes kernel ships I/O to dedicated I/O nodes [1]
Ø Examples
Ø Cray DVS
Ø IOFSL
Ø Cray Datawarp
14
1. AcceleraKng I/O Forwarding in IBM Blue Gene/P Systems
2. h]p://www.glennklockwood.com/data-intensive/storage/io-
forwarding.html
15. I/O Forwarding: Cray DVS
15
DVS on parallel file system, e.g., GPFS,
Lustre
Ø The DVS clients can spread their I/O traffic
between the DVS servers using a
determinisKc mapping.
Ø Configurable number DVS clients
Ø Reduces the number of clients that
communicate with the backing file system
(GPFS supports limited number of clients)
Stephen Sugiyama, etc, Cray DVS: Data VirtualizaKon Service, CUG 2008
16. I/O Middleware
Ø Why addiKonal I/O SoDware?
Ø AddiKonal I/O soDware provides improved performance and usability over directly
accessing the parallel file system.
Ø Reduces or (ideally) eliminates need for opKmizaKon in applicaKon codes.
Ø MPI-IO
Ø I/O interface specificaKon for use in MPI apps
Ø Data model is same as POSIX: Stream of bytes in a file
Ø MPI-IO Features
Ø CollecKve I/O
Ø NonconKguous I/O with MPI datatypes and file views
Ø Nonblocking I/O
Ø Fortran bindings (and addiKonal languages)
Ø System for encoding files in a portable format (external32)
16
17. What’s Wrong with POSIX?
Ø It’s a useful, ubiquitous interface for basic I/O
Ø It lacks constructs useful for parallel I/O
Ø Cluster applicaKon is really one program running on N nodes, but looks like N programs
to the filesystem
Ø No support for nonconKguous I/O
Ø No hinKng/prefetching
Ø Its rules hurt performance for parallel apps
Ø Atomic writes, read-aDer-write consistency
Ø A]ribute freshness
Ø POSIX should not have to be used (directly) in parallel applicaKons that want good
performance
Ø But developers use it anyway
17
18. Independent and Collective I/O
Ø Independent I/O operaKons specify only what a single process will do
Ø Independent I/O calls do not pass on relaKonships between I/O on other processes
Ø Why use independent I/O
Ø SomeKmes the synchronizaKon of collecKve calls is not natural
Ø SomeKmes the overhead of collecKve calls outweighs their benefits
Ø Example: very small I/O during metadata operaKons
18
P0 P1 P2 P3 P4 P5
Independent I/O
19. Independent and Collective I/O
Ø CollecKve I/O is coordinated access to storage by a group of processes
Ø CollecKve I/O funcKons are called by all processes parKcipaKng in I/O
Ø Why use collecKve I/O
Ø Allows I/O layers to know more about access as a whole, more opportuniKes for
opKmizaKon in lower soDware layers, be]er performance
Ø Combined with non-conKguous accesses yields highest performance
19
P0 P1 P2 P3 P4 P5
CollecKve I/O
20. Two Key Optimizations in ROMIO (MPIIO)
Ø MPI IO has many implementaKons
Ø ROMIO
Ø Cray, IBM, OpenMPI all have their own implementaKons/variants.
Ø Data sieving
Ø For independent nonconKguous requests
Ø ROMIO makes large I/O requests to the file system and, in memory, extracts the
data requested
Ø For wriKng, a read-modify-write is required
Ø Two-phase collecKve I/O
Ø CommunicaKon phase to merge data into large chunks
Ø I/O phase to write large chunks in parallel
20
21. Contiguous and Noncontiguous I/O
Ø ConKguous I/O moves data from a single memory block into a single file region
Ø NonconKguous I/O has three forms:
Ø NonconKguous in memory, nonconKguous in file, or nonconKguous in both
Ø Structured data leads naturally to nonconKguous I/O (e.g. block decomposiKon)
Ø Describing nonconKguous accesses with a single operaKon passes more knowledge
to I/O system
21
Process 0 Process 0
NonconKguous
in File
NonconKguous
in Memory
Ghost cell
Stored element
…
Vars 0, 1, 2, 3, … 23
ExtracKng variables from a block and
skipping ghost cells will result in
nonconKguous I/O.
22. Example: Collective IO for Noncontiguous IO
22
Courtesy of William Gropp
Large array
distributed among
16 processes
Each square
represents a
subarray in the
memory of a single
process
Access pa]ern in the file (row major)
23. Example: Collective IO for Noncontiguous IO
23
MPI_Type_create_subarray(ndims,..., &subarray);
MPI_Type_commit(&subarray);
MPI_File_open(MPI_COMM_WORLD, file,...,&fh);
MPI_File_set_view(fh, ..., subarray, ...);
MPI_File_read_all(fh, A, ...);
MPI_File_close(&fh);
MPI_File_open(MPI_COMM_WORLD, file, ...,&fh);
for (i=0; i<n_local_rows; i++) {
MPI_File_seek(fh, ...);
MPI_File_read(fh, &(A[i][0]), ...);
}
MPI_File_close(&fh);
Each process makes one independent read request
for each row in the local array
Each process creates a derived datatype to
describe the nonconKguous access pa]ern, defines
a file view, and calls independent I/O funcKons
26. High Level I/O Libraries
Ø Take advantage of high-performance parallel I/O while reducing complexity
Ø Add a well-defined layer to the I/O stack
Ø Allow users to specify complex data relaKonships and dependencies
Ø Come with machine-independent data formats, self-describing, suitable for array-
oriented scienKfic data
Ø Examples
Ø HDF5: HDF group, since 1989, top 5 libraries at NERSC
Ø Parallel netCDF: NWU, ANL, since 2001
Ø ADIOS: ORNL, since 2009
26
27. High Level I/O Libraries: HDF5
MPI_Init(&argc, &argv);
fapl_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(fapl_id, comm, info);
file_id = H5Fcreate(FNAME,…, fapl_id);
space_id = H5Screate_simple(…);
dset_id = H5Dcreate(file_id, DNAME, H5T_NATIVE_INT,
space_id,…);
xf_id = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(xf_id, H5FD_MPIO_COLLECTIVE);
status = H5Dwrite(dset_id, H5T_NATIVE_INT, …, xf_id…);
MPI_Finalize();
27
Ø A parallel HDF5 program has a few extra calls than a serial one
28. Productive I/O Interface
Ø Big Data AnalyKcs Stack
Ø Spark
Ø Tensorflow
Ø Caffe
Ø Science data needs to be loaded
efficiently into the engine.
Ø H5py
Ø H5Spark
Ø Fitsio
28
29. Productive I/O Interface: H5py
29
HDF5 C API
(libhdf5)
hsize_t H5Dget_storage_size(hid_t dset_id)
Cython
(h5d.pyx)
cdef class DatasetID(ObjectID):
def get_storage_size(self):
return H5Dget_storage_size(self.id)
Python
(_hl/dataset.py)
class Dataset(HLObject):
@property
def storagesize(self):
return self.id.get_storage_size()
DatasetID
Dataset
File
Group
FileID
32. - 32 -
H5py vs. HDF5 Performance
Single Node Mul/-nodes
Metadata
1k File CreaKon 63.8%
1k Object Scanning 60.0%
Independent I/O
Weak Scaling 97.8% 100%
Strong Scaling 100% 97.1%
CollecKve I/O
Weak Scaling 100% 90%
Strong Scaling 98.6% 87%
H5Py Performance / HDF5 Performance
Ques/ons: When you gain the producKvity, how much performance you can afford to lose?
33. HPC I/O Software Stack
High Level I/O Libraries map
applicaKon abstracKons onto
storage abstracKons and provide
data portability.
HDF5, Parallel netCDF, ADIOS
I/O Middleware organizes
accesses from many processes,
especially those using collecKve
I/O.
MPI-IO, GLEAN, PLFS
I/O Forwarding transforms I/O
from many clients into fewer,
larger request; reduces lock
contenKon; and bridges between
the HPC system and external
storage.
IBM ciod, IOFSL, Cray DVS, Cray
Datawarp
Parallel file system maintains
logical file model and provides
efficient access to data.
PVFS, PanFS, GPFS, Lustre
33
I/O Hardware
Application
Parallel File System
High-Level I/O Library
I/O Middleware
I/O Forwarding
Productive Interface
Produc/ve Interface builds a thin
layer on top of exisKng high
performance I/O library for
producKve big data analyKcs
H5py, H5Spark, Julia, Pandas, Fits
34. Get to Know Your I/O: Warp IO
Ø CharacterisKcs:
Ø Number of Files
Ø Size per File
Ø Number of Processes
Ø I/O API
34
iteraKon 0
iteraKon 1 Warp IO Pa]ern
² 172 - 600 MB per file
35. Leverage I/O Profiling Tool: Darshan
Ø Lightweight scalable I/O profiling tools
Ø Goal: to observe I/O pa]erns of the majority of applicaKons running on producKon
HPC plauorms, without perturbing their execuKon, with enough detail to gain
insight and aid in performance debugging.
Ø Majority of applicaKons – transparent integraKon with system build environment
Ø Without perturbaKon – bounded use of resources (memory, network, storage); no
communicaKon or I/O prior to job terminaKon; compression.
Ø Adequate detail:
Ø Basic job staKsKcs
Ø File access informaKon from mulKple APIs
35
36. The Technology behind Darshan
Ø Intercepts I/O funcKons using link-Kme wrappers
Ø No code modificaKon
Ø Can be transparently enabled in MPI compiler scripts
Ø CompaKble with all major C, C++, and Fortran compilers
Ø Record staKsKcs independently at each process, for each file
Ø Bounded memory consumpKon
Ø Compact summary rather than verbaKm record
Ø Collect, compress, and store results at shutdown Kme
Ø Aggregate shared file data using custom MPI reducKon operator
Ø Compress remaining data in parallel with zlib
Ø Write results with collecKve MPI-IO
Ø Result is a single gzip-compaKble file containing characterizaKon informaKon
Ø Works for Linux clusters, Blue Gene, and Cray systems
36
37. Darshan Analysis Example
37
hdf5writeTest 2 (2/4/2016) 1 of 3
jobid: 42375 uid: 58179 nprocs: 64 runtime: 84 seconds
0
20
40
60
80
100
PO
SIX
M
PI-IO
Percentageofruntime
Average I/O cost per process
Read
Write
Metadata
Other (including application compute)
0
10000
20000
30000
40000
50000
60000
70000
Read Write Open Stat Seek Mmap Fsync
Ops(Total,AllProcesses)
I/O Operation Counts
POSIX
MPI-IO Indep.
MPI-IO Coll.
0
10000
20000
30000
40000
50000
60000
70000
0-100
101-1K1K-10K10K-100K100K-1M1M
-4M
4M
-10M10M
-100M
100M
-1G1G
+
Count(Total,AllProcs)
I/O Sizes
Read Write
0
10000
20000
30000
40000
50000
60000
70000
Read Write
Ops(Total,AllProcs)
I/O Pattern
Total
Sequential
Consecutive
Most Common Access Sizes
access size count
1048576 65331
272 1
544 1
328 1
File Count Summary
(estimated by I/O access offsets)
type number of files avg. size max size
total opened 1 64G 64G
read-only files 0 0 0
write-only files 1 64G 64G
read/write files 0 0 0
created files 1 64G 64G
1
5
4
3 2
Ø The darshan-job-summary tool
produces a 3-page PDF file that
summarizes job I/O behavior
1. Run Kme
2. Percentage of runKme in I/O
3. Access type histograms
4. Access size histogram
5. File usage
39. How to Use Darshan
Ø How to link with Darshan
Ø Compile a C, C++, or FORTRAN program that uses MPI
Ø Run the applicaKon
Ø Look for the Darshan log file
Ø This will be in a parKcular directory (depending on your system’s configuraKon)
– <dir>/<year>/<month>/<day>/<username>_<appname>*.darshan*
Ø Mira: see /projects/logs/darshan/
Ø Edison: see /scratch1/scratchdirs/darshanlogs/
Ø Cori: see /global/cscratch1/sd/darshanlogs/
Ø Use Darshan command line tools to analyze the log file
Ø ApplicaKon must run to compleKon and call MPI_Finalize() to generate a log file
39
40. Optimizing I/O from File System Layer: Lustre
Ø File striping is a way to increase IO performance
Ø An increase in bandwidth because mulKple processes can simultaneously access
Ø To store large files that would take more space than a single OST.
40 Figure courtesy of NICS
41. Optimizing I/O from File System Layer: Lustre
Ø Default Striping: 1MB stripe size, 1 OST
Ø lfs getstripe
Ø lfs setstripe –c 100 –S 8m
Ø Chunk the file into 8MB blocks and distributed onto 100 OSTs
41
42. Optimizing I/O from File System Layer: Lustre
Ø More OSTS generally helps
Ø Striping a relaKvely small file on too many OSTs is not Good
Ø CommunicaKon overhead
Ø Saturated I/O bandwidth on other layer, e.g., compute nodes
Ø Storage straggler or bad OST
42
Write a 5GB file Write a 100GB file
43. Optimizing I/O from File System Layer: Lustre
Ø Empirical recommendaKons on Cori @NERSC
Ø File per process à Use default striping
Ø Single shared file à
43
45. Optimizing I/O from I/O Middle Layer: MPIIO
Ø Aggregate small and/or non-conKguous I/O into larger conKguous I/O
Ø CollecKve buffer size
Ø CollecKve buffer node # the actual number of I/O processes
Ø Use customized MPI-IO for be]er leveraging underneath file system
Ø Cray’s MPIIO knows its Lustre be]er, e.g., reduce contenKon by revealing data layout
informaKon on Lustre to MPI-IO layer
Ø How to pass the hints
Ø Through environmental variable:
setenv MPICH_MPIIO_HINTS "*:romio_cb_write=enable:romio_ds_write=disable”
Ø Use MPI_Info_set
MPI_Info_set(info, “striping_factor”, “4”);
MPI_Info_set(info, “cb_nodes”, “4”);
45
47. Optimizing MPI-IO on Cori Haswell vs. KNL
47
Ø Different colors: Different number of aggregators
Ø X-axis : collecKve buffer size
About this test:
Ø MPI-IO, collecKve IO
Ø 486GB file
Ø 32 processes per node
Ø 32 nodes
We recommend:
Ø 4 aggregator per node on Haswell
Ø 8 aggregator per node on KNL
Ø Check our paper in CUG’17
J. Liu, etc. Understanding the IO Performance Gap Between Cori KNL and Haswell, CUG’17
48. Optimizing I/O from I/O Middle Layer: Guidelines
48
Ø Limit the number of files (less metadata and easier to post-process)
Ø Make large and conKguous requests
Ø Avoid small accesses
Ø Avoid non-conKguous accesses
Ø Avoid random accesses Prefer collecKve I/O to independent I/O
(especially if the operaKons can be aggregated as single large conKguous
requests)
Ø Use derived datatypes and file views to ease the MPI I/O collecKve work
Ø Try MPI I/O hints (especially the collecKve buffering opKmizaKon;
disabling data sieving is also very oDen a good idea; also useful for
libraries based on MPI-IO)
Credits: Philippe Wautelet @IDRIS
49. Optimizing I/O from High Level Interface: HDF5
Ø MPI-IO layers’ opKmizaKon guidelines generally applies to HDF5 layer
Ø CollecKve metadata operaKon, available in HDF5 1.10
Ø This allows the library to just use one rank to read the data and broadcast it to all other
ranks
Ø And constructs an MPI derived datatype and writes collecKvely in a single call
Ø Increase page buffering
Ø H5Pset_page_buffer_size
Ø Stay tuned for next HDF5 talk at 10am
49
50. Optimizing I/O from Python Interface: H5py
Ø OpKmal HDF5 file creaKon
Ø Use low-level API in H5py
50
2.25X
Get closer to the HDF5 C library, fine tuning
51. Optimizing I/O from Python Interface: H5py
Ø Speedup the I/O with collecKve I/O
51
Using 1k process to write 1TB file, collecKve IO
achieved 2X speedup on Cori
52. Optimizing I/O from Python Interface: H5py
Ø Avoid type casKng in H5py
52
Reduced IO from 527 seconds to 1.3
53. Object Store for HPC
Ø Amazon S3 is so successful in supporKng various applicaKons, e.g., Instagram,
Dropbox
Ø HPC file system relies on strong POSIX, which hinders the performance, e.g.,
scalability.
Ø Object Store:
Ø Put : creates a new object and fills it with data
Ø Get: retrieves the data based on object ID
Ø Benefits: Scalability
Ø Lockless: Objects are immutable, write-once, no need to lock it before read
Ø Fast lookup, based simple hash of object ID
53
54. Object Store for HPC
Ø Disadvantages
Ø Data can not be modified, most HPC applicaKons like to read/write the data
Ø Limited metadata support, e.g., user info, access permission, requires a database layer
54
Figure courtesy of Glenn Lockwood
DDN WOS, Object Store Testbed at NERSC, (TBA Soon)
Contact me or Damian Hazen if you are interested
55. Burst Buffer
Ø Burst buffer on Cori
Ø 1.7 TB/second of peak I/O performance with 28M IOPs,
Ø 1.8PB of storage
Ø NVRAM-based ‘Burst Buffer’ (BB) as intermediate layer
Ø Handle I/O spikes without a huge PFS (stage to PFS asynchronously)
Ø Underlying media supports challenging I/O
Ø SoDware for filesystems- on-demand - scales be]er than large POSIX PFS
Ø Cray DataWarp soDware allocates storage to users per-job
Ø Users see a POSIX filesystem on-demand, striped across nodes
Ø Can specify data to stage in/out from Lustre while job is in queue
55
57. 0"
100"
200"
300"
400"
500"
600"
File"open" Fiber"object"copy" Catalog"query"&"copy"
Cost%(s)%
Steps%in%Workflow%
Lustre>Cori" BB"
Burst Buffer Use Case: H5Boss in Astronomy
• BOSS Baryon OscillaKon Spectroscopic
Survey – from SDSS
• Perform typical randomly generated
query to extract small amount of
stars/galaxies from millions
• Workflows involve 1000s of file open/
close and random and small read/
write I/O
• Run on final release of
SDSS-III complete BOSS dataset
– 2393 HDF5 files - total ~3.2TB
- 57 -
• 4.4 TB Burst Buffer - 22 nodes
• Lower I/O Kmes on Burst Buffer
• 5.5x speedup for enKre workflow