The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's Gordon

Glenn K. Lockwood, Ph.D.
March 6, 2015

Who Am I?
<= 2012: Materials scientist
• Interfacial chemistry, nanoporous systems
• Molecular dynamics of inorganic materials
2012 - 2014: San Diego Supercomputer Center
• Specialist in data-intensive computing
• Hadoop, HIVE, Pig, Mahout, Spark...
• Parallel R
• Operational workload analysis
• System and infrastructure design
• Emerging technologies
• Bioinformatics and genomics
• Industry consulting
2014 - 2015: Bay area startup
• Software and release engineering
• Devops and system engineering, HPC integration
>= 2015: NERSC

What am I talking about?
• Gordon: the world's first flash supercomputer™
• Deployed in 2012 at SDSC
• 1024-node cluster (Appro/Cray)
• 1024 x 300 GB SSDs via iSER (iSCSI)
• Dedicated InfiniBand
fabric for I/O
• 100 GB/sec to Lustre

Burst Buffers and the
Gordon Architecture

Burst Buffer Possibilities
- 5 -
High Speed
Network
Storage Fabric
Storage Server
Compute Node I/O Node
I/O Processor
Flash
• SDSC Trestles (2011)
• SDSC Comet (2014)
• ALCF Theta (2016)
• OLCF Summit (2018)
• ALCF Aurora (2018)
Flash
• SDSC Gordon (2012)
• NERSC Cori (2016)
• ALCF Aurora (2018)
Flash
• ALCF GPFS+AFM (sort of)

Burst Buffer Architecture Concept
- 6 -
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
BB SSD
SSD
BB SSD
SSD
BB SSD
SSD
BB SSD
SSD
ION NIC
NIC
ION NIC
NIC
StorageFabric
Storage Servers
Compute Nodes
High-Speed Network
Burst Buffer Node
I/O Node
Lustre OSSs/OSTs

The Gordon Concept
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
BB SSD
SSD
BB SSD
SSD
BB SSD
SSD
BB SSD
SSD
ION NIC
NIC
ION NIC
NIC
StorageFabric
Storage Servers
Lustre OSSs/OSTs
• Combine BB nodes
and IO nodes
• Attach compute
nodes to BB/IO node
• Maximum locality of
compute and data
• Connect high-locality
compute+data units
in scalable topology

The Gordon Building Block
16 core (Sandy Bridge)
64 GB DDR3
2x QDR IB HCAs
12 core (Westmere)
16x 300 GB SSDs
2x QDR IB HCAs
2x 10GbE ports

Gordon IO Subsystem
• 4x4x4 torus
• 64 IO nodes (LNET routers*)
• 1 hop to Lustre max*
• 100 GB/s to Lustre*
• 1024 provisionable SSDs
• all over dedicated,
secondary InfiniBand fabric
* not entirely true

Experiences with the flash-based file system on Gordon

1. Checkpointing for big jobs (proto-burst buffer)
2. Scratch for single-node jobs (e.g., Gaussian)
3. Large flash aggregates (BigFlash)
4. Persistent services
SSD Use Cases in Practice

• preventative measure, not capability
• difficult to quantify benefit
• very cumbersome for users

Proto-burst buffer: Staging
### Step 1. Distribute input data to all nodes (if necessary)
for node in $(/usr/bin/uniq $PBS_NODEFILE)
do
echo "$(/bin/date) :: Copying input data to node $node"
if [ $PARALLEL_COPY -ne 0 ]; then
scp $INPUT_FILES $node:$LOCAL_SCRATCH/
else
scp $INPUT_FILES $node:$LOCAL_SCRATCH/ &
fi
done
wait
### Step 2. Run desired code
mpirun_rsh -np 32 ./lmp_gordon < inputs.txt
### Step 3. Flush contents of each node's SSD back to workdir
nn=0
do
echo "$(/bin/date) :: Copying output data from node $node"
command="cd $LOCAL_SCRATCH && tar cvf $PBS_O_WORKDIR/node$nn-output.tar *"
ssh $node "$command" &
else
ssh $node "$command"
fi
let "nn++"
done
wait
https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub

Proto-burst buffer: Staging
### Step 1. Distribute input data to all nodes (if necessary)
do
echo "$(/bin/date) :: Copying input data to node $node"
scp $INPUT_FILES $node:$LOCAL_SCRATCH/
else
scp $INPUT_FILES $node:$LOCAL_SCRATCH/ &
fi
done
wait
### Step 2. Run desired code
mpirun_rsh -np 32 ./lmp_gordon < inputs.txt
### Step 3. Flush contents of each node's SSD back to workdir
nn=0
do
echo "$(/bin/date) :: Copying output data from node $node"
command="cd $LOCAL_SCRATCH && tar cvf $PBS_O_WORKDIR/node$nn-output.tar *"
ssh $node "$command" &
else
ssh $node "$command"
fi
let "nn++"
done
wait
Asking users to turn one-line
job script into 60(!) SLOC

Proto-burst buffer: Async I/O
BACKUP_INTERVAL=1h
backup() {
# loop forever
while true
do
sleep $BACKUP_INTERVAL
echo "backing up at $(date)"
# copy *.chk files from scratch back to job directory
rsync -avz $GAUSS_SCRDIR/*.chk $PBS_O_WORKDIR/
# can also copy both *.chk and *.wrf with the following command
#rsync -avz $GAUSS_SCRDIR/*.chk $GAUSS_SCRDIR/*.rwf $PBS_O_WORKDIR/
done
}
backup &
g09 < input.com > output.txt

• obvious benefit over parallel fs
• Gaussian as a representative
application, 880 test problems

Do SSDs help for local scratch?
89% of cases: spinning
disk would have been
sufficient
Data courtesy R. S. Sinkovits, San Diego Supercomputer Center
880 Gaussian test problems, five times each

How much does iSCSI hurt?
Gaussian:
• ~75% cases aren’t hurt
• ~10% show > 10% speedup with
direct-attach SSD
Raw performance1:
• up to 20% loss of bandwidth
• ~50% loss of IOPS
1 Cicotti, P. et al. Evaluation of I/O technologies on a flash-based I/O sub-system for HPC. Proceedings of the 1st Workshop on
Architectures and Systems for Big Data - ASBD ’11 (2011) 13–18.
Plotted data courtesy R. S. Sinkovits, San Diego Supercomputer Center
880 Gaussian test problems, five times each

• true capability feature
• what new problems can these tackle?

Flash Node Flash Node
BigFlash
Aggregate 16x SSDs:
• 4.4 TB RAID0 array
• 3.8 GB/s bandwidth
• 200,000 IOPS

BigFlash
Aggregate 16x SSDs:
• 4.4 TB RAID0 array
• 3.8 GB/s bandwidth
• 200,000 IOPS
NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
BigFlash

BigFlash for
Bioinformatics
SAMtools Sort
(out-of_core genome sort)
Each thread:
• Breaks 110 GB file into 600-
800 files
• Reads, re-reads those files
repeatedly
Per node (16 cores):
• 1.8 TB input data
• 3.6 TB of intermediate data
• 10k-12k files opened+closed
repeatedly
Walltime (hrs)
max = 3,177 GB
μIOPS = 3,400
μR = 900 MB/s
μw =
180 MB/s

BigFlash for
Bioinformatics
Can Lustre handle this?
• 3.6 TB intermediate data?
• 900 MB/sec/node?
• 10k opens/closes per node?
Were SSDs necessary, or can
striped HDDs meet spec?
• 3.6 TB intermediate data?
• 900 MB/sec/node?
• 10k opens/closes per node?
Walltime (hrs)
max = 3,177 GB
μIOPS = 3,400
μR = 900 MB/s
μw =
180 MB/s

Persistent Services on SSDs
• "Gordon ION" Projects – 1 year allocation
• Exclusive access to
– 1 ION (12 Westmere-EP cores, 48 GB DDR3)
– 16x SSDs
– N compute nodes, where N <= 16
• Batteries included
– 1 consultant (~0.1 FTE) to run interference
– 1 systems engineer (charity) to do root-only
configuration

Persistent Services on SSDs
• Protein Data Bank
– Apache httpd + Tomcat
– "pairwise 3D protein structure alignments" stored in MySQL
• UCSD Network Telescope
– 100k+ metrics over time timeseries
– Graphite = 1 whisperfile per metric (100k+ files)
• OpenTopography
– on-demand generation of 3D elevation models
– out-of-core calculation triggered via web
– middleware to stage from Microsoft Azure to Gordon
• IntegromeDB
– PostgreSQL fed by SmartCrawler and Lucene
– 5k tables, 500 billion rows, 50 TB of data
– index stored on SSD, data on Lustre

Having SSDs benefits some of the
applications some of the time
BigFlash provides a unique capability
...but who is using them?
SSDs in Practice

SSD Utilization 2013/2014
idea.org
ngram analysis
Gaussian
QChem
Checkpointing
(proto-burst buffering)
QChem
Gaussian

BigFlash Utilization 2013/2014
Gaussian
Gaussian
Gaussian
samtools sort

Insights:
• SSD load generally not high, but...
• when SSD load is high, it is overloaded
• Better balancing to be done
Caveats:
• Load sampling was once an hour
• Using system load to measure IO is
imperfect
So are SSDs utilized?

On Gordon's Proto-Burst Buffer:
• A new capability for jobs that can't run on Lustre
• iSCSI reduces performance (20% bw, 50% IOPs)
• Middleware to facilitate usability is critical
• SSDs vs. HDDs: lots of overlap in use cases
Longer Term:
• Balance ratio of SSDs to compute nodes
• Non-PFS HDDs often may be good enough
• optimized mix of provisionable HDD+SSD
• …this becomes untrue when PMR HDDs are EOL

Gordon Node Block Diagram
rail1 for Lustre, iSCSI
~ 3.2 GB/sec
rail0 for MPI
~ 3.8 GB/sec
Xeon E5 Xeon E5
ConnectX-
3
PCIe 2.0
Riser
ConnectX-
3
Idea is that two rails will
1. prevent interference
between MPI and IO
2. enhance performance
of communication-
bound applications

MPI Send/Recv Microbenchmark
Running MPI traffic
over both IB rails lets
us observe effects of
contention
Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer.
Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.

P3DFFT: Single Switch Performance
DNS kernel
• All to all, all the time
• Bandwidth limited
Promising results:
• 1.7x speedup of
communication
• 1.3x speedup
overall

P3DFFT: Single Switch vs. Many Switches
2σ = 1.92 mins

P3DFFT: Dual-rail Disaster
Lustre evictions,
iSCSI failures,
general panic
• Separating IO works!
• Ensure IO performance
doesn’t suffer because of
FFTs
• IO reliability improves

Lessons learned: Multirail Fabrics
• Better performance for bandwidth-bound
applications
• ...when those applications aren’t causing
IO to fail
• More reliable IO without sacrificing MPI
performance

Comparing the Proto-Burst Buffer to Cori:
Hardware Features
Feature Cori Gordon
Flash location On-fabric On-fabric
RPC level Objects via DVS Blocks via iSER
POSIX interface DWFS + XFS XFS
Namespace/metadata Server-side (XFS) Client-side (XFS)
BB nodes 144 (P1) / 288 (P2) 64
SSDs per node 2x Intel P3608 16x Intel 710
Capacity: BB / DRAM 4.5x (P1) / 1.7 (P2) 4.6x

Performance
Feature Cori Gordon
SSDs per node 2x Intel P3608 16x Intel 710
SSD capacity per node 6.4 TB 4.8 TB
SSD bandwidth per node 6 GB/sec 4 GB/sec
SSD IOPS per node (r/w) 89K/89K 200K/33K
PFS bandwidth per node 2.1 GB/sec 1.6 GB/sec
Total bandwidth: BB/PFS 1.16x (P1) / 2.32x (P2) 2.51x

Software Capabilities
Feature Cori Gordon
N-to-N I/O
(file per process)
Yes Yes
N-to-1 I/O
(single shared file)
Yes No
Provisionable Yes – at job time Yes(ish) – manual
Persistent reservations Yes – no hard limits Yes – up to 4.4 TB
Asynchronous staging Yes No

Acknowledgments – SDSC
Mahidhar Tatineni
Rick Wagner
Bob Sinkovits
Wayne Pfeiffer
Phil Papadopoulos
D.J. Choi
Christopher Irving
Trevor Cooper
Richard L. Moore
References
P. Cicotti, J. Bennet, S. Strande, R. S. Sinkovits, A. Snavely. Evaluation of I/O technologies on a
flash-based I/O sub-system for HPC. Proceedings of ASBD’11 (2011) 13–18.
Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the
Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and
Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
Acknowledgments – NERSC
The NERSC Burst Buffer Team
The Cray DataWarp Team

The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's Gordon

More Related Content

What's hot

Viewers also liked

Similar to The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's Gordon

More from Glenn K. Lockwood

Recently uploaded

The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's Gordon

Editor's Notes