Glenn K. Lockwood, Ph.D.
March 6, 2015
Who Am I?
<= 2012: Materials scientist
• Interfacial chemistry, nanoporous systems
• Molecular dynamics of inorganic materials
2012 - 2014: San Diego Supercomputer Center
• Specialist in data-intensive computing
• Hadoop, HIVE, Pig, Mahout, Spark...
• Parallel R
• Operational workload analysis
• System and infrastructure design
• Emerging technologies
• Bioinformatics and genomics
• Industry consulting
2014 - 2015: Bay area startup
• Software and release engineering
• Devops and system engineering, HPC integration
>= 2015: NERSC
What am I talking about?
• Gordon: the world's first flash supercomputer™
• Deployed in 2012 at SDSC
• 1024-node cluster (Appro/Cray)
• 1024 x 300 GB SSDs via iSER (iSCSI)
• Dedicated InfiniBand
fabric for I/O
• 100 GB/sec to Lustre
Burst Buffers and the
Gordon Architecture
Burst Buffer Possibilities
- 5 -
High Speed
Network
Storage Fabric
Storage Server
Compute Node I/O Node
I/O Processor
Flash
• SDSC Trestles (2011)
• SDSC Comet (2014)
• ALCF Theta (2016)
• OLCF Summit (2018)
• ALCF Aurora (2018)
Flash
• SDSC Gordon (2012)
• NERSC Cori (2016)
• ALCF Aurora (2018)
Flash
• ALCF GPFS+AFM (sort of)
Burst Buffer Architecture Concept
- 6 -
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
BB SSD
SSD
BB SSD
SSD
BB SSD
SSD
BB SSD
SSD
ION NIC
NIC
ION NIC
NIC
StorageFabric
Storage Servers
Compute Nodes
High-Speed Network
Burst Buffer Node
I/O Node
Lustre OSSs/OSTs
The Gordon Concept
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
BB SSD
SSD
BB SSD
SSD
BB SSD
SSD
BB SSD
SSD
ION NIC
NIC
ION NIC
NIC
StorageFabric
Storage Servers
Lustre OSSs/OSTs
• Combine BB nodes
and IO nodes
• Attach compute
nodes to BB/IO node
• Maximum locality of
compute and data
• Connect high-locality
compute+data units
in scalable topology
The Gordon Building Block
16 core (Sandy Bridge)
64 GB DDR3
2x QDR IB HCAs
12 core (Westmere)
16x 300 GB SSDs
2x QDR IB HCAs
2x 10GbE ports
Gordon IO Subsystem
• 4x4x4 torus
• 64 IO nodes (LNET routers*)
• 1 hop to Lustre max*
• 100 GB/s to Lustre*
• 1024 provisionable SSDs
• all over dedicated,
secondary InfiniBand fabric
* not entirely true
Experiences with the flash-based file system on Gordon
1. Checkpointing for big jobs (proto-burst buffer)
2. Scratch for single-node jobs (e.g., Gaussian)
3. Large flash aggregates (BigFlash)
4. Persistent services
SSD Use Cases in Practice
1. Checkpointing for big jobs (proto-burst buffer)
2. Scratch for single-node jobs (e.g., Gaussian)
3. Large flash aggregates (BigFlash)
4. Persistent services
• preventative measure, not capability
• difficult to quantify benefit
• very cumbersome for users
SSD Use Cases in Practice
Proto-burst buffer: Staging
### Step 1. Distribute input data to all nodes (if necessary)
for node in $(/usr/bin/uniq $PBS_NODEFILE)
do
echo "$(/bin/date) :: Copying input data to node $node"
if [ $PARALLEL_COPY -ne 0 ]; then
scp $INPUT_FILES $node:$LOCAL_SCRATCH/
else
scp $INPUT_FILES $node:$LOCAL_SCRATCH/ &
fi
done
wait
### Step 2. Run desired code
mpirun_rsh -np 32 ./lmp_gordon < inputs.txt
### Step 3. Flush contents of each node's SSD back to workdir
nn=0
for node in $(/usr/bin/uniq $PBS_NODEFILE)
do
echo "$(/bin/date) :: Copying output data from node $node"
command="cd $LOCAL_SCRATCH && tar cvf $PBS_O_WORKDIR/node$nn-output.tar *"
if [ $PARALLEL_COPY -ne 0 ]; then
ssh $node "$command" &
else
ssh $node "$command"
fi
let "nn++"
done
wait
https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub
Proto-burst buffer: Staging
### Step 1. Distribute input data to all nodes (if necessary)
for node in $(/usr/bin/uniq $PBS_NODEFILE)
do
echo "$(/bin/date) :: Copying input data to node $node"
if [ $PARALLEL_COPY -ne 0 ]; then
scp $INPUT_FILES $node:$LOCAL_SCRATCH/
else
scp $INPUT_FILES $node:$LOCAL_SCRATCH/ &
fi
done
wait
### Step 2. Run desired code
mpirun_rsh -np 32 ./lmp_gordon < inputs.txt
### Step 3. Flush contents of each node's SSD back to workdir
nn=0
for node in $(/usr/bin/uniq $PBS_NODEFILE)
do
echo "$(/bin/date) :: Copying output data from node $node"
command="cd $LOCAL_SCRATCH && tar cvf $PBS_O_WORKDIR/node$nn-output.tar *"
if [ $PARALLEL_COPY -ne 0 ]; then
ssh $node "$command" &
else
ssh $node "$command"
fi
let "nn++"
done
wait
https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub
Asking users to turn one-line
job script into 60(!) SLOC
Proto-burst buffer: Async I/O
BACKUP_INTERVAL=1h
backup() {
# loop forever
while true
do
sleep $BACKUP_INTERVAL
echo "backing up at $(date)"
# copy *.chk files from scratch back to job directory
rsync -avz $GAUSS_SCRDIR/*.chk $PBS_O_WORKDIR/
# can also copy both *.chk and *.wrf with the following command
#rsync -avz $GAUSS_SCRDIR/*.chk $GAUSS_SCRDIR/*.rwf $PBS_O_WORKDIR/
done
}
backup &
g09 < input.com > output.txt
https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub
1. Checkpointing for big jobs (proto-burst buffer)
2. Scratch for single-node jobs (e.g., Gaussian)
3. Large flash aggregates (BigFlash)
4. Persistent services
• obvious benefit over parallel fs
• Gaussian as a representative
application, 880 test problems
SSD Use Cases in Practice
Do SSDs help for local scratch?
89% of cases: spinning
disk would have been
sufficient
Data courtesy R. S. Sinkovits, San Diego Supercomputer Center
880 Gaussian test problems, five times each
How much does iSCSI hurt?
Gaussian:
• ~75% cases aren’t hurt
• ~10% show > 10% speedup with
direct-attach SSD
Raw performance1:
• up to 20% loss of bandwidth
• ~50% loss of IOPS
1 Cicotti, P. et al. Evaluation of I/O technologies on a flash-based I/O sub-system for HPC. Proceedings of the 1st Workshop on
Architectures and Systems for Big Data - ASBD ’11 (2011) 13–18.
Plotted data courtesy R. S. Sinkovits, San Diego Supercomputer Center
880 Gaussian test problems, five times each
1. Checkpointing for big jobs (proto-burst buffer)
2. Scratch for single-node jobs (e.g., Gaussian)
3. Large flash aggregates (BigFlash)
4. Persistent services
• true capability feature
• what new problems can these tackle?
SSD Use Cases in Practice
Flash Node Flash Node
Flash Node Flash Node
Flash Node Flash Node
Flash Node Flash Node
Flash Node Flash Node
Flash Node Flash Node
Flash Node Flash Node
Flash Node Flash Node
BigFlash
Aggregate 16x SSDs:
• 4.4 TB RAID0 array
• 3.8 GB/s bandwidth
• 200,000 IOPS
BigFlash
Aggregate 16x SSDs:
• 4.4 TB RAID0 array
• 3.8 GB/s bandwidth
• 200,000 IOPS
NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
NoFlash NoFlash
BigFlash
BigFlash for
Bioinformatics
SAMtools Sort
(out-of_core genome sort)
Each thread:
• Breaks 110 GB file into 600-
800 files
• Reads, re-reads those files
repeatedly
Per node (16 cores):
• 1.8 TB input data
• 3.6 TB of intermediate data
• 10k-12k files opened+closed
repeatedly
Walltime (hrs)
max = 3,177 GB
μIOPS = 3,400
μR = 900 MB/s
μw =
180 MB/s
BigFlash for
Bioinformatics
Can Lustre handle this?
• 3.6 TB intermediate data?
• 900 MB/sec/node?
• 10k opens/closes per node?
Were SSDs necessary, or can
striped HDDs meet spec?
• 3.6 TB intermediate data?
• 900 MB/sec/node?
• 10k opens/closes per node?
Walltime (hrs)
max = 3,177 GB
μIOPS = 3,400
μR = 900 MB/s
μw =
180 MB/s
1. Checkpointing for big jobs (proto-burst buffer)
2. Scratch for single-node jobs (e.g., Gaussian)
3. Large flash aggregates (BigFlash)
4. Persistent services
SSD Use Cases in Practice
Persistent Services on SSDs
• "Gordon ION" Projects – 1 year allocation
• Exclusive access to
– 1 ION (12 Westmere-EP cores, 48 GB DDR3)
– 16x SSDs
– N compute nodes, where N <= 16
• Batteries included
– 1 consultant (~0.1 FTE) to run interference
– 1 systems engineer (charity) to do root-only
configuration
Persistent Services on SSDs
• Protein Data Bank
– Apache httpd + Tomcat
– "pairwise 3D protein structure alignments" stored in MySQL
• UCSD Network Telescope
– 100k+ metrics over time timeseries
– Graphite = 1 whisperfile per metric (100k+ files)
• OpenTopography
– on-demand generation of 3D elevation models
– out-of-core calculation triggered via web
– middleware to stage from Microsoft Azure to Gordon
• IntegromeDB
– PostgreSQL fed by SmartCrawler and Lucene
– 5k tables, 500 billion rows, 50 TB of data
– index stored on SSD, data on Lustre
Having SSDs benefits some of the
applications some of the time
BigFlash provides a unique capability
...but who is using them?
SSDs in Practice
SSD Utilization 2013/2014
idea.org
ngram analysis
Gaussian
QChem
Checkpointing
(proto-burst buffering)
QChem
Gaussian
BigFlash Utilization 2013/2014
Gaussian
Gaussian
Gaussian
samtools sort
TGT Resource Requirements
Insights:
• SSD load generally not high, but...
• when SSD load is high, it is overloaded
• Better balancing to be done
Caveats:
• Load sampling was once an hour
• Using system load to measure IO is
imperfect
So are SSDs utilized?
On Gordon's Proto-Burst Buffer:
• A new capability for jobs that can't run on Lustre
• iSCSI reduces performance (20% bw, 50% IOPs)
• Middleware to facilitate usability is critical
• SSDs vs. HDDs: lots of overlap in use cases
Longer Term:
• Balance ratio of SSDs to compute nodes
• Non-PFS HDDs often may be good enough
• optimized mix of provisionable HDD+SSD
• …this becomes untrue when PMR HDDs are EOL
Experiences with the flash-based file system on Gordon
Gordon Node Block Diagram
rail1 for Lustre, iSCSI
~ 3.2 GB/sec
rail0 for MPI
~ 3.8 GB/sec
Xeon E5 Xeon E5
ConnectX-
3
PCIe 2.0
Riser
ConnectX-
3
Idea is that two rails will
1. prevent interference
between MPI and IO
2. enhance performance
of communication-
bound applications
MPI Send/Recv Microbenchmark
Running MPI traffic
over both IB rails lets
us observe effects of
contention
Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer.
Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
P3DFFT: Single Switch Performance
DNS kernel
• All to all, all the time
• Bandwidth limited
Promising results:
• 1.7x speedup of
communication
• 1.3x speedup
overall
Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer.
Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
P3DFFT: Single Switch vs. Many Switches
2σ = 1.92 mins
Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer.
Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
P3DFFT: Dual-rail Disaster
Lustre evictions,
iSCSI failures,
general panic
• Separating IO works!
• Ensure IO performance
doesn’t suffer because of
FFTs
• IO reliability improves
Lessons learned: Multirail Fabrics
• Better performance for bandwidth-bound
applications
• ...when those applications aren’t causing
IO to fail
• More reliable IO without sacrificing MPI
performance
Experiences with the flash-based file system on Gordon
Comparing the Proto-Burst Buffer to Cori:
Hardware Features
Feature Cori Gordon
Flash location On-fabric On-fabric
RPC level Objects via DVS Blocks via iSER
POSIX interface DWFS + XFS XFS
Namespace/metadata Server-side (XFS) Client-side (XFS)
BB nodes 144 (P1) / 288 (P2) 64
SSDs per node 2x Intel P3608 16x Intel 710
Capacity: BB / DRAM 4.5x (P1) / 1.7 (P2) 4.6x
Comparing the Proto-Burst Buffer to Cori:
Performance
Feature Cori Gordon
SSDs per node 2x Intel P3608 16x Intel 710
SSD capacity per node 6.4 TB 4.8 TB
SSD bandwidth per node 6 GB/sec 4 GB/sec
SSD IOPS per node (r/w) 89K/89K 200K/33K
PFS bandwidth per node 2.1 GB/sec 1.6 GB/sec
Total bandwidth: BB/PFS 1.16x (P1) / 2.32x (P2) 2.51x
Comparing the Proto-Burst Buffer to Cori:
Software Capabilities
Feature Cori Gordon
N-to-N I/O
(file per process)
Yes Yes
N-to-1 I/O
(single shared file)
Yes No
Provisionable Yes – at job time Yes(ish) – manual
Persistent reservations Yes – no hard limits Yes – up to 4.4 TB
Asynchronous staging Yes No
Acknowledgments – SDSC
Mahidhar Tatineni
Rick Wagner
Bob Sinkovits
Wayne Pfeiffer
Phil Papadopoulos
D.J. Choi
Christopher Irving
Trevor Cooper
Richard L. Moore
References
P. Cicotti, J. Bennet, S. Strande, R. S. Sinkovits, A. Snavely. Evaluation of I/O technologies on a
flash-based I/O sub-system for HPC. Proceedings of ASBD’11 (2011) 13–18.
Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the
Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and
Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
Acknowledgments – NERSC
The NERSC Burst Buffer Team
The Cray DataWarp Team

The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's Gordon

  • 1.
    Glenn K. Lockwood,Ph.D. March 6, 2015
  • 2.
    Who Am I? <=2012: Materials scientist • Interfacial chemistry, nanoporous systems • Molecular dynamics of inorganic materials 2012 - 2014: San Diego Supercomputer Center • Specialist in data-intensive computing • Hadoop, HIVE, Pig, Mahout, Spark... • Parallel R • Operational workload analysis • System and infrastructure design • Emerging technologies • Bioinformatics and genomics • Industry consulting 2014 - 2015: Bay area startup • Software and release engineering • Devops and system engineering, HPC integration >= 2015: NERSC
  • 3.
    What am Italking about? • Gordon: the world's first flash supercomputer™ • Deployed in 2012 at SDSC • 1024-node cluster (Appro/Cray) • 1024 x 300 GB SSDs via iSER (iSCSI) • Dedicated InfiniBand fabric for I/O • 100 GB/sec to Lustre
  • 4.
    Burst Buffers andthe Gordon Architecture
  • 5.
    Burst Buffer Possibilities -5 - High Speed Network Storage Fabric Storage Server Compute Node I/O Node I/O Processor Flash • SDSC Trestles (2011) • SDSC Comet (2014) • ALCF Theta (2016) • OLCF Summit (2018) • ALCF Aurora (2018) Flash • SDSC Gordon (2012) • NERSC Cori (2016) • ALCF Aurora (2018) Flash • ALCF GPFS+AFM (sort of)
  • 6.
    Burst Buffer ArchitectureConcept - 6 - CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN BB SSD SSD BB SSD SSD BB SSD SSD BB SSD SSD ION NIC NIC ION NIC NIC StorageFabric Storage Servers Compute Nodes High-Speed Network Burst Buffer Node I/O Node Lustre OSSs/OSTs
  • 7.
    The Gordon Concept CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN BBSSD SSD BB SSD SSD BB SSD SSD BB SSD SSD ION NIC NIC ION NIC NIC StorageFabric Storage Servers Lustre OSSs/OSTs • Combine BB nodes and IO nodes • Attach compute nodes to BB/IO node • Maximum locality of compute and data • Connect high-locality compute+data units in scalable topology
  • 8.
    The Gordon BuildingBlock 16 core (Sandy Bridge) 64 GB DDR3 2x QDR IB HCAs 12 core (Westmere) 16x 300 GB SSDs 2x QDR IB HCAs 2x 10GbE ports
  • 9.
    Gordon IO Subsystem •4x4x4 torus • 64 IO nodes (LNET routers*) • 1 hop to Lustre max* • 100 GB/s to Lustre* • 1024 provisionable SSDs • all over dedicated, secondary InfiniBand fabric * not entirely true
  • 10.
    Experiences with theflash-based file system on Gordon
  • 11.
    1. Checkpointing forbig jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services SSD Use Cases in Practice
  • 12.
    1. Checkpointing forbig jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services • preventative measure, not capability • difficult to quantify benefit • very cumbersome for users SSD Use Cases in Practice
  • 13.
    Proto-burst buffer: Staging ###Step 1. Distribute input data to all nodes (if necessary) for node in $(/usr/bin/uniq $PBS_NODEFILE) do echo "$(/bin/date) :: Copying input data to node $node" if [ $PARALLEL_COPY -ne 0 ]; then scp $INPUT_FILES $node:$LOCAL_SCRATCH/ else scp $INPUT_FILES $node:$LOCAL_SCRATCH/ & fi done wait ### Step 2. Run desired code mpirun_rsh -np 32 ./lmp_gordon < inputs.txt ### Step 3. Flush contents of each node's SSD back to workdir nn=0 for node in $(/usr/bin/uniq $PBS_NODEFILE) do echo "$(/bin/date) :: Copying output data from node $node" command="cd $LOCAL_SCRATCH && tar cvf $PBS_O_WORKDIR/node$nn-output.tar *" if [ $PARALLEL_COPY -ne 0 ]; then ssh $node "$command" & else ssh $node "$command" fi let "nn++" done wait https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub
  • 14.
    Proto-burst buffer: Staging ###Step 1. Distribute input data to all nodes (if necessary) for node in $(/usr/bin/uniq $PBS_NODEFILE) do echo "$(/bin/date) :: Copying input data to node $node" if [ $PARALLEL_COPY -ne 0 ]; then scp $INPUT_FILES $node:$LOCAL_SCRATCH/ else scp $INPUT_FILES $node:$LOCAL_SCRATCH/ & fi done wait ### Step 2. Run desired code mpirun_rsh -np 32 ./lmp_gordon < inputs.txt ### Step 3. Flush contents of each node's SSD back to workdir nn=0 for node in $(/usr/bin/uniq $PBS_NODEFILE) do echo "$(/bin/date) :: Copying output data from node $node" command="cd $LOCAL_SCRATCH && tar cvf $PBS_O_WORKDIR/node$nn-output.tar *" if [ $PARALLEL_COPY -ne 0 ]; then ssh $node "$command" & else ssh $node "$command" fi let "nn++" done wait https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub Asking users to turn one-line job script into 60(!) SLOC
  • 15.
    Proto-burst buffer: AsyncI/O BACKUP_INTERVAL=1h backup() { # loop forever while true do sleep $BACKUP_INTERVAL echo "backing up at $(date)" # copy *.chk files from scratch back to job directory rsync -avz $GAUSS_SCRDIR/*.chk $PBS_O_WORKDIR/ # can also copy both *.chk and *.wrf with the following command #rsync -avz $GAUSS_SCRDIR/*.chk $GAUSS_SCRDIR/*.rwf $PBS_O_WORKDIR/ done } backup & g09 < input.com > output.txt https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub
  • 16.
    1. Checkpointing forbig jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services • obvious benefit over parallel fs • Gaussian as a representative application, 880 test problems SSD Use Cases in Practice
  • 17.
    Do SSDs helpfor local scratch? 89% of cases: spinning disk would have been sufficient Data courtesy R. S. Sinkovits, San Diego Supercomputer Center 880 Gaussian test problems, five times each
  • 18.
    How much doesiSCSI hurt? Gaussian: • ~75% cases aren’t hurt • ~10% show > 10% speedup with direct-attach SSD Raw performance1: • up to 20% loss of bandwidth • ~50% loss of IOPS 1 Cicotti, P. et al. Evaluation of I/O technologies on a flash-based I/O sub-system for HPC. Proceedings of the 1st Workshop on Architectures and Systems for Big Data - ASBD ’11 (2011) 13–18. Plotted data courtesy R. S. Sinkovits, San Diego Supercomputer Center 880 Gaussian test problems, five times each
  • 19.
    1. Checkpointing forbig jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services • true capability feature • what new problems can these tackle? SSD Use Cases in Practice
  • 20.
    Flash Node FlashNode Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node BigFlash Aggregate 16x SSDs: • 4.4 TB RAID0 array • 3.8 GB/s bandwidth • 200,000 IOPS
  • 21.
    BigFlash Aggregate 16x SSDs: •4.4 TB RAID0 array • 3.8 GB/s bandwidth • 200,000 IOPS NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash BigFlash
  • 22.
    BigFlash for Bioinformatics SAMtools Sort (out-of_coregenome sort) Each thread: • Breaks 110 GB file into 600- 800 files • Reads, re-reads those files repeatedly Per node (16 cores): • 1.8 TB input data • 3.6 TB of intermediate data • 10k-12k files opened+closed repeatedly Walltime (hrs) max = 3,177 GB μIOPS = 3,400 μR = 900 MB/s μw = 180 MB/s
  • 23.
    BigFlash for Bioinformatics Can Lustrehandle this? • 3.6 TB intermediate data? • 900 MB/sec/node? • 10k opens/closes per node? Were SSDs necessary, or can striped HDDs meet spec? • 3.6 TB intermediate data? • 900 MB/sec/node? • 10k opens/closes per node? Walltime (hrs) max = 3,177 GB μIOPS = 3,400 μR = 900 MB/s μw = 180 MB/s
  • 24.
    1. Checkpointing forbig jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services SSD Use Cases in Practice
  • 25.
    Persistent Services onSSDs • "Gordon ION" Projects – 1 year allocation • Exclusive access to – 1 ION (12 Westmere-EP cores, 48 GB DDR3) – 16x SSDs – N compute nodes, where N <= 16 • Batteries included – 1 consultant (~0.1 FTE) to run interference – 1 systems engineer (charity) to do root-only configuration
  • 26.
    Persistent Services onSSDs • Protein Data Bank – Apache httpd + Tomcat – "pairwise 3D protein structure alignments" stored in MySQL • UCSD Network Telescope – 100k+ metrics over time timeseries – Graphite = 1 whisperfile per metric (100k+ files) • OpenTopography – on-demand generation of 3D elevation models – out-of-core calculation triggered via web – middleware to stage from Microsoft Azure to Gordon • IntegromeDB – PostgreSQL fed by SmartCrawler and Lucene – 5k tables, 500 billion rows, 50 TB of data – index stored on SSD, data on Lustre
  • 27.
    Having SSDs benefitssome of the applications some of the time BigFlash provides a unique capability ...but who is using them? SSDs in Practice
  • 28.
    SSD Utilization 2013/2014 idea.org ngramanalysis Gaussian QChem Checkpointing (proto-burst buffering) QChem Gaussian
  • 29.
  • 30.
  • 31.
    Insights: • SSD loadgenerally not high, but... • when SSD load is high, it is overloaded • Better balancing to be done Caveats: • Load sampling was once an hour • Using system load to measure IO is imperfect So are SSDs utilized?
  • 32.
    On Gordon's Proto-BurstBuffer: • A new capability for jobs that can't run on Lustre • iSCSI reduces performance (20% bw, 50% IOPs) • Middleware to facilitate usability is critical • SSDs vs. HDDs: lots of overlap in use cases Longer Term: • Balance ratio of SSDs to compute nodes • Non-PFS HDDs often may be good enough • optimized mix of provisionable HDD+SSD • …this becomes untrue when PMR HDDs are EOL
  • 33.
    Experiences with theflash-based file system on Gordon
  • 34.
    Gordon Node BlockDiagram rail1 for Lustre, iSCSI ~ 3.2 GB/sec rail0 for MPI ~ 3.8 GB/sec Xeon E5 Xeon E5 ConnectX- 3 PCIe 2.0 Riser ConnectX- 3 Idea is that two rails will 1. prevent interference between MPI and IO 2. enhance performance of communication- bound applications
  • 35.
    MPI Send/Recv Microbenchmark RunningMPI traffic over both IB rails lets us observe effects of contention Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
  • 36.
    P3DFFT: Single SwitchPerformance DNS kernel • All to all, all the time • Bandwidth limited Promising results: • 1.7x speedup of communication • 1.3x speedup overall Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
  • 37.
    P3DFFT: Single Switchvs. Many Switches 2σ = 1.92 mins Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
  • 38.
    P3DFFT: Dual-rail Disaster Lustreevictions, iSCSI failures, general panic • Separating IO works! • Ensure IO performance doesn’t suffer because of FFTs • IO reliability improves
  • 39.
    Lessons learned: MultirailFabrics • Better performance for bandwidth-bound applications • ...when those applications aren’t causing IO to fail • More reliable IO without sacrificing MPI performance
  • 40.
    Experiences with theflash-based file system on Gordon
  • 41.
    Comparing the Proto-BurstBuffer to Cori: Hardware Features Feature Cori Gordon Flash location On-fabric On-fabric RPC level Objects via DVS Blocks via iSER POSIX interface DWFS + XFS XFS Namespace/metadata Server-side (XFS) Client-side (XFS) BB nodes 144 (P1) / 288 (P2) 64 SSDs per node 2x Intel P3608 16x Intel 710 Capacity: BB / DRAM 4.5x (P1) / 1.7 (P2) 4.6x
  • 42.
    Comparing the Proto-BurstBuffer to Cori: Performance Feature Cori Gordon SSDs per node 2x Intel P3608 16x Intel 710 SSD capacity per node 6.4 TB 4.8 TB SSD bandwidth per node 6 GB/sec 4 GB/sec SSD IOPS per node (r/w) 89K/89K 200K/33K PFS bandwidth per node 2.1 GB/sec 1.6 GB/sec Total bandwidth: BB/PFS 1.16x (P1) / 2.32x (P2) 2.51x
  • 43.
    Comparing the Proto-BurstBuffer to Cori: Software Capabilities Feature Cori Gordon N-to-N I/O (file per process) Yes Yes N-to-1 I/O (single shared file) Yes No Provisionable Yes – at job time Yes(ish) – manual Persistent reservations Yes – no hard limits Yes – up to 4.4 TB Asynchronous staging Yes No
  • 44.
    Acknowledgments – SDSC MahidharTatineni Rick Wagner Bob Sinkovits Wayne Pfeiffer Phil Papadopoulos D.J. Choi Christopher Irving Trevor Cooper Richard L. Moore References P. Cicotti, J. Bennet, S. Strande, R. S. Sinkovits, A. Snavely. Evaluation of I/O technologies on a flash-based I/O sub-system for HPC. Proceedings of ASBD’11 (2011) 13–18. Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6. Acknowledgments – NERSC The NERSC Burst Buffer Team The Cray DataWarp Team

Editor's Notes