Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's Gordon

274 views

Published on

Comparing the burst buffers of today, such as the Cray DataWarp-based burst buffer implemented on NERSC Cori, to the proto-burst buffer deployed on SDSC's Gordon supercomputer in 2012.

Published in: Technology
  • Be the first to comment

The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's Gordon

  1. 1. Glenn K. Lockwood, Ph.D. March 6, 2015
  2. 2. Who Am I? <= 2012: Materials scientist • Interfacial chemistry, nanoporous systems • Molecular dynamics of inorganic materials 2012 - 2014: San Diego Supercomputer Center • Specialist in data-intensive computing • Hadoop, HIVE, Pig, Mahout, Spark... • Parallel R • Operational workload analysis • System and infrastructure design • Emerging technologies • Bioinformatics and genomics • Industry consulting 2014 - 2015: Bay area startup • Software and release engineering • Devops and system engineering, HPC integration >= 2015: NERSC
  3. 3. What am I talking about? • Gordon: the world's first flash supercomputer™ • Deployed in 2012 at SDSC • 1024-node cluster (Appro/Cray) • 1024 x 300 GB SSDs via iSER (iSCSI) • Dedicated InfiniBand fabric for I/O • 100 GB/sec to Lustre
  4. 4. Burst Buffers and the Gordon Architecture
  5. 5. Burst Buffer Possibilities - 5 - High Speed Network Storage Fabric Storage Server Compute Node I/O Node I/O Processor Flash • SDSC Trestles (2011) • SDSC Comet (2014) • ALCF Theta (2016) • OLCF Summit (2018) • ALCF Aurora (2018) Flash • SDSC Gordon (2012) • NERSC Cori (2016) • ALCF Aurora (2018) Flash • ALCF GPFS+AFM (sort of)
  6. 6. Burst Buffer Architecture Concept - 6 - CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN BB SSD SSD BB SSD SSD BB SSD SSD BB SSD SSD ION NIC NIC ION NIC NIC StorageFabric Storage Servers Compute Nodes High-Speed Network Burst Buffer Node I/O Node Lustre OSSs/OSTs
  7. 7. The Gordon Concept CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN BB SSD SSD BB SSD SSD BB SSD SSD BB SSD SSD ION NIC NIC ION NIC NIC StorageFabric Storage Servers Lustre OSSs/OSTs • Combine BB nodes and IO nodes • Attach compute nodes to BB/IO node • Maximum locality of compute and data • Connect high-locality compute+data units in scalable topology
  8. 8. The Gordon Building Block 16 core (Sandy Bridge) 64 GB DDR3 2x QDR IB HCAs 12 core (Westmere) 16x 300 GB SSDs 2x QDR IB HCAs 2x 10GbE ports
  9. 9. Gordon IO Subsystem • 4x4x4 torus • 64 IO nodes (LNET routers*) • 1 hop to Lustre max* • 100 GB/s to Lustre* • 1024 provisionable SSDs • all over dedicated, secondary InfiniBand fabric * not entirely true
  10. 10. Experiences with the flash-based file system on Gordon
  11. 11. 1. Checkpointing for big jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services SSD Use Cases in Practice
  12. 12. 1. Checkpointing for big jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services • preventative measure, not capability • difficult to quantify benefit • very cumbersome for users SSD Use Cases in Practice
  13. 13. Proto-burst buffer: Staging ### Step 1. Distribute input data to all nodes (if necessary) for node in $(/usr/bin/uniq $PBS_NODEFILE) do echo "$(/bin/date) :: Copying input data to node $node" if [ $PARALLEL_COPY -ne 0 ]; then scp $INPUT_FILES $node:$LOCAL_SCRATCH/ else scp $INPUT_FILES $node:$LOCAL_SCRATCH/ & fi done wait ### Step 2. Run desired code mpirun_rsh -np 32 ./lmp_gordon < inputs.txt ### Step 3. Flush contents of each node's SSD back to workdir nn=0 for node in $(/usr/bin/uniq $PBS_NODEFILE) do echo "$(/bin/date) :: Copying output data from node $node" command="cd $LOCAL_SCRATCH && tar cvf $PBS_O_WORKDIR/node$nn-output.tar *" if [ $PARALLEL_COPY -ne 0 ]; then ssh $node "$command" & else ssh $node "$command" fi let "nn++" done wait https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub
  14. 14. Proto-burst buffer: Staging ### Step 1. Distribute input data to all nodes (if necessary) for node in $(/usr/bin/uniq $PBS_NODEFILE) do echo "$(/bin/date) :: Copying input data to node $node" if [ $PARALLEL_COPY -ne 0 ]; then scp $INPUT_FILES $node:$LOCAL_SCRATCH/ else scp $INPUT_FILES $node:$LOCAL_SCRATCH/ & fi done wait ### Step 2. Run desired code mpirun_rsh -np 32 ./lmp_gordon < inputs.txt ### Step 3. Flush contents of each node's SSD back to workdir nn=0 for node in $(/usr/bin/uniq $PBS_NODEFILE) do echo "$(/bin/date) :: Copying output data from node $node" command="cd $LOCAL_SCRATCH && tar cvf $PBS_O_WORKDIR/node$nn-output.tar *" if [ $PARALLEL_COPY -ne 0 ]; then ssh $node "$command" & else ssh $node "$command" fi let "nn++" done wait https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub Asking users to turn one-line job script into 60(!) SLOC
  15. 15. Proto-burst buffer: Async I/O BACKUP_INTERVAL=1h backup() { # loop forever while true do sleep $BACKUP_INTERVAL echo "backing up at $(date)" # copy *.chk files from scratch back to job directory rsync -avz $GAUSS_SCRDIR/*.chk $PBS_O_WORKDIR/ # can also copy both *.chk and *.wrf with the following command #rsync -avz $GAUSS_SCRDIR/*.chk $GAUSS_SCRDIR/*.rwf $PBS_O_WORKDIR/ done } backup & g09 < input.com > output.txt https://github.com/sdsc/sdsc-user/blob/master/jobscripts/gordon/mpi-on-ssds.qsub
  16. 16. 1. Checkpointing for big jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services • obvious benefit over parallel fs • Gaussian as a representative application, 880 test problems SSD Use Cases in Practice
  17. 17. Do SSDs help for local scratch? 89% of cases: spinning disk would have been sufficient Data courtesy R. S. Sinkovits, San Diego Supercomputer Center 880 Gaussian test problems, five times each
  18. 18. How much does iSCSI hurt? Gaussian: • ~75% cases aren’t hurt • ~10% show > 10% speedup with direct-attach SSD Raw performance1: • up to 20% loss of bandwidth • ~50% loss of IOPS 1 Cicotti, P. et al. Evaluation of I/O technologies on a flash-based I/O sub-system for HPC. Proceedings of the 1st Workshop on Architectures and Systems for Big Data - ASBD ’11 (2011) 13–18. Plotted data courtesy R. S. Sinkovits, San Diego Supercomputer Center 880 Gaussian test problems, five times each
  19. 19. 1. Checkpointing for big jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services • true capability feature • what new problems can these tackle? SSD Use Cases in Practice
  20. 20. Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node Flash Node BigFlash Aggregate 16x SSDs: • 4.4 TB RAID0 array • 3.8 GB/s bandwidth • 200,000 IOPS
  21. 21. BigFlash Aggregate 16x SSDs: • 4.4 TB RAID0 array • 3.8 GB/s bandwidth • 200,000 IOPS NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash NoFlash BigFlash
  22. 22. BigFlash for Bioinformatics SAMtools Sort (out-of_core genome sort) Each thread: • Breaks 110 GB file into 600- 800 files • Reads, re-reads those files repeatedly Per node (16 cores): • 1.8 TB input data • 3.6 TB of intermediate data • 10k-12k files opened+closed repeatedly Walltime (hrs) max = 3,177 GB μIOPS = 3,400 μR = 900 MB/s μw = 180 MB/s
  23. 23. BigFlash for Bioinformatics Can Lustre handle this? • 3.6 TB intermediate data? • 900 MB/sec/node? • 10k opens/closes per node? Were SSDs necessary, or can striped HDDs meet spec? • 3.6 TB intermediate data? • 900 MB/sec/node? • 10k opens/closes per node? Walltime (hrs) max = 3,177 GB μIOPS = 3,400 μR = 900 MB/s μw = 180 MB/s
  24. 24. 1. Checkpointing for big jobs (proto-burst buffer) 2. Scratch for single-node jobs (e.g., Gaussian) 3. Large flash aggregates (BigFlash) 4. Persistent services SSD Use Cases in Practice
  25. 25. Persistent Services on SSDs • "Gordon ION" Projects – 1 year allocation • Exclusive access to – 1 ION (12 Westmere-EP cores, 48 GB DDR3) – 16x SSDs – N compute nodes, where N <= 16 • Batteries included – 1 consultant (~0.1 FTE) to run interference – 1 systems engineer (charity) to do root-only configuration
  26. 26. Persistent Services on SSDs • Protein Data Bank – Apache httpd + Tomcat – "pairwise 3D protein structure alignments" stored in MySQL • UCSD Network Telescope – 100k+ metrics over time timeseries – Graphite = 1 whisperfile per metric (100k+ files) • OpenTopography – on-demand generation of 3D elevation models – out-of-core calculation triggered via web – middleware to stage from Microsoft Azure to Gordon • IntegromeDB – PostgreSQL fed by SmartCrawler and Lucene – 5k tables, 500 billion rows, 50 TB of data – index stored on SSD, data on Lustre
  27. 27. Having SSDs benefits some of the applications some of the time BigFlash provides a unique capability ...but who is using them? SSDs in Practice
  28. 28. SSD Utilization 2013/2014 idea.org ngram analysis Gaussian QChem Checkpointing (proto-burst buffering) QChem Gaussian
  29. 29. BigFlash Utilization 2013/2014 Gaussian Gaussian Gaussian samtools sort
  30. 30. TGT Resource Requirements
  31. 31. Insights: • SSD load generally not high, but... • when SSD load is high, it is overloaded • Better balancing to be done Caveats: • Load sampling was once an hour • Using system load to measure IO is imperfect So are SSDs utilized?
  32. 32. On Gordon's Proto-Burst Buffer: • A new capability for jobs that can't run on Lustre • iSCSI reduces performance (20% bw, 50% IOPs) • Middleware to facilitate usability is critical • SSDs vs. HDDs: lots of overlap in use cases Longer Term: • Balance ratio of SSDs to compute nodes • Non-PFS HDDs often may be good enough • optimized mix of provisionable HDD+SSD • …this becomes untrue when PMR HDDs are EOL
  33. 33. Experiences with the flash-based file system on Gordon
  34. 34. Gordon Node Block Diagram rail1 for Lustre, iSCSI ~ 3.2 GB/sec rail0 for MPI ~ 3.8 GB/sec Xeon E5 Xeon E5 ConnectX- 3 PCIe 2.0 Riser ConnectX- 3 Idea is that two rails will 1. prevent interference between MPI and IO 2. enhance performance of communication- bound applications
  35. 35. MPI Send/Recv Microbenchmark Running MPI traffic over both IB rails lets us observe effects of contention Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
  36. 36. P3DFFT: Single Switch Performance DNS kernel • All to all, all the time • Bandwidth limited Promising results: • 1.7x speedup of communication • 1.3x speedup overall Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
  37. 37. P3DFFT: Single Switch vs. Many Switches 2σ = 1.92 mins Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6.
  38. 38. P3DFFT: Dual-rail Disaster Lustre evictions, iSCSI failures, general panic • Separating IO works! • Ensure IO performance doesn’t suffer because of FFTs • IO reliability improves
  39. 39. Lessons learned: Multirail Fabrics • Better performance for bandwidth-bound applications • ...when those applications aren’t causing IO to fail • More reliable IO without sacrificing MPI performance
  40. 40. Experiences with the flash-based file system on Gordon
  41. 41. Comparing the Proto-Burst Buffer to Cori: Hardware Features Feature Cori Gordon Flash location On-fabric On-fabric RPC level Objects via DVS Blocks via iSER POSIX interface DWFS + XFS XFS Namespace/metadata Server-side (XFS) Client-side (XFS) BB nodes 144 (P1) / 288 (P2) 64 SSDs per node 2x Intel P3608 16x Intel 710 Capacity: BB / DRAM 4.5x (P1) / 1.7 (P2) 4.6x
  42. 42. Comparing the Proto-Burst Buffer to Cori: Performance Feature Cori Gordon SSDs per node 2x Intel P3608 16x Intel 710 SSD capacity per node 6.4 TB 4.8 TB SSD bandwidth per node 6 GB/sec 4 GB/sec SSD IOPS per node (r/w) 89K/89K 200K/33K PFS bandwidth per node 2.1 GB/sec 1.6 GB/sec Total bandwidth: BB/PFS 1.16x (P1) / 2.32x (P2) 2.51x
  43. 43. Comparing the Proto-Burst Buffer to Cori: Software Capabilities Feature Cori Gordon N-to-N I/O (file per process) Yes Yes N-to-1 I/O (single shared file) Yes No Provisionable Yes – at job time Yes(ish) – manual Persistent reservations Yes – no hard limits Yes – up to 4.4 TB Asynchronous staging Yes No
  44. 44. Acknowledgments – SDSC Mahidhar Tatineni Rick Wagner Bob Sinkovits Wayne Pfeiffer Phil Papadopoulos D.J. Choi Christopher Irving Trevor Cooper Richard L. Moore References P. Cicotti, J. Bennet, S. Strande, R. S. Sinkovits, A. Snavely. Evaluation of I/O technologies on a flash-based I/O sub-system for HPC. Proceedings of ASBD’11 (2011) 13–18. Choi, D. J., et al. Performance of Applications using Dual-Rail InfiniBand 3D Torus network on the Gordon Supercomputer. Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment - XSEDE'14 (2014) 1–6. Acknowledgments – NERSC The NERSC Burst Buffer Team The Cray DataWarp Team

×