Blue Gene Active Storage

  • 3,449 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
3,449
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
39
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 August 2, 2010Blue Gene Active StorageHEC FSIO 2010Blake G. Fitch Alex RayshubskiyIBM Research T.J. Chris Ward Mike Pitman Bernard Metzlerbgf@us.ibm.com Heiko J. Schick Benjamin Krill Peter Morjan Robert S. Germain © 2010 IBM Corporation © 2002 IBM Corporation
  • 2. ‘Exascale’ Architectures Motivate Active Storage Exascale: Three orders of magnitude more of everything? –Often taken to mean just ‘exaflop’ – but that’s the easy part! –What about the rest of the Exascale ecosystem: storage, network bandwidth, usability? Exascale observations: –The wetware will not be three orders of magnitude more capable • Essential to think about usability. Massive concurrency must be made more usable. • Non-heroic programmers and end-users must be able to utilize exascale architectures –Data objects will be too large to move -- between or even within datacenters • Today, terabyte objects have people carrying around portable disk arrays – this is not scalable. • Bandwidth costs not likely to drop 103 either inside out outside the data center –Single program (MPI) use-case alone will be insufficient for to exploit exascale capability • Anticipate applications formed of composed components rather than just heroic monoliths • Some capability created by exascale will be spent enabling interfaces to support compos ability –“Tape is Dead, Disk is Tape, Flash is Disk, RAM locality is King” – Jim Gray • A rack of disks can deliver ~10GB/s. 1TB/s requires 100 racks or ~$100M • Mixing archival storage and online storage will be less efficient in the future • Solid state storage and storage class memory – but how will these technologies be used? Scalable Bisectional bandwidth will be THE challenge both to create and exploit2 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 3. From K. Borne, Data Science Challenges from Distributed Petabyte Astronomical Data Collections: Preparing for the Data Avalanche through Persistence, Parallelization, and Provenance, Computing with Massive and Persistent Data (CMPD 08) https://computation.llnl.gov/home/workshops/cmpd08/Presentations/Borne.pdf3 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 4. Data-centric and Network-centric Application Classes Embarrassingly Parallel Network Dependent Table scan Join Batch serial Sort –Floating point intensive –“order by” queries –Job farm/cloud “group by” queries Map-Reduce Map-Reduce Other applications without Aggregation operations internal global data –count(), sum(), min(), max(), avg(), … dependencies Data analysis/OLAP –Aggregation with “group by”… –Real-time analytics HPC applications4 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 5. Blue Gene Architecture in ReviewBlue Gene is not just FLOPs… … it’s also torus network, power efficiency, and dense packaging. Focus on scalability rather than on configurability gives the Blue Gene family’s System-on-a-Chip architecture unprecedented scalability and reliability.5 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 6. Blue Gene Architectural Advantages Main Memory Capacity per Rack Peak Memory Bandwidth per node (byte/flop) 4500 SGI ICE 4000 3500 Sun TACC 3000 Itanium 2 2500 2000 Cray XT5 4 core 1500 1000 Cray XT3 2 core 500 BG/P 4 core 0 LRZ Cray ASC BG/P Sun SGI ICE 0 0.5 1 1.5 IA64 XT4 Purple TACC Relative power, space and cooling efficiencies Low Failure Rates http://acts.nersc.gov/events/Workshop2006/slides/Simon.pdf400%300%200%100% IBM BG/P 0% Racks/TF kW/TF Sq Ft/TF Tons/TF Sun/Constellation Cray/XT4 SGI/ICE6 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 7. Blue Gene Torus: A cost effective all-to-all network 10GbE network costs from: Data Center Switch Architecture In The Age Of Merchant Silicon (Farrington, et al, 2009 http://cseweb.ucsd.edu/~vahdat/papers/hoti09.pdf) Theoretical 3D and 4D Torus All-to-all throughput per node for 1GB/s Links (~1/2 10GbE) 3 2 .5 A2A BW/Node (GB/s) 2 1 .5 1 0 .5 3 D -T o ru s 4 D -T o ru s 0 0 5000 10000 15000 20000 25000 30000 N ode C ount 7 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 8. Blue Gene Data-centric Micro-benchmark Results Random Loads Per Second 100 Blue Gene/L micro benchmarks show 10 16,384 Nodes order of magnitude improvement over largest SMP and clusters Giga-Loads/Second 1 0.1 0.01 Scalable random pointer chasing 0.001 through terabytes of memory Migrate Context Memory Get 0.0001 –BG/L 1.0 GigaLoads / Rack 1 10 100 1000 Total Read Contexts 10000 100000 1e+006 • 14 GigaLoads/s for 16 racks –P690 0.3 Gigaloads Strong Scaling vs. Node Count 120 1 TeraByte Scalable “Indy” sort of terabytes in 2 TeraByte 3 TeraByte 100 memory 4 TeraByte 5 TeraByte Sort Time (sec.) 80 6 TeraByte –BG/L 1.0 TB sort in 10 secs 7 TeraByte 60 –BG/L 7.5 TB sort in 100 secs 40 Streams benchmark 20 –BG/L 2.5TB / Sec / Rack 0 –P5 595 0.2TB / Sec / Rack 4000 8000 16000 Node Count8 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 9. Blue Gene/P for I/O and Active Storage ExplorationsCurrently a research platform for scalable data-centric computing and BG/Q I/O system prototyping Linux 2.6.29 kernel, 32-bit powerpc 4GB RAM, diskless, 4-way SMP, 1024 nodes/rack 4096 nodes available on site IP-over-torus to other compute nodes IO nodes as IP routers for off-fabric traffic MPICH/TCP, OpenMPI/TCP, SLURM, TCP sockets Running SoftiWARP for OFED RDMA (http://gitorious.org/softiwarp) Native OFED verbs provider for Blue Gene Torus has achieved initial function Experimental GPFS version for on-board Parallel File System explorations at very high node count Testing MVAPICH and OpenMPIFuture – anticipate moving ASF over to and integrating with Argonne National Lab’s ZeptoOS (http://www.zeptoos.com) 9 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 10. Blue Gene Active Storage Opportunity – Current ConditionsIBM Enterprise Server (Analysis Here) IBM System Storage DS8000 IBM BG/P HPC Solution (Scalable Compute Here) widening gap IBM 3494 Tape Library Persistent Storage10 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 11. Blue Gene Active Storage adds parallel processing into the storage hierarchy.IBM Enterprise Server 10 Blue Gene/AS 0s Option GB /s IBM System Storage DS8000 Application Solution Driver IBM 3494 Tape Library Managed as Scalable Solid State Storage with Embedded Parallel Persistent Processing Storage 11 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 12. Thought Experiment: A Blue Gene Active Storage Machine Integrate significant storage class memory (SCM) at each node – For now, Flash memory, maybe similar function to Fusion-io ioDrive Duo – Future systems may deploy Phase Change Memory (PCM), Memristor, or…? – Assume node density will drops 50% -- 512 Nodes/Rack for embedded apps ioDrive Duo One Board 512 Node – Objective: balance Flash bandwidth to network all-to-all throughput SLC NAND Cap. 320 GB 160 TB Read BW (64K) 1450 MB/s 725 GB/s Resulting System Attributes: Write BW (64K) 1400 MB/s 700 GB/s – Rack: 0.5 petabyte, 512 Blue Gene processors, and embedded torus network Read IOPS (4K) 270,000 138 Mega – 700 TB/s I/O bandwidth to Flash – competitive with ~70 large disk controllers Write IOPS (4K) 257,000 131 Maga • Order of magnitude less space and power than equivalent perf via disk solution Mixed R/W 207,000 105 Mega • Can configure fewer disk controllers and optimize them for archival use IOPs(75/25@4K) 51 – With network all-to-all throughput at 1GB/s per node, anticipate: 2x • 1TB sort from/to persistent storage in order 10 secs. • 130 Million IOPs per rack, 700 GB/s I/O bandwidth – Inherit Blue Gene attributes: scalability, reliability, power efficiency, Research Challenges (list not exhaustive): – Packaging – can the integration succeed? – Resilience – storage, network, system management, middleware – Data management – need clear split between on-line and archival data – Data structures and algorithms can take specific advantage of the BGAS architecture – no one cares it’s not x86 since software is embedded in storage Related Work: – Gordon (UCSD) http://nvsl.ucsd.edu/papers/Asplos2009Gordon.pdf – FAWN (CMU) http://www.cs.cmu.edu/~fawnproj/papers/fawn-sosp2009.pdf – RamCloud (Stanford) http://www.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf 12 12 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 13. Blue Gene Active Storage for Server Acceleration Manage Blue Gene hardware as a scalable, solid state P Servers Z servers storage device with embedded processing capability Integrate this “Active Storage Fabric” with middleware Data Sets Blue Gene/ASF such as DB2, MySql, GPFS, etc, at the data/storage interface using a parallel, key-value in-memory data store Disk Array Transparently accelerate server Jobs with ASF: – Create libraries of reusable, low overhead Embedded Parallel Modules (EPMs) (scan, sort, join, sum, etc) – EPMs directly access middleware data (tables, files, etc) in Robot Tape key/value DB based on middleware data model – Middleware makes ASF datasets available via legacy interfaces allowing incremental parallelization and interoperability with legacy middleware function and host programs Shelf Tape Integrate BG/ASF with information life-cycle and other standard IT data management products13 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 14. Blue Gene Active Storage for UNIX Utility Acceleration GPFS servers access file system data blocks in a single ‘disk’ in Active Store Huge Blue Gene Parallel In-Memory Database accesses CNTL Active Storage Fabric Analytics block with key of [file_num,block_num] Server Active GPFS Embedded Parallel Modules (EPMs) RDMA access file data directly with PIMD cleint Router asf_generate_indy_data an EPM RDMA creates Indy.Input synthetic data Router BG/AS Nodes Active Linux IP/RDMA asf_grep EPM access Indy.Input file GPFS and greps all text records creating RDMA Indy.Temp PIMD Server Router asf_sort EPM sorts Indy.Temp file and RDMA creates Indy.Output Router EPMs, ex: grep Tail -1 is a normal unix command which Non-heroic runs on the workstation and gets the End-User last line in the last block of the Indy.Output file Active GPFS asf_user@asf_host-> cd /active_gpfs Client asf_user@asf_host-> asf_generate_indy_data 1TB > Indy.Input asf_user@asf_host-> asf_grep “aaaaaa” Indy.Input > Indy.Temp asf_user@asf_host-> asf_sort < Indy.Temp >Indy.Output asf_user@asf_host-> tail -1 Indy.Output 9876543210aaaaaa………………………………. … … asf_user@asf_host-> rm Indy.Output Indy.Input14 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 15. Questions? Note: If you are NDA’d with IBM for Blue Gene/Q or are a US Government employee and you are interested in the confidential version of the Blue Gene/Q Active Storage presentation, please contact me. Blake Fitch bgf@us.ibm.com15 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 16. Backup16 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 17. Blue Gene/L ASF: Early Performance Comparison with Host Commands Benchmark process*: • Generate a 22 GB synthetic “Indy Benchmark” data file • grep for records with “AAAAAA” (outputs 1/3’d of the data which is used as input to sort) • sort the output of grep * NOTE: All ASF data taken on untuned system, non-optimized and with full debug printfs, all p55 data taken with unix time on command line) Function BG/ASF pSeries p55 (Measured at Host shell and break out of EPM components) 512 nodes (1 thread) Total host command time: create Indy File: (22GB) 22.26 s 686s Total host command time: unix grep of Indy File: 25.1s 3180s File Blocks to PIMD Records (Embedded grep) : 7.0 s Grep core function (Embedded grep) : 2.5s PIMD Records to File Blocks (Embedded grep) : 7.6s Total time on Blue Gene/ASF fabric (Embedded grep) : 18.8s Total host command time: sort output of grep: 60.41s 2220s File Blocks to PIMD Records (Embedded sort) : 2.9s Sort core function (Embedded sort) : 12.2s PIMD Records to File Blocks (Embedded sort) : 15.8s Total time on Blue Gene/ASF fabric (Embedded sort) : 54.7s17 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 18. Blue Gene/L ASF: Early Scalability Results Benchmark process*: • Generate a 22 GB synthetic “Indy Benchmark” data file • grep for records with “AAAAAA” (outputs 1/3’d of the data which is used as input to sort) • sort the output of grep * NOTE: All ASF data taken on untuned system, non-optimized and with full debug printfs, all p55 data taken with unix time on command line) Function BG/ASF BG/ASF 512 nodes 8K nodes Total host command time: create Indy File: (22GB) 22.26 s (1TB) 197s Total host command time: unix grep of Indy File: 25.1s 107s File Blocks to PIMD Records (Embedded grep) : 7.0 s 18.1s Grep core function (Embedded grep) : 2.5s 5.1s PIMD Records to File Blocks (Embedded grep) : 7.6s 19.9s Total time on Blue Gene/ASF fabric (Embedded grep) : 18.8s 50.2s Total host command time: sort output of grep: 60.41s 120s File Blocks to PIMD Records (Embedded sort) : 2.9s 6.8s Sort core function (Embedded sort) : 12.2s 14.8s PIMD Records to File Blocks (Embedded sort) : 15.8s 22.4s Total time on Blue Gene/ASF fabric (Embedded sort) : 54.7s 66.55s18 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 19. Parallel In-Memory Database (PIMD) 512 Nodes19 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 20. Applications with potentially strong data dependenciesApplication Application Class/Trends AlgorithmsNational Security/Intelligence, Social Graph-related Breadth first search, ….NetworksMolecular simulation N-body Spectral methods (FFT)Financial risk, pricing, market Real time control, shift from forensic Graph traversal, Monte Carlo, …making to real timeOil exploration Inverse problems Reverse time migration, finite differenceClimate modeling, weather Fluid dynamics Finite difference, spectral methods,prediction …Recommendation systems Interactive query Minimization (least squares, conjugate gradient, …)Network simulations Discrete eventTraffic management Real time control Real time optimizationBusiness intelligence/analytics Data mining/association discovery K-means clustering, aggregation, scans, ….Data deduplication Storage management HashingLoad balancing/meshing (cross- Graph partitioningdomain)Multi-media/unstructured search Search MapReduce, hashing, clustering20 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 21. BG/Q Active Storage for Expanding Flash Capacity 1.00E+13 1.00E+12 Next two years DRAM raw density (bits/cm2) FLASH raw density (bits/cm2) PCM raw density (bits/cm2) 1.00E+11 possible DRAM limit Pessimistic PCM (2 bits, 2 layers only) 1.00E+10 1.00E+09 2007 2009 2011 2013 2015 2017 2019 2021 2023ITRS 2008 data 21 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 22. Related Data Intensive Supercomputing Efforts•Hyperion (LLNL) http://www.supercomputingonline.com/latest/appro-deploys-a-world-class-linux-cluster-testbed-solution-to-llnl-in-support-of-the-hyperion-project l•The Linux cluster solution consists of 80 Appro 1U 1423X, dual socket customized server based on Intel Xeon processor 5600series with 24 GB of memory and 4 PCIe2 x8 slots. The cluster will be integrated with 2x ioSAN carrier cards with 320 GB ofFLASH memory and one 10Gb/s Ethernet and one IBA QDR (40Gb/s) link. The I/O nodes also have 2x 1Gb/s Ethernet links forlow speed cloud file system I/O and two SATA disks for speed comparisons. In addition to these IO nodes, 28 Lustre OSS 1UAppro 1U 1423X dual socket nodes with 24 GB of memory, one Mellanox Connect-X card, 10Gb/s Ethernet and QDR IBA(40Gb/s) links and 2 SATA drives will be added. The entire solution will provide 80 Appro server nodes powered byFusion-io technology delivering an aggregate 4KB IOPS rate of over 40 million IOPS and 320 GB/s of IO bandwidth and102 TB of FLASH memory capabilities.•Flash Gordon (SDSC) http://storageconference.org/2010/Presentations/MSST/10.Norman.pdf 22 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation
  • 23. Failures per Month per TF From:http://acts.nersc.gov/events/Workshop2006/slides/Simon.pdf23 Blue Gene Active Storage HEC FSIO 2010 © 2010 IBM Corporation