Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales

123 views

Published on

Keynote at the 24th IEEE International Conference on High Performance Computing, Data, and Analytics, in Jaipur, India, on December 21, 2017.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales

  1. 1. 1 Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales Ian Foster Argonne National Lab & University of Chicago December 21, 2017 HiPC, Jaipur, India https://www.researchgate.net/publication/317703782 foster@anl.gov
  2. 2. 2 Earth to be paradise; distance to lose enchantment “If, as it is said to be not unlikely in the near future, the principle of sight is applied to the telephone as well as that of sound, earth will be in truth a paradise, and distance will lose its enchantment by being abolished altogether.” — Arthur Mee, 1898
  3. 3. 3 Cooperating at a distance 1994 1999, 2003 2017
  4. 4. 4 Automating the research data lifecycle. Reducing barriers to cooperation at a distance globus.org
  5. 5. 5 Automating research data lifecycle 5 major services 13 national labs use Globus 340PB transferred 10,000 active endpoints 50 Bn files processed 75,000 registered users 99.5% uptime 65+ institutional subscribers 1 PB largest single transfer to date 3 months longest continuously managed transfer 300+ federated campus identities 12,000 active users/year
  6. 6. 6 Transferring 1PB in a day Argonne → NCSA • Cosmology simulation on Mira @ Argonne produces 1 PB in 24 hours • Data streamed to Blue Waters for analytics • Application reveals feasibility of real-time streaming at scale Without checksums With checksums
  7. 7. 7 Time to discovery Simula on me Ultra scale Where does HPC fit in the research lifecycle?
  8. 8. 8 The challenges of managing data and computation at the 1018 scale US Department of Energy
  9. 9. 9 US Exascale Computing Program: Using codesign and integration to achieve capable exascale US Department of Energy
  10. 10. 10 A. C. Bauer et al., EuroVis 2016 Computation 125 PB/s Node memory 4.5 PB/s Interconnect (largest cross-sectional b/w) 24 TB/s Storage 1.4 TB/s Interconnect 24 TB/s Node memory 4.5 PB/s Changing storage geography a major challenge for exascale
  11. 11. 11 Disk didn’t use to be the problem ~1980-2000 Patterson, CACM, 2004
  12. 12. 12 Disks are getting larger, but not faster ~1980-2000 Patterson, CACM, 2004 https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/
  13. 13. 13 10 180 x18 0.3 1 x3
  14. 14. 14
  15. 15. 15 Exascale climate goal: Ensembles of 1km models at 15 simulated years/24 hours Full state once per model day  260 TB every 16 seconds  1.4 EB/day
  16. 16. 16 Time to discovery Simula on me Ultra scale
  17. 17. 17 Time to discovery Simula on me Ultra scale Data Space tools: Popula on, naviga on, manipula on, dissemina on Leadership class facility Smaller systems Leadership class facility Smaller systems Ultra scale
  18. 18. 18 Time to discovery Simula on me Ultra scale Data Space tools: Popula on, naviga on, manipula on, dissemina on Leadership class facility Smaller systems Leadership class facility Smaller systems Ultra scale
  19. 19. 19 Time to discovery Simula on me Ultra scale Data Space tools: Popula on, naviga on, manipula on, dissemina on Leadership class facility Smaller systems Leadership class facility Smaller systems Ultra scale
  20. 20. 20 Time to discovery Simula on me Ultra scale Data Space tools: Popula on, naviga on, manipula on, dissemina on Leadership class facility Smaller systems Leadership class facility Smaller systems Ultra scale
  21. 21. 21 The need for online data analysis and reduction Traditional approach: Simulate, output, analyze Write simulation output to secondary storage; read back for analysis Decimate in time when simulation output rate exceeds output rate of computer Online: y = F(x) Offline: a = A(y), b= B(y), …
  22. 22. 22 The need for online data analysis and reduction Traditional approach: Simulate, output, analyze Write simulation output to secondary storage; read back for analysis Decimate in time when simulation output rate exceeds output rate of computer Online: y = F(x) Offline: a = A(y), b= B(y), … New approach: Online data analysis & reduction Co-optimize simulation, analysis, reduction for performance and information output Substitute CPU cycles for I/O, via online data (de)compression and/or analysis a) Online: a = A(F(x)), b = B(F(x)), … b) Online: r = R(F(x)) Offline: a = A’(r), b = B’(r), or a = A(U(r)), b = B(U(r)) [R = reduce, U = un-reduce]
  23. 23. 23 Exascale computing at Argonne by 2021 Precision medicine Data from sensors and scientific instruments Simulation and modeling of materials and physical systems Support for three types of computing: Traditional: HPC simulation and modeling Learning: Machine learning, deep learning, AI Data: Data analytics, data science, big data [Artists impression]
  24. 24. 25 Real-time analysis and experimental steering • Current protocols process and validate data only after experiment, which can lead to undetected errors and prevents online steering • Process data streamed from beamline to supercomputer; control feedback loop makes decisions during experiment • Tests in TXM beamline (32- ID@APS) in cement wetting experiment (2 experiments, each with 8 hours of data acquisition time) Sustained # Projections/seconds CircularBufferSize Reconstruction Frequency Image Quality w.r.t. Streamed Projections SimilarityScore # Streamed Projections Reconstructed Image Sequence Tekin Bicer et al., eScience 2017
  25. 25. 26 Deep learning for precision medicine https://de.mathworks.com/company/newsletters/articles/cancer-diagnostics-with-deep-learning-and-photonic-time-stretch.html
  26. 26. 27 Model selection Model training Inference Incremental training Training data Q A Training data New data Human expertise model architecture trained model Deep learning on HPC systems Evaluate 1M alternative models, each with 100M parameters  1014 parameter values
  27. 27. Using learning to optimize simulation studies
  28. 28. Simulation data Learning methods New capabilities New simulations Using learning to optimize simulation studies Logan Ward and Ben Blaiszik
  29. 29. 30 Synopsis: Applications are changing Single program Multiple program Offline analysis Online analysis A few or many tasks: • Loosely or tightly coupled • Hierarchical or not • Static or dynamic • Fail-stop or recoverable • Shared state • Persistent and transient state • Scheduled or data driven Multiple simulations + analyses Simulation + analysis Multiple simulations
  30. 30. 31 Many interesting codesign problems Big simulation Machine learning Deep learning Streaming Online analysis Online reduction Heterogeneity Prog. models - Many task - Streaming Libraries - Analysis, reduction - Communications System software - Fault tolerance - Resource mgmt Complex nodes - Many core - Accelerators - Heterogeneous NVRAM Networks - Internal - External Node configuration Internal networks External networks Memory hierarchy Storage systems Heterogeneity Operating policies
  31. 31. 32 Reduction comes with challenges • Handling high entropy • Performance – no benefit otherwise • Not only error in variable: Ε ≡ 𝑓 − 𝑓 • Must also consider impact on derived quantities: Ε ≡ (𝑔𝑙 𝑡 (𝑓 𝑥, 𝑡 ) − 𝑔𝑙 𝑡 ( 𝑓𝑙 𝑡 ( 𝑥, 𝑡 ) S. Klasky
  32. 32. 33 Key research challenge: How to manage the impact of errors on derived quantities? Where did it go??? S. Klasky Reduction comes with challenges
  33. 33. 34 CODAR: Codesign center for Online Data Analysis and Reduction • Infrastructure development and deployment • Enable rapid composition of application and “data services” (data reduction methods, data analysis methods, etc.) • Support CODAR-developed and other data services • Method development: new reduction & analysis routines • Motif-specific: e.g., finite difference mesh vs. particles vs. finite elements • Application-specific: e.g., reduced physics to understand deltas • Application engagement • Understand data analysis and reduction requirements • Integrate, deploy, evaluate impact https://codarcode.github.io codar-info@cels.anl.gov
  34. 34. 35 Cross-cutting research questions What are the best data analysis and reduction algorithms for different application classes, in terms of speed, accuracy, and resource needs? How can we implement those algorithms to achieve scalability and performance portability? What are tradeoffs in analysis accuracy, resource needs, and overall application performance between using various data reduction methods online prior to offline data reconstruction and analysis vs. performing more online data analysis? How do tradeoffs vary with hardware & software choices? How do we effectively orchestrate online data analysis and reduction to reduce associated overheads? How can hardware and software help with orchestration?
  35. 35. 36 Prototypical data analysis and reduction pipeline CODAR runtime Reduced output and reconstruction info I/O system CODAR data API Running simulation Multivariate statistics Feature analysis Outlier detection Application-aware Transforms Encodings Error calculation Refinement hints CODARdataAPI Offlinedataanalysis Simulation knowledge: application, models, numerics, performance optimization, … CODAR data analysis CODAR data reduction CODAR data monitoring
  36. 36. 37 Overarching data reduction challenges • Understanding the science requires massive data reduction • How do we reduce • The time spent in reducing the data to knowledge? • The amount of data moved on the HPC platform? • The amount of data read from the storage system? • The amount of data stored in memory, on storage system, moved over WAN? • Without removing the knowledge. • Requires deep dives into application post processing routines and simulations • Goal is to create both (a) co-design infrastructure and (b) reduction and analysis routines • General: e.g., reduce Nbytes to Mbytes, N<<M • Motif-specific: e.g., finite difference mesh vs. particles vs. finite elements • Application-specific: e.g. reduced physics allows us to understand deltas
  37. 37. 38 HPC floating point compression • Current interest is with lossy algorithms, some use preprocessing • Lossless may achieve up to ~3x reduction • ISABELA • SZ • ZFP • Linear auditing • SVD • Adaptive gradient methods Compress each variable separately: Several variables simultaneously: • PCA • Tensor decomposition • …
  38. 38. 39 Lossy compression with SZ No existing compressor can reduce hard to compress datasets by more than a factor of 2. Objective 1: Reduce hard to compress datasets by one order of magnitude Objective 2: Add user-required error controls (error bound, shape of error distribution, spectral behavior of error function, etc. etc.) NCAR atmosphere simulation output (1.5 TB) WRF hurricane simulation output Advanced Photon Source mouse brain data What we need to compress (bit map of 128 floating point numbers): Random noise Franck Cappello
  39. 39. 40 Lossy compression: Atmospheric simulation Franck Cappello Latest SZ
  40. 40. 41 Characterizing compression error 0.0001 0.001 0.01 0.1 1 1/N 6/N 11/N 16/N 21/N 26/N 31/N 36/N 41/N 46/N Amplitude Frequency 0 2e-07 4e-07 6e-07 8e-07 1e-06 1.2e-06 1.4e-06 1.6e-06 1.8e-06 2e-06 0 20 40 60 80 100 120 140 160 180 200MaximumCompressionError Variables SZ(max error) SZ(avg error) ZFP(max error) ZFP(avg error) Error distribution Spectral behavior Laplacian (derivatives) Autocorrelation of errors Respect of error bounds Error propagation Franck Cappello
  41. 41. 42 Z-checker: Analysis of data reduction error • Community tool to enable comprehensive assessment of lossy data reduction error: • Collection of data quality criteria from applications • Community repository for datasets, reduction quality requirements, compression performance • Modular design enables contributed analysis modules (C and R) and format readers (ADIOS, HDF5, etc.) • Off-line/on-line parallel statistical, spectral, point-wise distortion analysis with static & dynamic visualization Franck Cappello, Julie Bessac, Sheng Di
  42. 42. 43 Z-Checker computations • Normalized root mean squared error • Peak signal to noise ratio • Distribution of error • Pearson correlation between raw and reduced datasets • Power spectrum distortion • Auto-correlation of compression error • Maximum error • Point-wise error bound (relative or absolute) • Preservation of derivatives • Structural similarity (SSIM) index
  43. 43. 44 Science-driven optimizations • Information-theoretically derived methods like SZ, Isabella, ZFP make for good generic capabilities • If scientists can provide additional details on how to determine features of interest, we can use those to drive further optimizations. E.g., if they can select: • Regions of high gradient • Regions near turbulent flow • Particles with velocities > two standard deviations • How can scientists help define features?
  44. 44. 45 Multilevel compression techniques A hierarchical reduction scheme produces multiple levels of partial decompression of the data so that users can work with reduced representations that require minimal storage whilst achieving user-specified tolerance Compression vs. user-specified toleranceResults for turbulence dataset: extremely large, inherently non-smooth, resistant to compression Mark Ainsworth
  45. 45. 46 Manifold learning for change detection and adaptive sampling Low dimensional manifold projection of different state of MD trajectories • A single molecular dynamics trajectory can generate 32 PB • Use online data analysis to detect relevant or significant events • Project MD trajectories to manifold space (dimensionality reduction) across time into two dimensional space • Change detection on manifold space is more robust than original full coordinate space as it removes local vibrational noise • Apply adaptive sampling strategy based on accumulated changes of trajectories Shinjae Yoo
  46. 46. 47 Critical points extracted with topology analysis Tracking blobs in XGC fusion simulations Blobs, regions of high turbulence that can damage the Tokamak, can run along the edge wall down toward the diverter and damage it. Blob extraction and tracking enables the exploration and analysis of high-energy blobs across timesteps. • Access data with ADIOS I/O in high performance • Precondition the input data with robust PCA • Detect blobs as local extrema with topology analysis • Track blobs over time with combinatorial feature flow field method Extracting, tracking, and visualizing blobs in large 5D gyrokinetic Tokamak simulations Hanqi Guo, Tom Peterka Tracking graph that visualizes the dynamics of blobs (birth, merge, split, and death) over time Data preconditioning with robust PCA
  47. 47. 48 Reduction for visualization “an extreme scale simulation … calculates temperature and density over 1000 of time steps. For both variables, a scientist would like to visualize 10 isosurface values and X, Y, and Z cut planes for 10 locations in each dimension. One hundred different camera positions are also selected, in a hemisphere above the dataset pointing towards the data set. We will run the in situ image acquisition for every time step. These parameters will produce: 2 variables x 1000 time steps x (10 isosurface values + 3 x 10 cut planes) x 100 camera positions x 3 images (depth, float, lighting) = 2.4 x 107 images.” J. Ahrens et al., SC’14 103 time steps x 1015 B state per time step = 1018 B 2.4 x 107 images x 1MB/image (megapixel, 4B) = 2.4 x 1012 B
  48. 48. 49 Fusion whole device model XGC GENEInterpolator 100+ PB PB/day on Titan today; 10+ PB/day in the future 10 TB/day on Titan today; 100+ TB/day in the future Analysis Analysis Analysis Read 10-100 PB per analysis http://bit.ly/2fcyznK
  49. 49. 50 XGC GENEInterpolator Reduction Reduction XGC Viz. XGC output GENE Viz. GENE output Comparative Viz. NVRAM PFS TAPE Fusion whole device model http://bit.ly/2fcyznK
  50. 50. 51 Integrates multiple technologies: •ADIOS staging (DataSpaces) for coupling •Sirius (ADIOS + Ceph) for storage •ZFP, SZ, Dogstar for reduction •VTK-M services for visualization •TAU for instrumenting the code •Cheetah + Savanna to test the different configurations (same node, different node, hybrid-combination) to determine where to place the different services •Flexpath for staged-write from XGC to storage •Ceph + ADIOS to manage storage hierarchy •Swift for workflow automation XGC GENEInterpolator Reduction Reduction XGC Viz. XGC output GENE Viz. GENE output TAU TAU Comparative Viz. NVRAM PFS TAPE Performance Viz. Cheetah + Savanna drive codesign experiments Fusion whole device model
  51. 51. 52 Codesign questions to be addressed • How can we couple multiple codes? Files, staging on the same node, different nodes, synchronous, asynchronous? • How we can test different placement strategies for memory optimization, performance optimizations? • What are the best reduction technologies to allow us to capture all relevant information during a simulation? E.g., Performance vs. accuracy. • How can we create visualization services that work on the different architectures and use the data models in the codes? • How do we manage data across storage hierarchies?
  52. 52. 53 Savannah: Swift workflows coupled with ADIOS Z-Check dup Multi-node workflow components communicate over ADIOS Application data Cheetah Experiment configuration and dispatch User monitoring and control of multiple pipeline instances Co-design data Store experiment metadata Chimbuko captures co-design performance data Other co-design output (e.g., Z-Checker) CODAR campaign definition Analysis ADIOS output Job launch Science App Reduce Co-design experiment architecture
  53. 53. 54 Tasks demands new systems capabilities Single program Multiple program Offline analysis Online analysis A few or many tasks: • Loosely or tightly coupled • Hierarchical or not • Static or dynamic • Fail-stop or recoverable • Shared state • Persistent and transient state • Scheduled or data driven Multiple simulations + analyses Simulation + analysis Multiple simulations
  54. 54. 55 Challenge: Enable isolation, fault tolerance, and composability for ensembles of scientific simulation/analysis pipelines Defined MPIX_Comm_launch() call to enable vendors to support dynamic workflow pipelines, in which parallel applications of various sizes are coupled in complex ways. Key use case: ADIOS-based in situ analysis. Integrated this feature with Swift/T, a scalable, MPI-based workflow system. Allows ease of development when coupling existing codes. Working to have this mode of operation supported in Cray OS. Codesign of MPI interfaces in support of HPC workflows Depiction of workflow of simulation analysis pipelines. Clusters of boxes are MPI programs passing output data downstream. An algorithm such as parameter optimization controls progress. Our launch feature was scaled to 192 nodes with a challenging workload for performance analysis, and the feature is in use by the CODES network simulation team for its resilience capabilities. Dorier, Wozniak, and Ross. Supporting Task-level Fault-tolerance in HPC Workflows by Launching MPI Jobs inside MPI Jobs. WORKS @ SC, 2017.
  55. 55. 56Justin Wozniak and Jonathan Ozik EMEWS: Extreme-scale Model Exploration With Swift Many ways to extend: - Hyperband Li et al., arXiv:1603.0656 - Population-based training Jagerberg, arXiv:1711.09846 EMEWS hyperparameter optimization
  56. 56. 57 Co-evolution of HPC applications and systems … … demand new application, software, and hardware … … resulting in exciting new computer science challenges foster@anl.gov Thanks to US Department of Energy and CODAR team
  57. 57. 58 Extra slides

×