Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales
1. 1
Computing Just What You Need:
Online Data Analysis and Reduction
at Extreme Scales
Ian Foster
Argonne National Lab & University of Chicago
December 21, 2017
HiPC, Jaipur, India
https://www.researchgate.net/publication/317703782
foster@anl.gov
2. 2
Earth to be paradise; distance to lose enchantment
“If, as it is said to be not unlikely in
the near future, the principle of
sight is applied to the telephone
as well as that of sound, earth will
be in truth a paradise, and
distance will lose its enchantment
by being abolished altogether.”
— Arthur Mee, 1898
5. 5
Automating research data lifecycle
5
major
services
13
national labs
use Globus
340PB
transferred
10,000
active endpoints
50 Bn
files processed
75,000
registered users
99.5%
uptime
65+
institutional
subscribers
1 PB
largest single
transfer to date
3 months
longest continuously
managed transfer
300+
federated
campus
identities
12,000
active users/year
6. 6
Transferring 1PB in a day
Argonne → NCSA
• Cosmology simulation
on Mira @ Argonne
produces 1 PB in 24
hours
• Data streamed to Blue
Waters for analytics
• Application reveals
feasibility of real-time
streaming at scale
Without checksums
With checksums
15. 15
Exascale climate goal: Ensembles of 1km models at
15 simulated years/24 hours
Full state once per model day 260 TB every 16 seconds
1.4 EB/day
17. 17
Time to discovery
Simula on me
Ultra
scale
Data Space tools: Popula on, naviga on, manipula on, dissemina on
Leadership
class facility
Smaller systems
Leadership
class facility
Smaller systems
Ultra
scale
18. 18
Time to discovery
Simula on me
Ultra
scale
Data Space tools: Popula on, naviga on, manipula on, dissemina on
Leadership
class facility
Smaller systems
Leadership
class facility
Smaller systems
Ultra
scale
19. 19
Time to discovery
Simula on me
Ultra
scale
Data Space tools: Popula on, naviga on, manipula on, dissemina on
Leadership
class facility
Smaller systems
Leadership
class facility
Smaller systems
Ultra
scale
20. 20
Time to discovery
Simula on me
Ultra
scale
Data Space tools: Popula on, naviga on, manipula on, dissemina on
Leadership
class facility
Smaller systems
Leadership
class facility
Smaller systems
Ultra
scale
21. 21
The need for online data analysis and reduction
Traditional approach:
Simulate, output, analyze
Write simulation output to secondary
storage; read back for analysis
Decimate in time when simulation
output rate exceeds output rate of
computer
Online: y = F(x)
Offline: a = A(y), b= B(y), …
22. 22
The need for online data analysis and reduction
Traditional approach:
Simulate, output, analyze
Write simulation output to secondary
storage; read back for analysis
Decimate in time when simulation
output rate exceeds output rate of
computer
Online: y = F(x)
Offline: a = A(y), b= B(y), …
New approach:
Online data analysis & reduction
Co-optimize simulation, analysis,
reduction for performance and
information output
Substitute CPU cycles for I/O, via online
data (de)compression and/or analysis
a) Online: a = A(F(x)), b = B(F(x)), …
b) Online: r = R(F(x))
Offline: a = A’(r), b = B’(r), or
a = A(U(r)), b = B(U(r))
[R = reduce, U = un-reduce]
23. 23
Exascale computing at Argonne by 2021
Precision medicine
Data from
sensors and
scientific
instruments
Simulation and
modeling of
materials and
physical
systems
Support for three types of computing:
Traditional: HPC simulation and modeling
Learning: Machine learning, deep learning, AI
Data: Data analytics, data science, big data
[Artists impression]
24. 25
Real-time analysis and experimental steering
• Current protocols process
and validate data only after
experiment, which can lead
to undetected errors and
prevents online steering
• Process data streamed
from beamline to
supercomputer; control
feedback loop makes
decisions during
experiment
• Tests in TXM beamline (32-
ID@APS) in cement wetting
experiment (2 experiments,
each with 8 hours of data
acquisition time)
Sustained # Projections/seconds
CircularBufferSize
Reconstruction Frequency
Image Quality w.r.t. Streamed Projections
SimilarityScore
# Streamed Projections Reconstructed
Image Sequence
Tekin Bicer et al., eScience 2017
25. 26
Deep learning for precision medicine
https://de.mathworks.com/company/newsletters/articles/cancer-diagnostics-with-deep-learning-and-photonic-time-stretch.html
28. Simulation data Learning methods New capabilities
New simulations
Using learning to optimize simulation studies
Logan Ward and Ben Blaiszik
29. 30
Synopsis: Applications are changing
Single
program
Multiple
program
Offline
analysis
Online
analysis
A few or many tasks:
• Loosely or tightly coupled
• Hierarchical or not
• Static or dynamic
• Fail-stop or recoverable
• Shared state
• Persistent and transient state
• Scheduled or data driven
Multiple
simulations
+ analyses
Simulation
+ analysis
Multiple
simulations
30. 31
Many interesting codesign problems
Big simulation
Machine learning
Deep learning
Streaming
Online analysis
Online reduction
Heterogeneity
Prog. models
- Many task
- Streaming
Libraries
- Analysis, reduction
- Communications
System software
- Fault tolerance
- Resource mgmt
Complex nodes
- Many core
- Accelerators
- Heterogeneous
NVRAM
Networks
- Internal
- External
Node configuration
Internal networks
External networks
Memory hierarchy
Storage systems
Heterogeneity
Operating policies
31. 32
Reduction comes with challenges
• Handling high entropy
• Performance – no benefit
otherwise
• Not only error in variable:
Ε ≡ 𝑓 − 𝑓
• Must also consider impact
on derived quantities:
Ε ≡ (𝑔𝑙
𝑡
(𝑓 𝑥, 𝑡 ) − 𝑔𝑙
𝑡
( 𝑓𝑙
𝑡
( 𝑥, 𝑡 )
S. Klasky
32. 33
Key research challenge:
How to manage the impact
of errors on derived
quantities?
Where did it go???
S. Klasky
Reduction comes with challenges
33. 34
CODAR: Codesign center for Online Data
Analysis and Reduction
• Infrastructure development and deployment
• Enable rapid composition of application and “data services” (data
reduction methods, data analysis methods, etc.)
• Support CODAR-developed and other data services
• Method development: new reduction & analysis routines
• Motif-specific: e.g., finite difference mesh vs. particles vs. finite elements
• Application-specific: e.g., reduced physics to understand deltas
• Application engagement
• Understand data analysis and reduction requirements
• Integrate, deploy, evaluate impact
https://codarcode.github.io codar-info@cels.anl.gov
34. 35
Cross-cutting research questions
What are the best data analysis and reduction algorithms for
different application classes, in terms of speed, accuracy, and
resource needs? How can we implement those algorithms to
achieve scalability and performance portability?
What are tradeoffs in analysis accuracy, resource needs,
and overall application performance between using various
data reduction methods online prior to offline data reconstruction
and analysis vs. performing more online data analysis? How do
tradeoffs vary with hardware & software choices?
How do we effectively orchestrate online data analysis and
reduction to reduce associated overheads? How can hardware
and software help with orchestration?
35. 36
Prototypical data analysis and reduction pipeline
CODAR runtime
Reduced output and reconstruction info
I/O
system
CODAR data API
Running simulation
Multivariate statistics
Feature analysis
Outlier detection
Application-aware
Transforms
Encodings
Error calculation
Refinement hints
CODARdataAPI
Offlinedataanalysis
Simulation knowledge: application, models, numerics, performance optimization, …
CODAR
data analysis
CODAR
data reduction
CODAR
data monitoring
36. 37
Overarching data reduction challenges
• Understanding the science requires massive data reduction
• How do we reduce
• The time spent in reducing the data to knowledge?
• The amount of data moved on the HPC platform?
• The amount of data read from the storage system?
• The amount of data stored in memory, on storage system, moved over WAN?
• Without removing the knowledge.
• Requires deep dives into application post processing routines and simulations
• Goal is to create both (a) co-design infrastructure and (b)
reduction and analysis routines
• General: e.g., reduce Nbytes to Mbytes, N<<M
• Motif-specific: e.g., finite difference mesh vs. particles vs. finite elements
• Application-specific: e.g. reduced physics allows us to understand deltas
37. 38
HPC floating point compression
• Current interest is with lossy algorithms, some use preprocessing
• Lossless may achieve up to ~3x reduction
• ISABELA
• SZ
• ZFP
• Linear auditing
• SVD
• Adaptive gradient methods
Compress each variable separately: Several variables simultaneously:
• PCA
• Tensor decomposition
• …
38. 39
Lossy compression with SZ
No existing compressor can reduce hard to compress
datasets by more than a factor of 2.
Objective 1: Reduce hard to compress datasets by
one order of magnitude
Objective 2: Add user-required error controls (error
bound, shape of error distribution, spectral behavior
of error function, etc. etc.)
NCAR
atmosphere
simulation
output
(1.5 TB)
WRF
hurricane
simulation
output
Advanced
Photon Source
mouse brain
data
What we need to
compress
(bit map of 128
floating point
numbers):
Random noise
Franck Cappello
41. 42
Z-checker: Analysis of data reduction error
• Community tool to enable comprehensive assessment of lossy data reduction error:
• Collection of data quality criteria from applications
• Community repository for datasets, reduction quality requirements, compression
performance
• Modular design enables contributed analysis modules (C and R) and format
readers (ADIOS, HDF5, etc.)
• Off-line/on-line parallel statistical, spectral, point-wise distortion analysis with static
& dynamic visualization
Franck Cappello, Julie Bessac, Sheng Di
42. 43
Z-Checker computations
• Normalized root mean squared error
• Peak signal to noise ratio
• Distribution of error
• Pearson correlation between raw and reduced datasets
• Power spectrum distortion
• Auto-correlation of compression error
• Maximum error
• Point-wise error bound (relative or absolute)
• Preservation of derivatives
• Structural similarity (SSIM) index
43. 44
Science-driven optimizations
• Information-theoretically derived methods like SZ,
Isabella, ZFP make for good generic capabilities
• If scientists can provide additional details on how to
determine features of interest, we can use those to
drive further optimizations. E.g., if they can select:
• Regions of high gradient
• Regions near turbulent flow
• Particles with velocities > two standard deviations
• How can scientists help define features?
44. 45
Multilevel compression techniques
A hierarchical reduction scheme produces
multiple levels of partial decompression of the
data so that users can work with reduced
representations that require minimal storage
whilst achieving user-specified tolerance
Compression vs. user-specified toleranceResults for turbulence dataset: extremely large,
inherently non-smooth, resistant to compression Mark Ainsworth
45. 46
Manifold learning for change detection and
adaptive sampling
Low dimensional manifold projection
of different state of MD trajectories
• A single molecular dynamics
trajectory can generate 32 PB
• Use online data analysis to detect
relevant or significant events
• Project MD trajectories to manifold
space (dimensionality reduction) across
time into two dimensional space
• Change detection on manifold space is
more robust than original full coordinate
space as it removes local vibrational
noise
• Apply adaptive sampling strategy based
on accumulated changes of trajectories
Shinjae Yoo
46. 47
Critical points extracted
with topology analysis
Tracking blobs in XGC fusion simulations
Blobs, regions of high turbulence that can
damage the Tokamak, can run along the
edge wall down toward the diverter and
damage it. Blob extraction and tracking
enables the exploration and analysis of
high-energy blobs across timesteps.
• Access data with ADIOS I/O in high
performance
• Precondition the input data with robust PCA
• Detect blobs as local extrema with topology
analysis
• Track blobs over time with combinatorial
feature flow field method
Extracting, tracking, and visualizing blobs in large 5D gyrokinetic Tokamak simulations
Hanqi Guo, Tom Peterka
Tracking graph that visualizes the dynamics of blobs
(birth, merge, split, and death) over time
Data preconditioning
with robust PCA
47. 48
Reduction for visualization
“an extreme scale simulation … calculates
temperature and density over 1000 of time
steps. For both variables, a scientist would like
to visualize 10 isosurface values and X, Y, and Z
cut planes for 10 locations in each dimension.
One hundred different camera positions are
also selected, in a hemisphere above the
dataset pointing towards the data set. We will
run the in situ image acquisition for every time
step. These parameters will produce: 2
variables x 1000 time steps x (10 isosurface
values + 3 x 10 cut planes) x 100 camera
positions x 3 images (depth, float, lighting)
= 2.4 x 107 images.”
J. Ahrens et al., SC’14
103 time steps x
1015 B state per
time step = 1018 B
2.4 x 107 images x
1MB/image
(megapixel, 4B) =
2.4 x 1012 B
48. 49
Fusion whole device model
XGC GENEInterpolator
100+ PB
PB/day on
Titan today;
10+ PB/day
in the future
10 TB/day on
Titan today;
100+ TB/day
in the future
Analysis
Analysis
Analysis
Read 10-100 PB
per analysis
http://bit.ly/2fcyznK
50. 51
Integrates multiple technologies:
•ADIOS staging (DataSpaces) for coupling
•Sirius (ADIOS + Ceph) for storage
•ZFP, SZ, Dogstar for reduction
•VTK-M services for visualization
•TAU for instrumenting the code
•Cheetah + Savanna to test the different
configurations (same node, different node,
hybrid-combination) to determine where to
place the different services
•Flexpath for staged-write from XGC to storage
•Ceph + ADIOS to manage storage hierarchy
•Swift for workflow automation
XGC GENEInterpolator
Reduction Reduction
XGC
Viz.
XGC
output
GENE
Viz.
GENE
output
TAU TAU
Comparative
Viz.
NVRAM
PFS
TAPE
Performance
Viz.
Cheetah +
Savanna drive
codesign experiments
Fusion whole device model
51. 52
Codesign questions to be addressed
• How can we couple multiple codes? Files, staging on the same
node, different nodes, synchronous, asynchronous?
• How we can test different placement strategies for memory
optimization, performance optimizations?
• What are the best reduction technologies to allow us to capture
all relevant information during a simulation? E.g., Performance
vs. accuracy.
• How can we create visualization services that work on the
different architectures and use the data models in the codes?
• How do we manage data across storage hierarchies?
52. 53
Savannah: Swift workflows coupled with ADIOS
Z-Check
dup
Multi-node workflow components communicate over ADIOS
Application data
Cheetah
Experiment
configuration
and dispatch
User monitoring and
control of multiple
pipeline instances
Co-design data
Store
experiment
metadata
Chimbuko
captures co-design
performance data
Other co-design
output
(e.g., Z-Checker)
CODAR
campaign
definition
Analysis
ADIOS output
Job launch
Science
App
Reduce
Co-design experiment architecture
53. 54
Tasks demands new systems capabilities
Single
program
Multiple
program
Offline
analysis
Online
analysis
A few or many tasks:
• Loosely or tightly coupled
• Hierarchical or not
• Static or dynamic
• Fail-stop or recoverable
• Shared state
• Persistent and transient state
• Scheduled or data driven
Multiple
simulations
+ analyses
Simulation
+ analysis
Multiple
simulations
54. 55
Challenge: Enable isolation, fault
tolerance, and composability for
ensembles of scientific
simulation/analysis pipelines
Defined MPIX_Comm_launch() call to enable
vendors to support dynamic workflow
pipelines, in which parallel applications of
various sizes are coupled in complex ways.
Key use case: ADIOS-based in situ analysis.
Integrated this feature with Swift/T, a scalable,
MPI-based workflow system. Allows ease of
development when coupling existing codes.
Working to have this mode of operation
supported in Cray OS.
Codesign of MPI interfaces in support of HPC workflows
Depiction of workflow of simulation analysis pipelines. Clusters of
boxes are MPI programs passing output data downstream. An
algorithm such as parameter optimization controls progress. Our
launch feature was scaled to 192 nodes with a challenging workload
for performance analysis, and the feature is in use by the CODES
network simulation team for its resilience capabilities.
Dorier, Wozniak, and Ross. Supporting Task-level Fault-tolerance in HPC Workflows by Launching MPI Jobs inside MPI Jobs. WORKS @ SC, 2017.
55. 56Justin Wozniak and Jonathan Ozik
EMEWS: Extreme-scale
Model Exploration
With Swift
Many ways to extend:
- Hyperband
Li et al., arXiv:1603.0656
- Population-based training
Jagerberg, arXiv:1711.09846
EMEWS hyperparameter optimization
56. 57
Co-evolution of HPC applications and systems …
… demand new application, software, and hardware …
… resulting in exciting new computer science challenges
foster@anl.gov
Thanks to US Department of Energy and CODAR team