Scientific Applications and
Heterogeneous Architectures –
Data Analytics and the Intersection of HPC
and Edge Computing
Michela Taufer
Acknowledgements
T. Johnston
T. Estrada M. Cuendet E. Deelman
R. da Silva
Sponsors:
H. Weinstein
T. Do
S. Thomas
B. Mulligan
A. RazaviH. Carrillo
R. Vargas
D. Rorabaugh
R. LLamas M. Guevara
Trends in Next-Generation Systems: IO Gap and Ensembles
5x
12x
64x
13x
Rising Importance of Ensembles
Source:
https://wci.llnl.gov/simulation/computer-codes/uncertainty-quantification
ExascalePetascaleTerascale
Sierra (2018)
Sequoia (2013)
Purple (2005)
Number of Jobs
SystemScale(FLOPS)
Few Many
Widening IO Gap
Source: Lucy Nowell (DOE)
3
4
Pegasus LIGO PyCBC Workflow
Laser Interferometer Gravitational-Wave Observatory (LIGO)
Trends in Workflows: Compute + Analytics + Data
Source: Ewa Deelman, ISI - USC
Extending HPC to Connect to the “Edge”
Image Source: https://iiot-world.com/digital-disruption/the-right-representation-of-digital-twins-for-data-analytics/
Simulations Data Analytics Real System with Sensors
5
Two Use Cases
• Extending HPC to integrate data analytics
§ Next generation MD workflows
§ Molecular structures
§ Data transformation – i.e., capturing information
§ Dataflow modeling – i.e., lost information
• Extending HPC to connect to the “Edge”
§ Next generation precision farming
§ Soil moisture data
§ Data prediction – i.e., from coarse- to fine-grained information
6
Two Use Cases
• Extending HPC to integrate data analytics
§ Next generation MD workflows
§ Molecular structures
§ Data transformation – i.e., capturing information
§ Dataflow modeling – i.e., lost information
• Extending HPC to connect to the “Edge”
§ Next generation precision farming
§ Soil moisture data
§ Data prediction – i.e., from coarse- to fine-grained information
7
Classical Molecular Dynamics Simulations
Source: Vincent Voelz Gorup - CC BY-SA 3.0, http://www.voelzlab.org/
• A MD simulation comprises of
hundreds of thousands of MD job
• Each job preforms hundreds of
thousands of MD steps
8
Classical Molecular Dynamics Simulations
àForces on single atoms
à Acceleration
à Velocity
à Position
• MD step computes forces on single
atoms (e.g., bond, dihedrals, nonbond)
• Forces are added to compute
acceleration
• Acceleration is used to update
velocities
• Velocities are used to update the atom
positions
• Every n steps (Stride)
à Store 3D snapshot or frame
9
Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
Run n-Stride
simulation steps
Data Generation
10
Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data Storage
11
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data AnalyticsData Storage
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
12
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data AnalyticsData Storage
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
13
Data Feedback
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
Extending HPC to Integrate Data Analytics
Analysis
Local Storage
MEMORY
CN
Application
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
Parallel File System
MEMORY
CN
Application
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
Analysis
CN
14
Data Data + metadata
Parallel File System
Augmenting HPC with In Situ and In Transit Analytics
AnalysisSimulation
Node 1 Node 2 Node 3
Network Interconnect
Core Core
Core Core
Core Core
Core Core
Simulation
Node
Analysis
SharedMemory
Example of tools:
• DataSpaces (Rutgers U.)
• DataStager (GeorgiaTech)
15
In Situ and In Transit Analytics for MD Simulations
• We want to capture what is going on in each frame without:
§ Disrupting the simulation (e.g., stealing CPU and memory on the node)
§ Moving all the frames to a central file system and analyzing them once the
simulation is over
§ Comparing each frame with past frames of the same job
§ Comparing each frame with frames of other jobs
Frames (or snapshots) of an MD trajectory with a stride of 5 steps:
Frame 55 Frame 60 Frame 65 Frame 70 Frame 80
Frame 75
16
Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data AnalyticsData Storage
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
17
Data Feedback
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
Building a Closed-loop Workflow
Data Generation Data AnalyticsData Storage
18
Data Feedback
A4MD
analytics
Retriever
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
Proteins with Similar Functions
Key principle: proteins with similar structures have similar functions
• Measure millions of protein variants expressed from yeast or bacteria
• Structure proteins to produce desired properties (protein engineering)
19 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
Protein Representations
3D Cartesian representation Surface representationMulti-fold representation
20 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
From Multi-fold Representation to Image Encoding
21 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
From Multi-fold Representation to Image Encoding
22 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
From Multi-fold Representation to Image Encoding
23 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
High-Throughput Protein Analysis
• Eight biological processes from biological process taxonomy in RCSB-PDB
• 62,991 proteins from the PDB
Proteinsas3Dtens
24
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
convolutional
neural network
Google’s Inception-v3,
Gem-Net Groundtrue
Predictions
Normalized Confusion Matrix - Accuracy 80.66%
Capturing Changes in Folding with Transfer Learning
25
T. Estrada,et al. A Graphic Encoding Method for Quantitative Classification of Protein Structure and
Representation of Conformational Changes. 2019
Protein: Opsin Frame 50 Frame 1500 Frame 1950
Two Use Cases
• Extending HPC to integrate data analytics
§ Next generation MD workflows
§ Molecular structures
§ Data transformation – i.e., capturing information
§ Dataflow modeling – i.e., lost information
• Extending HPC to connect to the “Edge”
§ Next generation precision farming
§ Soil moisture data
§ Data prediction – i.e., from coarse- to fine-grained information
26
Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data AnalyticsData Storage
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
27
Data Feedback
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System
Modeling Lost Frames
Node
MD
Simulation
Data
Analytics
Dataspaces
Server
Main Memory
Dataspaces
Shared Memory Region
Single Node (In Situ)
28 S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
S1, S2, S3: generate MD frame
W1, W2, W2: write to shared memory
R1, R2, R3: read from shared memory
A1, A3: analyze frame
Dataflow for Modeling Lost Frames
29
Generator of
MD frames
Ingestor
Dataflow
Plumed In-memory
Staging Area
DataSpaces Retriever
Data Generation Data AnalyticsData Storage
Dataflow
Analytics
representations
Distance matrices
λmax
eigenvalues
T. Johnston et al. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of
Computational Chemistry (JCC), 38(16):1419-1430, 2017.
Eigenvalues: Proxy for Structural Changes
Frames (or snapshots) of an MD trajectory with a stride of 5 steps:
Frame 55 Frame 60 Frame 65 Frame 70 Frame 80
Frame 75
30
T. Johnston et al. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of
Computational Chemistry (JCC), 38(16):1419-1430, 2017.
λmax 60 λmax65 λmax70 λmax75 λmax80λmax55
The distance between two max eigenvalues can serve as a proxy for
distance between the two associated conformations
Cα
1
Cα
Nα
31
Single frame at time t
Nα Cα atoms
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
Cα
Nα
Cα
Nα/2
Cα
Nα/2 - 1
Cα
1
Cα
1 Cα
Nα
0 dij
dji 0
Distance of two segments with
segment length:
Nα/2 x Cα atoms
(Cα
1 - Cα
Nα/2 -1 ) ( Cα
Nα/2 - Cα
Nα/2)
Single λmax
Cα
1
Cα
Nα
32
Single frame at time t
Nα Cα atoms
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
Cα
1
Cα
2
Cα
3
Cα
4
Cα
1. Cα
4
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
Cα
Nα
Cα
Nα/2
Cα
Nα/2 - 1
Cα
1
Cα
1 Cα
Nα
0 dij
dji 0
Distance of two segments with
segment length:
Nα/2 x Cα atoms
(Cα
1 - Cα
Nα/2 -1 ) ( Cα
Nα/2 - Cα
Nα/2)
Single λmax
Cα
1
Cα
Nα
33
Distances of Nα/2 segments with
segment length: 2 x Cα atoms
(Cα
1 - Cα
2 ) ( Cα
3 - Cα
4)
(Cα
5 - Cα
6 ) ( Cα
7 – Cα
8)
….
(Cα
Nα/2-3 - Cα
Nα/2-2 ) ( Cα
Nα/2-1 - Cα
Nα/2)
λmax, 1 λmax, 2 … λmax, Nα/2
Single frame at time t
Nα Cα atoms
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
Cα
Nα
Cα
Nα/2
Cα
Nα/2 - 1
Cα
1
Cα
1 Cα
Nα
Cα
1
Cα
2
Cα
3
Cα
4
Cα
1. Cα
4
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
Many small
matrices
Few large
matrices
Segment size = proxy of number
of matrices and matrix sizes
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
34
2-step Model: Fraction of Analyzed Frames
Gltph 270,088 atoms
Trp cage 12,619 atoms T cell receptor 81,092 atoms
35
Observables: small segment lengths
(i.e., 2 , 4 , 6 , 8 , 10 , 12 , 14, 16, and 18 )
Matrix size (# elements)
Matrixperiod(s/matrices)
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
2-step Model: Fraction of Analyzed Frames
Observables: segment lengths
(i.e., 2 , 4 , 6 , 8 , 10 , 12 , 14, 16, and 18 )
Model: polynomial model of degree 2
obtained by least-square fitting observables
R2 = 0.88
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted)
36
Matrixperiod(s/matrices)
Matrix size (# elements)
Matrixperiod(s/matrices)
Matrix size (# elements)
2-step Model: Fraction of Analyzed Frames
Observables Degree 2 polynomial model Absolute error between data
and fitting model
R2 = 0.88
Matrixperiod(s/matrices)
Matrix size (# elements)
Matrixperiod(s/matrices)
Matrix size (# elements)
Matrixperiod(s/matrices)
Matrix size (# elements)
FractionofAnalyzedFrames
FractionofAnalyzedFrames
Error
2-step Model: Frames Distribution
• Given a trajectory, we model the proportions p and q of
analyzed frames (f) with periods k and k+1
§ Example: Gltph (27,000 atoms and TPS 318), trajectory of 1,000 frames.
Period between analyzed frames Period between analyzed frames Period between analyzed frames
Stride 2 Stride 6 Stride 12
Numberofanalyzedframes
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
38
Case Study: 1BDD Protein Conformations
Frame 1330 Frame 1360 Frame 1390
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted)
39
Case Study: 1BDD Protein Conformations
Frame 1330 Frame 1360 Frame 1390
Helix 1-3: λmax Helix 2-3: λmaxHelix 1-2: λmax
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted)
40
Case Study: 1BDD Protein Conformations
Frame 1330 Frame 1360 Frame 1390
Helix 1-3: λmax Helix 2-3: λmaxHelix 1-2: λmax
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted)
41
Two Use Cases
• Extending HPC to integrate data analytics
§ Next generation MD workflows
§ Molecular structures
§ Data transformation – i.e., capturing information
§ Dataflow modeling – i.e., lost information
• Extending HPC to connect to the “Edge”
§ Next generation precision farming
§ Soil moisture data
§ Data prediction – i.e., from coarse- to fine-grained information
42
Soil Moisture Data for Precision Farming
• Satellites collect raster data across the surface of the Earth
Image Source: http://www.esa-soilmoisture-cci.org/43
Soil Moisture: Incomplete Data
Image source: Ricardo Llamas, University of Delaware
Data source: ESA-CCI soil moisture database (http://www.esa-soilmoisture-cci.org/)
December 2000
Averages
(m3/m3)
44
Soil Moisture: Incomplete Data
Image source: Ricardo Llamas, University of Delaware
Data source: ESA-CCI soil moisture database (http://www.esa-soilmoisture-cci.org/)
December 2000
Averages
(m3/m3)
Causes of missing data:
● dense vegetation
● snow/ice cover
● frozen surface
● extremely dry surface
45
Soil Moisture: Coarse-grained Data
Original Resolution
27 km × 27 km
Image source: McPherson et al., Using coarse-grained occurrence data to predict species distributions at
finer spatial resolutions—possibilities and limitations, Ecological Modeling 192:499-522, 2006.
Desired Resolution
1 km × 1 km
46
Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data Generation
Landscape
Surface DSM
Weather Data
NOAA
47
Satellite and
sensors data
Course-grained,
incomplete data
Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data GenerationData Analytics
Landscape
Surface DSM
Weather Data
NOAA
48
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Satellite and
sensors data
Data
predictions
Course-grained,
incomplete data
Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data GenerationData Analytics
Landscape
Surface DSM
Compute
Weather Data
NOAA
Kubernetes
Singularity
Shifter and
Docker
49
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Satellite and
sensors data
Data
predictions
Soil moisture
integrated in FDS for:
• Controlled
combustion;
• Wildfire
propagation
Fire Dynamics Simulator
Fine-grained,
complete data
Course-grained,
incomplete data
Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data GenerationData Analytics
Landscape
Surface DSM
Compute
Weather Data
NOAA
Kubernetes
Singularity
Shifter and
Docker
50
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Data Feedback
Satellite and
sensors data
Data
predictions
Soil moisture
integrated in FDS for:
• Controlled
combustion
• Wildfire
propagation
Fire Dynamics Simulator
Fine-grained,
complete data
Course-grained,
incomplete data
Augmenting HPC with the Cloud
nodes
HPC system
ParallelFileSystem
51 Source: https://www.nextplatform.com/2018/02/26/adaptive-approach-bursting-hpc-cloud/
nodes
nodes nodes
nodes nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
Compute Analytics
Soil Moisture
ESA-CCI
Building a Closed-loop Workflow
Data GenerationData Analytics
Landscape
Surface DSM
Compute
Weather Data
NOAA
Kubernetes
Singularity
Shifter and
Docker
52
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Data Feedback
Satellite and
sensors data
Data
predictions
Soil moisture
integrated in FDS for:
• Controlled
combustion
• Wildfire
propagation
Fire Dynamics Simulator
Fine-grained,
complete data
Course-grained,
incomplete data
Hybrid Algorithms for Analytics
Data GenerationData AnalyticsCompute
53
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Data Feedback
Fine-grained,
complete data
Course-grained,
incomplete data
Data
(observations)
Parameter
Global versus Local Data Modeling
DependentVariable
Data
(observations)
Parameter
DependentVariable
54
Data
(observations)
Surrogate
Model
Parameter
Global versus Local Data Modeling
DependentVariable
Data
(observations)
Parameter
DependentVariable
k-Nearest Neighbor
Model
55
Global
modeling
Local
modeling
SBM vs. k-NN
56
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
SBM vs. k-NN
57
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
SBM vs. k-NN
58
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
SBM vs. k-NN
59
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
SBM vs. k-NN
60
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
Hybrid Piecewise Polynomial Modeling (HYPPO)
Surrogate-Based Modeling
• Use all sampled data
• Construct one polynomial
(single complex global model)
k Nearest Neighbors
• Use local data
• Compute the average
(many simple local models)
61 Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
Hybrid Piecewise Polynomial Modeling (HYPPO)
Surrogate-Based Modeling
• Use all sampled data
• Construct one polynomial
(single complex global model)
k Nearest Neighbors
• Use local data
• Compute the average
(many simple local models)
Hybrid modeling à HYPPO
• Use local data
• Construct many polynomials
(many complex local models)
62 Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
63
Observations
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
64
Surrogate-based Model
Observations
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
65
Surrogate-based Model
Observations
k Nearest Neighbors Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
66
Surrogate-based Model
Observations
k Nearest Neighbors Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33
Hybrid Piecewise Polynomial
Model (HYPPO)
3
2
1
0
Soil moisture predictions Model degrees
Case Study: Fine-grained Modeling of Mid-Atlantic Region
67 D. Rorabaugh, M. Guevara, R. Llamas, J. Kitson, R. Vargas, M. Taufer. SOMOSPIE: A modular SOil MOisture SPatial
Inference Engine based on data driven decision. arXiv:1904.07754, 2019
Challenges and Opportunities (I)
• Two trends in HPC are impacting scientific applications:
§ Convergence of simulations and data analytics
§ Emergence of edge computing
• Applications in precision medicine and precision farming
are leveraging these trends
• There is the need to further integrate these trends in HPC
§ New challenges and opportunities for the HPC community
68
Challenges and Opportunities (II)
• Efficiency: Optimize performance and power usage associated to data
generation, movement, and analytics
• Non-invasive: Capture knowledge from data without rewriting
simulations’ legacy codes or simulations’ scripts
• Generality: Build workflows that support different types of analytics
across different applications and different data
• Portability: Execute combined compute and analytics across different
systems, including the edge, and with heterogenous resources
• Scalability: Design methods for knowledge discovery at scale (e.g.,
scalable ML algorithms) for “compute + analytics + data” workflows
69
Scientific Applications and Heterogeneous Architectures

Scientific Applications and Heterogeneous Architectures

  • 1.
    Scientific Applications and HeterogeneousArchitectures – Data Analytics and the Intersection of HPC and Edge Computing Michela Taufer
  • 2.
    Acknowledgements T. Johnston T. EstradaM. Cuendet E. Deelman R. da Silva Sponsors: H. Weinstein T. Do S. Thomas B. Mulligan A. RazaviH. Carrillo R. Vargas D. Rorabaugh R. LLamas M. Guevara
  • 3.
    Trends in Next-GenerationSystems: IO Gap and Ensembles 5x 12x 64x 13x Rising Importance of Ensembles Source: https://wci.llnl.gov/simulation/computer-codes/uncertainty-quantification ExascalePetascaleTerascale Sierra (2018) Sequoia (2013) Purple (2005) Number of Jobs SystemScale(FLOPS) Few Many Widening IO Gap Source: Lucy Nowell (DOE) 3
  • 4.
    4 Pegasus LIGO PyCBCWorkflow Laser Interferometer Gravitational-Wave Observatory (LIGO) Trends in Workflows: Compute + Analytics + Data Source: Ewa Deelman, ISI - USC
  • 5.
    Extending HPC toConnect to the “Edge” Image Source: https://iiot-world.com/digital-disruption/the-right-representation-of-digital-twins-for-data-analytics/ Simulations Data Analytics Real System with Sensors 5
  • 6.
    Two Use Cases •Extending HPC to integrate data analytics § Next generation MD workflows § Molecular structures § Data transformation – i.e., capturing information § Dataflow modeling – i.e., lost information • Extending HPC to connect to the “Edge” § Next generation precision farming § Soil moisture data § Data prediction – i.e., from coarse- to fine-grained information 6
  • 7.
    Two Use Cases •Extending HPC to integrate data analytics § Next generation MD workflows § Molecular structures § Data transformation – i.e., capturing information § Dataflow modeling – i.e., lost information • Extending HPC to connect to the “Edge” § Next generation precision farming § Soil moisture data § Data prediction – i.e., from coarse- to fine-grained information 7
  • 8.
    Classical Molecular DynamicsSimulations Source: Vincent Voelz Gorup - CC BY-SA 3.0, http://www.voelzlab.org/ • A MD simulation comprises of hundreds of thousands of MD job • Each job preforms hundreds of thousands of MD steps 8
  • 9.
    Classical Molecular DynamicsSimulations àForces on single atoms à Acceleration à Velocity à Position • MD step computes forces on single atoms (e.g., bond, dihedrals, nonbond) • Forces are added to compute acceleration • Acceleration is used to update velocities • Velocities are used to update the atom positions • Every n steps (Stride) à Store 3D snapshot or frame 9
  • 10.
    Building a Closed-loopWorkflow MD code (e.g., GROMACS) Run n-Stride simulation steps Data Generation 10
  • 11.
    Building a Closed-loopWorkflow MD code (e.g., GROMACS) Dataflow Ingestor Run n-Stride simulation steps Data Generation Data Storage 11 Plumed In-memory Staging Area DataSpaces Burst Buffer Parallel File System
  • 12.
    Building a Closed-loopWorkflow MD code (e.g., GROMACS) A4MD analytics Retriever Dataflow Ingestor Run n-Stride simulation steps Data Generation Data AnalyticsData Storage Dataflow A4MD analytics Analytics representations + algorithms ML-inferred algorithms Collective variables 12 Plumed In-memory Staging Area DataSpaces Burst Buffer Parallel File System
  • 13.
    Building a Closed-loopWorkflow MD code (e.g., GROMACS) A4MD analytics Retriever Dataflow Ingestor Run n-Stride simulation steps Data Generation Data AnalyticsData Storage Dataflow A4MD analytics Analytics representations + algorithms ML-inferred algorithms Collective variables 13 Data Feedback Plumed In-memory Staging Area DataSpaces Burst Buffer Parallel File System
  • 14.
    Extending HPC toIntegrate Data Analytics Analysis Local Storage MEMORY CN Application CN CN CN CN CN CN CN CN CN CN CN CN CN CN CN Parallel File System MEMORY CN Application CN CN CN CN CN CN CN CN CN CN CN CN CN CN Analysis CN 14 Data Data + metadata Parallel File System
  • 15.
    Augmenting HPC withIn Situ and In Transit Analytics AnalysisSimulation Node 1 Node 2 Node 3 Network Interconnect Core Core Core Core Core Core Core Core Simulation Node Analysis SharedMemory Example of tools: • DataSpaces (Rutgers U.) • DataStager (GeorgiaTech) 15
  • 16.
    In Situ andIn Transit Analytics for MD Simulations • We want to capture what is going on in each frame without: § Disrupting the simulation (e.g., stealing CPU and memory on the node) § Moving all the frames to a central file system and analyzing them once the simulation is over § Comparing each frame with past frames of the same job § Comparing each frame with frames of other jobs Frames (or snapshots) of an MD trajectory with a stride of 5 steps: Frame 55 Frame 60 Frame 65 Frame 70 Frame 80 Frame 75 16
  • 17.
    Building a Closed-loopWorkflow MD code (e.g., GROMACS) A4MD analytics Retriever Dataflow Ingestor Run n-Stride simulation steps Data Generation Data AnalyticsData Storage Dataflow A4MD analytics Analytics representations + algorithms ML-inferred algorithms Collective variables 17 Data Feedback Plumed In-memory Staging Area DataSpaces Burst Buffer Parallel File System
  • 18.
    Building a Closed-loopWorkflow Data Generation Data AnalyticsData Storage 18 Data Feedback A4MD analytics Retriever Dataflow A4MD analytics Analytics representations + algorithms ML-inferred algorithms Collective variables
  • 19.
    Proteins with SimilarFunctions Key principle: proteins with similar structures have similar functions • Measure millions of protein variants expressed from yeast or bacteria • Structure proteins to produce desired properties (protein engineering) 19 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
  • 20.
    Protein Representations 3D Cartesianrepresentation Surface representationMulti-fold representation 20 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
  • 21.
    From Multi-fold Representationto Image Encoding 21 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
  • 22.
    From Multi-fold Representationto Image Encoding 22 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
  • 23.
    From Multi-fold Representationto Image Encoding 23 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
  • 24.
    High-Throughput Protein Analysis •Eight biological processes from biological process taxonomy in RCSB-PDB • 62,991 proteins from the PDB Proteinsas3Dtens 24 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018. convolutional neural network Google’s Inception-v3, Gem-Net Groundtrue Predictions Normalized Confusion Matrix - Accuracy 80.66%
  • 25.
    Capturing Changes inFolding with Transfer Learning 25 T. Estrada,et al. A Graphic Encoding Method for Quantitative Classification of Protein Structure and Representation of Conformational Changes. 2019 Protein: Opsin Frame 50 Frame 1500 Frame 1950
  • 26.
    Two Use Cases •Extending HPC to integrate data analytics § Next generation MD workflows § Molecular structures § Data transformation – i.e., capturing information § Dataflow modeling – i.e., lost information • Extending HPC to connect to the “Edge” § Next generation precision farming § Soil moisture data § Data prediction – i.e., from coarse- to fine-grained information 26
  • 27.
    Building a Closed-loopWorkflow MD code (e.g., GROMACS) A4MD analytics Retriever Dataflow Ingestor Run n-Stride simulation steps Data Generation Data AnalyticsData Storage Dataflow A4MD analytics Analytics representations + algorithms ML-inferred algorithms Collective variables 27 Data Feedback Plumed In-memory Staging Area DataSpaces Burst Buffer Parallel File System
  • 28.
    Modeling Lost Frames Node MD Simulation Data Analytics Dataspaces Server MainMemory Dataspaces Shared Memory Region Single Node (In Situ) 28 S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019. S1, S2, S3: generate MD frame W1, W2, W2: write to shared memory R1, R2, R3: read from shared memory A1, A3: analyze frame
  • 29.
    Dataflow for ModelingLost Frames 29 Generator of MD frames Ingestor Dataflow Plumed In-memory Staging Area DataSpaces Retriever Data Generation Data AnalyticsData Storage Dataflow Analytics representations Distance matrices λmax eigenvalues T. Johnston et al. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.
  • 30.
    Eigenvalues: Proxy forStructural Changes Frames (or snapshots) of an MD trajectory with a stride of 5 steps: Frame 55 Frame 60 Frame 65 Frame 70 Frame 80 Frame 75 30 T. Johnston et al. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017. λmax 60 λmax65 λmax70 λmax75 λmax80λmax55 The distance between two max eigenvalues can serve as a proxy for distance between the two associated conformations
  • 31.
    Cα 1 Cα Nα 31 Single frame attime t Nα Cα atoms S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019.
  • 32.
    Cα Nα Cα Nα/2 Cα Nα/2 - 1 Cα 1 Cα 1Cα Nα 0 dij dji 0 Distance of two segments with segment length: Nα/2 x Cα atoms (Cα 1 - Cα Nα/2 -1 ) ( Cα Nα/2 - Cα Nα/2) Single λmax Cα 1 Cα Nα 32 Single frame at time t Nα Cα atoms S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019.
  • 33.
    Cα 1 Cα 2 Cα 3 Cα 4 Cα 1. Cα 4 0 dij dji0 0 dij dji 0 0 dij dji 0 0 dij dji 0 0 dij dji 0 0 dij dji 0 0 dij dji 0 Cα Nα Cα Nα/2 Cα Nα/2 - 1 Cα 1 Cα 1 Cα Nα 0 dij dji 0 Distance of two segments with segment length: Nα/2 x Cα atoms (Cα 1 - Cα Nα/2 -1 ) ( Cα Nα/2 - Cα Nα/2) Single λmax Cα 1 Cα Nα 33 Distances of Nα/2 segments with segment length: 2 x Cα atoms (Cα 1 - Cα 2 ) ( Cα 3 - Cα 4) (Cα 5 - Cα 6 ) ( Cα 7 – Cα 8) …. (Cα Nα/2-3 - Cα Nα/2-2 ) ( Cα Nα/2-1 - Cα Nα/2) λmax, 1 λmax, 2 … λmax, Nα/2 Single frame at time t Nα Cα atoms S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019.
  • 34.
    Cα Nα Cα Nα/2 Cα Nα/2 - 1 Cα 1 Cα 1Cα Nα Cα 1 Cα 2 Cα 3 Cα 4 Cα 1. Cα 4 0 dij dji 0 0 dij dji 0 0 dij dji 0 0 dij dji 0 0 dij dji 0 0 dij dji 0 0 dij dji 0 0 dij dji 0 Many small matrices Few large matrices Segment size = proxy of number of matrices and matrix sizes S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019. 34
  • 35.
    2-step Model: Fractionof Analyzed Frames Gltph 270,088 atoms Trp cage 12,619 atoms T cell receptor 81,092 atoms 35 Observables: small segment lengths (i.e., 2 , 4 , 6 , 8 , 10 , 12 , 14, 16, and 18 ) Matrix size (# elements) Matrixperiod(s/matrices) S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019.
  • 36.
    2-step Model: Fractionof Analyzed Frames Observables: segment lengths (i.e., 2 , 4 , 6 , 8 , 10 , 12 , 14, 16, and 18 ) Model: polynomial model of degree 2 obtained by least-square fitting observables R2 = 0.88 S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted) 36 Matrixperiod(s/matrices) Matrix size (# elements) Matrixperiod(s/matrices) Matrix size (# elements)
  • 37.
    2-step Model: Fractionof Analyzed Frames Observables Degree 2 polynomial model Absolute error between data and fitting model R2 = 0.88 Matrixperiod(s/matrices) Matrix size (# elements) Matrixperiod(s/matrices) Matrix size (# elements) Matrixperiod(s/matrices) Matrix size (# elements) FractionofAnalyzedFrames FractionofAnalyzedFrames Error
  • 38.
    2-step Model: FramesDistribution • Given a trajectory, we model the proportions p and q of analyzed frames (f) with periods k and k+1 § Example: Gltph (27,000 atoms and TPS 318), trajectory of 1,000 frames. Period between analyzed frames Period between analyzed frames Period between analyzed frames Stride 2 Stride 6 Stride 12 Numberofanalyzedframes S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019. 38
  • 39.
    Case Study: 1BDDProtein Conformations Frame 1330 Frame 1360 Frame 1390 S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted) 39
  • 40.
    Case Study: 1BDDProtein Conformations Frame 1330 Frame 1360 Frame 1390 Helix 1-3: λmax Helix 2-3: λmaxHelix 1-2: λmax S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted) 40
  • 41.
    Case Study: 1BDDProtein Conformations Frame 1330 Frame 1360 Frame 1390 Helix 1-3: λmax Helix 2-3: λmaxHelix 1-2: λmax S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted) 41
  • 42.
    Two Use Cases •Extending HPC to integrate data analytics § Next generation MD workflows § Molecular structures § Data transformation – i.e., capturing information § Dataflow modeling – i.e., lost information • Extending HPC to connect to the “Edge” § Next generation precision farming § Soil moisture data § Data prediction – i.e., from coarse- to fine-grained information 42
  • 43.
    Soil Moisture Datafor Precision Farming • Satellites collect raster data across the surface of the Earth Image Source: http://www.esa-soilmoisture-cci.org/43
  • 44.
    Soil Moisture: IncompleteData Image source: Ricardo Llamas, University of Delaware Data source: ESA-CCI soil moisture database (http://www.esa-soilmoisture-cci.org/) December 2000 Averages (m3/m3) 44
  • 45.
    Soil Moisture: IncompleteData Image source: Ricardo Llamas, University of Delaware Data source: ESA-CCI soil moisture database (http://www.esa-soilmoisture-cci.org/) December 2000 Averages (m3/m3) Causes of missing data: ● dense vegetation ● snow/ice cover ● frozen surface ● extremely dry surface 45
  • 46.
    Soil Moisture: Coarse-grainedData Original Resolution 27 km × 27 km Image source: McPherson et al., Using coarse-grained occurrence data to predict species distributions at finer spatial resolutions—possibilities and limitations, Ecological Modeling 192:499-522, 2006. Desired Resolution 1 km × 1 km 46
  • 47.
    Soil Moisture ESA-CCI Building aClosed-loop Workflow Data Generation Landscape Surface DSM Weather Data NOAA 47 Satellite and sensors data Course-grained, incomplete data
  • 48.
    Soil Moisture ESA-CCI Building aClosed-loop Workflow Data GenerationData Analytics Landscape Surface DSM Weather Data NOAA 48 A4MD analytics A4MD analytics Analytics representations + algorithms Satellite and sensors data Data predictions Course-grained, incomplete data
  • 49.
    Soil Moisture ESA-CCI Building aClosed-loop Workflow Data GenerationData Analytics Landscape Surface DSM Compute Weather Data NOAA Kubernetes Singularity Shifter and Docker 49 A4MD analytics A4MD analytics Analytics representations + algorithms Satellite and sensors data Data predictions Soil moisture integrated in FDS for: • Controlled combustion; • Wildfire propagation Fire Dynamics Simulator Fine-grained, complete data Course-grained, incomplete data
  • 50.
    Soil Moisture ESA-CCI Building aClosed-loop Workflow Data GenerationData Analytics Landscape Surface DSM Compute Weather Data NOAA Kubernetes Singularity Shifter and Docker 50 A4MD analytics A4MD analytics Analytics representations + algorithms Data Feedback Satellite and sensors data Data predictions Soil moisture integrated in FDS for: • Controlled combustion • Wildfire propagation Fire Dynamics Simulator Fine-grained, complete data Course-grained, incomplete data
  • 51.
    Augmenting HPC withthe Cloud nodes HPC system ParallelFileSystem 51 Source: https://www.nextplatform.com/2018/02/26/adaptive-approach-bursting-hpc-cloud/ nodes nodes nodes nodes nodes nodes nodes nodes nodes nodes nodes nodes nodes nodes nodes nodes Compute Analytics
  • 52.
    Soil Moisture ESA-CCI Building aClosed-loop Workflow Data GenerationData Analytics Landscape Surface DSM Compute Weather Data NOAA Kubernetes Singularity Shifter and Docker 52 A4MD analytics A4MD analytics Analytics representations + algorithms Data Feedback Satellite and sensors data Data predictions Soil moisture integrated in FDS for: • Controlled combustion • Wildfire propagation Fire Dynamics Simulator Fine-grained, complete data Course-grained, incomplete data
  • 53.
    Hybrid Algorithms forAnalytics Data GenerationData AnalyticsCompute 53 A4MD analytics A4MD analytics Analytics representations + algorithms Data Feedback Fine-grained, complete data Course-grained, incomplete data
  • 54.
    Data (observations) Parameter Global versus LocalData Modeling DependentVariable Data (observations) Parameter DependentVariable 54
  • 55.
    Data (observations) Surrogate Model Parameter Global versus LocalData Modeling DependentVariable Data (observations) Parameter DependentVariable k-Nearest Neighbor Model 55 Global modeling Local modeling
  • 56.
    SBM vs. k-NN 56 Observations Piecewiselinear data Surrogate-based model 2-Nearest Neighbor Model Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 57.
    SBM vs. k-NN 57 Observations Piecewiselinear data Surrogate-based model 2-Nearest Neighbor Model Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 58.
    SBM vs. k-NN 58 Observations Piecewiselinear data Surrogate-based model 2-Nearest Neighbor Model Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 59.
    SBM vs. k-NN 59 Observations Piecewiselinear data Surrogate-based model 2-Nearest Neighbor Model Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 60.
    SBM vs. k-NN 60 Observations Piecewiselinear data Surrogate-based model 2-Nearest Neighbor Model Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 61.
    Hybrid Piecewise PolynomialModeling (HYPPO) Surrogate-Based Modeling • Use all sampled data • Construct one polynomial (single complex global model) k Nearest Neighbors • Use local data • Compute the average (many simple local models) 61 Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 62.
    Hybrid Piecewise PolynomialModeling (HYPPO) Surrogate-Based Modeling • Use all sampled data • Construct one polynomial (single complex global model) k Nearest Neighbors • Use local data • Compute the average (many simple local models) Hybrid modeling à HYPPO • Use local data • Construct many polynomials (many complex local models) 62 Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 63.
    63 Observations Johnston et al.,“HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 64.
    64 Surrogate-based Model Observations Johnston etal., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 65.
    65 Surrogate-based Model Observations k NearestNeighbors Model Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33
  • 66.
    66 Surrogate-based Model Observations k NearestNeighbors Model Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth Surfaces.” SBAC-PAD 2016: 26-33 Hybrid Piecewise Polynomial Model (HYPPO)
  • 67.
    3 2 1 0 Soil moisture predictionsModel degrees Case Study: Fine-grained Modeling of Mid-Atlantic Region 67 D. Rorabaugh, M. Guevara, R. Llamas, J. Kitson, R. Vargas, M. Taufer. SOMOSPIE: A modular SOil MOisture SPatial Inference Engine based on data driven decision. arXiv:1904.07754, 2019
  • 68.
    Challenges and Opportunities(I) • Two trends in HPC are impacting scientific applications: § Convergence of simulations and data analytics § Emergence of edge computing • Applications in precision medicine and precision farming are leveraging these trends • There is the need to further integrate these trends in HPC § New challenges and opportunities for the HPC community 68
  • 69.
    Challenges and Opportunities(II) • Efficiency: Optimize performance and power usage associated to data generation, movement, and analytics • Non-invasive: Capture knowledge from data without rewriting simulations’ legacy codes or simulations’ scripts • Generality: Build workflows that support different types of analytics across different applications and different data • Portability: Execute combined compute and analytics across different systems, including the edge, and with heterogenous resources • Scalability: Design methods for knowledge discovery at scale (e.g., scalable ML algorithms) for “compute + analytics + data” workflows 69