Scientific Applications and Heterogeneous Architectures

Scientific Applications and
Heterogeneous Architectures –
Data Analytics and the Intersection of HPC
and Edge Computing
Michela Taufer

Acknowledgements
T. Johnston
T. Estrada M. Cuendet E. Deelman
R. da Silva
Sponsors:
H. Weinstein
T. Do
S. Thomas
B. Mulligan
A. RazaviH. Carrillo
R. Vargas
D. Rorabaugh
R. LLamas M. Guevara

Trends in Next-Generation Systems: IO Gap and Ensembles
5x
12x
64x
13x
Rising Importance of Ensembles
Source:
https://wci.llnl.gov/simulation/computer-codes/uncertainty-quantification
ExascalePetascaleTerascale
Sierra (2018)
Sequoia (2013)
Purple (2005)
Number of Jobs
SystemScale(FLOPS)
Few Many
Widening IO Gap
Source: Lucy Nowell (DOE)
3

4
Pegasus LIGO PyCBC Workflow
Laser Interferometer Gravitational-Wave Observatory (LIGO)
Trends in Workflows: Compute + Analytics + Data
Source: Ewa Deelman, ISI - USC

Extending HPC to Connect to the “Edge”
Image Source: https://iiot-world.com/digital-disruption/the-right-representation-of-digital-twins-for-data-analytics/
Simulations Data Analytics Real System with Sensors
5

Two Use Cases
• Extending HPC to integrate data analytics
§ Next generation MD workflows
§ Molecular structures
§ Data transformation – i.e., capturing information
§ Dataflow modeling – i.e., lost information
• Extending HPC to connect to the “Edge”
§ Next generation precision farming
§ Soil moisture data
§ Data prediction – i.e., from coarse- to fine-grained information
6

Two Use Cases
7

Classical Molecular Dynamics Simulations
Source: Vincent Voelz Gorup - CC BY-SA 3.0, http://www.voelzlab.org/
• A MD simulation comprises of
hundreds of thousands of MD job
• Each job preforms hundreds of
thousands of MD steps
8

Classical Molecular Dynamics Simulations
àForces on single atoms
à Acceleration
à Velocity
à Position
• MD step computes forces on single
atoms (e.g., bond, dihedrals, nonbond)
• Forces are added to compute
acceleration
• Acceleration is used to update
velocities
• Velocities are used to update the atom
positions
• Every n steps (Stride)
à Store 3D snapshot or frame
9

Building a Closed-loop Workflow
MD code
(e.g., GROMACS)
Run n-Stride
simulation steps
Data Generation
10

MD code
(e.g., GROMACS)
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data Storage
11
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System

MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Data Generation Data AnalyticsData Storage
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
12
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System

MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
13
Data Feedback
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System

Extending HPC to Integrate Data Analytics
Analysis
Local Storage
MEMORY
CN
Application
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
Parallel File System
MEMORY
CN
Application
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
Analysis
CN
14
Data Data + metadata
Parallel File System

Augmenting HPC with In Situ and In Transit Analytics
AnalysisSimulation
Node 1 Node 2 Node 3
Network Interconnect
Core Core
Core Core
Core Core
Core Core
Simulation
Node
Analysis
SharedMemory
Example of tools:
• DataSpaces (Rutgers U.)
• DataStager (GeorgiaTech)
15

In Situ and In Transit Analytics for MD Simulations
• We want to capture what is going on in each frame without:
§ Disrupting the simulation (e.g., stealing CPU and memory on the node)
§ Moving all the frames to a central file system and analyzing them once the
simulation is over
§ Comparing each frame with past frames of the same job
§ Comparing each frame with frames of other jobs
Frames (or snapshots) of an MD trajectory with a stride of 5 steps:
Frame 55 Frame 60 Frame 65 Frame 70 Frame 80
Frame 75
16

MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
17
Data Feedback
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System

18
Data Feedback
A4MD
analytics
Retriever
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables

Proteins with Similar Functions
Key principle: proteins with similar structures have similar functions
• Measure millions of protein variants expressed from yeast or bacteria
• Structure proteins to produce desired properties (protein engineering)
19 T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
Graphic Encoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.

Protein Representations
3D Cartesian representation Surface representationMulti-fold representation

From Multi-fold Representation to Image Encoding

High-Throughput Protein Analysis
• Eight biological processes from biological process taxonomy in RCSB-PDB
• 62,991 proteins from the PDB
Proteinsas3Dtens
24
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer.
convolutional
neural network
Google’s Inception-v3,
Gem-Net Groundtrue
Predictions
Normalized Confusion Matrix - Accuracy 80.66%

Capturing Changes in Folding with Transfer Learning
25
T. Estrada,et al. A Graphic Encoding Method for Quantitative Classification of Protein Structure and
Representation of Conformational Changes. 2019
Protein: Opsin Frame 50 Frame 1500 Frame 1950

Two Use Cases
26

MD code
(e.g., GROMACS)
A4MD
analytics
Retriever
Dataflow
Ingestor
Run n-Stride
simulation steps
Dataflow A4MD
analytics
Analytics
representations
+ algorithms
ML-inferred
algorithms
Collective
variables
27
Data Feedback
Plumed In-memory
Staging Area
DataSpaces
Burst Buffer
Parallel File
System

Modeling Lost Frames
Node
MD
Simulation
Data
Analytics
Dataspaces
Server
Main Memory
Dataspaces
Shared Memory Region
Single Node (In Situ)
28 S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics
Simulations for Next-generation Supercomputers. eScience, 2019.
S1, S2, S3: generate MD frame
W1, W2, W2: write to shared memory
R1, R2, R3: read from shared memory
A1, A3: analyze frame

Dataflow for Modeling Lost Frames
29
Generator of
MD frames
Ingestor
Dataflow
Plumed In-memory
Staging Area
DataSpaces Retriever
Dataflow
Analytics
representations
Distance matrices
λmax
eigenvalues
T. Johnston et al. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of
Computational Chemistry (JCC), 38(16):1419-1430, 2017.

Eigenvalues: Proxy for Structural Changes
Frames (or snapshots) of an MD trajectory with a stride of 5 steps:
Frame 55 Frame 60 Frame 65 Frame 70 Frame 80
Frame 75
30
T. Johnston et al. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of
Computational Chemistry (JCC), 38(16):1419-1430, 2017.
λmax 60 λmax65 λmax70 λmax75 λmax80λmax55
The distance between two max eigenvalues can serve as a proxy for
distance between the two associated conformations

Cα
1
Cα
Nα
31
Single frame at time t
Nα Cα atoms
S. Thomas et al. Characterization of In Situ and In Transit Analytics of Molecular Dynamics

Cα
Nα
Cα
Nα/2
Cα
Nα/2 - 1
Cα
1
Cα
1 Cα
Nα
0 dij
dji 0
Distance of two segments with
segment length:
Nα/2 x Cα atoms
(Cα
1 - Cα
Nα/2 -1 ) ( Cα
Nα/2 - Cα
Nα/2)
Single λmax
Cα
1
Cα
Nα
32
Nα Cα atoms

Cα
1
Cα
2
Cα
3
Cα
4
Cα
1. Cα
4
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
Cα
Nα
Cα
Nα/2
Cα
Nα/2 - 1
Cα
1
Cα
1 Cα
Nα
0 dij
dji 0
Distance of two segments with
segment length:
Nα/2 x Cα atoms
(Cα
1 - Cα
Nα/2 -1 ) ( Cα
Nα/2 - Cα
Nα/2)
Single λmax
Cα
1
Cα
Nα
33
Distances of Nα/2 segments with
segment length: 2 x Cα atoms
(Cα
1 - Cα
2 ) ( Cα
3 - Cα
4)
(Cα
5 - Cα
6 ) ( Cα
7 – Cα
8)
….
(Cα
Nα/2-3 - Cα
Nα/2-2 ) ( Cα
Nα/2-1 - Cα
Nα/2)
λmax, 1 λmax, 2 … λmax, Nα/2
Nα Cα atoms

Cα
Nα
Cα
Nα/2
Cα
Nα/2 - 1
Cα
1
Cα
1 Cα
Nα
Cα
1
Cα
2
Cα
3
Cα
4
Cα
1. Cα
4
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
0 dij
dji 0
Many small
matrices
Few large
matrices
Segment size = proxy of number
of matrices and matrix sizes
34

2-step Model: Fraction of Analyzed Frames
Gltph 270,088 atoms
Trp cage 12,619 atoms T cell receptor 81,092 atoms
35
Observables: small segment lengths
(i.e., 2 , 4 , 6 , 8 , 10 , 12 , 14, 16, and 18 )
Matrix size (# elements)
Matrixperiod(s/matrices)

Observables: segment lengths
(i.e., 2 , 4 , 6 , 8 , 10 , 12 , 14, 16, and 18 )
Model: polynomial model of degree 2
obtained by least-square fitting observables
R2 = 0.88
Simulations for Next-generation Supercomputers. eScience, 2019. (Submitted)
36

Observables Degree 2 polynomial model Absolute error between data
and fitting model
R2 = 0.88
FractionofAnalyzedFrames
FractionofAnalyzedFrames
Error

2-step Model: Frames Distribution
• Given a trajectory, we model the proportions p and q of
analyzed frames (f) with periods k and k+1
§ Example: Gltph (27,000 atoms and TPS 318), trajectory of 1,000 frames.
Period between analyzed frames Period between analyzed frames Period between analyzed frames
Stride 2 Stride 6 Stride 12
Numberofanalyzedframes
38

Case Study: 1BDD Protein Conformations
Frame 1330 Frame 1360 Frame 1390
39

Helix 1-3: λmax Helix 2-3: λmaxHelix 1-2: λmax
40

Helix 1-3: λmax Helix 2-3: λmaxHelix 1-2: λmax
41

Two Use Cases
42

Soil Moisture Data for Precision Farming
• Satellites collect raster data across the surface of the Earth
Image Source: http://www.esa-soilmoisture-cci.org/43

Soil Moisture: Incomplete Data
Image source: Ricardo Llamas, University of Delaware
Data source: ESA-CCI soil moisture database (http://www.esa-soilmoisture-cci.org/)
December 2000
Averages
(m3/m3)
44

Soil Moisture: Incomplete Data
Image source: Ricardo Llamas, University of Delaware
Data source: ESA-CCI soil moisture database (http://www.esa-soilmoisture-cci.org/)
December 2000
Averages
(m3/m3)
Causes of missing data:
● dense vegetation
● snow/ice cover
● frozen surface
● extremely dry surface
45

Soil Moisture: Coarse-grained Data
Original Resolution
27 km × 27 km
Image source: McPherson et al., Using coarse-grained occurrence data to predict species distributions at
finer spatial resolutions—possibilities and limitations, Ecological Modeling 192:499-522, 2006.
Desired Resolution
1 km × 1 km
46

Soil Moisture
ESA-CCI
Data Generation
Landscape
Surface DSM
Weather Data
NOAA
47
Satellite and
sensors data
Course-grained,
incomplete data

Soil Moisture
ESA-CCI
Data GenerationData Analytics
Landscape
Surface DSM
Weather Data
NOAA
48
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Satellite and
sensors data
Data
predictions
Course-grained,
incomplete data

Soil Moisture
ESA-CCI
Landscape
Surface DSM
Compute
Weather Data
NOAA
Kubernetes
Singularity
Shifter and
Docker
49
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Satellite and
sensors data
Data
predictions
Soil moisture
integrated in FDS for:
• Controlled
combustion;
• Wildfire
propagation
Fire Dynamics Simulator
Fine-grained,
complete data
Course-grained,
incomplete data

Soil Moisture
ESA-CCI
Landscape
Surface DSM
Compute
Weather Data
NOAA
Kubernetes
Singularity
Shifter and
Docker
50
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Data Feedback
Satellite and
sensors data
Data
predictions
Soil moisture
• Controlled
combustion
• Wildfire
propagation
Fine-grained,
complete data
Course-grained,
incomplete data

Augmenting HPC with the Cloud
nodes
HPC system
ParallelFileSystem
51 Source: https://www.nextplatform.com/2018/02/26/adaptive-approach-bursting-hpc-cloud/
nodes
nodes nodes
nodes nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
nodes
Compute Analytics

Soil Moisture
ESA-CCI
Landscape
Surface DSM
Compute
Weather Data
NOAA
Kubernetes
Singularity
Shifter and
Docker
52
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Data Feedback
Satellite and
sensors data
Data
predictions
Soil moisture
• Controlled
combustion
• Wildfire
propagation
Fine-grained,
complete data
Course-grained,
incomplete data

Hybrid Algorithms for Analytics
Data GenerationData AnalyticsCompute
53
A4MD
analytics
A4MD
analytics
Analytics
representations
+ algorithms
Data Feedback
Fine-grained,
complete data
Course-grained,
incomplete data

Data
(observations)
Parameter
Global versus Local Data Modeling
DependentVariable
Data
(observations)
Parameter
DependentVariable
54

Data
(observations)
Surrogate
Model
Parameter
Global versus Local Data Modeling
DependentVariable
Data
(observations)
Parameter
DependentVariable
k-Nearest Neighbor
Model
55
Global
modeling
Local
modeling

SBM vs. k-NN
56
Observations
Piecewise linear data
Surrogate-based model
2-Nearest Neighbor Model
Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth
Surfaces.” SBAC-PAD 2016: 26-33

SBM vs. k-NN
57
Observations

SBM vs. k-NN
58
Observations

SBM vs. k-NN
59
Observations

SBM vs. k-NN
60
Observations

Hybrid Piecewise Polynomial Modeling (HYPPO)
Surrogate-Based Modeling
• Use all sampled data
• Construct one polynomial
(single complex global model)
k Nearest Neighbors
• Use local data
• Compute the average
(many simple local models)
61 Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth

Hybrid Piecewise Polynomial Modeling (HYPPO)
Surrogate-Based Modeling
• Use all sampled data
• Construct one polynomial
(single complex global model)
k Nearest Neighbors
• Use local data
• Compute the average
(many simple local models)
Hybrid modeling à HYPPO
• Use local data
• Construct many polynomials
(many complex local models)
62 Johnston et al., “HYPPO: A Hybrid, Piecewise Polynomial Modeling Technique for Non-Smooth

63
Observations

64
Surrogate-based Model
Observations

65
Observations
k Nearest Neighbors Model

66
Observations
k Nearest Neighbors Model
Hybrid Piecewise Polynomial
Model (HYPPO)

3
2
1
0
Soil moisture predictions Model degrees
Case Study: Fine-grained Modeling of Mid-Atlantic Region
67 D. Rorabaugh, M. Guevara, R. Llamas, J. Kitson, R. Vargas, M. Taufer. SOMOSPIE: A modular SOil MOisture SPatial
Inference Engine based on data driven decision. arXiv:1904.07754, 2019

Challenges and Opportunities (I)
• Two trends in HPC are impacting scientific applications:
§ Convergence of simulations and data analytics
§ Emergence of edge computing
• Applications in precision medicine and precision farming
are leveraging these trends
• There is the need to further integrate these trends in HPC
§ New challenges and opportunities for the HPC community
68

Challenges and Opportunities (II)
• Efficiency: Optimize performance and power usage associated to data
generation, movement, and analytics
• Non-invasive: Capture knowledge from data without rewriting
simulations’ legacy codes or simulations’ scripts
• Generality: Build workflows that support different types of analytics
across different applications and different data
• Portability: Execute combined compute and analytics across different
systems, including the edge, and with heterogenous resources
• Scalability: Design methods for knowledge discovery at scale (e.g.,
scalable ML algorithms) for “compute + analytics + data” workflows
69

Scientific Applications and Heterogeneous Architectures

Scientific Applications and Heterogeneous Architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scientific Applications and Heterogeneous Architectures

Similar to Scientific Applications and Heterogeneous Architectures (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

Scientific Applications and Heterogeneous Architectures