Building an ALCF Data Service:
Interactive, scalable, reproducible data science
Ian Foster
Rick Wagner, Nick Saint, Eric Blau
Kyle Chard, Yadu Nand Babuji
Logan Ward, Ben Blaiszik
Mike Papka
with
André Schleife and Cheng-Wei Lee
Aiichiro Nakano (USC - ALCF INCITE 2017), Maria Chan (ANL - ALCF INCITE 2016, André Schleife (UIUC - ALCF INCITE 2016)
Overview
• Leadership simulations produce
data of great scientific value
Overview
• Leadership simulations produce
data of great scientific value
• We demonstrate how to:
 Make data more accessible and
useful by associating them with rich
data lifecycle and analysis services
Find Analyze Publish
Interactive ♦ Scalable ♦ Reproducible
Overview
• Leadership simulations produce
data of great scientific value
• We demonstrate how to:
 Make data more accessible and
useful by associating them with rich
data lifecycle and analysis services
 Leverage advanced data science and
machine learning (ML) methods to
reduce simulation costs and
increase data quality and value
Find Analyze Publish
Interactive ♦ Scalable ♦ Reproducible
Collect Process Represent Learn
Interactive, scalable, reproducible data science
PUBLISH
 Automate capture, publication, and indexing of results from ALCF projects
 Enable creation of workspaces and reusable data objects to accelerate data analysis and
promote replicability
ANALYZE
 Combine ML approaches with ALCF HPC resources to extract more information from
existing datasets and to guide future simulation campaigns
FIND
 Unify search, discovery, and consumption of datasets, workspaces, and analysis results
Interactive, scalable, reproducible data science
Data
Movement
Data
Discovery
Data
Publication
Automation
Machine
Learning
HPC
Data
Interactivity
Data
Access
ALCF
Other
services
PUBLISH
 Automate capture, publication, and indexing of results from ALCF projects
 Enable creation of workspaces and reusable data objects to accelerate data analysis and
promote replicability
ANALYZE
 Combine ML approaches with ALCF HPC resources to extract more information from
existing datasets and to guide future simulation campaigns
FIND
 Unify search, discovery, and consumption of datasets, workspaces, and analysis results
Interactive, scalable, reproducible data science
PARSL
Data
Movement
Data
Discovery
Data
Publication
Automation
Machine
Learning
HPC
Data
Interactivity
Data
Access
ALCF
Other
services
Materials science as an initial testbed
• Advanced materials are critical to economic
security and competitiveness, national security,
and human welfare. (MGI 2011 interagency effort
DoD, DOE, NASA, NIST, and NSF)
• Finding and understanding new materials is
complex, expensive, and time consuming: often
taking > 20 years from research to application
• Materials scientists are key users of leadership
class computing (20-30% at ALCF)
• Community data tools and services to advance
materials science are emerging
Nicholas Brawand, University of Chicago; Larry Curtiss, Argonne National Laboratory
Modeling material stopping power
Stopping Power: a “drag” force
experienced by high speed protons,
electrons, or positrons in a material
Areas of Application
• Nuclear reactor safety
• Magnetic confinement / inertial
containment for nuclear fusion
• Solar cell surface adsorption
• Medicine (e.g., proton therapy
cancer treatment)
• Critical to understanding
material radiation damage
André Schleife and Cheng-Wei Lee (UIUC)
2016 ALCF INCITE Project
“Electronic Response to Particle Radiation
in Condensed Matter”
André Schleife, Yosuke Kanai, Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
Modeling material stopping power
André Schleife and Cheng-Wei Lee (UIUC)
2016 ALCF INCITE Project
“Electronic Response to Particle Radiation
in Condensed Matter”
André Schleife, Yosuke Kanai, Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
Stopping Power: a “drag” force
experienced by high speed protons,
electrons, or positrons in a material
Areas of Application
• Nuclear reactor safety
• Magnetic confinement / inertial
containment for nuclear fusion
• Solar cell surface adsorption
• Medicine (e.g., proton therapy
cancer treatment)
• Critical to understanding
material radiation damage
Computing stopping power with TD-DFT
Stopping power (SP) can be accurately
calculated by time-dependent density
functional theory (TD-DFT)
 Excellent agreement with experiment
 Can vary orientation, projectile, material
 Highly parallelizable
But we need many results
 Direction dependence
 Effect of defects
 Many more materials
TD-DFT alone may not be sufficient
André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
Experiment
TD-DFT
16k CPU-Hr
Computing stopping power with TD-DFT
Stopping power (SP) can be accurately
calculated by time-dependent density
functional theory (TD-DFT)
 Excellent agreement with experiment
 Can vary orientation, projectile, material
 Highly parallelizable
But we need many results
 Direction dependence
 Effect of defects
 Many more materials
TD-DFT alone may not be sufficient
André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
Experiment
TD-DFT
16k CPU-Hr
Potential Solution:
Machine Learning!
13
What? Algorithms that generate computer programs
Why? Create software too complex to write manually
General Task: Given inputs, predict output 𝑦 = 𝑓(𝑥)
Common Algorithms:
Advantages:
 Fast 104-107 evaluations/CPU/sec
 Adaptable Limited need to know underlying physics
 Self-correcting Improves with more data.
 Parallelizable Can use large-scale resources
x
y
𝒇(𝒙) = 𝒎𝒙 + 𝒃
Linear
Regression
𝒙 < 𝟒
𝒚 = 𝟐 𝒚 = 𝟔
Decision
Trees
Neural
Networks
Source: nature.com
What is machine learning?
Computing stopping power with TD-DFT+ML
We propose to use ML to create surrogate models for TD-DFT
How do we replace TD-DFT? First, consider what it does
Inputs:
 Atomic-scale structure (atom types, position)
 Electronic structure of system
Outputs:
 Energy of entire system
 Forces on each atom
 Time-derivates of electronic structure
If successful, we can use the ML model – not TD-DFT – to compute SP
Allow prediction of future state
Materials science and machine learning
Collect Process Represent Learn
3 4 -1.0
3 5 -0.5
Δ𝐻𝑓 = −1.0
Δ𝐻𝑓 = −0.5
𝑋 𝑦
Δ𝐻𝑓 = 𝑓(𝑍 𝐴, 𝑍 𝐵)
PARSL
Step 1: Data collection
Collect Process Represent Learn
3 4 -1.0
3 5 -0.5
Δ𝐻𝑓 = −1.0
Δ𝐻𝑓 = −0.5
𝑋 𝑦
Cooley
• AGNI
Fingerprints
• Ion-Ion force
• Local Charge
Density
• Linear
models
• ANNs
• RNNs
Stopping Power prediction: Our data
We have simulation results for H in face-centered cubic Al on a random trajectory.
André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
Stopping Power prediction: Our data
We have simulation results for H in face-centered cubic Al on a random trajectory.
For each of multiple velocities, we have:
1) A simulated SP: one red point
André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
Stopping Power prediction: Our data
We have simulation results for H in face-centered cubic Al on a random trajectory.
For each of multiple velocities, we have:
1) A simulated SP: one red point
2) A trajectory for that point:
André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
Stopping Power prediction: Our data
We have simulation results for H in face-centered cubic Al on a random trajectory.
For each of multiple velocities, we have:
1) A simulated SP: one red point
2) A trajectory for that point:
3) A ground-state calculation for that trajectory’s starting point
(About 6GB in total, mostly Qbox output files)
André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
Steps 2-3: Data processing/Representation
Collect Process Represent Learn
3 4 -1.0
3 5 -0.5
Δ𝐻𝑓 = −1.0
Δ𝐻𝑓 = −0.5
𝑋 𝑦
Cooley
• AGNI
Fingerprints
• Ion-Ion force
• Local Charge
Density
• Linear
models
• ANNs
• RNNs
PARSL
Designing a training set
Key Question: What are the inputs and outputs to our model?
Consider those for TD-DFT:
Inputs:
 Atomic-scale structure (atom types, position)
 Electronic structure of system
Outputs:
 Energy of entire system
 Time-derivates of electronic structure
 Forces on each atom the projectile
Input: Atomic Structure, Output: Force on Particle
Collect Process Represent Learn
Requires TD-DFT to compute
Reliant on entire history (hard to predict)
Not needed to compute stopping power
Selecting a representation
Key Questions: What determines force on projectile? How do we quantify it?
Types of Features
Ion-ion repulsion: Can be computed directly
Electronic interactions: We approximate with two feature types
Local charge density: Density of electrons at projectile position
AGNI fingerprints*: Describe the atom positions around projectile
Another need: History Dependence
Approach: Use charge density at fixed points ahead/behind projectile
Collect Process Represent Learn
* Botu, et al. J. Phys. Chem. C 121, 511–522 (2017).
PARSL
Step 4: Machine learning
Collect Process Represent Learn
3 4 -1.0
3 5 -0.5
Δ𝐻𝑓 = −1.0
Δ𝐻𝑓 = −0.5
𝑋 𝑦
Cooley
• AGNI
Fingerprints
• Ion-Ion force
• Local Charge
Density
• Linear
models
• ANNs
• RNNs
Selecting a machine learning algorithm
Key Criterion: Prediction accuracy
Beyond accuracy, the algorithm should…
be feasible to train with >104 entries
be quick to evaluate
produce a differentiable model
Standard Procedure:
1. Identify suitable algorithms
(linear models, neural networks)
2. Evaluate performance using cross-
validation
3. Validate the model vs. unseen data
Live Demo
You won’t see the live demo here. But it was cool. We located Schleife
simulation data previously published to MDF; assembled a workspace
comprising Aluminum data plus four Jupyter notebooks comprising
data processing, ML training, and SP modeling methods; deployed the
workspace to ALCF; and ran the notebooks to process data, train a
model, and predict SP values for many directions.
Summary of analysis results
We compared a variety of ML algorithms
We computed SP for other trajectories
We evaluated data needed for training
We calculated
Stopping Power
for many
trajectories
Materials science and machine learning
Collect Process Represent Learn
3 4 -1.0
3 5 -0.5
Δ𝐻𝑓 = −1.0
Δ𝐻𝑓 = −0.5
𝑋 𝑦
Cooley
• AGNI
Fingerprints
• Ion-Ion force
• Local Charge
Density
• Linear
models
• ANNs
• RNNs
PARSL
EP
EP
EP
EP
Deep indexing
Web UI, Forge, or
REST API
• Query
• Browse
• Aggregate
Publish
Web UI or API
• Mint DOIs
• Associate
metadata
Databases
Datasets
APIs
LIMS
etc.
Distributed data
storage
Data
publication
service
Data
discovery
service
Materials Data Facility to discover data
116 data sources
3.4M records
300 TB
Data ingest flow
1. Data are created at ALCF
2. Data are staged, published,
and assigned a permanent
identifier (DOI)
3. Results are indexed for
easy discovery
4. Interactive analysis,
modeling, and interrogation
1
Data ingest flow
Data Publication
Data Storage
2
2
1. Data are created at ALCF
2. Data are staged, published,
and assigned a permanent
identifier (DOI)
3. Results are indexed for
easy discovery
4. Interactive analysis,
modeling, interrogation
Data ingest flow
Data Publication
Data Storage
3
Indexing
1. Data are created at ALCF
2. Data are staged, published,
and assigned a permanent
identifier (DOI)
3. Results are indexed for
easy discovery
4. Interactive analysis,
modeling, interrogation
Data ingest flow
Data Publication
Data Storage
Query
Fetch
PARSL4
Indexing
1. Data are created at ALCF
2. Data are staged, published,
and assigned a permanent
identifier (DOI)
3. Results are indexed for
easy discovery
4. Interactive analysis,
modeling, interrogation
Data collection
1. Find data through search index
2. Create BDBags for data
reusability, staging, and sharing
3. Stage data and launch
interactive environment
on ALCF computers
4. Analyze data!
Data collection
1. Find data through search index
2. Create BDBags for data
reusability, staging, and sharing
3. Stage data and launch
interactive environment
on ALCF computers
4. Analyze data!
Data staging
1. Find data through search index
2. Create BDBags for data
reusability, staging, and sharing
3. Stage data and launch
interactive environment
on ALCF computers
4. Analyze data!
Interactive, scalable, reproducible data analysis
Data science and learning applications require:
- Interactivity
- Scalability
- You can’t run this on a desktop
- Reproducibility
- Publish code and documentation
Interactive, scalable, reproducible data analysis
Data science and learning applications require:
- Interactivity
- Scalability
- You can’t run this on a desktop
- Reproducibility
- Publish code and documentation
Our solution: JupyterHub + Parsl
 Interactive computing environment
 Notebooks for publication
 Can run on dedicated hardware
PARSL
parsl-project.orgjupyter.org
• Python-based parallel scripting library
• Tasks exposed as functions (Python or bash)
• Python code used to glue functions together
• Leverages Globus for auth and data movement
@App('python', dfk)
def compute_features(chunk):
for f in featurizers:
chunk = f.featurize_dataframe(chunk, 'atoms')
return chunk
chunks = [compute_features(chunk)
for chunk in np.array_split(data, chunks)]
Interactive, scalable, reproducible data science
TD-DFT Calculations Machine Learning
Direction-Dependent
Stopping Power
Existing Data ALCF Data Facility New Capabilities
Interactive, scalable, reproducible data science
Existing Data ALCF Data Facility New Capabilities
Next Steps
1. Model multiple velocities
2. Model more materials
3. Model direction dependence
4. Transfer Learning
Results so far
• Indexed data from an ALCF INCITE project
• Interactively built surrogate model using
ALCF Data Service capabilities
• Extended results to model SP direction
dependence in Aluminum
Thanks to our sponsors!
U . S . D E P A R T M E N T O F
ENERGY
ALCF DF
Parsl Globus IMaD
Building an ALCF Data Service:
Interactive, scalable, reproducible data science
Ian Foster
Rick Wagner, Nick Saint, Eric Blau
Kyle Chard, Yadu Nand Babuji
Logan Ward, Ben Blaiszik
Mike Papka
with
André Schleife and Cheng-Wei Lee
Aiichiro Nakano (USC - ALCF INCITE 2017), Maria Chan (ANL - ALCF INCITE 2016, André Schleife (UIUC - ALCF INCITE 2016)

Going Smart and Deep on Materials at ALCF

  • 1.
    Building an ALCFData Service: Interactive, scalable, reproducible data science Ian Foster Rick Wagner, Nick Saint, Eric Blau Kyle Chard, Yadu Nand Babuji Logan Ward, Ben Blaiszik Mike Papka with André Schleife and Cheng-Wei Lee Aiichiro Nakano (USC - ALCF INCITE 2017), Maria Chan (ANL - ALCF INCITE 2016, André Schleife (UIUC - ALCF INCITE 2016)
  • 2.
    Overview • Leadership simulationsproduce data of great scientific value
  • 3.
    Overview • Leadership simulationsproduce data of great scientific value • We demonstrate how to:  Make data more accessible and useful by associating them with rich data lifecycle and analysis services Find Analyze Publish Interactive ♦ Scalable ♦ Reproducible
  • 4.
    Overview • Leadership simulationsproduce data of great scientific value • We demonstrate how to:  Make data more accessible and useful by associating them with rich data lifecycle and analysis services  Leverage advanced data science and machine learning (ML) methods to reduce simulation costs and increase data quality and value Find Analyze Publish Interactive ♦ Scalable ♦ Reproducible Collect Process Represent Learn
  • 5.
    Interactive, scalable, reproducibledata science PUBLISH  Automate capture, publication, and indexing of results from ALCF projects  Enable creation of workspaces and reusable data objects to accelerate data analysis and promote replicability ANALYZE  Combine ML approaches with ALCF HPC resources to extract more information from existing datasets and to guide future simulation campaigns FIND  Unify search, discovery, and consumption of datasets, workspaces, and analysis results
  • 6.
    Interactive, scalable, reproducibledata science Data Movement Data Discovery Data Publication Automation Machine Learning HPC Data Interactivity Data Access ALCF Other services PUBLISH  Automate capture, publication, and indexing of results from ALCF projects  Enable creation of workspaces and reusable data objects to accelerate data analysis and promote replicability ANALYZE  Combine ML approaches with ALCF HPC resources to extract more information from existing datasets and to guide future simulation campaigns FIND  Unify search, discovery, and consumption of datasets, workspaces, and analysis results
  • 7.
    Interactive, scalable, reproducibledata science PARSL Data Movement Data Discovery Data Publication Automation Machine Learning HPC Data Interactivity Data Access ALCF Other services
  • 8.
    Materials science asan initial testbed • Advanced materials are critical to economic security and competitiveness, national security, and human welfare. (MGI 2011 interagency effort DoD, DOE, NASA, NIST, and NSF) • Finding and understanding new materials is complex, expensive, and time consuming: often taking > 20 years from research to application • Materials scientists are key users of leadership class computing (20-30% at ALCF) • Community data tools and services to advance materials science are emerging Nicholas Brawand, University of Chicago; Larry Curtiss, Argonne National Laboratory
  • 9.
    Modeling material stoppingpower Stopping Power: a “drag” force experienced by high speed protons, electrons, or positrons in a material Areas of Application • Nuclear reactor safety • Magnetic confinement / inertial containment for nuclear fusion • Solar cell surface adsorption • Medicine (e.g., proton therapy cancer treatment) • Critical to understanding material radiation damage André Schleife and Cheng-Wei Lee (UIUC) 2016 ALCF INCITE Project “Electronic Response to Particle Radiation in Condensed Matter” André Schleife, Yosuke Kanai, Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
  • 10.
    Modeling material stoppingpower André Schleife and Cheng-Wei Lee (UIUC) 2016 ALCF INCITE Project “Electronic Response to Particle Radiation in Condensed Matter” André Schleife, Yosuke Kanai, Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306 Stopping Power: a “drag” force experienced by high speed protons, electrons, or positrons in a material Areas of Application • Nuclear reactor safety • Magnetic confinement / inertial containment for nuclear fusion • Solar cell surface adsorption • Medicine (e.g., proton therapy cancer treatment) • Critical to understanding material radiation damage
  • 11.
    Computing stopping powerwith TD-DFT Stopping power (SP) can be accurately calculated by time-dependent density functional theory (TD-DFT)  Excellent agreement with experiment  Can vary orientation, projectile, material  Highly parallelizable But we need many results  Direction dependence  Effect of defects  Many more materials TD-DFT alone may not be sufficient André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306 Experiment TD-DFT 16k CPU-Hr
  • 12.
    Computing stopping powerwith TD-DFT Stopping power (SP) can be accurately calculated by time-dependent density functional theory (TD-DFT)  Excellent agreement with experiment  Can vary orientation, projectile, material  Highly parallelizable But we need many results  Direction dependence  Effect of defects  Many more materials TD-DFT alone may not be sufficient André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306 Experiment TD-DFT 16k CPU-Hr Potential Solution: Machine Learning!
  • 13.
    13 What? Algorithms thatgenerate computer programs Why? Create software too complex to write manually General Task: Given inputs, predict output 𝑦 = 𝑓(𝑥) Common Algorithms: Advantages:  Fast 104-107 evaluations/CPU/sec  Adaptable Limited need to know underlying physics  Self-correcting Improves with more data.  Parallelizable Can use large-scale resources x y 𝒇(𝒙) = 𝒎𝒙 + 𝒃 Linear Regression 𝒙 < 𝟒 𝒚 = 𝟐 𝒚 = 𝟔 Decision Trees Neural Networks Source: nature.com What is machine learning?
  • 14.
    Computing stopping powerwith TD-DFT+ML We propose to use ML to create surrogate models for TD-DFT How do we replace TD-DFT? First, consider what it does Inputs:  Atomic-scale structure (atom types, position)  Electronic structure of system Outputs:  Energy of entire system  Forces on each atom  Time-derivates of electronic structure If successful, we can use the ML model – not TD-DFT – to compute SP Allow prediction of future state
  • 15.
    Materials science andmachine learning Collect Process Represent Learn 3 4 -1.0 3 5 -0.5 Δ𝐻𝑓 = −1.0 Δ𝐻𝑓 = −0.5 𝑋 𝑦 Δ𝐻𝑓 = 𝑓(𝑍 𝐴, 𝑍 𝐵)
  • 16.
    PARSL Step 1: Datacollection Collect Process Represent Learn 3 4 -1.0 3 5 -0.5 Δ𝐻𝑓 = −1.0 Δ𝐻𝑓 = −0.5 𝑋 𝑦 Cooley • AGNI Fingerprints • Ion-Ion force • Local Charge Density • Linear models • ANNs • RNNs
  • 17.
    Stopping Power prediction:Our data We have simulation results for H in face-centered cubic Al on a random trajectory. André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
  • 18.
    Stopping Power prediction:Our data We have simulation results for H in face-centered cubic Al on a random trajectory. For each of multiple velocities, we have: 1) A simulated SP: one red point André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
  • 19.
    Stopping Power prediction:Our data We have simulation results for H in face-centered cubic Al on a random trajectory. For each of multiple velocities, we have: 1) A simulated SP: one red point 2) A trajectory for that point: André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
  • 20.
    Stopping Power prediction:Our data We have simulation results for H in face-centered cubic Al on a random trajectory. For each of multiple velocities, we have: 1) A simulated SP: one red point 2) A trajectory for that point: 3) A ground-state calculation for that trajectory’s starting point (About 6GB in total, mostly Qbox output files) André Schleife, Yosuke Kanai, and Alfredo A. Correa, 2015 -- 10.1103/PhysRevB.91.014306
  • 21.
    Steps 2-3: Dataprocessing/Representation Collect Process Represent Learn 3 4 -1.0 3 5 -0.5 Δ𝐻𝑓 = −1.0 Δ𝐻𝑓 = −0.5 𝑋 𝑦 Cooley • AGNI Fingerprints • Ion-Ion force • Local Charge Density • Linear models • ANNs • RNNs PARSL
  • 22.
    Designing a trainingset Key Question: What are the inputs and outputs to our model? Consider those for TD-DFT: Inputs:  Atomic-scale structure (atom types, position)  Electronic structure of system Outputs:  Energy of entire system  Time-derivates of electronic structure  Forces on each atom the projectile Input: Atomic Structure, Output: Force on Particle Collect Process Represent Learn Requires TD-DFT to compute Reliant on entire history (hard to predict) Not needed to compute stopping power
  • 23.
    Selecting a representation KeyQuestions: What determines force on projectile? How do we quantify it? Types of Features Ion-ion repulsion: Can be computed directly Electronic interactions: We approximate with two feature types Local charge density: Density of electrons at projectile position AGNI fingerprints*: Describe the atom positions around projectile Another need: History Dependence Approach: Use charge density at fixed points ahead/behind projectile Collect Process Represent Learn * Botu, et al. J. Phys. Chem. C 121, 511–522 (2017).
  • 24.
    PARSL Step 4: Machinelearning Collect Process Represent Learn 3 4 -1.0 3 5 -0.5 Δ𝐻𝑓 = −1.0 Δ𝐻𝑓 = −0.5 𝑋 𝑦 Cooley • AGNI Fingerprints • Ion-Ion force • Local Charge Density • Linear models • ANNs • RNNs
  • 25.
    Selecting a machinelearning algorithm Key Criterion: Prediction accuracy Beyond accuracy, the algorithm should… be feasible to train with >104 entries be quick to evaluate produce a differentiable model Standard Procedure: 1. Identify suitable algorithms (linear models, neural networks) 2. Evaluate performance using cross- validation 3. Validate the model vs. unseen data
  • 26.
    Live Demo You won’tsee the live demo here. But it was cool. We located Schleife simulation data previously published to MDF; assembled a workspace comprising Aluminum data plus four Jupyter notebooks comprising data processing, ML training, and SP modeling methods; deployed the workspace to ALCF; and ran the notebooks to process data, train a model, and predict SP values for many directions.
  • 27.
    Summary of analysisresults We compared a variety of ML algorithms We computed SP for other trajectories We evaluated data needed for training We calculated Stopping Power for many trajectories
  • 28.
    Materials science andmachine learning Collect Process Represent Learn 3 4 -1.0 3 5 -0.5 Δ𝐻𝑓 = −1.0 Δ𝐻𝑓 = −0.5 𝑋 𝑦 Cooley • AGNI Fingerprints • Ion-Ion force • Local Charge Density • Linear models • ANNs • RNNs PARSL
  • 29.
    EP EP EP EP Deep indexing Web UI,Forge, or REST API • Query • Browse • Aggregate Publish Web UI or API • Mint DOIs • Associate metadata Databases Datasets APIs LIMS etc. Distributed data storage Data publication service Data discovery service Materials Data Facility to discover data 116 data sources 3.4M records 300 TB
  • 30.
    Data ingest flow 1.Data are created at ALCF 2. Data are staged, published, and assigned a permanent identifier (DOI) 3. Results are indexed for easy discovery 4. Interactive analysis, modeling, and interrogation 1
  • 31.
    Data ingest flow DataPublication Data Storage 2 2 1. Data are created at ALCF 2. Data are staged, published, and assigned a permanent identifier (DOI) 3. Results are indexed for easy discovery 4. Interactive analysis, modeling, interrogation
  • 32.
    Data ingest flow DataPublication Data Storage 3 Indexing 1. Data are created at ALCF 2. Data are staged, published, and assigned a permanent identifier (DOI) 3. Results are indexed for easy discovery 4. Interactive analysis, modeling, interrogation
  • 33.
    Data ingest flow DataPublication Data Storage Query Fetch PARSL4 Indexing 1. Data are created at ALCF 2. Data are staged, published, and assigned a permanent identifier (DOI) 3. Results are indexed for easy discovery 4. Interactive analysis, modeling, interrogation
  • 34.
    Data collection 1. Finddata through search index 2. Create BDBags for data reusability, staging, and sharing 3. Stage data and launch interactive environment on ALCF computers 4. Analyze data!
  • 35.
    Data collection 1. Finddata through search index 2. Create BDBags for data reusability, staging, and sharing 3. Stage data and launch interactive environment on ALCF computers 4. Analyze data!
  • 36.
    Data staging 1. Finddata through search index 2. Create BDBags for data reusability, staging, and sharing 3. Stage data and launch interactive environment on ALCF computers 4. Analyze data!
  • 37.
    Interactive, scalable, reproducibledata analysis Data science and learning applications require: - Interactivity - Scalability - You can’t run this on a desktop - Reproducibility - Publish code and documentation
  • 38.
    Interactive, scalable, reproducibledata analysis Data science and learning applications require: - Interactivity - Scalability - You can’t run this on a desktop - Reproducibility - Publish code and documentation Our solution: JupyterHub + Parsl  Interactive computing environment  Notebooks for publication  Can run on dedicated hardware PARSL parsl-project.orgjupyter.org • Python-based parallel scripting library • Tasks exposed as functions (Python or bash) • Python code used to glue functions together • Leverages Globus for auth and data movement @App('python', dfk) def compute_features(chunk): for f in featurizers: chunk = f.featurize_dataframe(chunk, 'atoms') return chunk chunks = [compute_features(chunk) for chunk in np.array_split(data, chunks)]
  • 39.
    Interactive, scalable, reproducibledata science TD-DFT Calculations Machine Learning Direction-Dependent Stopping Power Existing Data ALCF Data Facility New Capabilities
  • 40.
    Interactive, scalable, reproducibledata science Existing Data ALCF Data Facility New Capabilities Next Steps 1. Model multiple velocities 2. Model more materials 3. Model direction dependence 4. Transfer Learning Results so far • Indexed data from an ALCF INCITE project • Interactively built surrogate model using ALCF Data Service capabilities • Extended results to model SP direction dependence in Aluminum
  • 41.
    Thanks to oursponsors! U . S . D E P A R T M E N T O F ENERGY ALCF DF Parsl Globus IMaD
  • 42.
    Building an ALCFData Service: Interactive, scalable, reproducible data science Ian Foster Rick Wagner, Nick Saint, Eric Blau Kyle Chard, Yadu Nand Babuji Logan Ward, Ben Blaiszik Mike Papka with André Schleife and Cheng-Wei Lee Aiichiro Nakano (USC - ALCF INCITE 2017), Maria Chan (ANL - ALCF INCITE 2016, André Schleife (UIUC - ALCF INCITE 2016)

Editor's Notes

  • #8 An index for heterogeneous distributed data, coupled with APIs to facilitate data access, discovery, and addition, layered with capabilities to support simplified deep learning against these data. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  • #11 On-channel Proton in Gold lattice
  • #12 TD-DFT offers a great way for computing the stopping power of materials. As shown in the figure on the right, it can accurately reproduce experimentally-measured stopping powers. Given that TD-DFT is parameter-free, we can easily model the effect of changing the direction of the projectile, the type of projectile, and the host material. Additionally, there are advanced, parallel TD-DFT relies on advanced parallel codes that enable the use of leadership-class computing facilities, which is good because it is resource intensive. Just one of the points on this plot required ~16k computing hours on Sierra at LLNL. That single point is only the stopping power for a single direction in the crystal for a single projectile type, at a single velocity, for a single host material. In the future, we want to be able to easily access the stopping power for many different types of materials in all possible directions, and even be able to ascertain the effects of defects on the stopping power. TD-DF, while quite powerful, might not be sufficient on its own to do this. To compensate, we propose to use machine learning to extend the capability of TD-DFT.
  • #13 TD-DFT offers a great way for computing the stopping power of materials. As shown in the figure on the right, it can accurately reproduce experimentally-measured stopping powers. Given that TD-DFT is parameter-free, we can easily model the effect of changing the direction of the projectile, the type of projectile, and the host material. Additionally, there are advanced, parallel TD-DFT relies on advanced parallel codes that enable the use of leadership-class computing facilities, which is good because it is resource intensive. Just one of the points on this plot required ~16k computing hours on Sierra at LLNL. That single point is only the stopping power for a single direction in the crystal for a single projectile type, at a single velocity, for a single host material. In the future, we want to be able to easily access the stopping power for many different types of materials in all possible directions, and even be able to ascertain the effects of defects on the stopping power. TD-DF, while quite powerful, might not be sufficient on its own to do this. To compensate, we propose to use machine learning to extend the capability of TD-DFT.
  • #14  Self-correcting: If the model’s wrong, add more data and it will automatically correct itself [theorists do this on a slower timescale]
  • #15 Our first question is: How do we approach creating a surrogate for TD-DFT? The first step in that process is recognizing what are the inputs and outputs to TD-DFT Its inputs are the atomic-scale structure (position and types of atoms), and the current electronic structure. In essence, what is the current state of the material at the electronic level? Its outputs are the energy of the system and quantities that allow you predict its future state: the forces acting on each atom, and the rate of change of the electronic structure (i.e., the wavefunctions for each atom). If we can successfully emulate the function that maps these inputs and outputs, we can use our ML surrogate to compute stopping power rather than use TD-DFT directly. Ok, this outlines what we need to replace in simple language. Now, our next step is to build this model.
  • #16 We break down building a machine learning model into 4 distinct steps: First, we need to collect a resource of raw data for training the model Next we need to process that raw data to define a training set: what are the inputs [in broad terms] and the desired outputs Finally, we translate our materials data into a form compatible with machine learning: a list of finite-length vectors that each have the same length. I.e., we select a representation Lastly, we employ machine learning to find a function that maps the representation to the outputs: the classic machine learning problem. At this point, we will break in to a live demo to show you how this process applies to modeling TD-DFT and how the ALCF Data Facility makes this work easier.
  • #17 As I just mentioned the first step in our process is gathering a set of training data
  • #18 For our application: the data is the data supporting a previous publication of Andre Schleife. Specifically we have the data backing this figure. What this figure describes is the stopping power as a function of velocity for a proton traveling through aluminum as along a random trajectory. For each red point in this figure, we have the TD-DFT data that was used to calculate it. What this actually means is we have about 6GB of Qbox output files that contain the structure and energy of the system as a function of time during the simulation. We also have the starting point for these simulations: A ground state DFT calculation of the electronic structure of Al [At this point, jump to showing off the ALCF portal and creating the environment. Then, go to the notebook and show what we have]
  • #19 For our application: the data is the data supporting a previous publication of Andre Schleife. Specifically we have the data backing this figure. What this figure describes is the stopping power as a function of velocity for a proton traveling through aluminum as along a random trajectory. For each red point in this figure, we have the TD-DFT data that was used to calculate it. What this actually means is we have about 6GB of Qbox output files that contain the structure and energy of the system as a function of time during the simulation. We also have the starting point for these simulations: A ground state DFT calculation of the electronic structure of Al [At this point, jump to showing off the ALCF portal and creating the environment. Then, go to the notebook and show what we have]
  • #20 For our application: the data is the data supporting a previous publication of Andre Schleife. Specifically we have the data backing this figure. What this figure describes is the stopping power as a function of velocity for a proton traveling through aluminum as along a random trajectory. For each red point in this figure, we have the TD-DFT data that was used to calculate it. What this actually means is we have about 6GB of Qbox output files that contain the structure and energy of the system as a function of time during the simulation. We also have the starting point for these simulations: A ground state DFT calculation of the electronic structure of Al [At this point, jump to showing off the ALCF portal and creating the environment. Then, go to the notebook and show what we have]
  • #21 For our application: the data is the data supporting a previous publication of Andre Schleife. Specifically we have the data backing this figure. What this figure describes is the stopping power as a function of velocity for a proton traveling through aluminum as along a random trajectory. For each red point in this figure, we have the TD-DFT data that was used to calculate it. What this actually means is we have about 6GB of Qbox output files that contain the structure and energy of the system as a function of time during the simulation. We also have the starting point for these simulations: A ground state DFT calculation of the electronic structure of Al [At this point, jump to showing off the ALCF portal and creating the environment. Then, go to the notebook and show what we have]
  • #22 An index for heterogeneous distributed data, coupled with APIs to facilitate data access, discovery, and addition, layered with capabilities to support simplified deep learning against these data. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  • #23 Now, back to the science. Our second step in a creating a model is processing the data from its raw form to create a training set with clear inputs and outputs. To decide what those should be for our model, we go back to the inputs and outputs. As inputs: TD-DFT takes the atomic scale structure and electronic structure. When building a model, we cannot use the time-dependent electronic structure because this requires TD-DFT to compute. On the other hand, we can know the atomic scale structure. For outputs: The stopping power of a material is the average force acting on the projectile over time We don’t need the electronic structure to compute the stopping power, so let’s eliminate that We can get the average force on the particle from the energy. But, the energy at any timestep is dependent on all of the previous timesteps, which makes it difficult to predict So, we should just predict the force acting on the particle
  • #24 Our next step is determine what ‘atomic structure’ actually means in terms of inputs to the model. We choose to use two types of inputs. First, we know part of the force deals with the ‘ion-ion’ repulsion between the projectile and the surrounding nuclei. This we can just compute directly Secondly, we know the particle interacts with the electrons in the material. To approximate this effect, we use two kinds of features: (1) the electron density (taken from the starting condition for our simulation, and (2) the AGNI fingerprints, which capture the local arrangement of atoms and provide the basis for an ML model to capture electronic effects Another thing we need for these features is history dependence. The force acting on a particle is not just dependent on it’s current environment, but also what just happened to it. As the particle travels at a constant velocity, we know it’s history and represent that history by computing the charge density at positions several timesteps in the past and one in the future. Now, let’s jump back to the notebooks to show what these processing and representation calculations look like.
  • #25 Our last step of the model building process is training a machine learning algorithm.
  • #26 The main question when selecting a machine learning algorithm is: which algorithm produces a model with the highest prediction accuracy However, there are also other important factors to consider with this application. The model should [explain the list] To identify the best model we follow a simple and very common procedure: 1. We first identify suitable algorithms: for our case, linear models and neural networks are our top choices 2. Then, we test them using cross-validation 3. Finally, we validate the model using data outside of our original training set We’ll show you this process in our notebooks
  • #29 An index for heterogeneous distributed data, coupled with APIs to facilitate data access, discovery, and addition, layered with capabilities to support simplified deep learning against these data. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  • #31 Ian, should these slides come this early, or later? An index for heterogeneous distributed data, coupled with APIs to facilitate data access, discovery, and addition, layered with capabilities to support simplified deep learning against these data. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  • #32 An index for heterogeneous distributed data, coupled with APIs to facilitate data access, discovery, and addition, layered with capabilities to support simplified deep learning against these data. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  • #33 An index for heterogeneous distributed data, coupled with APIs to facilitate data access, discovery, and addition, layered with capabilities to support simplified deep learning against these data. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  • #34 An index for heterogeneous distributed data, coupled with APIs to facilitate data access, discovery, and addition, layered with capabilities to support simplified deep learning against these data. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  • #35 An index for heterogeneous distributed data, coupled with APIs to facilitate data access, discovery, and addition, layered with capabilities to support simplified deep learning against these data. Simplify interfaces for data publication regardless of data size, type, and location Provide automation capabilities to capture data from pipelines Deploy APIs to foster community development and integration Encourage data re-use Incentivize data sharing and Support Open Science in materials research
  • #36 Animate?
  • #37 Animate?
  • #38 Back to the
  • #39 Back to the
  • #42 IAN: Credits missing. (added) Add page numbers?