Materials Data in ActionMaterials Data in Action
Max Hutchinson,Max Hutchinson,
Scientific Software Eng.
ONE DOES NOT SIMPLY...ONE DOES NOT SIMPLY...
APPLY OFF THE SHELF MLAPPLY OFF THE SHELF ML
TOOLS TO MATERIALSTOOLS TO MATERIALS
DISCOVERYDISCOVERY
ARTIFICIAL INTELLIGENCE FOR MAT. SCI.ARTIFICIAL INTELLIGENCE FOR MAT. SCI.
8 AUGUST 2018, NIST8 AUGUST 2018, NIST
Bryce Meredig,Bryce Meredig,
Chief Science Officer
What is materials informatics?What is materials informatics?
 
What makes it particularly challenging?What makes it particularly challenging?
  
Can we do anything about it?Can we do anything about it?
LET'S TRY TO MACHINELET'S TRY TO MACHINE
LEARN A NOBEL PRIZELEARN A NOBEL PRIZE
CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
Pia Jensen Ray. Figure 2.4 in Master's thesis, "Structural investigation of La(2-x)Sr(x)CuO(4+y) - Following staging as a function of temperature". Niels Bohr Institute, Faculty of Science,
University of Copenhagen. Copenhagen, Denmark, November 2015. DOI:10.6084/m9.figshare.2075680.v2
CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
Cross-validated RMSE for T ≈c 10K
CAN WE PREDICT HIGH-TCAN WE PREDICT HIGH-T
SUPERCONDUCTIVITY?!?SUPERCONDUCTIVITY?!?
(spoiler alert:) no
LEAVE ONE CLUSTER OUT (LOCO) CVLEAVE ONE CLUSTER OUT (LOCO) CV
Nominal k-fold cross validations assumes independence of samples from the input space
This is almost never true in materials informatics: individual data sources have
selection biases and different data sources draw from different distributions
LOCO CV groups the data before computing train/test splits
The groups are inferred via clustering rather than being dictated by a domain expert
"Can machine learning identify the next high-temperature superconductor? Examining
extrapolation performance for materials discovery."
B. Meredig, ..., M. Hutchinson, ..., B. Gibbons, J. Hattrick-Simpers, A. Mehta, L. Ward
CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
The model can't "extrapolate" across material classes (clusters).
LOW CROSS-VALIDATIONLOW CROSS-VALIDATION
ERROR IS INSUFFICIENTERROR IS INSUFFICIENT
PossiblePossible
MaterialsMaterials
InformaticsInformatics
ResearchResearch
ProgramProgram
1. Collect data
2. Train an approximate ML model
3. Validate the ML model
If insufficiently accurate, back to (1)
4. Optimize or screen over materials using the ML model
5. ...
6. Profit
A large portion of the literature focuses on collection, training,
and validation in support of screening.
CAN WE DISCOVER NEWCAN WE DISCOVER NEW
MATERIALS?MATERIALS?
(spoiler alert): yes
DESIGN OFDESIGN OF
EXPERIMENTS,EXPERIMENTS,
SEQUENTIALSEQUENTIAL
LEARNING,LEARNING,
AND "FUELS"AND "FUELS"
1. Collect data
2. Train an approximate ML model
3. Design an experiment
4. Conduct the experiment
If quality is insufficient, append and back to (2)
5. ...
6. Profit
Modeling Experiment
Designs
Informs
DESIGNING THE NEXT EXPERIMENTDESIGNING THE NEXT EXPERIMENT
Maximum Expected                           
  
 
 
Maximum Likelihood of
Improvement (MLI)
 
 
Maximum Uncertainty                        
x ∗ p(x; θ) dx∫−∞
∞
[ ]
p(x; θ) dx∫α
∞
[ ]
(x − ) dx∫−∞
∞
[ xˉ 2
]
BENCHMARK: DESIGN ON EXPLICIT LISTBENCHMARK: DESIGN ON EXPLICIT LIST
9x
2x
REAL WORLD EXAMPLE:  ADAPT @ MINESREAL WORLD EXAMPLE:  ADAPT @ MINES
https://www.additivemanufacturing.media/articles/how-machine-learning-is-moving-am-beyond-trial-and-error/
DATA DRIVEN MODELINGDATA DRIVEN MODELING
DELIVERS DISCOVERYDELIVERS DISCOVERY
FASTERFASTER
WHAT ABOUT THEWHAT ABOUT THE
MACHINE LEARNING?MACHINE LEARNING?
““SimplySimply downloading and ‘applying’downloading and ‘applying’
open-source software to your dataopen-source software to your data
won’t work. AI needs to be customizedwon’t work. AI needs to be customized
to your business context and data.”to your business context and data.”
 
Andrew Ng in Harvard Business Review 
(Stanford, Google Brain, Coursera, Baidu)
MATERIALS INFORMATICS CONTEXTMATERIALS INFORMATICS CONTEXT
Labels are scarce and expensive
Typical dataset sizes are 100-1000 labels
Preparing a sample is often more difficult than measuring it
Different labels have low marginal costs
We've been doing physics, chemistry, and materials science for hundreds of years
There are (not always accurate) sources of computational data
We have some priors for which labels are related
We have some priors for what some relationships look like
PHYSICAL RELATIONSHIPS PHYSICAL RELATIONSHIPS 
Materials science has Process-
Structure-Property (PSP) relationships
Process Structure Property
Structure
Properties
Performance
Processing
Characterization
PHYSICAL RELATIONSHIPS PHYSICAL RELATIONSHIPS 
Physics, mathematics, and engineering
think about multi-scale modeling
Micro Meso Macro
https://www.nas.nasa.gov/SC14/demos/demo26.html
GRAPHICAL MODELS: DOMAIN-AWARE MODELINGGRAPHICAL MODELS: DOMAIN-AWARE MODELING
Inputs & Features
Featurization
Empirical Relation
Computational Data
Machine Learning
Quantity of Interest
GRAPHICAL MODELS: TRANSFER LEARNINGGRAPHICAL MODELS: TRANSFER LEARNING
M. Hutchinson, E. Antono, B. Gibbons, S. Paradiso, J. Ling, B. Meredig
Overcoming data scarcity with transfer learning, https://arxiv.org/pdf/1711.05099.pdf
"B" is a plentiful latent variable
DFT band gap
Hydrogen splitting react. rate
Indentation hardness
 
"A" is a scarce or expensive label 
Color
NO splitting reaction rate       
Ultimate tensile strength
GRAPHICAL MODELS: TRANSFER LEARNINGGRAPHICAL MODELS: TRANSFER LEARNING
Simple example:
Adding yield strength
information to a fatigue
strength design increases
experimental efficiency
M. Hutchinson, E. Antono, B. Gibbons, S. Paradiso, J. Ling, B. Meredig
Overcoming data scarcity with transfer learning, https://arxiv.org/pdf/1711.05099.pdf
WHERE DOES THE UNCERTAINTY COME FROM?WHERE DOES THE UNCERTAINTY COME FROM?
Jackknife methods capture uncertainty with respect to finite sample size.
Computational cost is independent of the size of the feature space.
We add an explicit bias term trained on the out-of-bag errors
WHERE DOES THE UNCERTAINTY COME FROM?WHERE DOES THE UNCERTAINTY COME FROM?
WHERE DOES THE UNCERTAINTY COME FROM?WHERE DOES THE UNCERTAINTY COME FROM?
(PROBABALISTIC) GRAPHICAL MODELS(PROBABALISTIC) GRAPHICAL MODELS
Inputs & Features
Featurization
Empirical Relation
Computational Data
Machine Learning
Quantity of Interest
THANKTHANK
YOU!YOU! Job listings: citrine.io/jobs
Newsletter:
citrination.org/publications_talks/ddms-newsletter/
Literature review:
citrination.org/learn/citrines-literature-review/

Materials Data in Action