Machine learning for materials design:
opportunities, challenges, and methods
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
Energy Probe workshop, May 13, 2019
• Batteries
– stable and high-energy electrodes, solid state
electrolytes
• Thermal energy storage & conversion
– High zT thermoelectrics, high heat capacity liquids
• Photovoltaics
– Improved efficiency of absorber, reduced degradation
in coatings, controlling ion migration in front glass,
lifetime of organic / hybrid materials
2
Almost every technology could be improved with better
materials!
• Often, materials are
known for several decades
before their functional
applications are known
– MgB2 sitting on lab shelves
for 50 years before its
identification as a
superconductor in 2001
• Even after discovery,
optimization and
commercialization still
take decades
3
Typically, both new materials discovery and optimization
take decades
Materials data from: Eagar T., King M. Technology
Review 1995
4
Some opportunities for accelerating materials design using
machine learning techniques
Accelerated
materials
design
ML
surrogates for
expt / comp.
“Self-driving
laboratories”
Opportunities
in natural
language
processing
• Experiments are generally time-consuming and
labor-intensive
– Days to months to get measurements with large
investment of researcher time
– Not too long ago, one essentially needed to do
everything experimentally
5
ML surrogates for experiments and computation:
background
• Computations can be faster and require less
researcher time
– Today, some materials design problems can be
modeled in the computer[1]
– But, CPU-time is still a major issue
6
ML surrogates for experiments and computation:
background
[1] Jain, A., Shin, Y. & Persson, K. A. Computational predictions of energy materials using density functional
theory. Nature Reviews Materials 1, 15004 (2016).
• Machine learning can be the fastest of all and
could play a major role in supporting experiments
and computation, e.g. to identify the most
promising regions of chemical space prior to
even computation / theory
7
ML surrogates for experiments and computation:
background
8
Example application: machine learning as a surrogate for
DFT computations
1. S. Smith, J., Isayev, O. & E. Roitberg, A. ANI-1: an extensible neural network potential with DFT accuracy at force field
computational cost. Chemical Science 8, 3192–3203 (2017).
2. Aspuru-Guzik, A., & Persson, K. Materials Acceleration Platform—Accelerating Advanced Energy Materials Discovery by
Integrating High-Throughput Methods with Artificial Intelligence.
The ML model can be 5-6 orders of magnitude faster!
Potential to run ~1 million tests for the price of 1
9
Example from our group: developing and testing surrogate
models over diverse materials data problems
(paper in preparation)
10
Some opportunities for accelerating materials design using
machine learning techniques
Accelerated
materials
design
ML
surrogates for
expt / comp.
“Self-driving
laboratories”
Opportunities
in natural
language
processing
• Typically, the choice of what materials to
perform experiments on (or to compute) is
chosen by the researcher
• Advantage: takes advantage of domain expertise
of researcher (potentially decades of knowledge)
• Potential issues:
– Bias (exploring near already known systems)
– Time (takes time to think of what to study)
11
“Self-driving” laboratories: background
• In a “self-driving” laboratory,
an algorithm chooses the
next
experiment/computation
and performs it
automatically
• “Active learning” ML
• At each stage, the algorithm
balances exploration and
exploitation
12
“Self-driving” laboratories: background
Gubernatis, J. E. & Lookman, T. Machine learning in materials design and discovery: Examples from the present and
suggestions for the future. Phys. Rev. Materials 2, 120301 (2018).
13
Example application: shape-memory allows with low
transition temperature and hysteresis
Gubernatis, J. E. & Lookman, T. Machine learning in materials design and discovery: Examples from the present and
suggestions for the future. Phys. Rev. Materials 2, 120301 (2018).
Using an adaptive design strategy, one can reduce
the number of measurements needed to find all
Pareto-optimal shape memory alloys
14
Example from our group: Rocketsled for automated
computational searches
Rocketsled can help find optimal solutions using much
fewer computations overall (less CPU) and parallelized
over supercomputers (less time)
Dunn, A., Brenneck, J. & Jain, A. Rocketsled: a software library for optimizing high-throughput computational
searches. J. Phys. Mater. 2, 034002 (2019).
15
Some opportunities for accelerating materials design using
machine learning techniques
Accelerated
materials
design
ML
surrogates for
expt / comp.
“Self-driving
laboratories”
Opportunities
in natural
language
processing
• Most materials science data and knowledge only
exists in unstructured format (e.g., as text in
journal publications)
• Can we make use of knowledge in text format?
16
Natural language processing: background
17
Example: synthesis planning based on text mining
1.
1. Kim, E. et al. Data Descriptor : Machine-learned and codified synthesis parameters of oxide materials. Scientific Data 1–9 (2017).
2. Kim, E. et al. Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning. Chemistry of
Materials acs.chemmater.7b03500-acs.chemmater.7b03500 (2017).
18
Example from our group: using NN to predict “gaps” in
materials discoveries
Using word2vec on a database of 3 million materials science
abstracts, we can predict which words should co-occur with
one another.
This can be used to predict materials that should be studied
for functional applications (“gaps” in the research literature)
Tshitoyan V., Dagdelen J., Weston L., Dunn A., Rong Z., Kononova O., Persson K., Ceder G., Jain A. Unsupervised word
embeddings capture latent knowledge from materials science literature. Accepted / in press, Nature
• Data availability
– Typical materials data sets range from ~dozen
examples to a few thousand; rare to have 100,000
data points
– No standard data sets to build models on (e.g.
ImageNet)
19
Challenges
• Data Heterogeneity
– There is no single data type (e.g., image data, spectral
data, graph data)
– Different materials problems have their own data
types and often ones unknown in computer science
(e.g., periodic crystal structures)
20
Challenges
• ML model Extrapolation
– Almost all industry ML focuses on interpolation-type
problems (data on almost all representative examples
is in place)
– Materials science requires extrapolation of very
complex physics
– Standard cross-validation likely insufficient (e.g.,
cluster-based cross-validation better?)
– ML interpretability would build confidence in
extrapolation
21
Challenges
• Kristin Persson (ESDR) – materials databases, ML
• Shyam Dwaraknath (ESDR) –ML for characterization
• Juli Mueller (CRD) – active learning
• Dani Ushizima (CRD) – classifying materials image data
• Tess Smidt (CRD) – crystal structure models for ML
• Emory Chan (MSD) – automated experiments
• Colin Ophus (MSD) – TEM image labeling
• Gerbrand Ceder (MSD) – text mining / NLP of synthesis
22
Some relevant groups at LBNL

Machine learning for materials design: opportunities, challenges, and methods

  • 1.
    Machine learning formaterials design: opportunities, challenges, and methods Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA Energy Probe workshop, May 13, 2019
  • 2.
    • Batteries – stableand high-energy electrodes, solid state electrolytes • Thermal energy storage & conversion – High zT thermoelectrics, high heat capacity liquids • Photovoltaics – Improved efficiency of absorber, reduced degradation in coatings, controlling ion migration in front glass, lifetime of organic / hybrid materials 2 Almost every technology could be improved with better materials!
  • 3.
    • Often, materialsare known for several decades before their functional applications are known – MgB2 sitting on lab shelves for 50 years before its identification as a superconductor in 2001 • Even after discovery, optimization and commercialization still take decades 3 Typically, both new materials discovery and optimization take decades Materials data from: Eagar T., King M. Technology Review 1995
  • 4.
    4 Some opportunities foraccelerating materials design using machine learning techniques Accelerated materials design ML surrogates for expt / comp. “Self-driving laboratories” Opportunities in natural language processing
  • 5.
    • Experiments aregenerally time-consuming and labor-intensive – Days to months to get measurements with large investment of researcher time – Not too long ago, one essentially needed to do everything experimentally 5 ML surrogates for experiments and computation: background
  • 6.
    • Computations canbe faster and require less researcher time – Today, some materials design problems can be modeled in the computer[1] – But, CPU-time is still a major issue 6 ML surrogates for experiments and computation: background [1] Jain, A., Shin, Y. & Persson, K. A. Computational predictions of energy materials using density functional theory. Nature Reviews Materials 1, 15004 (2016).
  • 7.
    • Machine learningcan be the fastest of all and could play a major role in supporting experiments and computation, e.g. to identify the most promising regions of chemical space prior to even computation / theory 7 ML surrogates for experiments and computation: background
  • 8.
    8 Example application: machinelearning as a surrogate for DFT computations 1. S. Smith, J., Isayev, O. & E. Roitberg, A. ANI-1: an extensible neural network potential with DFT accuracy at force field computational cost. Chemical Science 8, 3192–3203 (2017). 2. Aspuru-Guzik, A., & Persson, K. Materials Acceleration Platform—Accelerating Advanced Energy Materials Discovery by Integrating High-Throughput Methods with Artificial Intelligence. The ML model can be 5-6 orders of magnitude faster! Potential to run ~1 million tests for the price of 1
  • 9.
    9 Example from ourgroup: developing and testing surrogate models over diverse materials data problems (paper in preparation)
  • 10.
    10 Some opportunities foraccelerating materials design using machine learning techniques Accelerated materials design ML surrogates for expt / comp. “Self-driving laboratories” Opportunities in natural language processing
  • 11.
    • Typically, thechoice of what materials to perform experiments on (or to compute) is chosen by the researcher • Advantage: takes advantage of domain expertise of researcher (potentially decades of knowledge) • Potential issues: – Bias (exploring near already known systems) – Time (takes time to think of what to study) 11 “Self-driving” laboratories: background
  • 12.
    • In a“self-driving” laboratory, an algorithm chooses the next experiment/computation and performs it automatically • “Active learning” ML • At each stage, the algorithm balances exploration and exploitation 12 “Self-driving” laboratories: background Gubernatis, J. E. & Lookman, T. Machine learning in materials design and discovery: Examples from the present and suggestions for the future. Phys. Rev. Materials 2, 120301 (2018).
  • 13.
    13 Example application: shape-memoryallows with low transition temperature and hysteresis Gubernatis, J. E. & Lookman, T. Machine learning in materials design and discovery: Examples from the present and suggestions for the future. Phys. Rev. Materials 2, 120301 (2018). Using an adaptive design strategy, one can reduce the number of measurements needed to find all Pareto-optimal shape memory alloys
  • 14.
    14 Example from ourgroup: Rocketsled for automated computational searches Rocketsled can help find optimal solutions using much fewer computations overall (less CPU) and parallelized over supercomputers (less time) Dunn, A., Brenneck, J. & Jain, A. Rocketsled: a software library for optimizing high-throughput computational searches. J. Phys. Mater. 2, 034002 (2019).
  • 15.
    15 Some opportunities foraccelerating materials design using machine learning techniques Accelerated materials design ML surrogates for expt / comp. “Self-driving laboratories” Opportunities in natural language processing
  • 16.
    • Most materialsscience data and knowledge only exists in unstructured format (e.g., as text in journal publications) • Can we make use of knowledge in text format? 16 Natural language processing: background
  • 17.
    17 Example: synthesis planningbased on text mining 1. 1. Kim, E. et al. Data Descriptor : Machine-learned and codified synthesis parameters of oxide materials. Scientific Data 1–9 (2017). 2. Kim, E. et al. Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning. Chemistry of Materials acs.chemmater.7b03500-acs.chemmater.7b03500 (2017).
  • 18.
    18 Example from ourgroup: using NN to predict “gaps” in materials discoveries Using word2vec on a database of 3 million materials science abstracts, we can predict which words should co-occur with one another. This can be used to predict materials that should be studied for functional applications (“gaps” in the research literature) Tshitoyan V., Dagdelen J., Weston L., Dunn A., Rong Z., Kononova O., Persson K., Ceder G., Jain A. Unsupervised word embeddings capture latent knowledge from materials science literature. Accepted / in press, Nature
  • 19.
    • Data availability –Typical materials data sets range from ~dozen examples to a few thousand; rare to have 100,000 data points – No standard data sets to build models on (e.g. ImageNet) 19 Challenges
  • 20.
    • Data Heterogeneity –There is no single data type (e.g., image data, spectral data, graph data) – Different materials problems have their own data types and often ones unknown in computer science (e.g., periodic crystal structures) 20 Challenges
  • 21.
    • ML modelExtrapolation – Almost all industry ML focuses on interpolation-type problems (data on almost all representative examples is in place) – Materials science requires extrapolation of very complex physics – Standard cross-validation likely insufficient (e.g., cluster-based cross-validation better?) – ML interpretability would build confidence in extrapolation 21 Challenges
  • 22.
    • Kristin Persson(ESDR) – materials databases, ML • Shyam Dwaraknath (ESDR) –ML for characterization • Juli Mueller (CRD) – active learning • Dani Ushizima (CRD) – classifying materials image data • Tess Smidt (CRD) – crystal structure models for ML • Emory Chan (MSD) – automated experiments • Colin Ophus (MSD) – TEM image labeling • Gerbrand Ceder (MSD) – text mining / NLP of synthesis 22 Some relevant groups at LBNL