Accelerating materials discovery
with big data and machine
learning
Anubhav Jain
ESDR / ETA
AI/ML for Chemistry and Materials Science Workshop
9/26/2024
Today: Simulation data forms structured materials
databases for design and machine learning
• The Materials Project (www.materialsproject.org)
• Free resource of calculated and contributed
materials properties
• >150,000 inorganic compounds
• >500,000 registered users
• Most popular database for downstream machine
learning (composition or structure à property)
MP for phosphors
References
✦ Wang, Z. et al. Mining Unexplored Chemistries for Phosphors for High-Color-
Quality White-Light-Emitting Diodes. Joule 2, 914–926 (2018)
✦ Li, S. et al. Data-Driven Discovery of Full-Visible-Spectrum Phosphor. Chemistry of
Materials 31, 6286–6294 (2019)
✦ Ha, J. et al. Color tunable single-phase Eu2+ and Ce3+ co-activated Sr2LiAlO4
phosphors. Journal of Materials Chemistry C 7, 7734–7744 (2019)
Prediction
Statistical analysis of existing
materials that co-occur with
word ‘phosphor’ followed
by structure prediction for
new materials
Experiment
Predicted first known Sr-Li-
Al-N quaternary, showed
green-yellow/blue emission
with quantum efficiency of
25% (Eu), 40% (Ce), 55%
(co-activated Eu, Ce)
Sr2LiAlN4
≈ç ≈
“matbench-discovery” benchmark task
Challenge: Make published experimental data
“ML-ready” like simulation data sets (e.g., LLMs)
Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A. S.; Ceder, G.; Persson, K.
A.; Jain, A. Structured Information Extraction from Scientific Text with Large
Language Models. Nat Commun 2024, 15 (1), 1418.
Top: dopant database; Bottom: Au NP synthesis
Today/Ongoing: end-to-end materials development
with automation/robotics, theory, and AI
“A-lab”
Materials Project
NERSC
AI recipes
based on
“reading”
literature
Iterative AI
refines recipe
to synthesize
target phase
New materials can be
virtually pre-screened
with supercomputers
and AI (“Materials Project”)
Targets from computer
models can be synthesized
using robotic equipment
and AI (“A-lab”)
Challenge: automated sample characterization
Hypothesis
generation
&
simulation
Data
collection
Uncertainty
and
decision-
making
Theory &
Simulation
Instrumentation
& Tools
Machine
learning
Possible structures,
vacancies, defects,
competing polymorphs
or phases & their
probabilities
(algorithmic decision-
making and confidence
assessment)
Samples (e.g., from automated synthesis)
Molecular Foundry
Tools designed for
rapid, automated
analysis
Automated sample
characterization
protocol that matches
“team of experts”
behavior for single
samples…
…yet also scales to
hundreds of samples
& hundreds of targets
Data infrastructure

Accelerating materials discovery with big data and machine learning

  • 1.
    Accelerating materials discovery withbig data and machine learning Anubhav Jain ESDR / ETA AI/ML for Chemistry and Materials Science Workshop 9/26/2024
  • 2.
    Today: Simulation dataforms structured materials databases for design and machine learning • The Materials Project (www.materialsproject.org) • Free resource of calculated and contributed materials properties • >150,000 inorganic compounds • >500,000 registered users • Most popular database for downstream machine learning (composition or structure à property) MP for phosphors References ✦ Wang, Z. et al. Mining Unexplored Chemistries for Phosphors for High-Color- Quality White-Light-Emitting Diodes. Joule 2, 914–926 (2018) ✦ Li, S. et al. Data-Driven Discovery of Full-Visible-Spectrum Phosphor. Chemistry of Materials 31, 6286–6294 (2019) ✦ Ha, J. et al. Color tunable single-phase Eu2+ and Ce3+ co-activated Sr2LiAlO4 phosphors. Journal of Materials Chemistry C 7, 7734–7744 (2019) Prediction Statistical analysis of existing materials that co-occur with word ‘phosphor’ followed by structure prediction for new materials Experiment Predicted first known Sr-Li- Al-N quaternary, showed green-yellow/blue emission with quantum efficiency of 25% (Eu), 40% (Ce), 55% (co-activated Eu, Ce) Sr2LiAlN4 ≈ç ≈ “matbench-discovery” benchmark task
  • 3.
    Challenge: Make publishedexperimental data “ML-ready” like simulation data sets (e.g., LLMs) Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A. S.; Ceder, G.; Persson, K. A.; Jain, A. Structured Information Extraction from Scientific Text with Large Language Models. Nat Commun 2024, 15 (1), 1418. Top: dopant database; Bottom: Au NP synthesis
  • 4.
    Today/Ongoing: end-to-end materialsdevelopment with automation/robotics, theory, and AI “A-lab” Materials Project NERSC AI recipes based on “reading” literature Iterative AI refines recipe to synthesize target phase New materials can be virtually pre-screened with supercomputers and AI (“Materials Project”) Targets from computer models can be synthesized using robotic equipment and AI (“A-lab”)
  • 5.
    Challenge: automated samplecharacterization Hypothesis generation & simulation Data collection Uncertainty and decision- making Theory & Simulation Instrumentation & Tools Machine learning Possible structures, vacancies, defects, competing polymorphs or phases & their probabilities (algorithmic decision- making and confidence assessment) Samples (e.g., from automated synthesis) Molecular Foundry Tools designed for rapid, automated analysis Automated sample characterization protocol that matches “team of experts” behavior for single samples… …yet also scales to hundreds of samples & hundreds of targets Data infrastructure