SlideShare a Scribd company logo
The Status of ML Algorithms
for Structure-property Relationships
Using Matbench as a Test Protocol
Anubhav Jain
Lawrence Berkeley National Laboratory
TMS Spring 2022, March 2022
Slides (already) posted to hackingmaterials.lbl.gov
ML is quickly becoming a standard tool for
materials screening
2
Machine learning
High-throughput DFT
Expensive calculation
Experiment
Millions of candidates
There are many new algorithms being published
for ML in materials –
New ones constantly reported!
3
There are many new algorithms being published
for ML in materials –
New ones constantly reported!
4
Q: Which one is the “best”
based on the literature?
There are many new algorithms being published
for ML in materials –
New ones constantly reported!
5
Q: Which one is the “best”
based on the literature?
A: Can’t tell! They’re nearly
all done on different data.
Difficulty of comparing ML algorithms
6
Data set used
in study A
Data set used
in study B
Data set used
in study C
• Different data sets
• Source (e.g., OQMD vs MP)
• Quantity (e.g., MP 2018 vs MP 2019)
• Subset / data filtering (e.g., ehull<X)
• Different evaluation metrics
• Test set vs. cross validation?
• Different test set fraction?
• Often no runnable version of a
published algorithm.
MAE 5-Fold CV = 0.102 eV
RMSE Test set = 0.098 eV
vs.
? ?
What’s needed – an “ImageNet” for materials
science
7
https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/
What does a standard
data set do for a field?
8
One of the reasons computer science
/ machine learning seems to advance
so quickly is that they decouple data
generation from algorithm
development
This allows groups to focus on
algorithm development without all
the data generation, data cleaning,
etc. that often is the majority of an
end-to-end data science project
The ingredients of the Matbench benchmark
qStandard data sets
qStandard test splits according to nested cross-validation procedure
qAn online leaderboard that encourages reproducible results
9
How to design good data sets for materials
science?
10
• There is no single type of problem that materials scientists are trying
to solve
• For now, focus on materials property prediction (from structure or
composition)
• We want a test set that contains a diverse array of problems
• Smaller data versus larger data
• Different applications (electronic, mechanical, etc.)
• Composition-only or structure information available
• Experimental vs. Ab-initio
• Classification or regression
Matbench includes 13 different ML tasks
11
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference
Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
The tasks encompass a variety of problems
12
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference
Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
The ingredients of the Matbench benchmark
ü Standard data sets
q Standard test splits according to nested cross-validation procedure
q An online leaderboard that encourages reproducible results
13
The most common method:
a single hold-out test set
14
• Training/validation is used for
model selection
• Test/hold-out is used only for
error estimation (i.e., final
score)
Nested CV as a standard scoring metric
15
Nested CV is like hold-out, but varies the hold out set.
Think of it as k different “universes” – we have a
different training + validation of the model in each
universe and a different hold-out.
Nested CV as a standard scoring metric
16
Nested CV is like hold-out, but varies the hold out set.
Think of it as N different “universes” – we have a
different training + validation of the model in each
universe and a different hold-out.
“A nested CV procedure provides an almost unbiased estimate of the true error.”
Varma and Simon, Bias in error estimation when using cross-validation for model
selection (2006)
The ingredients of the Matbench benchmark
ü Standard data sets
ü Standard test splits according to nested cross-validation procedure
q An online leaderboard that encourages reproducible results
17
Matbench Website – now complete!
https://matbench.materialsproject.org
Matbench compares ML algorithms
19
Bigger datasets
Better
relative
performance
Access to Datasets/ML tasks
Interactively, via Materials Project
ml.materialsproject.org
Programmatically via matbench in python (2 lines)
*loads all 13 tasks
Programmatically via matminer in python (2 lines) Direct download, via matbench.materialsproject.org
Preferred/easiest method!
https://github.com/hackingmaterials/matminer
https://github.com/hackingmaterials/matminer
Programmatic Access and Analysis of Submissions
21
• Run a benchmark on your own algorithm in ~10 lines of code
• Run on any combination or all of the 13 existing tasks
• If your entry outperforms existing entry, submit algorithm in a pull request!
Existing notebooks/code and
software requirements for
reproducing any benchmark
{'python': [['crabnet==1.2.1',
'scikit_learn==1.0.2', 'matbench==0.5']]}
Comprehensive raw data
(accessible via matbench python
package or any json-capable
language) on all benchmarks
Publicly available to anyone!
In-depth performance metrics for
individual ML tasks for all
submissions
Both visually on website, and
programmatically
The ingredients of the Matbench benchmark
ü Standard data sets
ü Standard test splits according to nested cross-validation procedure
ü An online leaderboard that encourages reproducible results
22
What algorithms have been tested on the
matbench data set so far?
• Magpie + sine coloumb matrix random forest (feature-based random forests)
• Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028
(2016). https://doi.org/10.1038/npjcompumats.2016.28
• Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015):
1094-1101.
• Automatminer (feature-based AutoML)
• Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer
Reference Algorithm. npj Comput Mater 2020, 6 (1), 138.
• CGCNN (graph neural network)
• Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett.
2018, 120 (14), 145301.
• MEGNET (graph neural network)
• Chen, C.; Ye, W.; Zuo, Y.; Zheng, C.; Ong, S. P. Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chemistry of Materials 2019, 31
(9), 3564–3572.
• MODNet (feature-based neural network)
• De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet.
arXiv:2102.02263 [cond-mat] 2021.
• CRABNet (attention-based composition neural network)
• Wang, A.; Kauwe, S.; Murdock, R.; Sparks, T. Compositionally-Restricted Attention-Based Network for Materials Property Prediction; ChemRxiv, 2020.
https://doi.org/10.26434/chemrxiv.11869026.v1.
• ALIGNN (graph neural network with bond angles)
• Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8.
23
Insights from standardized comparisons
24
• Originally, we found traditional ”hand-crafted” feature models performed best generally when ! < 10%
• So it seemed matsci data – typically small datasets, esp. experimental – was best modelled by traditional
ML/feature methods, e.g. Random Forest
• Clever developments in neural networks have improved GNN models on smaller datasets, in part
powered by competition on the Matbench leaderboard
• Standardized platform has enabled easier identification of techniques which work well for certain
problems, and those that do not
+
Insights from standardized comparisons
25
Errors Predicting Final Phonon DOS Peak Frequencies
Structural GNN
(2022)
Composition GNN
(2021)
Algorithm
Mean MAE
(cm-1)
Mean RMSE
(cm-1)
Maximum
max_error (cm-1)
ALIGNN (2022) 29.5385 53.501 615.3466
MODNet v0.1.10
(2021) 38.7524 78.222 1031.8168
CrabNet (2021) 55.1114 138.3775 1452.7562
AMMExpress
(2020) 56.1706 109.7048 1151.557
CGCNN (2019) 57.7635 141.7018 2504.8743
Mean Absolute Error !"#$ ± &"#$ Predicting Final PhDOS Peaks
SoTA early 2020
Same data, same test; so, why are some algorithms best?
• ALIGNN: Incorporation of bond angle into crystal graph
• Bond angle/local env importance for vibrational properties?
• Matbench enables these sorts of “instant” ablation studies
Insights from standardized comparisons
26
Errors Predicting Predicting Expt. !"#$
Mean Absolute Error %&'( ± *&'( Predicting Expt. !"#$
Composition GNN
Algorithm
Mean MAE
(eV)
Std. MAE
(eV)
Mean RMSE
(eV)
CrabNet 0.3463 0.0088 0.8504
MODNet (v0.1.10) 0.347 0.0222 0.7437
CrabNet v1.2.1 0.3757 0.0207 0.8805
AMMExpress v2020 0.4161 0.0194 0.9918
Traditional Features
+ Encoding/selection
SoTA early 2020
Same data, same test; so, why are some algorithms best?
• CrabNet: Importance of attention mechanism for
compositional props.; low variability across folds
• MODNet: Normalized Mutual Information feature selection
results in high performance at risk of higher variability across
folds
Improvements to Materials ML Benchmarks
27
Standardized Uncertainty Quantification More Datasets + Better Tasks!
• ML-Materials design improved by UQ of each prediction
• Enables adaptive design:
• Practical: modern models (e.g., MODNet) produce
UQ estimates naturally
• Useful: Can analyze UQ to tell us how often samples
true values actually fall outside UQ range
• In progress: Coming soon to matbench package!
• Impossible to represent the full field of materials
design in a single set of benchmarks
• However… can we come close? Aim to include a wider
variety of properties and sources:
• Expt. load-dependent Vicker’s hardness
• Expt. superconductor Tc
• Expt. Δ"#
$
from crystal structure
• Expt. UV-Vis measurements of metal oxides
• Unique, domain-specific procedures for each task
• For example: segregation of CV samples into clusters
based on structure/composition (LOCOCV)
• Evaluation procedures which most closely resemble
real world usage of these algorithms in the most
computationally feasible fashion
Conclusions and future
• As the community increasingly develops new algorithms for machine
learning materials properties, a standard way to test these algorithms
is needed
• Matbench represents such a standard and allows you to test your
algorithms against others
• Matbench also allows us to measure overall progress in the field
• We hope to see you on the leaderboard!
28
Acknowledgements
29
Alex Dunn
Lead developer
Qi Wang
Alex Ganose Daniel Dopp
Slides (already) posted to hackingmaterials.lbl.gov

More Related Content

What's hot

Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
Anubhav Jain
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
Anubhav Jain
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
Anubhav Jain
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
Anubhav Jain
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
Anubhav Jain
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
Anubhav Jain
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
BrianDeCost
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and Python
Shintaro Fukushima
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
Anubhav Jain
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
Anubhav Jain
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
Anubhav Jain
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and Analytics
Anubhav Jain
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
Anubhav Jain
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
Anubhav Jain
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
Anubhav Jain
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
Anubhav Jain
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
Ian Foster
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
aimsnist
 

What's hot (20)

Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 
Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...Natural Language Processing for Materials Design - What Can We Extract From t...
Natural Language Processing for Materials Design - What Can We Extract From t...
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...Capturing and leveraging materials science knowledge from millions of journal...
Capturing and leveraging materials science knowledge from millions of journal...
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Materials Informatics and Python
Materials Informatics and PythonMaterials Informatics and Python
Materials Informatics and Python
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...Materials design using knowledge from millions of journal articles via natura...
Materials design using knowledge from millions of journal articles via natura...
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
DuraMat Data Management and Analytics
DuraMat Data Management and AnalyticsDuraMat Data Management and Analytics
DuraMat Data Management and Analytics
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...Discovering advanced materials for energy applications by mining the scientif...
Discovering advanced materials for energy applications by mining the scientif...
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Accelerating materials design through natural language processing
Accelerating materials design through natural language processingAccelerating materials design through natural language processing
Accelerating materials design through natural language processing
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 

Similar to The Status of ML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol

Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
Anubhav Jain
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
Anubhav Jain
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
tsysglobalsolutions
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
KAMAL CHOUDHARY
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learning
Sung Kim
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
Anubhav Jain
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
IOSR Journals
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
Manuel Martín
 
A Review on Prediction of Compressive Strength and Slump by Using Different M...
A Review on Prediction of Compressive Strength and Slump by Using Different M...A Review on Prediction of Compressive Strength and Slump by Using Different M...
A Review on Prediction of Compressive Strength and Slump by Using Different M...
IRJET Journal
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization 
CS, NcState
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
aimsnist
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
Guido A. Ciollaro
 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
Sung Kim
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
Alexander Decker
 
2cee Master Cocomo20071
2cee Master Cocomo200712cee Master Cocomo20071
2cee Master Cocomo20071
CS, NcState
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
GigaScience, BGI Hong Kong
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
IRJET Journal
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
Anubhav Jain
 
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczxPCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
JuanManuelNasralaAlv1
 

Similar to The Status of ML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol (20)

Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...Perspectives on chemical composition and crystal structure representations fr...
Perspectives on chemical composition and crystal structure representations fr...
 
Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...Evaluating Chemical Composition and Crystal Structure Representations using t...
Evaluating Chemical Composition and Crystal Structure Representations using t...
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learning
 
Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...Discovering new functional materials for clean energy and beyond using high-t...
Discovering new functional materials for clean energy and beyond using high-t...
 
A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...A Hierarchical Feature Set optimization for effective code change based Defec...
A Hierarchical Feature Set optimization for effective code change based Defec...
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
 
A Review on Prediction of Compressive Strength and Slump by Using Different M...
A Review on Prediction of Compressive Strength and Slump by Using Different M...A Review on Prediction of Compressive Strength and Slump by Using Different M...
A Review on Prediction of Compressive Strength and Slump by Using Different M...
 
Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization 
 
2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
Predicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine LearningPredicting Fault-Prone Files using Machine Learning
Predicting Fault-Prone Files using Machine Learning
 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
 
Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)Survey on classification algorithms for data mining (comparison and evaluation)
Survey on classification algorithms for data mining (comparison and evaluation)
 
2cee Master Cocomo20071
2cee Master Cocomo200712cee Master Cocomo20071
2cee Master Cocomo20071
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
Partial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather ConditionsPartial Object Detection in Inclined Weather Conditions
Partial Object Detection in Inclined Weather Conditions
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczxPCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx
 

More from Anubhav Jain

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
Anubhav Jain
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
Anubhav Jain
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
Anubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Anubhav Jain
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Anubhav Jain
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
Anubhav Jain
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
Anubhav Jain
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Anubhav Jain
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
Anubhav Jain
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
Anubhav Jain
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
Anubhav Jain
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
Anubhav Jain
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
Anubhav Jain
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
Anubhav Jain
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
Anubhav Jain
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
Anubhav Jain
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
Anubhav Jain
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
Anubhav Jain
 

More from Anubhav Jain (18)

Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...Discovering advanced materials for energy applications: theory, high-throughp...
Discovering advanced materials for energy applications: theory, high-throughp...
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
An AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesisAn AI-driven closed-loop facility for materials synthesis
An AI-driven closed-loop facility for materials synthesis
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Best practices for DuraMat software dissemination
Best practices for DuraMat software disseminationBest practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
 
Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...Available methods for predicting materials synthesizability using computation...
Available methods for predicting materials synthesizability using computation...
 
Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...Efficient methods for accurately calculating thermoelectric properties – elec...
Efficient methods for accurately calculating thermoelectric properties – elec...
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Machine Learning for Catalyst Design
Machine Learning for Catalyst DesignMachine Learning for Catalyst Design
Machine Learning for Catalyst Design
 
Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...Natural language processing for extracting synthesis recipes and applications...
Natural language processing for extracting synthesis recipes and applications...
 
Accelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine LearningAccelerating New Materials Design with Supercomputing and Machine Learning
Accelerating New Materials Design with Supercomputing and Machine Learning
 
DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …DuraMat CO1 Central Data Resource: How it started, how it’s going …
DuraMat CO1 Central Data Resource: How it started, how it’s going …
 
The Materials Project
The Materials ProjectThe Materials Project
The Materials Project
 
Discovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials ProjectDiscovering and Exploring New Materials through the Materials Project
Discovering and Exploring New Materials through the Materials Project
 
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Applications to energy storage and functional materia...
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Machine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst DesignMachine Learning Platform for Catalyst Design
Machine Learning Platform for Catalyst Design
 
Conducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials ProjectConducting and Enabling Data-Driven Research Through the Materials Project
Conducting and Enabling Data-Driven Research Through the Materials Project
 

Recently uploaded

HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
ABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
sammy700571
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
Ritik83251
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills MN
 
Tissue fluids_etiology_volume regulation_pressure.pptx
Tissue fluids_etiology_volume regulation_pressure.pptxTissue fluids_etiology_volume regulation_pressure.pptx
Tissue fluids_etiology_volume regulation_pressure.pptx
muralinath2
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
PirithiRaju
 

Recently uploaded (20)

HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
 
Tissue fluids_etiology_volume regulation_pressure.pptx
Tissue fluids_etiology_volume regulation_pressure.pptxTissue fluids_etiology_volume regulation_pressure.pptx
Tissue fluids_etiology_volume regulation_pressure.pptx
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
 

The Status of ML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol

  • 1. The Status of ML Algorithms for Structure-property Relationships Using Matbench as a Test Protocol Anubhav Jain Lawrence Berkeley National Laboratory TMS Spring 2022, March 2022 Slides (already) posted to hackingmaterials.lbl.gov
  • 2. ML is quickly becoming a standard tool for materials screening 2 Machine learning High-throughput DFT Expensive calculation Experiment Millions of candidates
  • 3. There are many new algorithms being published for ML in materials – New ones constantly reported! 3
  • 4. There are many new algorithms being published for ML in materials – New ones constantly reported! 4 Q: Which one is the “best” based on the literature?
  • 5. There are many new algorithms being published for ML in materials – New ones constantly reported! 5 Q: Which one is the “best” based on the literature? A: Can’t tell! They’re nearly all done on different data.
  • 6. Difficulty of comparing ML algorithms 6 Data set used in study A Data set used in study B Data set used in study C • Different data sets • Source (e.g., OQMD vs MP) • Quantity (e.g., MP 2018 vs MP 2019) • Subset / data filtering (e.g., ehull<X) • Different evaluation metrics • Test set vs. cross validation? • Different test set fraction? • Often no runnable version of a published algorithm. MAE 5-Fold CV = 0.102 eV RMSE Test set = 0.098 eV vs. ? ?
  • 7. What’s needed – an “ImageNet” for materials science 7 https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/
  • 8. What does a standard data set do for a field? 8 One of the reasons computer science / machine learning seems to advance so quickly is that they decouple data generation from algorithm development This allows groups to focus on algorithm development without all the data generation, data cleaning, etc. that often is the majority of an end-to-end data science project
  • 9. The ingredients of the Matbench benchmark qStandard data sets qStandard test splits according to nested cross-validation procedure qAn online leaderboard that encourages reproducible results 9
  • 10. How to design good data sets for materials science? 10 • There is no single type of problem that materials scientists are trying to solve • For now, focus on materials property prediction (from structure or composition) • We want a test set that contains a diverse array of problems • Smaller data versus larger data • Different applications (electronic, mechanical, etc.) • Composition-only or structure information available • Experimental vs. Ab-initio • Classification or regression
  • 11. Matbench includes 13 different ML tasks 11 Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
  • 12. The tasks encompass a variety of problems 12 Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
  • 13. The ingredients of the Matbench benchmark ü Standard data sets q Standard test splits according to nested cross-validation procedure q An online leaderboard that encourages reproducible results 13
  • 14. The most common method: a single hold-out test set 14 • Training/validation is used for model selection • Test/hold-out is used only for error estimation (i.e., final score)
  • 15. Nested CV as a standard scoring metric 15 Nested CV is like hold-out, but varies the hold out set. Think of it as k different “universes” – we have a different training + validation of the model in each universe and a different hold-out.
  • 16. Nested CV as a standard scoring metric 16 Nested CV is like hold-out, but varies the hold out set. Think of it as N different “universes” – we have a different training + validation of the model in each universe and a different hold-out. “A nested CV procedure provides an almost unbiased estimate of the true error.” Varma and Simon, Bias in error estimation when using cross-validation for model selection (2006)
  • 17. The ingredients of the Matbench benchmark ü Standard data sets ü Standard test splits according to nested cross-validation procedure q An online leaderboard that encourages reproducible results 17
  • 18. Matbench Website – now complete! https://matbench.materialsproject.org
  • 19. Matbench compares ML algorithms 19 Bigger datasets Better relative performance
  • 20. Access to Datasets/ML tasks Interactively, via Materials Project ml.materialsproject.org Programmatically via matbench in python (2 lines) *loads all 13 tasks Programmatically via matminer in python (2 lines) Direct download, via matbench.materialsproject.org Preferred/easiest method! https://github.com/hackingmaterials/matminer https://github.com/hackingmaterials/matminer
  • 21. Programmatic Access and Analysis of Submissions 21 • Run a benchmark on your own algorithm in ~10 lines of code • Run on any combination or all of the 13 existing tasks • If your entry outperforms existing entry, submit algorithm in a pull request! Existing notebooks/code and software requirements for reproducing any benchmark {'python': [['crabnet==1.2.1', 'scikit_learn==1.0.2', 'matbench==0.5']]} Comprehensive raw data (accessible via matbench python package or any json-capable language) on all benchmarks Publicly available to anyone! In-depth performance metrics for individual ML tasks for all submissions Both visually on website, and programmatically
  • 22. The ingredients of the Matbench benchmark ü Standard data sets ü Standard test splits according to nested cross-validation procedure ü An online leaderboard that encourages reproducible results 22
  • 23. What algorithms have been tested on the matbench data set so far? • Magpie + sine coloumb matrix random forest (feature-based random forests) • Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028 (2016). https://doi.org/10.1038/npjcompumats.2016.28 • Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015): 1094-1101. • Automatminer (feature-based AutoML) • Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. • CGCNN (graph neural network) • Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120 (14), 145301. • MEGNET (graph neural network) • Chen, C.; Ye, W.; Zuo, Y.; Zheng, C.; Ong, S. P. Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chemistry of Materials 2019, 31 (9), 3564–3572. • MODNet (feature-based neural network) • De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet. arXiv:2102.02263 [cond-mat] 2021. • CRABNet (attention-based composition neural network) • Wang, A.; Kauwe, S.; Murdock, R.; Sparks, T. Compositionally-Restricted Attention-Based Network for Materials Property Prediction; ChemRxiv, 2020. https://doi.org/10.26434/chemrxiv.11869026.v1. • ALIGNN (graph neural network with bond angles) • Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8. 23
  • 24. Insights from standardized comparisons 24 • Originally, we found traditional ”hand-crafted” feature models performed best generally when ! < 10% • So it seemed matsci data – typically small datasets, esp. experimental – was best modelled by traditional ML/feature methods, e.g. Random Forest • Clever developments in neural networks have improved GNN models on smaller datasets, in part powered by competition on the Matbench leaderboard • Standardized platform has enabled easier identification of techniques which work well for certain problems, and those that do not +
  • 25. Insights from standardized comparisons 25 Errors Predicting Final Phonon DOS Peak Frequencies Structural GNN (2022) Composition GNN (2021) Algorithm Mean MAE (cm-1) Mean RMSE (cm-1) Maximum max_error (cm-1) ALIGNN (2022) 29.5385 53.501 615.3466 MODNet v0.1.10 (2021) 38.7524 78.222 1031.8168 CrabNet (2021) 55.1114 138.3775 1452.7562 AMMExpress (2020) 56.1706 109.7048 1151.557 CGCNN (2019) 57.7635 141.7018 2504.8743 Mean Absolute Error !"#$ ± &"#$ Predicting Final PhDOS Peaks SoTA early 2020 Same data, same test; so, why are some algorithms best? • ALIGNN: Incorporation of bond angle into crystal graph • Bond angle/local env importance for vibrational properties? • Matbench enables these sorts of “instant” ablation studies
  • 26. Insights from standardized comparisons 26 Errors Predicting Predicting Expt. !"#$ Mean Absolute Error %&'( ± *&'( Predicting Expt. !"#$ Composition GNN Algorithm Mean MAE (eV) Std. MAE (eV) Mean RMSE (eV) CrabNet 0.3463 0.0088 0.8504 MODNet (v0.1.10) 0.347 0.0222 0.7437 CrabNet v1.2.1 0.3757 0.0207 0.8805 AMMExpress v2020 0.4161 0.0194 0.9918 Traditional Features + Encoding/selection SoTA early 2020 Same data, same test; so, why are some algorithms best? • CrabNet: Importance of attention mechanism for compositional props.; low variability across folds • MODNet: Normalized Mutual Information feature selection results in high performance at risk of higher variability across folds
  • 27. Improvements to Materials ML Benchmarks 27 Standardized Uncertainty Quantification More Datasets + Better Tasks! • ML-Materials design improved by UQ of each prediction • Enables adaptive design: • Practical: modern models (e.g., MODNet) produce UQ estimates naturally • Useful: Can analyze UQ to tell us how often samples true values actually fall outside UQ range • In progress: Coming soon to matbench package! • Impossible to represent the full field of materials design in a single set of benchmarks • However… can we come close? Aim to include a wider variety of properties and sources: • Expt. load-dependent Vicker’s hardness • Expt. superconductor Tc • Expt. Δ"# $ from crystal structure • Expt. UV-Vis measurements of metal oxides • Unique, domain-specific procedures for each task • For example: segregation of CV samples into clusters based on structure/composition (LOCOCV) • Evaluation procedures which most closely resemble real world usage of these algorithms in the most computationally feasible fashion
  • 28. Conclusions and future • As the community increasingly develops new algorithms for machine learning materials properties, a standard way to test these algorithms is needed • Matbench represents such a standard and allows you to test your algorithms against others • Matbench also allows us to measure overall progress in the field • We hope to see you on the leaderboard! 28
  • 29. Acknowledgements 29 Alex Dunn Lead developer Qi Wang Alex Ganose Daniel Dopp Slides (already) posted to hackingmaterials.lbl.gov