SlideShare a Scribd company logo
8/7/2019
1
Max Planck SocietyMax Planck Society
When The New Science Is in The Outliers
When The New Science Is In The Outliers
Matthias Scheffler
Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin, Germany, and Physics Department and IRIS
Adlershof, Humboldt-Universität zu Berlin, 12489 Berlin, Germany
Several issues hamper progress in data-driven materials science. In particular, these are a missing FAIR [1]
data infrastructure and appropriate data-analytics methodology [2].
Significant efforts are still necessary to fully realize the A and I of FAIR. Here the development of metadata,
their intricate relationships, and data ontology need critical attention. Obviously, a FAIR data infrastructure
– for being accepted by the community – should work without bureaucratic hurdles or the needs for special
training. In this talk, I will discuss the challenges and progress, focusing on computational materials science.
Concerning the data-analytics, we note that the number of possible materials is practically infinite, but only
10 or 100 of them may be relevant for a certain science or engineering purpose. In simple words, in
materials science and engineering, we are often looking for “needles in a hay stack”. Fitting or machine-
learning all data (i.e. the hay) with a single, global model may average away the specialties of the
interesting minority (i.e. the needles). I will discuss methods that identify statistically-exceptional
subgroups in a large amount of data, and I will discuss how one can estimate the domains of applicability of
machine-learning models. [3]
1. FAIR stands for Findable, Accessible, Interoperable and Re-usable. The FAIR Data Principles;
https://www.force11.org/group/fairgroup/fairprinciples
2. C. Draxl and M. Scheffler, Big-Data-Driven Materials Science and its FAIR Data Infrastructure. Plenary Chapter in Handbook of Materials
Modeling (eds. S. Yip and W. Andreoni), Springer (2019). https://arxiv.org/ftp/arxiv/papers/1904/1904.05859.pdf
3. Ch. Sutton, M. Boley, L. M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, Domains of Applicability of Machine-Learning Models for Novel
Materials Discovery, to be published.
8/7/2019
2
Max Planck SocietyMax Planck SocietyMax Planck Society
High-Throughput Screening
in Computational (and Experimental) Materials Science
Sharing
Advances Science
Needs for a FAIR,
Efficient Research-
Data Infrastructure
Animation by G.-M. RignaneseO(101) – O(102) compounds selected
Recycle the “waste”!
Enable re-purposing.
Consider as many compounds a possible, typically O(103) – O(105)
Max Planck SocietyMax Planck SocietyMax Planck Society
Findable Accessible Interoperable Reusable
M. D. Wilkinson et al., Scientific Data 3, 160018 (2016)
Since
2015
8/7/2019
3
Max Planck SocietyMax Planck SocietyMax Planck Society
Since
2014
Findable Accessible Interoperable Reusable
M. D. Wilkinson et al., Scientific Data 3, 160018 (2016)
Encyclopedia
Archive (normalized data)
Visualization
Repository
(raw data)
Big-Data
Analytics
Requests the full input and output files The NOMAD Center of Excellence
Since
2015
The NOMAD Repository
>50 Mio. Total-Energy Calculations
90% of the VASP
files are from
AFLOW
S. Curtarolo
OQMD
C. Wolverton
Materials Project
G. Ceder K. Persson
Max Planck Society
8/7/2019
4
Max Planck SocietyMax Planck SocietyMax Planck Society
What Is Needed for A
FAIR Data Infrastructure?
 Scientific results are only meaningful and worth keeping if they are
fully characterized and all individual steps are fully documented.
 Computed data are only meaningful when method, approximations,
code, code version, and all computational parameters are known.
 For experimental data, we need a full characterization of the
sample, the description of the apparatus, the measurement
conditions, and the measured quantity.
This requires metadata, ontologies, and workflows.
We also need good search engines, an
“encyclopedia” GUI, and appropriate hardware.
Max Planck SocietyMax Planck Society
Any technique that
enables computers to
mimic human intelli-
gence, using logical if-
then rules, compressed
sensing, machine
learning (including
deep learning)
Artificial Intelligence (AI)
Machine Learning
The subset of machine lear-
ning composed of algorithms
that permit software to train
itself to perform tasks, like
speech and image recognition,
by exposing multilayered
neural networks to vast
amounts of data
A subset of AI
that includes
statistical techni-
ques that enable
machines to im-
prove at tasks
with more data.
It includes deep
learning
Deep Learning
Learning from “Big” Data:
Very Many Methods and Concepts,
Very Interdisciplinary
8/7/2019
5
Building Maps of Materials
(Role Models: Periodic Table, Ashby Plots)
Building Maps of Materials
(Role Models: Periodic Table, Ashby Plots)
-
Crystal-structure prediction
• Octet binaries (ZB vs. RS)
• AlxGayInzO3 (x+y+z=2)
• Perovskites (Goldschmidt
tolerance factor)
Property
classification:
• Topological
insulators
Activation
of CO2 at
metal
oxides and
carbides
Property
classification:
• Metal vs.
insulator
work in progress
Max Planck Society
Max Planck Society
One single model to describe the whole population
(known and unknown data)
• minimize the overall prediction error (e.g. RMSE)
using regularization
• therefore, disregard (on purpose) all local details
Global Learning
-- Machine Learning --
8/7/2019
6
Subgroups are statistically
exceptional.
Global vs. Local Learning
x=a
x=b
P
0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.6
0.4
0.2
0.0
d
𝜎 𝑗 ≡ 𝑑 𝑗 ≥ 0.8 ∧ (𝑥 𝑗 = 𝑎)
Max Planck Society
A global model fitted to the entire
dataset may be difficult to interpret and
may well hide or incorrectly describe the
actuating physical mechanisms.
Given:
Sample S population
Target property Pj
Features (descriptors) dj
Formic acid
Formaldehyde
Methanol
Methane
Turning Greenhouse Gases into
Useful Chemicals and Fuels
Max Planck Society
CO
CO2
C
Aliaksei
Mazheika
Sergey
Levchenko
Francesc
Illas H.-J. Freund et al., Angew. Chem. Int. 50, 10064 (2011).We need an efficient catalyst!
8/7/2019
7
Identifying New Potential Catalysts
Considering Oxides
Oxides:
A2+B4+O3, AO, BO2,
A3+B3+O3, A2O3 (B2O3),
A1+B5+O3, A2O, BO
A2+: Mg, Ca, Sr, Ba
A3+(B3+): Al, Ga, In, Sc, Y, La
B4+: Ti, Zr, Si, Ge, Sn
A+: Li, Na, K, Rb, Cs;
B5+: Nb, V, Sb
Max Planck Society
Machine learning of all produced data
does not provide a good description.
Consider surfaces of many different
materials and all possibly relevant surface
sites: Which materials (and surface sites)
are catalytically active?
Two Possibly Interesting Subgroups
for Idenifying High-Performance Materials
Subgroup identification:
 Define a ‘target property’
 Minimize the width of the target-property distribution.
 Maximize the distance between the median of the target-
property distribution and that of the whole data set.
 Maximize the size of the subgroup.
For how many xxx compunds do we know high catalytic
activity? Whar is meant by high catalytic activity?
Max Planck Society
1) ‘Small O-C-O angle’ subgroup
2) ‘Large C-O bond length’ subgroup
8/7/2019
8
Statistically Exceptional Subgroups of Oxides
– Considering 51 Potential Descriptors –
VBM < − 5.14 eV
(wrt vacuum)
Min. of Hirschfeld
charges of the A and
B atoms qmin <
0.48 e−
Distance between
the O surface atom
and its second-
nearest neighbor
cation d2 > 2.26 Å
‘Small OCO angle’ subgroup
‘Large C-O bond length’ subgroup
Other materials
gas-phase CO2
δ− molecule (2 > δ > 0.9)
Max Planck Society
C-Obondlength,Å
(qmin < 0.48 e) AND (W ≥ 5.14) AND (d2 > 2.16 Å).
δ = 0
1.17 Å, 180°
Max. of O 2p DOS M
> −6.0 eV
Distance between O
surface atom and its
nearest neighbor
cation d1 > 1.8 Å
Distance between the
O surface atom and
its second-nearest
neighbor cation d2 >
2.12 Å
1.5
1.4
1.3
1.2
The descriptors should
characterize the clean surface
‘Small OCO angle’subgroup
‘Large C-O bond-length’subgroup
All materials and sites
Two Possibly Relevant Subgroups for
Semiconducting Oxide Materials
Most known materials
with good catalytic
performance belong
to the ‘large C-O bond
length’ subgroup.
From the “bad-
performance
materials”, none
belongs to the green
subgroup.
Max Planck Society
NumberofSystemsperEnergy
NOVEL MATERIALS DISCOVERY
8/7/2019
9
Domain of (reliable) Applicability (DoA)
of Machine-Learning Models
Max Planck Society
𝑒𝑖 = |𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 |
• How reliable are machine-learning
models when fitted to all data?
• Are all data fitted equally well by the
one selected representation?
Individual absolute error
Find the subgroup with small individual errors.
Example: Data from NOMAD-Kaggle-2018
competition(*) on transparent, conducting oxides: AlxGayInz)2O3 (for 6 space
groups and up to 80 atoms/unit cell). Consider conjunctions on lattice-vector
lengths and angles, volume per atom, # atoms/unit cell, composition (%),
average nn distances (Al-Al, Al-Ga, Al-In, ... ), etc.
representation x
(*) C. Sutton, L.M. Ghiringhelli, et al., npj Comput. Materials, in print
simplified sketch
linear fit in
the DoA
Domain of
Applicability
linear fit
to all data
knowndatay(xi)andfitf(x)
Max Planck Society
ML model all data DoA selectors defining the DoA
(meV/cation) (meV/cation)
n-gram 15.2 11.41 𝑏 ≥ 5.59 Å 𝛾 < 90.35° 𝑅 Al−O ≤ 2.06Å 𝑅 Ga−O ≤ 2.07Å
SOAP 14.5 11.25 𝑎
𝑐 ≤ 3.89 𝛾 < 90.35° 𝛽 ≥ 88.68°
MBTR 13.9 8.03 𝑁 ≥ 50 𝛾 < 90.35° 𝑅Al-O ≤ 2.06 Å
Mean Absolute Error of the cohesive energy: 1
𝑁 𝑖=1
𝑁
|𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 |
Example: (AlxGayInz)2O3
with Gaussian-kernel KRR and different representations(*)
(*) C. Sutton, M. Boley, L.M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, to be published
Domain of (reliable) Applicability (DoA)
of Machine-Learning Models
8/7/2019
10
Max Planck Society
The Materials-Science Challenge Is Different
to That of Standard Machine Learning
RMSE =
Regularized RMSE optimization emphasizes the description of the majority.
It provides a “high chance of being right in the description of the hay”.
= predicted value
= true value
We are looking for statistically exceptional
data groups. This may be needles, or nuts,
or bolts, or coins, or … Often, we don’t know exactly what we are
searching for, except that the data should be statistically exceptional.
Identify these subgroups, and don’t “regularize away” the outliers!

More Related Content

What's hot

2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
aimsnist
 
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
aimsnist
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
BrianDeCost
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
aimsnist
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
aimsnist
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
Ian Foster
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
Anubhav Jain
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
aimsnist
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
Anubhav Jain
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
KAMAL CHOUDHARY
 
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
aimsnist
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
Anubhav Jain
 
Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...
Anubhav Jain
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Anubhav Jain
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Anubhav Jain
 
Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...
Anubhav Jain
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 

What's hot (20)

2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model2D/3D Materials screening and genetic algorithm with ML model
2D/3D Materials screening and genetic algorithm with ML model
 
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
 
TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...TMS workshop on machine learning in materials science: Intro to deep learning...
TMS workshop on machine learning in materials science: Intro to deep learning...
 
Autonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisitionAutonomous experimental phase diagram acquisition
Autonomous experimental phase diagram acquisition
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Machine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 
Physics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learningPhysics inspired artificial intelligence/machine learning
Physics inspired artificial intelligence/machine learning
 
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
 
Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...Methods, tools, and examples (Part II): High-throughput computation and machi...
Methods, tools, and examples (Part II): High-throughput computation and machi...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...Density functional theory calculations and data mining for new thermoelectric...
Density functional theory calculations and data mining for new thermoelectric...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 

Similar to When The New Science Is In The Outliers

The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...
Ichigaku Takigawa
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
Anubhav Jain
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
Ian Foster
 
C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)
PFHub PFHub
 
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
remAYDOAN3
 
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
University of Illinois at Urbana-Champaign
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
Anubhav Jain
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
Anubhav Jain
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster
 
(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習
Ichigaku Takigawa
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
University of Illinois at Urbana-Champaign
 
AI Science
AI Science AI Science
AI Science
Melanie Swan
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
Anubhav Jain
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Nathan Frey, PhD
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"ieee_cis_cyprus
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Anubhav Jain
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
Paul Groth
 
E04423133
E04423133E04423133
E04423133
IOSR-JEN
 
Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...
Anubhav Jain
 

Similar to When The New Science Is In The Outliers (20)

The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...The interplay between data-driven and theory-driven methods for chemical scie...
The interplay between data-driven and theory-driven methods for chemical scie...
 
Materials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learningMaterials discovery through theory, computation, and machine learning
Materials discovery through theory, computation, and machine learning
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)C hi mad_phasefieldworkshop(1)
C hi mad_phasefieldworkshop(1)
 
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
Advanced Intelligent Systems - 2020 - Sha - Artificial Intelligence to Power ...
 
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
The Einstein Toolkit: A Community Computational Infrastructure for Relativist...
 
The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...The Materials Project: A Community Data Resource for Accelerating New Materia...
The Materials Project: A Community Data Resource for Accelerating New Materia...
 
Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...Computational Materials Design and Data Dissemination through the Materials P...
Computational Materials Design and Data Dissemination through the Materials P...
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
 
AI Science
AI Science AI Science
AI Science
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
 
Apt thomas kelly
Apt thomas kellyApt thomas kelly
Apt thomas kelly
 
Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"Xin Yao: "What can evolutionary computation do for you?"
Xin Yao: "What can evolutionary computation do for you?"
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
E04423133
E04423133E04423133
E04423133
 
Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...Overview of accelerated materials design efforts in the Hacking Materials res...
Overview of accelerated materials design efforts in the Hacking Materials res...
 

More from aimsnist

Enabling Data Science Methods for Catalyst Design and Discovery
Enabling Data Science Methods for Catalyst Design and DiscoveryEnabling Data Science Methods for Catalyst Design and Discovery
Enabling Data Science Methods for Catalyst Design and Discovery
aimsnist
 
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
aimsnist
 
Coupling AI with HiTp experiments to Discover Metallic Glasses Faster
Coupling AI with HiTp experiments to Discover Metallic Glasses FasterCoupling AI with HiTp experiments to Discover Metallic Glasses Faster
Coupling AI with HiTp experiments to Discover Metallic Glasses Faster
aimsnist
 
Classical force fields as physics-based neural networks
Classical force fields as physics-based neural networksClassical force fields as physics-based neural networks
Classical force fields as physics-based neural networks
aimsnist
 
Pathways Towards a Hierarchical Discovery of Materials
Pathways Towards a Hierarchical Discovery of MaterialsPathways Towards a Hierarchical Discovery of Materials
Pathways Towards a Hierarchical Discovery of Materials
aimsnist
 
Materials Data in Action
Materials Data in ActionMaterials Data in Action
Materials Data in Action
aimsnist
 
Combinatorial Experimentation and Machine Learning for Materials Discovery
Combinatorial Experimentation and Machine Learning for Materials DiscoveryCombinatorial Experimentation and Machine Learning for Materials Discovery
Combinatorial Experimentation and Machine Learning for Materials Discovery
aimsnist
 
Progress in Natural Language Processing of Materials Science Text
Progress in Natural Language Processing of Materials Science TextProgress in Natural Language Processing of Materials Science Text
Progress in Natural Language Processing of Materials Science Text
aimsnist
 

More from aimsnist (8)

Enabling Data Science Methods for Catalyst Design and Discovery
Enabling Data Science Methods for Catalyst Design and DiscoveryEnabling Data Science Methods for Catalyst Design and Discovery
Enabling Data Science Methods for Catalyst Design and Discovery
 
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
 
Coupling AI with HiTp experiments to Discover Metallic Glasses Faster
Coupling AI with HiTp experiments to Discover Metallic Glasses FasterCoupling AI with HiTp experiments to Discover Metallic Glasses Faster
Coupling AI with HiTp experiments to Discover Metallic Glasses Faster
 
Classical force fields as physics-based neural networks
Classical force fields as physics-based neural networksClassical force fields as physics-based neural networks
Classical force fields as physics-based neural networks
 
Pathways Towards a Hierarchical Discovery of Materials
Pathways Towards a Hierarchical Discovery of MaterialsPathways Towards a Hierarchical Discovery of Materials
Pathways Towards a Hierarchical Discovery of Materials
 
Materials Data in Action
Materials Data in ActionMaterials Data in Action
Materials Data in Action
 
Combinatorial Experimentation and Machine Learning for Materials Discovery
Combinatorial Experimentation and Machine Learning for Materials DiscoveryCombinatorial Experimentation and Machine Learning for Materials Discovery
Combinatorial Experimentation and Machine Learning for Materials Discovery
 
Progress in Natural Language Processing of Materials Science Text
Progress in Natural Language Processing of Materials Science TextProgress in Natural Language Processing of Materials Science Text
Progress in Natural Language Processing of Materials Science Text
 

Recently uploaded

COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 

Recently uploaded (20)

COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 

When The New Science Is In The Outliers

  • 1. 8/7/2019 1 Max Planck SocietyMax Planck Society When The New Science Is in The Outliers When The New Science Is In The Outliers Matthias Scheffler Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin, Germany, and Physics Department and IRIS Adlershof, Humboldt-Universität zu Berlin, 12489 Berlin, Germany Several issues hamper progress in data-driven materials science. In particular, these are a missing FAIR [1] data infrastructure and appropriate data-analytics methodology [2]. Significant efforts are still necessary to fully realize the A and I of FAIR. Here the development of metadata, their intricate relationships, and data ontology need critical attention. Obviously, a FAIR data infrastructure – for being accepted by the community – should work without bureaucratic hurdles or the needs for special training. In this talk, I will discuss the challenges and progress, focusing on computational materials science. Concerning the data-analytics, we note that the number of possible materials is practically infinite, but only 10 or 100 of them may be relevant for a certain science or engineering purpose. In simple words, in materials science and engineering, we are often looking for “needles in a hay stack”. Fitting or machine- learning all data (i.e. the hay) with a single, global model may average away the specialties of the interesting minority (i.e. the needles). I will discuss methods that identify statistically-exceptional subgroups in a large amount of data, and I will discuss how one can estimate the domains of applicability of machine-learning models. [3] 1. FAIR stands for Findable, Accessible, Interoperable and Re-usable. The FAIR Data Principles; https://www.force11.org/group/fairgroup/fairprinciples 2. C. Draxl and M. Scheffler, Big-Data-Driven Materials Science and its FAIR Data Infrastructure. Plenary Chapter in Handbook of Materials Modeling (eds. S. Yip and W. Andreoni), Springer (2019). https://arxiv.org/ftp/arxiv/papers/1904/1904.05859.pdf 3. Ch. Sutton, M. Boley, L. M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, Domains of Applicability of Machine-Learning Models for Novel Materials Discovery, to be published.
  • 2. 8/7/2019 2 Max Planck SocietyMax Planck SocietyMax Planck Society High-Throughput Screening in Computational (and Experimental) Materials Science Sharing Advances Science Needs for a FAIR, Efficient Research- Data Infrastructure Animation by G.-M. RignaneseO(101) – O(102) compounds selected Recycle the “waste”! Enable re-purposing. Consider as many compounds a possible, typically O(103) – O(105) Max Planck SocietyMax Planck SocietyMax Planck Society Findable Accessible Interoperable Reusable M. D. Wilkinson et al., Scientific Data 3, 160018 (2016) Since 2015
  • 3. 8/7/2019 3 Max Planck SocietyMax Planck SocietyMax Planck Society Since 2014 Findable Accessible Interoperable Reusable M. D. Wilkinson et al., Scientific Data 3, 160018 (2016) Encyclopedia Archive (normalized data) Visualization Repository (raw data) Big-Data Analytics Requests the full input and output files The NOMAD Center of Excellence Since 2015 The NOMAD Repository >50 Mio. Total-Energy Calculations 90% of the VASP files are from AFLOW S. Curtarolo OQMD C. Wolverton Materials Project G. Ceder K. Persson Max Planck Society
  • 4. 8/7/2019 4 Max Planck SocietyMax Planck SocietyMax Planck Society What Is Needed for A FAIR Data Infrastructure?  Scientific results are only meaningful and worth keeping if they are fully characterized and all individual steps are fully documented.  Computed data are only meaningful when method, approximations, code, code version, and all computational parameters are known.  For experimental data, we need a full characterization of the sample, the description of the apparatus, the measurement conditions, and the measured quantity. This requires metadata, ontologies, and workflows. We also need good search engines, an “encyclopedia” GUI, and appropriate hardware. Max Planck SocietyMax Planck Society Any technique that enables computers to mimic human intelli- gence, using logical if- then rules, compressed sensing, machine learning (including deep learning) Artificial Intelligence (AI) Machine Learning The subset of machine lear- ning composed of algorithms that permit software to train itself to perform tasks, like speech and image recognition, by exposing multilayered neural networks to vast amounts of data A subset of AI that includes statistical techni- ques that enable machines to im- prove at tasks with more data. It includes deep learning Deep Learning Learning from “Big” Data: Very Many Methods and Concepts, Very Interdisciplinary
  • 5. 8/7/2019 5 Building Maps of Materials (Role Models: Periodic Table, Ashby Plots) Building Maps of Materials (Role Models: Periodic Table, Ashby Plots) - Crystal-structure prediction • Octet binaries (ZB vs. RS) • AlxGayInzO3 (x+y+z=2) • Perovskites (Goldschmidt tolerance factor) Property classification: • Topological insulators Activation of CO2 at metal oxides and carbides Property classification: • Metal vs. insulator work in progress Max Planck Society Max Planck Society One single model to describe the whole population (known and unknown data) • minimize the overall prediction error (e.g. RMSE) using regularization • therefore, disregard (on purpose) all local details Global Learning -- Machine Learning --
  • 6. 8/7/2019 6 Subgroups are statistically exceptional. Global vs. Local Learning x=a x=b P 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.8 0.6 0.4 0.2 0.0 d 𝜎 𝑗 ≡ 𝑑 𝑗 ≥ 0.8 ∧ (𝑥 𝑗 = 𝑎) Max Planck Society A global model fitted to the entire dataset may be difficult to interpret and may well hide or incorrectly describe the actuating physical mechanisms. Given: Sample S population Target property Pj Features (descriptors) dj Formic acid Formaldehyde Methanol Methane Turning Greenhouse Gases into Useful Chemicals and Fuels Max Planck Society CO CO2 C Aliaksei Mazheika Sergey Levchenko Francesc Illas H.-J. Freund et al., Angew. Chem. Int. 50, 10064 (2011).We need an efficient catalyst!
  • 7. 8/7/2019 7 Identifying New Potential Catalysts Considering Oxides Oxides: A2+B4+O3, AO, BO2, A3+B3+O3, A2O3 (B2O3), A1+B5+O3, A2O, BO A2+: Mg, Ca, Sr, Ba A3+(B3+): Al, Ga, In, Sc, Y, La B4+: Ti, Zr, Si, Ge, Sn A+: Li, Na, K, Rb, Cs; B5+: Nb, V, Sb Max Planck Society Machine learning of all produced data does not provide a good description. Consider surfaces of many different materials and all possibly relevant surface sites: Which materials (and surface sites) are catalytically active? Two Possibly Interesting Subgroups for Idenifying High-Performance Materials Subgroup identification:  Define a ‘target property’  Minimize the width of the target-property distribution.  Maximize the distance between the median of the target- property distribution and that of the whole data set.  Maximize the size of the subgroup. For how many xxx compunds do we know high catalytic activity? Whar is meant by high catalytic activity? Max Planck Society 1) ‘Small O-C-O angle’ subgroup 2) ‘Large C-O bond length’ subgroup
  • 8. 8/7/2019 8 Statistically Exceptional Subgroups of Oxides – Considering 51 Potential Descriptors – VBM < − 5.14 eV (wrt vacuum) Min. of Hirschfeld charges of the A and B atoms qmin < 0.48 e− Distance between the O surface atom and its second- nearest neighbor cation d2 > 2.26 Å ‘Small OCO angle’ subgroup ‘Large C-O bond length’ subgroup Other materials gas-phase CO2 δ− molecule (2 > δ > 0.9) Max Planck Society C-Obondlength,Å (qmin < 0.48 e) AND (W ≥ 5.14) AND (d2 > 2.16 Å). δ = 0 1.17 Å, 180° Max. of O 2p DOS M > −6.0 eV Distance between O surface atom and its nearest neighbor cation d1 > 1.8 Å Distance between the O surface atom and its second-nearest neighbor cation d2 > 2.12 Å 1.5 1.4 1.3 1.2 The descriptors should characterize the clean surface ‘Small OCO angle’subgroup ‘Large C-O bond-length’subgroup All materials and sites Two Possibly Relevant Subgroups for Semiconducting Oxide Materials Most known materials with good catalytic performance belong to the ‘large C-O bond length’ subgroup. From the “bad- performance materials”, none belongs to the green subgroup. Max Planck Society NumberofSystemsperEnergy NOVEL MATERIALS DISCOVERY
  • 9. 8/7/2019 9 Domain of (reliable) Applicability (DoA) of Machine-Learning Models Max Planck Society 𝑒𝑖 = |𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 | • How reliable are machine-learning models when fitted to all data? • Are all data fitted equally well by the one selected representation? Individual absolute error Find the subgroup with small individual errors. Example: Data from NOMAD-Kaggle-2018 competition(*) on transparent, conducting oxides: AlxGayInz)2O3 (for 6 space groups and up to 80 atoms/unit cell). Consider conjunctions on lattice-vector lengths and angles, volume per atom, # atoms/unit cell, composition (%), average nn distances (Al-Al, Al-Ga, Al-In, ... ), etc. representation x (*) C. Sutton, L.M. Ghiringhelli, et al., npj Comput. Materials, in print simplified sketch linear fit in the DoA Domain of Applicability linear fit to all data knowndatay(xi)andfitf(x) Max Planck Society ML model all data DoA selectors defining the DoA (meV/cation) (meV/cation) n-gram 15.2 11.41 𝑏 ≥ 5.59 Å 𝛾 < 90.35° 𝑅 Al−O ≤ 2.06Å 𝑅 Ga−O ≤ 2.07Å SOAP 14.5 11.25 𝑎 𝑐 ≤ 3.89 𝛾 < 90.35° 𝛽 ≥ 88.68° MBTR 13.9 8.03 𝑁 ≥ 50 𝛾 < 90.35° 𝑅Al-O ≤ 2.06 Å Mean Absolute Error of the cohesive energy: 1 𝑁 𝑖=1 𝑁 |𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 | Example: (AlxGayInz)2O3 with Gaussian-kernel KRR and different representations(*) (*) C. Sutton, M. Boley, L.M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, to be published Domain of (reliable) Applicability (DoA) of Machine-Learning Models
  • 10. 8/7/2019 10 Max Planck Society The Materials-Science Challenge Is Different to That of Standard Machine Learning RMSE = Regularized RMSE optimization emphasizes the description of the majority. It provides a “high chance of being right in the description of the hay”. = predicted value = true value We are looking for statistically exceptional data groups. This may be needles, or nuts, or bolts, or coins, or … Often, we don’t know exactly what we are searching for, except that the data should be statistically exceptional. Identify these subgroups, and don’t “regularize away” the outliers!