When The New Science Is In The Outliers

8/7/2019
1
Max Planck SocietyMax Planck Society
When The New Science Is in The Outliers
When The New Science Is In The Outliers
Matthias Scheffler
Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin, Germany, and Physics Department and IRIS
Adlershof, Humboldt-Universität zu Berlin, 12489 Berlin, Germany
Several issues hamper progress in data-driven materials science. In particular, these are a missing FAIR [1]
data infrastructure and appropriate data-analytics methodology [2].
Significant efforts are still necessary to fully realize the A and I of FAIR. Here the development of metadata,
their intricate relationships, and data ontology need critical attention. Obviously, a FAIR data infrastructure
– for being accepted by the community – should work without bureaucratic hurdles or the needs for special
training. In this talk, I will discuss the challenges and progress, focusing on computational materials science.
Concerning the data-analytics, we note that the number of possible materials is practically infinite, but only
10 or 100 of them may be relevant for a certain science or engineering purpose. In simple words, in
materials science and engineering, we are often looking for “needles in a hay stack”. Fitting or machine-
learning all data (i.e. the hay) with a single, global model may average away the specialties of the
interesting minority (i.e. the needles). I will discuss methods that identify statistically-exceptional
subgroups in a large amount of data, and I will discuss how one can estimate the domains of applicability of
machine-learning models. [3]
1. FAIR stands for Findable, Accessible, Interoperable and Re-usable. The FAIR Data Principles;
https://www.force11.org/group/fairgroup/fairprinciples
2. C. Draxl and M. Scheffler, Big-Data-Driven Materials Science and its FAIR Data Infrastructure. Plenary Chapter in Handbook of Materials
Modeling (eds. S. Yip and W. Andreoni), Springer (2019). https://arxiv.org/ftp/arxiv/papers/1904/1904.05859.pdf
3. Ch. Sutton, M. Boley, L. M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, Domains of Applicability of Machine-Learning Models for Novel
Materials Discovery, to be published.

8/7/2019
2
Max Planck SocietyMax Planck SocietyMax Planck Society
High-Throughput Screening
in Computational (and Experimental) Materials Science
Sharing
Advances Science
Needs for a FAIR,
Efficient Research-
Data Infrastructure
Animation by G.-M. RignaneseO(101) – O(102) compounds selected
Recycle the “waste”!
Enable re-purposing.
Consider as many compounds a possible, typically O(103) – O(105)
Findable Accessible Interoperable Reusable
M. D. Wilkinson et al., Scientific Data 3, 160018 (2016)
Since
2015

8/7/2019
3
Since
2014
Findable Accessible Interoperable Reusable
M. D. Wilkinson et al., Scientific Data 3, 160018 (2016)
Encyclopedia
Archive (normalized data)
Visualization
Repository
(raw data)
Big-Data
Analytics
Requests the full input and output files The NOMAD Center of Excellence
Since
2015
The NOMAD Repository
>50 Mio. Total-Energy Calculations
90% of the VASP
files are from
AFLOW
S. Curtarolo
OQMD
C. Wolverton
Materials Project
G. Ceder K. Persson
Max Planck Society

8/7/2019
4
What Is Needed for A
FAIR Data Infrastructure?
 Scientific results are only meaningful and worth keeping if they are
fully characterized and all individual steps are fully documented.
 Computed data are only meaningful when method, approximations,
code, code version, and all computational parameters are known.
 For experimental data, we need a full characterization of the
sample, the description of the apparatus, the measurement
conditions, and the measured quantity.
This requires metadata, ontologies, and workflows.
We also need good search engines, an
“encyclopedia” GUI, and appropriate hardware.
Max Planck SocietyMax Planck Society
Any technique that
enables computers to
mimic human intelli-
gence, using logical if-
then rules, compressed
sensing, machine
learning (including
deep learning)
Artificial Intelligence (AI)
Machine Learning
The subset of machine lear-
ning composed of algorithms
that permit software to train
itself to perform tasks, like
speech and image recognition,
by exposing multilayered
neural networks to vast
amounts of data
A subset of AI
that includes
statistical techni-
ques that enable
machines to im-
prove at tasks
with more data.
It includes deep
learning
Deep Learning
Learning from “Big” Data:
Very Many Methods and Concepts,
Very Interdisciplinary

8/7/2019
5
Building Maps of Materials
(Role Models: Periodic Table, Ashby Plots)
Building Maps of Materials
(Role Models: Periodic Table, Ashby Plots)
-
Crystal-structure prediction
• Octet binaries (ZB vs. RS)
• AlxGayInzO3 (x+y+z=2)
• Perovskites (Goldschmidt
tolerance factor)
Property
classification:
• Topological
insulators
Activation
of CO2 at
metal
oxides and
carbides
Property
classification:
• Metal vs.
insulator
work in progress
Max Planck Society
Max Planck Society
One single model to describe the whole population
(known and unknown data)
• minimize the overall prediction error (e.g. RMSE)
using regularization
• therefore, disregard (on purpose) all local details
Global Learning
-- Machine Learning --

8/7/2019
6
Subgroups are statistically
exceptional.
Global vs. Local Learning
x=a
x=b
P
0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.6
0.4
0.2
0.0
d
𝜎 𝑗 ≡ 𝑑 𝑗 ≥ 0.8 ∧ (𝑥 𝑗 = 𝑎)
Max Planck Society
A global model fitted to the entire
dataset may be difficult to interpret and
may well hide or incorrectly describe the
actuating physical mechanisms.
Given:
Sample S population
Target property Pj
Features (descriptors) dj
Formic acid
Formaldehyde
Methanol
Methane
Turning Greenhouse Gases into
Useful Chemicals and Fuels
Max Planck Society
CO
CO2
C
Aliaksei
Mazheika
Sergey
Levchenko
Francesc
Illas H.-J. Freund et al., Angew. Chem. Int. 50, 10064 (2011).We need an efficient catalyst!

8/7/2019
7
Identifying New Potential Catalysts
Considering Oxides
Oxides:
A2+B4+O3, AO, BO2,
A3+B3+O3, A2O3 (B2O3),
A1+B5+O3, A2O, BO
A2+: Mg, Ca, Sr, Ba
A3+(B3+): Al, Ga, In, Sc, Y, La
B4+: Ti, Zr, Si, Ge, Sn
A+: Li, Na, K, Rb, Cs;
B5+: Nb, V, Sb
Max Planck Society
Machine learning of all produced data
does not provide a good description.
Consider surfaces of many different
materials and all possibly relevant surface
sites: Which materials (and surface sites)
are catalytically active?
Two Possibly Interesting Subgroups
for Idenifying High-Performance Materials
Subgroup identification:
 Define a ‘target property’
 Minimize the width of the target-property distribution.
 Maximize the distance between the median of the target-
property distribution and that of the whole data set.
 Maximize the size of the subgroup.
For how many xxx compunds do we know high catalytic
activity? Whar is meant by high catalytic activity?
Max Planck Society
1) ‘Small O-C-O angle’ subgroup
2) ‘Large C-O bond length’ subgroup

8/7/2019
8
Statistically Exceptional Subgroups of Oxides
– Considering 51 Potential Descriptors –
VBM < − 5.14 eV
(wrt vacuum)
Min. of Hirschfeld
charges of the A and
B atoms qmin <
0.48 e−
Distance between
the O surface atom
and its second-
nearest neighbor
cation d2 > 2.26 Å
‘Small OCO angle’ subgroup
‘Large C-O bond length’ subgroup
Other materials
gas-phase CO2
δ− molecule (2 > δ > 0.9)
Max Planck Society
C-Obondlength,Å
(qmin < 0.48 e) AND (W ≥ 5.14) AND (d2 > 2.16 Å).
δ = 0
1.17 Å, 180°
Max. of O 2p DOS M
> −6.0 eV
Distance between O
surface atom and its
nearest neighbor
cation d1 > 1.8 Å
Distance between the
O surface atom and
its second-nearest
neighbor cation d2 >
2.12 Å
1.5
1.4
1.3
1.2
The descriptors should
characterize the clean surface
‘Small OCO angle’subgroup
‘Large C-O bond-length’subgroup
All materials and sites
Two Possibly Relevant Subgroups for
Semiconducting Oxide Materials
Most known materials
with good catalytic
performance belong
to the ‘large C-O bond
length’ subgroup.
From the “bad-
performance
materials”, none
belongs to the green
subgroup.
Max Planck Society
NumberofSystemsperEnergy
NOVEL MATERIALS DISCOVERY

8/7/2019
9
Domain of (reliable) Applicability (DoA)
of Machine-Learning Models
Max Planck Society
𝑒𝑖 = |𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 |
• How reliable are machine-learning
models when fitted to all data?
• Are all data fitted equally well by the
one selected representation?
Individual absolute error
Find the subgroup with small individual errors.
Example: Data from NOMAD-Kaggle-2018
competition(*) on transparent, conducting oxides: AlxGayInz)2O3 (for 6 space
groups and up to 80 atoms/unit cell). Consider conjunctions on lattice-vector
lengths and angles, volume per atom, # atoms/unit cell, composition (%),
average nn distances (Al-Al, Al-Ga, Al-In, ... ), etc.
representation x
(*) C. Sutton, L.M. Ghiringhelli, et al., npj Comput. Materials, in print
simplified sketch
linear fit in
the DoA
Domain of
Applicability
linear fit
to all data
knowndatay(xi)andfitf(x)
Max Planck Society
ML model all data DoA selectors defining the DoA
(meV/cation) (meV/cation)
n-gram 15.2 11.41 𝑏 ≥ 5.59 Å 𝛾 < 90.35° 𝑅 Al−O ≤ 2.06Å 𝑅 Ga−O ≤ 2.07Å
SOAP 14.5 11.25 𝑎
𝑐 ≤ 3.89 𝛾 < 90.35° 𝛽 ≥ 88.68°
MBTR 13.9 8.03 𝑁 ≥ 50 𝛾 < 90.35° 𝑅Al-O ≤ 2.06 Å
Mean Absolute Error of the cohesive energy: 1
𝑁 𝑖=1
𝑁
|𝑓 𝑥𝑖 − 𝑦 𝑥𝑖 |
Example: (AlxGayInz)2O3
with Gaussian-kernel KRR and different representations(*)
(*) C. Sutton, M. Boley, L.M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler, to be published
Domain of (reliable) Applicability (DoA)
of Machine-Learning Models

8/7/2019
10
Max Planck Society
The Materials-Science Challenge Is Different
to That of Standard Machine Learning
RMSE =
Regularized RMSE optimization emphasizes the description of the majority.
It provides a “high chance of being right in the description of the hay”.
= predicted value
= true value
We are looking for statistically exceptional
data groups. This may be needles, or nuts,
or bolts, or coins, or … Often, we don’t know exactly what we are
searching for, except that the data should be statistically exceptional.
Identify these subgroups, and don’t “regularize away” the outliers!

When The New Science Is In The Outliers

More Related Content

What's hot

Similar to When The New Science Is In The Outliers

More from aimsnist

Recently uploaded

When The New Science Is In The Outliers