Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Introduction (Part I): High-throughput computation and machine learning applied to materials design
1. Introduction (Part I):
High-throughput computation and machine learning
applied to materials design
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
LLNL Computational Chemistry Materials Science
Summer Institute, 2018
Slides (already) posted to hackingmaterials.lbl.gov
2. New materials discovery for devices is difficult
• Novel materials with enhanced performance characteristics
could make a big dent in sustainability, scalability, and cost
• In practice, we tend to re-use the same fundamental materials
for decades
– solar power w/Si since 1950s
– graphite/LiCoO2 (basis of today’s Li battery electrodes) since 1990
– Bi2Te3 and PbTe thermoelectrics first studied ~1910
• Although there are lots of improvements to manufacturing,
microstructure, etc., there not many new basic compositions
• Why is discovering better materials such a challenge?
2
3. 3
A material is defined at multiple length scales –
stick to the fundamental scale for now
4. 4
A material is defined at multiple length scales –
stick to the fundamental scale for now
5. 5
Atoms in a box – the materials universe is huge!
• Bag of 30 atoms
• Each atom is one of 50
elements
• Arrange on 10x10x10 lattice
• Over 10108 possibilities!
– more than grains of sand on all
beaches (1021)
– more than number of atoms in
universe (1080)
7. What constrains traditional experimentation?
7
“[The Chevrel] discovery resulted from a lot of
unsuccessful experiments of Mg ions insertion
into well-known hosts for Li+ ions insertion, as
well as from the thorough literature analysis
concerning the possibility of divalent ions
intercalation into inorganic materials.”
-Aurbach group, on discovery of Chevrel cathode
for multivalent (e.g., Mg2+) batteries
Levi, Levi, Chasid, Aurbach
J. Electroceramics (2009)
8. Can we invent other, faster ways of finding materials?
• The Materials Genome
Initiative wants to discover,
develop, manufacture, and
deploy advanced materials
twice as fast at a fraction of
the cost
• Strategy involves:
– simulations & supercomputers
– digital data and data mining
– better merging computation
and experiment
8
https://obamawhitehouse.archives.gov/mgi
9. Outline
9
① From quantum mechanics to density functional
theory (DFT)
② “High-throughput” DFT
③ Calculation and property databases
④ Data mining approaches to materials design
⑤ Preview of part II (tomorrow)
10. The basis of density functional theory
is quantum mechanics
10
−!2
2m
∇2
Ψ(r)+V (r)Ψ(r) = EΨ(r)
Schrödinger equation describes all the properties
of a system through the wavefunction:
Time-independent, non-relativistic Schrödinger equation
11. • There aren’t too many real situations where we can
get a closed solution to the Schrödinger equation
• Let’s pretend we want to approach things
numerically for 1000 electrons
– There are ~500,000 electron-electron interactions to worry
about.
– Even storing the wavefunction would take ~101000 GB!
• Discretize the x,y,z, position of each electron into a 1000-
element grid = 1 billion positions per electron
• Need the wavefunction output (real + complex part) for each
combination of all electron positions, i.e. 1E9 ^ (1000) * 2, or
2E9000 values
• even at 1 byte per wavefunction value (low resolution), you have
about 2E1000 GB needed needed to store the wavefunction!
11
The wave function is formidable
12. Maybe Dirac said it best …
12
“The underlying physical laws necessary
for the mathematical theory of a large part
of physics and the whole of chemistry are
thus completely known, and the difficulty
is only that the exact application of these
laws leads to equations much too
complicated to be soluble.”
“It therefore becomes desirable that
approximate practical methods of applying
quantum mechanics should be developed,
which can lead to an explanation of the
main features of complex atomic systems
without too much computation.”
13. What is density functional theory (DFT)?
13
DFT is a method solve for the electronic structure and energetics of arbitrary
materials starting from first-principles. It replaces many-body interactions
with a mean field interaction that reproduces the same charge density.
In theory, it is exact for the ground state. In practice, accuracy depends on the
choice of (some) parameters, the type of material, the property to be studied,
and whether the simulated system (crystal) is a good approximation of reality.
DFT resulted in the 1999 Nobel Prize for chemistry (W. Kohn). It is
responsible for 2 of the top 10 cited papers of all time, across all sciences.
e– e–
e– e–
e– e–
14. How does one use DFT to design new materials?
14
A. Jain, Y. Shin, and K. A.
Persson, Nat. Rev. Mater.
1, 15004 (2016).
15. How accurate is DFT in practice?
15
Shown are typical DFT results for (i) Li
battery voltages, (ii) electronic band gaps,
and (iii) bulk modulus
(i) (ii)
(iii)
(i) V. L. Chevrier, S. P. Ong, R. Armiento, M. K. Y. Chan, and G. Ceder,
Phys. Rev. B 82, 075122 (2010).
(ii) M. Chan and G. Ceder, Phys. Rev. Lett. 105, 196403 (2010).
(iii) M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst,
M. Sluiter, C. K. Ande, S. Van Der Zwaag, J. J. Plata, C. Toher, S.
Curtarolo, G. Ceder, K.A. Persson, and M. Asta, Sci. Data 2, 150009
(2015).
battery voltages
band gaps
bulk modulus
16. • System size is essentially limited to a few thousand atoms
– many important materials phenomena simply do not occur at this
length scale
• Certain materials, such as those with strong electron
correlation, remain difficult to model accurately
• Certain properties, including excited state properties
such as band gap, remain difficult to model accurately
• These are all active areas of research and improvement
to the theory, and the situation is improving on all fronts
16
Limitations of density functional theory
18. Viewpoint of the DFT accuracy situation
• Improvements to the
theory would certainly be
very helpful
– Many researchers are
working on this problem
– New and better methods
do appear over time, e.g.,
hybrid functionals for
solids.
• But – let’s not wait for
perfection before we start
applying it.
18
The map is not perfect, but time
to set sail and leave port!
19. Outline
19
① From quantum mechanics to density functional
theory (DFT)
② “High-throughput” DFT
③ Calculation and property databases
④ Data mining approaches to materials design
⑤ Preview of part II (tomorrow)
22. High-throughput DFT: a key idea
22
Automate the DFT
procedure
Supercomputing
Power
FireWorks
Software for programming
general computational
workflows that can be
scaled across large
supercomputers.
NERSC
Supercomputing center,
processor count is
~100,000 desktop
machines. Other centers
are also viable.
High-throughput
materials screening
G. Ceder & K.A.
Persson, Scientific
American (2015)
23. • The answer is “it really varies a lot”
– how big / complicated are the materials you are modeling?
– how complex / expensive are the properties you are
modeling?
• Ballpark numbers:
– Low range: optimize structure of ~3-atom compounds
• time to do a million materials ~ 10 million CPU-hours
– Medium range: bulk modulus of ~50 atom compounds
• time to do a million materials ~ 2 billion CPU hours
• The largest CPU allocations from the DOE are
typically in the order of ~100 million CPU-hours
23
How much computer time is needed for
high-throughput DFT?
24. Examples of (early) high-throughput studies
24
Application Researcher Search space Candidates Hit rate
Scintillators Klintenberg et al. 22,000 136 1/160
Curtarolo et al. 11,893 ? ?
Topological insulators Klintenberg et al. 60,000 17 1/3500
Curtarolo et al. 15,000 28 1/535
High TC superconductors Klintenberg et al. 60,000 139 1/430
Thermoelectrics – ICSD
- Half Heusler systems
- Half Heusler best ZT
Curtarolo et al. 2,500
80,000
80,000
20
75
18
1/125
1/1055
1/4400
1-photon water splitting Jacobsen et al. 19,000 20 1/950
2-photon water splitting Jacobsen et al. 19,000 12 1/1585
Transparent shields Jacobsen et al. 19,000 8 1/2375
Hg adsorbers Bligaard et al. 5,581 14 1/400
HER catalysts Greeley et al. 756 1 1/756*
Li ion battery cathodes Ceder et al. 20,000 4 1/5000*
Entries marked with * have experimentally verified the candidates.
See also: Curtarolo et al., Nature Materials 12 (2013) 191–201.
25. Computations predict, experiments confirm
25
Sidorenkite-based Li-ion battery
cathodes
LED phosphors
YCuTe2 thermoelectrics
Wang, Z., Ha, J., Kim, Y. H., Im, W. Bin, McKittrick, J. &
Ong, S. P. Mining Unexplored Chemistries for Phosphors
for High-Color-Quality White-Light-Emitting Diodes.
Joule 2, 914–926 (2018).
Chen, H.; Hao, Q.; Zivkovic, O.; Hautier, G.; Du, L.-S.; Tang,
Y.; Hu, Y.-Y.; Ma, X.; Grey, C. P.; Ceder, G. Sidorenkite
(Na3MnPO4CO3): A New Intercalation Cathode Material
for Na-Ion Batteries, Chem. Mater., 2013
Aydemir, U; Pohls, J-H; Zhu, H; Hautier, G; Bajaj, S; Gibbs,
ZM; Chen, W; Li, G; Broberg, D; White, MA; Asta, M;
Persson, K; Ceder, G; Jain, A; Snyder, GJ. Thermoelectric
Properties of Intrinsically Doped YCuTe2 with CuTe4-based
Layered Structure. J. Mat. Chem C, 2016
More examples here: A. Jain, Y. Shin, and K. A. Persson, Nat. Rev. Mater. 1, 15004 (2016).
26. • All the limitations of standard DFT still apply
• How to set DFT parameters automatically?
– A single universal parameter set will not accurately model all
materials and all properties
– Different parameter sets for different materials requires deciding
how to divide things up, and adds complication of “incompatibility”
between calculations
• How to handle non-uniformity of DFT errors when doing meta
analyses?
• How to run high-throughput efficiently on large computers?
– The biggest supercomputers are designed for massive parallelization;
unfortunately, DFT does not scale well to many processors
26
Limitations of high-throughput DFT
27. Outline
27
① From quantum mechanics to density functional
theory (DFT)
② “High-throughput” DFT
③ Calculation and property databases
④ Data mining approaches to materials design
⑤ Preview of part II (tomorrow)
28. With HT-DFT, we can generate data rapidly – what to do next?
28
M. de Jong, W. Chen, H.
Geerlings, M. Asta, and K. A.
Persson, Sci. Data, 2015, 2,
150053.!
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>900
piezoelectric
tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier (in
submission)!
29. Materials Project database: putting all the data online
• Online resource of density
functional theory simulation data
for ~85,000 inorganic materials
• Includes band structures, elastic
tensors, piezoelectric tensors,
battery properties and more
• ~55,000 registered users
• Free
• www.materialsproject.org
29
Jain et al. Commentary: The Materials Project: A
materials genome approach to accelerating
materials innovation. APL Mater. 1, 11002 (2013).!
30. The data is re-used by the community
30
K. He, Y. Zhou, P. Gao, L. Wang, N. Pereira, G.G. Amatucci, et al.,
Sodiation via Heterogeneous Disproportionation in FeF2 Electrodes for
Sodium-Ion Batteries., ACS Nano. 8 (2014) 7251–9.
M.M. Doeff, J. Cabana,
M. Shirpour, Titanate
Anodes for Sodium Ion
Batteries, J. Inorg.
Organomet. Polym. Mater.
24 (2013) 5–14.
Further examples in: A. Jain, K.A. Persson, G. Ceder. APL Materials (2016).
31. 31
There are now many
first-principles
computational
databases, including
ones not on this list
(e.g., NIST-Jarvis,
NREL-TEDesignLab)
Lin, L. Materials Databases
Infrastructure Constructed by
First Principles Calculations: A
Review. Mater. Perform. Charact.
4, MPC20150014 (2015).
32. • All the limitations of standard DFT and high-
throughput DFT still apply
• Communicating accuracy, limitations, etc. to a
diverse user group is difficult
• It remains difficult to merge information from
different computational databases
– Citrine Informatics is trying, e.g., www.citrination.com
32
Limitations of DFT databases
33. Outline
33
① From quantum mechanics to density functional
theory (DFT)
② “High-throughput” DFT
③ Calculation and property databases
④ Data mining approaches to materials design
⑤ Preview of part II (tomorrow)
37. 37
Bottom-up vs top-down approach
Small number of
general principles
Large number of
specific cases
• Conventional theory starts
with a small number of
principles and keeps
extending / simplifying to
tackle more and more cases
(growing the theory)
• Data mining starts from a
*very* large space of
possible models and
removes ones that are
inconsistent with the data
(“trimming” the theories)
38. 38
Early “data mining” (not really machine learning)
“Pettifor structure maps”
D.G. Pettifor: The structures of binary compound: I.
Phenomenological
structure maps. J. Phys. C: Solid State Phys. 19, 285–
313 (1986).
“Cation coordination numbers”
Brown, I. D. What factors determine cation coordination
numbers? Acta Crystallogr. Sect. B Struct. Sci. 44, 545–553
(1988).
39. 39
An ML model automatically learns relationships between
input variables and output variables
The ML model can find nonlinear relationships
between descriptors that accurately reproduce /
model outputs
40. • Some people see machine learning as just fancy
curve fitting
– See “Big data need big theory too” by Coveney et al.
• I see two things distinguishing ML:
– ML is more flexible; some see it as “writing a program”
– Curve fitting is about good interpolation, whereas a
large part of ML is figuring out how to use the data to
be predictive or perform a function (play Go)
40
Is ML just curve fitting?
41. 41
Getting a “predictive” fit: 3-tier design
Image credit: Joseph Gonzalez, ds100.org
Image credit: Adi Bronshtein, Towards Data Science
42. • Clustering groups
together data points
by their descriptors –
i.e., group “similar
points” together
• No output value is
needed (unsupervised)
• Many techniques,
including hierarchical
that shows groupings
as a function of cutoff
42
Unsupervised learning: clustering
W. Chen, J.-H. Pöhls, G. Hautier, D. Broberg, S.
Bajaj, U. Aydemir, Z. M. Gibbs, H. Zhu, M. Asta, G. J.
Snyder, B. Meredig, M. A. White, K. Persson, and A.
Jain, J. Mater. Chem. C 4, 4414 (2016).
clustering thermoelectric materials
by similarity
43. • A more modern approach is
autoencoders
• Neural networks are forced to
start with a high-dimensional data
set and reproduce that
information with a few number of
dimensions / neuron
• The few dimensions that the
neural network finds tend to be
very good “descriptors”
• Can then use these descriptors
on supervised problems
43
Unsupervised learning: autoencoders
Q. V. Le, Proc. 2013 IEEE Int. Conf. Acoust.
Speech Signal Process. 8595 (2013).
44. • Regression techniques
can predict output
values for new data
• ML commonly uses:
– Lasso, Ridge, ElasticNet –
these are all regularization
techniques to prevent
overfitting
– Kernel Ridge Regression,
which can find nonlinear
patterns in the data
44
Regression: beyond linear regression
K. Hansen, G. Montavon, F. Biegler, S. Fazli, M.
Rupp, M. Scheffler, O. A. Von Lilienfeld, A.
Tkatchenko, and K. R. Müller, J. Chem. Theory
Comput. 9, 3404 (2013).
45. • Tree-based techniques make a
series of “decisions” based on
input features
• These decisions are designed to
split the data into homogeneous
groups
• At the end, the data should be
homogeneous enough to predict
a single value
• Pro: highly interpretable
Cons: usually not very accurate,
gives discontinuous predictions
45
Regression: tree-based techniques (1)
J. Carrete, N. Mingo, S. Wang, and S. Curtarolo,
Adv. Funct. Mater. 24, 7427 (2014).
46. • Random forests train multiple
trees each with slightly
different information
(different subset of input data
and features) and average or
vote on the results
• Random forests tend to be a
good “starter” model to see
roughly how good ML will do
• Gradient boosted trees go
one step further
46
Regression: tree-based techniques (2) –random forest
47. • These systems try to
guess the next best point
to try for an application
given the existing data
• The next choice might try
to maximize your output,
give highest chance of
some improvement, or
might instead be tuned to
give the maximum
increase in knowledge
about the problem
47
Recommendation engines and adaptive learning
D. Xue, P. V. Balachandran, J. Hogden, J.
Theiler, D. Xue, and T. Lookman, Nat.
Commun. 7, 11241 (2016).
48. • Neural networks are one of the
hottest topics in ML at the moment
• Problem: *many* tunable
parameters, maybe a billion to train
• If you try to train a deep neural
network with 1000 data points, you
might be fooling yourself
• What to do?
– identify problems with a lot of data
– use “transfer learning”
– wait for more data …
48
Neural networks
Image credit: kdnuggets.com
49. 49
ML and “creativity” – inventing fake celebrities
None of these are real people! They are all fakes from a neural net (GAN).
Karras, T., Aila, T., Laine, S. & Lehtinen, J. Progressive Growing of GANs for
Improved Quality, Stability, and Variation. 1–26 (2017). doi:10.1002/joe.20070
50. • e.g., neural net generated thesis titles
– Reconstruction of metal-to-motion : construction of
plasma emissions
– Computational approaches for distributed microscopy
– Optical effects on virtual radio Projects
– Experiments, and protein games from multiple atoms
– Hydrogel wireless charging via nanoparticle education
– Supersone legal questions support kits regulation on
qubits and transportation
– Atoms and characteristics of monolithic nanocity
50
Amateur fun with AI “creativity” at http://aiweirdness.com
51. • No underlying physical constraints on the model
– the machine doesn’t know whether it’s modeling baseball
statistics, physics of particle trajectories, or housing prices
• Hard to know how much to trust an ML result
– Uncertainty can be built into the model, but it’s not clear
that these are all that meaningful
– Cross-validation estimates of accuracy is a “gold standard”,
but has many pitfalls, e.g. out of sample errors
• Almost always a tradeoff between accuracy and
interpretability
51
Limitations and pitfalls of machine learning
52. 52
ML behaves in ways that are often not well understood
We still do not really understand
how many of these models
really “work”
They can often give wrong
results, with a high degree of
confidence(!), that are very
different than our own intuition
Nguyen, A., Yosinski, J. & Clune, J. Deep neural networks are
easily fooled: High confidence predictions for unrecognizable
images. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit. 07–12–June, 427–436 (2015).
53. • Small data sets – typically a few thousand (not millions) –
which is a major limitation for data-driven methods
• Materials scientists are typically looking to predict
outliers, not typical examples
– e.g., we are not looking for materials that behave essentially like
other materials the way ML looks for “images that look like
typical faces”
– Instead, we are looking for “outlier” materials that behave
differently than known materials, i.e., like nothing in our dataset
• A material is a complex object to describe to a computer
53
There are many ML challenges particular to materials science
54. Outline
54
① From quantum mechanics to density functional
theory (DFT)
② “High-throughput” DFT
③ Calculation and property databases
④ Data mining approaches to materials design
⑤ Preview of part II (tomorrow)
55. • tell you about some of my own research
• show you how can get involved in this field much
faster and easier than ever before!
55
Tomorrow I will:
experiment
computation
56. • Funding: DOE Basic Energy Sciences
56
Thank you!
Slides (already) posted to hackingmaterials.lbl.gov