Machine Learning in computational materials science: an overview, a primer, and a rant, Igor Mosyagin
1. Machine learning in computational materials
science
overview and personal experience
Igor Mosyagin
June 12, 2017
2. Few disclaimers
This talk is about computational materials science
There might be scientific fields where everything’s different
Materials science ≈ statistical physics
This talk is based on my limited experience
I also express my own opinion which may not coincide with my
current employer’s opinion.
4. What is «more» in this context?
In chemistry and physics, the
Avogadro constant is the number
of constituent particles, usually
atoms or molecules, that are
contained in the amount of
substance given by one mole.
NA = 6.022 × 1023
mol−1
At «normal conditions»
1 mol ∝ 22.4 L
Water bottles at pyparis coffee
breaks are 0.5 L.
8. Some projects even get featured in national magazine
Felix A. Faber et al. doi:10.1103/PhysRevLett.117.135502
9. Computational costs?
Modern state-of-the-art computations — N atoms in a simulation
cell, N is several hundred. World record — few thousand.
c
a
If one adds temperature (but
stays at quantum level), it
becomes more complicated.
For temperature-involving
simulations N is typically several
hundred.
Scales typically as N3
10. Time to solution? A month, at least
If everything is fine, it takes a few hours for static (T = 0)
calculation, and few weeks for temperature-related simulations.
Steps in temperature-related simulations
1 Preparation. Select parameters, build simulation cells, select
starting positions etc
bash/awk/sed, gui tools. Perl, fortran, matlab
2 Running simulations in an HPC environment (shared
supercomputer with queue, priorities and quotas)
Fortran. Sometimes an old version of fortran.
3 Processing. Parsing output of calculations, building models on
top, visualization, etc
Every other language and gui tool. Also fortran
11. Temperature-involved calculations are expensive
Everything that is not related to HPC calculations can be done in
high-level language.
There are some packages that help with those steps. Sometimes
those packages even provide python interfaces to fortran codes
(python-ase, pymatgen). There’s a separate journal (a few) for those
sort of programms.
12. Human factors
The lack of software craftsmanship skills leads to people
believing that Fortran is the only option.
Lack of exposure? The «next» step after bash scripts and
fortran in data processing is usually matlab.
(Young) researchers are no different from developers:
smart
arrogant
NiH-syndrome
lazy
do complex stuff
It might be hard to convince your supervisor to allow you
spend resources on improving your «programming» skills
13. What can be done?
Need more exposure!
if you organize a meetup — put a note on the local university
board/fb. PhD students tend to have very similar set of interests
as developers.
if you have friends/acquaintances in academia, bring them to
meetup or ask them if they suffer any computer-related pain.
You might help them save a few weeks of work, and maybe get
a free beer in return
there’s always github physics projects that would love to have
somebody help them with code
Lead by example, if you can
Scientists believe that DS is all about classification, while «real»
science is all about regressions
If you feel bold enough, organize a tutorial
A lot of people use matlab/Rstudio only for convenient layout,
and few know that tools like jupyter/spyder exist
14. Some authority to use with stubborn people
10.1371/journal.pcbi.1003285 and 10.1371/journal.pone.0067111