Handling data and workflows in computational materials science: the AiiDA initiative

Andrea Ferretti
Handling data and workﬂows in
computational materials science
The AiiDA initiative
Firenze, 15 Nov 2016

-  Highly accurate ab initio
methods in electronic structure
-  Large computational power
required (now available)
-  High-throughput screening
possible
-  Reduced need for exp dat
COMPUTATIONAL MATERIALS’ SCIENCE
N. Marzari, Nature Materials, Apr 2016
PRL 105, 106601 (2010)

G. Hautier et al, Nat Comm 4, 2292 (2013)
p-type dopability have already been reported experimentally
or computationally for several of them. B6O has been
experimentally measured to show p-type conductivity31. It has
been demonstrated experimentally that PbZr0.5Ti0.5O3 can be
K2Pb2O
4
3
2
Defectformationenergy(eV)
1
0
–1
–2
–3
–4
–2
–3
c
1–
Defec
Figure 3 | Vacan
indicate results fo
vacancy formatio
are indicated by o
1.5 3 3.5 4 4.5 5
0
0.5
1
1.5
2
2.5
3
3.5
ZnO SnO2
In2O3
AlCuO2
SrCu2
O2
ZnRh2O4
K2Sn2O3
Sb4
Cl2
O5
K2
Pb2
O3 PbTiO3
Ca4
As2
O
Ca4
P2
O
Sr4
P2
O
Sr4As2O
Hg2
SO4
PbZrO3NaNbO2
Tl4
V2
O7
Tl4
O3
ZrSO HfSO
B6
O
Na2
Sn2
O3
PbHfO3
Band gap (eV)
Effectivemass Current p-type
TCOs
Current n-type TCOs
2 2.5
Figure 2 | Effective mass versus band gap for the p-type TCO candidates.
We superposed on the band gap axis a colour spectrum corresponding to
the wavelength associated with a photon energy. The TCO candidates are
marked with red dots. A few known p-type (blue diamonds) and n-type
(green square) TCOs can be compared to the new candidates. The best
TCOs should lie in the lower right corner. For clarity, we kept only one
representative when polymorphs existed for a given stoechiometry (for

-  Highly accurate ab initio
methods in electronic structure
-  Large computational power
required (now available)
-  High-throughput screening
possible
-  Reduced need for exp data
-  Data handling needed
N. Marzari, Nature Materials, Apr 2016
PRL 105, 106601 (2010)

SOME THOUGHTS ON DATA
•  In computational science, data are naturally generated,
so the workflows that create properties and data from a
structure are key
•  Curated data are needed (e.g. for verification or for
machine learning)
•  A model of data-on-demand can be implemented
(high-throughput pushes the development of robust
workflows to calculate automatically).

OBJECTIVES
•  Automation: run thousands of calculations daily
•  Provenance: all children and all parent data are
recorded
•  Reproducibility: go back to a simulation years later,
and redo it with new parameters or codes
•  Extensible/agnostic to models, codes and formats
•  Workflows: dynamical, robust, complex “turnkey
solutions” that calculate desired properties on demand
•  Sharing: provide the distributed environment to
disseminate workflows and data and to provide
services

ADES MODEL FOR COMPUTATIONAL
SCIENCE
G. Pizzi et al., Comp. Mat. Sci 111, 218-230 (2016)
Low-level pillars User-level pillars

Automation Data Environment Sharing
Automation Database Research environment Social
Remote management Provenance Scientific workflows Sharing
High-throughput Storage Data analytics Standards
A factory A library A scholar A community
http://www.aiida.net
(MIT BSD, jointly developed with Robert Bosch)
G. Pizzi et al., Comp. Mat. Sci. 111, 218 (2016)

G. Pizzi, A.C., et al., arXiv:1504.01163
ADES
Automation in AiiDA
Remote management
Coupling to data
High throughput

Automation in AiiDA
1. The core of the code is the AiiDA API (Application Programming
Interface), a set of Python classes that exposes the users to the key
objects: Calculations, Codes, and Data.
What is AiiDA?

Automation in AiiDA
2. The AiiDA Object-Relational Mapper (ORM) maps AiiDA objects into
Python Classes, so that the objects can be created/modified/queried via
an agnostic high-level interface. Any interaction with Storage occurs
transparently via Python calls.

Automation in AiiDA
3. A daemon manages calculation states (submission, retrieval,
parsing…) without user intervention (uses Python celery+supervisor
modules), through remote transports and Slurm/PBS Pro/SGE/
Torque plugins.

Automation in AiiDA
4. User interactions occurs via the command line tool Verdi, the
interactive shell or via Python scripts

Coupling automation with storage
•  The AiiDA-API acts as the unique interface to
heterogeneous, remote HPC resources, that are
abstracted away
– All work can be done on the local resources, and the user
does not need to connect explicitly to remote HPC
•  Coupling automation with storage ensures:
– uniformity of the input data, usage of codes and computers
(the same interface encompasses several supercomputers,
different schedulers, connection protocols…
– full reproducibility and provenance, with automatic storage of
all data and links
– seamless sharing of calculations with other users

ADES
Data in AiiDA
Storage
Database
Provenance

The Open Provenance Model
•  Any calculation is a function,
manipulating an input to obtain an
output:
out1, out2 = F(in1, in2)
•  Each functional object is a node in a
graph, connected together with
directional, labeled links
•  Output nodes in turn can be used as
inputs of following calculations out1 out2
in1 in2
F
data
data
data
data
calc

DIRECTED ACYCLIC GRAPHS
Nodes:
Calculations
Codes
Data

Saving the DAGs: Nodes and Links
Nodes and links:
a graph structure
•  Each node: row in a SQL table
+ folder for files
•  Links also stored in a SQL table
jobs provenance
Transitive closure (TC) table
•  Allows queries that traverse the graph
•  Automatically updated using triggers
•  Queries using TC in SQL faster than with
graph DB backends!

Benchmark against Neo4j
•  Graph databases exist (Neo4j)
•  They are still young, while SQL is very mature
•  Our benchmark (with postgreSQL) vs. Neo4j on same realistic
data, ~11K graphs, ~100K nodes, >1M attributes)
AiiDA (query 1 and 2)
Neo4j (query 1)
Neo4j (query 2)
Number of results
Query
time (s)

The AiiDA daemon
A daemon runs in the background
Calculation state
SUBMITTING
WITHSCHEDULER
RETRIEVING
PARSING
FINISHED

ADES
Environment in AiiDA
High-level workspace
Scientific workflows
Data analytics

Environment in AiiDA: plugins
All functionality provided using a plugin interface
Calculation Data Parser Transport Scheduler
Generation of
input files for a
given code
Quantum Espresso,
Phonopy, GPAW,
Yambo, NWChem,
…
Management of
data objects for
input/output
files&folders,
parameter sets,
remote data,
structures,
pseudos, ...
Parsing of code
output and
generation of
new DB nodes
Quantum Espresso,
Phonopy, GPAW,
Yambo,
NWChem, ...
How to connect
to a cluster
local connection,
ssh, ...
How to interact
with the
scheduler
PBSPro, Torque,
SGE, SLURM, ...

•  Full python scripting capabilities
•  AiiDA manages calculation dependency
•  They are modular: users can expand on the workflows of others
•  A step can call nested subworkflows.
•  Develop turn-key solutions for the calculation of material
properties: libraries of workflows
Environment in AiiDA: Workflows

Workflows features
•  Automatic provenance tracking, stored in DB using simple
python functions
inputs, outputs, function calls stored by adding simple decorator to existing functions
•  Serial and parallel execution support
can launch long running tasks on separate threads and wait for result when needed
•  Control provenance granularity
store level of detail relevant to the workflows
•  Seamless mixing of local and remote jobs
•  Progress checkpointing
restart from arbitrary step, retry on failure
•  Easy debugging
execute workflows in IDE and observe/change states of variables as it runs
•  Background execution
daemon execution allows machine to be shutdown and continue from last point,
essential for running long remote jobs

WORKFLOWS ENCODING CORE KNOWLEDGE
CHRONOS workflow:
electronic-magnetic-
atomic structure
PHONON workflow:
phonon dispersions
(+elastic, dielectric)
Single q
calculation
Single q
calculation
Phonon
initialization
Energy
calculation
Input
parameters
Dynamical matrices
Phonon calculation
Phonon calculation
Single q
calculation
Collect results
Fourier
interpolation
Phonon
dispersion
q-points distribution
Loops on itself
if fails (change
parameters)
Restart if clean stop (max
CPU time reached)
Phonon “restart”
sub-workflow Testing metallic
character
Generating structures with
random magnetizations
Structure
Magnetic
energy relax.
Fully relaxed
structure
Magnetic
energy relax.
Magnetic
energy relax.
Lowest energy
configuration
Non-magnetic
energy relaxation
Final energy
relaxation + bands
Electronic
bands
Energy calculation +
bands
Finding
magnetic
properties
Set of tested &
converged
pseudos (SSSP)

InlineCalculation (4825159)
elastic_constants_inline()
ParameterData (4825161)
output_parameters
StructureData (4781156)
InSe
structure
ure
InlineCalculation (47811
deformation_inline()
structure
bestfit_1
bestfit_0
standardize_structure_inline()
standardized_structure
best_fit_inline()
output_parameters
parameters
'3D_with_2D_substructure'
InSe
structure
lagrangian_strain_9
lagrangian_strain_10
parameters
lagrangian_strain_9
lagrangian_strain_0 ParameterData (4795620)
lagrangian_strain_1
lagrangian_strain_2
lagrangian_strain_3
lagrangian_strain_5
lagrangian_strain_7
PwCalculation (269210)
vc-relax FINISHED
output_structure
relax FINISHED
output_parameters
output_parameters
output_parameters
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
output_parameters
Code (124499)
'pw-5.2-rhoxml-piz-dora_aprun'
code
vc-relax FINISHED
code
SinglefileData (260128)
vdw_table
vdw_tablele vdw_table vdw_table vdw_tablevdw_table vdw_table vdw_table vdw_tablevdw_table vdw_table vdw_table
vdw_table
vdw_table
vdw_table
vdw_table
parameters
settings
KpointsData (246769)
10x10x2 (+0.0,0.0,0.0)
kpoints
kpoints kpoints kpoints kpointskpointskpoints kpoints kpoints kpoints kpoints kpoints kpointskpoints kpoints kpoints
kpoints
UpfData (81898)
pseudo_In
pseudo_In pseudo_Inpseudo_In pseudo_In pseudo_Inpseudo_Inpseudo_In pseudo_In pseudo_In pseudo_Ipseudo_In pseudo_In
pseudo_In
pseudo_Se
pseudo_Se pseudo_Sepseudo_Se pseudo_Se pseudo_Sepseudo_Se pseudo_Se
pseudo_Se
pseudo_Se
InSe
structure
Code (4634612)
'pw-5.2-rhoxml-piz-daint'
code code code code codecode code code
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
ParameterD
settings
InSe
structure
output_structure
deformed_structure_8deformed_structure_3 deformed_structure_2 deformed_structure_1
parameters
settings
InSe
structure
parameters
primitive_structure_inline()
primitive_structure_spg
CifData (15308)
cif
parameters
CiffilterCalculation (37378)
FINISHED
cif
Code (33048)
'cif_select'
code
CifData (3743)
cif
parameters
FINISHED
cif
Code (24766)
'cif_filter'
code
CifData (13415)
cif
parameters
elastic_constants_inline()
output_parameters
InSe
structure
relax FINISHED
structure
structure
structure
bestfit_1
bestfit_0
standardize_structure_inline()
standardized_structure
best_fit_inline()
output_parameters
best_fit_inline()
output_parameters
parameters
InSe
structure
parameters
lagrangian_strain_8
lagrangian_strain_9
lagrangian_strain_10
lagrangian_strain_0
lagrangian_strain_1
lagrangian_strain_2
lagrangian_strain_3
lagrangian_strain_4
lagrangian_strain_5
lagrangian_strain_6
lagrangian_strain_7
parameters
lagrangian_strain_8ParameterData (4815214)
lagrangian_strain_9
lagrangian_strain_1
lagrangian_strain_2
lagrangian_strain_3
lagrangian_strain_5
lagrangian_strain_6
lagrangian_strain_7
vc-relax FINISHED
output_structure
relax FINISHED
output_parameters output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
relax FINISHED
output_parameters
Code (124499)
'pw-5.2-rhoxml-piz-dora_aprun'
code
vc-relax FINISHED
code
SinglefileData (260128)
vdw_table
vdw_table vdw_tablevdw_tablevdw_table vdw_table vdw_tablevdw_tablevdw_table vdw_table vdw_table vdw_tablevdw_table vdw_table vdw_table vdw_tablevdw_table vdw_table vdw_table
vdw_table
vdw_table
vdw_table
relax FAILED
vdw_table
parameters
settings
KpointsData (246769)
10x10x2 (+0.0,0.0,0.0)
kpoints
kpoints kpointskpointskpoints kpoints kpointskpointskpoints kpoints kpoints kpointskpoints kpoints kpoints kpointskpoints kpoints kpoints
kpoints
kpoints
kpoints
kpoints
UpfData (81898)
pseudo_In
pseudo_In pseudo_Inpseudo_Inpseudo_In pseudo_In pseudo_Inpseudo_Inpseudo_In pseudo_In pseudo_In pseudo_Inpseudo_In pseudo_In pseudo_In pseudo_Inpseudo_In pseudo_In pseudo_In
pseudo_In
pseudo_In
pseudo_In
pseudo_In
UpfData (95553)
pseudo_Se
pseudo_Se pseudo_Sepseudo_Sepseudo_Se pseudo_Se pseudo_Sepseudo_Sepseudo_Se pseudo_Se pseudo_Se pseudo_Sepseudo_Se pseudo_Se pseudo_Se pseudo_Sepseudo_Se pseudo_Se pseudo_Se
pseudo_Se
pseudo_Se
pseudo_Se
pseudo_Se
InSe
structure
Code (4634612)
'pw-5.2-rhoxml-piz-daint'
code codecodecode code codecodecode code code codecode code code codecode code code
code
code code
parameters
settings
InSe
structure
parameters
settings
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
parameters
settings
InSe
structure
RemoteData (4793804)
parent_calc_folder
parameters
settings
InSe
structure
structure
parameters
settings
InSe
structure
output_structure
deformed_structure_8 deformed_structure_7deformed_structure_6 deformed_structure_4 deformed_structure_3deformed_structure_2deformed_structure_1 deformed_structure_0 deformed_structure_9 deformed_structure_9deformed_structure_8 deformed_structure_7 deformed_structure_6 deformed_structure_4deformed_structure_3 deformed_structure_2 deformed_structure_1 deformed_structure_0deformed_structure_10
remote_folder
parameters
settings
InSe
structure
parameters
parameters
parameters
settings
primitive_structure_inline()
primitive_structure_spg
CifData (15308)
cif
parameters
FINISHED
cif
Code (33048)
'cif_select'
code
CifData (3743)
cif
parameters
FINISHED
cif
Code (24766)
'cif_filter'
code
CifData (13415)
cif
parameters
WHAT REALLY HAPPENS

ADES
Sharing in AiiDA
Social ecosystem
Repository pipelines
Standardization

Sharing in AiiDA
Clusters
Users
Databases
Private
data
Public/shared
data
Group 1
Group 3
Group 2
Some
data
shared
Some
data
shared
•  Sharing model in AiiDA
•  Data can be pushed to the
outside world or other
repositories
•  Importer of previous
calculations
•  UUIDs used to uniquely identify all
data/calculation objects

MATERIALS CLOUD
INFRASTRUCTURE
•  server side AiiDA API
•  federated data via iRODs
•  client side API in AngularJS

CONCLUSIONS
l  In computational science, data are naturally
calculated, not harvested
l  ADES model
(automation – data – environment - sharing)
l  AiiDA v1.0 released by end of 2016
l  A DMP is part of (and distributed with) the AiiDA sw
l  AiiDA as a turn-key solution for Data management

Giovanni
Pizzi
(EPFL)
Riccardo
Sabatini
(Hum. Longevity)
Andrea
Cepellotti
(EPFL)
Andrius
Merkys
(Vilnius)
Nicolas
Mounet
(EPFL)
Boris
Kozinsky
(BOSCH)
Martin
Uhrin
(EPFL)
Spyros
Zoupanos
(EPFL)
Snehal
Waychal
(EPFL)
Nicola
Varini
(EPFL)
Leonid
Kahle
(EPFL)
Anton
Kozhevnikov
(CSCS)
Fernando
Gargiulo
(EPFL)
THE AiiDA TEAM
Georgy Samsonidze, Prateek Mehta, Andrea Greco @ Bosch

SUPPORT MOSTLY FROM
http://nccr-marvel.ch
http://www.bosch.us
http://max-centre.eu
http://nffa.eu
http://emmc.info

Handling data and workflows in computational materials science: the AiiDA initiative

Handling data and workflows in computational materials science: the AiiDA initiative

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Handling data and workflows in computational materials science: the AiiDA initiative

Similar to Handling data and workflows in computational materials science: the AiiDA initiative (20)

More from Research Data Alliance

More from Research Data Alliance (20)

Recently uploaded

Recently uploaded (20)

Handling data and workflows in computational materials science: the AiiDA initiative