0
The Materials Project
Validation, Provenance and Sandboxes
Goals
• Validation
– constantly guard against bugs in core data
and imported data
• Provenance
– know how data came to be
...
Validation
(Internal)
Database ID
External ID What we expected
What we got
Validation runs all the time
• Rules with "constraints" for every database (and sandbox)
• Test constraints against entire...
Rules have a simple syntax
_aliases:
- snl_id = mps_id
- energy = analysis.e_above_hull
materials:
-
filter:
constraints:
...
Validation summary
Easy-to-use, integrated, efficient tools to
report errors
Next steps
– Record all check results in DB
–...
Provenance: How do I know that
the data is correct?
Types of provenance in the system
1) Calculation workflows
– FireWorks records calculation inputs, .. results in great det...
Provenance is available
for every material
Provenance in DB
Structure Notation Language
"snl_final": {
"about": {
"created_at": {
"string": "2014-02-22
19:07:00.3838...
Future work: unified view of
provenance
VASP
result
ICSD
VASP
result
VASP
result
Post-
processing
Material
properties
Comp...
Sandbox example: Multivalent
JCESR
users
Non-
JCESR
users
Multivalent app
Sandboxes = Database + Apps
Core data Core data
+
multivalent
materials
Non-
JCESR
users
JCESR
users
Technical challenges
• Pre-process data for real-time search
• Interfaces for per-user access control
– https://materialsp...
Future: dynamic sandbox creation
Current:
– Large & significant
additional data / apps
• e.g., JCESR
– Longer-term
connect...
Summary
• Validation
– guard against bugs by checking all data daily
and at data import/creation time
• Provenance
– unive...
Upcoming SlideShare
Loading in...5
×

Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

404

Published on

Summary of Goals, Progress, and Next steps for these three aspects of the Materials Project (materialsproject.org) infrastructure

* Validation: constantly guard against bugs in core data and imported data

* Provenance: know how data came to be

* Sandboxes: combine public and non-public data; "good fences make good neighbors"

Presenter: Dan Gunter, LBNL

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
404
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Picture of 1915 Heinrich Campendonk painting, "Landscape with horses". Steve Martin paid $850K for a forged version of the painting, from a reputable art house in Paris, in 2004. He sold it at a loss of $250K before discovering it was a forgery. The forgery was performed by Wolfgang Beltracchi.
  • Sandboxes are a way to share preliminary data in the context of MP data and tools.
  • Transcript of "Materials Project Validation, Provenance, and Sandboxes by Dan Gunter"

    1. 1. The Materials Project Validation, Provenance and Sandboxes
    2. 2. Goals • Validation – constantly guard against bugs in core data and imported data • Provenance – know how data came to be • Sandboxes – Combine public and non-public data; "good fences make good neighbors"
    3. 3. Validation (Internal) Database ID External ID What we expected What we got
    4. 4. Validation runs all the time • Rules with "constraints" for every database (and sandbox) • Test constraints against entire DB every night  email reports • Validation engine, etc. all open-source software in pymatgen-db Remote server Validation engine Rules MP Databases Reports (email, web pages, ..)
    5. 5. Rules have a simple syntax _aliases: - snl_id = mps_id - energy = analysis.e_above_hull materials: - filter: constraints: - final_energy_per_atom <= 0 - initial_structure.lattice.volume > 0 - initial_structure.lattice.a > 0 - initial_structure.lattice.b > 0 - initial_structure.lattice.c > 0 - initial_structure.lattice.matrix size 3 - formation_energy_per_atom <= 5 - formation_energy_per_atom > -5 - cpu_time > 5 - e_above_hull > -0.000001 - final_energy < 0 - reduced_cell_formula size$ nelements # Check num. ICSD sources for selected compounds - filter: - task_id = "mp-540081" constraints: - icsd_id size> 10 - filter: - task_id = "mp-20379" constraints: - icsd_id size 1 - filter: - task_id = "mp-13634" constraints: - icsd_id size> 0 - filter: - task_id = "mp-600022" constraints: - icsd_id size 0 # NiO2 phases should never become stable - filter: - e_above_hull = 0 constraints: - pretty_formula != 'NiO2' tasks: - filter: - state = "successful" constraints: - output.final_energy_per_atom <= 0
    6. 6. Validation summary Easy-to-use, integrated, efficient tools to report errors Next steps – Record all check results in DB – More sophisticated checks (Map/Reduce) – Make it easier to add new checks internally – Make it easier to add new check for anyone • per-sandbox or even per-user ("MP Alerts")
    7. 7. Provenance: How do I know that the data is correct?
    8. 8. Types of provenance in the system 1) Calculation workflows – FireWorks records calculation inputs, .. results in great detail 2) External datasets – Structure Notation Language standardizes the naming of data sources and publications 3) Post-calculation data transformations – New "builders" provides framework for tracking creation of final database products (1) (2) (3)
    9. 9. Provenance is available for every material
    10. 10. Provenance in DB Structure Notation Language "snl_final": { "about": { "created_at": { "string": "2014-02-22 19:07:00.383869", "@class": "datetime", "@module": "datetime" }, "_materialsproject": { "submission_id": 52621, "snl_id": 398676, "spacegroup": { "lattice_type": "tetragonal", "symbol": "P4_2/mmc", "number": 131, "point_group": "4/mmm", "crystal_system": "tetragonal", "hall": "-P 4c 2" } }, "_cedergroup": { "BURP_sids": [ 409544, 409545, 409546 ], "icsd_ids": [ ], "e_above_hull": 0.075125350000000423734 }, "references": "", "authors": [ { "name": "Geoffroy Hautier", "email": "geoffroy.hautier@uclouvain .be" }, { "name": "Bo Xu", "email": "boxu14@mit.edu" } ], "remarks": [ "supplementary compounds from MIT matgen database" ], "projects": [ "MIT matgen" ], "history": [ { "url": "http://www.fiz- karlsruhe.de/icsd_home.htm l", "name": "Inorganic Crystal Structure Database", "description": { "Collection code": 24692 } }, { "url": "", "name": "", "description": { "source": null, "orig_name": "Basic substitution code.", "formula": "O1 Pd1" } }, { "url": "http://ceder.mit.edu/", "name": "MIT Ceder group research database", "description": { "source": 105986, "orig_name": "", "formula": "FeO" } }, { "url": "http://www.materialsproject. org", "name": "Materials Project structure optimization", "description": { "fw_id": 820305, "task_type": "GGA optimize structure (2x)", "task_id": "mp- 753682" } }, { "url": "http://www.materialsproject. org", "name": "Materials Project structure optimization", "description": { "fw_id": 820308, "task_type": "GGA+U optimize structure (2x)", "task_id": "mp- 776678" } } ] }, Metadata Crystal DB sources References History of structure optimizations
    11. 11. Future work: unified view of provenance VASP result ICSD VASP result VASP result Post- processing Material properties Computation Data import processing e.g., Defects
    12. 12. Sandbox example: Multivalent JCESR users Non- JCESR users
    13. 13. Multivalent app
    14. 14. Sandboxes = Database + Apps Core data Core data + multivalent materials Non- JCESR users JCESR users
    15. 15. Technical challenges • Pre-process data for real-time search • Interfaces for per-user access control – https://materialsproject.org/materials/1234?san dbox=jcesr – Web UI elements and
    16. 16. Future: dynamic sandbox creation Current: – Large & significant additional data / apps • e.g., JCESR – Longer-term connections to MP data • e.g. porous materials – Companies • e.g. VW/Stanford Future small collab. per-user? CoD?
    17. 17. Summary • Validation – guard against bugs by checking all data daily and at data import/creation time • Provenance – universal standard for annotating data provenance • Sandboxes – unified view of distinct databases – onramp for new collaborations and data
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×