The Materials Data Facility: A Distributed Model for the Materials Data Community

Logan Ward1 (loganw@uchicago.edu)
Ben Blaiszik1,2 (blaiszik@uchicago.edu),
Ian Foster (foster@uchicago.edu)1,2, Ryan Chard2
Jonathon Gaff1, Kyle Chard1, Jim Pruyne1,
Rachana Ananthakrishnan1, Steven Tuecke1
Michael Ondrejcek3, Kenton McHenry3, John Towns3
University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3
materialsdatafacility.org
globus.org
Materials Data Facility:
A Distributed Model for
the Materials Data Community
15 August 2017

The Materials Data Facility Team
2
UC/Argonne
Ian Foster (PI) Ben Blaiszik Steve Tuecke
Kyle ChardJim Pruyne
Logan Ward Jonathon Gaff
Illinois (Urbana-Champaign)
Rachana
Ananthakrishnan
John Towns (PI) Kenton McHenry
Michal Ondrejcek
Stephen Rosen
Ryan Chard

Data-Intensive Materials Science
3
Materials Databases High-Throughput Screening
Machine Learning Multi-scale Modeling
Kirklin et al. Acta Mat. (2016)
de Jong et al. Sci Rep. (2016) Sparks et al. Scr. Mat. (2015) https://www.mpg.de/

Data-Intensive Materials Science
4
Science is becoming limited by the ability to handle data
- Where to get it?
- How to selectively share it?
- Where to store it?
- How do know what it is?
- How to build software that uses it?
- How to get others to share theirs?
- How to keep track of provenance?
- ….?
Our goal is to create easy answers to these questions

Why create the MDF?
5
1. Make your data shareable
Custom access control, using institution credentials
2. Make your data open
Access to >100TB of storage space
3. Make your data accessible
Search across distributed resources
Automatic, domain-specific metadata extraction
4. Make your data computable
Tight integration with computing resources
5. Make your data valuable
Citable with DOIs, measured with usage stats
$
EP

What is the MDF?
EP
EP
EP
EP
Deep indexing
Query
Browse
Aggregate
Publish
Mint DOIs
Associate
metadata
Databases
Datasets
APIs
LIMS
etc.
Distributed data
storage
Data
publication
service
Data
discovery
service

Globus and the research data lifecycle
8
Researcher initiates
transfer request; or
requested automatically
by script, science
gateway
1
Instrument
Compute Facility
Globus transfers files
reliably, securely
2
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
4
Curator reviews and
approves; data set
published on campus
or other system
7
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
3
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
5
Researcher
assembles data set;
describes it using
metadata (Datacite
& domain-specific)
6
6
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
8
Publication
Repository
Personal Computer
Transfer
Share
Publish
Discover
• Only a Web browser
required
• Use storage system
of your choice
• Access using your
campus credentials
8

Data sharing and Globus
9
Easily control who gains access to your data:
- Globus can use University/Laboratory credentials
- You can establish groups of authorized users

Data sharing and Globus
10
Simple to move data to/from any resource

Open data and Globus
12
Bottom Line: Globus provides a
robust, highly-developed, well-
supported platform for sharing and
managing open data

What do I mean by “accessibility”?
Need: Simplify finding and acquiring materials data
Major Challenges:
1. Data spread across many resources
§ Have to search each repository individually
§ Different services, different APIs to get data
2. Contents of resources are poorly described
§ Lack domain-specific metadata
Goal: Linking together world’s materials data resources,
with enough metadata to make it useful
14

Part 1: Linking with the Data Community
15
Materials Project
Citrination
Materials
Commons
Other Facilities (APS, SNS, NSLS, …), Institutional Repositories,
Publishers!
Metadata
Publishing
MetadataMD,
Pub., Compute
Metadata
Publishing
NCSA-PIREHV/TMSMBDH

MDF data discovery ecosystem
EP
NIST
MRR
Data
discovery
service
Harvest
Deep index
Register / Sync
Services
Bots
MDF
Pub
Service
Automate
Process
Refine
Analyze
Data Output
Data Input
EP
Data Sources
Query
Browse
Aggregate
User Interfaces
Identify resources for indexing
16

MDF + NIST Database Tools
17
Data
discovery
service
MDCS
NIST
MRR
Ref: Dima, et al. JOM. 68 (2016), 2053. doi: 10.1007/s11837-016-2000-4

MDF + NIST Database Tools
18
Data
discovery
service
MDCS
NIST
MRR
MDF automates publicizing data
and provides a uniform search interface

Piping DFT data from MDF to Citrine
{ "category": "system.chemical",
"chemicalFormula": "MgO2",
"properties": {
"units": "eV", "name": "Band gap",
"scalars": [ { "value": 7.8 } ] } }
2. Bot requests open DFT data periodically
3. Bot accesses data, runs DFT parser to refine data
4. Push metadata to Citrine
1. User publishes DFT dataset
5. Ingest DFT data quality report
…
Our datasets are discoverable through many tools
19

Part 2: A Materials Data Search Engine
Goal: Simplify finding useful data
Key Issue: Lack of metadata
Approaches:
1. Simplifying metadata capture from the source
2. Extracting useful information from dataset
20

Route 1: Integrating with LIMS/Workflow
Tools
21
MAST
Materials Commons (MC)
T2C2 (4CeeD)
• Build connections to international materials
efforts and registries (e.g., NIMS, RDA, NIST,
EUDAT, NDS)
• Promote IMaD data services, tools, and
accomplishments to the community
• Develop video tutorials, webinars, and shared
code repositories
• Interface with the Materials Accelerator
Network (MAN)
• Engage with colleges, industry, and
consortiums
• (Wisconsin) Regional Materials and
Manufacturing Network (RM2N)
• (Illinois) Digital Manufacturing and
Design Innovation Institute DMDII
• (Michigan) LIFT consortium
Engagement
Linking Software and Services
PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6
1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5
University of Illinois at Urbana-Champaign, 6 Northwestern University
Overview
• NSF Midwest Big Data Spoke

• Argonne Leadership Computing Facility (>1000 users/year)
§ Working with datasets that comprise ~300M core hours, with 200M
more identified for near term
§ New joint effort to roll out MDF-like capabilities to ALCF users
• Advanced Photon Source (>5000 users/year)
• Building pipelines and procedures to index and publish data from
15 beamlines (~1/3 of the facility) in conjunction with the APS
software team (Schwartz)
• Advanced Light Source (>2000 users/year)
• Integration with CAMERA project and associated tomography
beamlines
Linking Data from Major Facilities
22Working with user facilities to facilitate capturing data/metadata

Ripple: Home automation for research data
Doi:10.1109/ICDCSW.2017.30 23
Procedure for automating tomography experiments:
At ALS: Detect new beamline data,
and transfer it to NERSC
At NERSC: Submit, run jobs on Edison,
transfer data back to ALS
At ALS: Create a shared endpoint,
notify collaborators of result via email
Automate capturing results and metadata
Ryan Chard

Route 2: Deep Indexing Materials Data
MDF
Index Data resources
indexed
116
Records
>3.4M
Repositories harvested
• MDF
• NIST MML Repo
• MATIN
• Materials
Commons
• CXIDB
• NIST Materials
Resource
Registry
6
~200 Datasets
~260 TB
Made
discoverable
24

Adding More Metadata to NIST MatDL
Dataset As Published
Limited Metadata
Querying Difficult
25

Deep-Indexed into the MDF
Data Available Programmatically
26

Deep-Indexed into the MDF
Can be used for scripting
27

Another benefit: domain-specific querying
Example service possible with DFT
data files
Answer questions like:
“Do we have any data about
anatase-TiO2?”
“Who else has studied Li-MnO3
batteries with DFT?”
Crystal Structure File
.cif, VASP, etc.
Entries from MDF that
are structurally-similar
28

Skluma: A Statistical Learning Pipeline
for Taming Unkempt Data Repositories
29
doi:10.1145/3085504.3091116
Goal: Build intelligent search indexes
with minimal human effort
Method: Employ machine learning
to extract metadata from file
repositories
- Classify data files
- Detect file types
Tyler Skluzacek
Search Otherwise-Unusable Data Repositories

MDF Forge python package (under development)
• Interface to MDF services
• Helper functions for common tasks
APIs, Automation, and Examples
https://github.com/materials-data-facility/forge
30
Tools for using these capabilities will be available soon

Computable Data
Reproducing data-driven science should be trivial
It often is not. Common problems:
§ If available, datasets lack documentation
§ Algorithms/methods are not open sourced
§ Models rarely published
§ Software installation/configuration require expertise
Our goal: Simplify publishing data-driven science
- Storing software and models
- Integrating them with compute resources
32

Integrating analytics tools with MDF
33
MATIN (GT)
~ 10 datasets
Used in
education
Result: Scientists connected with data, analytics tools,
and compute capability
MDF Data
Publication
MATIN (GT)
MML
Repository
(NIST)
Materials
Commons
(UM
PRISMS) Coherent X-Ray
Tomography
Database (LNL)
To End UsersTo End UsersTo Compute ResourcesFrom Data Repositories
Jetstream is a self-provisioned, scalable science and engineering cloud environment
operated by Indiana University for the National Science Foundation: jetstream-cloud.org

Building a machine learning model using MDF
A simple web service to train ML forcefields
34

35

Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
36
Method: Botu et al. JPCC. (2017)
Using only original data
Training SetHoldout Set

37
Including Diffusion Dataset

38
Including Diffusion DatasetIncluding 𝐷 + 𝑇# Dataset
Better performance in original application: No new DFT calculations

• Summer Intern (Jiming Chen) reproducing and
extending materials and ML papers with the MDF
• Joined our team with the NSF WholeTale project
Reproducing data-driven MSE with MDF
Users publish data
to the MDF…
… and code to
WholeTale
Long-term goals:
- Assemble community-driven resource for ML tools/examples
- Use MDF/WholeTale to create benchmark challenges
Jiming Chen (UIUC)
39

Replicating Ward et al. 2016
40

• Publish and share models and code linked with full
training datasets
• Link database with HPC/Cloud computing resources
• Provide uniform interface for training, running models
DLHub: Advancing Deep Learning Adoption

What is the MDF?
EP
EP
EP
EP
Deep indexing
Query
Browse
Aggregate
Publish
Mint DOIs
Associate
metadata
Databases
Datasets
APIs
LIMS
etc.
Distributed data
storage
Data
publication
service
Data
discovery
service
43

Data publication service
44
• Mechanisms to create and enforce
schemas and logical collections
• Web UI to create datasets and manage
curation and admin tasks
• Tools to automate publication process
• Dataset record permanent landing page
for DOI link
• Record shows some metadata links to
the rest
• Direct link to underlying files
• Download statistics

Published Data Highlights
45
~ 30 datasets
~ 6.5 TB
MATIN (GT)
~ 10 datasets
Used in
education
X-ray Scattering Image Classification
Using Deep Learning
http://dx.doi.org/10.18126/M2Z30Z
Electron Backscattering and
Diffraction Datasets for Ni, Mg, Fe, Si
Yager et al.Marc De Graef et al.
Phase Field Benchmark I Dataset
Jokisaari et al.
Grain Structure, Grain-averaged Lattice Strains, and
Macro-scale Strain Data for Superelastic Nickel-
Titanium Shape Memory Alloy Polycrystal Loaded in
Tension
Paranjape et al.
• Largest dataset to date (>1.5 TB). Showcases MDF unique
capabilities and makes a unique dataset discoverable for code
development, analysis, and benchmarking

Streamline & automate data publication
12.5 TB
12.4 TB out
Data
Volumes
Publication
Authors
94
Institutions
14
Accesses
>1000
Total
datasets
50
CHiMaD
datasets
16
Pipeline CHiMaD
datasets
+14
Total
datasets
+30

Advantages of Globus Publish
Capable of handling large datasets
§ Publish data in place
§ Integration with Globus Transfer/HTTPS
Deep indexing of materials-specific metadata
§ Parse common materials data types
§ Make data searchable on the file-level
Automatically re-publishing data elsewhere
§ Publishing dataset metadata to MRR, Google Scholar, etc.
§ Sending fine-grained metadata to other databases (e.g., Citrine)
In Progress: Know how often your data is used
§ Track when it is used in analytics tools
48
All of these capabilities increase the value of your data

Why create the MDF?
http://materialsdatafacility.org 49
1. Make your data shareable
Custom access control, using institution credentials
2. Make your data open
Access to >100TB of storage space
3. Make your data accessible
Search across distributed resources
Automatic, domain-specific metadata extraction
4. Make your data computable
Tight integration with computing resources
5. Make your data valuable
Citable with DOIs, measured with usage stats
$
EP

Thanks to our sponsors!
50
U . S . D E P A R T M E N T O F
ENERGY

The Materials Data Facility: A Distributed Model for the Materials Data Community

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Materials Data Facility: A Distributed Model for the Materials Data Community

Similar to The Materials Data Facility: A Distributed Model for the Materials Data Community (20)

Recently uploaded

Recently uploaded (20)

The Materials Data Facility: A Distributed Model for the Materials Data Community