SlideShare a Scribd company logo
Logan Ward1 (loganw@uchicago.edu)
Ben Blaiszik1,2 (blaiszik@uchicago.edu),
Ian Foster (foster@uchicago.edu)1,2, Ryan Chard2
Jonathon Gaff1, Kyle Chard1, Jim Pruyne1,
Rachana Ananthakrishnan1, Steven Tuecke1
Michael Ondrejcek3, Kenton McHenry3, John Towns3
University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3
materialsdatafacility.org
globus.org
Materials Data Facility:
A Distributed Model for
the Materials Data Community
15 August 2017
The Materials Data Facility Team
2
UC/Argonne
Ian Foster (PI) Ben Blaiszik Steve Tuecke
Kyle ChardJim Pruyne
Logan Ward Jonathon Gaff
Illinois (Urbana-Champaign)
Rachana
Ananthakrishnan
John Towns (PI) Kenton McHenry
Michal Ondrejcek
Stephen Rosen
Ryan Chard
Data-Intensive Materials Science
3
Materials Databases High-Throughput Screening
Machine Learning Multi-scale Modeling
Kirklin	et	al.	Acta	Mat. (2016)
de	Jong	et	al.	Sci	Rep. (2016) Sparks	et	al.	Scr.	Mat. (2015) https://www.mpg.de/
Data-Intensive Materials Science
4
Science is becoming limited by the ability to handle data
- Where to get it?
- How to selectively share it?
- Where to store it?
- How do know what it is?
- How to build software that uses it?
- How to get others to share theirs?
- How to keep track of provenance?
- ….?
Our goal is to create easy answers to these questions
Why create the MDF?
5
1. Make your data shareable
Custom access control, using institution credentials
2. Make your data open
Access to >100TB of storage space
3. Make your data accessible
Search across distributed resources
Automatic, domain-specific metadata extraction
4. Make your data computable
Tight integration with computing resources
5. Make your data valuable
Citable with DOIs, measured with usage stats
$
EP
What is the MDF?
EP
EP
EP
EP
Deep indexing
Query
Browse
Aggregate
Publish
Mint DOIs
Associate
metadata
Databases
Datasets
APIs
LIMS
etc.
Distributed data
storage
Data
publication
service
Data
discovery
service
SHAREABLE AND OPEN DATA
7
EP
Globus and the research data lifecycle
8
Researcher initiates
transfer request; or
requested automatically
by script, science
gateway
1
Instrument
Compute Facility
Globus transfers files
reliably, securely
2
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
4
Curator reviews and
approves; data set
published on campus
or other system
7
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
3
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
5
Researcher
assembles data set;
describes it using
metadata (Datacite
& domain-specific)
6
6
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
8
Publication
Repository
Personal Computer
Transfer
Share
Publish
Discover
• Only a Web browser
required
• Use storage system
of your choice
• Access using your
campus credentials
8
Data sharing and Globus
9
Easily control who gains access to your data:
- Globus can use University/Laboratory credentials
- You can establish groups of authorized users
Data sharing and Globus
10
Simple to move data to/from any resource
Open data and Globus
11
Open data and Globus
12
Bottom Line: Globus provides a
robust, highly-developed, well-
supported platform for sharing and
managing open data
DATA ACCESSIBILITY
13
What do I mean by “accessibility”?
Need: Simplify finding and acquiring materials data
Major Challenges:
1. Data spread across many resources
§ Have to search each repository individually
§ Different services, different APIs to get data
2. Contents of resources are poorly described
§ Lack domain-specific metadata
Goal: Linking together world’s materials data resources,
with enough metadata to make it useful
14
Part 1: Linking with the Data Community
15
Materials	Project
Citrination
Materials	
Commons
Other	Facilities	(APS,	SNS,	NSLS,	…),	Institutional	Repositories,	
Publishers!
Metadata
Publishing
MetadataMD,
Pub.,	Compute
Metadata
Publishing
NCSA-PIREHV/TMSMBDH
MDF data discovery ecosystem
EP
NIST
MRR
Data
discovery
service
Harvest
Deep index
Register / Sync
Services
Bots
MDF
Pub
Service
Automate
Process
Refine
Analyze
Data Output
Data Input
EP
Data Sources
Query
Browse
Aggregate
User Interfaces
Identify resources for indexing
16
MDF + NIST Database Tools
17
Data
discovery
service
MDCS
NIST
MRR
Ref:	Dima,	et	al.	JOM.	68	(2016),	2053.	doi:	10.1007/s11837-016-2000-4
MDF + NIST Database Tools
18
Data
discovery
service
MDCS
NIST
MRR
MDF	automates	publicizing	data
and	provides	a	uniform	search	interface
Piping DFT data from MDF to Citrine
{ "category": "system.chemical",
"chemicalFormula": "MgO2",
"properties": {
"units": "eV", "name": "Band gap",
"scalars": [ { "value": 7.8 } ] } }
2.	Bot	requests	open	DFT	data	periodically
3.	Bot	accesses	data,	runs	DFT	parser	to	refine	data
4.	Push	metadata	to	Citrine
1.	User	publishes	DFT	dataset
5.	Ingest	DFT	data	quality	report
…
Our	datasets	are	discoverable	through	many	tools
19
Part 2: A Materials Data Search Engine
Goal: Simplify finding useful data
Key Issue: Lack of metadata
Approaches:
1. Simplifying metadata capture from the source
2. Extracting useful information from dataset
20
Route 1: Integrating with LIMS/Workflow
Tools
21
MAST
Materials Commons (MC)
T2C2 (4CeeD)
• Build connections to international materials
efforts and registries (e.g., NIMS, RDA, NIST,
EUDAT, NDS)
• Promote IMaD data services, tools, and
accomplishments to the community
• Develop video tutorials, webinars, and shared
code repositories
• Interface with the Materials Accelerator
Network (MAN)
• Engage with colleges, industry, and
consortiums
• (Wisconsin) Regional Materials and
Manufacturing Network (RM2N)
• (Illinois) Digital Manufacturing and
Design Innovation Institute DMDII
• (Michigan) LIFT consortium
Engagement
Linking Software and Services
PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6
1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5
University of Illinois at Urbana-Champaign, 6 Northwestern University
Overview
• NSF Midwest Big Data Spoke
• Argonne Leadership Computing Facility (>1000 users/year)
§ Working with datasets that comprise ~300M core hours, with 200M
more identified for near term
§ New joint effort to roll out MDF-like capabilities to ALCF users
• Advanced Photon Source (>5000 users/year)
• Building pipelines and procedures to index and publish data from
15 beamlines (~1/3 of the facility) in conjunction with the APS
software team (Schwartz)
• Advanced Light Source (>2000 users/year)
• Integration with CAMERA project and associated tomography
beamlines
Linking Data from Major Facilities
22Working	with	user	facilities	to	facilitate	capturing	data/metadata
Ripple: Home automation for research data
Doi:10.1109/ICDCSW.2017.30 23
Procedure for automating tomography experiments:
At ALS: Detect new beamline data,
and transfer it to NERSC
At NERSC: Submit, run jobs on Edison,
transfer data back to ALS
At ALS: Create a shared endpoint,
notify collaborators of result via email
Automate	capturing	results	and	metadata
Ryan	Chard
Route 2: Deep Indexing Materials Data
MDF
Index Data resources
indexed
116
Records
>3.4M
Repositories harvested
• MDF
• NIST MML Repo
• MATIN
• Materials
Commons
• CXIDB
• NIST Materials
Resource
Registry
6
~200 Datasets
~260 TB
Made
discoverable
24
Adding More Metadata to NIST MatDL
Dataset	As	Published
Limited	Metadata
Querying	Difficult
25
Adding More Metadata to NIST MatDL
Deep-Indexed	into	the	MDF
Data	Available	Programmatically
26
Adding More Metadata to NIST MatDL
Deep-Indexed	into	the	MDF
Can	be	used	for	scripting
27
Another benefit: domain-specific querying
Example service possible with DFT
data files
Answer questions like:
“Do we have any data about
anatase-TiO2?”
“Who else has studied Li-MnO3
batteries with DFT?”
Crystal Structure File
.cif, VASP, etc.
Entries from MDF that
are structurally-similar
28
Skluma: A Statistical Learning Pipeline
for Taming Unkempt Data Repositories
29
doi:10.1145/3085504.3091116
Goal: Build	intelligent	search	indexes	
with	minimal	human	effort
Method:	Employ	machine	learning	
to	extract	metadata	from	file	
repositories
- Classify	data	files
- Detect	file	types
Tyler	Skluzacek
Search	Otherwise-Unusable	Data	Repositories
MDF Forge python package (under development)
• Interface to MDF services
• Helper functions for common tasks
APIs, Automation, and Examples
https://github.com/materials-data-facility/forge
30
Tools for using these capabilities will be available soon
COMPUTABLE DATA
31
Computable Data
Reproducing data-driven science should be trivial
It often is not. Common problems:
§ If available, datasets lack documentation
§ Algorithms/methods are not open sourced
§ Models rarely published
§ Software installation/configuration require expertise
Our goal: Simplify publishing data-driven science
- Storing software and models
- Integrating them with compute resources
32
Integrating analytics tools with MDF
33
MATIN (GT)
~ 10 datasets
Used in
education
Result: Scientists connected with data, analytics tools,
and compute capability
MDF Data
Publication
MATIN (GT)
MML
Repository
(NIST)
Materials
Commons
(UM
PRISMS) Coherent X-Ray
Tomography
Database (LNL)
To	End	UsersTo	End	UsersTo	Compute	ResourcesFrom	Data	Repositories
Jetstream is a self-provisioned, scalable science and engineering cloud environment
operated by Indiana University for the National Science Foundation: jetstream-cloud.org
Building a machine learning model using MDF
A simple web service to train ML forcefields
34
35
Building a machine learning model using MDF
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
36
Building a machine learning model using MDF
Method: Botu et al. JPCC. (2017)
Using only original data
Training	SetHoldout	Set
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
37
Building a machine learning model using MDF
Method: Botu et al. JPCC. (2017)
Using only original data
Training	SetHoldout	Set
Including Diffusion Dataset
Example: Building force-field potentials from different datasets
Data resources: 3 DFT datasets with Aluminum data
1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov
Result: Improved performance by integrating data sources
38
Building a machine learning model using MDF
Method: Botu et al. JPCC. (2017)
Using only original data
Training	SetHoldout	Set
Including Diffusion DatasetIncluding 𝐷 + 𝑇# Dataset
Better performance in original application: No new DFT calculations
• Summer Intern (Jiming Chen) reproducing and
extending materials and ML papers with the MDF
• Joined our team with the NSF WholeTale project
Reproducing data-driven MSE with MDF
Users publish data
to the MDF…
… and code to
WholeTale
Long-term goals:
- Assemble community-driven resource for ML tools/examples
- Use MDF/WholeTale to create benchmark challenges
Jiming Chen (UIUC)
39
Replicating Ward et al. 2016
40
• Publish and share models and code linked with full
training datasets
• Link database with HPC/Cloud computing resources
• Provide uniform interface for training, running models
DLHub: Advancing Deep Learning Adoption
INCREASING VALUE OF DATA
42
$
What is the MDF?
EP
EP
EP
EP
Deep indexing
Query
Browse
Aggregate
Publish
Mint DOIs
Associate
metadata
Databases
Datasets
APIs
LIMS
etc.
Distributed data
storage
Data
publication
service
Data
discovery
service
43
Data publication service
44
• Mechanisms to create and enforce
schemas and logical collections
• Web UI to create datasets and manage
curation and admin tasks
• Tools to automate publication process
• Dataset record permanent landing page
for DOI link
• Record shows some metadata links to
the rest
• Direct link to underlying files
• Download statistics
Published Data Highlights
45
~ 30 datasets
~ 6.5 TB
MATIN (GT)
~ 10 datasets
Used in
education
X-ray Scattering Image Classification
Using Deep Learning
http://dx.doi.org/10.18126/M2Z30Z
Electron Backscattering and
Diffraction Datasets for Ni, Mg, Fe, Si
Yager et al.Marc De Graef et al.
Phase Field Benchmark I Dataset
Jokisaari et al.
Grain Structure, Grain-averaged Lattice Strains, and
Macro-scale Strain Data for Superelastic Nickel-
Titanium Shape Memory Alloy Polycrystal Loaded in
Tension
Paranjape et al.
• Largest dataset to date (>1.5 TB). Showcases MDF unique
capabilities and makes a unique dataset discoverable for code
development, analysis, and benchmarking
Datasets Are Citable
46
Streamline & automate data publication
12.5 TB
12.4 TB out
Data
Volumes
Publication
Authors
94
Institutions
14
Accesses
>1000
Total
datasets
50
CHiMaD
datasets
16
Pipeline CHiMaD
datasets
+14
Total
datasets
+30
Advantages of Globus Publish
Capable of handling large datasets
§ Publish data in place
§ Integration with Globus Transfer/HTTPS
Deep indexing of materials-specific metadata
§ Parse common materials data types
§ Make data searchable on the file-level
Automatically re-publishing data elsewhere
§ Publishing dataset metadata to MRR, Google Scholar, etc.
§ Sending fine-grained metadata to other databases (e.g., Citrine)
In Progress: Know how often your data is used
§ Track when it is used in analytics tools
48
All	of	these	capabilities	increase	the	value	of	your	data
Why create the MDF?
http://materialsdatafacility.org 49
1. Make your data shareable
Custom access control, using institution credentials
2. Make your data open
Access to >100TB of storage space
3. Make your data accessible
Search across distributed resources
Automatic, domain-specific metadata extraction
4. Make your data computable
Tight integration with computing resources
5. Make your data valuable
Citable with DOIs, measured with usage stats
$
EP
Thanks to our sponsors!
50
U . S . D E P A R T M E N T O F
ENERGY

More Related Content

What's hot

Deep Learning: a birds eye view
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye view
Roelof Pieters
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
Gregory Piatetsky-Shapiro
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
AI open tools for Research.pptx
AI open tools for Research.pptxAI open tools for Research.pptx
AI open tools for Research.pptx
Mohammad Usman
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligence
Parita Nagrecha
 
Human pose estimation with deep learning
Human pose estimation with deep learningHuman pose estimation with deep learning
Human pose estimation with deep learning
engiyad95
 
Procedimento legislativo e potere di emendamento
Procedimento legislativo e potere di emendamentoProcedimento legislativo e potere di emendamento
Procedimento legislativo e potere di emendamentoReti
 
Artificialintelligence
ArtificialintelligenceArtificialintelligence
Artificialintelligence
Ravi Rao
 
Deep learning and Healthcare
Deep learning and HealthcareDeep learning and Healthcare
Deep learning and Healthcare
Thomas da Silva Paula
 
3D tumor spheroid models for in vitro therapeutic screening: a systematic app...
3D tumor spheroid models for in vitro therapeutic screening: a systematic app...3D tumor spheroid models for in vitro therapeutic screening: a systematic app...
3D tumor spheroid models for in vitro therapeutic screening: a systematic app...
Arun kumar
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
Luca Bianchi
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
leopauly
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
aaroncollie
 
Sjr education
Sjr education Sjr education
Sjr education
M.S mohammadzadeh
 
A secure and dynamic multi keyword ranked
A secure and dynamic multi keyword rankedA secure and dynamic multi keyword ranked
A secure and dynamic multi keyword ranked
jpstudcorner
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentation
gmbmanikandan
 
Ontology Learning
Ontology LearningOntology Learning
Ontology Learning
Ícaro Medeiros
 
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNINGARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
Dr Sandeep Ranjan
 
How to become a data scientist in 6 months
How to become a data scientist in 6 monthsHow to become a data scientist in 6 months
How to become a data scientist in 6 months
Tetiana Ivanova
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
Devanand Sharma
 

What's hot (20)

Deep Learning: a birds eye view
Deep Learning: a birds eye viewDeep Learning: a birds eye view
Deep Learning: a birds eye view
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
 
AI open tools for Research.pptx
AI open tools for Research.pptxAI open tools for Research.pptx
AI open tools for Research.pptx
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligence
 
Human pose estimation with deep learning
Human pose estimation with deep learningHuman pose estimation with deep learning
Human pose estimation with deep learning
 
Procedimento legislativo e potere di emendamento
Procedimento legislativo e potere di emendamentoProcedimento legislativo e potere di emendamento
Procedimento legislativo e potere di emendamento
 
Artificialintelligence
ArtificialintelligenceArtificialintelligence
Artificialintelligence
 
Deep learning and Healthcare
Deep learning and HealthcareDeep learning and Healthcare
Deep learning and Healthcare
 
3D tumor spheroid models for in vitro therapeutic screening: a systematic app...
3D tumor spheroid models for in vitro therapeutic screening: a systematic app...3D tumor spheroid models for in vitro therapeutic screening: a systematic app...
3D tumor spheroid models for in vitro therapeutic screening: a systematic app...
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Research Data Management
Research Data ManagementResearch Data Management
Research Data Management
 
Sjr education
Sjr education Sjr education
Sjr education
 
A secure and dynamic multi keyword ranked
A secure and dynamic multi keyword rankedA secure and dynamic multi keyword ranked
A secure and dynamic multi keyword ranked
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentation
 
Ontology Learning
Ontology LearningOntology Learning
Ontology Learning
 
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNINGARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
 
How to become a data scientist in 6 months
How to become a data scientist in 6 monthsHow to become a data scientist in 6 months
How to become a data scientist in 6 months
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 

Similar to The Materials Data Facility: A Distributed Model for the Materials Data Community

Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
Globus
 
A Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management ProgramA Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management Program
DataWorks Summit
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
SEAD
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
OpenAIRE
 
The Future of Semantics on the Web
The Future of Semantics on the WebThe Future of Semantics on the Web
The Future of Semantics on the Web
John Domingue
 
Lowenberg Making Data Count
Lowenberg Making Data CountLowenberg Making Data Count
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
aceas13tern
 
John morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptxJohn morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptx
ARDC
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharing
Jisc RDM
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and Education
MANENDRASINGH30
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
Rob Grim
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishing
Varsha Khodiyar
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Bertram Ludäscher
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
Yongyao Jiang
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
University of Arizona
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
Research Data Alliance
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
Research Data Alliance
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
Jason Hattrick-Simpers
 
Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?
Hilmar Lapp
 

Similar to The Materials Data Facility: A Distributed Model for the Materials Data Community (20)

Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
A Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management ProgramA Year in Review - Building a Comprehensive Data Management Program
A Year in Review - Building a Comprehensive Data Management Program
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
The Future of Semantics on the Web
The Future of Semantics on the WebThe Future of Semantics on the Web
The Future of Semantics on the Web
 
Lowenberg Making Data Count
Lowenberg Making Data CountLowenberg Making Data Count
Lowenberg Making Data Count
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
John morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptxJohn morrissey c3 dis fair working data.pptx
John morrissey c3 dis fair working data.pptx
 
Recognising data sharing
Recognising data sharingRecognising data sharing
Recognising data sharing
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and Education
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
Preparing your data for sharing and publishing
Preparing your data for sharing and publishingPreparing your data for sharing and publishing
Preparing your data for sharing and publishing
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Research data life cycle
Research data life cycleResearch data life cycle
Research data life cycle
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
 
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation OpenAIRE and Eudat services and tools to support FAIR DMP implementation
OpenAIRE and Eudat services and tools to support FAIR DMP implementation
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?
 

Recently uploaded

The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
QusayMaghayerh
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
PirithiRaju
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Sérgio Sacani
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
sammy700571
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
CLASS 12th CHEMISTRY SOLID STATE ppt (Animated)
CLASS 12th CHEMISTRY SOLID STATE ppt (Animated)CLASS 12th CHEMISTRY SOLID STATE ppt (Animated)
CLASS 12th CHEMISTRY SOLID STATE ppt (Animated)
eitps1506
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Sérgio Sacani
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
Ritik83251
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
fatima132662
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
ABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA
 

Recently uploaded (20)

The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
CLASS 12th CHEMISTRY SOLID STATE ppt (Animated)
CLASS 12th CHEMISTRY SOLID STATE ppt (Animated)CLASS 12th CHEMISTRY SOLID STATE ppt (Animated)
CLASS 12th CHEMISTRY SOLID STATE ppt (Animated)
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
 

The Materials Data Facility: A Distributed Model for the Materials Data Community

  • 1. Logan Ward1 (loganw@uchicago.edu) Ben Blaiszik1,2 (blaiszik@uchicago.edu), Ian Foster (foster@uchicago.edu)1,2, Ryan Chard2 Jonathon Gaff1, Kyle Chard1, Jim Pruyne1, Rachana Ananthakrishnan1, Steven Tuecke1 Michael Ondrejcek3, Kenton McHenry3, John Towns3 University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3 materialsdatafacility.org globus.org Materials Data Facility: A Distributed Model for the Materials Data Community 15 August 2017
  • 2. The Materials Data Facility Team 2 UC/Argonne Ian Foster (PI) Ben Blaiszik Steve Tuecke Kyle ChardJim Pruyne Logan Ward Jonathon Gaff Illinois (Urbana-Champaign) Rachana Ananthakrishnan John Towns (PI) Kenton McHenry Michal Ondrejcek Stephen Rosen Ryan Chard
  • 3. Data-Intensive Materials Science 3 Materials Databases High-Throughput Screening Machine Learning Multi-scale Modeling Kirklin et al. Acta Mat. (2016) de Jong et al. Sci Rep. (2016) Sparks et al. Scr. Mat. (2015) https://www.mpg.de/
  • 4. Data-Intensive Materials Science 4 Science is becoming limited by the ability to handle data - Where to get it? - How to selectively share it? - Where to store it? - How do know what it is? - How to build software that uses it? - How to get others to share theirs? - How to keep track of provenance? - ….? Our goal is to create easy answers to these questions
  • 5. Why create the MDF? 5 1. Make your data shareable Custom access control, using institution credentials 2. Make your data open Access to >100TB of storage space 3. Make your data accessible Search across distributed resources Automatic, domain-specific metadata extraction 4. Make your data computable Tight integration with computing resources 5. Make your data valuable Citable with DOIs, measured with usage stats $ EP
  • 6. What is the MDF? EP EP EP EP Deep indexing Query Browse Aggregate Publish Mint DOIs Associate metadata Databases Datasets APIs LIMS etc. Distributed data storage Data publication service Data discovery service
  • 7. SHAREABLE AND OPEN DATA 7 EP
  • 8. Globus and the research data lifecycle 8 Researcher initiates transfer request; or requested automatically by script, science gateway 1 Instrument Compute Facility Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Researcher assembles data set; describes it using metadata (Datacite & domain-specific) 6 6 Peers, collaborators search and discover datasets; transfer and share using Globus 8 Publication Repository Personal Computer Transfer Share Publish Discover • Only a Web browser required • Use storage system of your choice • Access using your campus credentials 8
  • 9. Data sharing and Globus 9 Easily control who gains access to your data: - Globus can use University/Laboratory credentials - You can establish groups of authorized users
  • 10. Data sharing and Globus 10 Simple to move data to/from any resource
  • 11. Open data and Globus 11
  • 12. Open data and Globus 12 Bottom Line: Globus provides a robust, highly-developed, well- supported platform for sharing and managing open data
  • 14. What do I mean by “accessibility”? Need: Simplify finding and acquiring materials data Major Challenges: 1. Data spread across many resources § Have to search each repository individually § Different services, different APIs to get data 2. Contents of resources are poorly described § Lack domain-specific metadata Goal: Linking together world’s materials data resources, with enough metadata to make it useful 14
  • 15. Part 1: Linking with the Data Community 15 Materials Project Citrination Materials Commons Other Facilities (APS, SNS, NSLS, …), Institutional Repositories, Publishers! Metadata Publishing MetadataMD, Pub., Compute Metadata Publishing NCSA-PIREHV/TMSMBDH
  • 16. MDF data discovery ecosystem EP NIST MRR Data discovery service Harvest Deep index Register / Sync Services Bots MDF Pub Service Automate Process Refine Analyze Data Output Data Input EP Data Sources Query Browse Aggregate User Interfaces Identify resources for indexing 16
  • 17. MDF + NIST Database Tools 17 Data discovery service MDCS NIST MRR Ref: Dima, et al. JOM. 68 (2016), 2053. doi: 10.1007/s11837-016-2000-4
  • 18. MDF + NIST Database Tools 18 Data discovery service MDCS NIST MRR MDF automates publicizing data and provides a uniform search interface
  • 19. Piping DFT data from MDF to Citrine { "category": "system.chemical", "chemicalFormula": "MgO2", "properties": { "units": "eV", "name": "Band gap", "scalars": [ { "value": 7.8 } ] } } 2. Bot requests open DFT data periodically 3. Bot accesses data, runs DFT parser to refine data 4. Push metadata to Citrine 1. User publishes DFT dataset 5. Ingest DFT data quality report … Our datasets are discoverable through many tools 19
  • 20. Part 2: A Materials Data Search Engine Goal: Simplify finding useful data Key Issue: Lack of metadata Approaches: 1. Simplifying metadata capture from the source 2. Extracting useful information from dataset 20
  • 21. Route 1: Integrating with LIMS/Workflow Tools 21 MAST Materials Commons (MC) T2C2 (4CeeD) • Build connections to international materials efforts and registries (e.g., NIMS, RDA, NIST, EUDAT, NDS) • Promote IMaD data services, tools, and accomplishments to the community • Develop video tutorials, webinars, and shared code repositories • Interface with the Materials Accelerator Network (MAN) • Engage with colleges, industry, and consortiums • (Wisconsin) Regional Materials and Manufacturing Network (RM2N) • (Illinois) Digital Manufacturing and Design Innovation Institute DMDII • (Michigan) LIFT consortium Engagement Linking Software and Services PIs: I. Foster1,2, J. Allison3, D. Morgan4, D. Trinkle5, P. Voorhees6 1 University of Chicago 2 Argonne National Laboratory 3 University of Michigan 4 University of Wisconsin-Madison 5 University of Illinois at Urbana-Champaign, 6 Northwestern University Overview • NSF Midwest Big Data Spoke
  • 22. • Argonne Leadership Computing Facility (>1000 users/year) § Working with datasets that comprise ~300M core hours, with 200M more identified for near term § New joint effort to roll out MDF-like capabilities to ALCF users • Advanced Photon Source (>5000 users/year) • Building pipelines and procedures to index and publish data from 15 beamlines (~1/3 of the facility) in conjunction with the APS software team (Schwartz) • Advanced Light Source (>2000 users/year) • Integration with CAMERA project and associated tomography beamlines Linking Data from Major Facilities 22Working with user facilities to facilitate capturing data/metadata
  • 23. Ripple: Home automation for research data Doi:10.1109/ICDCSW.2017.30 23 Procedure for automating tomography experiments: At ALS: Detect new beamline data, and transfer it to NERSC At NERSC: Submit, run jobs on Edison, transfer data back to ALS At ALS: Create a shared endpoint, notify collaborators of result via email Automate capturing results and metadata Ryan Chard
  • 24. Route 2: Deep Indexing Materials Data MDF Index Data resources indexed 116 Records >3.4M Repositories harvested • MDF • NIST MML Repo • MATIN • Materials Commons • CXIDB • NIST Materials Resource Registry 6 ~200 Datasets ~260 TB Made discoverable 24
  • 25. Adding More Metadata to NIST MatDL Dataset As Published Limited Metadata Querying Difficult 25
  • 26. Adding More Metadata to NIST MatDL Deep-Indexed into the MDF Data Available Programmatically 26
  • 27. Adding More Metadata to NIST MatDL Deep-Indexed into the MDF Can be used for scripting 27
  • 28. Another benefit: domain-specific querying Example service possible with DFT data files Answer questions like: “Do we have any data about anatase-TiO2?” “Who else has studied Li-MnO3 batteries with DFT?” Crystal Structure File .cif, VASP, etc. Entries from MDF that are structurally-similar 28
  • 29. Skluma: A Statistical Learning Pipeline for Taming Unkempt Data Repositories 29 doi:10.1145/3085504.3091116 Goal: Build intelligent search indexes with minimal human effort Method: Employ machine learning to extract metadata from file repositories - Classify data files - Detect file types Tyler Skluzacek Search Otherwise-Unusable Data Repositories
  • 30. MDF Forge python package (under development) • Interface to MDF services • Helper functions for common tasks APIs, Automation, and Examples https://github.com/materials-data-facility/forge 30 Tools for using these capabilities will be available soon
  • 32. Computable Data Reproducing data-driven science should be trivial It often is not. Common problems: § If available, datasets lack documentation § Algorithms/methods are not open sourced § Models rarely published § Software installation/configuration require expertise Our goal: Simplify publishing data-driven science - Storing software and models - Integrating them with compute resources 32
  • 33. Integrating analytics tools with MDF 33 MATIN (GT) ~ 10 datasets Used in education Result: Scientists connected with data, analytics tools, and compute capability MDF Data Publication MATIN (GT) MML Repository (NIST) Materials Commons (UM PRISMS) Coherent X-Ray Tomography Database (LNL) To End UsersTo End UsersTo Compute ResourcesFrom Data Repositories Jetstream is a self-provisioned, scalable science and engineering cloud environment operated by Indiana University for the National Science Foundation: jetstream-cloud.org
  • 34. Building a machine learning model using MDF A simple web service to train ML forcefields 34
  • 35. 35 Building a machine learning model using MDF
  • 36. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Aluminum data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources 36 Building a machine learning model using MDF Method: Botu et al. JPCC. (2017) Using only original data Training SetHoldout Set
  • 37. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Aluminum data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources 37 Building a machine learning model using MDF Method: Botu et al. JPCC. (2017) Using only original data Training SetHoldout Set Including Diffusion Dataset
  • 38. Example: Building force-field potentials from different datasets Data resources: 3 DFT datasets with Aluminum data 1 dataset from khazana.uconn.edu, 2 datasets from materialsdata.nist.gov Result: Improved performance by integrating data sources 38 Building a machine learning model using MDF Method: Botu et al. JPCC. (2017) Using only original data Training SetHoldout Set Including Diffusion DatasetIncluding 𝐷 + 𝑇# Dataset Better performance in original application: No new DFT calculations
  • 39. • Summer Intern (Jiming Chen) reproducing and extending materials and ML papers with the MDF • Joined our team with the NSF WholeTale project Reproducing data-driven MSE with MDF Users publish data to the MDF… … and code to WholeTale Long-term goals: - Assemble community-driven resource for ML tools/examples - Use MDF/WholeTale to create benchmark challenges Jiming Chen (UIUC) 39
  • 40. Replicating Ward et al. 2016 40
  • 41. • Publish and share models and code linked with full training datasets • Link database with HPC/Cloud computing resources • Provide uniform interface for training, running models DLHub: Advancing Deep Learning Adoption
  • 42. INCREASING VALUE OF DATA 42 $
  • 43. What is the MDF? EP EP EP EP Deep indexing Query Browse Aggregate Publish Mint DOIs Associate metadata Databases Datasets APIs LIMS etc. Distributed data storage Data publication service Data discovery service 43
  • 44. Data publication service 44 • Mechanisms to create and enforce schemas and logical collections • Web UI to create datasets and manage curation and admin tasks • Tools to automate publication process • Dataset record permanent landing page for DOI link • Record shows some metadata links to the rest • Direct link to underlying files • Download statistics
  • 45. Published Data Highlights 45 ~ 30 datasets ~ 6.5 TB MATIN (GT) ~ 10 datasets Used in education X-ray Scattering Image Classification Using Deep Learning http://dx.doi.org/10.18126/M2Z30Z Electron Backscattering and Diffraction Datasets for Ni, Mg, Fe, Si Yager et al.Marc De Graef et al. Phase Field Benchmark I Dataset Jokisaari et al. Grain Structure, Grain-averaged Lattice Strains, and Macro-scale Strain Data for Superelastic Nickel- Titanium Shape Memory Alloy Polycrystal Loaded in Tension Paranjape et al. • Largest dataset to date (>1.5 TB). Showcases MDF unique capabilities and makes a unique dataset discoverable for code development, analysis, and benchmarking
  • 47. Streamline & automate data publication 12.5 TB 12.4 TB out Data Volumes Publication Authors 94 Institutions 14 Accesses >1000 Total datasets 50 CHiMaD datasets 16 Pipeline CHiMaD datasets +14 Total datasets +30
  • 48. Advantages of Globus Publish Capable of handling large datasets § Publish data in place § Integration with Globus Transfer/HTTPS Deep indexing of materials-specific metadata § Parse common materials data types § Make data searchable on the file-level Automatically re-publishing data elsewhere § Publishing dataset metadata to MRR, Google Scholar, etc. § Sending fine-grained metadata to other databases (e.g., Citrine) In Progress: Know how often your data is used § Track when it is used in analytics tools 48 All of these capabilities increase the value of your data
  • 49. Why create the MDF? http://materialsdatafacility.org 49 1. Make your data shareable Custom access control, using institution credentials 2. Make your data open Access to >100TB of storage space 3. Make your data accessible Search across distributed resources Automatic, domain-specific metadata extraction 4. Make your data computable Tight integration with computing resources 5. Make your data valuable Citable with DOIs, measured with usage stats $ EP
  • 50. Thanks to our sponsors! 50 U . S . D E P A R T M E N T O F ENERGY