A Data Ecosystem to Support Machine Learning in Materials Science

Globus
https://www.materialsdatafacility.org
Ben Blaiszik (blaiszik@uchicago.edu), Jonathon Gaff,
Logan Ward, Ryan Chard, Marcus Schwarting, Kyle
Chard, Steven Tuecke, Ian Foster
A Data Ecosystem to Support
Machine Learning in Materials
Science
https://www.dlhub.org
1
A Growing Opportunity in Machine Learning in the Sciences
ML in Science
• Number of publications across domains is growing
rapidly
• Access to datasets is improving, but still a challenge
• Access to models and codes is a particular challenge
• Benchmarks are lagging
ML and Informatics in Materials
Model ??
Code ???
Data ????
A Motivation
2
An Emerging Data Infrastructure Ecosystem
Models and
Code
ComputeData
• Data: capture, organize, analyze, share move
and deliver data
• Compute: accessible at many scales, types, and
locations
• Models: discoverable, described, and portable to
wherever data and/or computer are located
• Workflows: easily discovered, adapted,
composed, scaled, and reused
We need and can build a cohesive infrastructure for AI-driven science
3
An Emerging Data Infrastructure Ecosystem
Models and
Code
ComputeData
CH MaD
• Data: capture, organize, analyze, share move
and deliver data
• Compute: accessible at many scales, types, and
locations
• Models: discoverable, described, and portable to
wherever data and/or computer are located
• Workflows: easily discovered, adapted,
composed, scaled, and reused
We need and can build a cohesive infrastructure for AI-driven science
4
Materials Data Facility (MDF) Overview
Build data services to
• Empower researchers to publish data, regardless
of size, type, and location
• Automate data and metadata extraction and ingest
• Enable unified search and discovery across
disparate materials data sources
Deploy with APIs to simplify connection to
other data efforts and to enable
automation
CH MaD5
The Materials Data Facility (MDF)
CH MaD
• Connect: Extract domain-relevant
metadata / transform the data
• Publish: Built to handle big data
(many TB, millions of files),
provides persistent identifier for
data, distributed storage enabled
• Discover: Programmatic search
index to aggregate and retrieve
data across hundreds of indexed
data sources
• Currently holds ~30TB of data
from over 150 authors, millions of
individual results
• Auth
• Transfer
• Groups
• Search
• Connectors
6
MDF – Enabling Flexible Data Publication
Key features
§ Receive a citable DOI for your dataset
§ Native support for large datasets
§ millions of files or many TB of data
§ Distributed storage enabled
§ Host data on high availability, reliable,
performant storage (ALCF, UIUC) or
your own storage cluster
§ Currently hold over 30 TB of data from
over 150 authors
§ Free of charge for researchers
Publish via Web Interface
Or with a Python script
CH MaDhttps://www.materialsdatafacility.org7
MDF – Enabling Flexible Data Publication
Key features
§ Receive a citable DOI for your dataset
§ Native support for large datasets
§ millions of files or many TB of data
§ Distributed storage enabled
§ Host data on high availability, reliable,
performant storage (ALCF, UIUC) or
your own storage cluster
§ Currently hold over 30 TB of data from
over 150 authors
§ Free of charge for researchers
Publish via Python Script
CH MaDhttps://www.materialsdatafacility.org8
An Emerging Data Infrastructure Ecosystem
Models and
Code
ComputeData
CH MaD
• Data: capture, organize, analyze, share move
and deliver data
• Compute: accessible at many scales, types, and
locations
• Models: discoverable, described, and portable to
wherever data and/or computer are located
• Workflows: easily discovered, adapted,
composed, scaled, and reused
We need and can build a cohesive infrastructure for AI-driven science
9
• Collect, publish, categorize models and
processing logic from many disciplines
(materials science, physics, chemistry,
genomics, etc.)
• Serve models via DLHub operated service to
simplify sharing, consumption, and access
• Mint persistent identifiers for all artifacts
• Enable new science through reuse, real-time
integration, and synthesis of existing models
DLHub – A Data and Learning Hub for Science
10
DLHub – A Data and Learning Hub for Science
Cherukara et al., 2018
Energy Storage TomographyX-Ray Science
Input Output
• Predict molecular energies with G4MP2
accuracy at B3LYP cost
• Data available in MDF
• Enhance tomographic scans and remove
noise using generative adversarial model
• Example data available on Petrel
• Predict structure and
phase of a material
given coherent
diffraction intensity
• Data available from
Github
Exascale Cancer
Research
11
Bringing it Together
Get Data
Run Model
2018 Argonne Adv. Computing LDRD12
2018 Argonne Adv. Computing LDRD
Get Data
Run Model
Models and
Code
ComputeData
CH MaD
Bringing it Together
• Auth
• Transfer
• Groups
• Search
• Connectors
13
References
Websites
• https://www.dlhub.org
• https://www.materialsdatafacility.org
Project Papers
• DLHub
– Blaiszik, Ben, Logan Ward, Marcus Schwarting, Jonathon Gaff, Ryan Chard, Daniel Pike, Kyle Chard, and Ian Foster. "A Data
Ecosystem to Support Machine Learning in Materials Science." arXiv preprint arXiv:1904.10423 (2019).
– Chard, Ryan, Zhuozhao Li, Kyle Chard, Logan Ward, Yadu Babuji, Anna Woodard, Steve Tuecke, Ben Blaiszik, Michael J.
Franklin, and Ian Foster. "DLHub: Model and Data Serving for Science." arXiv preprint arXiv:1811.11213 (2018).
– Blaiszik, Ben, Kyle Chard, Ryan Chard, Ian Foster, and Logan Ward. "Data automation at light sources." In AIP Conference
Proceedings, vol. 2054, no. 1, p. 020003. AIP Publishing, 2019.
• MDF
– Blaiszik, Ben, Logan Ward, Marcus Schwarting, Jonathon Gaff, Ryan Chard, Daniel Pike, Kyle Chard, and Ian Foster. "A Data
Ecosystem to Support Machine Learning in Materials Science." arXiv preprint arXiv:1904.10423 (2019).
– Blaiszik, B., K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, and I. Foster. "The materials data facility: data services to
advance materials science research." JOM 68, no. 8 (2016): 2045-2052.
14
CH MaD
Thanks to our sponsors!
U.S. DEPARTMENT OF
ENERGY
ALCF DF
Parsl Globus IMaD
DLHub Argonne
LDRD
15
1 of 15

Recommended

Recent Upgrades to ARM Data Transfer and Delivery Using Globus by
Recent Upgrades to ARM Data Transfer and Delivery Using GlobusRecent Upgrades to ARM Data Transfer and Delivery Using Globus
Recent Upgrades to ARM Data Transfer and Delivery Using GlobusGlobus
377 views11 slides
Enabling Secure Data Discoverability (SC21 Tutorial) by
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Globus
106 views80 slides
SomeSlides by
SomeSlidesSomeSlides
SomeSlidesguestd60742
496 views40 slides
Globus: Enabling the Open Storage Network by
Globus: Enabling the Open Storage NetworkGlobus: Enabling the Open Storage Network
Globus: Enabling the Open Storage NetworkGlobus
435 views14 slides
Grid Computing July 2009 by
Grid Computing July 2009Grid Computing July 2009
Grid Computing July 2009Ian Foster
1.6K views57 slides
20090701 Climate Data Staging by
20090701 Climate Data Staging20090701 Climate Data Staging
20090701 Climate Data StagingHenning Bergmeyer
1.2K views31 slides

More Related Content

What's hot

NIH Data Commons Architecture Ideas by
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
587 views9 slides
Cenitpede: Analyzing Webcrawl by
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlPrimal Pappachan
3.3K views17 slides
Mining a Large Web Corpus by
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
12.5K views21 slides
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012 by
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
2.9K views29 slides
The Web of data and web data commons by
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commonsJesse Wang
4.3K views58 slides
London HUG by
London HUGLondon HUG
London HUGBoudicca
354 views19 slides

What's hot(20)

NIH Data Commons Architecture Ideas by Ian Foster
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
Ian Foster587 views
Mining a Large Web Corpus by Robert Meusel
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
Robert Meusel12.5K views
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012 by Amazon Web Services
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
Amazon Web Services2.9K views
The Web of data and web data commons by Jesse Wang
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
Jesse Wang4.3K views
London HUG by Boudicca
London HUGLondon HUG
London HUG
Boudicca354 views
Using the whole web as your dataset by Turi, Inc.
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
Turi, Inc.1.3K views
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS by Ed Dodds
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Ed Dodds594 views
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information by Kai Schlegel
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
Kai Schlegel1.1K views
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ... by Robert Meusel
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
Robert Meusel1.8K views
Simplified Research Data Management with the Globus Platform by Globus
Simplified Research Data Management with the Globus PlatformSimplified Research Data Management with the Globus Platform
Simplified Research Data Management with the Globus Platform
Globus 104 views
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC by Geoffrey Fox
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
Geoffrey Fox3K views
51 Use Cases and implications for HPC & Apache Big Data Stack by Geoffrey Fox
51 Use Cases and implications for HPC & Apache Big Data Stack51 Use Cases and implications for HPC & Apache Big Data Stack
51 Use Cases and implications for HPC & Apache Big Data Stack
Geoffrey Fox2.5K views
Automating Research Data Management at Scale with Globus by Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
Globus 221 views
Research Automation for Data-Driven Discovery by Globus
Research Automationfor Data-Driven DiscoveryResearch Automationfor Data-Driven Discovery
Research Automation for Data-Driven Discovery
Globus 173 views
20160922 Materials Data Facility TMS Webinar by Ben Blaiszik
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar
Ben Blaiszik187 views
Insight Data Engineering project by Hoa Nguyen
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
Hoa Nguyen1.1K views
Gateways 2020 Tutorial - Instrument Data Distribution with Globus by Globus
Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGateways 2020 Tutorial - Instrument Data Distribution with Globus
Gateways 2020 Tutorial - Instrument Data Distribution with Globus
Globus 146 views
Exploration of multidimensional biomedical data in pub chem, Presented by Lia... by Lucidworks (Archived)
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Globus Integrations (GlobusWorld Tour - UMich) by Globus
Globus Integrations (GlobusWorld Tour - UMich)Globus Integrations (GlobusWorld Tour - UMich)
Globus Integrations (GlobusWorld Tour - UMich)
Globus 74 views

Similar to A Data Ecosystem to Support Machine Learning in Materials Science

Materials Data Facility: Streamlined and automated data sharing, discovery, ... by
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
880 views39 slides
Learning Systems for Science by
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
454 views34 slides
The Materials Data Facility: A Distributed Model for the Materials Data Commu... by
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
320 views50 slides
Hadoop ecosystem for health/life sciences by
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesUri Laserson
4.1K views57 slides
Big Process for Big Data @ PNNL, May 2013 by
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
899 views65 slides
Publishing and Serving Machine Learning Models with DLHub by
Publishing and Serving Machine Learning Models with DLHubPublishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubGlobus
181 views22 slides

Similar to A Data Ecosystem to Support Machine Learning in Materials Science(20)

Materials Data Facility: Streamlined and automated data sharing, discovery, ... by Ian Foster
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster880 views
Learning Systems for Science by Ian Foster
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
Ian Foster454 views
The Materials Data Facility: A Distributed Model for the Materials Data Commu... by Ben Blaiszik
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
Ben Blaiszik320 views
Hadoop ecosystem for health/life sciences by Uri Laserson
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
Uri Laserson4.1K views
Big Process for Big Data @ PNNL, May 2013 by Ian Foster
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
Ian Foster899 views
Publishing and Serving Machine Learning Models with DLHub by Globus
Publishing and Serving Machine Learning Models with DLHubPublishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHub
Globus 181 views
Building genomic data cyberinfrastructure with the online database software T... by mestato
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
mestato331 views
Foundations for the Future of Science by Globus
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
Globus 223 views
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A... by BigData_Europe
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
BigData_Europe423 views
Impact of Covid-19 on Learning and Education by MANENDRASINGH30
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and Education
MANENDRASINGH30135 views
Networked Science, And Integrating with Dataverse by Anita de Waard
Networked Science, And Integrating with DataverseNetworked Science, And Integrating with Dataverse
Networked Science, And Integrating with Dataverse
Anita de Waard596 views
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P... by Bertram Ludäscher
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Bertram Ludäscher684 views
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa... by DeVonne Parks, CEM
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
December 9, 2015 NISO Webinar: Two-Part Webinar: Emerging Resource Types - Pa...
DeVonne Parks, CEM1.1K views
Open science, open-source, and open data: Collaboration as an emergent property? by Hilmar Lapp
Open science, open-source, and open data: Collaboration as an emergent property?Open science, open-source, and open data: Collaboration as an emergent property?
Open science, open-source, and open data: Collaboration as an emergent property?
Hilmar Lapp2.7K views
Datat and donuts: how to write a data management plan by C. Tobin Magle
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
C. Tobin Magle451 views
Elephant in the Room: Scaling Storage for the HathiTrust Research Center by Robert H. McDonald
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterElephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Robert H. McDonald1.4K views
Data Science with the Help of Metadata by Jim Dowling
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
Jim Dowling640 views
re:Invent 2013-foster-madduri by Ravi Madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
Ravi Madduri973 views
Being FAIR: FAIR data and model management SSBSS 2017 Summer School by Carole Goble
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Carole Goble978 views

More from Globus

Introduction to Globus for System Administrators by
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System AdministratorsGlobus
11 views55 slides
Introduction to Data Transfer and Sharing for Researchers by
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersGlobus
4 views33 slides
Introduction to the Globus Platform for Developers by
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersGlobus
4 views28 slides
Introduction to the Command Line Interface (CLI) by
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Globus
12 views12 slides
Automating Research Data with Globus Flows and Compute by
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeGlobus
6 views60 slides
Automating Research Data Flows and Introduction to the Globus Platform by
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformGlobus
50 views41 slides

More from Globus (20)

Introduction to Globus for System Administrators by Globus
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
Globus 11 views
Introduction to Data Transfer and Sharing for Researchers by Globus
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for Researchers
Globus 4 views
Introduction to the Globus Platform for Developers by Globus
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for Developers
Globus 4 views
Introduction to the Command Line Interface (CLI) by Globus
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)
Globus 12 views
Automating Research Data with Globus Flows and Compute by Globus
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and Compute
Globus 6 views
Automating Research Data Flows and Introduction to the Globus Platform by Globus
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus Platform
Globus 50 views
Advanced Globus System Administration by Globus
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
Globus 26 views
Introduction to Globus for System Administrators by Globus
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
Globus 94 views
Introduction to Globus for New Users by Globus
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New Users
Globus 55 views
Working with Globus Platform Services and Portals by Globus
Working with Globus Platform Services and PortalsWorking with Globus Platform Services and Portals
Working with Globus Platform Services and Portals
Globus 28 views
Globus Automation by Globus
Globus AutomationGlobus Automation
Globus Automation
Globus 20 views
Advanced Globus System Administration by Globus
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
Globus 21 views
Introduction to Globus by Globus
Introduction to GlobusIntroduction to Globus
Introduction to Globus
Globus 38 views
Introduction to Globus for System Administrators by Globus
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
Globus 27 views
Working with Globus Platform Services by Globus
Working with Globus Platform ServicesWorking with Globus Platform Services
Working with Globus Platform Services
Globus 41 views
Advanced Globus System Administration by Globus
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
Globus 29 views
Introduction to Globus for System Administrators by Globus
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
Globus 145 views
Using Globus to Streamline Research at Scale by Globus
Using Globus to Streamline Research at ScaleUsing Globus to Streamline Research at Scale
Using Globus to Streamline Research at Scale
Globus 30 views
Introduction to Globus for Researchers by Globus
Introduction to Globus for ResearchersIntroduction to Globus for Researchers
Introduction to Globus for Researchers
Globus 89 views
Automating Research Data Flows and an Introduction to the Globus Platform by Globus
Automating Research Data Flows and an Introduction to the Globus PlatformAutomating Research Data Flows and an Introduction to the Globus Platform
Automating Research Data Flows and an Introduction to the Globus Platform
Globus 132 views

Recently uploaded

ChatGPT and AI for Web Developers by
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web DevelopersMaximiliano Firtman
181 views82 slides
The Research Portal of Catalonia: Growing more (information) & more (services) by
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)CSUC - Consorci de Serveis Universitaris de Catalunya
73 views25 slides
Attacking IoT Devices from a Web Perspective - Linux Day by
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day Simone Onofri
15 views68 slides
Tunable Laser (1).pptx by
Tunable Laser (1).pptxTunable Laser (1).pptx
Tunable Laser (1).pptxHajira Mahmood
23 views37 slides
Uni Systems for Power Platform.pptx by
Uni Systems for Power Platform.pptxUni Systems for Power Platform.pptx
Uni Systems for Power Platform.pptxUni Systems S.M.S.A.
50 views21 slides
Spesifikasi Lengkap ASUS Vivobook Go 14 by
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14Dot Semarang
35 views1 slide

Recently uploaded(20)

Attacking IoT Devices from a Web Perspective - Linux Day by Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day Attacking IoT Devices from a Web Perspective - Linux Day
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri15 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14Spesifikasi Lengkap ASUS Vivobook Go 14
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang35 views
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor... by Vadym Kazulkin
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
How to reduce cold starts for Java Serverless applications in AWS at JCON Wor...
Vadym Kazulkin75 views
[2023] Putting the R! in R&D.pdf by Eleanor McHugh
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh38 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet55 views
Understanding GenAI/LLM and What is Google Offering - Felix Goh by NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS41 views
The Importance of Cybersecurity for Digital Transformation by NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS27 views
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze by NUS-ISS
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng TszeDigital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
Digital Product-Centric Enterprise and Enterprise Architecture - Tan Eng Tsze
NUS-ISS19 views
Perth MeetUp November 2023 by Michael Price
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023
Michael Price15 views
Black and White Modern Science Presentation.pptx by maryamkhalid2916
Black and White Modern Science Presentation.pptxBlack and White Modern Science Presentation.pptx
Black and White Modern Science Presentation.pptx
maryamkhalid291614 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta15 views
Future of Learning - Yap Aye Wee.pdf by NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS41 views

A Data Ecosystem to Support Machine Learning in Materials Science

  • 1. https://www.materialsdatafacility.org Ben Blaiszik (blaiszik@uchicago.edu), Jonathon Gaff, Logan Ward, Ryan Chard, Marcus Schwarting, Kyle Chard, Steven Tuecke, Ian Foster A Data Ecosystem to Support Machine Learning in Materials Science https://www.dlhub.org 1
  • 2. A Growing Opportunity in Machine Learning in the Sciences ML in Science • Number of publications across domains is growing rapidly • Access to datasets is improving, but still a challenge • Access to models and codes is a particular challenge • Benchmarks are lagging ML and Informatics in Materials Model ?? Code ??? Data ???? A Motivation 2
  • 3. An Emerging Data Infrastructure Ecosystem Models and Code ComputeData • Data: capture, organize, analyze, share move and deliver data • Compute: accessible at many scales, types, and locations • Models: discoverable, described, and portable to wherever data and/or computer are located • Workflows: easily discovered, adapted, composed, scaled, and reused We need and can build a cohesive infrastructure for AI-driven science 3
  • 4. An Emerging Data Infrastructure Ecosystem Models and Code ComputeData CH MaD • Data: capture, organize, analyze, share move and deliver data • Compute: accessible at many scales, types, and locations • Models: discoverable, described, and portable to wherever data and/or computer are located • Workflows: easily discovered, adapted, composed, scaled, and reused We need and can build a cohesive infrastructure for AI-driven science 4
  • 5. Materials Data Facility (MDF) Overview Build data services to • Empower researchers to publish data, regardless of size, type, and location • Automate data and metadata extraction and ingest • Enable unified search and discovery across disparate materials data sources Deploy with APIs to simplify connection to other data efforts and to enable automation CH MaD5
  • 6. The Materials Data Facility (MDF) CH MaD • Connect: Extract domain-relevant metadata / transform the data • Publish: Built to handle big data (many TB, millions of files), provides persistent identifier for data, distributed storage enabled • Discover: Programmatic search index to aggregate and retrieve data across hundreds of indexed data sources • Currently holds ~30TB of data from over 150 authors, millions of individual results • Auth • Transfer • Groups • Search • Connectors 6
  • 7. MDF – Enabling Flexible Data Publication Key features § Receive a citable DOI for your dataset § Native support for large datasets § millions of files or many TB of data § Distributed storage enabled § Host data on high availability, reliable, performant storage (ALCF, UIUC) or your own storage cluster § Currently hold over 30 TB of data from over 150 authors § Free of charge for researchers Publish via Web Interface Or with a Python script CH MaDhttps://www.materialsdatafacility.org7
  • 8. MDF – Enabling Flexible Data Publication Key features § Receive a citable DOI for your dataset § Native support for large datasets § millions of files or many TB of data § Distributed storage enabled § Host data on high availability, reliable, performant storage (ALCF, UIUC) or your own storage cluster § Currently hold over 30 TB of data from over 150 authors § Free of charge for researchers Publish via Python Script CH MaDhttps://www.materialsdatafacility.org8
  • 9. An Emerging Data Infrastructure Ecosystem Models and Code ComputeData CH MaD • Data: capture, organize, analyze, share move and deliver data • Compute: accessible at many scales, types, and locations • Models: discoverable, described, and portable to wherever data and/or computer are located • Workflows: easily discovered, adapted, composed, scaled, and reused We need and can build a cohesive infrastructure for AI-driven science 9
  • 10. • Collect, publish, categorize models and processing logic from many disciplines (materials science, physics, chemistry, genomics, etc.) • Serve models via DLHub operated service to simplify sharing, consumption, and access • Mint persistent identifiers for all artifacts • Enable new science through reuse, real-time integration, and synthesis of existing models DLHub – A Data and Learning Hub for Science 10
  • 11. DLHub – A Data and Learning Hub for Science Cherukara et al., 2018 Energy Storage TomographyX-Ray Science Input Output • Predict molecular energies with G4MP2 accuracy at B3LYP cost • Data available in MDF • Enhance tomographic scans and remove noise using generative adversarial model • Example data available on Petrel • Predict structure and phase of a material given coherent diffraction intensity • Data available from Github Exascale Cancer Research 11
  • 12. Bringing it Together Get Data Run Model 2018 Argonne Adv. Computing LDRD12
  • 13. 2018 Argonne Adv. Computing LDRD Get Data Run Model Models and Code ComputeData CH MaD Bringing it Together • Auth • Transfer • Groups • Search • Connectors 13
  • 14. References Websites • https://www.dlhub.org • https://www.materialsdatafacility.org Project Papers • DLHub – Blaiszik, Ben, Logan Ward, Marcus Schwarting, Jonathon Gaff, Ryan Chard, Daniel Pike, Kyle Chard, and Ian Foster. "A Data Ecosystem to Support Machine Learning in Materials Science." arXiv preprint arXiv:1904.10423 (2019). – Chard, Ryan, Zhuozhao Li, Kyle Chard, Logan Ward, Yadu Babuji, Anna Woodard, Steve Tuecke, Ben Blaiszik, Michael J. Franklin, and Ian Foster. "DLHub: Model and Data Serving for Science." arXiv preprint arXiv:1811.11213 (2018). – Blaiszik, Ben, Kyle Chard, Ryan Chard, Ian Foster, and Logan Ward. "Data automation at light sources." In AIP Conference Proceedings, vol. 2054, no. 1, p. 020003. AIP Publishing, 2019. • MDF – Blaiszik, Ben, Logan Ward, Marcus Schwarting, Jonathon Gaff, Ryan Chard, Daniel Pike, Kyle Chard, and Ian Foster. "A Data Ecosystem to Support Machine Learning in Materials Science." arXiv preprint arXiv:1904.10423 (2019). – Blaiszik, B., K. Chard, J. Pruyne, R. Ananthakrishnan, S. Tuecke, and I. Foster. "The materials data facility: data services to advance materials science research." JOM 68, no. 8 (2016): 2045-2052. 14
  • 15. CH MaD Thanks to our sponsors! U.S. DEPARTMENT OF ENERGY ALCF DF Parsl Globus IMaD DLHub Argonne LDRD 15