SlideShare a Scribd company logo
Ben Blaiszik (blaiszik@uchicago.edu),
Kyle Chard, Rachana Ananthakrishnan
Michael Ondrejcek, Kenton McHenry
PIs: Ian Foster (foster@uchicago.edu), Steven Tuecke, John Towns
materialsdatafacility.org
globus.org
Materials Data Facility -
Data Services to Advance Materials
Science Research
2
http://dx.doi.org/10.1007/s11837-016-2001-3
MDF Article in JOM (August Issue)
MaterialsDataFacility.org
3
To	get	started,
contact	Ben	Blaiszik
blaiszik@uchicago.edu
4
Outline
APIs
• Overview
§ MDF Overview
§ Globus quick introduction
• MDF Data Publication Service
§ Key MDF data pub service features
§ Publication walk-through
• General Observations and Future
Outlook
What is MDF?
5
We are developing production services
to make it more simple for materials
datasets and resources to be ...
Published
Identified
Described
Curated
Verifiable
Accessible
Preserved
Discovered
Searched
Browsed
Shared
Recommended
Accessed
and
SRD
Publishable Results
Published Results
Resource Data
Ref Data
Derived Data
Working Data * Figure adapted from Warren et al.
Data Service Infrastructure
6
Publication Discovery
Compute for
data interaction
and viz
Resource
Registration
APIs
+ +
+ - Initial Foci
7
Publication
APIs
• Identify datasets with persistent identifiers (e.g. DOI)
• Describe datasets with appropriate metadata and
provenance
• Verify dataset contents over time
• Preserve critical datasets in a state that increases
transparency, replicability, and helps encourage reuse
8
Discovery
• Search and query datasets in modern ways – e.g. via
search against indexed metadata and harvested file
contents rather than remembering opaque file paths
Future...
Spotlight for all
data you have
access to
regardless of
location
Under Development
9
Discovery
Under Development
• SaaS cloud-hosted solution
• Logical metadata repository to index many external sources
• Flexible queries (boosting, full text, partial matches, etc.)
• Search results are limited by ACLs
10
Discovery
Under Development
• All MDF-published datasets will be indexed
• May use common schemas (Datacite, Dublin Core etc.) or
domain specific
• Globus endpoint contents may be indexed (owner enabled)
• Index has the flexibility of no required schema
• Built on Elasticsearch for proven scalability and speed,
hosted on scalable AWS resources
11
Discovery
Under Development
Custom boosting
Facets
Test-indexed data
13
Globus
Background
https://www.globus.org
Globus Platform-as-a-Service (PaaS)
14
Identity
management
User
groups
Data
transfer
Data
sharing
• Share directly from your storage
device (laptop or cluster)
• File and directory-level ACLs
• Manage user group creation and
administration flows
• Share data with user groups
• High-performance data transfer
from a web browser
• Optimize transfer settings and
verify transfer integrity
• Add your laptop to the Globus cloud
with Globus Connect Personal
• create and manage a unique
identity linked to external identities
for authentication
Publication Discovery
REST APIs, Clients, and Docs
15
• New version of core services released in Feb.
• New Python SDK available
§ https://github.com/globusonline/globus-sdk-python
• Jupyter Notebook Examples
§ https://github.com/globus/globus-jupyter-notebooks
• Sample Data Portal
§ https://github.com/globus/globus-sample-data-portal
• (alpha) MDF Data Publication Service API
Globus Background
16
B
Globus moves the
data for you
secure
endpoint,
e.g. laptop
You
submit a
transfer
request Globus
notifies you
once the
transfer is
complete
secure
endpoint,
e.g. midway
transfer
A
Endpoint
• E.g. laptop or server
running a Globus
client (e.g. Dropbox
client)
• Enables advanced file
transfer and sharing
• Currently GridFTP,
future GridFTP + HTTP
Some Key
Features
• REST API for
automation and
interoperability
• Web UI for
convenience
• Optimizes and verifies
transfers
• Handles auto-restarts
• Battle tested with big
data
Globus Web UI
17
Endpoint
• E.g. laptop or server
running a Globus
client (e.g. Dropbox
client)
• Enables advanced file
transfer and sharing
• Currently GridFTP,
future GridFTP + HTTP
Some Key
Features
• REST API for
automation and
interoperability
• Web UI for
convenience
• Optimizes and verifies
transfers
• Handles auto-restarts
• Battle tested with big
data
19
Data
Publication
Where are we
Now?
20
Materials Data Publication/Discovery is Often
a Challenge
Data Collection Data Storage and Process Publication
21
Materials Data Publication/Discovery is Often
a Challenge
Data Collection
?
?
?
Networked storage, sometimes many TB
Unique identifier data for search/cite
Custom metadata descriptions
Data curation workflow
Automation capabilities
Data Storage and Process Publication
Want to
Discover / Use
Want to
Publish
Don’t put under desk!
Needed to close the loop
22
Data Collection
?
?
?
Need storage, sometimes many TB
Need to uniquely identify data for search/cite
Need custom metadata descriptions
Need a data curation workflow
Need automation capabilities
Data Storage and Process Publication
Want to
Discover / Use
Want to
Publish
Materials Data Publication/Discovery is Often
a Challenge
Don’t put under desk!
Collection Model
23
• Collections might be a
research group or a research
topic...
• Collections have specified
§ Mapping to storage endpoint
§ Currently handled as automatically created
shared endpoints
§ Metadata schemas
§ Access control policies
§ Licenses
§ Curation workflows
• Collections contain
§ Datasets
§ Data
§ Metadata
• Metadata Persistence
§ Metadata log file with dataset
§ Metadata replicated in search
index
Hybrid Distributed Model
24
Petrel @Argonne
1.7 PB
BlueWaters Condo
@UIUC
100 TB
EP 1
EP 2
EP 3
Campus
RDS
DOE
Cloud Metadata Index
And Tools
Centralized resource
Globus endpoint
NSF
(XSEDE)
ElectroCat
EP
Publish Large Datasets
25
• Distributed data model leverages
Globus production capabilities for file
transfer (i.e. dataset assembly), user
authentication, and access control
groups
• 100s of TB of reliable storage @ NCSA,
and more storage at Argonne
§ Globus	endpoint	at	ncsa#mdf on	Nebula
§ Expandable	to	many	PBs	as	necessary
§ Automated	tape	backup	for	reliability	(in	progress)
• Researchers can optionally use your
own local or institutional storage
Uniquely Identify Datasets
26
• Associate a unique identifier with a
dataset
§ DOI,	Handle
• Improve dataset discovery and citability
§ Aligning	incentives	and	understanding	the	culture	
will	be	critical	to	driving	adoption
DatasetDownloads
Time
• Your work has been cited
153 times in the last year
• Researchers from 30
institutions have
downloaded your datasets
Future...
Share Data with Flexible ACLs
27
• Share data publicly, with a set of users,
or keep data private
Leverage Curation Workflows
• Collection administrators can specify
the level of curation workflow required
for a given collection e.g.
§ No	curation
§ Curation	of	metadata	only
§ Curation	of	metadata	and	files
Customize Metadata
28
• Build a custom metadata schema for
your specific research data
• Re-use existing metadata schemas
• Working in conjunction with NIST
researchers to define these schemas
• Can we build a system that allows
schema:
§ Inheritance
§ E.g. a schema “polymers” might inherit and expand
upon the “base material” of NIST
§ Versioning
§ E.g. Understand contextually how to map fields
between versions
§ Dependence
§ E.g. Allows the ability to build consensus around
schemas
Future...
29
MDF
Submission
Walkthrough
Example Use Case
30
Publishing Big, Remote Data
Collected multi TB
of data at a light source
Bundle the data with metadata
and provenance
Want a citable DOI to share the
raw and derived data with the
community
Want their data to be discoverable
by free text search and custom
metadata
MDF Collection Home
31
MDF Collections
32
Recall: Policies Set at the Collection Level
• Required metadata, schemas
• Data storage location
• Metadata curation policies
MDF Metadata Entry
33
• Scientist or
representative
describes the data
they are submitting
• For this collection
Dublin Core and a
custom metadata
template are
required
MDF Custom Metadata
34
• Scientist or
representative
describes the data
they are submitting
• For this collection
Dublin Core and a
custom metadata
template are
required
Dataset Assembly
35
• Shared endpoint is
auto-created on
collection-specified
data store
• Scientist transfers
dataset files to a
unique publish
endpoint
• Dataset may be
assembled over any
period of time
• When submission is
finished, dataset
will be rendered
immutable via
checksum
(e.g. NU) (e.g. UIUC Nebula)
Dataset Assembly
36
• Shared endpoint is
auto-created on
collection-specified
data store
• Scientist transfers
dataset files to a
unique publish
endpoint
• Dataset may be
assembled over any
period of time
• When submission is
finished, dataset
will be rendered
immutable via
checksum
(e.g. NU) (e.g. UIUC Nebula)
Dataset Curation (Optional)
37
• Optionally specified
in collection
configuration
• Can be approved or
rejected (i.e. sent
back to the
submitter)
Mint a Permanent Identifier
38
Can	be	DOI or	Handle
Dataset Record
39
Dataset Discovery
40
47
General
Observations
and Future
Outlook
48
Publication Year
1 Milestones
APIs
• Opened to the public in March 2016
• Provisioned reliable storage to support researchers sharing
open materials data (~200 TB)
• MDF data volume approaching ~ 6 TB of materials data
• Started building deep relationships with many of the key
materials data generating groups and communities
• Ingested dataset > 1 TB in size
• Ingested dataset > 1.5M files
Integration with the Community is Key
49
Materials	Project
OQMD
Citrination
Materials	
Commons
Other	Facilities	(APS,	SNS,	NSLS,	…),	Institutional	Repositories,	
Publishers!
Metadata
Publishing
MetadataMD,
Pub.,	Compute
Metadata
Publishing
NCSA-PIREHV/TMSMBDH
Understanding Incentives is Critical
50
Meeting Award
Requirements
Smoothing
Dislocations
Increasing
Impact
• Increase paper citations1
• Add dataset citation capabilities
• [Distance] Enable simple sharing among
collaborators (near and far)
• [Personnel] Ease transitions between students
• [Format] Lessen need for ad hoc resource sharing
(e.g. via group websites)
• Simplify DMP compliance
1 Citation increase 30 (10.7717/peerj.175) - 60% (10.1371/journal.pone.0000308) [caveat bio research]
Lessons Learned
51
• The demand is there from researchers and
institutions
• Lots of cross-over with centers and projects
§ (NIST) CHiMaD
§ (DOE) ElectroCat, MICCoM, JCESR, PRISMS, Argonne IT, Integrated Imaging Institute
§ (NSF) T2C2 [DIBBS], AMI-CFP (PIRE), HV/TMS (I/UCRC), BD Hubs, IMaD BD Spoke*
• Data Heterogeneity is a challenge
§ Metadata is the major sticking point
• Friction points
§ Need more flexible data objects e.g. {“temperature”:100, “unit”:“K”}
§ Need file or directory based metadata
§ Immutable datasets alone is not enough à Versioning
§ Data gathering in retrospect
§ Schema generation and interoperability
§ Working with and following developments at NIST, RDA, Citrine et al.
§ Differing institutional approval processes
§ Lack of programmatic interface (planned).
• Support for data interactivity and visualization
• Smart versioning for large file-based datasets
Wider Data Community
52
• Curated and described datasets
• Well-posed problems
• Community to share analyses
• Challenges to start “sprints”
• Great APIs and clients
• Examples to get started
• Hundreds of video tutorials
Materials	ProjectOQMD
Citrination
Materials	
Commons
• Less inherently intuitive problems
• Sometimes need advanced compute
capabilities
• Often many TB
53
• Continuous integration, QA, and testing
• Containerized solutions, microservice architecture, abstracting software from
hardware
• Automation
• Internet of Things (IoT) – connect everything
• Machine Learning / AI
• Natural Language Processing (Siri, chatbots or “slack”bots, etc.)
• Search rules the world – ok this was 20 years ago…
What are the analogs and applications in the materials community?
Materials	ProjectOQMD
Citrination
Materials	
Commons
• Less inherently intuitive problems
• Sometimes need advanced compute
capabilities
• Often many TB
Broader Trends
54
Experimentation Ahead
No team commitments
here!
Open source opportunities,
contact:
blaiszik@uchicago.edu
Use Case: Scenario Generator-Consumer
55
• Data generator
§ Generates data periodically (perhaps from an instrument)
§ Pushes data to a public channel
§ Schema is validated before inclusion in channel stream
• Data consumer
§ Polls channel periodically
§ Wants to pull datasets by property
Dataset
Channel
MDF-composites
Data Generator
Data Consumer
DatasetDatasetDataset
DatasetDatasetcreate
q: result
q
Automated Data Aggregation
(consumer)
56
Aggregate, Perform ML
58
• Combine cloud-published dataset, scikitlearn, pandas to predict
steel fatigue and “reproduce” data from journal publication
Aggregate, Perform ML, Visualize
59
• Combine cloud-published dataset, scikitlearn, pandas to predict
steel fatigue and validate journal publication
What’s Currently Available?
60
• Web interface to support data publication (public-
facing APIs coming soon)
• 100s of TB of storage at NCSA (scalable to many PB)
more at Argonne (1.7 PB total on Petrel – not all for
materials…)
• Help with developing metadata schemas to describe
your research datasets
MDF Tutorial on Github
https://github.com/blaiszik/materials-data-facility-training
What are we looking for?
61
• Early adopters, willing to get their hands
dirty with the service and give honest
feedback
• Key integration points where metadata is
picked up automatically!
• Key datasets and resources of all sizes,
shapes, raw or derived, that might help us
understand the process better
Thanks to Our Sponsors!
62
U . S . D E P A RT M E N T O F
ENERGY
Publication REST APIs Discovery
• Identify datasets with
persistent identifiers (e.g.
DOI)
• Describe datasets with
appropriate metadata and
provenance
• Verify dataset contents over
time
• Handle big (and small) data:
We have already ingested
datasets with > 1.5M files
and > 1TB in size
• Search and query
datasets in modern ways
• Index metadata and
harvest file contents
• Simple user interfaces
(i.e., after Google and
Amazon)
Opened to external users in Mar. 2016
~ 6 TB of data published
Materialsdatafacility
.org

More Related Content

What's hot

Ckan tutorial odw2013 131109
Ckan tutorial odw2013 131109Ckan tutorial odw2013 131109
Ckan tutorial odw2013 131109
Chengjen Lee
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
Anant Kumar
 
Russell 2012 introduction to spring integration and spring batch
Russell 2012   introduction to spring integration and spring batchRussell 2012   introduction to spring integration and spring batch
Russell 2012 introduction to spring integration and spring batch
GaryPRussell
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)
Hector Correa
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
Suvradeep Rudra
 
Instrument Data Orchestration with Globus Search and Flows
Instrument Data Orchestration with Globus Search and FlowsInstrument Data Orchestration with Globus Search and Flows
Instrument Data Orchestration with Globus Search and Flows
Globus
 
Gateways 2020 Tutorial - Large Scale Data Transfer with Globus
Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGateways 2020 Tutorial - Large Scale Data Transfer with Globus
Gateways 2020 Tutorial - Large Scale Data Transfer with Globus
Globus
 
Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus
Gateways 2020 Tutorial - Automated Data Ingest and Search with GlobusGateways 2020 Tutorial - Automated Data Ingest and Search with Globus
Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus
Globus
 
Globus: Beyond File Transfer
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File Transfer
Globus
 
Gateways 2020 Tutorial - Instrument Data Distribution with Globus
Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGateways 2020 Tutorial - Instrument Data Distribution with Globus
Gateways 2020 Tutorial - Instrument Data Distribution with Globus
Globus
 
The basics of remote data replication
The basics of remote data replicationThe basics of remote data replication
The basics of remote data replication
FileCatalyst
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
Jeff Hammerbacher
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
David Green
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Filip Ilievski
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
ateeq ateeq
 
Introduction to W3C Linked Data Platform
Introduction to W3C Linked Data PlatformIntroduction to W3C Linked Data Platform
Introduction to W3C Linked Data Platform
Nandana Mihindukulasooriya
 
Gateways 2020 Tutorial - Introduction to Globus
Gateways 2020 Tutorial - Introduction to GlobusGateways 2020 Tutorial - Introduction to Globus
Gateways 2020 Tutorial - Introduction to Globus
Globus
 
Describing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyDescribing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core Vocabulary
Nandana Mihindukulasooriya
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
J Singh
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
Fayez Shayeb
 

What's hot (20)

Ckan tutorial odw2013 131109
Ckan tutorial odw2013 131109Ckan tutorial odw2013 131109
Ckan tutorial odw2013 131109
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
Russell 2012 introduction to spring integration and spring batch
Russell 2012   introduction to spring integration and spring batchRussell 2012   introduction to spring integration and spring batch
Russell 2012 introduction to spring integration and spring batch
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)
 
NOSQL Databases types and Uses
NOSQL Databases types and UsesNOSQL Databases types and Uses
NOSQL Databases types and Uses
 
Instrument Data Orchestration with Globus Search and Flows
Instrument Data Orchestration with Globus Search and FlowsInstrument Data Orchestration with Globus Search and Flows
Instrument Data Orchestration with Globus Search and Flows
 
Gateways 2020 Tutorial - Large Scale Data Transfer with Globus
Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGateways 2020 Tutorial - Large Scale Data Transfer with Globus
Gateways 2020 Tutorial - Large Scale Data Transfer with Globus
 
Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus
Gateways 2020 Tutorial - Automated Data Ingest and Search with GlobusGateways 2020 Tutorial - Automated Data Ingest and Search with Globus
Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus
 
Globus: Beyond File Transfer
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File Transfer
 
Gateways 2020 Tutorial - Instrument Data Distribution with Globus
Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGateways 2020 Tutorial - Instrument Data Distribution with Globus
Gateways 2020 Tutorial - Instrument Data Distribution with Globus
 
The basics of remote data replication
The basics of remote data replicationThe basics of remote data replication
The basics of remote data replication
 
Hdfs Dhruba
Hdfs DhrubaHdfs Dhruba
Hdfs Dhruba
 
Azure doc db (slideshare)
Azure doc db (slideshare)Azure doc db (slideshare)
Azure doc db (slideshare)
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Introduction to W3C Linked Data Platform
Introduction to W3C Linked Data PlatformIntroduction to W3C Linked Data Platform
Introduction to W3C Linked Data Platform
 
Gateways 2020 Tutorial - Introduction to Globus
Gateways 2020 Tutorial - Introduction to GlobusGateways 2020 Tutorial - Introduction to Globus
Gateways 2020 Tutorial - Introduction to Globus
 
Describing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyDescribing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core Vocabulary
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 

Viewers also liked

Evaluation Questions 4+5
Evaluation Questions 4+5Evaluation Questions 4+5
Evaluation Questions 4+5
misscarter123
 
Channel strategy(1)
Channel strategy(1)Channel strategy(1)
Channel strategy(1)
ngoc797
 
Course Syllabus Tier 2 5 Day Syllabus Fall 2015
Course Syllabus Tier 2 5 Day Syllabus Fall 2015Course Syllabus Tier 2 5 Day Syllabus Fall 2015
Course Syllabus Tier 2 5 Day Syllabus Fall 2015
David Bourque
 
Rg 839
Rg 839Rg 839
Rg 839
johnmichal1
 
Sc manufacturing conference and expo hitachi i io-t april 20 2016 final
Sc manufacturing conference and expo   hitachi i io-t april 20 2016 finalSc manufacturing conference and expo   hitachi i io-t april 20 2016 final
Sc manufacturing conference and expo hitachi i io-t april 20 2016 final
Keith Brown
 
Uma janela para o mundo: bibliotecas e bibliotecários em meio prisional
Uma janela para o mundo: bibliotecas e bibliotecários em meio prisionalUma janela para o mundo: bibliotecas e bibliotecários em meio prisional
Uma janela para o mundo: bibliotecas e bibliotecários em meio prisional
Bruno Duarte Eiras
 
Resumen y comentario crítico
Resumen y comentario críticoResumen y comentario crítico
Resumen y comentario crítico
telleiras4eso
 
Care santos
Care santosCare santos
Care santos
fgmezlpez
 
April resume 2016
April resume 2016April resume 2016
April resume 2016
Araceli Ulloa
 

Viewers also liked (10)

Evaluation Questions 4+5
Evaluation Questions 4+5Evaluation Questions 4+5
Evaluation Questions 4+5
 
Channel strategy(1)
Channel strategy(1)Channel strategy(1)
Channel strategy(1)
 
Course Syllabus Tier 2 5 Day Syllabus Fall 2015
Course Syllabus Tier 2 5 Day Syllabus Fall 2015Course Syllabus Tier 2 5 Day Syllabus Fall 2015
Course Syllabus Tier 2 5 Day Syllabus Fall 2015
 
Rg 839
Rg 839Rg 839
Rg 839
 
Certificate 1
Certificate 1Certificate 1
Certificate 1
 
Sc manufacturing conference and expo hitachi i io-t april 20 2016 final
Sc manufacturing conference and expo   hitachi i io-t april 20 2016 finalSc manufacturing conference and expo   hitachi i io-t april 20 2016 final
Sc manufacturing conference and expo hitachi i io-t april 20 2016 final
 
Uma janela para o mundo: bibliotecas e bibliotecários em meio prisional
Uma janela para o mundo: bibliotecas e bibliotecários em meio prisionalUma janela para o mundo: bibliotecas e bibliotecários em meio prisional
Uma janela para o mundo: bibliotecas e bibliotecários em meio prisional
 
Resumen y comentario crítico
Resumen y comentario críticoResumen y comentario crítico
Resumen y comentario crítico
 
Care santos
Care santosCare santos
Care santos
 
April resume 2016
April resume 2016April resume 2016
April resume 2016
 

Similar to 20160922 Materials Data Facility TMS Webinar

Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
National Information Standards Organization (NISO)
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
Globus
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Ian Foster
 
Simplified Research Data Management with the Globus Platform
Simplified Research Data Management with the Globus PlatformSimplified Research Data Management with the Globus Platform
Simplified Research Data Management with the Globus Platform
Globus
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
Ravi Madduri
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
Jim Dowling
 
Introduction to the Globus SaaS (GlobusWorld Tour - STFC)
Introduction to the Globus SaaS (GlobusWorld Tour - STFC)Introduction to the Globus SaaS (GlobusWorld Tour - STFC)
Introduction to the Globus SaaS (GlobusWorld Tour - STFC)
Globus
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Amazon Web Services
 
Introduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 TutorialIntroduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 Tutorial
Globus
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
Globus
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
Ben Blaiszik
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
Robert Grossman
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Ian Foster
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Ed Dodds
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
Peter Haase
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)
Globus
 
Shug meetup Hops Hadoop
Shug meetup Hops HadoopShug meetup Hops Hadoop
Shug meetup Hops Hadoop
Jim Dowling
 
Intelligent Cloud Enablement
Intelligent Cloud EnablementIntelligent Cloud Enablement
Intelligent Cloud Enablement
DocuLynx
 
Introduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFIntroduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCF
Globus
 

Similar to 20160922 Materials Data Facility TMS Webinar (20)

Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
Simplified Research Data Management with the Globus Platform
Simplified Research Data Management with the Globus PlatformSimplified Research Data Management with the Globus Platform
Simplified Research Data Management with the Globus Platform
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Introduction to the Globus SaaS (GlobusWorld Tour - STFC)
Introduction to the Globus SaaS (GlobusWorld Tour - STFC)Introduction to the Globus SaaS (GlobusWorld Tour - STFC)
Introduction to the Globus SaaS (GlobusWorld Tour - STFC)
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
 
Introduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 TutorialIntroduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 Tutorial
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...The Materials Data Facility: A Distributed Model for the Materials Data Commu...
The Materials Data Facility: A Distributed Model for the Materials Data Commu...
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSExperiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)
 
Shug meetup Hops Hadoop
Shug meetup Hops HadoopShug meetup Hops Hadoop
Shug meetup Hops Hadoop
 
Intelligent Cloud Enablement
Intelligent Cloud EnablementIntelligent Cloud Enablement
Intelligent Cloud Enablement
 
Introduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCFIntroduction to Globus: Research Data Management Software at the ALCF
Introduction to Globus: Research Data Management Software at the ALCF
 

Recently uploaded

Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
 
Analysis of Polygenic Traits (GPB-602)
Analysis of Polygenic Traits (GPB-602)Analysis of Polygenic Traits (GPB-602)
Analysis of Polygenic Traits (GPB-602)
PABOLU TEJASREE
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 
Nereis Type Study for BSc 1st semester.ppt
Nereis Type Study for BSc 1st semester.pptNereis Type Study for BSc 1st semester.ppt
Nereis Type Study for BSc 1st semester.ppt
underratedsunrise
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
Sérgio Sacani
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Sérgio Sacani
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Sérgio Sacani
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
DrRajeshDas
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
fatima132662
 
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...
Evaluation and Identification of J'BaFofi the Giant  Spider of Congo and Moke...Evaluation and Identification of J'BaFofi the Giant  Spider of Congo and Moke...
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...
MrSproy
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
frank0071
 
acanthocytes_causes_etiology_clinical sognificance-future.pptx
acanthocytes_causes_etiology_clinical sognificance-future.pptxacanthocytes_causes_etiology_clinical sognificance-future.pptx
acanthocytes_causes_etiology_clinical sognificance-future.pptx
muralinath2
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
lucianamillenium
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
Module_1.In autotrophic nutrition ORGANISM
Module_1.In autotrophic nutrition ORGANISMModule_1.In autotrophic nutrition ORGANISM
Module_1.In autotrophic nutrition ORGANISM
rajeshwexl
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
goluk9330
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
shubhijain836
 
Mechanics:- Simple and Compound Pendulum
Mechanics:- Simple and Compound PendulumMechanics:- Simple and Compound Pendulum
Mechanics:- Simple and Compound Pendulum
PravinHudge1
 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Creative-Biolabs
 

Recently uploaded (20)

Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
 
Analysis of Polygenic Traits (GPB-602)
Analysis of Polygenic Traits (GPB-602)Analysis of Polygenic Traits (GPB-602)
Analysis of Polygenic Traits (GPB-602)
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 
Nereis Type Study for BSc 1st semester.ppt
Nereis Type Study for BSc 1st semester.pptNereis Type Study for BSc 1st semester.ppt
Nereis Type Study for BSc 1st semester.ppt
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
 
Lattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptxLattice Defects in ionic solid compound.pptx
Lattice Defects in ionic solid compound.pptx
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
 
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...
Evaluation and Identification of J'BaFofi the Giant  Spider of Congo and Moke...Evaluation and Identification of J'BaFofi the Giant  Spider of Congo and Moke...
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
 
acanthocytes_causes_etiology_clinical sognificance-future.pptx
acanthocytes_causes_etiology_clinical sognificance-future.pptxacanthocytes_causes_etiology_clinical sognificance-future.pptx
acanthocytes_causes_etiology_clinical sognificance-future.pptx
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
Module_1.In autotrophic nutrition ORGANISM
Module_1.In autotrophic nutrition ORGANISMModule_1.In autotrophic nutrition ORGANISM
Module_1.In autotrophic nutrition ORGANISM
 
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxBIRDS  DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptx
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
 
Mechanics:- Simple and Compound Pendulum
Mechanics:- Simple and Compound PendulumMechanics:- Simple and Compound Pendulum
Mechanics:- Simple and Compound Pendulum
 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
 

20160922 Materials Data Facility TMS Webinar

  • 1. Ben Blaiszik (blaiszik@uchicago.edu), Kyle Chard, Rachana Ananthakrishnan Michael Ondrejcek, Kenton McHenry PIs: Ian Foster (foster@uchicago.edu), Steven Tuecke, John Towns materialsdatafacility.org globus.org Materials Data Facility - Data Services to Advance Materials Science Research
  • 4. 4 Outline APIs • Overview § MDF Overview § Globus quick introduction • MDF Data Publication Service § Key MDF data pub service features § Publication walk-through • General Observations and Future Outlook
  • 5. What is MDF? 5 We are developing production services to make it more simple for materials datasets and resources to be ... Published Identified Described Curated Verifiable Accessible Preserved Discovered Searched Browsed Shared Recommended Accessed and SRD Publishable Results Published Results Resource Data Ref Data Derived Data Working Data * Figure adapted from Warren et al.
  • 6. Data Service Infrastructure 6 Publication Discovery Compute for data interaction and viz Resource Registration APIs + + + - Initial Foci
  • 7. 7 Publication APIs • Identify datasets with persistent identifiers (e.g. DOI) • Describe datasets with appropriate metadata and provenance • Verify dataset contents over time • Preserve critical datasets in a state that increases transparency, replicability, and helps encourage reuse
  • 8. 8 Discovery • Search and query datasets in modern ways – e.g. via search against indexed metadata and harvested file contents rather than remembering opaque file paths Future... Spotlight for all data you have access to regardless of location Under Development
  • 9. 9 Discovery Under Development • SaaS cloud-hosted solution • Logical metadata repository to index many external sources • Flexible queries (boosting, full text, partial matches, etc.) • Search results are limited by ACLs
  • 10. 10 Discovery Under Development • All MDF-published datasets will be indexed • May use common schemas (Datacite, Dublin Core etc.) or domain specific • Globus endpoint contents may be indexed (owner enabled) • Index has the flexibility of no required schema • Built on Elasticsearch for proven scalability and speed, hosted on scalable AWS resources
  • 13. Globus Platform-as-a-Service (PaaS) 14 Identity management User groups Data transfer Data sharing • Share directly from your storage device (laptop or cluster) • File and directory-level ACLs • Manage user group creation and administration flows • Share data with user groups • High-performance data transfer from a web browser • Optimize transfer settings and verify transfer integrity • Add your laptop to the Globus cloud with Globus Connect Personal • create and manage a unique identity linked to external identities for authentication Publication Discovery
  • 14. REST APIs, Clients, and Docs 15 • New version of core services released in Feb. • New Python SDK available § https://github.com/globusonline/globus-sdk-python • Jupyter Notebook Examples § https://github.com/globus/globus-jupyter-notebooks • Sample Data Portal § https://github.com/globus/globus-sample-data-portal • (alpha) MDF Data Publication Service API
  • 15. Globus Background 16 B Globus moves the data for you secure endpoint, e.g. laptop You submit a transfer request Globus notifies you once the transfer is complete secure endpoint, e.g. midway transfer A Endpoint • E.g. laptop or server running a Globus client (e.g. Dropbox client) • Enables advanced file transfer and sharing • Currently GridFTP, future GridFTP + HTTP Some Key Features • REST API for automation and interoperability • Web UI for convenience • Optimizes and verifies transfers • Handles auto-restarts • Battle tested with big data
  • 16. Globus Web UI 17 Endpoint • E.g. laptop or server running a Globus client (e.g. Dropbox client) • Enables advanced file transfer and sharing • Currently GridFTP, future GridFTP + HTTP Some Key Features • REST API for automation and interoperability • Web UI for convenience • Optimizes and verifies transfers • Handles auto-restarts • Battle tested with big data
  • 18. 20 Materials Data Publication/Discovery is Often a Challenge Data Collection Data Storage and Process Publication
  • 19. 21 Materials Data Publication/Discovery is Often a Challenge Data Collection ? ? ? Networked storage, sometimes many TB Unique identifier data for search/cite Custom metadata descriptions Data curation workflow Automation capabilities Data Storage and Process Publication Want to Discover / Use Want to Publish Don’t put under desk! Needed to close the loop
  • 20. 22 Data Collection ? ? ? Need storage, sometimes many TB Need to uniquely identify data for search/cite Need custom metadata descriptions Need a data curation workflow Need automation capabilities Data Storage and Process Publication Want to Discover / Use Want to Publish Materials Data Publication/Discovery is Often a Challenge Don’t put under desk!
  • 21. Collection Model 23 • Collections might be a research group or a research topic... • Collections have specified § Mapping to storage endpoint § Currently handled as automatically created shared endpoints § Metadata schemas § Access control policies § Licenses § Curation workflows • Collections contain § Datasets § Data § Metadata • Metadata Persistence § Metadata log file with dataset § Metadata replicated in search index
  • 22. Hybrid Distributed Model 24 Petrel @Argonne 1.7 PB BlueWaters Condo @UIUC 100 TB EP 1 EP 2 EP 3 Campus RDS DOE Cloud Metadata Index And Tools Centralized resource Globus endpoint NSF (XSEDE) ElectroCat EP
  • 23. Publish Large Datasets 25 • Distributed data model leverages Globus production capabilities for file transfer (i.e. dataset assembly), user authentication, and access control groups • 100s of TB of reliable storage @ NCSA, and more storage at Argonne § Globus endpoint at ncsa#mdf on Nebula § Expandable to many PBs as necessary § Automated tape backup for reliability (in progress) • Researchers can optionally use your own local or institutional storage
  • 24. Uniquely Identify Datasets 26 • Associate a unique identifier with a dataset § DOI, Handle • Improve dataset discovery and citability § Aligning incentives and understanding the culture will be critical to driving adoption DatasetDownloads Time • Your work has been cited 153 times in the last year • Researchers from 30 institutions have downloaded your datasets Future...
  • 25. Share Data with Flexible ACLs 27 • Share data publicly, with a set of users, or keep data private Leverage Curation Workflows • Collection administrators can specify the level of curation workflow required for a given collection e.g. § No curation § Curation of metadata only § Curation of metadata and files
  • 26. Customize Metadata 28 • Build a custom metadata schema for your specific research data • Re-use existing metadata schemas • Working in conjunction with NIST researchers to define these schemas • Can we build a system that allows schema: § Inheritance § E.g. a schema “polymers” might inherit and expand upon the “base material” of NIST § Versioning § E.g. Understand contextually how to map fields between versions § Dependence § E.g. Allows the ability to build consensus around schemas Future...
  • 28. Example Use Case 30 Publishing Big, Remote Data Collected multi TB of data at a light source Bundle the data with metadata and provenance Want a citable DOI to share the raw and derived data with the community Want their data to be discoverable by free text search and custom metadata
  • 30. MDF Collections 32 Recall: Policies Set at the Collection Level • Required metadata, schemas • Data storage location • Metadata curation policies
  • 31. MDF Metadata Entry 33 • Scientist or representative describes the data they are submitting • For this collection Dublin Core and a custom metadata template are required
  • 32. MDF Custom Metadata 34 • Scientist or representative describes the data they are submitting • For this collection Dublin Core and a custom metadata template are required
  • 33. Dataset Assembly 35 • Shared endpoint is auto-created on collection-specified data store • Scientist transfers dataset files to a unique publish endpoint • Dataset may be assembled over any period of time • When submission is finished, dataset will be rendered immutable via checksum (e.g. NU) (e.g. UIUC Nebula)
  • 34. Dataset Assembly 36 • Shared endpoint is auto-created on collection-specified data store • Scientist transfers dataset files to a unique publish endpoint • Dataset may be assembled over any period of time • When submission is finished, dataset will be rendered immutable via checksum (e.g. NU) (e.g. UIUC Nebula)
  • 35. Dataset Curation (Optional) 37 • Optionally specified in collection configuration • Can be approved or rejected (i.e. sent back to the submitter)
  • 36. Mint a Permanent Identifier 38 Can be DOI or Handle
  • 40. 48 Publication Year 1 Milestones APIs • Opened to the public in March 2016 • Provisioned reliable storage to support researchers sharing open materials data (~200 TB) • MDF data volume approaching ~ 6 TB of materials data • Started building deep relationships with many of the key materials data generating groups and communities • Ingested dataset > 1 TB in size • Ingested dataset > 1.5M files
  • 41. Integration with the Community is Key 49 Materials Project OQMD Citrination Materials Commons Other Facilities (APS, SNS, NSLS, …), Institutional Repositories, Publishers! Metadata Publishing MetadataMD, Pub., Compute Metadata Publishing NCSA-PIREHV/TMSMBDH
  • 42. Understanding Incentives is Critical 50 Meeting Award Requirements Smoothing Dislocations Increasing Impact • Increase paper citations1 • Add dataset citation capabilities • [Distance] Enable simple sharing among collaborators (near and far) • [Personnel] Ease transitions between students • [Format] Lessen need for ad hoc resource sharing (e.g. via group websites) • Simplify DMP compliance 1 Citation increase 30 (10.7717/peerj.175) - 60% (10.1371/journal.pone.0000308) [caveat bio research]
  • 43. Lessons Learned 51 • The demand is there from researchers and institutions • Lots of cross-over with centers and projects § (NIST) CHiMaD § (DOE) ElectroCat, MICCoM, JCESR, PRISMS, Argonne IT, Integrated Imaging Institute § (NSF) T2C2 [DIBBS], AMI-CFP (PIRE), HV/TMS (I/UCRC), BD Hubs, IMaD BD Spoke* • Data Heterogeneity is a challenge § Metadata is the major sticking point • Friction points § Need more flexible data objects e.g. {“temperature”:100, “unit”:“K”} § Need file or directory based metadata § Immutable datasets alone is not enough à Versioning § Data gathering in retrospect § Schema generation and interoperability § Working with and following developments at NIST, RDA, Citrine et al. § Differing institutional approval processes § Lack of programmatic interface (planned). • Support for data interactivity and visualization • Smart versioning for large file-based datasets
  • 44. Wider Data Community 52 • Curated and described datasets • Well-posed problems • Community to share analyses • Challenges to start “sprints” • Great APIs and clients • Examples to get started • Hundreds of video tutorials Materials ProjectOQMD Citrination Materials Commons • Less inherently intuitive problems • Sometimes need advanced compute capabilities • Often many TB
  • 45. 53 • Continuous integration, QA, and testing • Containerized solutions, microservice architecture, abstracting software from hardware • Automation • Internet of Things (IoT) – connect everything • Machine Learning / AI • Natural Language Processing (Siri, chatbots or “slack”bots, etc.) • Search rules the world – ok this was 20 years ago… What are the analogs and applications in the materials community? Materials ProjectOQMD Citrination Materials Commons • Less inherently intuitive problems • Sometimes need advanced compute capabilities • Often many TB Broader Trends
  • 46. 54 Experimentation Ahead No team commitments here! Open source opportunities, contact: blaiszik@uchicago.edu
  • 47. Use Case: Scenario Generator-Consumer 55 • Data generator § Generates data periodically (perhaps from an instrument) § Pushes data to a public channel § Schema is validated before inclusion in channel stream • Data consumer § Polls channel periodically § Wants to pull datasets by property Dataset Channel MDF-composites Data Generator Data Consumer DatasetDatasetDataset DatasetDatasetcreate q: result q
  • 49. Aggregate, Perform ML 58 • Combine cloud-published dataset, scikitlearn, pandas to predict steel fatigue and “reproduce” data from journal publication
  • 50. Aggregate, Perform ML, Visualize 59 • Combine cloud-published dataset, scikitlearn, pandas to predict steel fatigue and validate journal publication
  • 51. What’s Currently Available? 60 • Web interface to support data publication (public- facing APIs coming soon) • 100s of TB of storage at NCSA (scalable to many PB) more at Argonne (1.7 PB total on Petrel – not all for materials…) • Help with developing metadata schemas to describe your research datasets MDF Tutorial on Github https://github.com/blaiszik/materials-data-facility-training
  • 52. What are we looking for? 61 • Early adopters, willing to get their hands dirty with the service and give honest feedback • Key integration points where metadata is picked up automatically! • Key datasets and resources of all sizes, shapes, raw or derived, that might help us understand the process better
  • 53. Thanks to Our Sponsors! 62 U . S . D E P A RT M E N T O F ENERGY
  • 54. Publication REST APIs Discovery • Identify datasets with persistent identifiers (e.g. DOI) • Describe datasets with appropriate metadata and provenance • Verify dataset contents over time • Handle big (and small) data: We have already ingested datasets with > 1.5M files and > 1TB in size • Search and query datasets in modern ways • Index metadata and harvest file contents • Simple user interfaces (i.e., after Google and Amazon) Opened to external users in Mar. 2016 ~ 6 TB of data published Materialsdatafacility .org