This document discusses using cloud services to facilitate materials data sharing and analysis. It proposes a "Discovery Cloud" that would allow researchers to easily store, curate, discover, and analyze materials data without needing local software or hardware. This cloud platform could accelerate discovery by automating workflows and reducing costs through on-demand scalability. It would also make long-term data preservation simpler. The document highlights Globus research data management services as an example of cloud tools that could help address the dual challenges of treating data as both a rare treasure to preserve and a "deluge" to efficiently manage.
4. computationinstitute.org
Materials Innovation Infrastructure
A data sharing system to facilitate:
• Use of a broader set of data to render
more accurate models
• Multi-disciplinary communication
among scientists and engineers
working on different stages of
materials development
• Searches for advanced materials
with specific, desired properties
• Curating and sharing of reliable computational
code for modeling and simulation
Credit: Meredith Drosback, OSTP
Computation
Data
Experiment
6. computationinstitute.org
It’s both …
We must manage the data
deluge—both to enhance
user productivity and to
increase data capture
Network materials data
Or chaotic deluge?
Wellington bucket fountain: https://www.youtube.com/watch?v=_p_FNNDu16w
8. computationinstitute.org
Linking simulation and experiment
to study disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
9. computationinstitute.org
An expensive business …
Network
engineer
Parallel
programme
r
Software
engineer
Database
architect
Database
manager
Software
engineer
Data
engineer
Parallel
programmer
Postdoc
Postdoc
10. computationinstitute.org
A small business, 20 years ago
Secretary
HR
manager
Marketing
Database
manager
Accountant
IT
department
Personal
assistant
Shipping
department
Intern
Payroll
14. computationinstitute.org
Linking simulation and experiment
to study disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Globus transfer service
Cloud hosted: reliable, secure, fast
20K users, 3B files, 50 PB transferred
Available at www.globus.org
16. computationinstitute.org
Linking simulation and experiment
to study disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Evolutionary optimization
Globus sharing
Identities, groups, profiles
Cloud hosted
18. computationinstitute.org
Linking simulation and experiment
to study disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Contribute to knowledge base
Knowledge-driven
decision making
Globus data publication
and discovery
Cloud hosted
19. computationinstitute.org
Data publication and discovery
We are looking for pilot users!
Metadata
Access Control
License
Storage
Curation
Workflow
Policies
Collection
Metadata
DataMetadata
Data
Metadata
Data
Dataset
Dataset
Dataset
Community
34. computationinstitute.org
Linking simulation and experiment
to study disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La 60%
Sr 40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
36. computationinstitute.org
Tool shed
Simulation
models & analysis
tools
Data space
Local and remote
datasets
Workflows
Link data, tools in
reusable form
Simulation and data analysis:
Point and click parallelism
Capture domain
knowledge: data
and code
Reusable workflows encode
commonly used modeling
and analysis pipelines
Builds on widely used Galaxy, Globus, and Swift systems
galaxyproject.org ✧ globus.org ✧ swift-lang.org
Large
simulation
campaigns
Hosted on Amazon cloud for reliable, on-demand access and scalability
37. computationinstitute.org
Discovery Cloud:
Three common themes
1) Accelerate discovery via automation
2) Slash costs of trying new methods
– No local software installation
– No need to read manual
– On-demand, elastic scalability
– Low operational costs, proactive support
3) Make data preservation trivial
38. computationinstitute.org
Take away messages
• Data has a dual nature: rare treasure
and chaotic deluge
• MGI must embrace this duality
– Treasure: Store, curate, index, preserve
– Deluge: Slash management costs, to both
accelerate use & facilitate data preservation
• Cloud services can help in both areas
39. computationinstitute.org
Thanks to great colleagues
and collaborators
• Rachana Ananthakrishnan, Ben Blaiszik, Kyle
Chard, Raj Kettimuthu, Ravi Madduri, Tanu
Malik, Steve Tuecke, Justin Wozniak, and other
CS colleagues
• Ray Osborn, Francesco de Carlo, Chris
Jacobsen, Nicola Ferrier, and other Argonne
scientists
• Juan de Pablo, Peter Voorhees, and other NIST
CHiMaD participants
We have this vision of what we want the system to be able to do - Looking for tools that bring data together seamlessly, and connect them with models (a la hubzero)
HOWEVER, materials is not as advanced in this some fields, such as astronomy, genetics, etc. Broad materials community hasn’t entirely bought into the idea of the utility of data sharing
An open collaboration platform is vital to the success of MGI – tell us how to do this!
Knowledge and expertise supporting the development of the materials of the 21st century are widely dispersed. To address the need for collaboration, some organizations are developing online materials databases and platforms. These "hubs" make materials research data public and searchable in order to improve transparency and help the materials community build on existing knowledge rather than unknowingly repeat research or duplicate efforts in parallel. Moreover, they provide modeling and simulation and systems integration tools that are necessary to manage and interpret complex datasets representing the structure and properties of materials.
Is data a rare treasure, to be carefully curated, cherished …
the case of data, constant movement and change is really the norm. Data is constantly created, flows, is lost, …
Project at Argonne’s Advanced Photon Source, seeks to understand properties of disordered structures such as paramagnetic insulators
“Most of materials science is bottlenecked by disordered structures”—Littlewood.
Solve inverse problem.
How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base.
Challenge: takes months to do a single loop through cycle.
Just as important, it is an incredibly labor intensive and expensive process.
Network engineer
Parallel computer
Storage system
Data manager
Database architect
Parallel programmer
Salesforce: Customer Relationship Management
oDesk: Freelane work management
ADP:
Xero: Accounting
Cloud9 Analytics: BI
KnowledgeTree: Document management
Scientists—thousands
Services
Cloud underneath
Could put the bottom figure somewhere else??
Scientists—thousands
Services
Cloud underneath
Could put the bottom figure somewhere else??
“Most of materials science is bottlenecked by disordered structures”—Littlewood.
Solve inverse problem.
How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. [IDEA: Expand knowledge base?]
Scientists—thousands
Services
Cloud underneath
Could put the bottom figure somewhere else??
“Most of materials science is bottlenecked by disordered structures”—Littlewood.
Solve inverse problem.
How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. [IDEA: Expand knowledge base?]
The publish dashboard shows all current submissions at any stage of the submission workflow. Here users can view accepted submissions, see a list of all submissions currently in the curation process, view/edit their unfinished submissions, and start a new submission.
"The Scientist" will now start a new submission.
The first step of submission is to select a collection. In this case "The Scientist" selects the “Center for Nanoscale Materials”, as this is the department through which he conducted his research.
Note: "The Scientist" can only see collections he is allowed to publish to.
"The Scientist" must first describe the dataset he is publishing. There are two types of metadata required for submission to the CNM collection: 1) Dublin core and 2) scientific metadata. These metadata requirements are defined by the collection and can be configured depending on the domain. Additional pages can also be defined.
Here, "The Scientist" enters information about the Authors, their ORCID (a unique researcher identity), the submission title, the date of publication, the accompanying publication to which this dataset is related, and the DOI for that publication.
Note: "The Scientist" has missed an ORCID for one of his co-authors.
The second type of metadata required by the CNM relates to the materials science research at the Advanced Photon Source.
Here, "The Scientist" enters information such as keywords describing the dataset, information about the sponsors who funded this research, a description of the dataset, the experiment name, the materials analyzed in this dataset, the energy density of the materials (this is important for research into battery development) and the Argonne General User Proposal (GUP) number. The GUP number is a unique identifier for all beam time allocations at the APS and is used by administrators to associate researchers, experiments, and allocations.
All of this entered information can be subsequently used by other researchers with appropriate access to discover this dataset.
Having described the dataset, "The Scientist" must now assemble the dataset. To do so, he first chooses to select the files to be published.
Using the familiar Globus interface, "The Scientist" is able to select files from multiple sources and transfer them to his unique submission endpoint (publish#submission_11).
This submission endpoint is created on shared Argonne storage resources, but is initially accessible only to "The Scientist"
The dataset may be assembled over any period of time. "The Scientist" can create new files and folders on the endpoint and he can arrange these files in any hierarchy.
At the completion of the submission the permissions on the endpoint will be changed such that the dataset is immutable. "The Scientist” will be given read access to the dataset, collection curators will also be given read access to the data so that they can view the contents.
When "The Scientist" is happy with his assembled dataset, he can return to the publication workflow. Here, he sees a summary of the dataset and may confirm the correct file sizes and names are associated. The system attempts to determine the file types for each of the dataset’s files.
"The Scientist" can choose to edit, remove or add files if necessary.
When submitted, the dataset now enters a pre-determined curation workflow. "The Scientist” can check the progress of the submission through his dashboard. If any further attention is required, it will be displayed through his dashboard.
“The Researcher” chooses to search for all published data in the CNM collection. The results show a brief summary of each published dataset including information about the publication time, collection, summary of number of files, name, authors, description and a set of keyword tags as well as key-value tags.
Each of these fields can be used to search for a particular dataset.
Knowing that other collections may well have datasets of interest , “The Researcher” may broaden the search context to all accessible collections and search for datasets related to “Li-ion” and “autonomic”. Here, the results show datasets from 2 collections: the CNM and the Chemical Sciences and Engineering collection (red boxes). Results are ranked according to their relevance to the search.
Going further, “The Researcher” can use different queries such as key-value and ranges. In this case, “The Researcher” searchers for energy density > 1500 and microcapsules, and finds the dataset previously published in this demo with an associated key-value pair of energy-density:2000 that fits the range query criteria.
Having found the desired published dataset, “The Researcher” can navigate to the summary page.
Finally, “The Researcher” can view the downloaded dataset on their desktop PC.
Scientists—thousands
Services
Cloud underneath
Could put the bottom figure somewhere else??
“Most of materials science is bottlenecked by disordered structures”—Littlewood.
Solve inverse problem.
How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. [IDEA: Expand knowledge base?]
Could add figures here of some sort to communicate the driving to zero idea. Notes about usage scaling with ease of use?