Networking Materials Data

870 views

Published on

A talk given at a workshop in Atlanta on "Building an Integrated MGI Accelerator Network": see http://acceleratornetwork.org/event/building-an-integrated-mgi-accelerator-network/.

The US Materials Genome Initiative seeks to develop an infrastructure that will accelerate advanced materials development and deployment. The term Materials Genome suggests a science that is fundamentally driven by the systematic capture of large quantities of elemental data. In practice, we know, things are more complex—in materials as in biology. Nevertheless, the ability to locate and reuse data is often essential to research progress. I discuss here three aspects of networking materials data: data publication and discovery; linking instruments, computations, and people to enable new research modalities based on near-real-time processing; and organizing data generation, transformation, and analysis software to facilitate understanding and reuse. I use these three problems to motivate a discussion of recent results in cloud computing, data publication management, high-performance computing, and related topics.

Published in: Science, Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
870
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • We have this vision of what we want the system to be able to do - Looking for tools that bring data together seamlessly, and connect them with models (a la hubzero)
    HOWEVER, materials is not as advanced in this some fields, such as astronomy, genetics, etc. Broad materials community hasn’t entirely bought into the idea of the utility of data sharing
    An open collaboration platform is vital to the success of MGI – tell us how to do this!
    Knowledge and expertise supporting the development of the materials of the 21st century are widely dispersed. To address the need for collaboration, some organizations are developing online materials databases and platforms. These "hubs" make materials research data public and searchable in order to improve transparency and help the materials community build on existing knowledge rather than unknowingly repeat research or duplicate efforts in parallel. Moreover, they provide modeling and simulation and systems integration tools that are necessary to manage and interpret complex datasets representing the structure and properties of materials.

  • Is data a rare treasure, to be carefully curated, cherished …
  • the case of data, constant movement and change is really the norm. Data is constantly created, flows, is lost, …
  • Project at Argonne’s Advanced Photon Source, seeks to understand properties of disordered structures such as paramagnetic insulators
  • “Most of materials science is bottlenecked by disordered structures”—Littlewood.
    Solve inverse problem.
    How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base.
    Challenge: takes months to do a single loop through cycle.
    Just as important, it is an incredibly labor intensive and expensive process.
  • Network engineer
    Parallel computer
    Storage system
    Data manager
    Database architect
    Parallel programmer
  • Salesforce: Customer Relationship Management
    oDesk: Freelane work management
    ADP:
    Xero: Accounting
    Cloud9 Analytics: BI
    KnowledgeTree: Document management
  • Scientists—thousands
    Services
    Cloud underneath

    Could put the bottom figure somewhere else??
  • Scientists—thousands
    Services
    Cloud underneath

    Could put the bottom figure somewhere else??
  • “Most of materials science is bottlenecked by disordered structures”—Littlewood.
    Solve inverse problem.
    How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. [IDEA: Expand knowledge base?]
  • Scientists—thousands
    Services
    Cloud underneath

    Could put the bottom figure somewhere else??
  • “Most of materials science is bottlenecked by disordered structures”—Littlewood.
    Solve inverse problem.
    How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. [IDEA: Expand knowledge base?]
  • The publish dashboard shows all current submissions at any stage of the submission workflow. Here users can view accepted submissions, see a list of all submissions currently in the curation process, view/edit their unfinished submissions, and start a new submission.

    "The Scientist" will now start a new submission.
  • The first step of submission is to select a collection. In this case "The Scientist" selects the “Center for Nanoscale Materials”, as this is the department through which he conducted his research.

    Note: "The Scientist" can only see collections he is allowed to publish to.
  • "The Scientist" must first describe the dataset he is publishing. There are two types of metadata required for submission to the CNM collection: 1) Dublin core and 2) scientific metadata. These metadata requirements are defined by the collection and can be configured depending on the domain. Additional pages can also be defined.

    Here, "The Scientist" enters information about the Authors, their ORCID (a unique researcher identity), the submission title, the date of publication, the accompanying publication to which this dataset is related, and the DOI for that publication.

    Note: "The Scientist" has missed an ORCID for one of his co-authors.
  • The second type of metadata required by the CNM relates to the materials science research at the Advanced Photon Source.

    Here, "The Scientist" enters information such as keywords describing the dataset, information about the sponsors who funded this research, a description of the dataset, the experiment name, the materials analyzed in this dataset, the energy density of the materials (this is important for research into battery development) and the Argonne General User Proposal (GUP) number. The GUP number is a unique identifier for all beam time allocations at the APS and is used by administrators to associate researchers, experiments, and allocations.

    All of this entered information can be subsequently used by other researchers with appropriate access to discover this dataset.
  • Having described the dataset, "The Scientist" must now assemble the dataset. To do so, he first chooses to select the files to be published.
  • Using the familiar Globus interface, "The Scientist" is able to select files from multiple sources and transfer them to his unique submission endpoint (publish#submission_11).

    This submission endpoint is created on shared Argonne storage resources, but is initially accessible only to "The Scientist"

    The dataset may be assembled over any period of time. "The Scientist" can create new files and folders on the endpoint and he can arrange these files in any hierarchy.

    At the completion of the submission the permissions on the endpoint will be changed such that the dataset is immutable. "The Scientist” will be given read access to the dataset, collection curators will also be given read access to the data so that they can view the contents.
  • When "The Scientist" is happy with his assembled dataset, he can return to the publication workflow. Here, he sees a summary of the dataset and may confirm the correct file sizes and names are associated. The system attempts to determine the file types for each of the dataset’s files.

    "The Scientist" can choose to edit, remove or add files if necessary.
  • When submitted, the dataset now enters a pre-determined curation workflow. "The Scientist” can check the progress of the submission through his dashboard. If any further attention is required, it will be displayed through his dashboard.
  • “The Researcher” chooses to search for all published data in the CNM collection. The results show a brief summary of each published dataset including information about the publication time, collection, summary of number of files, name, authors, description and a set of keyword tags as well as key-value tags.

    Each of these fields can be used to search for a particular dataset.
  • Knowing that other collections may well have datasets of interest , “The Researcher” may broaden the search context to all accessible collections and search for datasets related to “Li-ion” and “autonomic”. Here, the results show datasets from 2 collections: the CNM and the Chemical Sciences and Engineering collection (red boxes). Results are ranked according to their relevance to the search.
  • Going further, “The Researcher” can use different queries such as key-value and ranges. In this case, “The Researcher” searchers for energy density > 1500 and microcapsules, and finds the dataset previously published in this demo with an associated key-value pair of energy-density:2000 that fits the range query criteria.
  • Having found the desired published dataset, “The Researcher” can navigate to the summary page.
  • Finally, “The Researcher” can view the downloaded dataset on their desktop PC.
  • Scientists—thousands
    Services
    Cloud underneath

    Could put the bottom figure somewhere else??
  • “Most of materials science is bottlenecked by disordered structures”—Littlewood.
    Solve inverse problem.
    How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. [IDEA: Expand knowledge base?]
  • Could add figures here of some sort to communicate the driving to zero idea. Notes about usage scaling with ease of use?
  • Add NIST and Amazon
  • Networking Materials Data

    1. 1. computationinstitute.org Networking materials data Ian Foster foster@anl.gov ianfoster.org
    2. 2. computationinstitute.org
    3. 3. computationinstitute.org
    4. 4. computationinstitute.org Materials Innovation Infrastructure A data sharing system to facilitate: • Use of a broader set of data to render more accurate models • Multi-disciplinary communication among scientists and engineers working on different stages of materials development • Searches for advanced materials with specific, desired properties • Curating and sharing of reliable computational code for modeling and simulation Credit: Meredith Drosback, OSTP Computation Data Experiment
    5. 5. computationinstitute.org Data: Rare treasure? http://www.thejakartapost.com/news/2011/05/14/holy-water.html
    6. 6. computationinstitute.org It’s both … We must manage the data deluge—both to enhance user productivity and to increase data capture  Network materials data Or chaotic deluge? Wellington bucket fountain: https://www.youtube.com/watch?v=_p_FNNDu16w
    7. 7. computationinstitute.org Linking simulation and experiment to study disordered structures Diffuse scattering images from Ray Osborn et al., Argonne
    8. 8. computationinstitute.org Linking simulation and experiment to study disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
    9. 9. computationinstitute.org An expensive business … Network engineer Parallel programme r Software engineer Database architect Database manager Software engineer Data engineer Parallel programmer Postdoc Postdoc
    10. 10. computationinstitute.org A small business, 20 years ago Secretary HR manager Marketing Database manager Accountant IT department Personal assistant Shipping department Intern Payroll
    11. 11. computationinstitute.org A small business, today “Business cloud” Reduce costs Speed innovation Reliable, scalable, simple
    12. 12. computationinstitute.org Can we do the same for research? “Discovery cloud” Reduce costs Speed discovery Reliable, scalable, simple ?
    13. 13. computationinstitute.org File transfer & sharing Discovery cloud: Globus research data management services www.globus.org
    14. 14. computationinstitute.org Linking simulation and experiment to study disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Globus transfer service Cloud hosted: reliable, secure, fast 20K users, 3B files, 50 PB transferred Available at www.globus.org
    15. 15. computationinstitute.org File transfer & sharing Identity & group management Discovery cloud: Globus research data management services www.globus.org
    16. 16. computationinstitute.org Linking simulation and experiment to study disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Evolutionary optimization Globus sharing Identities, groups, profiles Cloud hosted
    17. 17. computationinstitute.org File transfer & sharing Data publication & discovery Identity & group management Discovery cloud: Globus research data management services www.globus.org
    18. 18. computationinstitute.org Linking simulation and experiment to study disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Knowledge base Past experiments; simulations; literature; expert knowledge Contribute to knowledge base Knowledge-driven decision making Globus data publication and discovery Cloud hosted
    19. 19. computationinstitute.org Data publication and discovery We are looking for pilot users! Metadata Access Control License Storage Curation Workflow Policies Collection Metadata DataMetadata Data Metadata Data Dataset Dataset Dataset Community
    20. 20. computationinstitute.org Publish dashboard 20
    21. 21. computationinstitute.org Start a new submission 21
    22. 22. 22 Describe submission: 1) Dublin Core
    23. 23. 23 Describe submission: 2) Science metadata
    24. 24. computationinstitute.org Assemble the dataset 24
    25. 25. 25 Transfer files to submission endpoint
    26. 26. 26 Check dataset is assembled correctly
    27. 27. computationinstitute.org Submission now in curation workflow 27
    28. 28. computationinstitute.org Search published datasets 28
    29. 29. computationinstitute.org Search across collections
    30. 30. computationinstitute.org Discover a published dataset 30
    31. 31. computationinstitute.org Select a published dataset 31
    32. 32. computationinstitute.org View downloaded dataset 32
    33. 33. computationinstitute.org File transfer & sharing Data publication & discovery Simulation & data analysis Identity & group management Discovery cloud: Globus research data management services www.globus.org
    34. 34. computationinstitute.org Linking simulation and experiment to study disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
    35. 35. Justin Wozniak et al.
    36. 36. computationinstitute.org Tool shed Simulation models & analysis tools Data space Local and remote datasets Workflows Link data, tools in reusable form Simulation and data analysis: Point and click parallelism Capture domain knowledge: data and code Reusable workflows encode commonly used modeling and analysis pipelines Builds on widely used Galaxy, Globus, and Swift systems galaxyproject.org ✧ globus.org ✧ swift-lang.org Large simulation campaigns Hosted on Amazon cloud for reliable, on-demand access and scalability
    37. 37. computationinstitute.org Discovery Cloud: Three common themes 1) Accelerate discovery via automation 2) Slash costs of trying new methods – No local software installation – No need to read manual – On-demand, elastic scalability – Low operational costs, proactive support 3) Make data preservation trivial
    38. 38. computationinstitute.org Take away messages • Data has a dual nature: rare treasure and chaotic deluge • MGI must embrace this duality – Treasure: Store, curate, index, preserve – Deluge: Slash management costs, to both accelerate use & facilitate data preservation • Cloud services can help in both areas
    39. 39. computationinstitute.org Thanks to great colleagues and collaborators • Rachana Ananthakrishnan, Ben Blaiszik, Kyle Chard, Raj Kettimuthu, Ravi Madduri, Tanu Malik, Steve Tuecke, Justin Wozniak, and other CS colleagues • Ray Osborn, Francesco de Carlo, Chris Jacobsen, Nicola Ferrier, and other Argonne scientists • Juan de Pablo, Peter Voorhees, and other NIST CHiMaD participants
    40. 40. computationinstitute.org Thank you to our sponsors!

    ×