Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20160922 Materials Data Facility TMS Webinar

122 views

Published on

Fall 2016 TMS Webinar on Data Curation Tools. Slides for the Materials Data Facility presentation on data services (publish and discover) as described by Ben Blaiszik. See http://www.materialsdatafacility.org for more information.

Published in: Science
  • Be the first to comment

20160922 Materials Data Facility TMS Webinar

  1. 1. Ben Blaiszik (blaiszik@uchicago.edu), Kyle Chard, Rachana Ananthakrishnan Michael Ondrejcek, Kenton McHenry PIs: Ian Foster (foster@uchicago.edu), Steven Tuecke, John Towns materialsdatafacility.org globus.org Materials Data Facility - Data Services to Advance Materials Science Research
  2. 2. 2 http://dx.doi.org/10.1007/s11837-016-2001-3 MDF Article in JOM (August Issue)
  3. 3. MaterialsDataFacility.org 3 To get started, contact Ben Blaiszik blaiszik@uchicago.edu
  4. 4. 4 Outline APIs • Overview § MDF Overview § Globus quick introduction • MDF Data Publication Service § Key MDF data pub service features § Publication walk-through • General Observations and Future Outlook
  5. 5. What is MDF? 5 We are developing production services to make it more simple for materials datasets and resources to be ... Published Identified Described Curated Verifiable Accessible Preserved Discovered Searched Browsed Shared Recommended Accessed and SRD Publishable Results Published Results Resource Data Ref Data Derived Data Working Data * Figure adapted from Warren et al.
  6. 6. Data Service Infrastructure 6 Publication Discovery Compute for data interaction and viz Resource Registration APIs + + + - Initial Foci
  7. 7. 7 Publication APIs • Identify datasets with persistent identifiers (e.g. DOI) • Describe datasets with appropriate metadata and provenance • Verify dataset contents over time • Preserve critical datasets in a state that increases transparency, replicability, and helps encourage reuse
  8. 8. 8 Discovery • Search and query datasets in modern ways – e.g. via search against indexed metadata and harvested file contents rather than remembering opaque file paths Future... Spotlight for all data you have access to regardless of location Under Development
  9. 9. 9 Discovery Under Development • SaaS cloud-hosted solution • Logical metadata repository to index many external sources • Flexible queries (boosting, full text, partial matches, etc.) • Search results are limited by ACLs
  10. 10. 10 Discovery Under Development • All MDF-published datasets will be indexed • May use common schemas (Datacite, Dublin Core etc.) or domain specific • Globus endpoint contents may be indexed (owner enabled) • Index has the flexibility of no required schema • Built on Elasticsearch for proven scalability and speed, hosted on scalable AWS resources
  11. 11. 11 Discovery Under Development Custom boosting Facets Test-indexed data
  12. 12. 13 Globus Background https://www.globus.org
  13. 13. Globus Platform-as-a-Service (PaaS) 14 Identity management User groups Data transfer Data sharing • Share directly from your storage device (laptop or cluster) • File and directory-level ACLs • Manage user group creation and administration flows • Share data with user groups • High-performance data transfer from a web browser • Optimize transfer settings and verify transfer integrity • Add your laptop to the Globus cloud with Globus Connect Personal • create and manage a unique identity linked to external identities for authentication Publication Discovery
  14. 14. REST APIs, Clients, and Docs 15 • New version of core services released in Feb. • New Python SDK available § https://github.com/globusonline/globus-sdk-python • Jupyter Notebook Examples § https://github.com/globus/globus-jupyter-notebooks • Sample Data Portal § https://github.com/globus/globus-sample-data-portal • (alpha) MDF Data Publication Service API
  15. 15. Globus Background 16 B Globus moves the data for you secure endpoint, e.g. laptop You submit a transfer request Globus notifies you once the transfer is complete secure endpoint, e.g. midway transfer A Endpoint • E.g. laptop or server running a Globus client (e.g. Dropbox client) • Enables advanced file transfer and sharing • Currently GridFTP, future GridFTP + HTTP Some Key Features • REST API for automation and interoperability • Web UI for convenience • Optimizes and verifies transfers • Handles auto-restarts • Battle tested with big data
  16. 16. Globus Web UI 17 Endpoint • E.g. laptop or server running a Globus client (e.g. Dropbox client) • Enables advanced file transfer and sharing • Currently GridFTP, future GridFTP + HTTP Some Key Features • REST API for automation and interoperability • Web UI for convenience • Optimizes and verifies transfers • Handles auto-restarts • Battle tested with big data
  17. 17. 19 Data Publication Where are we Now?
  18. 18. 20 Materials Data Publication/Discovery is Often a Challenge Data Collection Data Storage and Process Publication
  19. 19. 21 Materials Data Publication/Discovery is Often a Challenge Data Collection ? ? ? Networked storage, sometimes many TB Unique identifier data for search/cite Custom metadata descriptions Data curation workflow Automation capabilities Data Storage and Process Publication Want to Discover / Use Want to Publish Don’t put under desk! Needed to close the loop
  20. 20. 22 Data Collection ? ? ? Need storage, sometimes many TB Need to uniquely identify data for search/cite Need custom metadata descriptions Need a data curation workflow Need automation capabilities Data Storage and Process Publication Want to Discover / Use Want to Publish Materials Data Publication/Discovery is Often a Challenge Don’t put under desk!
  21. 21. Collection Model 23 • Collections might be a research group or a research topic... • Collections have specified § Mapping to storage endpoint § Currently handled as automatically created shared endpoints § Metadata schemas § Access control policies § Licenses § Curation workflows • Collections contain § Datasets § Data § Metadata • Metadata Persistence § Metadata log file with dataset § Metadata replicated in search index
  22. 22. Hybrid Distributed Model 24 Petrel @Argonne 1.7 PB BlueWaters Condo @UIUC 100 TB EP 1 EP 2 EP 3 Campus RDS DOE Cloud Metadata Index And Tools Centralized resource Globus endpoint NSF (XSEDE) ElectroCat EP
  23. 23. Publish Large Datasets 25 • Distributed data model leverages Globus production capabilities for file transfer (i.e. dataset assembly), user authentication, and access control groups • 100s of TB of reliable storage @ NCSA, and more storage at Argonne § Globus endpoint at ncsa#mdf on Nebula § Expandable to many PBs as necessary § Automated tape backup for reliability (in progress) • Researchers can optionally use your own local or institutional storage
  24. 24. Uniquely Identify Datasets 26 • Associate a unique identifier with a dataset § DOI, Handle • Improve dataset discovery and citability § Aligning incentives and understanding the culture will be critical to driving adoption DatasetDownloads Time • Your work has been cited 153 times in the last year • Researchers from 30 institutions have downloaded your datasets Future...
  25. 25. Share Data with Flexible ACLs 27 • Share data publicly, with a set of users, or keep data private Leverage Curation Workflows • Collection administrators can specify the level of curation workflow required for a given collection e.g. § No curation § Curation of metadata only § Curation of metadata and files
  26. 26. Customize Metadata 28 • Build a custom metadata schema for your specific research data • Re-use existing metadata schemas • Working in conjunction with NIST researchers to define these schemas • Can we build a system that allows schema: § Inheritance § E.g. a schema “polymers” might inherit and expand upon the “base material” of NIST § Versioning § E.g. Understand contextually how to map fields between versions § Dependence § E.g. Allows the ability to build consensus around schemas Future...
  27. 27. 29 MDF Submission Walkthrough
  28. 28. Example Use Case 30 Publishing Big, Remote Data Collected multi TB of data at a light source Bundle the data with metadata and provenance Want a citable DOI to share the raw and derived data with the community Want their data to be discoverable by free text search and custom metadata
  29. 29. MDF Collection Home 31
  30. 30. MDF Collections 32 Recall: Policies Set at the Collection Level • Required metadata, schemas • Data storage location • Metadata curation policies
  31. 31. MDF Metadata Entry 33 • Scientist or representative describes the data they are submitting • For this collection Dublin Core and a custom metadata template are required
  32. 32. MDF Custom Metadata 34 • Scientist or representative describes the data they are submitting • For this collection Dublin Core and a custom metadata template are required
  33. 33. Dataset Assembly 35 • Shared endpoint is auto-created on collection-specified data store • Scientist transfers dataset files to a unique publish endpoint • Dataset may be assembled over any period of time • When submission is finished, dataset will be rendered immutable via checksum (e.g. NU) (e.g. UIUC Nebula)
  34. 34. Dataset Assembly 36 • Shared endpoint is auto-created on collection-specified data store • Scientist transfers dataset files to a unique publish endpoint • Dataset may be assembled over any period of time • When submission is finished, dataset will be rendered immutable via checksum (e.g. NU) (e.g. UIUC Nebula)
  35. 35. Dataset Curation (Optional) 37 • Optionally specified in collection configuration • Can be approved or rejected (i.e. sent back to the submitter)
  36. 36. Mint a Permanent Identifier 38 Can be DOI or Handle
  37. 37. Dataset Record 39
  38. 38. Dataset Discovery 40
  39. 39. 47 General Observations and Future Outlook
  40. 40. 48 Publication Year 1 Milestones APIs • Opened to the public in March 2016 • Provisioned reliable storage to support researchers sharing open materials data (~200 TB) • MDF data volume approaching ~ 6 TB of materials data • Started building deep relationships with many of the key materials data generating groups and communities • Ingested dataset > 1 TB in size • Ingested dataset > 1.5M files
  41. 41. Integration with the Community is Key 49 Materials Project OQMD Citrination Materials Commons Other Facilities (APS, SNS, NSLS, …), Institutional Repositories, Publishers! Metadata Publishing MetadataMD, Pub., Compute Metadata Publishing NCSA-PIREHV/TMSMBDH
  42. 42. Understanding Incentives is Critical 50 Meeting Award Requirements Smoothing Dislocations Increasing Impact • Increase paper citations1 • Add dataset citation capabilities • [Distance] Enable simple sharing among collaborators (near and far) • [Personnel] Ease transitions between students • [Format] Lessen need for ad hoc resource sharing (e.g. via group websites) • Simplify DMP compliance 1 Citation increase 30 (10.7717/peerj.175) - 60% (10.1371/journal.pone.0000308) [caveat bio research]
  43. 43. Lessons Learned 51 • The demand is there from researchers and institutions • Lots of cross-over with centers and projects § (NIST) CHiMaD § (DOE) ElectroCat, MICCoM, JCESR, PRISMS, Argonne IT, Integrated Imaging Institute § (NSF) T2C2 [DIBBS], AMI-CFP (PIRE), HV/TMS (I/UCRC), BD Hubs, IMaD BD Spoke* • Data Heterogeneity is a challenge § Metadata is the major sticking point • Friction points § Need more flexible data objects e.g. {“temperature”:100, “unit”:“K”} § Need file or directory based metadata § Immutable datasets alone is not enough à Versioning § Data gathering in retrospect § Schema generation and interoperability § Working with and following developments at NIST, RDA, Citrine et al. § Differing institutional approval processes § Lack of programmatic interface (planned). • Support for data interactivity and visualization • Smart versioning for large file-based datasets
  44. 44. Wider Data Community 52 • Curated and described datasets • Well-posed problems • Community to share analyses • Challenges to start “sprints” • Great APIs and clients • Examples to get started • Hundreds of video tutorials Materials ProjectOQMD Citrination Materials Commons • Less inherently intuitive problems • Sometimes need advanced compute capabilities • Often many TB
  45. 45. 53 • Continuous integration, QA, and testing • Containerized solutions, microservice architecture, abstracting software from hardware • Automation • Internet of Things (IoT) – connect everything • Machine Learning / AI • Natural Language Processing (Siri, chatbots or “slack”bots, etc.) • Search rules the world – ok this was 20 years ago… What are the analogs and applications in the materials community? Materials ProjectOQMD Citrination Materials Commons • Less inherently intuitive problems • Sometimes need advanced compute capabilities • Often many TB Broader Trends
  46. 46. 54 Experimentation Ahead No team commitments here! Open source opportunities, contact: blaiszik@uchicago.edu
  47. 47. Use Case: Scenario Generator-Consumer 55 • Data generator § Generates data periodically (perhaps from an instrument) § Pushes data to a public channel § Schema is validated before inclusion in channel stream • Data consumer § Polls channel periodically § Wants to pull datasets by property Dataset Channel MDF-composites Data Generator Data Consumer DatasetDatasetDataset DatasetDatasetcreate q: result q
  48. 48. Automated Data Aggregation (consumer) 56
  49. 49. Aggregate, Perform ML 58 • Combine cloud-published dataset, scikitlearn, pandas to predict steel fatigue and “reproduce” data from journal publication
  50. 50. Aggregate, Perform ML, Visualize 59 • Combine cloud-published dataset, scikitlearn, pandas to predict steel fatigue and validate journal publication
  51. 51. What’s Currently Available? 60 • Web interface to support data publication (public- facing APIs coming soon) • 100s of TB of storage at NCSA (scalable to many PB) more at Argonne (1.7 PB total on Petrel – not all for materials…) • Help with developing metadata schemas to describe your research datasets MDF Tutorial on Github https://github.com/blaiszik/materials-data-facility-training
  52. 52. What are we looking for? 61 • Early adopters, willing to get their hands dirty with the service and give honest feedback • Key integration points where metadata is picked up automatically! • Key datasets and resources of all sizes, shapes, raw or derived, that might help us understand the process better
  53. 53. Thanks to Our Sponsors! 62 U . S . D E P A RT M E N T O F ENERGY
  54. 54. Publication REST APIs Discovery • Identify datasets with persistent identifiers (e.g. DOI) • Describe datasets with appropriate metadata and provenance • Verify dataset contents over time • Handle big (and small) data: We have already ingested datasets with > 1.5M files and > 1TB in size • Search and query datasets in modern ways • Index metadata and harvest file contents • Simple user interfaces (i.e., after Google and Amazon) Opened to external users in Mar. 2016 ~ 6 TB of data published Materialsdatafacility .org

×