Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

GeoDataspace: Simplifying Data Management Tasks with Globus


Published on

Published in: Technology
  • Be the first to comment

GeoDataspace: Simplifying Data Management Tasks with Globus

  1. 1. Simplifying  Data  Management  Tasks  with   Globus   Tanu  Malik,  Ian  Foster,  Kyle  Chard,  Roselyne  Tchoua,   Joseph  Baker,  Mike  Gurnis,  Jonathan  Goodall,  ScoD  Peckham   GeoDataspace
  2. 2. Share and Reproduce Alice wants to share her models and simulation output with Bob, and Bob wants to re-execute Alice’s application to validate her inputs and outputs. GeoDatasp
  3. 3. Alice’s Options 1. A tar and gzip 2. Build a website with model code, parameters, and data 3. Create a virtual machine GeoDatasp
  4. 4. Bob’s Frustration 1. I do not find the required for building the model. 2. How do I? GeoDatasp Lack of easy and efficient methods for sharing and reproducibility Amount of pain Bob suffers Amount of pain Alice suffers
  5. 5. GeoDataspace • Goal: Sharing and reproducibility hand-in- hand • Target users: Computational geoscientists • Data and model integration • Research Output is More Than "Just" a Research Paper GeoDatasp
  6. 6. GeoDataspace CI Components • The geounits • Units of scientific activity/research output • How to capture and track this activity • Globus Catalog • A scalable, flexible catalog for annotations conforming to open-world assumption • Globus Publish and reproduce geounits • Share/Publish geounits for others • Replay geounits for analysis GeoDatasp
  7. 7. geounits: package data , source code and environment GeoDatasp
  8. 8. geounit Client: Provenance is key GeoDatasp 1. audit <program name> 2. PROV compliant database 3. exec <program name> [activity]
  9. 9. geounit Client: Features • Based on Code, Data, Environment (CDE’s) ptrace and okapi functionality • Data/code can be local or distributed • Data/code files are not manifested into the package until ready to share; only descriptions in package • Specify granularity of auditing • Partial replay • Unpack into docker or vagrant
  10. 10. Globus Catalog: hosts geounits • Dataset Management Model • Catalog: a hosted resource that enables the grouping of related datasets • Dataset: a virtual collection of (schemaless) metadata and distributed data elements viz files, provenance • Annotation: a piece of metadata that exists within the context of a dataset or data member GeoDatasp
  11. 11. Globus Catalog • Dataset Service • Virtual views of data based on user-defined and/or automatically extracted metadata (annotations) • Implemented as a service with web and REST interfaces • Relies on Globus Nexus for user authentication and group management • Client-side Tooling • Dataset ingest • Automatic creation of datasets and extraction of metadata from various common data formats and directory structures • Globus endpoints • Associate data (in files and directories) with one or more datasets • Python Client library • Integration with external services • Transfer: Moving datasets from their storage endpoint(s) to a selected destination • Faceted Browser Search • Search based on provenance entities and activities GeoDatasp
  12. 12. Globus Catalog: REST interface GeoDatasp Approach •  Hosted user-defined catalogs •  Based on annotation model <dataset/member, name, value> •  Association of data members •  Fine grained access control •  Flexible query language –  Name:value, free text, facets,… •  Integrated with other services /geodataspace /geodataspace/annotation /geodataspace/geounit /geodataspace/geounit/annotation /geodataspace/geounit/acl /geodataspace/geounit/members /geodataspace/geounit/members/annotation /geodataspace/geounit/provenance /geodataspace/geounit/version
  13. 13. Publish and Reexecute geounits • Still in the works • Each geounit can be published through Globus Publish and re-executed through analysis platform GeoDatasp
  14. 14. Science Drivers Solid Earth Space Science Hydrology CSDMS GeoDatasp GeoDataspace
  15. 15. Solid Earth • Allow reproducible, replayable geounits of GPlates • GPlates • Software package has several dependencies • Create geounits of Kinematic Representation of Surface of Earth (3D and 4D models) • GPlates software, • GPML files (XML for plate tectonics) used in the model, • output GPML files are simple X/Y format or could be visualization files, a global set of visualization output, images as well.  • Integrating geounits in Python workflows • Incorporate metadata from workflows and use geounit metadata to inform workflows GeoDatasp
  16. 16. Hydrology • Data processing steps for theVIC model geounit 1 geounit 2 geounit 3 geounit 4 Objective: Monitor changes in the data processing steps and compare them across the various runs GeoDatasp
  17. 17. Space Science • Create geounits of SuperDarn data and its plotting products • Publish them for validation GeoDatasp
  18. 18. CSDMS • How geounits should be coupled • Metadata alignment issues • If we create geounits of CSDMS models, how do we enable suitable search interfaces with the provenance metadata and CSDMS metadata? GeoDatasp
  19. 19. Current Work • Working with use cases to bootstrap geounits • Populating geounits based on Python workflows and incorporate geounits in workflows • Interfacing geounit Client with Globus Catalog • Improving distributed search functionality GeoDatasp
  20. 20. Track it! • geodataspace • Software, Source code, Science Usecases, Reports, Presentations, News GeoDatasp
  21. 21. Acknowledgements • National Science Foundation • EarthCube Community • Globus team • CI team GeoDatasp