GeoDataspace: Simplifying Data Management Tasks with Globus
1. Simplifying
Data
Management
Tasks
with
Globus
Tanu
Malik,
Ian
Foster,
Kyle
Chard,
Roselyne
Tchoua,
Joseph
Baker,
Mike
Gurnis,
Jonathan
Goodall,
ScoD
Peckham
GeoDataspace
2. Share and Reproduce
Alice wants to share her models and
simulation output with Bob, and Bob wants
to re-execute Alice’s application to validate
her inputs and outputs.
GeoDatasp
3. Alice’s Options
1. A tar and gzip
2. Build a website with model code,
parameters, and data
3. Create a virtual machine
GeoDatasp
4. Bob’s Frustration
1. I do not find the lib.so required for building
the model.
2. How do I?
GeoDatasp
Lack of easy and efficient methods for sharing
and reproducibility
Amount of pain
Bob suffers
Amount of
pain Alice suffers
5. GeoDataspace
• Goal: Sharing and reproducibility hand-in-
hand
• Target users: Computational geoscientists
• Data and model integration
• Research Output is More Than "Just" a Research Paper
GeoDatasp
6. GeoDataspace
CI Components
• The geounits
• Units of scientific activity/research output
• How to capture and track this activity
• Globus Catalog
• A scalable, flexible catalog for annotations
conforming to open-world assumption
• Globus Publish and reproduce geounits
• Share/Publish geounits for others
• Replay geounits for analysis
GeoDatasp
9. geounit Client:
Features
• Based on Code, Data, Environment (CDE’s)
ptrace and okapi functionality
• Data/code can be local or distributed
• Data/code files are not manifested into the
package until ready to share; only
descriptions in package
• Specify granularity of auditing
• Partial replay
• Unpack into docker or vagrant
10. Globus Catalog:
hosts geounits
• Dataset Management Model
• Catalog: a hosted resource that enables
the grouping of related datasets
• Dataset: a virtual collection of
(schemaless) metadata and distributed
data elements viz files, provenance
• Annotation: a piece of metadata that
exists within the context of a dataset or
data member
GeoDatasp
11. Globus Catalog
• Dataset Service
• Virtual views of data based on user-defined and/or automatically
extracted metadata (annotations)
• Implemented as a service with web and REST interfaces
• Relies on Globus Nexus for user authentication and group management
• Client-side Tooling
• Dataset ingest
• Automatic creation of datasets and extraction of metadata from various
common data formats and directory structures
• Globus endpoints
• Associate data (in files and directories) with one or more datasets
• Python Client library
• Integration with external services
• Transfer: Moving datasets from their storage endpoint(s) to a selected
destination
• Faceted Browser Search
• Search based on provenance entities and activities
GeoDatasp
12. Globus Catalog:
REST interface
GeoDatasp
Approach
• Hosted user-defined catalogs
• Based on annotation model
<dataset/member, name, value>
• Association of data members
• Fine grained access control
• Flexible query language
– Name:value, free text, facets,…
• Integrated with other
services
/geodataspace
/geodataspace/annotation
/geodataspace/geounit
/geodataspace/geounit/annotation
/geodataspace/geounit/acl
/geodataspace/geounit/members
/geodataspace/geounit/members/annotation
/geodataspace/geounit/provenance
/geodataspace/geounit/version
13. Publish and Reexecute
geounits
• Still in the works
• Each geounit can be published through
Globus Publish and re-executed through
analysis platform
GeoDatasp
15. Solid Earth
• Allow reproducible, replayable geounits of GPlates
• GPlates
• Software package has several dependencies
• Create geounits of Kinematic Representation of
Surface of Earth (3D and 4D models)
• GPlates software,
• GPML files (XML for plate tectonics) used in the model,
• output GPML files are simple X/Y format or could be visualization files, a
global set of visualization output, images as well.
• Integrating geounits in Python workflows
• Incorporate metadata from workflows and use geounit metadata to
inform workflows
GeoDatasp
16. Hydrology
• Data processing steps for theVIC model
geounit 1
geounit 2
geounit 3 geounit 4
Objective: Monitor changes in the data processing steps
and compare them across the various runs
GeoDatasp
17. Space Science
• Create geounits of SuperDarn data and its
plotting products
• Publish them for validation
GeoDatasp
18. CSDMS
• How geounits should be coupled
• Metadata alignment issues
• If we create geounits of CSDMS models,
how do we enable suitable search
interfaces with the provenance metadata
and CSDMS metadata?
GeoDatasp
19. Current Work
• Working with use cases to bootstrap
geounits
• Populating geounits based on Python
workflows and incorporate geounits in
workflows
• Interfacing geounit Client with Globus
Catalog
• Improving distributed search functionality
GeoDatasp