SlideShare a Scribd company logo
Simplifying Science Gateway Data
Management with Globus
Part IV – Automated Data Ingests
October 2020, Gateways 2020
Phase 1 - Gather data
Gathering datasets from research partners
• Your project is gathering datasets
from partners. Each dataset is
several TBs and takes ~a day to
transfer over the network.
• For the data to be useful, it needs
descriptive metadata.
• Ultimately, the team needs to find
datasets that match specific
criteria.
What are the dataset ingest challenges?
• Getting very large datasets transferred from gateway
users’ systems to the central repository
– (This is Scenario I - large-scale data transfer.)
• Generating persistent identifiers for the data in the
central repository so we can link metadata to data
• Storing the metadata
• Indexing the metadata to enable searching
Demonstration
Data ingests in a
web application
https://petraldata.net/
What needs to be in place for it to work?
• Data storage
– Globus Connect Server on Petrel
• Persistent identifiers
– FAIR Research Identifier Service
– Hosted by https://fair-research.org/
• Metadata storage, indexing, search
– Globus Search API
– Hosted by Globus
Globus Connect Server on Petrel
• Configured for self-service projects
– Researchers do not receive local (Linux) accounts!
– Uses Globus for authorization & management
• Guest collections and groups
– Project PIs request access by applying to join the “Petrel
Project Owners” group (using the Globus web app)
– Admin creates Globus group, makes PI a group manager
– Admin creates guest collection, makes PI an access manager
– Admin sets a quota of 100TB for the guest collection
• RESTful web service, written in Python, that
stores identifier metadata
• Mints (creates) identifiers from external
service providers using a unified service
provider interface (SPI)
• Different identifiers supported through
namespaces
• Client requests served as HTML landing
pages or other machine-readable formats
(e.g., JSON, JSON-LD)
FAIR Research Identifiers
AWS-RDS
AWS-EC2
Postgres
Registration SPI
(Python)
Web Server - REST API
(Apache, Flask, Python)
RDBMS ORM
(SQLAlchemy)
AuthN/AuthZ
(Globus Auth, Globus Groups)
Web
Browser
Client
APIs
HTML JSON, JSON-LD, other
extensible renderings
DataCite
(DOI)
EZID
(ARK)
Minid
(Handle)
https://minid.readthedocs.io/en/develop/
• REST API provides a simple CRUD
interface
• Has other capabilities, like finding
identifiers by checksum
• JSON is used for request and
response
• Namespaces may also have their own
handlers, landing pages, and other
customizations.
FAIR Research Identifiers
Globus Search API
• RESTful API for indexing & search
– Hosted by Globus (including the metadata &
index storage!)
– Each project gets an “index” object (private
tenancy)
– REST API, Python client package, Python CLI
• https://docs.globus.org/api/search/
Globus Search API features
• Scalable: to billions of entries
• Schema agnostic: can use standard
(e.g., DataCite) or custom metadata
• Fine-grain access control: only returns
results that are visible to user
• Plain text search: ranked results
• Faceted search: for data discovery
• Rich query language: ranges,
expressions, regex, fuzzy, stemming, etc.
Key ingredients
1. UUID and base path for the guest collection where
data is gathered
2. Minid Python client
3. UUID for Globus Search index
4. Your choice of appropriate metadata schema for
your project’s datasets
Code
Data ingest in a
web application

More Related Content

What's hot

Introduction to the Globus Platform (APS Workshop)
Introduction to the Globus Platform (APS Workshop)Introduction to the Globus Platform (APS Workshop)
Introduction to the Globus Platform (APS Workshop)
Globus
 
Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)
Globus
 
What's New in Globus - Internet2 TechEXtra
What's New in Globus - Internet2 TechEXtraWhat's New in Globus - Internet2 TechEXtra
What's New in Globus - Internet2 TechEXtra
Globus
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)
Globus
 
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
Recent Upgrades to ARM Data Transfer and Delivery Using GlobusRecent Upgrades to ARM Data Transfer and Delivery Using Globus
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
Globus
 
"What's New With Globus" Webinar: Spring 2018
"What's New With Globus" Webinar: Spring 2018"What's New With Globus" Webinar: Spring 2018
"What's New With Globus" Webinar: Spring 2018
Globus
 
Introduction to Globus (APS Workshop)
Introduction to Globus (APS Workshop)Introduction to Globus (APS Workshop)
Introduction to Globus (APS Workshop)
Globus
 
Globus: Enabling the Open Storage Network
Globus: Enabling the Open Storage NetworkGlobus: Enabling the Open Storage Network
Globus: Enabling the Open Storage Network
Globus
 
GlobusWorld 2021 Tutorial: Globus for System Administrators
GlobusWorld 2021 Tutorial: Globus for System AdministratorsGlobusWorld 2021 Tutorial: Globus for System Administrators
GlobusWorld 2021 Tutorial: Globus for System Administrators
Globus
 
Globus: Beyond File Transfer
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File Transfer
Globus
 
20090701 Climate Data Staging
20090701 Climate Data Staging20090701 Climate Data Staging
20090701 Climate Data Staging
Henning Bergmeyer
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
Globus
 
Introduction to the Globus Platform (GlobusWorld Tour - UMich)
Introduction to the Globus Platform (GlobusWorld Tour - UMich)Introduction to the Globus Platform (GlobusWorld Tour - UMich)
Introduction to the Globus Platform (GlobusWorld Tour - UMich)
Globus
 
Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17
Mary Bass
 
Globus status and publication plans
Globus status and publication plansGlobus status and publication plans
Globus status and publication plans
Ian Foster
 
Globus publication demo screenshots
Globus publication demo screenshotsGlobus publication demo screenshots
Globus publication demo screenshots
Ian Foster
 
Automating Research Data Flows with the Globus Command Line Interface (CLI)
Automating Research Data Flows with the Globus Command Line Interface (CLI)Automating Research Data Flows with the Globus Command Line Interface (CLI)
Automating Research Data Flows with the Globus Command Line Interface (CLI)
Globus
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
Rushika Shah
 
Globus Platform Overview
Globus Platform OverviewGlobus Platform Overview
Globus Platform Overview
Globus
 

What's hot (20)

Introduction to the Globus Platform (APS Workshop)
Introduction to the Globus Platform (APS Workshop)Introduction to the Globus Platform (APS Workshop)
Introduction to the Globus Platform (APS Workshop)
 
Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)Globus Portal Framework (APS Workshop)
Globus Portal Framework (APS Workshop)
 
What's New in Globus - Internet2 TechEXtra
What's New in Globus - Internet2 TechEXtraWhat's New in Globus - Internet2 TechEXtra
What's New in Globus - Internet2 TechEXtra
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)
 
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
Recent Upgrades to ARM Data Transfer and Delivery Using GlobusRecent Upgrades to ARM Data Transfer and Delivery Using Globus
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
 
"What's New With Globus" Webinar: Spring 2018
"What's New With Globus" Webinar: Spring 2018"What's New With Globus" Webinar: Spring 2018
"What's New With Globus" Webinar: Spring 2018
 
Introduction to Globus (APS Workshop)
Introduction to Globus (APS Workshop)Introduction to Globus (APS Workshop)
Introduction to Globus (APS Workshop)
 
Globus: Enabling the Open Storage Network
Globus: Enabling the Open Storage NetworkGlobus: Enabling the Open Storage Network
Globus: Enabling the Open Storage Network
 
GlobusWorld 2021 Tutorial: Globus for System Administrators
GlobusWorld 2021 Tutorial: Globus for System AdministratorsGlobusWorld 2021 Tutorial: Globus for System Administrators
GlobusWorld 2021 Tutorial: Globus for System Administrators
 
Globus: Beyond File Transfer
Globus: Beyond File TransferGlobus: Beyond File Transfer
Globus: Beyond File Transfer
 
20090701 Climate Data Staging
20090701 Climate Data Staging20090701 Climate Data Staging
20090701 Climate Data Staging
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Introduction to the Globus Platform (GlobusWorld Tour - UMich)
Introduction to the Globus Platform (GlobusWorld Tour - UMich)Introduction to the Globus Platform (GlobusWorld Tour - UMich)
Introduction to the Globus Platform (GlobusWorld Tour - UMich)
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17Globus: Research Data Management as Service and Platform - pearc17
Globus: Research Data Management as Service and Platform - pearc17
 
Globus status and publication plans
Globus status and publication plansGlobus status and publication plans
Globus status and publication plans
 
Globus publication demo screenshots
Globus publication demo screenshotsGlobus publication demo screenshots
Globus publication demo screenshots
 
Automating Research Data Flows with the Globus Command Line Interface (CLI)
Automating Research Data Flows with the Globus Command Line Interface (CLI)Automating Research Data Flows with the Globus Command Line Interface (CLI)
Automating Research Data Flows with the Globus Command Line Interface (CLI)
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Globus Platform Overview
Globus Platform OverviewGlobus Platform Overview
Globus Platform Overview
 

Similar to Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
Advanced Computing Meets Data FAIRness
Advanced Computing Meets Data FAIRnessAdvanced Computing Meets Data FAIRness
Advanced Computing Meets Data FAIRness
Globus
 
ASP.NET Mvc 4 web api
ASP.NET Mvc 4 web apiASP.NET Mvc 4 web api
ASP.NET Mvc 4 web api
Tiago Knoch
 
Building Data Portals and Science Gateways with Globus
Building Data Portals and Science Gateways with GlobusBuilding Data Portals and Science Gateways with Globus
Building Data Portals and Science Gateways with Globus
Globus
 
Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655
Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655
Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655
Getting value from IoT, Integration and Data Analytics
 
Building RESTfull Data Services with WebAPI
Building RESTfull Data Services with WebAPIBuilding RESTfull Data Services with WebAPI
Building RESTfull Data Services with WebAPI
Gert Drapers
 
Building Software Backend (Web API)
Building Software Backend (Web API)Building Software Backend (Web API)
Building Software Backend (Web API)
Alexander Goida
 
SOLID Programming with Portable Class Libraries
SOLID Programming with Portable Class LibrariesSOLID Programming with Portable Class Libraries
SOLID Programming with Portable Class Libraries
Vagif Abilov
 
Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...
Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...
Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...
Thomas W. Fry
 
Denodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me AnythingDenodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaS
Globus
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 
Tutorial: Leveraging Globus in your Research Applications
Tutorial: Leveraging Globus in your Research ApplicationsTutorial: Leveraging Globus in your Research Applications
Tutorial: Leveraging Globus in your Research Applications
Globus
 
ArcReady - Architecting For The Cloud
ArcReady - Architecting For The CloudArcReady - Architecting For The Cloud
ArcReady - Architecting For The Cloud
Microsoft ArcReady
 
Evolution Of The Web Platform & Browser Security
Evolution Of The Web Platform & Browser SecurityEvolution Of The Web Platform & Browser Security
Evolution Of The Web Platform & Browser Security
Sanjeev Verma, PhD
 
Debugging the Web with Fiddler
Debugging the Web with FiddlerDebugging the Web with Fiddler
Debugging the Web with Fiddler
Ido Flatow
 
Working with Globus Platform Services
Working with Globus Platform ServicesWorking with Globus Platform Services
Working with Globus Platform Services
Globus
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
MapR Technologies
 

Similar to Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus (20)

Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Advanced Computing Meets Data FAIRness
Advanced Computing Meets Data FAIRnessAdvanced Computing Meets Data FAIRness
Advanced Computing Meets Data FAIRness
 
ASP.NET Mvc 4 web api
ASP.NET Mvc 4 web apiASP.NET Mvc 4 web api
ASP.NET Mvc 4 web api
 
Building Data Portals and Science Gateways with Globus
Building Data Portals and Science Gateways with GlobusBuilding Data Portals and Science Gateways with Globus
Building Data Portals and Science Gateways with Globus
 
Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655
Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655
Databasecentricapisonthecloudusingplsqlandnodejscon3153oow2016 160922021655
 
Building RESTfull Data Services with WebAPI
Building RESTfull Data Services with WebAPIBuilding RESTfull Data Services with WebAPI
Building RESTfull Data Services with WebAPI
 
Building Software Backend (Web API)
Building Software Backend (Web API)Building Software Backend (Web API)
Building Software Backend (Web API)
 
SOLID Programming with Portable Class Libraries
SOLID Programming with Portable Class LibrariesSOLID Programming with Portable Class Libraries
SOLID Programming with Portable Class Libraries
 
Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...
Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...
Cerebro: Bringing together data scientists and bi users - Royal Caribbean - S...
 
Denodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me AnythingDenodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me Anything
 
Basics of the Web Platform
Basics of the Web PlatformBasics of the Web Platform
Basics of the Web Platform
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaS
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Tutorial: Leveraging Globus in your Research Applications
Tutorial: Leveraging Globus in your Research ApplicationsTutorial: Leveraging Globus in your Research Applications
Tutorial: Leveraging Globus in your Research Applications
 
ArcReady - Architecting For The Cloud
ArcReady - Architecting For The CloudArcReady - Architecting For The Cloud
ArcReady - Architecting For The Cloud
 
Evolution Of The Web Platform & Browser Security
Evolution Of The Web Platform & Browser SecurityEvolution Of The Web Platform & Browser Security
Evolution Of The Web Platform & Browser Security
 
Debugging the Web with Fiddler
Debugging the Web with FiddlerDebugging the Web with Fiddler
Debugging the Web with Fiddler
 
Working with Globus Platform Services
Working with Globus Platform ServicesWorking with Globus Platform Services
Working with Globus Platform Services
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 

More from Globus

Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
The Department of Energy's Integrated Research Infrastructure (IRI)
The Department of Energy's Integrated Research Infrastructure (IRI)The Department of Energy's Integrated Research Infrastructure (IRI)
The Department of Energy's Integrated Research Infrastructure (IRI)
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Extending Globus into a Site-wide Automated Data Infrastructure.pdf
Extending Globus into a Site-wide Automated Data Infrastructure.pdfExtending Globus into a Site-wide Automated Data Infrastructure.pdf
Extending Globus into a Site-wide Automated Data Infrastructure.pdf
Globus
 
Globus at the United States Geological Survey
Globus at the United States Geological SurveyGlobus at the United States Geological Survey
Globus at the United States Geological Survey
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Globus Compute with Integrated Research Infrastructure (IRI) workflows
Globus Compute with Integrated Research Infrastructure (IRI) workflowsGlobus Compute with Integrated Research Infrastructure (IRI) workflows
Globus Compute with Integrated Research Infrastructure (IRI) workflows
Globus
 
Reactive Documents and Computational Pipelines - Bridging the Gap
Reactive Documents and Computational Pipelines - Bridging the GapReactive Documents and Computational Pipelines - Bridging the Gap
Reactive Documents and Computational Pipelines - Bridging the Gap
Globus
 

More from Globus (20)

Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
The Department of Energy's Integrated Research Infrastructure (IRI)
The Department of Energy's Integrated Research Infrastructure (IRI)The Department of Energy's Integrated Research Infrastructure (IRI)
The Department of Energy's Integrated Research Infrastructure (IRI)
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Extending Globus into a Site-wide Automated Data Infrastructure.pdf
Extending Globus into a Site-wide Automated Data Infrastructure.pdfExtending Globus into a Site-wide Automated Data Infrastructure.pdf
Extending Globus into a Site-wide Automated Data Infrastructure.pdf
 
Globus at the United States Geological Survey
Globus at the United States Geological SurveyGlobus at the United States Geological Survey
Globus at the United States Geological Survey
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Globus Compute with Integrated Research Infrastructure (IRI) workflows
Globus Compute with Integrated Research Infrastructure (IRI) workflowsGlobus Compute with Integrated Research Infrastructure (IRI) workflows
Globus Compute with Integrated Research Infrastructure (IRI) workflows
 
Reactive Documents and Computational Pipelines - Bridging the Gap
Reactive Documents and Computational Pipelines - Bridging the GapReactive Documents and Computational Pipelines - Bridging the Gap
Reactive Documents and Computational Pipelines - Bridging the Gap
 

Recently uploaded

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Hivelance Technology
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
XfilesPro
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
KrzysztofKkol1
 

Recently uploaded (20)

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 

Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

  • 1. Simplifying Science Gateway Data Management with Globus Part IV – Automated Data Ingests October 2020, Gateways 2020
  • 2. Phase 1 - Gather data Gathering datasets from research partners • Your project is gathering datasets from partners. Each dataset is several TBs and takes ~a day to transfer over the network. • For the data to be useful, it needs descriptive metadata. • Ultimately, the team needs to find datasets that match specific criteria.
  • 3. What are the dataset ingest challenges? • Getting very large datasets transferred from gateway users’ systems to the central repository – (This is Scenario I - large-scale data transfer.) • Generating persistent identifiers for the data in the central repository so we can link metadata to data • Storing the metadata • Indexing the metadata to enable searching
  • 4. Demonstration Data ingests in a web application https://petraldata.net/
  • 5. What needs to be in place for it to work? • Data storage – Globus Connect Server on Petrel • Persistent identifiers – FAIR Research Identifier Service – Hosted by https://fair-research.org/ • Metadata storage, indexing, search – Globus Search API – Hosted by Globus
  • 6. Globus Connect Server on Petrel • Configured for self-service projects – Researchers do not receive local (Linux) accounts! – Uses Globus for authorization & management • Guest collections and groups – Project PIs request access by applying to join the “Petrel Project Owners” group (using the Globus web app) – Admin creates Globus group, makes PI a group manager – Admin creates guest collection, makes PI an access manager – Admin sets a quota of 100TB for the guest collection
  • 7. • RESTful web service, written in Python, that stores identifier metadata • Mints (creates) identifiers from external service providers using a unified service provider interface (SPI) • Different identifiers supported through namespaces • Client requests served as HTML landing pages or other machine-readable formats (e.g., JSON, JSON-LD) FAIR Research Identifiers AWS-RDS AWS-EC2 Postgres Registration SPI (Python) Web Server - REST API (Apache, Flask, Python) RDBMS ORM (SQLAlchemy) AuthN/AuthZ (Globus Auth, Globus Groups) Web Browser Client APIs HTML JSON, JSON-LD, other extensible renderings DataCite (DOI) EZID (ARK) Minid (Handle) https://minid.readthedocs.io/en/develop/
  • 8. • REST API provides a simple CRUD interface • Has other capabilities, like finding identifiers by checksum • JSON is used for request and response • Namespaces may also have their own handlers, landing pages, and other customizations. FAIR Research Identifiers
  • 9. Globus Search API • RESTful API for indexing & search – Hosted by Globus (including the metadata & index storage!) – Each project gets an “index” object (private tenancy) – REST API, Python client package, Python CLI • https://docs.globus.org/api/search/
  • 10. Globus Search API features • Scalable: to billions of entries • Schema agnostic: can use standard (e.g., DataCite) or custom metadata • Fine-grain access control: only returns results that are visible to user • Plain text search: ranked results • Faceted search: for data discovery • Rich query language: ranges, expressions, regex, fuzzy, stemming, etc.
  • 11. Key ingredients 1. UUID and base path for the guest collection where data is gathered 2. Minid Python client 3. UUID for Globus Search index 4. Your choice of appropriate metadata schema for your project’s datasets
  • 12. Code Data ingest in a web application

Editor's Notes

  1. You’re working on a project with partners at other institutions, each of whom is analyzing unique samples and generating big datasets from them. You need to gather 100s of TBs of data on your campus’s HPC storage system. How can you make it easy for your partners to get the data from their labs to your server? And once it’s there, how are the partners going to understand each others’ datasets? First, they need to be able to see, in general, what’s been uploaded. Then, they need to find datasets that have specific features. NOTE: We’re presenting this as a single project, but at Globus, we see this happening for dozens-to-hundreds of research projects on a continuous basis. Our end goal is to enable research teams to do this routinely, without special planning or extraordinary measures by individual projects.
  2. Examples of “analysis on a community dataset”: Examples of ”analyze user’s data”: Examples of “download simulation results”: Examples of “submit data to a repository”:
  3. Petrel Data: https://petreldata.net/ Data storage is provided by the Advanced Leadership Computing Facility at Argonne National Laboratory. Petrel offers 100TB allocations to approved projects, with a total of 3PB of storage. Goal is to enable projects to self-manage themselves, including ingest, metadata management, index & search, and sharing permissions.
  4. PIs request access by applying to join a Globus group Petrel admin creates a project group for the PI and makes the PI a group manager Petrel admin creates a Globus guest collection with access managed by the PI Petrel admin also sets a quota of 100TB for the guest collection’s directory.
  5. https://github.com/globus/globus-jupyter-notebooks/blob/master/Data_Publication_Flow.ipynb