Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

•Download as PPTX, PDF•

0 likes•203 views

We describe the automated data ingest scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform automated data ingest and present a faceted search interface that can be used by science gateways to simplify data discovery. We also walk through the application's GitHub repository and highlight relevant components.

Simplifying Science Gateway Data
Management with Globus
Part IV – Automated Data Ingests
October 2020, Gateways 2020

Phase 1 - Gather data
Gathering datasets from research partners
• Your project is gathering datasets
from partners. Each dataset is
several TBs and takes ~a day to
transfer over the network.
• For the data to be useful, it needs
descriptive metadata.
• Ultimately, the team needs to find
datasets that match specific
criteria.

What are the dataset ingest challenges?
• Getting very large datasets transferred from gateway
users’ systems to the central repository
– (This is Scenario I - large-scale data transfer.)
• Generating persistent identifiers for the data in the
central repository so we can link metadata to data
• Storing the metadata
• Indexing the metadata to enable searching

Demonstration
Data ingests in a
web application
https://petraldata.net/

What needs to be in place for it to work?
• Data storage
– Globus Connect Server on Petrel
• Persistent identifiers
– FAIR Research Identifier Service
– Hosted by https://fair-research.org/
• Metadata storage, indexing, search
– Globus Search API
– Hosted by Globus

Globus Connect Server on Petrel
• Configured for self-service projects
– Researchers do not receive local (Linux) accounts!
– Uses Globus for authorization & management
• Guest collections and groups
– Project PIs request access by applying to join the “Petrel
Project Owners” group (using the Globus web app)
– Admin creates Globus group, makes PI a group manager
– Admin creates guest collection, makes PI an access manager
– Admin sets a quota of 100TB for the guest collection

• RESTful web service, written in Python, that
stores identifier metadata
• Mints (creates) identifiers from external
service providers using a unified service
provider interface (SPI)
• Different identifiers supported through
namespaces
• Client requests served as HTML landing
pages or other machine-readable formats
(e.g., JSON, JSON-LD)
FAIR Research Identifiers
AWS-RDS
AWS-EC2
Postgres
Registration SPI
(Python)
Web Server - REST API
(Apache, Flask, Python)
RDBMS ORM
(SQLAlchemy)
AuthN/AuthZ
(Globus Auth, Globus Groups)
Web
Browser
Client
APIs
HTML JSON, JSON-LD, other
extensible renderings
DataCite
(DOI)
EZID
(ARK)
Minid
(Handle)
https://minid.readthedocs.io/en/develop/

• REST API provides a simple CRUD
interface
• Has other capabilities, like finding
identifiers by checksum
• JSON is used for request and
response
• Namespaces may also have their own
handlers, landing pages, and other
customizations.
FAIR Research Identifiers

Globus Search API
• RESTful API for indexing & search
– Hosted by Globus (including the metadata &
index storage!)
– Each project gets an “index” object (private
tenancy)
– REST API, Python client package, Python CLI
• https://docs.globus.org/api/search/

Globus Search API features
• Scalable: to billions of entries
• Schema agnostic: can use standard
(e.g., DataCite) or custom metadata
• Fine-grain access control: only returns
results that are visible to user
• Plain text search: ranked results
• Faceted search: for data discovery
• Rich query language: ranges,
expressions, regex, fuzzy, stemming, etc.

Key ingredients
1. UUID and base path for the guest collection where
data is gathered
2. Minid Python client
3. UUID for Globus Search index
4. Your choice of appropriate metadata schema for
your project’s datasets

Research computing facilities, such as the national supercomputing centers, and shared instruments, such as cryo electron microscopes and advanced light sources, are generating large volumes of data daily. These growing data volumes make it challenging for researchers to perform what should be mundane tasks: move data reliably, describe data for subsequent discovery, and make data accessible to geographically distributed collaborators. Most employ some set of ad hoc methods, which are not scalable, and it is clear that some level of automation is required for these tasks. Globus is an established service from the University of Chicago that is widely used for managing research data in national laboratories, campus computing centers, and HPC facilities. While its intuitive web app addresses simple file transfer and sharing scenarios, automation at scale requires integrating Globus data management platform services into custom science gateways, data portals and other web applications in service of research. Such applications should enable automated ingest of data from diverse sources, launching of analysis runs on diverse computing resources, extraction and addition of metadata for creating search indexes, assignment of persistent identifiers faceted search for rapid data discovery, and point-and-click downloading of datasets by authorized users — all protected by an authentication and authorization substrate that allows the implementation of flexible data access policies for both metadata and data alike. We describe current and emerging Globus services that facilitate these automated data flows while ensuring a streamlined user experience. We also demonstrate Petreldata.net, a data management portal and gateway to multiple computing resources, that supports large-scale research at the Advanced Photon Source.

Data Orchestration at Scale (GlobusWorld Tour West)

Globus

Connecting Your System to Globus (APS Workshop)

Globus

GlobusWorld 2021 Tutorial: Building with the Globus Platform

Globus

Automating Research Data Flows with Globus (CHPC 2019 - South Africa)

Globus

Major research instruments are generating orders of magnitude more data in relatively short timeframes. As a result, the research enterprise is increasingly challenged by what should be mundane tasks: describing data for discovery and making data securely accessible to the broader research community. The ad hoc methods currently employed place undue burden on scientists and system administrators alike, and it is clear that a more robust, scalable approach is required. Bespoke data portals (and science gateways/data commons) are becoming more prominent as a means of enabling access to large datasets. in this tutorial we demonstrate how services for authentication, authorization, metadata management, and search may be integrated with popular web frameworks, and used in combination with fast, well-architected networks to make data discoverable and accessible. Outcomes: build a simple, but functional, data portal that facilitates flexible data description, faceted data search and secure data access.

Recent Upgrades to ARM Data Transfer and Delivery Using Globus

Globus

"What's New With Globus" Webinar: Spring 2018

Globus

Introduction to Globus (APS Workshop)

Globus

Globus: Enabling the Open Storage Network

Globus

GlobusWorld 2021 Tutorial: Globus for System Administrators

Globus

Globus: Beyond File Transfer

Globus

20090701 Climate Data Staging

Henning Bergmeyer

GlobusWorld 2020 Keynote

Globus

Introduction to the Globus Platform (GlobusWorld Tour - UMich)

Globus

SomeSlidesguestd60742

Globus: Research Data Management as Service and Platform - pearc17

Mary Bass

Scientists have embraced the use of specialized cloud-hosted services to perform data management operations. Globus offers a suite of data and user management capabilities to the community, encompassing data transfer and sharing, user identity and authorization, and data publication. Globus capabilities are accessible via both a web browser and REST APIs. Web access allows Globus to address the needs of research labs through a software-as-a-service model; the newer REST APIs address the needs of developers of research services, who can now use Globus as a platform, outsourcing complex user and data management tasks to Globus cloud-hosted services. Here we review Globus capabilities and outline how it is being applied as a platform for scientific services. Presentation by Steve Tuecke from The University of Chicago. Steve is Globus Founder and Project Lead.

Globus status and publication plans

Ian Foster

Globus publication demo screenshots

Ian Foster

Automating Research Data Flows with the Globus Command Line Interface (CLI)

Globus

Log analysis using elk

Rushika Shah

Globus Platform Overview

Globus

Data Analytics Service Company and Its Ruby Usage

SATOSHI TAGOMORI

Advanced Computing Meets Data FAIRness

Globus

Tutorial presented at Mini Gateways 2022. Demonstrates how to build data portals and science gateways with the Django Globus Portal Framework. The broad scope of a typical science gateway—to simplify access to shared data, computing and other resources—makes building such a gateway from scratch a daunting task. Investigators must be able to stage data from instruments (or other sources), submit compute jobs to analyze data, move data to more persistent storage, describe data products, and provide a means for collaborators to search, discover, reuse and augment these data products. Myriad tools are available to enable all these tasks but integrating them in a way that hides the complexity from users, is a challenge. In this tutorial we will describe an approach that bootstraps science gateway development based on the Modern Research Data Portal[1] design pattern. The solution uses a set of open source tools that build on the established Django web framework, the ubiquitous OAuth2/OpenID connect standards for authentication/authorization, the widely deployed Globus service for research data management, and the nascent funcX functions-as-a-service platform. Attendees will learn how to rapidly deploy a science gateway that enables both automated computation at scale and data enhanced discovery of resulting data products. The emphasis will be on automating many of the required tasks so that gateway developers can focus on building differentiated, discipline-specific functionality rather than low-value—yet critical—supporting infrastructure. We will use the ALCF Community Data Co-Op as an exemplar to illustrate how these tools have been used to support large-scale collaborative research. We will describe the overall solution architecture and introduce attendees to the individual tools. Attendees will then use these tools to deploy and configure their own science gateway to support image analysis, description, indexing and search. The tutorial will comprise a mix of lectures, demonstration and hands-on exercises. Virtual machines will be provided for computation and for hosting the science gateway. The objective is for attendees to develop a high-level understanding of the various components and leave with working code that can serve as the starting point for their own science gateway implementation.

What's hot

Introduction to the Globus Platform (APS Workshop)

Globus

Globus Portal Framework (APS Workshop)

Globus

What's New in Globus - Internet2 TechEXtra

Globus

Enabling Secure Data Discoverability (SC21 Tutorial)

Globus

Recent Upgrades to ARM Data Transfer and Delivery Using Globus

Globus

"What's New With Globus" Webinar: Spring 2018

Globus

Introduction to Globus (APS Workshop)

Globus

Globus: Enabling the Open Storage Network

Globus

GlobusWorld 2021 Tutorial: Globus for System Administrators

Globus

Globus: Beyond File Transfer

Globus

20090701 Climate Data Staging

Henning Bergmeyer

GlobusWorld 2020 Keynote

Globus

Introduction to the Globus Platform (GlobusWorld Tour - UMich)

Globus

SomeSlidesguestd60742

Globus: Research Data Management as Service and Platform - pearc17

Mary Bass

Globus status and publication plans

Ian Foster

Globus publication demo screenshots

Ian Foster

Automating Research Data Flows with the Globus Command Line Interface (CLI)

Globus

Log analysis using elk

Rushika Shah

Globus Platform Overview

Globus

What's hot (20)

Introduction to the Globus Platform (APS Workshop)

Globus Portal Framework (APS Workshop)

What's New in Globus - Internet2 TechEXtra

Enabling Secure Data Discoverability (SC21 Tutorial)

Recent Upgrades to ARM Data Transfer and Delivery Using Globus

"What's New With Globus" Webinar: Spring 2018

Introduction to Globus (APS Workshop)

Globus: Enabling the Open Storage Network

GlobusWorld 2021 Tutorial: Globus for System Administrators

Globus: Beyond File Transfer

20090701 Climate Data Staging

GlobusWorld 2020 Keynote

Introduction to the Globus Platform (GlobusWorld Tour - UMich)

SomeSlides

Globus: Research Data Management as Service and Platform - pearc17

Globus status and publication plans

Globus publication demo screenshots

Automating Research Data Flows with the Globus Command Line Interface (CLI)

Log analysis using elk

Globus Platform Overview

The Department of Energy's Integrated Research Infrastructure (IRI)

Globus

GlobusWorld 2024 Opening Keynote session

Globus

Enhancing Performance with Globus and the Science DMZ

Globus

Extending Globus into a Site-wide Automated Data Infrastructure.pdf

Globus

The Rosalind Franklin Institute hosts a variety of scientific instruments, which allow us to capture a multifaceted and multilevel view of biological systems, generating around 70 terabytes of data a month. Distributed solutions, such as Globus and Ceph, facilitates storage, access, and transfer of large amount of data. However, we still must deal with the heterogeneity of the file formats and directory structure at acquisition, which is optimised for fast recording, rather than for efficient storage and processing. Our data infrastructure includes local storage at the instruments and workstations, distributed object stores with POSIX and S3 access, remote storage on HPCs, and taped backup. This can pose a challenge in ensuring fast, secure, and efficient data transfer. Globus allows us to handle this heterogeneity, while its Python SDK allows us to automate our data infrastructure using Globus microservices integrated with our data access models. Our data management workflows are becoming increasingly complex and heterogenous, including desktop PCs, virtual machines, and offsite HPCs, as well as several open-source software tools with different computing and data structure requirements. This complexity commands that data is annotated with enough details about the experiments and the analysis to ensure efficient and reproducible workflows. This talk explores how we extend Globus into different parts of our data lifecycle to create a secure, scalable, and high performing automated data infrastructure that can provide FAIR[1,2] data for all our science. 1. https://doi.org/10.1038/sdata.2016.18 2. https://www.go-fair.org/fair-principles

Globus at the United States Geological Survey

Globus

Providing Globus Services to Users of JASMIN for Environmental Data Analysis

Globus

Globus Compute with Integrated Research Infrastructure (IRI) workflows

Globus

As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and I will give a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.

Reactive Documents and Computational Pipelines - Bridging the Gap

Globus

As scientific discovery and experimentation become increasingly reliant on computational methods, the static nature of traditional publications renders them progressively fragmented and unreproducible. How can workflow automation tools, such as Globus, be leveraged to address these issues and potentially create a new, higher-value form of publication? LivePublication leverages Globus’s custom Action Provider integrations and Compute nodes to capture semantic and provenance information during distributed flow executions. This information is then embedded within an RO-crate and interfaced with a programmatic document, creating a seamless pipeline from instruments, to computation, to publication.

More from Globus (20)

Globus Compute wth IRI Workflows - GlobusWorld 2024

Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...

Globus Compute Introduction - GlobusWorld 2024

Globus Connect Server Deep Dive - GlobusWorld 2024

Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...

Providing Globus Services to Users of JASMIN for Environmental Data Analysis

First Steps with Globus Compute Multi-User Endpoints

Enhancing Research Orchestration Capabilities at ORNL.pdf

Understanding Globus Data Transfers with NetSage

How to Position Your Globus Data Portal for Success Ten Good Practices

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Developing Distributed High-performance Computing Capabilities of an Open Sci...

The Department of Energy's Integrated Research Infrastructure (IRI)

GlobusWorld 2024 Opening Keynote session

Enhancing Performance with Globus and the Science DMZ

Extending Globus into a Site-wide Automated Data Infrastructure.pdf

Globus at the United States Geological Survey

Providing Globus Services to Users of JASMIN for Environmental Data Analysis

Globus Compute with Integrated Research Infrastructure (IRI) workflows

Reactive Documents and Computational Pipelines - Bridging the Gap

Recently uploaded

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Juraj Vysvader

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

XfilesPro

Large Language Models and the End of Programming

Matt Welsh

Explore Modern SharePoint Templates for 2024

Sharepoint Designs

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

takuyayamamoto1800

De mooiste recreatieve routes ontdekken met RouteYou en FME

Jelle | Nordend

Cracking the code review at SpringIO 2024

Paco van Beckhoven

Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production. Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process? In this session we will cover: - The Art of Effective Code Reviews - Streamlining the Review Process - Elevating Reviews with Automated Tools By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces

top nidhi software solution freedownload

vrstrong314

This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.

Advanced Flow Concepts Every Developer Should Know

Peter Caitens

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...

Hivelance Technology

Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots

Using IESVE for Room Loads Analysis - Australia & New Zealand

IES VE

SOCRadar Research Team: Latest Activities of IntelBroker

SOCRadar

The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month. The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies. However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News. Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!

A Comprehensive Look at Generative AI in Retail App Testing.pdf

kalichargn70th171

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

AMB-Review

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos https://www.amb-review.com/tubetrivia-ai Exclusive Features: AI-Powered Questions, Wide Range of Categories, Adaptive Difficulty, User-Friendly Interface, Multiplayer Mode, Regular Updates. #TubeTriviaAI #QuizVideoMagic #ViralQuizVideos #AIQuizGenerator #EngageExciteExplode #MarketingRevolution #BoostYourTraffic #SocialMediaSuccess #AIContentCreation #UnlimitedTraffic

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation

WSO2

Why React Native as a Strategic Advantage for Startup Innovation.pdf

ayushiqss

Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework. In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill. But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app. Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.

2024 RoOUG Security model for the cloud.pptx

Georgi Kodinov

How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?

XfilesPro

Lecture 1 Introduction to games development

abdulrafaychaudhry

Designing for Privacy in Amazon Web Services

KrzysztofKkol1

Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload. Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.

Recently uploaded (20)

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

Large Language Models and the End of Programming

Explore Modern SharePoint Templates for 2024

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

De mooiste recreatieve routes ontdekken met RouteYou en FME

Cracking the code review at SpringIO 2024

top nidhi software solution freedownload

Advanced Flow Concepts Every Developer Should Know

Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...

Using IESVE for Room Loads Analysis - Australia & New Zealand

SOCRadar Research Team: Latest Activities of IntelBroker

A Comprehensive Look at Generative AI in Retail App Testing.pdf

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation

Why React Native as a Strategic Advantage for Startup Innovation.pdf

2024 RoOUG Security model for the cloud.pptx

How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?

Lecture 1 Introduction to games development

Designing for Privacy in Amazon Web Services

Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

1. Simplifying Science Gateway Data Management with Globus Part IV – Automated Data Ingests October 2020, Gateways 2020

2. Phase 1 - Gather data Gathering datasets from research partners • Your project is gathering datasets from partners. Each dataset is several TBs and takes ~a day to transfer over the network. • For the data to be useful, it needs descriptive metadata. • Ultimately, the team needs to find datasets that match specific criteria.

3. What are the dataset ingest challenges? • Getting very large datasets transferred from gateway users’ systems to the central repository – (This is Scenario I - large-scale data transfer.) • Generating persistent identifiers for the data in the central repository so we can link metadata to data • Storing the metadata • Indexing the metadata to enable searching

4. Demonstration Data ingests in a web application https://petraldata.net/

5. What needs to be in place for it to work? • Data storage – Globus Connect Server on Petrel • Persistent identifiers – FAIR Research Identifier Service – Hosted by https://fair-research.org/ • Metadata storage, indexing, search – Globus Search API – Hosted by Globus

6. Globus Connect Server on Petrel • Configured for self-service projects – Researchers do not receive local (Linux) accounts! – Uses Globus for authorization & management • Guest collections and groups – Project PIs request access by applying to join the “Petrel Project Owners” group (using the Globus web app) – Admin creates Globus group, makes PI a group manager – Admin creates guest collection, makes PI an access manager – Admin sets a quota of 100TB for the guest collection

7. • RESTful web service, written in Python, that stores identifier metadata • Mints (creates) identifiers from external service providers using a unified service provider interface (SPI) • Different identifiers supported through namespaces • Client requests served as HTML landing pages or other machine-readable formats (e.g., JSON, JSON-LD) FAIR Research Identifiers AWS-RDS AWS-EC2 Postgres Registration SPI (Python) Web Server - REST API (Apache, Flask, Python) RDBMS ORM (SQLAlchemy) AuthN/AuthZ (Globus Auth, Globus Groups) Web Browser Client APIs HTML JSON, JSON-LD, other extensible renderings DataCite (DOI) EZID (ARK) Minid (Handle) https://minid.readthedocs.io/en/develop/

8. • REST API provides a simple CRUD interface • Has other capabilities, like finding identifiers by checksum • JSON is used for request and response • Namespaces may also have their own handlers, landing pages, and other customizations. FAIR Research Identifiers

9. Globus Search API • RESTful API for indexing & search – Hosted by Globus (including the metadata & index storage!) – Each project gets an “index” object (private tenancy) – REST API, Python client package, Python CLI • https://docs.globus.org/api/search/

10. Globus Search API features • Scalable: to billions of entries • Schema agnostic: can use standard (e.g., DataCite) or custom metadata • Fine-grain access control: only returns results that are visible to user • Plain text search: ranked results • Faceted search: for data discovery • Rich query language: ranges, expressions, regex, fuzzy, stemming, etc.

11. Key ingredients 1. UUID and base path for the guest collection where data is gathered 2. Minid Python client 3. UUID for Globus Search index 4. Your choice of appropriate metadata schema for your project’s datasets

12. Code Data ingest in a web application

Editor's Notes

You’re working on a project with partners at other institutions, each of whom is analyzing unique samples and generating big datasets from them. You need to gather 100s of TBs of data on your campus’s HPC storage system. How can you make it easy for your partners to get the data from their labs to your server? And once it’s there, how are the partners going to understand each others’ datasets? First, they need to be able to see, in general, what’s been uploaded. Then, they need to find datasets that have specific features. NOTE: We’re presenting this as a single project, but at Globus, we see this happening for dozens-to-hundreds of research projects on a continuous basis. Our end goal is to enable research teams to do this routinely, without special planning or extraordinary measures by individual projects.
Examples of “analysis on a community dataset”: Examples of ”analyze user’s data”: Examples of “download simulation results”: Examples of “submit data to a repository”:
Petrel Data: https://petreldata.net/ Data storage is provided by the Advanced Leadership Computing Facility at Argonne National Laboratory. Petrel offers 100TB allocations to approved projects, with a total of 3PB of storage. Goal is to enable projects to self-manage themselves, including ingest, metadata management, index & search, and sharing permissions.
PIs request access by applying to join a Globus group Petrel admin creates a project group for the PI and makes the PI a group manager Petrel admin creates a Globus guest collection with access managed by the PI Petrel admin also sets a quota of 100TB for the guest collection’s directory.
https://github.com/globus/globus-jupyter-notebooks/blob/master/Data_Publication_Flow.ipynb

Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

Similar to Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus (20)

More from Globus

More from Globus (20)

Recently uploaded

Recently uploaded (20)

Gateways 2020 Tutorial - Automated Data Ingest and Search with Globus

Editor's Notes