Successfully reported this slideshow.
Your SlideShare is downloading. ×

Advanced Computing Meets Data FAIRness

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 70 Ad

Advanced Computing Meets Data FAIRness

Download to read offline

Tutorial presented at Mini Gateways 2022. Demonstrates how to build data portals and science gateways with the Django Globus Portal Framework.

The broad scope of a typical science gateway—to simplify access to shared data, computing and other resources—makes building such a gateway from scratch a daunting task. Investigators must be able to stage data from instruments (or other sources), submit compute jobs to analyze data, move data to more persistent storage, describe data products, and provide a means for collaborators to search, discover, reuse and augment these data products. Myriad tools are available to enable all these tasks but integrating them in a way that hides the complexity from users, is a challenge.

In this tutorial we will describe an approach that bootstraps science gateway development based on the Modern Research Data Portal[1] design pattern. The solution uses a set of open source tools that build on the established Django web framework, the ubiquitous OAuth2/OpenID connect standards for authentication/authorization, the widely deployed Globus service for research data management, and the nascent funcX functions-as-a-service platform. Attendees will learn how to rapidly deploy a science gateway that enables both automated computation at scale and data enhanced discovery of resulting data products. The emphasis will be on automating many of the required tasks so that gateway developers can focus on building differentiated, discipline-specific functionality rather than low-value—yet critical—supporting infrastructure.

We will use the ALCF Community Data Co-Op as an exemplar to illustrate how these tools have been used to support large-scale collaborative research. We will describe the overall solution architecture and introduce attendees to the individual tools. Attendees will then use these tools to deploy and configure their own science gateway to support image analysis, description, indexing and search.
The tutorial will comprise a mix of lectures, demonstration and hands-on exercises. Virtual machines will be provided for computation and for hosting the science gateway. The objective is for attendees to develop a high-level understanding of the various components and leave with working code that can serve as the starting point for their own science gateway implementation.

Tutorial presented at Mini Gateways 2022. Demonstrates how to build data portals and science gateways with the Django Globus Portal Framework.

The broad scope of a typical science gateway—to simplify access to shared data, computing and other resources—makes building such a gateway from scratch a daunting task. Investigators must be able to stage data from instruments (or other sources), submit compute jobs to analyze data, move data to more persistent storage, describe data products, and provide a means for collaborators to search, discover, reuse and augment these data products. Myriad tools are available to enable all these tasks but integrating them in a way that hides the complexity from users, is a challenge.

In this tutorial we will describe an approach that bootstraps science gateway development based on the Modern Research Data Portal[1] design pattern. The solution uses a set of open source tools that build on the established Django web framework, the ubiquitous OAuth2/OpenID connect standards for authentication/authorization, the widely deployed Globus service for research data management, and the nascent funcX functions-as-a-service platform. Attendees will learn how to rapidly deploy a science gateway that enables both automated computation at scale and data enhanced discovery of resulting data products. The emphasis will be on automating many of the required tasks so that gateway developers can focus on building differentiated, discipline-specific functionality rather than low-value—yet critical—supporting infrastructure.

We will use the ALCF Community Data Co-Op as an exemplar to illustrate how these tools have been used to support large-scale collaborative research. We will describe the overall solution architecture and introduce attendees to the individual tools. Attendees will then use these tools to deploy and configure their own science gateway to support image analysis, description, indexing and search.
The tutorial will comprise a mix of lectures, demonstration and hands-on exercises. Virtual machines will be provided for computation and for hosting the science gateway. The objective is for attendees to develop a high-level understanding of the various components and leave with working code that can serve as the starting point for their own science gateway implementation.

Advertisement
Advertisement

More Related Content

Similar to Advanced Computing Meets Data FAIRness (20)

More from Globus (20)

Advertisement

Recently uploaded (20)

Advanced Computing Meets Data FAIRness

  1. 1. Advanced Computing Meets Data FAIRness Building Science Gateways with the Django Globus Portal Framework Vas Vasiliadis – vas@uchicago.edu Lee Liming – lliming@uchicago.edu April 5, 2022
  2. 2. Tutorial materials and handy links bit.ly/minisgci-2022
  3. 3. Agenda • Introduction and motivation • The Modern Research Data Portal design pattern • Deploying a science gateway using the MRDP • Making data findable with Globus Search • Customizing the science gateway • Making data discoverable at scale • Integrating compute into your science gateway - Hands-on exercise - Live demonstration
  4. 4. Introduction and Motivation
  5. 5. What’s the common theme? 6
  6. 6. The brilliance “arms race”... K. Wille, The Physics of Particle Accelerators: An Introduction, Oxford University Press, Oxford, UK (2000); J. B. Parise and G. E. Brown, Jr., Elements, 2, 37-42 (2006)
  7. 7. Some challenges… • Increasing data rates, heterogeneity • Continuum of computing resources • Differing workflows across instruments
  8. 8. Distribution Store Data Portal Advanced Computing Facility Instrument Facility A common data flow pattern Image Analysis 3 Search/Discovery 5 Science! 6 Imaging 1 Acquisition 2 Description/Identification 4 v
  9. 9. Globus services for research data management Unified Data Access Data Transfer and Sharing Platform-as-a-Service Reliable Automation Publication & Discovery Remote Execution (future)
  10. 10. The Modern Research Data Portal Design Pattern docs.globus.org/mrdp
  11. 11. Why we use portals and science gateways • Different experiments (beamlines, electron microscopes, biology, etc) generate data with different types, size and experimental information • Processing, curation, and cataloguing need to happen as soon as possible so data are not lost • Standardize secure access between users • Work toward FAIR datasets to enable more science
  12. 12. Benefits • Make data FAIRer • Track lots of (heterogeneous) data • Facilitate discovery – Free text search in Globus Search – Filtering on specific values – User Friendly GUI • Enforce appropriate access controls – Public/private, group-, subject-level ACLs • Integrate with other (Globus) services • Customize for your research environment
  13. 13. MRDP: Key elements Science DMZ Fast, clean data path Data Transfer Nodes Purpose-built data movers Globus Platform Secure, reliable data orchestration Globus Connect Storage system enabler 16 Globus Portal Framework Data discovery and access
  14. 14. …makes your storage system a Globus endpoint
  15. 15. Globus Connectors support diverse systems
  16. 16. What’s wrong with my LRDP? 19
  17. 17. L(egacy)RDP architecture 20 Source: ESnet Science Engagement team
  18. 18. MRDP network architecture 21 Source: ESnet Science Engagement team
  19. 19. An exemplar: The ALCF Data Co-op 22 acdc.alcf.anl.gov
  20. 20. Globus Platform Services
  21. 21. Relevant Globus platform capabilities • Data transfer and sharing • Data description (metadata) and discovery • Data (and compute) task orchestration • Authentication and Authorization 25
  22. 22. Brokering Access to Services using Globus Auth
  23. 23. Globus Auth: Foundational IAM service Brokers authentication and authorization among… – End-users – Identity providers: enterprise, external (federated identities) – Services: resource servers with REST APIs – Apps: web, mobile, desktop, command line clients – Services acting as clients to other services • OAuth 2.0 Authorization Framework (a.k.a. OAuth2) • OpenID Connect Core 1.0 (a.k.a. OIDC) 27
  24. 24. Several authentication models supported • Application acting as user with consent – Auth flow: Authorization code grant • Application authenticating as itself – Auth flow: Client credentials grant – Application (client) has its own identity à app are people too! • Application able to manage tokens for offline or long running tasks – Refresh tokens
  25. 25. Data transfer and sharing • Move data to collection à Submit Transfer task • Make data accessible à Set guest collection access rule • Grant user/app access à Add/confirm Group membership 29 Groups service Transfer service GET /groups/my_groups POST /endpoint/{endpoint_id}/access POST /transfer
  26. 26. Using guest collections in your data portal • Create a guest collection; requires authentication – Cannot be completely automated – must ”log in” – Create once and automate rest of the steps • Grant the application Access Manager role – Allows the application to manage permissions on the collection – Set for application identity: appclientid@clients.auth.globus.org • Grant roles for management of endpoint and tasks
  27. 27. Deploying a Simple (but fully functional and extensible) Research Data Portal
  28. 28. Globus Search Evolving the MRDP design pattern Enabling discoverability: MRDP + Faceted Search Input form Automated Extraction Ingest metadata, set visibility policies Bulk ingest MRDP
  29. 29. Portal Core Functionality • User authentication • Django-based framework – Portal URL mappings – Token loading • Service calls to Globus Search • Manage request lifecycle • Post process search requests
  30. 30. User authentication • Scopes are configured in the portal • Users authenticate with Globus using standard flow – Python Social Auth used for Authentication backend • User tokens are saved in the database • Future requests authorized with user access tokens – Searches use Search bearer token
  31. 31. Portal service calls use the Globus SDK • Globus portal framework loads tokens from database • Globus service object instantiated with token • Call to Globus service(s) • Portal renders result in templates
  32. 32. Globus Portal Framework URLs • URLs span three categories – Index Selection – Index Search page – Search Subject detail page • Supports multiple Globus Search indices • Search page links to multiple result subjects • Each subject has a unique URL
  33. 33. Format of a URL
  34. 34. An index is configuration driven • A Search index is configured in portal settings • Add Globus Search index UUID • Add a name • Add facets • Add fields • Start searching!
  35. 35. Lifecycle of a request • User makes a query • Portal sends request to Globus Search – Request contains user bearer token • Portal receives response • Portal does processing on response – Parse Dates, build URL for Globus webapp, etc. • Portal renders data into templates • User receives a search page
  36. 36. Creating your science gateway using the Globus portal framework 40 bit.ly/minisgci-2022 Source: github.com/globus/django-globus-portal-framework Docs: django-globus-portal-framework.readthedocs.io/en/stable/
  37. 37. Step 0: Application registration • Set redirect URLs • Get client ID and secret • Consents implement least privileges principle 41 developers.globus.org Redirect URLs https://tutN.globusdemo.org:8443/ https://tutN.globusdemo.org:8443/complete/globus/
  38. 38. Accessing your VM Host: tutN.globusdemo.org Login user: devN Password: Globus_2022# 42 bit.ly/minisgci-2022 Source: github.com/globus/django-globus-portal-framework Docs: django-globus-portal-framework.readthedocs.io/en/stable/
  39. 39. Portal deployment • Install dependent libraries – For production use, add robust WSGI/ASGI server • Deploy a portal instance using cookiecutter • Configure settings • Run and use! • Future: containers
  40. 40. Making Data Findable with Globus Search
  41. 41. Data description and discovery • Metadata store with fine- grained visibility controls • Schema agnostic à dynamic schemas • Simple search using URL query parameters • Complex search using search request document 46 docs.globus.org/api/search Search Index
  42. 42. Distinct access policies may be applied to Data and Metadata …(ideally) using permissions on guest collections …using permissions on metadata elements
  43. 43. Data ingest with Globus Search 48 Search Index POST /index/{index_id}/ingest' { "ingest_type": "GMetaList", "ingest_data": { "gmeta": [ { "id": "filetype", "subject”: "https://search.api.globus.org/abc.txt", "visible_to": ["public"], "content": { "metadata-schema/file#type": "file” } }, ... ] } - Bulk create and update - Task model for ingest at scale
  44. 44. Data ingest with Globus Search 49 Search Index POST /index/{index_id}/ingest' { "ingest_type": "GMetaList", "ingest_data": { "gmeta": [ { "id": ”weight", "subject": "https://search.api.globus.org/abc.txt", "visible_to": ["urn:globus:auth:identity:46bd0f56- e24f-11e5-a510-131bef46955c"], "content": { "metadata-schema/file#size": ”37.6", "metadata-schema/file#size_human": ”<50lb” } }, ... ] } Visibility limited to Globus Auth identity - Single user - Globus Group - Registered client application
  45. 45. Data discovery with Globus Search 50 { "@datatype": "GSearchResult", "@version": "2017-09-01", "count": 1, "gmeta": [ { "@datatype": "GMetaResult", "@version": "2019-08-27", "entries": [ { ... } ], "subject": "https://..." } ], "offset": 0, "total": 1 } GET /index/{index_id}/search?q=type%3Ahdf5 Search Index Simple query
  46. 46. Data discovery with Globus Search 51 POST /index/{index_id}/search Search Index Complex query { "filters": [ { "type": "range", "field_name": ”pubdate", "values": [ { "from": "*", "to": "2020-12-31" } ] } ], "facets": [ { "name": "Publication Date", "field_name": "pubdate", ... } ] } Filter Facets Boosts Sort Limit
  47. 47. Cancer Registry Records for Research (CR3) • Create network of federated cancer registries – Deploy similar infrastructure at other cancer registries – Enable queries across multiple registries • Federation via Globus: network scale ßà local control – Data owners input/export data sets, apply QC, set access policies – Registry data remain at the institution where they were generated – Identities are provided/authenticated by the institution, not Globus – System scale depends on data owners providing storage resources
  48. 48. CR3 requirements • Search Index – Only de-identified data in search index – No record-level for researchers • Portal – Fine-grained access control – Researchers must use a specific identity – Access must be logged – Render graphs based on search results – Faceted search in real time
  49. 49. CR3 Discovery Portal Cohort aggregate counts Login with UPMC/Pitt credentials Globus Search (GS) Globus Auth (GA) UPMC/Pitt Identity Providers Authentication Auth initiated to GA Cohort search initiated to GS Researcher Cohort aggregate counts returned CR3 Architecture Globus Transfer (GT) Registry Staff Data transfer from registrar to researcher mediated by GT Manage authorization Elasticsearch Request Service Cancer Registry De-identified Data Index (minimal criteria data: e.g., staging)
  50. 50. SEER Registry Medical Center Registry State Registry SEER Registry Medical Center Registry State Registry CR3 Portal (simulated data) Federated logon using Globus Auth with Pitt/UPMC as identity providers Dynamically updating charts as facets change Variable facets based on source registry index Google-like text search with facets for filtering Developed using a framework based on the Globus Modern Research Data Portal* design pattern (docs.globus.org/mrdp) * PeerJ Articles:cs-144 https://peerj.com/articles/cs-144/
  51. 51. Working with Globus Search 56 jupyter.demo.globus.org
  52. 52. Ingesting search metadata 57 github.com/globus/searchable-files-demo
  53. 53. Adding a new search index to your portal 63
  54. 54. Making Data Discoverable at Scale
  55. 55. Globus Automation Capabilities Timer Service Scheduled and recurring transfers (a.k.a. Globus cron) Command Line Interface Ad hoc scripting and integration Globus Flows service Comprehensive task (data and compute) orchestration with human in the loop interactions
  56. 56. Globus Timer Service
  57. 57. The Globus Timer service • Scheduled/recurring file transfers • Supports all Globus transfer and sync options • Service with a command line interface • Example: NIH – hpc.nih.gov/storage/globus_cron.html 68
  58. 58. Scheduled transfers to data portal endpoint(s) Globus Timer CLI: pypi.org/project/globus-timer-cli 71
  59. 59. Globus Command Line Interface (CLI)
  60. 60. Globus Command Line Interface Open source, uses the Python SDK
  61. 61. Globus Flows Service
  62. 62. Managed automation of tasks • Flows: A platform service for defining, applying, and sharing distributed research automation flows • Flows comprise Actions • Action Providers: Called by Flows to perform tasks • Triggers*: Start flows based on events * In development
  63. 63. Automation with Globus Flows • Built on AWS Step Functions – Simple JSON-based state machine language – Conditions, loops, fault tolerance, etc. – Propagates state through the flow • Standardized API for integrating custom event and action services – Actions: synchronous or asynchronous – Custom Web forms prompt for user input • Actions secured with Globus Auth
  64. 64. Extending the ecosystem: Action providers 78 • Action Provider is a service endpoint – Run – Status – Cancel – Release – Resume • Action Provider Toolkit action-provider- tools.readthedocs.io/en/latest Search Transfer Notification ACLs Identifier Delete Ingest User Form Describe Xtract funcX Web Form Custom built Globus Provided
  65. 65. Automation services ecosystem GET /provider_url/ POST /provider_url/run GET /provider_url/action_id/status GET /provider_url/action_id/cancel GET /provider_url/action_id/status Create Action Providers Define and deploy flows { “StartAt”: ”ToProject”, ”States” : { ”ToProject” : { … }, ”SetPermission” : { …}, “ProcessData” : { … } … }} Run flows
  66. 66. Working with Globus Flows Try it: demo.gladier.org/gladier-demo/upload-file Run flows: app.globus.org/flows/library Docs: docs.globus.org/globus-automation-services 80
  67. 67. Adding compute to your science gateway 81
  68. 68. Coming soon: Globus Trigger service • Trigger–Action platform • Predefined triggers and actions to create rules • Globus processes triggers and reliably executes actions
  69. 69. bit.ly/minisgci-2022 docs.globus.org github.com/globus outreach@globus.org support@globus.org

×