We presented these slides at the NIH Data Commons kickoff meeting, showing some of the technologies that we propose to integrate in our "full stack" pilot.
This document discusses lessons learned for achieving interoperability. It recommends having a clear purpose, starting with basic conventions like identifiers, monitoring commitments to build trust, and focusing on outward-facing interoperability through simple APIs and platforms rather than full software stacks. Observance of industry practices like authentication methods and cloud-based platforms is also advised to promote rapid development and distribution of applications.
Research Automationfor Data-Driven DiscoveryGlobus
This document discusses research automation and data-driven discovery. It notes that data volumes are growing much faster than computational power, creating a productivity crisis in research. However, most labs have limited resources to handle these large data volumes. The document proposes applying lessons from industry to create cloud-based science services with standardized APIs that can automate and outsource common tasks like data transfer, sharing, publishing, and searching. This would help scientists focus on their core research instead of computational infrastructure. Examples of existing services from Globus and the Materials Data Facility are presented. The goal is to establish robust, scalable, and persistent cloud platforms to help address the challenges of data-driven scientific discovery.
Gateways 2020 Tutorial - Introduction to GlobusGlobus
Globus provides a platform and services for simplifying data management and sharing for science gateways and applications. It offers fast and reliable file transfers between any storage systems, secure data sharing without copying data, and APIs and SDKs for building applications. Globus uses OAuth authentication and supports a variety of interfaces like CLI, Python SDK, and Jupyter notebooks to enable access.
A presentation at the NIH Workshop on Advanced Networking for Data-Intensive Biomedical Research. The talk covers our work with the science community on using cloud computing to enhance and improve basic research for data analysis and scientific discovery
Big Data Processing in the Cloud: a Hydra/Sufia Experience
Zhiwu Xie, Ph.D., Associate Professor and Technology Development Librarian, Center for Digital Research and Scholarship University Libraries, Virginia Tech
This document discusses lessons learned for achieving interoperability. It recommends having a clear purpose, starting with basic conventions like identifiers, monitoring commitments to build trust, and focusing on outward-facing interoperability through simple APIs and platforms rather than full software stacks. Observance of industry practices like authentication methods and cloud-based platforms is also advised to promote rapid development and distribution of applications.
Research Automationfor Data-Driven DiscoveryGlobus
This document discusses research automation and data-driven discovery. It notes that data volumes are growing much faster than computational power, creating a productivity crisis in research. However, most labs have limited resources to handle these large data volumes. The document proposes applying lessons from industry to create cloud-based science services with standardized APIs that can automate and outsource common tasks like data transfer, sharing, publishing, and searching. This would help scientists focus on their core research instead of computational infrastructure. Examples of existing services from Globus and the Materials Data Facility are presented. The goal is to establish robust, scalable, and persistent cloud platforms to help address the challenges of data-driven scientific discovery.
Gateways 2020 Tutorial - Introduction to GlobusGlobus
Globus provides a platform and services for simplifying data management and sharing for science gateways and applications. It offers fast and reliable file transfers between any storage systems, secure data sharing without copying data, and APIs and SDKs for building applications. Globus uses OAuth authentication and supports a variety of interfaces like CLI, Python SDK, and Jupyter notebooks to enable access.
A presentation at the NIH Workshop on Advanced Networking for Data-Intensive Biomedical Research. The talk covers our work with the science community on using cloud computing to enhance and improve basic research for data analysis and scientific discovery
Big Data Processing in the Cloud: a Hydra/Sufia Experience
Zhiwu Xie, Ph.D., Associate Professor and Technology Development Librarian, Center for Digital Research and Scholarship University Libraries, Virginia Tech
This document provides information about Microsoft Azure for Research, a program that provides free cloud computing resources and services to academic and non-commercial researchers. It describes the various tools and services available on Azure including virtual machines, storage, databases, and tools for application development. It highlights case studies of researchers using Azure. It also details the Azure for Research Award Program that provides a free year of Azure services through a bi-monthly proposal process, as well as special opportunity requests on topics like machine learning, genomics, and climate data.
An Approach for RDF-based Semantic Access to NoSQL Repositories, presented as partial requiremnt for the discipline "Metodologia da Pesquisa em Ciência da Computação" at UFSC/2015
Globus and Dataverse: Towards big Data PublicationGlobus
Globus is a non-profit service that aims to increase research efficiency through sustainable software. It unifies access to data across disparate systems like cloud storage, repositories, and research instruments. Globus Connectors support secure data sharing and transfer between these systems. The document proposes using Globus and Dataverse together for big data publication, allowing faceted searches of a Dataverse repository, authorization controls, and asynchronous bulk data transfer using Globus for larger datasets. A demonstration of this combination is available online.
DataverseNL is a data repository service started in 2014 by DANS as a shared service for 15 Dutch institutions. It currently contains over 200 dataverses and 450 datasets that have been downloaded over 7,000 times. DANS aims to use DataverseNL for ongoing research projects and then archive finalized datasets in its Trusted Digital Repository (TDR) for permanent preservation. DataverseNL serves as a collaboration platform and integration point for sharing research data across Dutch universities and organizations. DANS is working to link DataverseNL metadata to semantic web vocabularies and expose it as linked open data.
Or2019 DSpace 7 Enhanced submission & workflow4Science
The last two years have been very intense for the DSpace community. A great effort has been put into finalizing the development of a DSpace release, 7.0, which has many changes from previous releases, particularly with regard to UI technology.
As part of the activities related to the creation of DSpace 7, particularly innovative is the submission and workflow process that can be associated with the different collections.
The presentation will provide a deep dive into the new Enhanced Submission and Workflow features of DSpace 7, including how to configure, customize & use this feature (and differences with DSpace 6 and below)
iRODS UGM 2018 Fair data management and DISQOVERabilityMaarten Coonen
The document summarizes the DataHub at Maastricht University, which uses iRODS for FAIR data management and discoverability. DataHub goes beyond just iRODS to include additional services like a web portal, metadata entry, and semantic search using DISQOVER. It aims to make data both human and machine-readable by using ontologies and linked data principles when storing and enriching metadata. Major milestones since its start in 2014 include various software releases that expand its capabilities in support of the FAIR data principles.
Architecting An Enterprise Storage Platform Using Object StoresNiraj Tolia
This document discusses architecting an enterprise storage platform using object stores. It summarizes MagFS, a file system designed for the cloud that is layered on top of object storage. Key points include:
- MagFS provides a consistent, elastic, secure and mobile-enabled file system experience while leveraging low-cost object storage.
- The client architecture pushes intelligence to edges for heavy lifting like encryption, deduplication, and caching while coordinating with metadata servers.
- Metadata servers enforce strong consistency, authentication, and garbage collection while optimizing performance through virtualization and lease-based caching.
- Security is ensured through server-driven request signing and scrubbing writes to object storage after client acknowledgment
Leverage DSpace for an enterprise, mission critical platformAndrea Bollini
Conference: Open Repository, Indianapolis, 8-12 June 2015
Presenters: Andrea Bollini, Michele Mennielli
Cineca, Italy
We would like to share with the DSpace Community some useful tips, starting from how to embed DSpace into a larger IT ecosystem that can provide additional value to the information managed. We will then show how publication data in DSpace - enriched with a proper use of the authority framework - can be combined with information coming from the HR system. Thanks to this, the system can provide rich and detailed reports and analysis through a business intelligence solution based on the Pentaho’s Mondrian OLAP open source data integration tools.
We will also present other use cases related to the management of publication information for reporting purpose: publication record has an extended lifecycle compared to the one in a basic IR; system load is much bigger, especially in writing, since the researchers need to be able to make changes to enrich data when new requirements come from the government or the university researcher office; data quality requires the ability to make distributed changes to the publication also after the conclusion of a validation workflow.
Finally we intend to present our direct experience and the challenges we faced to make DSpace easily and rapidly deployable to more than 60 sites.
PSICQUIC is a community effort to standardize accessing and retrieving molecular interaction data from decentralized databases. It uses a client-server model where a single client can integrate information from multiple sources using a common query interface and standard formats. PSICQUIC provides over 150 million binary interactions and allows querying via its registry, services using MIQL, and visualization of results.
This presentation as been used to start the pilot phase of the OpenAIRE Advance' funded implementation project in DSpace-CRIS.
DSpace-CRIS now provide support for the OpenAIRE guidelines for CRIS manager in addition to the previous already supported guidelines for Literature Repository and DataArchive
This document summarizes several papers on integrating NoSQL databases. It discusses different approaches to integration such as global-as-view and local-as-view. Most solutions presented use a global-as-view approach and expose a unified REST API. The papers cover domains like healthcare, biodiversity, and graph matching. The BigDAWG system is highlighted as the most complete approach, federating access across different database models in a scalable way.
Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...Continuent
Elasticsearch provides a quick and easy method to aggregate data, whether you want to use it for simplifying your search across multiple depots and databases, or as part of your analytics stack. Getting the data from your transactional engines into Elasticsearch is something that can be achieved within your application layer with all of the associated development and maintenance costs. Instead, offload the operation and simplify your deployment by using direct data replication to handle the insert, update and delete processes.
AGENDA
- Basic replication model
- How to concentrate data from multiple sources
- How the data is represented within Elasticsearch
- Customizations and configurations available to tailor the data format
- Filters and data modifications available
This document discusses APIs and the API economy. It defines an API as an interface between software systems that allows them to interact. APIs provide programmatic access to systems and processes within organizations. Building APIs improves digital ecosystems by enabling data sharing, reuse, and rapid prototyping. The document advocates for an open innovation approach where organizations use both internal and external knowledge through APIs to accelerate innovation. It presents a vision of APIs managing complex systems and data as products that fuel collaboration across communities.
CPaaS.io Y1 Review Meeting - Holistic Data ManagementStephan Haller
Data management and governance aspects of the CPaaS.io platform as presented at the first year review meeting in Tokyo on October 5, 2017.
Disclaimer:
This document has been produced in the context of the CPaaS.io project which is jointly funded by the European Commission (grant agreement n° 723076) and NICT from Japan (management number 18302). All information provided in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. For the avoidance of all doubts, the European Commission and NICT have no liability in respect of this document, which is merely representing the view of the project consortium. This document is subject to change without notice.
BlogMyData is a virtual research environment for collaboratively visualizing environmental data. It allows researchers to detect features in models, diagnose problems, preview data before downloading, make sense of large datasets, and communicate complex concepts. Existing scientific visualization software requires expert knowledge and has limited interoperability. BlogMyData addresses this by providing a web-based blogging tool for scientists to discuss, collaborate, and record discussions as part of the research record. It utilizes open authentication and spatial features from existing frameworks to overlay blogged data on other visualization clients and offer customized geospatial feeds of blog entries. The prototype received positive feedback and future features may include supporting more data types.
MongoDB ne fonctionne pas comme les autres bases de données. Son modèle de données orienté documents, son partitionnement en gammes et sa cohérence forte sont bien adaptés à certains problèmes et moins adaptés à d'autres. Dans ce séminaire Web, nous étudierons des exemples réels d'utilisation de MongoDB mettant à profit ces fonctionnalités uniques. Nous évoquerons le cas de clients spécifiques qui utilisent MongoDB et nous verrons la façon dont ils ont implémenté leur solution. Nous vous montrerons également comment construire une solution du même type pour votre entreprise.
RightsDirect provides data-driven content solutions that help make copyright work for everyone. They offer document delivery, content workflow and analytics, text and data mining, licensing solutions, and copyright education for rightsholders and publishers with over 600 million rights. For content users, RightsDirect offers a Multinational Copyright License that provides a consistent set of rights from thousands of publishers to simplify content usage and sharing across borders. The license complements but does not replace publisher subscriptions. RightsDirect also offers document delivery through RightFind, personal and shared libraries, and content decision support services to help track content usage and spending.
1) SIDN Labs is the research branch of SIDN, which runs the .nl top-level domain, and aims to improve security and stability of .nl and the DNS through various research projects.
2) One focus of the research is analyzing how attackers abuse .nl domains for activities like phishing and malware in order to better detect and mitigate such abuse.
3) The research uses SIDN's data and resources like their ENTRADA big data platform to analyze domain registration patterns and DNS queries to detect suspicious domains.
Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)Globus
This document provides an introduction to Globus for new users. It discusses how Globus can be used to move, share, describe, discover, and reproduce research data across different storage locations and computing resources. It highlights key Globus features like transferring files securely between endpoints, sharing data with collaborators, and building applications and services using the Globus APIs. The document also covers sustainability topics like Globus usage metrics, subscription plans, and support resources.
Simplified Research Data Management with the Globus PlatformGlobus
Overview of the Globus research data management platform, as presented at the Fall 2018 Membership Meeting of the Coalition for Networked Information (CNI), held in Washington, D.C., December 10-11, 2018
This document provides information about Microsoft Azure for Research, a program that provides free cloud computing resources and services to academic and non-commercial researchers. It describes the various tools and services available on Azure including virtual machines, storage, databases, and tools for application development. It highlights case studies of researchers using Azure. It also details the Azure for Research Award Program that provides a free year of Azure services through a bi-monthly proposal process, as well as special opportunity requests on topics like machine learning, genomics, and climate data.
An Approach for RDF-based Semantic Access to NoSQL Repositories, presented as partial requiremnt for the discipline "Metodologia da Pesquisa em Ciência da Computação" at UFSC/2015
Globus and Dataverse: Towards big Data PublicationGlobus
Globus is a non-profit service that aims to increase research efficiency through sustainable software. It unifies access to data across disparate systems like cloud storage, repositories, and research instruments. Globus Connectors support secure data sharing and transfer between these systems. The document proposes using Globus and Dataverse together for big data publication, allowing faceted searches of a Dataverse repository, authorization controls, and asynchronous bulk data transfer using Globus for larger datasets. A demonstration of this combination is available online.
DataverseNL is a data repository service started in 2014 by DANS as a shared service for 15 Dutch institutions. It currently contains over 200 dataverses and 450 datasets that have been downloaded over 7,000 times. DANS aims to use DataverseNL for ongoing research projects and then archive finalized datasets in its Trusted Digital Repository (TDR) for permanent preservation. DataverseNL serves as a collaboration platform and integration point for sharing research data across Dutch universities and organizations. DANS is working to link DataverseNL metadata to semantic web vocabularies and expose it as linked open data.
Or2019 DSpace 7 Enhanced submission & workflow4Science
The last two years have been very intense for the DSpace community. A great effort has been put into finalizing the development of a DSpace release, 7.0, which has many changes from previous releases, particularly with regard to UI technology.
As part of the activities related to the creation of DSpace 7, particularly innovative is the submission and workflow process that can be associated with the different collections.
The presentation will provide a deep dive into the new Enhanced Submission and Workflow features of DSpace 7, including how to configure, customize & use this feature (and differences with DSpace 6 and below)
iRODS UGM 2018 Fair data management and DISQOVERabilityMaarten Coonen
The document summarizes the DataHub at Maastricht University, which uses iRODS for FAIR data management and discoverability. DataHub goes beyond just iRODS to include additional services like a web portal, metadata entry, and semantic search using DISQOVER. It aims to make data both human and machine-readable by using ontologies and linked data principles when storing and enriching metadata. Major milestones since its start in 2014 include various software releases that expand its capabilities in support of the FAIR data principles.
Architecting An Enterprise Storage Platform Using Object StoresNiraj Tolia
This document discusses architecting an enterprise storage platform using object stores. It summarizes MagFS, a file system designed for the cloud that is layered on top of object storage. Key points include:
- MagFS provides a consistent, elastic, secure and mobile-enabled file system experience while leveraging low-cost object storage.
- The client architecture pushes intelligence to edges for heavy lifting like encryption, deduplication, and caching while coordinating with metadata servers.
- Metadata servers enforce strong consistency, authentication, and garbage collection while optimizing performance through virtualization and lease-based caching.
- Security is ensured through server-driven request signing and scrubbing writes to object storage after client acknowledgment
Leverage DSpace for an enterprise, mission critical platformAndrea Bollini
Conference: Open Repository, Indianapolis, 8-12 June 2015
Presenters: Andrea Bollini, Michele Mennielli
Cineca, Italy
We would like to share with the DSpace Community some useful tips, starting from how to embed DSpace into a larger IT ecosystem that can provide additional value to the information managed. We will then show how publication data in DSpace - enriched with a proper use of the authority framework - can be combined with information coming from the HR system. Thanks to this, the system can provide rich and detailed reports and analysis through a business intelligence solution based on the Pentaho’s Mondrian OLAP open source data integration tools.
We will also present other use cases related to the management of publication information for reporting purpose: publication record has an extended lifecycle compared to the one in a basic IR; system load is much bigger, especially in writing, since the researchers need to be able to make changes to enrich data when new requirements come from the government or the university researcher office; data quality requires the ability to make distributed changes to the publication also after the conclusion of a validation workflow.
Finally we intend to present our direct experience and the challenges we faced to make DSpace easily and rapidly deployable to more than 60 sites.
PSICQUIC is a community effort to standardize accessing and retrieving molecular interaction data from decentralized databases. It uses a client-server model where a single client can integrate information from multiple sources using a common query interface and standard formats. PSICQUIC provides over 150 million binary interactions and allows querying via its registry, services using MIQL, and visualization of results.
This presentation as been used to start the pilot phase of the OpenAIRE Advance' funded implementation project in DSpace-CRIS.
DSpace-CRIS now provide support for the OpenAIRE guidelines for CRIS manager in addition to the previous already supported guidelines for Literature Repository and DataArchive
This document summarizes several papers on integrating NoSQL databases. It discusses different approaches to integration such as global-as-view and local-as-view. Most solutions presented use a global-as-view approach and expose a unified REST API. The papers cover domains like healthcare, biodiversity, and graph matching. The BigDAWG system is highlighted as the most complete approach, federating access across different database models in a scalable way.
Webinar Slides: Tungsten Replicator for Elasticsearch - Real-time data loadin...Continuent
Elasticsearch provides a quick and easy method to aggregate data, whether you want to use it for simplifying your search across multiple depots and databases, or as part of your analytics stack. Getting the data from your transactional engines into Elasticsearch is something that can be achieved within your application layer with all of the associated development and maintenance costs. Instead, offload the operation and simplify your deployment by using direct data replication to handle the insert, update and delete processes.
AGENDA
- Basic replication model
- How to concentrate data from multiple sources
- How the data is represented within Elasticsearch
- Customizations and configurations available to tailor the data format
- Filters and data modifications available
This document discusses APIs and the API economy. It defines an API as an interface between software systems that allows them to interact. APIs provide programmatic access to systems and processes within organizations. Building APIs improves digital ecosystems by enabling data sharing, reuse, and rapid prototyping. The document advocates for an open innovation approach where organizations use both internal and external knowledge through APIs to accelerate innovation. It presents a vision of APIs managing complex systems and data as products that fuel collaboration across communities.
CPaaS.io Y1 Review Meeting - Holistic Data ManagementStephan Haller
Data management and governance aspects of the CPaaS.io platform as presented at the first year review meeting in Tokyo on October 5, 2017.
Disclaimer:
This document has been produced in the context of the CPaaS.io project which is jointly funded by the European Commission (grant agreement n° 723076) and NICT from Japan (management number 18302). All information provided in this document is provided "as is" and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. For the avoidance of all doubts, the European Commission and NICT have no liability in respect of this document, which is merely representing the view of the project consortium. This document is subject to change without notice.
BlogMyData is a virtual research environment for collaboratively visualizing environmental data. It allows researchers to detect features in models, diagnose problems, preview data before downloading, make sense of large datasets, and communicate complex concepts. Existing scientific visualization software requires expert knowledge and has limited interoperability. BlogMyData addresses this by providing a web-based blogging tool for scientists to discuss, collaborate, and record discussions as part of the research record. It utilizes open authentication and spatial features from existing frameworks to overlay blogged data on other visualization clients and offer customized geospatial feeds of blog entries. The prototype received positive feedback and future features may include supporting more data types.
MongoDB ne fonctionne pas comme les autres bases de données. Son modèle de données orienté documents, son partitionnement en gammes et sa cohérence forte sont bien adaptés à certains problèmes et moins adaptés à d'autres. Dans ce séminaire Web, nous étudierons des exemples réels d'utilisation de MongoDB mettant à profit ces fonctionnalités uniques. Nous évoquerons le cas de clients spécifiques qui utilisent MongoDB et nous verrons la façon dont ils ont implémenté leur solution. Nous vous montrerons également comment construire une solution du même type pour votre entreprise.
RightsDirect provides data-driven content solutions that help make copyright work for everyone. They offer document delivery, content workflow and analytics, text and data mining, licensing solutions, and copyright education for rightsholders and publishers with over 600 million rights. For content users, RightsDirect offers a Multinational Copyright License that provides a consistent set of rights from thousands of publishers to simplify content usage and sharing across borders. The license complements but does not replace publisher subscriptions. RightsDirect also offers document delivery through RightFind, personal and shared libraries, and content decision support services to help track content usage and spending.
1) SIDN Labs is the research branch of SIDN, which runs the .nl top-level domain, and aims to improve security and stability of .nl and the DNS through various research projects.
2) One focus of the research is analyzing how attackers abuse .nl domains for activities like phishing and malware in order to better detect and mitigate such abuse.
3) The research uses SIDN's data and resources like their ENTRADA big data platform to analyze domain registration patterns and DNS queries to detect suspicious domains.
Introduction to Globus for New Users (GlobusWorld Tour - Columbia University)Globus
This document provides an introduction to Globus for new users. It discusses how Globus can be used to move, share, describe, discover, and reproduce research data across different storage locations and computing resources. It highlights key Globus features like transferring files securely between endpoints, sharing data with collaborators, and building applications and services using the Globus APIs. The document also covers sustainability topics like Globus usage metrics, subscription plans, and support resources.
Simplified Research Data Management with the Globus PlatformGlobus
Overview of the Globus research data management platform, as presented at the Fall 2018 Membership Meeting of the Coalition for Networked Information (CNI), held in Washington, D.C., December 10-11, 2018
We provide a summary review of Globus features targeted at those new to Globus. We demonstrate how to transfer and share data, and install a Globus Connect Personal endpoint on your laptop.
Scalable Data Management: Automation and the Modern Research Data PortalGlobus
Globus is an established service from the University of Chicago that is widely used for managing research data in national laboratories, campus computing centers, and HPC facilities. While its interactive web browser interface addresses simple file transfer and sharing scenarios, large scale automation typically requires integration of the research data management platform it provides into bespoke applications.
We will describe one such example, the Petrel data portal (https://petreldata.net), used by researchers to manage data in diverse fields including materials science, cosmology, machine learning, and serial crystallography. The portal facilitates automated ingest of data, extraction and addition of metadata for creating search indexes, assignment of persistent identifiers faceted search for rapid data discovery, and point-and-click downloading of datasets by authorized users. As security and privacy are often critical requirements, the portal employs fine-grained permissions that control both visibility of metadata and access to the datasets themselves. It is based on the Modern Research Data Portal design pattern, jointly developed by the ESnet and Globus teams, and leverages capabilities such as the Science DMZ for enhanced performance and to streamline the user experience.
Managing Protected and Controlled Data with Globus Globus
This document discusses how Globus can be used to manage protected and controlled data with high assurance. It describes features for restricting data handling according to standards like NIST 800-53 and 800-171. Compliance focuses on access control, configuration management, maintenance, and accountability. Restricted data passed to Globus does not include file contents. The initial release includes a new web app, Globus Connect Server v5.2, and Connect Personal. High assurance capabilities include additional authentication, application instance isolation, encryption, and detailed auditing. Subscription levels like High Assurance and BAA provide these features.
Globus is a non-profit service that aims to increase research efficiency by unifying access to disparate storage systems and simplifying secure data sharing. It allows users to easily, securely, and reliably transfer data between different resources like HPC systems, cloud storage, instruments, and personal computers. Globus also provides APIs and SDKs to help researchers build data-centric applications and automate workflows. Funding comes partly from government grants, with subscriptions enabling additional features and supporting ongoing operations.
Globus: A Data Management Platform for Collaborative Research (CHPC 2019 - So...Globus
Globus is a non-profit data management platform developed by the University of Chicago to increase the efficiency of data-driven research. It allows researchers to easily transfer large datasets, securely share data with collaborators across different storage systems, and automate the movement of data from instruments and compute facilities. Globus sees growing adoption with over 120 subscribers and has transferred over 768 petabytes of data.
Introduction to the Globus SaaS (GlobusWorld Tour - STFC)Globus
This document summarizes a presentation about the Globus data management platform. It includes an agenda covering an introduction to the Globus Software as a Service and Platform as a Service, automating research data workflows, facilitating collaboration, and building services. There are demonstrations of file transfers, data sharing, publication, and high assurance endpoints. The sustainability model is discussed, with standard and high assurance subscriptions, branded websites, premium storage connectors, and identity providers. Support resources like documentation, email lists, and professional services are also mentioned.
Facilitating Collaboration with Globus (GlobusWorld Tour - STFC)Globus
This document discusses how Globus services can facilitate collaboration and data sharing through automated workflows. It describes how Globus Auth enables authentication for shared access to endpoints. APIs and command line tools allow applications to programmatically manage permissions and transfer data. JupyterHub can be configured with Globus Auth to provide tokens for accessing remote Globus services within notebooks. This enables collaborative and distributed data analysis. The document also outlines how Globus services can support automated publication of datasets through search, identifiers, and metadata.
Introduction to Globus (GlobusWorld Tour West)Globus
This document introduces Globus, which provides fast and reliable data transfer, sharing, and platform services across different storage systems and resources. It does this through software-as-a-service that uses existing user identities, with the goal of unifying access to data across different tiers like HPC, storage, cloud, and personal resources. Key features include secure data transfers without moving files, access control and sharing capabilities, and tools for building automations and integrating with science gateways. It also discusses options for handling protected data like health information with additional security controls and business agreements.
Globus is a non-profit data management service that allows users to transfer, share, and access data across different storage systems and platforms through software-as-a-service. It has transferred over 1.34 exabytes of data and aims to unify access to research data across different tiers of storage through connectors, APIs, and user interfaces. Globus ensures secure data transfers and sharing by using user identities, access controls, encryption, and audit logging without storing user credentials or data.
Introduces the Globus software-as-a-service for file transfer and data sharing. Includes step-by-step instructions for creating a Globus account, transferring a file, and setting up a Globus endpoint on your laptop.
Introduction to Globus (GlobusWorld Tour - UMich)Globus
This document provides an agenda for a Globus World Tour event taking place on Monday and Tuesday. On Monday, there will be sessions on introductions to Globus for new and administrative users. On Tuesday, sessions will focus on developing with Globus, including building research data portals, automating workflows, and working with instrument data. The document also provides background information on Globus and how it aims to make research data movement, sharing, and synchronization easy, reliable and secure for researchers.
GlobusWorld 2021 Tutorial: Introduction to GlobusGlobus
An introduction to the core features of the Globus data management service. This tutorial was presented at the GlobusWorld 2021 conference in Chicago, IL by Greg Nawrocki.
This document provides an overview and introduction to Windows Azure SQL Database. It discusses key topics such as:
- SQL Database service tiers including Basic, Standard, and Premium, which are differentiated by performance levels measured in Database Transaction Units (DTUs) and other features.
- Database size limits and performance metrics for each tier.
- Database replication and high availability capabilities to ensure reliability.
- Support for common SQL Server features while noting some limitations compared to on-premises SQL Server.
- Considerations for database naming, users/logins, migrations, and automation in the SQL Database platform.
- Indexing requirements and compatibility differences to be aware of.
Similar to NIH Data Commons Architecture Ideas (20)
Global Services for Global Science March 2023.pptxIan Foster
We are on the verge of a global communications revolution based on ubiquitous high-speed 5G, 6G, and free-space optics technologies. The resulting global communications fabric can enable new ultra-collaborative research modalities that pool sensors, data, and computation with unprecedented flexibility and focus. But realizing these modalities requires new services to overcome the tremendous friction currently associated with any actions that traverse institutional boundaries. The solution, I argue, is new global science services to mediate between user intent and infrastructure realities. I describe our experiences building and operating such services and the principles that we have identified as needed for successful deployment and operations.
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
The Earth System Grid Federation (ESGF) is a distributed network of climate data servers that archives and shares model output data used by scientists worldwide. ESGF has led data archiving for the Coupled Model Intercomparison Project (CMIP) since its inception. The ESGF Holdings have grown significantly from CMIP5 to CMIP6 and are expected to continue growing rapidly. A new ESGF2 project funded by the US Department of Energy aims to modernize ESGF to handle exabyte scale data volumes through a new architecture based on centralized Globus services, improved data discovery tools, and data proximate computing capabilities.
Better Information Faster: Programming the ContinuumIan Foster
This document discusses the computing continuum and efforts to enable better information faster through computation. It provides examples of how techniques like executing tasks closer to data sources or on specialized hardware can significantly accelerate applications. Programming models and managed services are explored for specifying and executing workloads across diverse infrastructure. There are still open questions around optimizing networks, algorithms, and applications for the computing continuum.
ESnet6 provides an ultra-fast and reliable network that enables new smart instruments for 21st century science. The network capacity has increased dramatically over time, with 2022 bandwidth being 500,000 times greater than 1993. This network allows rapid data transfer between facilities, such as replicating 7 petabytes of climate data between three labs. It also enables fast assembly and use of new instruments like high energy diffraction microscopy that can perform an analysis in 31 seconds. The integrated research infrastructure provided by Globus further supports use of remote resources and smart instruments that will drive scientific discovery.
Linking Scientific Instruments and ComputationIan Foster
[Talk presented at Monterey Data Conference, August 31, 2022]
Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Thus, methods are required for configuring and running distributed computing pipelines—what we call flows—that link instruments, computers (e.g., for analysis, simulation, AI model training), edge computing (e.g., for analysis), data stores, metadata catalogs, and high-speed networks. We review common patterns associated with such flows and describe methods for instantiating these patterns. We present experiences with the application of these methods to the processing of data from five different scientific instruments, each of which engages powerful computers for data inversion, machine learning model training, or other purposes. We also discuss implications of such methods for operators and users of scientific facilities.
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
Talk in the National Science Data Fabric (NSDF) Distinguished Speaker Series
The Globus team has spent more than a decade developing software-as-a-service methods for research data management, available at globus.org. Globus transfer, sharing, search, publication, identity and access management (IAM), automation, and other services enable reliable, secure, and efficient managed access to exabytes of scientific data on tens of thousands of storage systems. For developers, flexible and open platform APIs reduce greatly the cost of developing and operating customized data distribution, sharing, and analysis applications. With 200,000 registered users at more than 2,000 institutions, more than 1.5 exabytes and 100 billion files handled, and 100s of registered applications and services, the services that comprise the Globus platform have become essential infrastructure for many researchers, projects, and institutions. I describe the design of the Globus platform, present illustrative applications, and discuss lessons learned for cyberinfrastructure software architecture, dissemination, and sustainability.
Video is at https://www.youtube.com/watch?v=p8pCHkFFq1E
Daniel Lopresti, Bill Gropp, Mark D. Hill, Katie Schuman, and I put together a white paper on "Building a National Discovery Cloud" for the Computing Community Consortium (http://cra.org/ccc). I presented these slides at a Computing Research Association "Best Practices on using the Cloud for Computing Research Workshop" (https://cra.org/industry/events/cloudworkshop/).
Abstract from White Paper:
The nature of computation and its role in our lives have been transformed in the past two decades by three remarkable developments: the emergence of public cloud utilities as a new computing platform; the ability to extract information from enormous quantities of data via machine learning; and the emergence of computational simulation as a research method on par with experimental science. Each development has major implications for how societies function and compete; together, they represent a change in technological foundations of society as profound as the telegraph or electrification. Societies that embrace these changes will lead in the 21st Century; those that do not, will decline in prosperity and influence. Nowhere is this stark choice more evident than in research and education, the two sectors that produce the innovations that power the future and prepare a workforce able to exploit those innovations, respectively. In this article, we introduce these developments and suggest steps that the US government might take to prepare the research and education system for its implications.
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
I presented to the Environmental Data Science group at UChicago, with the goal of getting them excited about the opportunities inherent in big data, big computing, and AI--and to think about how to collaborate with Argonne in those areas. We had a great and long conversation about Takuya Kurihana's work on unsupervised learning for cloud classification. I also mentioned our work making NASA and CMIP data accessible on AI supercomputers.
The document discusses using artificial intelligence (AI) to accelerate materials innovation for clean energy applications. It outlines six elements needed for a Materials Acceleration Platform: 1) automated experimentation, 2) AI for materials discovery, 3) modular robotics for synthesis and characterization, 4) computational methods for inverse design, 5) bridging simulation length and time scales, and 6) data infrastructure. Examples of opportunities include using AI to bridge simulation scales, assist complex measurements, and enable automated materials design. The document argues that a cohesive infrastructure is needed to make effective use of AI, data, computation, and experiments for materials science.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
Research Automation for Data-Driven DiscoveryIan Foster
This document discusses research automation and data-driven discovery. It notes that data volumes are growing much faster than computational power, creating a productivity crisis in research. However, most labs have limited resources to handle these large data volumes. The document proposes applying lessons from industry to create cloud-based science services with standardized APIs that can automate and outsource common tasks like data transfer, sharing, publishing, and searching. This would help scientists focus on their core research instead of computational infrastructure. Examples of existing services from Argonne National Lab and the University of Chicago Globus project are provided. The goal is to establish robust, scalable, and persistent cloud platforms to help address the challenges of data-driven scientific discovery.
Scaling collaborative data science with Globus and JupyterIan Foster
The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
Team Argon proposes a commons platform using reusable components to promote continuous FAIRness of data. These components include Globus Connect Server for standardized data access and transfer across storage systems, Globus Auth for authentication and authorization, and BDBags for exchange of query results and cohorts using a common manifest format. Together these aim to provide uniform, secure, and reliable access, transfer, and sharing of data while supporting identification, search, and virtualization of derived data products.
Going Smart and Deep on Materials at ALCFIan Foster
As we acquire large quantities of science data from experiment and simulation, it becomes possible to apply machine learning (ML) to those data to build predictive models and to guide future simulations and experiments. Leadership Computing Facilities need to make it easy to assemble such data collections and to develop, deploy, and run associated ML models.
We describe and demonstrate here how we are realizing such capabilities at the Argonne Leadership Computing Facility. In our demonstration, we use large quantities of time-dependent density functional theory (TDDFT) data on proton stopping power in various materials maintained in the Materials Data Facility (MDF) to build machine learning models, ranging from simple linear models to complex artificial neural networks, that are then employed to manage computations, improving their accuracy and reducing their cost. We highlight the use of new services being prototyped at Argonne to organize and assemble large data collections (MDF in this case), associate ML models with data collections, discover available data and models, work with these data and models in an interactive Jupyter environment, and launch new computations on ALCF resources.
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
This document discusses computing challenges posed by rapidly increasing data scales in scientific applications and high performance computing. It introduces the concept of online data analysis and reduction as an alternative to traditional offline analysis to help address these challenges. The key messages are that dramatic changes in HPC system geography due to different growth rates of technologies are driving new application structures and computational logistics problems, presenting exciting new computer science opportunities in online data analysis and reduction.
Software Infrastructure for a National Research PlatformIan Foster
A presentation at the First National Research Platform workshop. "The purpose of this workshop is to bring together representatives from interested institutions to discuss implementation strategies for deployment of interoperable Science DMZs at a national scale." I present eight desirable properties for a software infrastructure for such a platforms, and describe our experience realizing these properties in the Globus system.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Climate Impact of Software Testing at Nordic Testing Days
NIH Data Commons Architecture Ideas
1. Team Argon
“A Commons Platform for Promoting Continuous FAIRness”
NIH Data Commons Pilot
Globus, University of Chicago
University of Southern California
Contact: Ian Foster, foster@uchicago.edu
PIs: Kyle Chard, Ian Foster, Carl Kesselman, Ravi Madduri
2. Three big picture themes
• Continuous FAIRness: Make all data findable, accessible,
interoperable, reusable at every stage, via pervasive use of
simple identifier and exchange format conventions
• Build on proven security, data, and computation building
blocks that have large user communities inside and outside
biomedicine (see subsequent slides for details)
• Solutions leverage industry best practices and professional
services team to meet scalability, interoperability,
sustainability, and reliability needs
3. App
(Client)
Service
(Resource Server)
Service
(Resource Server)
Globus Auth: A foundational service for an
authentication and authorization ecosystem
• A flexible security infrastructure that can be used across the Commons
• Enables federation across services using arbitrary linked identities (e.g., @gmail
@xsede @uchicago)
• Facilitates secure/authorized communication between users, services, clients
• Supports arbitrary clients including REST, web, command line, software
• Flexible token management
• Secure sharing between services
• Fine-grain user consents and revocation
Service
(Resource Server)
Resource
Owner
Resource
server operator
App
(Client)
3
https://docs.globus.org/api/auth/
4. Standards-based, reliable, performant data management
• Globus Connect Server: S3-compatible
HTTP/OAuth interface for secure
client-server transfer
• Endpoints have DNS names
• Globus Transfer: Managed, high-
performance, secure, reliable bulk
asynchronous transfer
• In-place data sharing with flexible and
secure ACLs
• Standards compliant
• S3, OAuth, OIDC, HTTP, GridFTP
4
https://docs.globus.org/api/transfer/
5. Interoperability: naming and exchange
Minid
• Lightweight identifiers for any product
at any stage
• Easily created, dereferenced, validated
• Global integrity – validate content
across the commons
BDBag
• Self-describing and flexible format
for exchange
• Extended BagIt Specification
• Standard manifest representation
that supports different protocols
Data
Metadata
File1 2AG230..
File2 A31FDC.. FTP
File3 D0F142.. HTTP
…
Minid 001
Minid 007
Minid 719
http://minid.bd2k.org http://bd2k.ini.usc.edu/tools/bdbag/
6. Infrastructure
My Workspace
• Workspaces bring together data and tools
• Infrastructure designed for scalability and portability
• Leverages
• Federated identities & access control
• Secure access to distributed data
• Data interoperability, exchange
• Provenance
• Tracking activity around data
• By whom? With what?
• Publication & sharing of tools
and workflows
• Cost aware resource allocation for
both compute and data movement
Workspaces: Scalable compute for distributed data
Data Tools
6
7. Search, navigation, and virtual cohorts
• DERIVA: Digital asset management for heterogeneous data
• Organize, navigate, discover interrelated objects (e.g., assays from a sample over time)
• REST interface
• Entity/Relation model for organizing data
• Supports various DCPPC metadata models
• Fine grain access control to support diverse
collaboration models
• Model evolution to enable continuous
publication, diverse, heterogeneous use cases
• Model driven user interface that
self-configures to current data model
• Integration with Globus Auth, Minids, BDBags, and other components
• Complements Globus Search: Access-controlled search of derived data products
7
8. Workspace Manager
Bags Workspaces Pipelines
minid_1 Galaxy GTExRNA
minid_2 Jupyter GATKVar
minid_3 RStudio
UCSC
GTEx
TOPMed
MOD
User catalogs
User catalogs
User catalogs
User catalogs
Search
Analyze Visualize
Publish & Reproduce
Discover
Uniform,
secure,
reliable
access to
storage
Virtual cohorts
in standard
manifest with
lightweight ID
Uniform search
across multiple
data sources
All results tracked via standard
manifest and lightweight IDs
Workspaces support Jupyter and
Galaxy on different clouds
Publication
assigns DOIs
and indexes
datasets
Integrating scenario
9. Summary: Reusable components include...
• Globus Connect Server for data access, transfer, and sharing
• HTTP/S3 access to many storage systems (Posix, object store, etc.)
• GridFTP for managed, reliable, secure, efficient transfers
• Integration with Globus Auth for authentication and authorization
• Offers: A universal storage API
• Globus Auth for securing all REST API interactions
• OAuth2 and OIDC + fine-grained consents and revocation
• Offers: A universal authentication and authorization API
• BDBag (“big data bags”: profiles on BagIt) tools
• BagIt specification with profiles for “holey bags”, etc.
• Offers: Common manifest for exchange of query results, virtual cohorts
• Identifier service for creating lightweight identifiers
• ARKs, created on demand, associated checksum, simple metadata
• Offers: Common mechanism for naming and tracking derived data products9