Extending Globus into
a site wide automated
data infrastructure
Tibor Auer, Dimitrios Bellos, Laura Shemilt
Advanced Research Computing
The Rosalind Franklin Institute
The Rosalind Franklin Institute
• Research
• Aim: image, interpret and intervene in biological systems
• Integrative scope
• Multilevel resolution: macroscopic to atomic
The Rosalind Franklin Institute
• Infrastructural requirements
• Fast, secure, and efficient data transfer
• Efficient and reproducible analysis workflows
• Data management challenge
• Amount: 70TB per month
• Heterogeneous data: formats, structures, and
collection rates
• Heterogenous computing requirements:
• Physical workstations,VMs, and offsite HPCs
• Various open-source software tools
Fast, secure, and efficient data transfer
• Globus • RFI Globus
• Access management using guest collections and
ACL rules (ORCID links local and Globus identities)
• Automated configuration
• Service Accounts avoid need for human-in-the-loop
• Setup steps (need to be done only a few times) →
automated with Ansible
• Transfer steps (need to be done regularly) → RFI
Globus API based on Globus SDK for Python
Workstation
(SSHFS)
VM
HPC
Instruments
Storage
(POSIX & S3)
Efficient and reproducible analysis workflows
• Automated data acquisition and annotation • Automated data retrieval
Integrate with microservices
Transactional and
scientific metadata
Transactional and
scientific metadata
Workstation
Storage
(POSIX & S3)
Instruments
User permission
(LDAP/Keycloak)
Efficient and reproducible analysis workflows
• Automated analysis
Integrate with microservices
• Check for new data
• Wait for the
‘sentinel’ directory
• Sanity checks
• Generate UDID
(directory name
on HPC)
Transfer
data
Create a
‘sentinel’
directory
If the ‘sentinel’
directory is
deleted, transfer
results back
(Optionally)
delete data
from HPC
Submit
the analysis script
to the scheduler
The job deletes the
‘sentinel’ directory
Acknowledgement
Advanced Research Computing
• Dimitrios Bellos
• Laura Crawford
• Nick Crawford
• Alex Lubbock
• Laura Shemilt
• Mark Basham
• Silvia da Graca Ramos (former member)
• JossWhittle (former member)
Instrument scientists and lab managers
IT
• Niaz Khan
• Matthew Selby
Globus SupportTeam

Extending Globus into a Site-wide Automated Data Infrastructure.pdf

  • 2.
    Extending Globus into asite wide automated data infrastructure Tibor Auer, Dimitrios Bellos, Laura Shemilt Advanced Research Computing The Rosalind Franklin Institute
  • 3.
    The Rosalind FranklinInstitute • Research • Aim: image, interpret and intervene in biological systems • Integrative scope • Multilevel resolution: macroscopic to atomic
  • 4.
    The Rosalind FranklinInstitute • Infrastructural requirements • Fast, secure, and efficient data transfer • Efficient and reproducible analysis workflows • Data management challenge • Amount: 70TB per month • Heterogeneous data: formats, structures, and collection rates • Heterogenous computing requirements: • Physical workstations,VMs, and offsite HPCs • Various open-source software tools
  • 5.
    Fast, secure, andefficient data transfer • Globus • RFI Globus • Access management using guest collections and ACL rules (ORCID links local and Globus identities) • Automated configuration • Service Accounts avoid need for human-in-the-loop • Setup steps (need to be done only a few times) → automated with Ansible • Transfer steps (need to be done regularly) → RFI Globus API based on Globus SDK for Python Workstation (SSHFS) VM HPC Instruments Storage (POSIX & S3)
  • 6.
    Efficient and reproducibleanalysis workflows • Automated data acquisition and annotation • Automated data retrieval Integrate with microservices Transactional and scientific metadata Transactional and scientific metadata Workstation Storage (POSIX & S3) Instruments User permission (LDAP/Keycloak)
  • 7.
    Efficient and reproducibleanalysis workflows • Automated analysis Integrate with microservices • Check for new data • Wait for the ‘sentinel’ directory • Sanity checks • Generate UDID (directory name on HPC) Transfer data Create a ‘sentinel’ directory If the ‘sentinel’ directory is deleted, transfer results back (Optionally) delete data from HPC Submit the analysis script to the scheduler The job deletes the ‘sentinel’ directory
  • 8.
    Acknowledgement Advanced Research Computing •Dimitrios Bellos • Laura Crawford • Nick Crawford • Alex Lubbock • Laura Shemilt • Mark Basham • Silvia da Graca Ramos (former member) • JossWhittle (former member) Instrument scientists and lab managers IT • Niaz Khan • Matthew Selby Globus SupportTeam