Extending Globus into a Site-wide Automated Data Infrastructure.pdf

globusonline 109 views 8 slides May 29, 2024
Slide 1
Slide 1 of 8
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8

About This Presentation

The Rosalind Franklin Institute hosts a variety of scientific instruments, which allow us to capture a multifaceted and multilevel view of biological systems, generating around 70 terabytes of data a month. Distributed solutions, such as Globus and Ceph, facilitates storage, access, and transfer of ...


Slide Content

Extending Globus into
a site wide automated
data infrastructure
Tibor Auer, Dimitrios Bellos, Laura Shemilt
Advanced Research Computing
The Rosalind Franklin Institute

The Rosalind Franklin Institute
•Research
•Aim: image, interpret and intervene in biological systems

•Integrative scope





•Multilevel resolution: macroscopic to atomic

The Rosalind Franklin Institute
•Infrastructural requirements

•Fast, secure, and efficient data transfer





•Efficient and reproducible analysis workflows
•Data management challenge

•Amount: 70 TB per month
•Heterogeneous data: formats, structures, and
collection rates



•Heterogenous computing requirements:
•Physical workstations, VMs, and offsite HPCs
•Various open-source software tools

Fast, secure, and efficient data transfer
•Globus •RFI Globus
•Access management using guest collections and
ACL rules (ORCID links local and Globus identities)

•Automated configuration
•Service Accounts avoid need for human-in-the-loop

•Setup steps (need to be done only a few times) →
automated with Ansible
•Transfer steps (need to be done regularly) → RFI
Globus API based on Globus SDK for Python




Workstation
(SSHFS)




VM




HPC




Instruments
Storage
(POSIX & S3)

Efficient and reproducible analysis workflows
•Automated data acquisition and annotation•Automated data retrieval
Integrate with microservices
Transactional and
scientific metadata
Transactional and
scientific metadata



Workstation
Storage
(POSIX & S3)




Instruments
User permission
(LDAP/Keycloak)

Efficient and reproducible analysis workflows
•Automated analysis
Integrate with microservices
•Check for new data
•Wait for the
‘sentinel’ directory
•Sanity checks
•Generate UDID
(directory name
on HPC)
Transfer
data
Create a
‘sentinel’
directory
If the ‘sentinel’
directory is
deleted, transfer
results back
(Optionally)
delete data
from HPC
Submit
the analysis script
to the scheduler
The job deletes the
‘sentinel’ directory

Acknowledgement
Advanced Research Computing
•Dimitrios Bellos
•Laura Crawford
•Nick Crawford
•Alex Lubbock
•Laura Shemilt
•Mark Basham

•Silvia da Graca Ramos (former member)
•Joss Whittle (former member)


Instrument scientists and lab managers

IT
•Niaz Khan
•Matthew Selby

Globus Support Team