Extending Globus into a Site-wide Automated Data Infrastructure.pdf
globusonline
109 views
8 slides
May 29, 2024
Slide 1 of 8
1
2
3
4
5
6
7
8
About This Presentation
The Rosalind Franklin Institute hosts a variety of scientific instruments, which allow us to capture a multifaceted and multilevel view of biological systems, generating around 70 terabytes of data a month. Distributed solutions, such as Globus and Ceph, facilitates storage, access, and transfer of ...
The Rosalind Franklin Institute hosts a variety of scientific instruments, which allow us to capture a multifaceted and multilevel view of biological systems, generating around 70 terabytes of data a month. Distributed solutions, such as Globus and Ceph, facilitates storage, access, and transfer of large amount of data. However, we still must deal with the heterogeneity of the file formats and directory structure at acquisition, which is optimised for fast recording, rather than for efficient storage and processing. Our data infrastructure includes local storage at the instruments and workstations, distributed object stores with POSIX and S3 access, remote storage on HPCs, and taped backup. This can pose a challenge in ensuring fast, secure, and efficient data transfer. Globus allows us to handle this heterogeneity, while its Python SDK allows us to automate our data infrastructure using Globus microservices integrated with our data access models. Our data management workflows are becoming increasingly complex and heterogenous, including desktop PCs, virtual machines, and offsite HPCs, as well as several open-source software tools with different computing and data structure requirements. This complexity commands that data is annotated with enough details about the experiments and the analysis to ensure efficient and reproducible workflows. This talk explores how we extend Globus into different parts of our data lifecycle to create a secure, scalable, and high performing automated data infrastructure that can provide FAIR[1,2] data for all our science.
1. https://doi.org/10.1038/sdata.2016.18
2. https://www.go-fair.org/fair-principles
Size: 1.47 MB
Language: en
Added: May 29, 2024
Slides: 8 pages
Slide Content
Extending Globus into
a site wide automated
data infrastructure
Tibor Auer, Dimitrios Bellos, Laura Shemilt
Advanced Research Computing
The Rosalind Franklin Institute
The Rosalind Franklin Institute
•Research
•Aim: image, interpret and intervene in biological systems
•Integrative scope
•Multilevel resolution: macroscopic to atomic
The Rosalind Franklin Institute
•Infrastructural requirements
•Fast, secure, and efficient data transfer
•Efficient and reproducible analysis workflows
•Data management challenge
•Amount: 70 TB per month
•Heterogeneous data: formats, structures, and
collection rates
Fast, secure, and efficient data transfer
•Globus •RFI Globus
•Access management using guest collections and
ACL rules (ORCID links local and Globus identities)
•Automated configuration
•Service Accounts avoid need for human-in-the-loop
•Setup steps (need to be done only a few times) →
automated with Ansible
•Transfer steps (need to be done regularly) → RFI
Globus API based on Globus SDK for Python
Workstation
(SSHFS)
VM
HPC
Instruments
Storage
(POSIX & S3)
Efficient and reproducible analysis workflows
•Automated data acquisition and annotation•Automated data retrieval
Integrate with microservices
Transactional and
scientific metadata
Transactional and
scientific metadata
Workstation
Storage
(POSIX & S3)
Instruments
User permission
(LDAP/Keycloak)
Efficient and reproducible analysis workflows
•Automated analysis
Integrate with microservices
•Check for new data
•Wait for the
‘sentinel’ directory
•Sanity checks
•Generate UDID
(directory name
on HPC)
Transfer
data
Create a
‘sentinel’
directory
If the ‘sentinel’
directory is
deleted, transfer
results back
(Optionally)
delete data
from HPC
Submit
the analysis script
to the scheduler
The job deletes the
‘sentinel’ directory