Exploring Innovations in Data Repository Solutions - Insights from the U.S. Geological Survey - Globus Partnership
globusonline
47 views
24 slides
May 30, 2024
Slide 1 of 24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to impr...
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Size: 4.85 MB
Language: en
Added: May 30, 2024
Slides: 24 pages
Slide Content
Exploring Innovations in Data
Repository Solutions: Insights
from the U.S. Geological
Survey-Globus Partnership
US Geological Survey
Core Science Systems
Science Analytics & Synthesis (SAS)
Science Data Management Branch
May 2024
U.S. Department of the Interior
U.S. Geological Survey
US Geological Survey
•Science for a Changing World
•The USGS serves the Nation by providing reliable scientific
information to describe and understand the Earth;
minimizing loss of life and property from natural disasters;
managing water, biological, energy, and mineral resources; and
enhancing and protecting our quality of life.
GlobusWorld 2024
USGS by the Numbers
“Nick”
“Jane”
GlobusWorld 2024
Pressure gage water data collection | U.S. Geological Survey (usgs.gov)
Analyzing Pore Water Samples | U.S. Geological Survey (usgs.gov)
GlobusWorld 2024
Modeling wave heights due to
surface winds in Hurricane Sandy
Large River monitoring & mapping
Subsurface Geologic Data
Unplugged Orphaned Oil and Gas Well Dataset
GC! 1 Core Data
GlobusWorld 2024
The White House Office of Science and
Technology Policy (OSTP) defines Open
Science as the principle and practice of
making research products and processes
available to all, while respecting diverse
cultures, maintaining security and privacy, and
fostering collaborations, reproducibility, and
equity.
PLAN
DM Plans
ACQUIRE
Field Work
Procurement
Data Mining
Labs
Data Wrangling
Python
Compute
Open Science
“Readiness”
ANALYZE RELEASE
“Jane”
Metadata Wizard
https://code.usgs.gov/usgs/fort-pymdwizard
USGS SAS Science Data Management Team
Developing Enterprise Tools
ScienceBase
sciencebase.gov
DOI Creation Tool
www1.usgs.gov/csas/doi
mdEditor (with US FWS)
https://www.mdeditor.org
Science Data Catalog
data.usgs.gov
10
Model Catalog
data.usgs.gov/modelcatalog
Building Communities
Community for Data Integration
usgs.gov/cdi
Promoting Best Practices
USGS Data Management Website
usgs.gov/datamanagement
Policy Leadership Research in Data
Management
•USGS State of the Data
•Data Management
Planning
•Data Citation
•FAIR Roadmap
Recommendations
•Sustainability of
Seed-funded Projects
(CDI)
GlobusWorld 2024
USGS: State of the Data
11
Engaged community
to develop and test
a rubric based on
FAIR Principles
Performed
multiple analyses
of rubric using a
common dataset to
calibrate scoring
Selected ~400
datasets randomly
from Science Data
Catalog for analysis
Analyzed each
individual dataset
using rubric
Compiled dataset to
identify trends in
analysis
Data and rubric
released in USGS
ScienceBase
Manuscript
submitted to
journal
Report: https://doi.org/10.5334/dsj-2024-022
Data Release: https://doi.org/10.5066/P97V4XA4
USGS Data Policy
12
Scientific Data Management Foundation
(requires Data Management Plans)
Metadata for Scientific Data, Software, and
Other Information Products
Review and Approval of Scientific Data for
Release
Preservation Requirements for Digital Scientific
Data
Fundamental Science Practices (FSP) USGS Public Access Plan
The USGS Public Access Plan, which is reflective
of broad government requirements for
scientific public access to scientific data,
requires USGS scientists to release the data
upon which their scientific publications are
based.
GlobusWorld 2024
POLICY
Fundamental
Science Practices
TOOLS
Metadata
Identifiers
POLICYTOOLS
SUPPORT
RELEASE
PLAN ACQUIRE ANALYZE RELEASE
DATA
RELEASE
GlobusWorld 2024
Trusted Digital Repositories in USGS
Based on Core Trust Seal criteria
Approved by a USGS policy team
Recertification is required every 3 years
GlobusWorld 2024
USGS ScienceBase: Trusted Digital Repository
The ScienceBase repository serves data
generated using AI/ML and supports active
AI/ML workflows within the application.
ScienceBase: www.sciencebase.gov
GlobusWorld 2024
17
ScienceBase: www.sciencebase.gov
ScienceBase for Data Release
Identifiers
People
USGS Metadata
Validation
Open Science
Data.gov
Automated Connections
Release Approval
& Governance
Science Workflow &
Community
GlobusWorld 2024
GlobusWorld 2024
Component access is managed by JOSSO
for authentication
ScienceBase Catalog
Grails 2.5 (upgrade in
progress)
ScienceBase
Vocab
Grails 3
ScienceBase
Directory
Grails 3
Directory/User
MgmtScienceBase
CatalogMaps
/mnt/prod_a
ppdata
Mong
oDB
Elastic
Search
JOS
SO
Pos
tGI
S
Post
gres
Post
gres
On-Prem ArcGIS
Server(s)
A
G
S
2
A
G
S
1
A
G
S
3
Activ
e
Direct
ory
On Premise - CSAHC Cloud – ProdIS FORT VPC 446 (AWS) in
CHS
Keycl
oak
AD
SA
ML
Component access is managed by Keycloak
for authentication
ScienceBase
Manager UI
React
SbGrap
hQL
SB Upload
S3
SB
Prod
S3
SB
Public
S3
SBDR
Backup
ScienceBase
Lambda
Functions
Virus scan
Publish to
Public S3
Publish to
Dremio
Files Backup
File Copy
Routines
SB
Dremio
S3
Footpri
nter
Cloud – SAS VPC (AWS)
sas-sciencebase
-data S3
Cloud (Other with policy applied)
User via Globus
AWS PostGres
(RDS)
Why Globus?
When we speak of “legacy systems” in
government, it does not mean simply that
they are old. It means that we are
grappling with the legacy of decades of
competing interests, power struggles,
creative work-arounds, and make-dos that
are opportune at the time but
unmanageable in the long run.
--Jennifer Pahlka, Recoding America
GlobusWorld 2024
Cooperative Agreement with Globus to “research and develop advanced repository components
and workflows leveraging current [USGS] investment in Globus”
GlobusWorld 2024
Component access is managed
by JOSSO for authentication
ScienceBase
Catalog
Grails 2.5
(upgrade in
progress)
ScienceB
ase
Vocab
Grails 3
ScienceBas
e Directory
Grails 3
Directory/
UserMgmt
ScienceBase
CatalogMaps
/mnt/pro
d_appda
ta
Mo
ngo
DB
Elasti
c
Sear
ch
J
O
S
S
O
P
os
tG
IS
Po
stg
re
s
Po
stg
re
s
On-Prem
ArcGIS
Server(s)
A
G
S
2
A
G
S
1
A
G
S
3
Act
ive
Dir
ect
ory
On Premise - CSAHC Cloud – ProdIS FORT VPC 446
(AWS) in CHS
Ke
ycl
oa
k
A
D
S
A
M
L
Component access is managed by
Keycloak for authentication
ScienceBase
Manager UI
React
SbGr
aph
QL
SB
Upload
S3
SB
Prod
S3
SB
Publ
ic S3
SBDR
Backu
p
ScienceB
ase
Lambda
Function
s
Virus
scan
Publish
to Public
S3
Publish
to
Dremio
Files
Backup
File Copy
Routines
SB
Dremi
o S3
Foot
print
er
Cloud – SAS VPC (AWS)
sas-science
base-data
S3
Cloud (Other with policy
applied)
User via Globus
AWS
PostGres
(RDS)
Why Globus?
GlobusWorld 2024
Globus Auth
GlobusWorld 2024
Research Data Lifecycle by LMA Research Data Management Working Group is licensed under a Creative Commons
Attribution-NonCommercial 4.0 International License.
DREW IGNIZIO – SCIENCEBASE PRODUCT OWNER
GRACE DONOVAN – DATA MANAGER
TAMAR NORKIN – DATA MANAGER
MADISON LANGSETH – DATA MANAGER
BRANDON SERNA – TECHNICAL ARCHITECT
JEFF FALGOUT – HPC ENGINEER
VIV HUTCHISON – CHIEF DATA ADVISOR
GlobusWorld 2024
Thank You!