Exploring Innovations in Data Repository Solutions - Insights from the U.S. Geological Survey - Globus Partnership

globusonline 47 views 24 slides May 30, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to impr...


Slide Content

Exploring Innovations in Data
Repository Solutions: Insights
from the U.S. Geological
Survey-Globus Partnership



US Geological Survey
Core Science Systems
Science Analytics & Synthesis (SAS)
Science Data Management Branch

May 2024

U.S. Department of the Interior
U.S. Geological Survey

US Geological Survey
•Science for a Changing World

•The USGS serves the Nation by providing reliable scientific
information to describe and understand the Earth;
minimizing loss of life and property from natural disasters;
managing water, biological, energy, and mineral resources; and
enhancing and protecting our quality of life.


GlobusWorld 2024

USGS by the Numbers

“Nick”
“Jane”
GlobusWorld 2024

Pressure gage water data collection | U.S. Geological Survey (usgs.gov)
Analyzing Pore Water Samples | U.S. Geological Survey (usgs.gov)
GlobusWorld 2024

Modeling wave heights due to
surface winds in Hurricane Sandy
Large River monitoring & mapping
Subsurface Geologic Data
Unplugged Orphaned Oil and Gas Well Dataset
GC! 1 Core Data
GlobusWorld 2024

The White House Office of Science and
Technology Policy (OSTP) defines Open
Science as the principle and practice of
making research products and processes
available to all, while respecting diverse
cultures, maintaining security and privacy, and
fostering collaborations, reproducibility, and
equity. 

PLAN
DM Plans
ACQUIRE
Field Work
Procurement
Data Mining
Labs

Data Wrangling
Python
Compute
Open Science
“Readiness”

ANALYZE RELEASE

“Jane”

Metadata Wizard
https://code.usgs.gov/usgs/fort-pymdwizard
USGS SAS Science Data Management Team
Developing Enterprise Tools
ScienceBase
sciencebase.gov
DOI Creation Tool
www1.usgs.gov/csas/doi
mdEditor (with US FWS)
https://www.mdeditor.org
Science Data Catalog
data.usgs.gov
10
Model Catalog
data.usgs.gov/modelcatalog
Building Communities
Community for Data Integration
usgs.gov/cdi
Promoting Best Practices
USGS Data Management Website
usgs.gov/datamanagement
Policy Leadership Research in Data
Management
•USGS State of the Data
•Data Management
Planning
•Data Citation
•FAIR Roadmap
Recommendations
•Sustainability of
Seed-funded Projects
(CDI)
GlobusWorld 2024

USGS: State of the Data
11
Engaged community
to develop and test
a rubric based on
FAIR Principles

Performed
multiple analyses
of rubric using a
common dataset to
calibrate scoring
Selected ~400
datasets randomly
from Science Data
Catalog for analysis

Analyzed each
individual dataset
using rubric

Compiled dataset to
identify trends in
analysis

Data and rubric
released in USGS
ScienceBase

Manuscript
submitted to
journal

Report: https://doi.org/10.5334/dsj-2024-022
Data Release: https://doi.org/10.5066/P97V4XA4

USGS Data Policy
12
Scientific Data Management Foundation
(requires Data Management Plans)
Metadata for Scientific Data, Software, and
Other Information Products
Review and Approval of Scientific Data for
Release
Preservation Requirements for Digital Scientific
Data
Fundamental Science Practices (FSP) USGS Public Access Plan
The USGS Public Access Plan, which is reflective
of broad government requirements for
scientific public access to scientific data,
requires USGS scientists to release the data
upon which their scientific publications are
based.
GlobusWorld 2024

POLICY
Fundamental
Science Practices
TOOLS
Metadata
Identifiers

Guidance
Consultation
Traiing
Consistency
Documentation

SUPPORT
RELEASE
GlobusWorld 2024

POLICYTOOLS
SUPPORT
RELEASE
PLAN ACQUIRE ANALYZE RELEASE
DATA
RELEASE
GlobusWorld 2024

Trusted Digital Repositories in USGS
Based on Core Trust Seal criteria
Approved by a USGS policy team
Recertification is required every 3 years
GlobusWorld 2024

USGS ScienceBase: Trusted Digital Repository
The ScienceBase repository serves data
generated using AI/ML and supports active
AI/ML workflows within the application.
ScienceBase: www.sciencebase.gov
GlobusWorld 2024

17
ScienceBase: www.sciencebase.gov
ScienceBase for Data Release
Identifiers
People
USGS Metadata
Validation
Open Science
Data.gov
Automated Connections
Release Approval
& Governance
Science Workflow &
Community
GlobusWorld 2024

GlobusWorld 2024

Component access is managed by JOSSO
for authentication
ScienceBase Catalog
Grails 2.5 (upgrade in
progress)
ScienceBase
Vocab
Grails 3
ScienceBase
Directory
Grails 3

Directory/User
MgmtScienceBase
CatalogMaps
/mnt/prod_a
ppdata
Mong
oDB
Elastic
Search
JOS
SO
Pos
tGI
S
Post
gres
Post
gres
On-Prem ArcGIS
Server(s)
A
G
S
2
A
G
S
1
A
G
S
3
Activ
e
Direct
ory
On Premise - CSAHC Cloud – ProdIS FORT VPC 446 (AWS) in
CHS
Keycl
oak
AD
SA
ML
Component access is managed by Keycloak
for authentication
ScienceBase
Manager UI
React
SbGrap
hQL
SB Upload
S3
SB
Prod
S3
SB
Public
S3
SBDR
Backup
ScienceBase
Lambda
Functions
Virus scan
Publish to
Public S3
Publish to
Dremio
Files Backup
File Copy
Routines
SB
Dremio
S3
Footpri
nter
Cloud – SAS VPC (AWS)
sas-sciencebase
-data S3
Cloud (Other with policy applied)
User via Globus
AWS PostGres
(RDS)
Why Globus?
When we speak of “legacy systems” in
government, it does not mean simply that
they are old. It means that we are
grappling with the legacy of decades of
competing interests, power struggles,
creative work-arounds, and make-dos that
are opportune at the time but
unmanageable in the long run.

--Jennifer Pahlka, Recoding America
GlobusWorld 2024

Cooperative Agreement with Globus to “research and develop advanced repository components
and workflows leveraging current [USGS] investment in Globus”





GlobusWorld 2024

Component access is managed
by JOSSO for authentication
ScienceBase
Catalog
Grails 2.5
(upgrade in
progress)
ScienceB
ase
Vocab
Grails 3
ScienceBas
e Directory
Grails 3

Directory/
UserMgmt
ScienceBase
CatalogMaps
/mnt/pro
d_appda
ta
Mo
ngo
DB
Elasti
c
Sear
ch
J
O
S
S
O
P
os
tG
IS
Po
stg
re
s
Po
stg
re
s
On-Prem
ArcGIS
Server(s)
A
G
S
2
A
G
S
1
A
G
S
3
Act
ive
Dir
ect
ory
On Premise - CSAHC Cloud – ProdIS FORT VPC 446
(AWS) in CHS
Ke
ycl
oa
k
A
D
S
A
M
L
Component access is managed by
Keycloak for authentication
ScienceBase
Manager UI
React
SbGr
aph
QL
SB
Upload
S3
SB
Prod
S3
SB
Publ
ic S3
SBDR
Backu
p
ScienceB
ase
Lambda
Function
s
Virus
scan
Publish
to Public
S3
Publish
to
Dremio
Files
Backup
File Copy
Routines
SB
Dremi
o S3
Foot
print
er
Cloud – SAS VPC (AWS)
sas-science
base-data
S3
Cloud (Other with policy
applied)
User via Globus
AWS
PostGres
(RDS)
Why Globus?
GlobusWorld 2024
Globus Auth

Globus Search

Globus Flows

Globus Transfer

Globus Compute

●"Thin" Portal Client
●Shareable (Open Source)
●Scalable
GlobusWorld 2024
Globus Auth

Globus Search

Globus Flows

Globus Transfer

Globus Compute

Globus Auth

Globus Search

Globus Flows

Globus Transfer

Globus Compute

GlobusWorld 2024
Research Data Lifecycle by LMA Research Data Management Working Group is licensed under a Creative Commons
Attribution-NonCommercial 4.0 International License.

DREW IGNIZIO – SCIENCEBASE PRODUCT OWNER
GRACE DONOVAN – DATA MANAGER
TAMAR NORKIN – DATA MANAGER
MADISON LANGSETH – DATA MANAGER
BRANDON SERNA – TECHNICAL ARCHITECT

JEFF FALGOUT – HPC ENGINEER
VIV HUTCHISON – CHIEF DATA ADVISOR
GlobusWorld 2024
Thank You!