Cloud Optimized HDF5 for the ICESat-2 mission

HDFEOS 63 views 18 slides Aug 02, 2024
Slide 1
Slide 1 of 18
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18

About This Presentation

HDF and HDF-EOS Workshop XXVII (2024)


Slide Content

Cloud Optimized HDF5
for the
ICESat-2 mission
ESIP Summer meeting 2024
Luis López
Research Software Engineer
NSIDC


Andrew Barrett
Aleksandar Jelenak
Lisa Kaser
Jeff Lee
Amy Steiker



Credit: NASA's Goddard Space Flight Center

Important questions about our planet can now be answered by
integrating years of data from different missions.
Global Sea Ice Concentration Boreal Forest Biomass

The data coming from these missions is
now available in the cloud! **
NASA and other agencies started to migrate their data to the cloud.
**caveat: it’s by large in archival formats, HDF5 and NetCDF

Problem: Accessing HDF5 in
the cloud is slow, how slow?

Improving performance of HDF5 in the cloud is key to
enable science at scale.
●Data is becoming too large to work
locally.

●I/O libraries are optimized for local and
supercomputing workflows.

●HDF as a format was not designed for
the cloud.

The problem = Size + Tools + Format

Cloud-optimized HDF5?
https://www.hdfgroup.org/2024/01/strategies-and-software-to-optimize-hdf5-netcdf-4-files-for-the-cloud/
●Metadata is consolidated
●Custom caching buffer size
●Global API lock is still in place

●Metadata is scattered through the file, each nested group makes this
problem worse.
●By default, metadata is written to the file (and read from) on fixed blocks of
4kb. 1MB of metadata ~= 250 requests.
●Global API lock, those 250 reqs are sequential!
Why HDF is not Performant in the Cloud

Paged Aggregation (data + metadata)

Metadata Blocks (user or dedicated page)

Trying Big Files from the ICESat-2 Mission
Source: https://github.com/nsidc/earthaccess/discussions/251

Accidental Complexity
NASA Policies
file format libraryI/O driver
data wrangling
library
AWS S3
ds = xr.open_dataset(“s3://nasa-data.hdf5”)
(or ROS3)

It’s APIs All the Way Down

Cloud-Optimized HDF5 Works!*
Code: https://gist.github.com/betolink/b545c364f80882c113b8cc27b763c729
Source: Andrew Barrett

Remote I/O Visualized
https://github.com/ajelenak/ros3vfd-log-info
●Cloud optimizations to HDF5
reduces requests by an order of
magnitude
●Data that’s not cloud optimized or
is read with out-of-the-box
parameters produces a lot of I/O

What could we do with CO-HDF?
Xpublish Kerchunk SlideRule Happy researchers

Improving performance of HDF5 in the cloud is key to
enable science at scale. Thanks!