Cloud Optimized HDF5
for the
ICESat-2 mission
ESIP Summer meeting 2024
Luis López
Research Software Engineer
NSIDC
Andrew Barrett
Aleksandar Jelenak
Lisa Kaser
Jeff Lee
Amy Steiker
Credit: NASA's Goddard Space Flight Center
Important questions about our planet can now be answered by
integrating years of data from different missions.
Global Sea Ice Concentration Boreal Forest Biomass
The data coming from these missions is
now available in the cloud! **
NASA and other agencies started to migrate their data to the cloud.
**caveat: it’s by large in archival formats, HDF5 and NetCDF
Problem: Accessing HDF5 in
the cloud is slow, how slow?
Improving performance of HDF5 in the cloud is key to
enable science at scale.
●Data is becoming too large to work
locally.
●I/O libraries are optimized for local and
supercomputing workflows.
●HDF as a format was not designed for
the cloud.
The problem = Size + Tools + Format
Cloud-optimized HDF5?
https://www.hdfgroup.org/2024/01/strategies-and-software-to-optimize-hdf5-netcdf-4-files-for-the-cloud/
●Metadata is consolidated
●Custom caching buffer size
●Global API lock is still in place
●Metadata is scattered through the file, each nested group makes this
problem worse.
●By default, metadata is written to the file (and read from) on fixed blocks of
4kb. 1MB of metadata ~= 250 requests.
●Global API lock, those 250 reqs are sequential!
Why HDF is not Performant in the Cloud
Paged Aggregation (data + metadata)
Metadata Blocks (user or dedicated page)
Trying Big Files from the ICESat-2 Mission
Source: https://github.com/nsidc/earthaccess/discussions/251
Accidental Complexity
NASA Policies
file format libraryI/O driver
data wrangling
library
AWS S3
ds = xr.open_dataset(“s3://nasa-data.hdf5”)
(or ROS3)
It’s APIs All the Way Down
Cloud-Optimized HDF5 Works!*
Code: https://gist.github.com/betolink/b545c364f80882c113b8cc27b763c729
Source: Andrew Barrett
Remote I/O Visualized
https://github.com/ajelenak/ros3vfd-log-info
●Cloud optimizations to HDF5
reduces requests by an order of
magnitude
●Data that’s not cloud optimized or
is read with out-of-the-box
parameters produces a lot of I/O
What could we do with CO-HDF?
Xpublish Kerchunk SlideRule Happy researchers
Improving performance of HDF5 in the cloud is key to
enable science at scale. Thanks!