FAIRSpectra - Towards a common data file format for SIMS images

AlexHendersonManchester 13 views 27 slides Mar 11, 2025
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

Presentation from the 24th International Conference on Secondary Ion Mass Spectrometry (SIMS-24) held in La Rochelle, France, 8-13 September 2024.
https://www.sims-24.com/

This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, ...


Slide Content

FAIRSpectra
Towards a common data file format
for SIMS images
Alex Henderson
University of Manchester, UK
Office for Open Research
https://fairspectra.nethttps://alexhenderson.info

What is FAIR?
The FAIR Guiding Principles
Findable
Accessible
Interoperable
Reusable
https://www.go-fair.org
Interoperable
•Integration with other data, applications
and workflows for analysis, storage and processing
Reusable
•Well-described so they can be replicated
and/or combined in different settings

What is FAIRSpectra?
Community-driven initiative
Focus on hyperspectral imaging techniques
•File formats for hyperspectral imaging
•No standards exist right now
•Software tools to support these
•Metadata requirements
•Education and training
•Raising awareness

What is FAIRSpectra?
https://fairspectra.net https://fairspectra.zulipchat.com https://github.com/FAIRSpectra

Survey from SIMS Europe, UKSAF, and SpringSciX
Positives
•Everyone wanted to see something done, not sure about how
Barriers
•People have difficulty sharing
•Poor documentation
•Proprietary file formats – loss of information
•Raw data vs. processed data – large file size
•Gazumping / IP & prior art / confidentiality
•Time consuming
Feedback from the community

What are the issues?
…for academia
Funders require ‘data’ to be deposited in (open) repositories
But…
•No dedicated repositories
•Metadata terms are patchy
•Instrument data in proprietary file formats
•Many software packages not compatible with open formats
Researchers willing to share, but don’t know how

What are the issues?
…for industry
Barriers
•FAIR often confused with Open
•In-house processes considered good enough
•Worry about certain metadata usage giving secrets away
Benefits
•Easier to share data in-house, between labs and (overseas) sites
•FAIR practises lead to better records retention
•Acquisitions and mergers become more straightforward
•Third-party (open source) software becomes more easily accessible
•Incoming staff already familiar with systems

What are the issues?
…for instrument vendors
Barriers
•Concern about giving away commercial advantage
•Internal effort to support additional export format
Benefits
•No need to be a ‘data science house’
•‘Outsource‘ multivariate statistics/machine learning/AI to academia
•Cherry-pick externally developed methods into their software
•Software team can concentrate on instrument-specific tasks
•New & exciting software solutions sell the technique in new areas
→ drives instrument sales
Instrument manufacturer buy-in is vital

The problem
with SIMS
Answers on a postcard to…
fairspectra.net
Photo by Kelly Sikkema on Unsplash

SIMS data characteristics
•Huge number of data channels
•Very sparse – almost all data channels are zero counts
•Non-zero values are ‘clumped‘ together → peaks
•Example from Gus
•File (IONTOF in .grd format) is 306 MB
•256 × 256 pixels × 65 layers × ~1 million channels, would be 15.5 TB
•Only holds locations (4D space: Z, Y, X, Ch) of non-zero positions
•Lossless compression

Solutions to the sparsity problem
Lossless compression
•Compress and decompress returns original data unchanged
•Use off-the-shelf lossless compression (e.g. ZIP)
•Only record non-zero events as coordinates
Lossy compression
•Some data discarded during compression
•Use off-the-shelf lossy compression (e.g. MP3-like, or JPEG-like)
•Down-bin spectra to lower mass resolution
•Use peak detection and only record centroid and area

Problems with lossless
compression
•May need to unpack/unzip data to work on it
•Compression methods not tuned to SIMS data
•Designed for floating point numbers, or
text
•Bespoke encoding requires bespoke software
•Need to reinvent the wheel for each
algorithm
Image by Freeimages.com

Problems with
lossy compression
•Throwing away data
•Possible loss of mass or
spatial resolution
•Not possible to ‘round-trip’ data

What does an ideal scenario look like?
Depends on the context
•Visualisation
•Multivariate statistical analysis (central limit theorem-based)
•Machine learning analysis (AI?)
•Library search
•Data fusion with other techniques
•Use existing software from other techniques
•Fast to read and write
•Yet to be determined…
Accessible by
novice users
in addition to experts

Potential
solutions
Photo by Neil Thomas on Unsplash

imzML
•Developed by MALDI community
•Metadata – largely biological (HUPO)
•Data stored in two flavours
•Processed – discrete peaks/regions
•Who decided the peak positions? Parameters?
•Data locked in
•Continuous – all spectral channels, no compression
•Data becomes unwieldy/impossible to store
•No facility for 3D data

Methods from other fields
Astronomy – HDF5
Climate research – netCDF, HDF5, Zarr
Microscopy – OME-NGFF, OME-ZARR
•Chunked formats
•Built-in lossless compression
•Plugin compression methods

Alternative
data encoding
Formats like HDF5 and Zarr can
be chunked and compressed
Photo by Markus Spiske on Unsplash

Chunking of a hyper-mango
https://www.ambitiouskitchen.com/how-to-cut-a-mango/

Each mini-cube is separately addressable
https://www.blosc.org/posts/blosc2-ndim-intro/

Only relevant segments are loaded to RAM
→ cloud storage friendly
Spectral range Image
X
Y
Segments are cached and garbage collected
https://commons.wikimedia.org/wiki/File:OLAPcube.png

Different chunk sizes
Smaller chunks gives higher granularity
in selection, but more addressable
segments increases data size
Chunks can be compressed.
Contents of a chunk changes
compressibility.
Difficult to predict overall file size
https://www.bbc.co.uk/bitesize/guides/zjs9dxs/revision/3

Where are we now?
•Prototype HDF5 compression filter ready to test
•Two types of compression included:
•Run-length encoding (RLE)
•Replace runs of zero values with a single negative number
•Guaranteed not to increase the size
0, 0, 1, 5, 2, 0, 0, 0, 0, 2, 3, … → -2, 1, 5, 2, -4, 2, 3, …
•Dictionary of keys (DOK) sparse encoding
•Only record positions of non-zero values in pairs
0, 0, 1, 5, 2, 0, 0, 0, 0, 2, 3, … → {3, 1}, {4, 1}, {5, 2}, {10, 2}, {11, 3}

What is still needed?
•Explore compression ratio as function of:
•Instrument
•Sample type
•Chunk size
•Need more samples to test – different instruments too
•Tooling required (in Python and MATLAB) to simplify usage
•Only discussion binary payload of file
•Still need to address appropriate metadata

One another topic…
•Outcome of IUVSTA 101 meeting
•Proposed platform for visualisation and algorithmic access
•Looking for volunteers to develop the look and feel of this
•All skill levels are welcome!
•Anyone interested, meet at reception desk during coffee on
Thursday afternoon

Summary
•Need a file format for hyperspectral images
•HDF5 sparse filter developed to compress data
•Metadata not yet considered
•Need partners to test/optimise solutions
•Collecting partners to design/develop visualisation and testing
harness
Join FAIRSpectra today!
https://fairspectra.net

Thanks…
•For financial support
•University of Manchester’s Office for Open Research
•SurfaceSpectra Ltd.
•For in-kind support
•101
st
IUVSTA Workshop (metadata workshop)
•UK Surface Analysis Users Forum (UKSAF) (free exhibition space)
•SIMS Europe (free exhibition space)
•SpringSciX 2024 (free exhibition space)
SIMS Europe
Office for Open Research