FAIRSpectra - Towards a common data file format for SIMS images

AlexHendersonManchester 52 views 29 slides May 26, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

Presentation from the 101st IUVSTA Workshop on High performance SIMS instrumentation and machine learning / artificial intelligence methods for complex data.
This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, and some potent...


Slide Content

FAIRSpectra
Towards a common data file format
for SIMS images
Alex Henderson
University of Manchester, UK
Office for Open Research
https://fairspectra.nethttps://alexhenderson.info

Thanks…
•For financial support
•University of Manchester’s Office for Open Research
•SurfaceSpectra Ltd.
•For in-kind support
•101
st
IUVSTA Workshop (metadata workshop)
•UK Surface Analysis Users Forum (UKSAF) (free exhibition space)
•SIMS Europe (free exhibition space)
•SpringSciX 2024 (free exhibition space)
SIMS Europe
Office for Open Research

What is FAIR?
The FAIR Guiding Principles
Findable
Accessible
Interoperable
Reusable
https://www.go-fair.org
Interoperable
•Integration with other data, applications
and workflows for analysis, storage and processing
Reusable
•Well-described so they can be replicated
and/or combined in different settings

What is FAIRSpectra?
Community-driven initiative
Focus on hyperspectral imaging techniques
•File formats for hyperspectral imaging
•No standards exist right now
•Software tools to support these
•Metadata requirements
•Education and training
•Raising awareness

What is FAIRSpectra?
https://fairspectra.net https://fairspectra.zulipchat.com https://github.com/FAIRSpectra

Survey from SIMS Europe, UKSAF, and SpringSciX
Positives
•Everyone wanted to see something done, not sure about how
Barriers
•People have difficulty sharing
•Poor documentation
•Proprietary file formats – loss of information
•Raw data vs. processed data – large file size
•Gazumping / IP & prior art / confidentiality
•Time consuming
Feedback from the community

What are the issues?
…for academia
Funders require ‘data’ to be deposited in (open) repositories
But…
•No dedicated repositories
•Metadata terms are patchy
•Instrument data in proprietary file formats
•Many software packages not compatible with open formats
Researchers willing to share, but don’t know how

What are the issues?
…for industry
Barriers
•FAIR often confused with Open
•In-house processes considered good enough
•Worry about certain metadata usage giving secrets away
Benefits
•Easier to share data in-house, between labs and (overseas) sites
•FAIR practises lead to better records retention
•Acquisitions and mergers become more straightforward
•Third-party (open source) software becomes more easily accessible
•Incoming staff already familiar with systems

What are the issues?
…for instrument vendors
Barriers
•Concern about giving away commercial advantage
•Internal effort to support additional export format
Benefits
•No need to be a ‘data science house’
•‘Outsource‘ multivariate statistics/machine learning/AI to academia
•Cherry-pick externally developed methods into their software
•Software team can concentrate on instrument-specific tasks
•New & exciting software solutions sell the technique in new areas
→ drives instrument sales
Instrument manufacturer buy-in is vital

The problem
with SIMS
Answers on a postcard to…
fairspectra.net
Photo by Kelly Sikkema on Unsplash

SIMS data characteristics
•Huge number of data channels
•Very sparse – almost all data channels are zero counts
•Non-zero values are ‘clumped‘ together → peaks
•Example from Gus
•File (IONTOF in .grd format) is 306 MB
•256 × 256 pixels × 65 layers × ~1 million channels, would be 15.5 TB
•Only holds locations (4D space: Z, Y, X, Ch) of non-zero positions
•Lossless compression

Solutions to the sparsity problem
Lossless compression
•Compress and decompress returns original data unchanged
•Use off-the-shelf lossless compression (e.g. ZIP)
•Only record non-zero events as coordinates
Lossy compression
•Some data discarded during compression
•Use off-the-shelf lossy compression (e.g. MP3-like, or JPEG-like)
•Down-bin spectra to lower mass resolution
•Use peak detection and only record centroid and area

Problems with lossless
compression
•May need to unpack/unzip data to work on it
•Compression methods not tuned to SIMS data
•Designed for floating point numbers, or
text
•Bespoke encoding requires bespoke software
•Need to reinvent the wheel for each
algorithm
Image by Freeimages.com

Problems with
lossy compression
•Throwing away data
•Possible loss of mass or
spatial resolution
•Not possible to ‘round-trip’ data

What does an ideal scenario look like?
Depends on the context
•Visualisation
•Multivariate statistical analysis (central limit theorem-based)
•Machine learning analysis (AI?)
•Library search
•Data fusion with other techniques
•Use existing software from other techniques
•Fast to read and write
•Yet to be determined…
Accessible by
novice users
in addition to experts

Potential
solutions
Photo by Neil Thomas on Unsplash

imzML
•Developed by MALDI community
•Metadata – largely biological (HUPO)
•Data stored in two flavours
•Processed – discrete peaks/regions
•Who decided the peak positions? Parameters?
•Data locked in
•Continuous – all spectral channels, no compression
•Data becomes unwieldy/impossible to store
•No facility for 3D data

Aside on peak detection
•Many methods available
•Unclear what instrument vendors do
•My method (in ChiToolbox on GitHub)
•Determine total ion spectrum
•Gaussian smooth
•Second derivative
•Determine zero-crossing points to get peak limits
•Determine channel containing centroid of total ion spectrum peak
•Apply these limits to each pixel
•Calculate area of discovered peak
https://github.com/AlexHenderson/ChiToolbox/blob/master/ChiToolbox/%40ChiMSCharacter/peakdetect.m

Issues with peak detection
•Works well for intense peaks
•Noisy peaks become ‘perfect’
•Noise characteristics/statistics lost
•Peaks could have shoulders included
•Moves centroid: OK for visualisation, bad for library search
•Parameters usually not shared
•Original data may no longer be available/shared/readable
•Loss of peak shape
•Detecting on each pixel means centroids not aligned in image data
•Don’t know what we don’t know…
Photo by Louise Tollisen on Unsplash

Methods from other fields
Astronomy – HDF5
Climate research – netCDF, HDF5, Zarr
Microscopy – OME-NGFF, OME-ZARR
•Chunked formats
•Built-in lossless compression
•Plugin compression methods

Alternative
data encoding
Formats like HDF5 and Zarr can
be chunked and compressed
Photo by Markus Spiske on Unsplash

Chunking of a hyper-mango
https://www.ambitiouskitchen.com/how-to-cut-a-mango/

Each mini-cube is separately addressable
https://www.blosc.org/posts/blosc2-ndim-intro/

Only relevant segments are loaded to RAM
→ cloud storage friendly
Spectral range Image
X
Y
Segments are cached and garbage collected
https://commons.wikimedia.org/wiki/File:OLAPcube.png

Different chunk sizes
Smaller chunks gives higher granularity
in selection, but more addressable
segments increases data size
Chunks can be compressed.
Contents of a chunk changes
compressibility.
Difficult to predict overall file size
https://www.bbc.co.uk/bitesize/guides/zjs9dxs/revision/3

Compromise
This Photo by Unknown Author is licensed under CC BY-NC-ND

Peaks vs sticks
•Intermediate encoding
•Pseudo thresholding
•Use peak detection limits to separate ‘interesting’
from ‘uninteresting’ spectral ranges
•Not the same as intensity thresholding
•Peak detection can still be performed on regions
•Compression of entire spectrum more efficient
Photo by Stéphane Fellay on Unsplash

Encapsulate in chunked format
•Hold the full spectral resolution
of the detected peaks
•Discard data between peaks
•Start with chunked data format
•Develop compression plugin for
these formats
•Produces a pseudo-continuous
spectrum with acceptable
compression
Sean Lucas

Summary
The researchers are willing, but their resources are weak
•Few solutions currently exist
•Metadata terms missing
•Proprietary file formats are a barrier
•Instrument vendor buy-in required
•Lack of awareness persists
But…
•Some low-hanging fruit
•Opportunity to make an impact
•Even Closed FAIR can still have benefits to industry
There’s lots to do, but FAIRSpectra is just getting started!
https://fairspectra.nethttps://alexhenderson.info