FAIRSpectra - Towards a common data file format for SIMS images
AlexHendersonManchester
52 views
29 slides
May 26, 2024
Slide 1 of 29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
About This Presentation
Presentation from the 101st IUVSTA Workshop on High performance SIMS instrumentation and machine learning / artificial intelligence methods for complex data.
This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, and some potent...
Presentation from the 101st IUVSTA Workshop on High performance SIMS instrumentation and machine learning / artificial intelligence methods for complex data.
This presentation describes the issues relating to storing and sharing data from Secondary Ion Mass Spectrometry experiments, and some potential solutions.
Size: 2.06 MB
Language: en
Added: May 26, 2024
Slides: 29 pages
Slide Content
FAIRSpectra
Towards a common data file format
for SIMS images
Alex Henderson
University of Manchester, UK
Office for Open Research
https://fairspectra.nethttps://alexhenderson.info
Thanks…
•For financial support
•University of Manchester’s Office for Open Research
•SurfaceSpectra Ltd.
•For in-kind support
•101
st
IUVSTA Workshop (metadata workshop)
•UK Surface Analysis Users Forum (UKSAF) (free exhibition space)
•SIMS Europe (free exhibition space)
•SpringSciX 2024 (free exhibition space)
SIMS Europe
Office for Open Research
What is FAIR?
The FAIR Guiding Principles
Findable
Accessible
Interoperable
Reusable
https://www.go-fair.org
Interoperable
•Integration with other data, applications
and workflows for analysis, storage and processing
Reusable
•Well-described so they can be replicated
and/or combined in different settings
What is FAIRSpectra?
Community-driven initiative
Focus on hyperspectral imaging techniques
•File formats for hyperspectral imaging
•No standards exist right now
•Software tools to support these
•Metadata requirements
•Education and training
•Raising awareness
What is FAIRSpectra?
https://fairspectra.net https://fairspectra.zulipchat.com https://github.com/FAIRSpectra
Survey from SIMS Europe, UKSAF, and SpringSciX
Positives
•Everyone wanted to see something done, not sure about how
Barriers
•People have difficulty sharing
•Poor documentation
•Proprietary file formats – loss of information
•Raw data vs. processed data – large file size
•Gazumping / IP & prior art / confidentiality
•Time consuming
Feedback from the community
What are the issues?
…for academia
Funders require ‘data’ to be deposited in (open) repositories
But…
•No dedicated repositories
•Metadata terms are patchy
•Instrument data in proprietary file formats
•Many software packages not compatible with open formats
Researchers willing to share, but don’t know how
What are the issues?
…for industry
Barriers
•FAIR often confused with Open
•In-house processes considered good enough
•Worry about certain metadata usage giving secrets away
Benefits
•Easier to share data in-house, between labs and (overseas) sites
•FAIR practises lead to better records retention
•Acquisitions and mergers become more straightforward
•Third-party (open source) software becomes more easily accessible
•Incoming staff already familiar with systems
What are the issues?
…for instrument vendors
Barriers
•Concern about giving away commercial advantage
•Internal effort to support additional export format
Benefits
•No need to be a ‘data science house’
•‘Outsource‘ multivariate statistics/machine learning/AI to academia
•Cherry-pick externally developed methods into their software
•Software team can concentrate on instrument-specific tasks
•New & exciting software solutions sell the technique in new areas
→ drives instrument sales
Instrument manufacturer buy-in is vital
The problem
with SIMS
Answers on a postcard to…
fairspectra.net
Photo by Kelly Sikkema on Unsplash
SIMS data characteristics
•Huge number of data channels
•Very sparse – almost all data channels are zero counts
•Non-zero values are ‘clumped‘ together → peaks
•Example from Gus
•File (IONTOF in .grd format) is 306 MB
•256 × 256 pixels × 65 layers × ~1 million channels, would be 15.5 TB
•Only holds locations (4D space: Z, Y, X, Ch) of non-zero positions
•Lossless compression
Solutions to the sparsity problem
Lossless compression
•Compress and decompress returns original data unchanged
•Use off-the-shelf lossless compression (e.g. ZIP)
•Only record non-zero events as coordinates
Lossy compression
•Some data discarded during compression
•Use off-the-shelf lossy compression (e.g. MP3-like, or JPEG-like)
•Down-bin spectra to lower mass resolution
•Use peak detection and only record centroid and area
Problems with lossless
compression
•May need to unpack/unzip data to work on it
•Compression methods not tuned to SIMS data
•Designed for floating point numbers, or
text
•Bespoke encoding requires bespoke software
•Need to reinvent the wheel for each
algorithm
Image by Freeimages.com
Problems with
lossy compression
•Throwing away data
•Possible loss of mass or
spatial resolution
•Not possible to ‘round-trip’ data
What does an ideal scenario look like?
Depends on the context
•Visualisation
•Multivariate statistical analysis (central limit theorem-based)
•Machine learning analysis (AI?)
•Library search
•Data fusion with other techniques
•Use existing software from other techniques
•Fast to read and write
•Yet to be determined…
Accessible by
novice users
in addition to experts
Potential
solutions
Photo by Neil Thomas on Unsplash
imzML
•Developed by MALDI community
•Metadata – largely biological (HUPO)
•Data stored in two flavours
•Processed – discrete peaks/regions
•Who decided the peak positions? Parameters?
•Data locked in
•Continuous – all spectral channels, no compression
•Data becomes unwieldy/impossible to store
•No facility for 3D data
Aside on peak detection
•Many methods available
•Unclear what instrument vendors do
•My method (in ChiToolbox on GitHub)
•Determine total ion spectrum
•Gaussian smooth
•Second derivative
•Determine zero-crossing points to get peak limits
•Determine channel containing centroid of total ion spectrum peak
•Apply these limits to each pixel
•Calculate area of discovered peak
https://github.com/AlexHenderson/ChiToolbox/blob/master/ChiToolbox/%40ChiMSCharacter/peakdetect.m
Issues with peak detection
•Works well for intense peaks
•Noisy peaks become ‘perfect’
•Noise characteristics/statistics lost
•Peaks could have shoulders included
•Moves centroid: OK for visualisation, bad for library search
•Parameters usually not shared
•Original data may no longer be available/shared/readable
•Loss of peak shape
•Detecting on each pixel means centroids not aligned in image data
•Don’t know what we don’t know…
Photo by Louise Tollisen on Unsplash
Methods from other fields
Astronomy – HDF5
Climate research – netCDF, HDF5, Zarr
Microscopy – OME-NGFF, OME-ZARR
•Chunked formats
•Built-in lossless compression
•Plugin compression methods
Alternative
data encoding
Formats like HDF5 and Zarr can
be chunked and compressed
Photo by Markus Spiske on Unsplash
Chunking of a hyper-mango
https://www.ambitiouskitchen.com/how-to-cut-a-mango/
Each mini-cube is separately addressable
https://www.blosc.org/posts/blosc2-ndim-intro/
Only relevant segments are loaded to RAM
→ cloud storage friendly
Spectral range Image
X
Y
Segments are cached and garbage collected
https://commons.wikimedia.org/wiki/File:OLAPcube.png
Different chunk sizes
Smaller chunks gives higher granularity
in selection, but more addressable
segments increases data size
Chunks can be compressed.
Contents of a chunk changes
compressibility.
Difficult to predict overall file size
https://www.bbc.co.uk/bitesize/guides/zjs9dxs/revision/3
Compromise
This Photo by Unknown Author is licensed under CC BY-NC-ND
Peaks vs sticks
•Intermediate encoding
•Pseudo thresholding
•Use peak detection limits to separate ‘interesting’
from ‘uninteresting’ spectral ranges
•Not the same as intensity thresholding
•Peak detection can still be performed on regions
•Compression of entire spectrum more efficient
Photo by Stéphane Fellay on Unsplash
Encapsulate in chunked format
•Hold the full spectral resolution
of the detected peaks
•Discard data between peaks
•Start with chunked data format
•Develop compression plugin for
these formats
•Produces a pseudo-continuous
spectrum with acceptable
compression
Sean Lucas
Summary
The researchers are willing, but their resources are weak
•Few solutions currently exist
•Metadata terms missing
•Proprietary file formats are a barrier
•Instrument vendor buy-in required
•Lack of awareness persists
But…
•Some low-hanging fruit
•Opportunity to make an impact
•Even Closed FAIR can still have benefits to industry
There’s lots to do, but FAIRSpectra is just getting started!
https://fairspectra.nethttps://alexhenderson.info