Scientific Datasets and Machine Learning Benchmarks​ ​

SamuelJackson100 11 views 9 slides Jul 17, 2024
Slide 1
Slide 1 of 9
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9

About This Presentation

Presented progress of SCD benchmarking efforts and acted as a panel member for a wider discussion of ML benchmarking for science


Slide Content

Scientific Datasets and Machine
Learning Benchmarks
Sam Jackson
Rutherford Appleton Labs, STFC
[email protected]

Facilities at RAL
Rutherford Appleton Laboratory
Harwell Campus, near Oxford
•Rate of scientific data is
exploding
•Traditional data processing
cannot keep up
•Facilities looking for new
software/hardware solutions
to keep up.

Why Scientific Benchmarks?
•Scientific data is often large, complex, and challenging to work with
•General solutionsare not necessarily optimal
•Provide a realistic, specific & focussed test cases based on real
experimentaldata
•To motivate exploration of new ideas & models
•To inform hardware & software choices
•Focus on end-to-endbenchmarking rather than microbenchmarks
•Examples: Three archetypal problems
•Inverse problems
•Self-supervised denoising
•Multi-modal image segmentation

SLSTR Benchmark
9 Channel input
Each as separate NetCDF file.
2 resolutions
Generally convert to patches
Standard U-Net Architecture
Output a binary mask
Sea Surface Temperature
Estimates

SLSTR Benchmark
•Parts in Orange have already been implemented.
•Parts in Blue are not measured, but we can do them.
•Depends on system. (i.e. transfer time to PEARL)
•Both parts incur time penalty from unzipping.
Image Extraction Training Inference SST Validation

Benchmarking Code
•Code base started at:https://gitlab.stfc.ac.uk/sciml/sciml-benchmarks
•Pip installable
•Docker & singularity support.

Benchmarking Metrics

Benchmarking Metrics
•Currently Tracking
•Model Performance: Loss, Accuracy, etc.
•Time: Duration, Image/s
•Host information: CPU cores, utilization, Memory utilization, RAM etc.
•GPU information: number, utilization, power draw

Summary
•Scientific machine learning datasets are challenging in their scale &
complexity
•Providing representative datasets & models can:
•Motivate new solutions
•Inform & train non-ML, non-HPC experts
•Aid understanding of performance of new hardware to inform facility
choices
•Aid fair comparisons between models/methods/hardware/software