Scientific Datasets and Machine Learning Benchmarks
SamuelJackson100
11 views
9 slides
Jul 17, 2024
Slide 1 of 9
1
2
3
4
5
6
7
8
9
About This Presentation
Presented progress of SCD benchmarking efforts and acted as a panel member for a wider discussion of ML benchmarking for science
Size: 1.03 MB
Language: en
Added: Jul 17, 2024
Slides: 9 pages
Slide Content
Scientific Datasets and Machine
Learning Benchmarks
Sam Jackson
Rutherford Appleton Labs, STFC [email protected]
Facilities at RAL
Rutherford Appleton Laboratory
Harwell Campus, near Oxford
•Rate of scientific data is
exploding
•Traditional data processing
cannot keep up
•Facilities looking for new
software/hardware solutions
to keep up.
Why Scientific Benchmarks?
•Scientific data is often large, complex, and challenging to work with
•General solutionsare not necessarily optimal
•Provide a realistic, specific & focussed test cases based on real
experimentaldata
•To motivate exploration of new ideas & models
•To inform hardware & software choices
•Focus on end-to-endbenchmarking rather than microbenchmarks
•Examples: Three archetypal problems
•Inverse problems
•Self-supervised denoising
•Multi-modal image segmentation
SLSTR Benchmark
9 Channel input
Each as separate NetCDF file.
2 resolutions
Generally convert to patches
Standard U-Net Architecture
Output a binary mask
Sea Surface Temperature
Estimates
SLSTR Benchmark
•Parts in Orange have already been implemented.
•Parts in Blue are not measured, but we can do them.
•Depends on system. (i.e. transfer time to PEARL)
•Both parts incur time penalty from unzipping.
Image Extraction Training Inference SST Validation
Benchmarking Code
•Code base started at:https://gitlab.stfc.ac.uk/sciml/sciml-benchmarks
•Pip installable
•Docker & singularity support.
Benchmarking Metrics
Benchmarking Metrics
•Currently Tracking
•Model Performance: Loss, Accuracy, etc.
•Time: Duration, Image/s
•Host information: CPU cores, utilization, Memory utilization, RAM etc.
•GPU information: number, utilization, power draw
Summary
•Scientific machine learning datasets are challenging in their scale &
complexity
•Providing representative datasets & models can:
•Motivate new solutions
•Inform & train non-ML, non-HPC experts
•Aid understanding of performance of new hardware to inform facility
choices
•Aid fair comparisons between models/methods/hardware/software