Alluxio-FUSE as a data access layer for Dask

Alluxio 251 views 33 slides Apr 26, 2021
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Alluxio Day III
April 27, 2021

Speaker:
Peter Roelants, aspect Analytics


Slide Content

Alluxio Day III
Exploring Alluxio & Dask integration
1
&
2020-04-27

whoami
2
Peter Roelants
Machine Learning Engineering Lead
@Aspect Analytics

@PeterRoelants

Outline
1.Aspect Analytics

2.Use case: Mass Spectrometry Imaging

3.Dask

4.Alluxio

5.Data access via FUSE POSIX API
3

Aspect Analytics
A brief overview of Aspect Analytics.
4
more info at https://aspect-analytics.com/

5
Software company dedicated to Mass Spectrometry Imaging bioinformatics
We build software tools to support clients’ workflows (off-the-shelf and custom)
Leverage the full potential of MSI data in high-throughput settings
Beyond bioinformatics: data analysis embedded in integrated platform solution
more info at https://aspect-analytics.com/

Mass Spectrometry Imaging
What data are we working with?
6
more info at
https://aspect-analytics.com/media/blog/2020-05-31-introduction-to-msi-data-analysis/

7
Mass spectrometry
●Measures the abundance of molecular weights in a sample.
●Output is a mass spectrum:
○Histogram of molecular weights in sample.
more info at https://aspect-analytics.com/media/blog/2020-05-30-introduction-to-mass-spectrometry-data-analysis/

8
Mass spectrometry imaging
Measure spatial distribution of molecular masses
over a slice of tissue.

9
Mass spectrometry imaging workflow
Overlay tissue slice with virtual grid of "pixels".

10
Mass spectrometry imaging workflow
Measure mass spectrum at each "pixel".

MSI data structure: 3D tensor
500 x 500 pixels
100,000 to 1,000,000 mass bins
⇒ 100GB - 1TB per data set
11

Illustration from Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry. Nico Verbeeck Richard M. Caprioli Raf Van de Plas - 2019
12
Unsupervised analysis of mass
spectral images to help with
biomarker discovery.

⇒ new diagnostic tests
Use case

Use case
more info at https://aspect-analytics.com/media/blog/2020-05-31-introduction-to-msi-data-analysis/
●Spatial localisation of biomolecules
●Region of interest analysis (images shown)
●Clinical diagnostics
13

Data challenges
14
•Interactively explore data
•Slice and subset data without loading full
data-cube into memory.
•Distributed machine learning
•Find patterns and extract features from
multiple large data-cubes.

•Process huge data arrays
•Parallel processing
•Out-of-core

Dask
What, Why?
15

Why Dask
Dask is like Apache Spark in Python with support for distributed data arrays.


•Parallel processing of data array chunks.
•Integration with Python machine learning ecosystem.
•Integration with our existing Python algorithms.
16
more info at https://docs.dask.org/en/latest/spark.html

Why Dask
•Delayed compute that can be dynamically scheduled.
•Diagnostics dashboard.


more info at https://docs.dask.org/. Figure from http://matthewrocklin.com/slides/dask-scipy-2016.html
17

Alluxio
Why Alluxio?
18

Why Alluxio
Data access layer
•Non Python specific
•Our platform user application is built on Clojure.
•Standardized access via FUSE POSIX API
•More on this in later slides
•Distributed and Tiered Caching layer
•Download once, use multiple times
•Share between different processes and services
•Centralized access to data
•Analytics code does not need to deal with different storage implementations.
•Avoid keeping object store credentials on client services.

19

Why Alluxio
Deployable in various scenarios
•Deployable in cloud as well as on-prem
•Through Docker & Kubernetes
•Long-lived vs short-lived deployments
•Long-running Alluxio server for continuous data access
(e.g. to provide data for notebook server)
•Short-term Alluxio deployment voor ad-hoc computations.
(e.g. to run a set of analyses on a new dataset)
•Integrate in automated testing.
20

Dask & Alluxio
Using Alluxio as a data access layer for Dask.
21

Dask & Alluxio
22

Dask & Alluxio
23

Dask & Alluxio
24
●Dataset access to Dask is
provided via Alluxio FUSE

Dask & Alluxio
25
●Dataset access to Dask is
provided via Alluxio FUSE
●Alluxio worker only loads
data that is required locally
●Alluxio worker keeps data in
cache

Alluxio FUSE
Data access via a POSIX API.
26

27
FUSE
Filesystem in Userspace (FUSE)

Filesystem:
●Expose virtual files
●POSIX filesystem API
Userspace
●Refers to all code that run is run by the user (outside the operating system's kernel).

FUSE allows to create filesystems without needing to modify OS kernel code.
27

28
Share Alluxio-FUSE via Bind-Mount
●Each service has its own
Docker environment.

29
Share Alluxio-FUSE via Bind-Mount
●Each service has its own
Docker environment.

30
Share Alluxio-FUSE via Bind-Mount
●Each service has its own
Docker environment.
●FUSE Filesystem connects
Alluxio with Analytics
platform via a bind-mount.

31
Share Alluxio-FUSE via Bind-Mount
●Each service has its own
Docker environment.
●FUSE Filesystem connects
Alluxio with Analytics
platform via a bind-mount.

32
Some anecdotal results
•We have custom Alluxio containers to reduce image size.
•It takes 30s to 1 min to spin up the Alluxio services with FUSE.
•Dask reading from S3 through FUSE (without caching):
•30% slower compared to the native Dask S3 integration.
•Reading large files with Dask from local Alluxio cache:
•10x speedup compared to reading from S3 each time.
•Enabling FUSE kernel caching gave another 3x speedup when reading.
32
more info at https://docs.alluxio.io/os/user/stable/en/api/POSIX-API.html#:~:text=Tuning%20mount%20options