Building a Cloud Native LSM on Object Storage

ScyllaDB 424 views 27 slides Oct 14, 2024
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

Excited to introduce SlateDB, an open-source, cloud-native storage engine. Built as an LSM on object stores like S3/GCS/ABS, it leverages object storage benefits while tackling unique latency and cost challenges. Join us to explore our design decisions and tradeoffs. #DevTalk #SlateDB


Slide Content

A ScyllaDB Community
Building a Cloud Native LSM on
Object Storage
Chris Riccomini
Materialized View Capital
Rohan Desai
Responsive

Chris Riccomini (he/him)

GP at Materialized View Capital
■Investor: Materialized View Capital
■Engineer: ex-LinkedIn, ex-WePay
■Open Source: SlateDB, Apache Samza, Apache Airflow
■Writing: The Missing README, Materialized View
■Co-Founder: Responsive
■Engineer: ex-Confluent, ex-Yahoo
■Open Source: SlateDB, Responsive, ksqlDB
Rohan Desai (he/him)

Co-Founder at Responsive

Let’s talk about SlateDB.
■Backstory
■Overview
■Architecture
■Performance

The Plan

Backstory

Rise of Object Storage
(Chris is an investor)

Latency, Cost, Durability: Pick Two
https://bsky.app/profile/chris.blue/post/3kqipq5bfos2k

The Cloud Storage Triad

The Cloud Storage Triad
https://materializedview.io/p/cloud-storage-triad-latency-cost-durability

We believe that the future of object storage are multi-region,
low latency buckets that support atomic CAS operations.
Inspired by The Cloud Storage Triad: Latency, Cost,
Durability, we set out to build a storage engine built for the
cloud. SlateDB is that storage engine.
https://slatedb.io/docs/introduction

Overview

A cloud native embedded storage engine built on object storage.
■In-process (Rust) library
■Key-Value interface
■All writes go to object storage
■Implemented as a log-structured merge-tree (LSM)


What is SlateDB?

■Zero-disk architecture
■Single-writer
■Multi-reader
■Read caching
■Writer fencing
■Snapshot isolation ᵗᵒᵈᵒ
■Transactions ᵗᵒᵈᵒ
■Pluggable compaction ᵗᵒᵈᵒ

Features

SlateDB is designed for use cases that are tolerant to 50-100ms write latency, are
tolerant to data loss during failure, or are willing to pay for frequent API PUT calls.
■Stream processing
■Serverless functions
■Durable execution
■Workflow orchestration
■Durable caches
■Data lakes
■Online transaction processing ᵗᵒᵈᵒ
Use Cases

Architecture

WAL
Immutable
WAL
Memtable
Frozen
Memtable
put(k, v);
00000000000000000073.sst
00000000000000000029.sst

01J53ZKSXP1MCCPENTTFXTQ6HS.sst
01J53ZMJ4E93TKZ6FKBYCM2Z43.sst
Object Storage
00000000000000000035.manifest
00000000000000000000.manifest

Memory
01J5433B0KY7Q76NNA1XYJ2V0A.sst
01J5433FVGKBPJESKHM0D6KC18.sst

L0
SR1
Manifest
get(k);
(uncommitted reads)
wal/ compacted/ manifest/
Block Diagram

Write Path
WAL
Immutable
WAL
Memtable
Frozen
Memtable
put(k, v);
00000000000000000073.sst 01J53ZKSXP1MCCPENTTFXTQ6HS.sst
Object Storage
Memory
wal/ compacted/
flush_ms l0_sst_size_bytes

Read Path
WAL
Immutable
WAL
Memtable
Frozen
Memtable
01J53ZKSXP1MCCPENTTFXTQ6HS.sst
01J53ZMJ4E93TKZ6FKBYCM2Z43.sst
Object Storage
Memory
01J5433B0KY7Q76NNA1XYJ2V0A.sst
01J5433FVGKBPJESKHM0D6KC18.sst

L0
SR1
get(k);
(uncommitted reads)
compacted/

Compactor
Orchestrator
01J53ZKSXP1MCCPENTTFXTQ6HS.
sst
01J53ZMJ4E93TKZ6FKBYCM2Z43.ss
t
Writer
01J5433B0KY7Q76NNA1XYJ2V0A.ss
t
01J5433FVGKBPJESKHM0D6KC18.s
st

L0
SR1
01J56FQZNHCATEC63XPDAM6429.s
st
01J56FR63JYE5J3YDKJS7X3HEC.sst

SR2
Writer
Scheduler
(pluggable)
Executor
(pluggable)
Manifest
Compactor
db updates
compactions
compactions
status
read/write
read
read/write
write
read/write

Want More?
https://github.com/slatedb/slatedb/blob/main/rfcs

Performance

Fencing Simulator
A little simulator to test fencing protocol. Just a test, early days, no tuning, YMMV, etc.
■Instance: t2.2xlarge (us-east-1)
■Bucket: us-east-1
■Configuration: 1KiB write payload, flush_ms 5ms
■Latency
●Mean: 40.44ms
●Median: 36ms
●99th percentile (p99): 67 ms
●Minimum: 28ms
●Maximum: 67 ms
https://github.com/slatedb/simluator/issues/1

Benchmark to test compaction speed of a single compaction step (compacting
multiple SSTs/SRs to 1 SR
■Instance: m5.xlarge (4 cores, 16GB RAM, 1.25Gbit baseline network,
us-west-2)
■Bucket: (us-west-2)
■Configuration: 32 1GB SSTs to 1 SR, max_sst_size 1GB
■Duration: 302491 ms => 864Mbps / 108MBps
■Utilization: 1.5 cores
■With 2 parallel compactions we can fully utilize available network
Compaction Bench

Get Started!
slatedb.io
github.com/slatedb/slatedb

Thank you! Let’s connect.
Chris Riccomini
@criccomini
linkedin.com/in/riccomini
materializedview.io
Rohan Desai
@_RohanDesai
linkedin.com/in/rohanpd

Addendum

Write Path
■Call put call on the client
■Write to the mutable, in-memory WAL table
■After flush_ms milliseconds
●Freeze mutable WAL into immutable WAL
●Asynchronously write immutable WAL to object storage
■On WAL write success
●Merge mutable WAL table into the mutable memtable
●Notify all await'ing writers
●If memtable is >= l0_sst_size_bytes, freeze it and write as an L0 SSTable in the object store

■Call get on the client
■Look for key in order of…
●Mutable memtable
●Immutable memtable
●L0 SSTables (newest to oldest using bloom filtering)
●Sorted runs (newest to oldest using bloom filtering)
■Return first value found or none if doesn’t exist (or deletion tombstone)
Read Path
Tags