Overcoming Distributed Databases Scaling Challenges with Tablets

ScyllaDB 505 views 47 slides Oct 15, 2024
Slide 1
Slide 1 of 47
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47

About This Presentation

Maximizing performance goes beyond server-level tweaks. Even low level code, scaling requires more. In this session, learn about "tablets"—a dynamic sharding design at ScyllaDB that optimizes CPU, storage, and elasticity for top-notch performance. #Database #ScyllaDB


Slide Content

A ScyllaDB Community
Overcoming Distributed Databases
Scaling Challenges with Tablets
Dor Laor
Co-founder & CEO

Dor Laor

Co-founder of ScyllaDB
■Something cool I’ve done: Dyed my hair
■My perspective on P99s: Less is more
■Another thing about me: Own a Phd in Snowboard
■What I do away from work: Work remotely

Low p99 latency algorithm:
#1: write code
#2: Optimize
#3: if (fast) break
goto #2

Algorithm #2: rewrite in assembly
internal::memory_prefaulter::work(...) {
auto fault_in_memory = [] (char* p);
// Touch the page for write, but be sure not to modify anything
// The compilers tend to optimize things, so prefer assembly

asm volatile ("lock orb $0, %0" : "=&m"(*p));
while (start < end) {
fault_in_memory(start);
start += page_size;
}
};

Algorithm #3: Thread per core Architecture
Shards Threads

Profit!

Most times, |server| > 1
Hmm, so local optimizations matter more!

Sometimes, |server| >> 1

Let’s Shard

Redis Shards

Cassandra Token Ring [2008]
■Each node has one token range
■Token ranges are assigned
randomly on node creation
■A new node split an existing
range
■Replica sets determined
algorithmically
Owned range
Replica set
A
D
C
F

Advantages
■Nodes can be added with no central coordination

Disadvantages
■Nodes cannot be added without central coordination
■Can only join/decommission one node at a time
■Bad data distribution
●./shardsim --nodes 12 --vnodes 1 --shards 1
maximum node overcommit: 3.42607
●Support for manual balancing by changing tokens
■New nodes on stream from only three neighbors
●Generates uneven CPU, disk load during join

Cassandra Virtual Nodes (vnodes, 2012)
■Each node has N (typically 32 or
256) randomly assigned tokens

Owned range
Replica set
A
A

Advantages
■Better data distribution
$ ./shardsim --nodes 12 --vnodes 256 --shards 1
maximum node overcommit: 1.12073

Disadvantages
■Many more replica sets
●Scans of small tables have to traverse 256*nr_nodes ranges
●Overhead can dominate scan time for near-empty tables
●2 failures - Good likelihood of availability issue in a single zone deployment

■Even worse for Cassandra secondary indexes (that are local to the node)

ScyllaDB Architecture: vnodes & shard per core
Homogeneous nodes Ring Architecture

2014 - ScyllaDB 0.1 shards
■Introduce orthogonal
distribution, internal to the
node
■Different nodes can have
different shard counts
Owned range
Replica set
Shard 0
Shard 1
Shard 2

Advantages
■Per-shard data ownership removes need for locks

Disadvantages
■Poor data distribution
●$ ./shardsim --nodes 12 --vnodes 256 --shards 30 --ignore-msb-bits 0
12 nodes, 256 vnodes, 30 shards
maximum node overcommit: 1.11148
maximum shard overcommit: 2.512377
■Poor scan performance over near-empty tables
●Need to scan each vnode, then each shard’s intersection in the vnode

Owned range
Replica set
2016 - ScyllaDB 1.6 -
murmur3_ignore_msb_bits
■Split token range into
4096*nr_shard ranges
■Each shard owns 4096 ranges
Shard 0
Shard 1
Shard 2
Shard 0
Shard 0
Shard 0
Shard 1
Shard 1
Shard 1
Shard 2
Shard 2
Shard 2

Advantages
■Better data distribution
●$ ./shardsim --nodes 12 --vnodes 256 --shards 30 --ignore-msb-bits 12
12 nodes, 256 vnodes, 30 shards
maximum node overcommit: 1.11791
maximum shard overcommit: 1.140437

Disadvantages
■Even worse small-table performance
●60 shards * 4096 per-shard subranges * 30 nodes * 256 vnodes = ???

■Terrible repair/streaming with mismatched shard counts
■Complexity, bugs

Hello Tablets

Tablet History: ClayDB > Bigtable > ScyllaDB

ScyllaDB Tablet
■A small range of keys (tokens)
■Size of 5GB
■Dynamic, per table
■Dynamically assigned to nodes
■A full LSM tree {compaction, sstable files, memtable}
■Small table -> fit in a single tablets, great scan/index

The Tablets Table
// Managed by the Raft protocol
// Replicated on every node

CREATE TABLE system.tablets (
table_id uuid,
last_token bigint,
keyspace_name text static,
replicas list<tuple<uuid, int>>,
new_replicas list<tuple<uuid, int>>,
session uuid,
stage text,
transition text,
table_name text static,
tablet_count int static,
PRIMARY KEY (table_id, last_token)
);

RAFT
Group 0
RAFT tablet table


key1key2
tablet
tablet
replica
tablet
replica

Tablets - balancing
Table starts with a few tablets.

Small tables end there
Not fragmented into tiny pieces
like with tokens

Tablets - balancing
When tablet becomes too heavy
(disk, CPU, …) it is split

Tablets - balancing
When tablet becomes too heavy
(disk, CPU, …) it is split

Tablets - balancing
The load balancer can decide to
move tablets

Tablets
Resharding is cheap.
SStables split at tablet boundary.
Reassign tablets to shards (logical operation).

Tablets
Cleanup after topology change is cheap.
Just delete SStables.

Tablet Scheduler
Scheduler globally controls movement, maintenance
operation on a per tablet basis




repairmigrationtablet 0
tablet 1
schedule schedule
repairBackup

Tablet Scheduler
Goals:
■Maximize throughput (saturate)
■Keep migrations short (don’t overload)

Rules:
■migrations-in <= 2 per shard
■migrations-out <= 4 per shard

Sharding Considerations
■Data distribution/fairness -> Perfect
■Mixed node sizes, add/remove 5% in size
■Multiple nodes operations - double capacity in minutes
■Amount of node streams to/from single node
■Scanning is efficient
■Small tables are fast & efficient

■Drivers are tablet-aware
■But without reading the tablets table
●Driver contacts a random node/shard
●On miss, gets updated routing
information for that tablet
●Ensures fast start-up even with 100,000
entries in the tablets table
Driver Support

■Fencing - each write is signed with topology version
■If there is a version mismatch, the write doesn’t go through
Changes in the Data Plane
Replica
Coordinator
Topology
coordinator
Req(V
1)
Req(V
1
)
Fence(V
1)
Drain(V
1)
Req(V
2
)

Tablet
Tablet
Tablet
Tablet
Add node
Tablet
Tablet
Tablet
Tablet
Serving
reads/writes

Tablet
Tablet
Tablet
Tablet
Add node stream first tablet
Tablet
Tablet
Tablet
Tablet

Tablet
Tablet
Tablet
Tablet
Add node - start serving
Tablet
Tablet
Tablet
Tablet
Start Serving
reads/writes

Demo time

Not a demo, a test suite

Tablets => Serverless
Time
Volume
ie3n’s
i4i
Time
Throughput
Capacity
Required
Time
Throughput
On-demand
Base
Typeless Sizeless limitless

Serverless tests (even more)

Thank you! Let’s connect.
@DorLaor
Tags