Overcoming Distributed Databases Scaling Challenges with Tablets

ScyllaDB 505 views 47 slides Oct 15, 2024

Slide 1 of 47

About This Presentation

Maximizing performance goes beyond server-level tweaks. Even low level code, scaling requires more. In this session, learn about "tablets"—a dynamic sharding design at ScyllaDB that optimizes CPU, storage, and elasticity for top-notch performance. #Database #ScyllaDB

Size: 4.21 MB

Language: en

Added: Oct 15, 2024

Slides: 47 pages

Slide Content

A ScyllaDB Community
Overcoming Distributed Databases
Scaling Challenges with Tablets
Dor Laor
Co-founder & CEO

Dor Laor

Co-founder of ScyllaDB
■Something cool I’ve done: Dyed my hair
■My perspective on P99s: Less is more
■Another thing about me: Own a Phd in Snowboard
■What I do away from work: Work remotely

Low p99 latency algorithm:
#1: write code
#2: Optimize
#3: if (fast) break
goto #2

Algorithm #2: rewrite in assembly
internal::memory_prefaulter::work(...) {
auto fault_in_memory = [] (char* p);
// Touch the page for write, but be sure not to modify anything
// The compilers tend to optimize things, so prefer assembly

asm volatile ("lock orb $0, %0" : "=&m"(*p));
while (start < end) {
fault_in_memory(start);
start += page_size;
}
};

Algorithm #3: Thread per core Architecture
Shards Threads

Proﬁt!

Most times, |server| > 1
Hmm, so local optimizations matter more!

Sometimes, |server| >> 1

Let’s Shard

Redis Shards

Cassandra Token Ring [2008]
■Each node has one token range
■Token ranges are assigned
randomly on node creation
■A new node split an existing
range
■Replica sets determined
algorithmically
Owned range
Replica set
A
D
C
F

Advantages
■Nodes can be added with no central coordination

Disadvantages
■Nodes cannot be added without central coordination
■Can only join/decommission one node at a time
■Bad data distribution
●./shardsim --nodes 12 --vnodes 1 --shards 1
maximum node overcommit: 3.42607
●Support for manual balancing by changing tokens
■New nodes on stream from only three neighbors
●Generates uneven CPU, disk load during join

Cassandra Virtual Nodes (vnodes, 2012)
■Each node has N (typically 32 or
256) randomly assigned tokens

Owned range
Replica set
A
A

Advantages
■Better data distribution
$ ./shardsim --nodes 12 --vnodes 256 --shards 1
maximum node overcommit: 1.12073

Disadvantages
■Many more replica sets
●Scans of small tables have to traverse 256*nr_nodes ranges
●Overhead can dominate scan time for near-empty tables
●2 failures - Good likelihood of availability issue in a single zone deployment

■Even worse for Cassandra secondary indexes (that are local to the node)

ScyllaDB Architecture: vnodes & shard per core
Homogeneous nodes Ring Architecture

2014 - ScyllaDB 0.1 shards
■Introduce orthogonal
distribution, internal to the
node
■Different nodes can have
different shard counts
Owned range
Replica set
Shard 0
Shard 1
Shard 2

Advantages
■Per-shard data ownership removes need for locks

Disadvantages
■Poor data distribution
●$ ./shardsim --nodes 12 --vnodes 256 --shards 30 --ignore-msb-bits 0
12 nodes, 256 vnodes, 30 shards
maximum node overcommit: 1.11148
maximum shard overcommit: 2.512377
■Poor scan performance over near-empty tables
●Need to scan each vnode, then each shard’s intersection in the vnode

Owned range
Replica set
2016 - ScyllaDB 1.6 -
murmur3_ignore_msb_bits
■Split token range into
4096*nr_shard ranges
■Each shard owns 4096 ranges
Shard 0
Shard 1
Shard 2
Shard 0
Shard 0
Shard 0
Shard 1
Shard 1
Shard 1
Shard 2
Shard 2
Shard 2

Advantages
■Better data distribution
●$ ./shardsim --nodes 12 --vnodes 256 --shards 30 --ignore-msb-bits 12
12 nodes, 256 vnodes, 30 shards
maximum node overcommit: 1.11791
maximum shard overcommit: 1.140437

Disadvantages
■Even worse small-table performance
●60 shards * 4096 per-shard subranges * 30 nodes * 256 vnodes = ???

■Terrible repair/streaming with mismatched shard counts
■Complexity, bugs

Hello Tablets

Tablet History: ClayDB > Bigtable > ScyllaDB

ScyllaDB Tablet
■A small range of keys (tokens)
■Size of 5GB
■Dynamic, per table
■Dynamically assigned to nodes
■A full LSM tree {compaction, sstable ﬁles, memtable}
■Small table -> ﬁt in a single tablets, great scan/index

The Tablets Table
// Managed by the Raft protocol
// Replicated on every node

CREATE TABLE system.tablets (
table_id uuid,
last_token bigint,
keyspace_name text static,
replicas list<tuple<uuid, int>>,
new_replicas list<tuple<uuid, int>>,
session uuid,
stage text,
transition text,
table_name text static,
tablet_count int static,
PRIMARY KEY (table_id, last_token)
);

RAFT
Group 0
RAFT tablet table

key1key2
tablet
tablet
replica
tablet
replica

Tablets - balancing
Table starts with a few tablets.

Small tables end there
Not fragmented into tiny pieces
like with tokens

Tablets - balancing
When tablet becomes too heavy
(disk, CPU, …) it is split

Tablets - balancing
The load balancer can decide to
move tablets

Tablets
Resharding is cheap.
SStables split at tablet boundary.
Reassign tablets to shards (logical operation).

Tablets
Cleanup after topology change is cheap.
Just delete SStables.

Tablet Scheduler
Scheduler globally controls movement, maintenance
operation on a per tablet basis

repairmigrationtablet 0
tablet 1
schedule schedule
repairBackup

Tablet Scheduler
Goals:
■Maximize throughput (saturate)
■Keep migrations short (don’t overload)

Rules:
■migrations-in <= 2 per shard
■migrations-out <= 4 per shard

Sharding Considerations
■Data distribution/fairness -> Perfect
■Mixed node sizes, add/remove 5% in size
■Multiple nodes operations - double capacity in minutes
■Amount of node streams to/from single node
■Scanning is eﬃcient
■Small tables are fast & eﬃcient

■Drivers are tablet-aware
■But without reading the tablets table
●Driver contacts a random node/shard
●On miss, gets updated routing
information for that tablet
●Ensures fast start-up even with 100,000
entries in the tablets table
Driver Support

■Fencing - each write is signed with topology version
■If there is a version mismatch, the write doesn’t go through
Changes in the Data Plane
Replica
Coordinator
Topology
coordinator
Req(V
1)
Req(V
1
)
Fence(V
1)
Drain(V
1)
Req(V
2
)

Tablet
Tablet
Tablet
Tablet
Add node
Tablet
Tablet
Tablet
Tablet
Serving
reads/writes

Tablet
Tablet
Tablet
Tablet
Add node stream ﬁrst tablet
Tablet
Tablet
Tablet
Tablet

Tablet
Tablet
Tablet
Tablet
Add node - start serving
Tablet
Tablet
Tablet
Tablet
Start Serving
reads/writes

Demo time

Not a demo, a test suite

Tablets => Serverless
Time
Volume
ie3n’s
i4i
Time
Throughput
Capacity
Required
Time
Throughput
On-demand
Base
Typeless Sizeless limitless

Overcoming Distributed Databases Scaling Challenges with Tablets

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Overcoming Distributed Databases Scaling Challenges with Tablets

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......