ScyllaDB Tablets: Rethinking Replication

ScyllaDB 1,181 views 40 slides Jun 20, 2024
Slide 1
Slide 1 of 40
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40

About This Presentation

ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this ...


Slide Content

ScyllaDB Tablets: Rethinking Replication Avi Kivity, CTO @ ScyllaDB

Avi Kivity CTO, co-founder ScyllaDB Linux KVM author and ex-maintainer

Goals Short Demo The problem we’re fixing Implementation details Deep dive into Demo Status Agenda Picture of chocolate tablets

Tablet project goals

Goals Fast bootstrap/decommission Bootstrap is a critical time for a cluster if running out of space or CPU capacity Incremental bootstrap Shoulder the load immediately, not after bootstrap completes Parallel bootstrap Add multiple nodes in parallel if we’re in a real hurry Decouple topology operations Remove a dead node while bootstrapping two new nodes Improve support for many small tables

Demo

Scenario Preload a 3-node cluster with 650 GB/replica Run a moderate mixed read/write workload Bootstrap three nodes Decommission three nodes Test harness Scylla-cluster-tests (open-source) Used in weekly regression tests Demo Scenario

Replica Writes

Replica Reads

History

How did we get here? ScyllaDB streaming was fast… but node storage grew faster Some schema shapes slow down mutation-based streaming Eventually consistent, masterless architecture means the operator has to coordinate everything … … and everything is serialized Layout is static (famous Token Ring) A B A B Token ring For 2-node cluster After adding one node A B A B C C

Implementation

Introduce a new layer of indirection - the tablets table Each table has its own token range to node mapping Mapping can change independently of node addition and removal A new node can be added without any owned data! Different tables can have different tablet counts Tablet counts change so tablet size on disk remains roughly constant Managed by Raft Group 0 Implementation - metadata System, tablets Query Replica Set Token

Each tablet replica is isolated into its own memtable+sstables Forms its own little Log-Structured Merge Tree With compaction and stuff Can be migrated as a unit Migration: copy the unit Cleanup: delete the unit Split/merge as the table grows/shrinks Implementation - data path

Hosted on one node But can be migrated freely if the node is down Synchronized via Raft Collects statistics on tables and tablets Migrates to balance space Evacuates nodes to decommission Migrates to balance CPU load Rebuilds and repairs Implementation - load balancer

Source of truth for the cluster How many tablets for each table Token boundaries for each tablets On which nodes and shards do we have replicas What kind of transition the tablet is undergoing Which nodes and shards will host tablet replicas after the transition Managed using the Raft protocol Replicated on every node The tablets table CREATE TABLE system.tablets ( table_id uuid, last_token bigint, keyspace_name text static, replicas list<tuple<uuid, int>>, new_replicas list<tuple<uuid, int>>, session uuid, stage text, transition text, table_name text static , tablet_count int static , PRIMARY KEY (table_id, last_token) );

Drivers are tablet-aware But without reading the tablets table Driver contacts a random node/shard On miss, gets updated routing information for that tablet Ensures fast start-up even with 100,000 entries in the tablets table Driver support

Deeper dive into demo

Replica Reads/Writes

Coordinator requests

CPU load (per shard)

Disk bandwidth - streaming

Disk bandwidth -query

Disk bandwidth - commitlog

Visualization tool

Status

What works Parallel, incremental bootstrap and decommission File streaming Replace and rebuild Space load balancing Split on table growth Materialized Views Still working on it Merge on table shrinkage Change Data Capture CPU load balancing Automatic migration from Vnodes Status/plans Illustration: migraine relief tablets

Stay in Touch Avi Kivity [email protected] @AviKivity @avikivity https://www.linkedin.com/in/avikivity/

Graphics Sandbox

Illustration - request looks up into a table of tablets, exits lookup with (node, shard) routing information Where is this tablet? node B shard 3 Where is this tablet? Node B Shard 3 system.tablets Query Replica Set

(detailed, vnodeS)

(detailed, tablets)

(detailed, vnodes, per-shard)

(detailed, tablets, per-shard)

Vnodes, advanced

Tablets, advanced

A node in which there are 16 equally sized slots for tablets, only some are filled. In each filled slot there is a memtable + a few sstables Memtable SSTable SSTable Memtable SSTable SSTable SSTable Memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Memtable SSTable SSTable Memtable SSTable Memtable SSTable SSTable SSTable Memtable SSTable Memtable SSTable SSTable SSTable SSTable Memtable SSTable SSTable Memtable SSTable

Illustration - maybe an octopus spreading its tentacles to each node Maybe use DB Performance at Scale book style tentacles here ?

Illustration - Deuteronomy style tablets but with host/shard information instead of commandments node1 Shard 0 Shard 1 Shard 3 node2 Shard 0 Shard 3 Shard 2 node3 Shard 0 Shard 3 Shard 2 CREATE TABLE system.tablets ( keyspace_name text static, table_name text static, tablet_count int static, table_id uuid, last_token bigint, replicas frozen<list<tuple<uuid, int>>>, new_replicas frozen<list<tuple<uuid, int>>>, stage text, session uuid, PRIMARY KEY (table_id, last_token) ) CREATE TABLE dump

Felipe's replacement idea for deuteronomy thingie CREATE TABLE system.tablets ( keyspace_name text static, table_name text static, tablet_count int static, table_id uuid, last_token bigint, … PRIMARY KEY (table_id, last_token) ) Raft-Managed Table User Table Replicas
Tags