ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this ...
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
Size: 9.25 MB
Language: en
Added: Jun 20, 2024
Slides: 40 pages
Slide Content
ScyllaDB Tablets: Rethinking Replication Avi Kivity, CTO @ ScyllaDB
Avi Kivity CTO, co-founder ScyllaDB Linux KVM author and ex-maintainer
Goals Short Demo The problem we’re fixing Implementation details Deep dive into Demo Status Agenda Picture of chocolate tablets
Tablet project goals
Goals Fast bootstrap/decommission Bootstrap is a critical time for a cluster if running out of space or CPU capacity Incremental bootstrap Shoulder the load immediately, not after bootstrap completes Parallel bootstrap Add multiple nodes in parallel if we’re in a real hurry Decouple topology operations Remove a dead node while bootstrapping two new nodes Improve support for many small tables
Demo
Scenario Preload a 3-node cluster with 650 GB/replica Run a moderate mixed read/write workload Bootstrap three nodes Decommission three nodes Test harness Scylla-cluster-tests (open-source) Used in weekly regression tests Demo Scenario
Replica Writes
Replica Reads
History
How did we get here? ScyllaDB streaming was fast… but node storage grew faster Some schema shapes slow down mutation-based streaming Eventually consistent, masterless architecture means the operator has to coordinate everything … … and everything is serialized Layout is static (famous Token Ring) A B A B Token ring For 2-node cluster After adding one node A B A B C C
Implementation
Introduce a new layer of indirection - the tablets table Each table has its own token range to node mapping Mapping can change independently of node addition and removal A new node can be added without any owned data! Different tables can have different tablet counts Tablet counts change so tablet size on disk remains roughly constant Managed by Raft Group 0 Implementation - metadata System, tablets Query Replica Set Token
Each tablet replica is isolated into its own memtable+sstables Forms its own little Log-Structured Merge Tree With compaction and stuff Can be migrated as a unit Migration: copy the unit Cleanup: delete the unit Split/merge as the table grows/shrinks Implementation - data path
Hosted on one node But can be migrated freely if the node is down Synchronized via Raft Collects statistics on tables and tablets Migrates to balance space Evacuates nodes to decommission Migrates to balance CPU load Rebuilds and repairs Implementation - load balancer
Source of truth for the cluster How many tablets for each table Token boundaries for each tablets On which nodes and shards do we have replicas What kind of transition the tablet is undergoing Which nodes and shards will host tablet replicas after the transition Managed using the Raft protocol Replicated on every node The tablets table CREATE TABLE system.tablets ( table_id uuid, last_token bigint, keyspace_name text static, replicas list<tuple<uuid, int>>, new_replicas list<tuple<uuid, int>>, session uuid, stage text, transition text, table_name text static , tablet_count int static , PRIMARY KEY (table_id, last_token) );
Drivers are tablet-aware But without reading the tablets table Driver contacts a random node/shard On miss, gets updated routing information for that tablet Ensures fast start-up even with 100,000 entries in the tablets table Driver support
Deeper dive into demo
Replica Reads/Writes
Coordinator requests
CPU load (per shard)
Disk bandwidth - streaming
Disk bandwidth -query
Disk bandwidth - commitlog
Visualization tool
Status
What works Parallel, incremental bootstrap and decommission File streaming Replace and rebuild Space load balancing Split on table growth Materialized Views Still working on it Merge on table shrinkage Change Data Capture CPU load balancing Automatic migration from Vnodes Status/plans Illustration: migraine relief tablets
Stay in Touch Avi Kivity [email protected] @AviKivity @avikivity https://www.linkedin.com/in/avikivity/
Graphics Sandbox
Illustration - request looks up into a table of tablets, exits lookup with (node, shard) routing information Where is this tablet? node B shard 3 Where is this tablet? Node B Shard 3 system.tablets Query Replica Set
(detailed, vnodeS)
(detailed, tablets)
(detailed, vnodes, per-shard)
(detailed, tablets, per-shard)
Vnodes, advanced
Tablets, advanced
A node in which there are 16 equally sized slots for tablets, only some are filled. In each filled slot there is a memtable + a few sstables Memtable SSTable SSTable Memtable SSTable SSTable SSTable Memtable SSTable SSTable SSTable SSTable SSTable SSTable SSTable Memtable SSTable SSTable Memtable SSTable Memtable SSTable SSTable SSTable Memtable SSTable Memtable SSTable SSTable SSTable SSTable Memtable SSTable SSTable Memtable SSTable
Illustration - maybe an octopus spreading its tentacles to each node Maybe use DB Performance at Scale book style tentacles here ?
Illustration - Deuteronomy style tablets but with host/shard information instead of commandments node1 Shard 0 Shard 1 Shard 3 node2 Shard 0 Shard 3 Shard 2 node3 Shard 0 Shard 3 Shard 2 CREATE TABLE system.tablets ( keyspace_name text static, table_name text static, tablet_count int static, table_id uuid, last_token bigint, replicas frozen<list<tuple<uuid, int>>>, new_replicas frozen<list<tuple<uuid, int>>>, stage text, session uuid, PRIMARY KEY (table_id, last_token) ) CREATE TABLE dump
Felipe's replacement idea for deuteronomy thingie CREATE TABLE system.tablets ( keyspace_name text static, table_name text static, tablet_count int static, table_id uuid, last_token bigint, … PRIMARY KEY (table_id, last_token) ) Raft-Managed Table User Table Replicas