ScyllaDB Tablets:
Rethinking Replication
Avi Kivity, CTO @ ScyllaDB
Avi Kivity
■CTO, co-founder ScyllaDB
■Linux KVM author and ex-maintainer
■Goals
■Short Demo
■The problem we’re fixing
■Implementation details
■Deep dive into Demo
■Status
Agenda
Tablet project goals
Goals
■Fast bootstrap/decommission
■Bootstrap is a critical time for a cluster if running out of space or CPU capacity
■Incremental bootstrap
■Shoulder the load immediately, not after bootstrap completes
■Parallel bootstrap
■Add multiple nodes in parallel if we’re in a real hurry
■Decouple topology operations
■Remove a dead node while bootstrapping two new nodes
■Improve support for many small tables
Demo
■Scenario
■Preload a 3-node cluster with 650 GB/replica
■Run a moderate mixed read/write workload
■Bootstrap three nodes
■Decommission three nodes
■Test harness
■Scylla-cluster-tests (open-source)
■Used in weekly regression tests
Demo Scenario
Replica Writes
Replica Reads
History
How did we get here?
■ScyllaDB streaming was fast… but
node storage grew faster
■Some schema shapes slow down
mutation-based streaming
■Eventually consistent, masterless
architecture means the operator has
to coordinate everything …
■… and everything is serialized
■Layout is static (famous Token Ring)
A
B
A
BToken ring
For 2-node
cluster
After adding
one node
A
B
A
B
C
C
Implementation
■Introduce a new layer of indirection -
the tablets table
■Each table has its own token range to node
mapping
■Mapping can change independently of node
addition and removal
■A new node can be added without any owned
data!
■Different tables can have different tablet counts
■Tablet counts change so tablet size on disk
remains roughly constant
■Managed by Raft Group 0
Implementation -metadata
System, tablets
Query
Replica
Set
Token
■Each tablet replica is isolated into its
own memtable+sstables
■Forms its own little Log-Structured Merge Tree
■With compaction and stuff
■Can be migrated as a unit
■Migration: copy the unit
■Cleanup: delete the unit
■Split/merge as the table
grows/shrinks
Implementation -data path
■Hosted on one node
■But can be migrated freely if the node is down
■Synchronized via Raft
■Collects statistics on tables and
tablets
■Migrates to balance space
■Evacuates nodes to decommission
■Migrates to balance CPU load
■Rebuilds and repairs
Implementation -load balancer
■Source of truth for the cluster
■How many tablets for each table
■Token boundaries for each tablets
■On which nodes and shards do we have
replicas
■What kind of transition the tablet is
undergoing
■Which nodes and shards will host tablet
replicas after the transition
■Managed using the Raft protocol
■Replicated on every node
The tablets table
CREATE TABLEsystem.tablets (
table_id uuid,
last_token bigint,
keyspace_name text static,
replicas list<tuple<uuid, int>>,
new_replicas list<tuple<uuid, int>>,
session uuid,
stage text,
transition text,
table_name text static,
tablet_count int static,
PRIMARY KEY(table_id, last_token)
);
■Drivers are tablet-aware
■But without reading the tablets
table
■Driver contacts a random node/shard
■On miss, gets updated routing
information for that tablet
■Ensures fast start-up even with 100,000
entries in the tablets table
Driver support
Deeper dive into demo
Replica Reads/Writes
Coordinator requests
CPU load (per shard)
Disk bandwidth -streaming
Disk bandwidth -query
Disk bandwidth -commitlog
Visualization tool
Status
■What works
■Parallel, incremental bootstrap and
decommission
■File streaming
■Replace and rebuild
■Space load balancing
■Split on table growth
■Materialized Views
■Still working on it
■Merge on table shrinkage
■Change Data Capture
■CPU load balancing
■Automatic migration from Vnodes
Status/plans
Stay in Touch
Avi Kivity [email protected]
@AviKivity
@avikivity
https://www.linkedin.com/in/avikivity/
Graphics Sandbox
Illustration -request looks up into a table of tablets, exits lookup with (node,
shard) routing information
Where is
this tablet?
Node B
Shard 3
system.tablets
Query
Replica
Set
(detailed, vnodeS)
(detailed, tablets)
(detailed, vnodes, per-shard)
(detailed, tablets, per-shard)
Vnodes, advanced
Tablets, advanced
A node in which there are 16 equally sized slots for tablets, only some are
filled. In each filled slot there is a memtable + a few sstables
Memtable
SSTable
SSTable
Memtable
SSTable
SSTable
SSTable
Memtable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTable
SSTableMemtable SSTable
SSTableMemtable SSTable
Memtable
SSTableSSTableSSTable
Memtable SSTable
Memtable
SSTableSSTableSSTable
SSTableMemtable SSTable
SSTableMemtable SSTable
Illustration -maybe an octopus spreading its
tentacles to each node
Maybe use DB Performance at
Scale book style tentacles here?
Illustration -Deuteronomy style tablets but with host/shard
information instead of commandments
Felipe's replacement idea for deuteronomy thingie
CREATE TABLEsystem.tablets(
keyspace_name text static,
table_name text static,
tablet_count int static,
table_id uuid,
last_token bigint,
…
PRIMARYKEY(table_id, last_token)
)
Raft-Managed Table
User Table
Replicas