CTO Avi Kivity shares how scalability is core to ScyllaDB's architecture.
Size: 1.88 MB
Language: en
Added: Mar 05, 2025
Slides: 33 pages
Slide Content
A ScyllaDB Community
Architecture for Extreme Scale:
Scalability is a Feature
Avi Kivity
CTO
Avi Kivity
■Linux KVM
■First line written
■Seastar
■First line written
■ScyllaDB
■First line written
■What is Scalability?
■Case Studies
Presentation Agenda
What is Scalability?
Scalability - definition
■The property of a program whose throughput is linear with the amount of
computing resources dedicated to it
■Scalability does not imply fast single-node performance, only that as you
increase resources, so does performance
Practical scalability
■The property of a program whose throughput is near-linear with the
amount of computing resources dedicated to it, up to a reasonable limit,
and for targeted workloads
■Scalability does not imply fast single-node performance, only that as you
increase resources, so does performance cannot sacrifice low node
count performance, or it becomes impractical. A practical system must
deliver good performance in a three-node cluster.
Horizontal and Scalability
■Horizontal: Scale with the number of nodes in the system
■Vertical: Scale with the size (compute power) of individual nodes
■We need both!
Vertical
Why Horizontal
■Eventually any node size is limited
■Needed for geographical distribution
■Needed for availability and fault tolerance
Why Vertical
■With small nodes, overhead quickly dominates
■1000 nodes with 3-year MTBF -> one failure per day
■Upgrade hell
Why both?
■Work for horizontal scaling benefits vertical
scaling (thread-per-core)
■Reasonable cluster sizes reduce maintenance
■Benefit from geographical distribution
■100x 120 TB = 12 PB
■Should be enough for anybody
ScyllaDB Goals
■Fast algorithms and implementation
■Good scalability
■Practical tradeoffs
Fun fact
■The name “Scylla” was chosen to evoke “Scale”
■Then we learned how it’s pronounced
Practical scalability goals
■100 nodes/datacenter
■100 nodes, 60 TB/node -> 6PB/datacenter
■Thousand-node clusters with 1TB/node are skewed
■Terrible for maintenance
■256 vcpu / TB-RAM class machines
■Grow with hardware evolution
Components of Performance
■Good algorithms and data structures
■Scalability across the cluster
■Scalability within a node
■Efficient implementation
Seastar and thread-per-core
■Clustered application already solves compute and data distribution
across nodes
■Re-apply the solution towards compute and data distribution within
cores
■Sacrifices some efficiency for high utilization
Memtable/cache sizing
■How much memory to allocate to memtables?
■How much memory to allocate to cache?
■How much memory to leave free?
Log-Structured Merge Tree Tiers
■Write amplification is O(log N)
■But what is N?
■
■Size of largest sstable: ~ disk size
■Size of smallest sstable: ~ memtable size
■To reduce write amplification, we must have the largest memtable size
we can
Memtable size allocation
■Allocate half of the machine’s memory to memtables
■The other half for cache and “incidentals”
Why half and half?
■Sensitivity analysis
■At 50%/50%, a 5% change has the same impact on both sides
■At 90%/10%, a 5% change has a drastic impact on the smaller side
Global index implementation choices
Solution review
■Cassandra secondary index
■ScyllaDB Global Index
■Cassandra Storage Attached Index (SAI/SASI)
Cassandra Secondary Index
■Each node indexes its own data
■Index is stored in an index table
■Invalidated/rebuilt/cleaned on node
bootstrap
■Work scales with number of nodes
Index
Index
Index
Index
Index
Index
ScyllaDB Global Index
■Use materialized view to store index
■Impervious to node bootstrap
■Additional hops to access data
■Work does not scale with cluster size
■Low cardinality indexed columns cause
large/hot partitions
Data Index
Cassandra SAI/SASI
■Each sstable indexes its own data
■Supports ranges and complex queries
in addition to simple matches
■Index rebuilt during compaction
■Much more memory/CPU efficient
■Work scales with number of sstables
in the cluster
sstable
Index
sstable
Index
sstable
Index
Indexing comparison
Index type Antiscaling with Side effects Additional benefits
Local index (Cassandra) Nodes Rebuild/cleanup
Local index (ScyllaDB) Shards Rebuild/cleanup
Global index - Materialized views
SAI/SASI sstables None Range queries/free text
Vector Indexes
■No partition key can be extracted from the vector
■HNSW computational complexity is O(log N)
■Partitioned index ->
■Antiscaling!
Vector Index - Solution
Data
■Vertical scaling - use few large nodes
■Prefer increasing the node size to adding nodes
■Prefer duplicating data to partitioning
■Asymmetric architecture
■Bonus: use GPU/TPU/FPGA
Index
Indexing comparison (with vectors)
Index type Antiscaling with Side effects Additional benefits
Local index (Cassandra) Nodes Rebuild/cleanup
Local index (ScyllaDB) Shards Rebuild/cleanup
Global index - Materialized views
SAI/SASI sstables None Range queries/free text
Asymmetric Indexing nodes Increased coordination and
orchestration
GPU/TPU/FPGA
Change data capture streams
■Similar to indexes (and materialized views)
■A copy of the data, but with a different key
■CDC = time keyed
■How many streams?
■Streams require user effort
■Keep track of last-consumed entry
How many streams?
■One stream?
■A node becomes a hotspot
■Hard to synchronize
■One stream per node?
■A shard within a node becomes a hotspot?
■One stream per shard!
One stream per shard
■Number of shards in the cluster can change
■Need to keep pace and close/start new streams
■Need meta-stream to update about those changes
■Scalability considerations generates most of the work for this feature
Conclusions
■Scalability considerations permeates everything we do
■May sacrifice some efficiency to avoid breaking down at high scale
■But must also be competitive at low scale
Stay in Touch
Your First and Last Name [email protected]
@AviKivity
@avikivity
Who uses it?