Architecture for Extreme Scale by Avi Kivity

ScyllaDB 235 views 33 slides Mar 05, 2025
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

CTO Avi Kivity shares how scalability is core to ScyllaDB's architecture.


Slide Content

A ScyllaDB Community
Architecture for Extreme Scale:
Scalability is a Feature
Avi Kivity
CTO

Avi Kivity
■Linux KVM
■First line written
■Seastar
■First line written
■ScyllaDB
■First line written

■What is Scalability?
■Case Studies

Presentation Agenda

What is Scalability?

Scalability - definition
■The property of a program whose throughput is linear with the amount of
computing resources dedicated to it


■Scalability does not imply fast single-node performance, only that as you
increase resources, so does performance

Practical scalability
■The property of a program whose throughput is near-linear with the
amount of computing resources dedicated to it, up to a reasonable limit,
and for targeted workloads


■Scalability does not imply fast single-node performance, only that as you
increase resources, so does performance cannot sacrifice low node
count performance, or it becomes impractical. A practical system must
deliver good performance in a three-node cluster.

Horizontal and Scalability
■Horizontal: Scale with the number of nodes in the system
■Vertical: Scale with the size (compute power) of individual nodes

■We need both!
Vertical

Why Horizontal
■Eventually any node size is limited
■Needed for geographical distribution
■Needed for availability and fault tolerance

Why Vertical
■With small nodes, overhead quickly dominates
■1000 nodes with 3-year MTBF -> one failure per day
■Upgrade hell

Why both?
■Work for horizontal scaling benefits vertical
scaling (thread-per-core)
■Reasonable cluster sizes reduce maintenance
■Benefit from geographical distribution

■100x 120 TB = 12 PB

■Should be enough for anybody

ScyllaDB Goals
■Fast algorithms and implementation
■Good scalability
■Practical tradeoffs

Fun fact
■The name “Scylla” was chosen to evoke “Scale”
■Then we learned how it’s pronounced

Practical scalability goals
■100 nodes/datacenter
■100 nodes, 60 TB/node -> 6PB/datacenter
■Thousand-node clusters with 1TB/node are skewed
■Terrible for maintenance
■256 vcpu / TB-RAM class machines
■Grow with hardware evolution

Components of Performance
■Good algorithms and data structures
■Scalability across the cluster
■Scalability within a node
■Efficient implementation

Seastar and thread-per-core
■Clustered application already solves compute and data distribution
across nodes
■Re-apply the solution towards compute and data distribution within
cores
■Sacrifices some efficiency for high utilization

Memtable/cache sizing
■How much memory to allocate to memtables?
■How much memory to allocate to cache?
■How much memory to leave free?

Log-Structured Merge Tree Tiers
■Write amplification is O(log N)
■But what is N?

■Size of largest sstable: ~ disk size
■Size of smallest sstable: ~ memtable size

■To reduce write amplification, we must have the largest memtable size
we can

Memtable size allocation
■Allocate half of the machine’s memory to memtables
■The other half for cache and “incidentals”

Why half and half?
■Sensitivity analysis
■At 50%/50%, a 5% change has the same impact on both sides
■At 90%/10%, a 5% change has a drastic impact on the smaller side

Global index implementation choices
Solution review
■Cassandra secondary index
■ScyllaDB Global Index
■Cassandra Storage Attached Index (SAI/SASI)

Cassandra Secondary Index
■Each node indexes its own data
■Index is stored in an index table
■Invalidated/rebuilt/cleaned on node
bootstrap
■Work scales with number of nodes

Index
Index
Index
Index
Index
Index

ScyllaDB Global Index
■Use materialized view to store index
■Impervious to node bootstrap
■Additional hops to access data
■Work does not scale with cluster size
■Low cardinality indexed columns cause
large/hot partitions
Data Index

Cassandra SAI/SASI
■Each sstable indexes its own data
■Supports ranges and complex queries
in addition to simple matches
■Index rebuilt during compaction
■Much more memory/CPU efficient
■Work scales with number of sstables
in the cluster
sstable
Index
sstable
Index
sstable
Index

Indexing comparison
Index type Antiscaling with Side effects Additional benefits
Local index (Cassandra) Nodes Rebuild/cleanup
Local index (ScyllaDB) Shards Rebuild/cleanup
Global index - Materialized views
SAI/SASI sstables None Range queries/free text

Vector Indexes
■No partition key can be extracted from the vector
■HNSW computational complexity is O(log N)
■Partitioned index ->
■Antiscaling!

Vector Index - Solution
Data
■Vertical scaling - use few large nodes
■Prefer increasing the node size to adding nodes
■Prefer duplicating data to partitioning
■Asymmetric architecture
■Bonus: use GPU/TPU/FPGA
Index

Indexing comparison (with vectors)
Index type Antiscaling with Side effects Additional benefits
Local index (Cassandra) Nodes Rebuild/cleanup
Local index (ScyllaDB) Shards Rebuild/cleanup
Global index - Materialized views
SAI/SASI sstables None Range queries/free text
Asymmetric Indexing nodes Increased coordination and
orchestration
GPU/TPU/FPGA

Change data capture streams
■Similar to indexes (and materialized views)
■A copy of the data, but with a different key
■CDC = time keyed
■How many streams?
■Streams require user effort
■Keep track of last-consumed entry

How many streams?
■One stream?
■A node becomes a hotspot
■Hard to synchronize
■One stream per node?
■A shard within a node becomes a hotspot?
■One stream per shard!

One stream per shard
■Number of shards in the cluster can change
■Need to keep pace and close/start new streams
■Need meta-stream to update about those changes


■Scalability considerations generates most of the work for this feature

Conclusions
■Scalability considerations permeates everything we do
■May sacrifice some efficiency to avoid breaking down at high scale
■But must also be competitive at low scale

Stay in Touch
Your First and Last Name
[email protected]
@AviKivity
@avikivity
Who uses it?
Tags