Surviving Majority Loss: When a Leader Fails by Konstantin Osipov

ScyllaDB 191 views 29 slides Mar 10, 2025
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

In this lightning talk I will present all common combinations of ScyllaDB deployments, single and multi-DC, and how well they play out with Raft based topology management. I'll present ScyllaDB's 2024.2 new feature, zero token nodes, and discuss how they could be used to improve resilience t...


Slide Content

A ScyllaDB Community
Surviving Majority Loss:
When a Leader Fails
Konstantin Osipov
Director of Engineering

Konstantin Osipov
■ScyllaDB geek since 2019
■All things consistency
■Aquarist and a father of three

Previous Episodes

■ScyllaDB consistent cluster management recap
■Choices for production cluster topology
■… and disaster recovery consequences of those
Presentation Agenda

Strong consistency recap

Data vs metadata

-metadata


-data
Schema information: table,
view, type definitions
Topology information:
nodes, tokens
Static and regular rows,
counters
Replicated everywhere Partitioned
Not commutative Commutative
Changes rarely Changes frequently
Consistency of Metadata
1
2 3
3
1 2
replication_factor=2
ScyllaDB cluster

■Runs alongside Raft leader
■Highly available
■Drives the progress of topology changes
■Performs linearizable reads and writes of the topology
■Request coordinators still use the local view on topology
■No extra coordination when executing user requests

The Centralized Topology Coordinator

All the fun ways to deploy
… and what happens next

Let’s start with a 4 node cluster

DC=my_dc

Let’s start with a 4 node cluster

DC=my_dc
replication_factor=3 raft_quorum = 3 data_quorum = 2

Reaction to a node failure

DC=my_dc
replication_factor=3 raft_quorum = 3 data_quorum = 2

Lost majority pre 2025.2

DC=my_dc
replication_factor=3 raft_quorum = 3 data_quorum = 2

Lost majority 2025.2

DC=my_dc
replication_factor=3 raft_quorum = 2 data_quorum = 2

Two data center deployment - pre 2025.2

DC1=us_east
replication_factor=2 raft_quorum = 3 data_quorum = 4
DC2=eu_london

Two data center deployment - 2025.2

DC1=us_east
replication_factor=2 raft_quorum = 2 data_quorum = 4
DC2=eu_london

Split brain - 2025.2

replication_factor=2 raft_quorum = 2 data_quorum = 4
DC1=us_east DC2=eu_london

Let’s talk about racks

replication_factor=2 raft_quorum = 3 data_quorum = 3
rack 1
rack 2
DC1=us_east DC2=eu_london
rack 1
rack 2

Having enough nodes in each rack

replication_factor=2 raft_quorum = 5 data_quorum = 3
rack 1
rack 2
DC1=us_east DC2=eu_london
rack 1
rack 2

Limited voter count

replication_factor=1 raft_quorum = 2 data_quorum = 2

Limited voter count

replication_factor=1 raft_quorum = 3 data_quorum = 2

Zero token nodes

DC1=us_east
replication_factor=2 raft_quorum = 2 data_quorum = 3
DC2=eu_london
DC3=arbiter
scylla.yaml:
join_ring: false

Zero token preference

replication_factor=2 raft_quorum = 2 data_quorum = 3
scylla.yaml:
join_ring: false

Disaster recovery after a majority loss - 2025.2

scylla.yaml:
recovery_leader: <uuid>

1.Identify alive nodes
2.Choose recovery leader
3.Erase group0 state from system tables
4.Restart the nodes with recovery_leader option
The new recovery procedure - 2025.2

●Adaptive voter selection algorithm makes ScyllaDB clusters even more
resilient to failures

Recap

●Adaptive voter selection algorithm makes ScyllaDB clusters even more
resilient to failures
●Zero token nodes allow improving resilience with low cost

Recap

●Adaptive voter selection algorithm makes ScyllaDB clusters even more
resilient to failures
●Zero token nodes allow improving resilience with low cost
●A new disaster recovery procedure allows to form a new majority
automatically and supports tablets

Recap

●Certain roll-out scenarios, such as:
○Even number of nodes
○Two data centers
○Low number of racks
.. are more error-prone than others and should be avoided.
Recap

Stay in Touch
Kostja Osipov
[email protected]
Tags