Surviving Majority Loss: When a Leader Fails by Konstantin Osipov

ScyllaDB 191 views 29 slides Mar 10, 2025

Slide 1 of 29

About This Presentation

In this lightning talk I will present all common combinations of ScyllaDB deployments, single and multi-DC, and how well they play out with Raft based topology management. I'll present ScyllaDB's 2024.2 new feature, zero token nodes, and discuss how they could be used to improve resilience t...

Size: 3.54 MB

Language: en

Added: Mar 10, 2025

Slides: 29 pages

Slide Content

A ScyllaDB Community
Surviving Majority Loss:
When a Leader Fails
Konstantin Osipov
Director of Engineering

Konstantin Osipov
■ScyllaDB geek since 2019
■All things consistency
■Aquarist and a father of three

Previous Episodes

■ScyllaDB consistent cluster management recap
■Choices for production cluster topology
■… and disaster recovery consequences of those
Presentation Agenda

Strong consistency recap

Data vs metadata

-metadata

-data
Schema information: table,
view, type definitions
Topology information:
nodes, tokens
Static and regular rows,
counters
Replicated everywhere Partitioned
Not commutative Commutative
Changes rarely Changes frequently
Consistency of Metadata
1
2 3
3
1 2
replication_factor=2
ScyllaDB cluster

■Runs alongside Raft leader
■Highly available
■Drives the progress of topology changes
■Performs linearizable reads and writes of the topology
■Request coordinators still use the local view on topology
■No extra coordination when executing user requests

The Centralized Topology Coordinator

All the fun ways to deploy
… and what happens next

Let’s start with a 4 node cluster

DC=my_dc

Let’s start with a 4 node cluster

DC=my_dc
replication_factor=3 raft_quorum = 3 data_quorum = 2

Reaction to a node failure

DC=my_dc
replication_factor=3 raft_quorum = 3 data_quorum = 2

Lost majority pre 2025.2

DC=my_dc
replication_factor=3 raft_quorum = 3 data_quorum = 2

Lost majority 2025.2

DC=my_dc
replication_factor=3 raft_quorum = 2 data_quorum = 2

Two data center deployment - pre 2025.2

DC1=us_east
replication_factor=2 raft_quorum = 3 data_quorum = 4
DC2=eu_london

Two data center deployment - 2025.2

DC1=us_east
replication_factor=2 raft_quorum = 2 data_quorum = 4
DC2=eu_london

Split brain - 2025.2

replication_factor=2 raft_quorum = 2 data_quorum = 4
DC1=us_east DC2=eu_london

Let’s talk about racks

replication_factor=2 raft_quorum = 3 data_quorum = 3
rack 1
rack 2
DC1=us_east DC2=eu_london
rack 1
rack 2

Having enough nodes in each rack

replication_factor=2 raft_quorum = 5 data_quorum = 3
rack 1
rack 2
DC1=us_east DC2=eu_london
rack 1
rack 2

Limited voter count

replication_factor=1 raft_quorum = 2 data_quorum = 2

Limited voter count

replication_factor=1 raft_quorum = 3 data_quorum = 2

Zero token nodes

DC1=us_east
replication_factor=2 raft_quorum = 2 data_quorum = 3
DC2=eu_london
DC3=arbiter
scylla.yaml:
join_ring: false

Zero token preference

replication_factor=2 raft_quorum = 2 data_quorum = 3
scylla.yaml:
join_ring: false

Disaster recovery after a majority loss - 2025.2

scylla.yaml:
recovery_leader: <uuid>

1.Identify alive nodes
2.Choose recovery leader
3.Erase group0 state from system tables
4.Restart the nodes with recovery_leader option
The new recovery procedure - 2025.2

●Adaptive voter selection algorithm makes ScyllaDB clusters even more
resilient to failures

Recap

●Adaptive voter selection algorithm makes ScyllaDB clusters even more
resilient to failures
●Zero token nodes allow improving resilience with low cost

Recap

●Adaptive voter selection algorithm makes ScyllaDB clusters even more
resilient to failures
●Zero token nodes allow improving resilience with low cost
●A new disaster recovery procedure allows to form a new majority
automatically and supports tablets

Recap

●Certain roll-out scenarios, such as:
○Even number of nodes
○Two data centers
○Low number of racks
.. are more error-prone than others and should be avoided.
Recap

Surviving Majority Loss: When a Leader Fails by Konstantin Osipov

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Surviving Majority Loss: When a Leader Fails by Konstantin Osipov

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx