intuitions for distributed consensus at dc systems

philipeaton35 500 views 43 slides Oct 09, 2024
Slide 1
Slide 1 of 186
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156
Slide 157
157
Slide 158
158
Slide 159
159
Slide 160
160
Slide 161
161
Slide 162
162
Slide 163
163
Slide 164
164
Slide 165
165
Slide 166
166
Slide 167
167
Slide 168
168
Slide 169
169
Slide 170
170
Slide 171
171
Slide 172
172
Slide 173
173
Slide 174
174
Slide 175
175
Slide 176
176
Slide 177
177
Slide 178
178
Slide 179
179
Slide 180
180
Slide 181
181
Slide 182
182
Slide 183
183
Slide 184
184
Slide 185
185
Slide 186
186

About This Presentation

From DC Systems October 2024.

dcsystems.xyz


Slide Content

intuitions for distributed consensus @eatonphil

meetups started a little meetup with @ngeloxyz in nyc

meetups started a little meetup with @ngeloxyz in nyc turned up in cities around the us, india, germany

meetups started a little meetup with @ngeloxyz in nyc turned up in cities around the us, india, germany very special to be a part of dc systems

who i am i’m phil

who i am i’m phil developer for 10 years

who i am i’m phil developer for 10 years got interested in databases 4 years ago

who i am i’m phil developer for 10 years got interested in databases 4 years ago work for edb on distributed postgres product

who you are a number of experts in this room

who you are a number of experts in this room who’ve worked on some of the systems i’ll cover

who you are a number of experts in this room who’ve worked on some of the systems i’ll cover risky of me! 😂

setting expectations this talk is about basics and behavior

setting expectations this talk is about basics and behavior relevant for any (backend) developer

setting expectations this talk is about basics and behavior relevant for any (backend) developer (first talk i’ve given in ~5 years)

let’s go

you’ve got an app MyApp

and some data Key-Value Store

(on a single node) Key-Value Store

communication is simple Key-Value Store MyApp Reads / Writes

we might write Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" <pending>

we might write Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()>

and every time we read Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" <pending>

we get back what we wrote Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<"I. Asimov">

a.k.a. consistency

consistency is a spectrum

where “consistent” = linearizable

most easily illustrated via counterexample

linearizability counterexample Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" <pending>

linearizability counterexample Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()>

linearizability counterexample Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" <pending>

linearizability counterexample: reading stale values Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<()>

linearizability counterexample: reading stale values Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<()> t2 get "id:1.name" <pending>

linearizability counterexample: reading stale values Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<()> t2 get "id:1.name" ok<"I. Asimov">

stale reads: one example of being not linearizable

more formally

https://jepsen.io/consistency/models/linearizable “Linearizability is one of the strongest single-object consistency models, and implies that every operation appears to take place atomically, in some order, consistent with the real-time ordering of those operations”

linearizability for a single node, not super interesting

linearizability for a single node, not super interesting but as a property in general, very useful

linearizability for a single node, not super interesting but as a property in general, very useful linearizable system can be treated as if it were a single node

linearizability for a single node, not super interesting but as a property in general, very useful linearizable system can be treated as if it were a single node even if it consists of more than one node

single node crashes, entire system is unavailable

intuition #0: a single node is not highly available!

how to achieve high availability?

add nodes? MyApp Reads / Writes KV Store A KV Store B KV Store C ???

how to keep data in sync? WhatsIpp Reads / Writes KV Store A KV Store B KV Store C ???

read replicas!

read replicas are a thing MyApp KV Store A (Leader) KV Store B KV Store C

but on leader failure MyApp KV Store A (Leader) KV Store B KV Store C

but on leader failure MyApp KV Store A (Leader) KV Store B KV Store C

read replicas may not be up-to-date MyApp KV Store A (Leader) KV Store B KV Store C

thus not linearizable MyApp KV Store A (Leader) KV Store B KV Store C

some options for linearizability + availability chain replication (ebs)

some options for linearizability + availability chain replication (ebs) kafka replication protocol

some options for linearizability + availability chain replication (ebs) kafka replication protocol foundationdb replication protocol

some options for linearizability + availability chain replication (ebs) kafka replication protocol foundationdb replication protocol client-side quorums

some options for linearizability + availability chain replication (ebs) kafka replication protocol foundationdb replication protocol client-side quorums distributed consensus

further reading: data replication design spectrum

but most popular

and for this talk

distributed consensus

distributed consensus linearizability + availability

distributed consensus linearizability + availability examples: raft multipaxos viewstamped replication

used everywhere

kubernetes

elasticsearch

mongodb

cockroachdb

tigerbeetle

edb postgres distributed

etcd, redpanda, scylla, consul, hazelcast, yugabyte, clickhouse, nats, kafka, tidb, neo4j, rabbitmq, etc. https://en.wikipedia.org/wiki/Raft_(algorithm)

zooming in on

raft

at a high level

nodes form a cluster KV Store A KV Store B KV Store C

cluster elects a leader by majority vote KV Store A KV Store B (Leader) KV Store C

if a leader becomes unavailable KV Store A KV Store B (Leader) KV Store C

the cluster elects a new leader KV Store A (Leader) KV Store B KV Store C

client talks with the (current) leader KV Store A (Leader) KV Store B KV Store C MyApp

state changes modeled as commands MyApp KV Store A (Leader) KV Store B KV Store C exec: 'set "id:1.name" "I. Asimov"'

state changes modeled as commands MyApp KV Store A (Leader) KV Store B KV Store C exec: 'get "id:1.name"'

note: log replication pauses while leader election is happening

commands stored in a log KV Store A (Leader) log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" …

leader replicates logs in order

example state KV Store A (Leader) log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" … 3 get "id:2.name" KV Store B log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … KV Store C log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" …

once majority replicates a log entry, the entry is “committed”

once majority replicates a log entry, the entry is “committed” the entry is durable

committed index = index replicated by majority KV Store A (Leader) log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" … 3 get "id:2.name" KV Store B log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … KV Store C log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" … committed replicated not-replicated

nodes apply committed entry to state machine, leader returns result to client

example kv store state machine def apply_log ( state : Map < string , u64 >, command : [] u8 ) -> Result < Option < u64 >>: command_type = get_command_type ( command ) if command_type == " set ": state [ get_set_command_key ( command )] = get_set_command_value ( command ) return Ok ( None ) if command_type == " get ": return Ok ( Some ( state [ get_get_command_key ( command )])) return Err ( " Unknown command {command_type} " )

note: reads need not be replicated like this

intuition #1: consensus involves more work than using a single node

intuition #1: consensus involves more work than using a single node consensus means higher latency, lower throughput than a single node

easily encapsulated

distributed consensus libraries go https://github.com/etcd-io/raft https://github.com/lni/dragonboat https://github.com/hashicorp/raft rust https://github.com/tikv/raft-rs https://github.com/databendlabs/openraft

pedagogical walkthrough

but state machine is always application-specific

how is everyone using consensus?

modeling with distributed consensus consensus in the control plane & data plane

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite)

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd)

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd) consensus in the control plane, data plane replicated separately

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd) consensus in the control plane, data plane replicated separately chain replication (ebs, delta by meta)

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd) consensus in the control plane, data plane replicated separately chain replication (ebs, delta by meta) edb postgres distributed

some code

why again are we doing this?

fault tolerance a fault is when a service is unavailable for any reason

fault tolerance a fault is when a service is unavailable for any reason e.g. crashed process

fault tolerance a fault is when a service is unavailable for any reason e.g. crashed process e.g. network partition

fault tolerance a fault is when a service is unavailable for any reason e.g. crashed process e.g. network partition e.g. general slowness ( gray failure )

fault tolerance and raft handling f failures

fault tolerance and raft handling f failures requires 2 f +1 nodes

fault tolerance and raft handling f failures requires 2 f +1 nodes 3-node cluster can handle 1 failure

fault tolerance and raft handling f failures requires 2 f +1 nodes 3-node cluster can handle 1 failure 5-node cluster can handle 2 failures

fault tolerance and raft handling f failures requires 2 f +1 nodes 3-node cluster can handle 1 failure 5-node cluster can handle 2 failures 101-node cluster can handle 50 failures

worst case scenarios

imagine processes constantly crashing

imagine processes constantly crashing due to bugs, oom killer, etc.

imagine processes constantly crashing due to bugs, oom killer, etc. disks may be slow

imagine processes constantly crashing due to bugs, oom killer, etc. disks may be slow network may be down or slow

impact takes longer to achieve consensus

impact takes longer to achieve consensus takes longer to replicate the log

impact takes longer to achieve consensus takes longer to replicate the log elections happen more frequently

impact takes longer to achieve consensus takes longer to replicate the log elections happen more frequently and take longer to succeed

impact takes longer to achieve consensus takes longer to replicate the log elections happen more frequently and take longer to succeed bonus! leader elections block replication

for clients: worst case means worse throughput, worse latency

see: sim.tigerbeetle.com

sim.tigerbeetle.com

best case scenarios

imagine processes are stable

imagine processes are stable disks are fast

imagine processes are stable disks are fast network is fast and reliable

impact leader elected quickly

impact leader elected quickly leader is stable

impact leader elected quickly leader is stable logs replicated quickly

for clients: best case means better throughput, lower latency

intuition #2: throughput and latency deteriorate as the environment worsens

as you add nodes?

just add nodes! more nodes means more fault tolerance

just add nodes! more nodes means more fault tolerance 5-node cluster only tolerates 2 faults

just add nodes! more nodes means more fault tolerance 5-node cluster only tolerates 2 faults 101-node cluster tolerates 50 faults

just add nodes! more nodes means more fault tolerance 5-node cluster only tolerates 2 faults 101-node cluster tolerates 50 faults certainly highly available

but more communication is its own penalty

more nodes, more problems 5-node cluster

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses 101-node cluster

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses 101-node cluster leader makes 100 requests for every log entry

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses 101-node cluster leader makes 100 requests for every log entry Waits for 50 responses

we’re as slow as our slowest sub-request

Page 17 Designing Data Intensive Applications

and consider tail latency

and consider tail latency fancy term for variability

numbers everyone should know Caption https://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf

kind of misleading? Caption https://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf

latency is a distribution

network

Caption https://www.evanjones.ca/network-latencies-2021.html

disk

Caption fio --name=fiotest --ioengine=sync --size 1Gb --rw=read --bs=1M --direct=1 --numjobs=4 --runtime=60 --startdelay=60 --group_reporting

we become less predictable as the number of requests grow

intuition #3: throughput and latency deteriorate as the size of the cluster grows

how to scale?

distributed consensus as a building block some databases shard on top of consensus

distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb

distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb each shard is replicated with consensus

distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb each shard is replicated with consensus at the same time, some do not!

distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb each shard is replicated with consensus at the same time, some do not! tigerbeetle, etcd, consul

example: cockroach

https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

key range sharding

https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

each shard is replicated

so what does “distributed” mean? horizontal scaling means sharding

so what does “distributed” mean? horizontal scaling means sharding distributed means sharding?

so what does “distributed” mean? horizontal scaling means sharding distributed means sharding? not necessarily

so what does “distributed” mean? horizontal scaling means sharding distributed means sharding? not necessarily consensus means sharding?

so what does “distributed” mean? horizontal scaling means sharding distributed means sharding? not necessarily consensus means sharding? definitely not

intuition #4: distributed consensus has nothing to do with horizontal scaling

recapping

intuition takeaways #0: a single node is not highly available

intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free

intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free #2: latency and throughput of distributed consensus get worse as the environment worsens

intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free #2: latency and throughput of distributed consensus get worse as the environment worsens #3: latency and throughput of distributed consensus get worse as the size of the cluster grows

intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free #2: latency and throughput of distributed consensus get worse as the environment worsens #3: latency and throughput of distributed consensus get worse as the size of the cluster grows #4: distributed consensus has nothing to do with horizontal scaling

with thanks to alex miller (@alexmillerdb) jack vanlightly (@vanlightly) paul nowoczynski (@00pauln00) daniel chia (@DanielChiaJH) alex petrov (@ifesdjeen)

thank you
Tags