intuitions for distributed consensus at dc systems
philipeaton35
500 views
43 slides
Oct 09, 2024
Slide 1 of 186
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
About This Presentation
From DC Systems October 2024.
dcsystems.xyz
Size: 11.14 MB
Language: en
Added: Oct 09, 2024
Slides: 43 pages
Slide Content
intuitions for distributed consensus @eatonphil
meetups started a little meetup with @ngeloxyz in nyc
meetups started a little meetup with @ngeloxyz in nyc turned up in cities around the us, india, germany
meetups started a little meetup with @ngeloxyz in nyc turned up in cities around the us, india, germany very special to be a part of dc systems
who i am i’m phil
who i am i’m phil developer for 10 years
who i am i’m phil developer for 10 years got interested in databases 4 years ago
who i am i’m phil developer for 10 years got interested in databases 4 years ago work for edb on distributed postgres product
who you are a number of experts in this room
who you are a number of experts in this room who’ve worked on some of the systems i’ll cover
who you are a number of experts in this room who’ve worked on some of the systems i’ll cover risky of me! 😂
setting expectations this talk is about basics and behavior
setting expectations this talk is about basics and behavior relevant for any (backend) developer
setting expectations this talk is about basics and behavior relevant for any (backend) developer (first talk i’ve given in ~5 years)
let’s go
you’ve got an app MyApp
and some data Key-Value Store
(on a single node) Key-Value Store
communication is simple Key-Value Store MyApp Reads / Writes
we might write Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" <pending>
we might write Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()>
and every time we read Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" <pending>
we get back what we wrote Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<"I. Asimov">
a.k.a. consistency
consistency is a spectrum
where “consistent” = linearizable
most easily illustrated via counterexample
linearizability counterexample Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" <pending>
linearizability counterexample Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()>
linearizability counterexample Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" <pending>
linearizability counterexample: reading stale values Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<()>
linearizability counterexample: reading stale values Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<()> t2 get "id:1.name" <pending>
linearizability counterexample: reading stale values Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<()> t2 get "id:1.name" ok<"I. Asimov">
stale reads: one example of being not linearizable
more formally
https://jepsen.io/consistency/models/linearizable “Linearizability is one of the strongest single-object consistency models, and implies that every operation appears to take place atomically, in some order, consistent with the real-time ordering of those operations”
linearizability for a single node, not super interesting
linearizability for a single node, not super interesting but as a property in general, very useful
linearizability for a single node, not super interesting but as a property in general, very useful linearizable system can be treated as if it were a single node
linearizability for a single node, not super interesting but as a property in general, very useful linearizable system can be treated as if it were a single node even if it consists of more than one node
single node crashes, entire system is unavailable
intuition #0: a single node is not highly available!
how to achieve high availability?
add nodes? MyApp Reads / Writes KV Store A KV Store B KV Store C ???
how to keep data in sync? WhatsIpp Reads / Writes KV Store A KV Store B KV Store C ???
read replicas!
read replicas are a thing MyApp KV Store A (Leader) KV Store B KV Store C
but on leader failure MyApp KV Store A (Leader) KV Store B KV Store C
but on leader failure MyApp KV Store A (Leader) KV Store B KV Store C
read replicas may not be up-to-date MyApp KV Store A (Leader) KV Store B KV Store C
thus not linearizable MyApp KV Store A (Leader) KV Store B KV Store C
some options for linearizability + availability chain replication (ebs)
some options for linearizability + availability chain replication (ebs) kafka replication protocol
some options for linearizability + availability chain replication (ebs) kafka replication protocol foundationdb replication protocol
some options for linearizability + availability chain replication (ebs) kafka replication protocol foundationdb replication protocol client-side quorums
some options for linearizability + availability chain replication (ebs) kafka replication protocol foundationdb replication protocol client-side quorums distributed consensus
nodes form a cluster KV Store A KV Store B KV Store C
cluster elects a leader by majority vote KV Store A KV Store B (Leader) KV Store C
if a leader becomes unavailable KV Store A KV Store B (Leader) KV Store C
the cluster elects a new leader KV Store A (Leader) KV Store B KV Store C
client talks with the (current) leader KV Store A (Leader) KV Store B KV Store C MyApp
state changes modeled as commands MyApp KV Store A (Leader) KV Store B KV Store C exec: 'set "id:1.name" "I. Asimov"'
state changes modeled as commands MyApp KV Store A (Leader) KV Store B KV Store C exec: 'get "id:1.name"'
note: log replication pauses while leader election is happening
commands stored in a log KV Store A (Leader) log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" …
leader replicates logs in order
example state KV Store A (Leader) log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" … 3 get "id:2.name" KV Store B log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … KV Store C log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" …
once majority replicates a log entry, the entry is “committed”
once majority replicates a log entry, the entry is “committed” the entry is durable
committed index = index replicated by majority KV Store A (Leader) log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" … 3 get "id:2.name" KV Store B log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … KV Store C log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" … committed replicated not-replicated
nodes apply committed entry to state machine, leader returns result to client
example kv store state machine def apply_log ( state : Map < string , u64 >, command : [] u8 ) -> Result < Option < u64 >>: command_type = get_command_type ( command ) if command_type == " set ": state [ get_set_command_key ( command )] = get_set_command_value ( command ) return Ok ( None ) if command_type == " get ": return Ok ( Some ( state [ get_get_command_key ( command )])) return Err ( " Unknown command {command_type} " )
note: reads need not be replicated like this
intuition #1: consensus involves more work than using a single node
intuition #1: consensus involves more work than using a single node consensus means higher latency, lower throughput than a single node
easily encapsulated
distributed consensus libraries go https://github.com/etcd-io/raft https://github.com/lni/dragonboat https://github.com/hashicorp/raft rust https://github.com/tikv/raft-rs https://github.com/databendlabs/openraft
pedagogical walkthrough
but state machine is always application-specific
how is everyone using consensus?
modeling with distributed consensus consensus in the control plane & data plane
modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite)
modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd)
modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd) consensus in the control plane, data plane replicated separately
modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd) consensus in the control plane, data plane replicated separately chain replication (ebs, delta by meta)
modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd) consensus in the control plane, data plane replicated separately chain replication (ebs, delta by meta) edb postgres distributed
some code
why again are we doing this?
fault tolerance a fault is when a service is unavailable for any reason
fault tolerance a fault is when a service is unavailable for any reason e.g. crashed process
fault tolerance a fault is when a service is unavailable for any reason e.g. crashed process e.g. network partition
fault tolerance a fault is when a service is unavailable for any reason e.g. crashed process e.g. network partition e.g. general slowness ( gray failure )
fault tolerance and raft handling f failures
fault tolerance and raft handling f failures requires 2 f +1 nodes
fault tolerance and raft handling f failures requires 2 f +1 nodes 3-node cluster can handle 1 failure
fault tolerance and raft handling f failures requires 2 f +1 nodes 3-node cluster can handle 1 failure 5-node cluster can handle 2 failures
fault tolerance and raft handling f failures requires 2 f +1 nodes 3-node cluster can handle 1 failure 5-node cluster can handle 2 failures 101-node cluster can handle 50 failures
worst case scenarios
imagine processes constantly crashing
imagine processes constantly crashing due to bugs, oom killer, etc.
imagine processes constantly crashing due to bugs, oom killer, etc. disks may be slow
imagine processes constantly crashing due to bugs, oom killer, etc. disks may be slow network may be down or slow
impact takes longer to achieve consensus
impact takes longer to achieve consensus takes longer to replicate the log
impact takes longer to achieve consensus takes longer to replicate the log elections happen more frequently
impact takes longer to achieve consensus takes longer to replicate the log elections happen more frequently and take longer to succeed
impact takes longer to achieve consensus takes longer to replicate the log elections happen more frequently and take longer to succeed bonus! leader elections block replication
for clients: worst case means worse throughput, worse latency
see: sim.tigerbeetle.com
sim.tigerbeetle.com
best case scenarios
imagine processes are stable
imagine processes are stable disks are fast
imagine processes are stable disks are fast network is fast and reliable
impact leader elected quickly
impact leader elected quickly leader is stable
impact leader elected quickly leader is stable logs replicated quickly
for clients: best case means better throughput, lower latency
intuition #2: throughput and latency deteriorate as the environment worsens
as you add nodes?
just add nodes! more nodes means more fault tolerance
just add nodes! more nodes means more fault tolerance 5-node cluster only tolerates 2 faults
just add nodes! more nodes means more fault tolerance 5-node cluster only tolerates 2 faults 101-node cluster tolerates 50 faults
just add nodes! more nodes means more fault tolerance 5-node cluster only tolerates 2 faults 101-node cluster tolerates 50 faults certainly highly available
but more communication is its own penalty
more nodes, more problems 5-node cluster
more nodes, more problems 5-node cluster leader makes 4 requests for every log entry
more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses
more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses 101-node cluster
more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses 101-node cluster leader makes 100 requests for every log entry
more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses 101-node cluster leader makes 100 requests for every log entry Waits for 50 responses
we’re as slow as our slowest sub-request
Page 17 Designing Data Intensive Applications
and consider tail latency
and consider tail latency fancy term for variability
numbers everyone should know Caption https://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf
kind of misleading? Caption https://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf
we become less predictable as the number of requests grow
intuition #3: throughput and latency deteriorate as the size of the cluster grows
how to scale?
distributed consensus as a building block some databases shard on top of consensus
distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb
distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb each shard is replicated with consensus
distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb each shard is replicated with consensus at the same time, some do not!
distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb each shard is replicated with consensus at the same time, some do not! tigerbeetle, etcd, consul
so what does “distributed” mean? horizontal scaling means sharding
so what does “distributed” mean? horizontal scaling means sharding distributed means sharding?
so what does “distributed” mean? horizontal scaling means sharding distributed means sharding? not necessarily
so what does “distributed” mean? horizontal scaling means sharding distributed means sharding? not necessarily consensus means sharding?
so what does “distributed” mean? horizontal scaling means sharding distributed means sharding? not necessarily consensus means sharding? definitely not
intuition #4: distributed consensus has nothing to do with horizontal scaling
recapping
intuition takeaways #0: a single node is not highly available
intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free
intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free #2: latency and throughput of distributed consensus get worse as the environment worsens
intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free #2: latency and throughput of distributed consensus get worse as the environment worsens #3: latency and throughput of distributed consensus get worse as the size of the cluster grows
intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free #2: latency and throughput of distributed consensus get worse as the environment worsens #3: latency and throughput of distributed consensus get worse as the size of the cluster grows #4: distributed consensus has nothing to do with horizontal scaling
with thanks to alex miller (@alexmillerdb) jack vanlightly (@vanlightly) paul nowoczynski (@00pauln00) daniel chia (@DanielChiaJH) alex petrov (@ifesdjeen)