intuitions for distributed consensus at dc systems

intuitions for distributed consensus @eatonphil

meetups started a little meetup with @ngeloxyz in nyc

meetups started a little meetup with @ngeloxyz in nyc turned up in cities around the us, india, germany

meetups started a little meetup with @ngeloxyz in nyc turned up in cities around the us, india, germany very special to be a part of dc systems

who i am i’m phil

who i am i’m phil developer for 10 years

who i am i’m phil developer for 10 years got interested in databases 4 years ago

who i am i’m phil developer for 10 years got interested in databases 4 years ago work for edb on distributed postgres product

who you are a number of experts in this room

who you are a number of experts in this room who’ve worked on some of the systems i’ll cover

who you are a number of experts in this room who’ve worked on some of the systems i’ll cover risky of me! 😂

setting expectations this talk is about basics and behavior

setting expectations this talk is about basics and behavior relevant for any (backend) developer

setting expectations this talk is about basics and behavior relevant for any (backend) developer (first talk i’ve given in ~5 years)

let’s go

you’ve got an app MyApp

and some data Key-Value Store

(on a single node) Key-Value Store

communication is simple Key-Value Store MyApp Reads / Writes

we might write Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" <pending>

we might write Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()>

and every time we read Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" <pending>

we get back what we wrote Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<"I. Asimov">

a.k.a. consistency

consistency is a spectrum

where “consistent” = linearizable

most easily illustrated via counterexample

linearizability counterexample Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" <pending>

linearizability counterexample Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()>

linearizability counterexample Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" <pending>

linearizability counterexample: reading stale values Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<()>

linearizability counterexample: reading stale values Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<()> t2 get "id:1.name" <pending>

linearizability counterexample: reading stale values Time WhatsIpp Key-Value Store t0 set "id:1.name" "I. Asimov" ok<()> t1 get "id:1.name" ok<()> t2 get "id:1.name" ok<"I. Asimov">

stale reads: one example of being not linearizable

more formally

https://jepsen.io/consistency/models/linearizable “Linearizability is one of the strongest single-object consistency models, and implies that every operation appears to take place atomically, in some order, consistent with the real-time ordering of those operations”

linearizability for a single node, not super interesting

linearizability for a single node, not super interesting but as a property in general, very useful

linearizability for a single node, not super interesting but as a property in general, very useful linearizable system can be treated as if it were a single node

linearizability for a single node, not super interesting but as a property in general, very useful linearizable system can be treated as if it were a single node even if it consists of more than one node

single node crashes, entire system is unavailable

intuition #0: a single node is not highly available!

how to achieve high availability?

add nodes? MyApp Reads / Writes KV Store A KV Store B KV Store C ???

how to keep data in sync? WhatsIpp Reads / Writes KV Store A KV Store B KV Store C ???

read replicas!

read replicas are a thing MyApp KV Store A (Leader) KV Store B KV Store C

but on leader failure MyApp KV Store A (Leader) KV Store B KV Store C

read replicas may not be up-to-date MyApp KV Store A (Leader) KV Store B KV Store C

thus not linearizable MyApp KV Store A (Leader) KV Store B KV Store C

some options for linearizability + availability chain replication (ebs)

some options for linearizability + availability chain replication (ebs) kafka replication protocol

some options for linearizability + availability chain replication (ebs) kafka replication protocol foundationdb replication protocol

some options for linearizability + availability chain replication (ebs) kafka replication protocol foundationdb replication protocol client-side quorums

some options for linearizability + availability chain replication (ebs) kafka replication protocol foundationdb replication protocol client-side quorums distributed consensus

further reading: data replication design spectrum

but most popular

and for this talk

distributed consensus

distributed consensus linearizability + availability

distributed consensus linearizability + availability examples: raft multipaxos viewstamped replication

used everywhere

kubernetes

elasticsearch

mongodb

cockroachdb

tigerbeetle

edb postgres distributed

etcd, redpanda, scylla, consul, hazelcast, yugabyte, clickhouse, nats, kafka, tidb, neo4j, rabbitmq, etc. https://en.wikipedia.org/wiki/Raft_(algorithm)

zooming in on

raft

at a high level

nodes form a cluster KV Store A KV Store B KV Store C

cluster elects a leader by majority vote KV Store A KV Store B (Leader) KV Store C

if a leader becomes unavailable KV Store A KV Store B (Leader) KV Store C

the cluster elects a new leader KV Store A (Leader) KV Store B KV Store C

client talks with the (current) leader KV Store A (Leader) KV Store B KV Store C MyApp

state changes modeled as commands MyApp KV Store A (Leader) KV Store B KV Store C exec: 'set "id:1.name" "I. Asimov"'

state changes modeled as commands MyApp KV Store A (Leader) KV Store B KV Store C exec: 'get "id:1.name"'

note: log replication pauses while leader election is happening

commands stored in a log KV Store A (Leader) log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" …

leader replicates logs in order

example state KV Store A (Leader) log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" … 3 get "id:2.name" KV Store B log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … KV Store C log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" …

once majority replicates a log entry, the entry is “committed”

once majority replicates a log entry, the entry is “committed” the entry is durable

committed index = index replicated by majority KV Store A (Leader) log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" … 3 get "id:2.name" KV Store B log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … KV Store C log index value additional log entry metadata (null) (null) 1 set "id:1.name" "I. Asimov" … 2 get "id:1.name" … committed replicated not-replicated

nodes apply committed entry to state machine, leader returns result to client

example kv store state machine def apply_log ( state : Map < string , u64 >, command : [] u8 ) -> Result < Option < u64 >>: command_type = get_command_type ( command ) if command_type == " set ": state [ get_set_command_key ( command )] = get_set_command_value ( command ) return Ok ( None ) if command_type == " get ": return Ok ( Some ( state [ get_get_command_key ( command )])) return Err ( " Unknown command {command_type} " )

note: reads need not be replicated like this

intuition #1: consensus involves more work than using a single node

intuition #1: consensus involves more work than using a single node consensus means higher latency, lower throughput than a single node

easily encapsulated

distributed consensus libraries go https://github.com/etcd-io/raft https://github.com/lni/dragonboat https://github.com/hashicorp/raft rust https://github.com/tikv/raft-rs https://github.com/databendlabs/openraft

pedagogical walkthrough

but state machine is always application-specific

how is everyone using consensus?

modeling with distributed consensus consensus in the control plane & data plane

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite)

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd)

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd) consensus in the control plane, data plane replicated separately

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd) consensus in the control plane, data plane replicated separately chain replication (ebs, delta by meta)

modeling with distributed consensus consensus in the control plane & data plane replicate statements (e.g. rqlite) replicate data (e.g. cockroach, tigerbeetle, etcd) consensus in the control plane, data plane replicated separately chain replication (ebs, delta by meta) edb postgres distributed

some code

why again are we doing this?

fault tolerance a fault is when a service is unavailable for any reason

fault tolerance a fault is when a service is unavailable for any reason e.g. crashed process

fault tolerance a fault is when a service is unavailable for any reason e.g. crashed process e.g. network partition

fault tolerance a fault is when a service is unavailable for any reason e.g. crashed process e.g. network partition e.g. general slowness ( gray failure )

fault tolerance and raft handling f failures

fault tolerance and raft handling f failures requires 2 f +1 nodes

fault tolerance and raft handling f failures requires 2 f +1 nodes 3-node cluster can handle 1 failure

fault tolerance and raft handling f failures requires 2 f +1 nodes 3-node cluster can handle 1 failure 5-node cluster can handle 2 failures

fault tolerance and raft handling f failures requires 2 f +1 nodes 3-node cluster can handle 1 failure 5-node cluster can handle 2 failures 101-node cluster can handle 50 failures

worst case scenarios

imagine processes constantly crashing

imagine processes constantly crashing due to bugs, oom killer, etc.

imagine processes constantly crashing due to bugs, oom killer, etc. disks may be slow

imagine processes constantly crashing due to bugs, oom killer, etc. disks may be slow network may be down or slow

impact takes longer to achieve consensus

impact takes longer to achieve consensus takes longer to replicate the log

impact takes longer to achieve consensus takes longer to replicate the log elections happen more frequently

impact takes longer to achieve consensus takes longer to replicate the log elections happen more frequently and take longer to succeed

impact takes longer to achieve consensus takes longer to replicate the log elections happen more frequently and take longer to succeed bonus! leader elections block replication

for clients: worst case means worse throughput, worse latency

see: sim.tigerbeetle.com

sim.tigerbeetle.com

best case scenarios

imagine processes are stable

imagine processes are stable disks are fast

imagine processes are stable disks are fast network is fast and reliable

impact leader elected quickly

impact leader elected quickly leader is stable

impact leader elected quickly leader is stable logs replicated quickly

for clients: best case means better throughput, lower latency

intuition #2: throughput and latency deteriorate as the environment worsens

as you add nodes?

just add nodes! more nodes means more fault tolerance

just add nodes! more nodes means more fault tolerance 5-node cluster only tolerates 2 faults

just add nodes! more nodes means more fault tolerance 5-node cluster only tolerates 2 faults 101-node cluster tolerates 50 faults

just add nodes! more nodes means more fault tolerance 5-node cluster only tolerates 2 faults 101-node cluster tolerates 50 faults certainly highly available

but more communication is its own penalty

more nodes, more problems 5-node cluster

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses 101-node cluster

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses 101-node cluster leader makes 100 requests for every log entry

more nodes, more problems 5-node cluster leader makes 4 requests for every log entry waits for 2 responses 101-node cluster leader makes 100 requests for every log entry Waits for 50 responses

we’re as slow as our slowest sub-request

Page 17 Designing Data Intensive Applications

and consider tail latency

and consider tail latency fancy term for variability

numbers everyone should know Caption https://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf

kind of misleading? Caption https://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf

latency is a distribution

network

Caption https://www.evanjones.ca/network-latencies-2021.html

disk

Caption fio --name=fiotest --ioengine=sync --size 1Gb --rw=read --bs=1M --direct=1 --numjobs=4 --runtime=60 --startdelay=60 --group_reporting

we become less predictable as the number of requests grow

intuition #3: throughput and latency deteriorate as the size of the cluster grows

how to scale?

distributed consensus as a building block some databases shard on top of consensus

distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb

distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb each shard is replicated with consensus

distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb each shard is replicated with consensus at the same time, some do not!

distributed consensus as a building block some databases shard on top of consensus cockroach, yugabyte, tidb each shard is replicated with consensus at the same time, some do not! tigerbeetle, etcd, consul

example: cockroach

https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

key range sharding

https://github.com/cockroachdb/cockroach/blob/master/docs/design.md

each shard is replicated

so what does “distributed” mean? horizontal scaling means sharding

so what does “distributed” mean? horizontal scaling means sharding distributed means sharding?

so what does “distributed” mean? horizontal scaling means sharding distributed means sharding? not necessarily

so what does “distributed” mean? horizontal scaling means sharding distributed means sharding? not necessarily consensus means sharding?

so what does “distributed” mean? horizontal scaling means sharding distributed means sharding? not necessarily consensus means sharding? definitely not

intuition #4: distributed consensus has nothing to do with horizontal scaling

recapping

intuition takeaways #0: a single node is not highly available

intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free

intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free #2: latency and throughput of distributed consensus get worse as the environment worsens

intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free #2: latency and throughput of distributed consensus get worse as the environment worsens #3: latency and throughput of distributed consensus get worse as the size of the cluster grows

intuition takeaways #0: a single node is not highly available #1: distributed consensus is not free #2: latency and throughput of distributed consensus get worse as the environment worsens #3: latency and throughput of distributed consensus get worse as the size of the cluster grows #4: distributed consensus has nothing to do with horizontal scaling

with thanks to alex miller (@alexmillerdb) jack vanlightly (@vanlightly) paul nowoczynski (@00pauln00) daniel chia (@DanielChiaJH) alex petrov (@ifesdjeen)

thank you

intuitions for distributed consensus at dc systems

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

intuitions for distributed consensus at dc systems

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77