Square's Lessons Learned from Implementing a Key-Value Store with Raft

ScyllaDB 194 views 28 slides Jun 27, 2024
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

To put it simply, Raft is used to make a use case (e.g., key-value store, indexing system) more fault tolerant to increase availability using replication (despite server and network failures). Raft has been gaining ground due to its simplicity without sacrificing consistency and performance.

Altho...


Slide Content

Lessons Learnt from Implementing a Key-Value Store with Raft Omar Elgabry Software Engineer at Square

Omar Elgabry Software Engineer at Square I am a software engineer at Square This is my second time at the P99 conf Blog: https://medium.com/@OmarElgabry

I ntro Motivation T o go through the journey of implementing a KV store, and the micro-lessons one can learn from building a fault-tolerant, strongly consistent distributed DB using Raft. Goals To build a KV store that achieves the following: fault tolerance: provide availability despite server and network failures using replication consistency guarantees (linearizability): behave as if all operations are executed on a single machine performance guarantees: should be equivalent to other consensus algorithms such as Paxos recoverability: recovers from failures

ROWA (read one, write all) Replica Replica Leader Read Write βŒ› Config read & writes

ROWA (read one, write all) Follower Follower Leader πŸ’€ πŸ’€ Leader Config failures πŸ€”

Majority Follower Follower Leader βŒ› … Majority (β…”) read & writes

Majority Replica Replica Leader πŸ’€ πŸ€” πŸ’€ Leader failures

ROWA vs Majority R aft is more complex vs ROWA simpler Raft does not rely on a single master to make critical decisions Majority quorum systems has been gaining ground due to their toleration of temporarily slow/ broken replicas

Raft

l og replication is the backbone of Raft. Log Replication x=4 x=5 y =2 z=1 x=4 y =2 z=1 x=5 Log KV store Β· Disk Β· In-memory

Log Replication order the log entries, client requests and commands help nodes agree on a single execution order same order β†’ same final state help the leader ensure followers have identical logs help the leader re-send logs to lagging followers help with electing a new leader with the most up-to-date logs allow logs replay after server crashes and reboots … but why?

Log Replication Follower Follower Leader ❗ applying log entries x=4 x=4 x=4

Log Replication log properties once a command in a log entry is applied by any server, no other server executes a different command for that log entry applied log entries can’t be rolled back log entries might be overridden, iff it hasn’t been replicated & applied on majority followers

Network Partitioning Follower Leader follower Follower Follower Follower X ❗ Follower

Network Partitioning Follower Leader leader Follower Follower Follower ❗ Leader Split Brain Syndrome 🧠

Leader Elections Follower Leader voting Follower Follower Follower ❗ X Candidate Leader πŸ€” πŸ‘ πŸ‘Ž 3 1 Follower

Leader Elections split votes Follower Follower Follower Candidate πŸ‘ πŸ‘Ž 2 2 Candidate πŸ‘ πŸ‘Ž 2 2 Leader ❗ Follower Follower Candidate Leader πŸ• πŸ•₯ Follower https://www.p99conf.io/session/square-engineerings-fail-fast-retry-soon-performance-optimization-technique

Recoverability snapshots Raft Β· Disk

Recoverability snapshots Raft Β· Disk

Recoverability snapshot performance issues when to take snapshot based on log size, fixed timer, etc schedule should balance between resources and performance how to unblock processing of write requests while taking snapshots copy-on-write (e.g., fork() in Linux) client read reqs and snapshot access the same underlying KV store data only the parts of state being updated are copied, and thus, not affecting the snapshot

Recoverability snapshot size considerations r aft's snapshot logic is reasonable if the KV store size is small it simplifies the logic by always copying the entire KV store state i f a follower was offline for some time and leader discarded its log while taking a snapshot leader can't repair that follower by sending logs leader sends a copy of its snapshot in chunks over the network copying large amounts of data ( for snapshots and lagging followers) is not scalable incremental snapshots (e.g. , LSM trees) persist KV store data on disk

Scaling Raft scale reads c onsistency guarantees: no stale reads KV store must reflect the behavior expected of a single server e.g., reads reqs started after a write op is completed should see the result from the write op can followers respond to read requests? no, unless we avoid stale read. https://www.usenix.org/system/files/conference/hotcloud17/hotcloud17-paper-arora.pdf

Scaling Raft scale reads c an leader respond to read-only requests by reading local copy, without replicating log entries? yes, under two conditions: n o split-brain one leader might return a stale value that has been updated by another recent leader. one solution is to use leases on successful heartbeats from majority leader is then allowed to respond to read-only requests for a lease period, without replicating new leader can’t execute write reqs until this lease period has elapsed. no-op operation new leader must append and apply a no-op operation into the log at the start of its election term. ensures all prev log entries before the election term are consistent with the majority and applied before new leader starts to serve client requests. Noop Noop

Scaling Raft scale writes c onsistency guarantees: ordering & concurrent writes the state at any point in time must reflect the result of sequential execution of ordered operations, in the same order they were sent (or received by the KV store) concurrent ops can be executed in any order as long as all clients see the same order batching to batch multiple client requests in order to send them efficiently over the network, and to write them to disk. optimizes throughput leader broadcast the batch when it exceeds the size or timeout threshold.

pipelining to send next batch to followers without having to wait for previous batch to finish optimizes latency leader can send the next batch halfway through the processing time of the first batch p arallelizing to parallelize and independent operations, e.g., disk writes from sending batches over the network Scaling Raft scale writes

Scaling Raft scale writes Follower Follower Leader Batch 1 Β· Disk Batch 2

References Diego O. and John O. Stanford University, In Search of an Understandable Consensus Algorithm , https://raft.github.io/raft.pdf Massachusetts Institute of Technology (MIT), MIT 6.824 distributed systems class, https://pdos.csail.mit.edu/6.824/index.html Diego O., Consensus: Bridging Theory and Practice, published by Stanford University, https://github.com/ongardie/dissertation/blob/master/stanford.pdf Unmesh Joshi, Patterns of Distributed Systems, https://martinfowler.com/articles/patterns-of-distributed-systems

Omar Elgabry [email protected] LinkedIn/ omarelgabry Thank you! Let’s connect.
Tags