Square's Lessons Learned from Implementing a Key-Value Store with Raft
ScyllaDB
194 views
28 slides
Jun 27, 2024
Slide 1 of 28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
About This Presentation
To put it simply, Raft is used to make a use case (e.g., key-value store, indexing system) more fault tolerant to increase availability using replication (despite server and network failures). Raft has been gaining ground due to its simplicity without sacrificing consistency and performance.
Altho...
To put it simply, Raft is used to make a use case (e.g., key-value store, indexing system) more fault tolerant to increase availability using replication (despite server and network failures). Raft has been gaining ground due to its simplicity without sacrificing consistency and performance.
Although we'll cover Raft's building blocks, this is not about the Raft algorithm; it is more about the micro-lessons one can learn from building fault-tolerant, strongly consistent distributed systems using Raft. Things like majority agreement rule (quorum), write-ahead log, split votes & randomness to reduce contention, heartbeats, split-brain syndrome, snapshots & logs replay, client requests dedupe & idempotency, consistency guarantees (linearizability), leases & stale reads, batching & streaming, parallelizing persisting & broadcasting, version control, and more!
And believe it or not, you might be using some of these techniques without even realizing it!
This is inspired by Raft paper (raft.github.io), publications & courses on Raft, and an attempt to implement a key-value store using Raft as a side project.
Size: 1.58 MB
Language: en
Added: Jun 27, 2024
Slides: 28 pages
Slide Content
Lessons Learnt from Implementing a Key-Value Store with Raft Omar Elgabry Software Engineer at Square
Omar Elgabry Software Engineer at Square I am a software engineer at Square This is my second time at the P99 conf Blog: https://medium.com/@OmarElgabry
I ntro Motivation T o go through the journey of implementing a KV store, and the micro-lessons one can learn from building a fault-tolerant, strongly consistent distributed DB using Raft. Goals To build a KV store that achieves the following: fault tolerance: provide availability despite server and network failures using replication consistency guarantees (linearizability): behave as if all operations are executed on a single machine performance guarantees: should be equivalent to other consensus algorithms such as Paxos recoverability: recovers from failures
ROWA vs Majority R aft is more complex vs ROWA simpler Raft does not rely on a single master to make critical decisions Majority quorum systems has been gaining ground due to their toleration of temporarily slow/ broken replicas
Raft
l og replication is the backbone of Raft. Log Replication x=4 x=5 y =2 z=1 x=4 y =2 z=1 x=5 Log KV store Β· Disk Β· In-memory
Log Replication order the log entries, client requests and commands help nodes agree on a single execution order same order β same final state help the leader ensure followers have identical logs help the leader re-send logs to lagging followers help with electing a new leader with the most up-to-date logs allow logs replay after server crashes and reboots β¦ but why?
Log Replication log properties once a command in a log entry is applied by any server, no other server executes a different command for that log entry applied log entries canβt be rolled back log entries might be overridden, iff it hasnβt been replicated & applied on majority followers
Recoverability snapshot performance issues when to take snapshot based on log size, fixed timer, etc schedule should balance between resources and performance how to unblock processing of write requests while taking snapshots copy-on-write (e.g., fork() in Linux) client read reqs and snapshot access the same underlying KV store data only the parts of state being updated are copied, and thus, not affecting the snapshot
Recoverability snapshot size considerations r aft's snapshot logic is reasonable if the KV store size is small it simplifies the logic by always copying the entire KV store state i f a follower was offline for some time and leader discarded its log while taking a snapshot leader can't repair that follower by sending logs leader sends a copy of its snapshot in chunks over the network copying large amounts of data ( for snapshots and lagging followers) is not scalable incremental snapshots (e.g. , LSM trees) persist KV store data on disk
Scaling Raft scale reads c onsistency guarantees: no stale reads KV store must reflect the behavior expected of a single server e.g., reads reqs started after a write op is completed should see the result from the write op can followers respond to read requests? no, unless we avoid stale read. https://www.usenix.org/system/files/conference/hotcloud17/hotcloud17-paper-arora.pdf
Scaling Raft scale reads c an leader respond to read-only requests by reading local copy, without replicating log entries? yes, under two conditions: n o split-brain one leader might return a stale value that has been updated by another recent leader. one solution is to use leases on successful heartbeats from majority leader is then allowed to respond to read-only requests for a lease period, without replicating new leader canβt execute write reqs until this lease period has elapsed. no-op operation new leader must append and apply a no-op operation into the log at the start of its election term. ensures all prev log entries before the election term are consistent with the majority and applied before new leader starts to serve client requests. Noop Noop
Scaling Raft scale writes c onsistency guarantees: ordering & concurrent writes the state at any point in time must reflect the result of sequential execution of ordered operations, in the same order they were sent (or received by the KV store) concurrent ops can be executed in any order as long as all clients see the same order batching to batch multiple client requests in order to send them efficiently over the network, and to write them to disk. optimizes throughput leader broadcast the batch when it exceeds the size or timeout threshold.
pipelining to send next batch to followers without having to wait for previous batch to finish optimizes latency leader can send the next batch halfway through the processing time of the first batch p arallelizing to parallelize and independent operations, e.g., disk writes from sending batches over the network Scaling Raft scale writes
References Diego O. and John O. Stanford University, In Search of an Understandable Consensus Algorithm , https://raft.github.io/raft.pdf Massachusetts Institute of Technology (MIT), MIT 6.824 distributed systems class, https://pdos.csail.mit.edu/6.824/index.html Diego O., Consensus: Bridging Theory and Practice, published by Stanford University, https://github.com/ongardie/dissertation/blob/master/stanford.pdf Unmesh Joshi, Patterns of Distributed Systems, https://martinfowler.com/articles/patterns-of-distributed-systems