Square's Lessons Learned from Implementing a Key-Value Store with Raft

ScyllaDB 194 views 28 slides Jun 27, 2024

Slide 1 of 28

About This Presentation

To put it simply, Raft is used to make a use case (e.g., key-value store, indexing system) more fault tolerant to increase availability using replication (despite server and network failures). Raft has been gaining ground due to its simplicity without sacrificing consistency and performance.

Altho...

Size: 1.58 MB

Language: en

Added: Jun 27, 2024

Slides: 28 pages

Slide Content

Lessons Learnt from Implementing a Key-Value Store with Raft Omar Elgabry Software Engineer at Square

Omar Elgabry Software Engineer at Square I am a software engineer at Square This is my second time at the P99 conf Blog: https://medium.com/@OmarElgabry

I ntro Motivation T o go through the journey of implementing a KV store, and the micro-lessons one can learn from building a fault-tolerant, strongly consistent distributed DB using Raft. Goals To build a KV store that achieves the following: fault tolerance: provide availability despite server and network failures using replication consistency guarantees (linearizability): behave as if all operations are executed on a single machine performance guarantees: should be equivalent to other consensus algorithms such as Paxos recoverability: recovers from failures

ROWA (read one, write all) Replica Replica Leader Read Write ⌛ Config read & writes

ROWA (read one, write all) Follower Follower Leader 💀 💀 Leader Config failures 🤔

Majority Follower Follower Leader ⌛ … Majority (⅔) read & writes

Majority Replica Replica Leader 💀 🤔 💀 Leader failures

ROWA vs Majority R aft is more complex vs ROWA simpler Raft does not rely on a single master to make critical decisions Majority quorum systems has been gaining ground due to their toleration of temporarily slow/ broken replicas

Raft

l og replication is the backbone of Raft. Log Replication x=4 x=5 y =2 z=1 x=4 y =2 z=1 x=5 Log KV store · Disk · In-memory

Log Replication order the log entries, client requests and commands help nodes agree on a single execution order same order → same final state help the leader ensure followers have identical logs help the leader re-send logs to lagging followers help with electing a new leader with the most up-to-date logs allow logs replay after server crashes and reboots … but why?

Log Replication Follower Follower Leader ❗ applying log entries x=4 x=4 x=4

Log Replication log properties once a command in a log entry is applied by any server, no other server executes a different command for that log entry applied log entries can’t be rolled back log entries might be overridden, iff it hasn’t been replicated & applied on majority followers

Network Partitioning Follower Leader follower Follower Follower Follower X ❗ Follower

Network Partitioning Follower Leader leader Follower Follower Follower ❗ Leader Split Brain Syndrome 🧠

Leader Elections Follower Leader voting Follower Follower Follower ❗ X Candidate Leader 🤔 👍 👎 3 1 Follower

Leader Elections split votes Follower Follower Follower Candidate 👍 👎 2 2 Candidate 👍 👎 2 2 Leader ❗ Follower Follower Candidate Leader 🕐 🕥 Follower https://www.p99conf.io/session/square-engineerings-fail-fast-retry-soon-performance-optimization-technique

Recoverability snapshots Raft · Disk

Recoverability snapshot performance issues when to take snapshot based on log size, fixed timer, etc schedule should balance between resources and performance how to unblock processing of write requests while taking snapshots copy-on-write (e.g., fork() in Linux) client read reqs and snapshot access the same underlying KV store data only the parts of state being updated are copied, and thus, not affecting the snapshot

Recoverability snapshot size considerations r aft's snapshot logic is reasonable if the KV store size is small it simplifies the logic by always copying the entire KV store state i f a follower was offline for some time and leader discarded its log while taking a snapshot leader can't repair that follower by sending logs leader sends a copy of its snapshot in chunks over the network copying large amounts of data ( for snapshots and lagging followers) is not scalable incremental snapshots (e.g. , LSM trees) persist KV store data on disk

Scaling Raft scale reads c onsistency guarantees: no stale reads KV store must reflect the behavior expected of a single server e.g., reads reqs started after a write op is completed should see the result from the write op can followers respond to read requests? no, unless we avoid stale read. https://www.usenix.org/system/files/conference/hotcloud17/hotcloud17-paper-arora.pdf

Scaling Raft scale reads c an leader respond to read-only requests by reading local copy, without replicating log entries? yes, under two conditions: n o split-brain one leader might return a stale value that has been updated by another recent leader. one solution is to use leases on successful heartbeats from majority leader is then allowed to respond to read-only requests for a lease period, without replicating new leader can’t execute write reqs until this lease period has elapsed. no-op operation new leader must append and apply a no-op operation into the log at the start of its election term. ensures all prev log entries before the election term are consistent with the majority and applied before new leader starts to serve client requests. Noop Noop

Scaling Raft scale writes c onsistency guarantees: ordering & concurrent writes the state at any point in time must reflect the result of sequential execution of ordered operations, in the same order they were sent (or received by the KV store) concurrent ops can be executed in any order as long as all clients see the same order batching to batch multiple client requests in order to send them efficiently over the network, and to write them to disk. optimizes throughput leader broadcast the batch when it exceeds the size or timeout threshold.

pipelining to send next batch to followers without having to wait for previous batch to finish optimizes latency leader can send the next batch halfway through the processing time of the first batch p arallelizing to parallelize and independent operations, e.g., disk writes from sending batches over the network Scaling Raft scale writes

Scaling Raft scale writes Follower Follower Leader Batch 1 · Disk Batch 2

References Diego O. and John O. Stanford University, In Search of an Understandable Consensus Algorithm , https://raft.github.io/raft.pdf Massachusetts Institute of Technology (MIT), MIT 6.824 distributed systems class, https://pdos.csail.mit.edu/6.824/index.html Diego O., Consensus: Bridging Theory and Practice, published by Stanford University, https://github.com/ongardie/dissertation/blob/master/stanford.pdf Unmesh Joshi, Patterns of Distributed Systems, https://martinfowler.com/articles/patterns-of-distributed-systems

Omar Elgabry [email protected] LinkedIn/ omarelgabry Thank you! Let’s connect.

Square's Lessons Learned from Implementing a Key-Value Store with Raft

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Square&#39;s Lessons Learned from Implementing a Key-Value Store with Raft

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx

Square's Lessons Learned from Implementing a Key-Value Store with Raft