Conquering Load Balancing: Experiences from ScyllaDB Drivers

ScyllaDB 169 views 35 slides Jul 01, 2024
Slide 1
Slide 1 of 35
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35

About This Presentation

Load balancing seems simple on the surface, with algorithms like round-robin, but the real world loves throwing curveballs. Join me in this session as we delve into the intricacies of load balancing within ScyllaDB Drivers. Discover firsthand experiences from our journey in driver development, where...


Slide Content

Conquering Load Balancing: Experiences from ScyllaDB Drivers Piotr Grabowski Software Team Leader at ScyllaDB

Piotr Grabowski he / him Software Team Leader at ScyllaDB Software Team Leader at ScyllaDB responsible for all ScyllaDB drivers, ScyllaDB Kafka Connectors (ScyllaDB Sink Connector and ScyllaDB CDC Source Connector) Joined ScyllaDB 3 years ago

Load balancing in ScyllaDB

Load balancing in ScyllaDB SELECT * FROM table WHERE partition_key = “R1250GS” hash(“R1250GS”) = replica nodes

Naïve load balancing policy SELECT * FROM table WHERE partition_key = “R1250GS”

Naïve load balancing policy SELECT * FROM table WHERE partition_key = “R1250GS”

Round-robin policy A very basic load balancing policy algorithm Maintain a incrementing counter N When sending a query: S end it to replica number N Increment counter (modulo number of replicas) The load will be equally distributed between the nodes

Random choice policy A very basic load balancing policy algorithm When sending a query: Send the query to a random replica

Problems!

Slow node SELECT * FROM table WHERE partition_key = “R1250GS” Slow node F aulty network T emporary overload

Slow node SELECT * FROM table WHERE partition_key = “R1250GS” Slow node F aulty network T emporary overload + 10 ms latency

Slow core SELECT * FROM table WHERE partition_key = “R1250GS”

Inflight requests per node+shard Inflight requests

Inflight requests per node+shard Inflight requests

Slow core (synthetic scenario) 5 node i3.4xlarge Scylla cluster P urposefully severely overloaded a single core on one of the nodes That’s only 1/80 of the cluster! The results were staggering!

Slow core (synthetic scenario)

Slow core (synthetic scenario)

Slow core (synthetic scenario)

Least Inflight load balancing? One potential solution to that problem Instead of choosing any of the replicas, send the request to the least loaded one (as measured by number of inflight requests) Works well for a single client, but is susceptible to overloading a node in case of many clients

Power of Two Choices load balancing A simple algorithm that aims to ignore the slowest replica while avoid the problem of many clients overloading the same node When selecting the replica: Randomize 2 replicas Select the replica with a fewer number of inflight requests

Sending requests to a wrong shard A counterintuitive approach to dealing with an overloaded shard If you send the request to the wrong shard: Initial processing of the request happens on the wrong shard The wrong shard moves the processing to the correct shard The correct shard didn’t have to do the initial processing, reducing its load!

Sending requests to a wrong shard

Zone awareness in the Cloud

Zone awareness in the Cloud

Zone awareness in the Cloud

Zone awareness in the Cloud Instead of sending the request to any replica, prefer the replica in the local AZ Potential issues: A common setup is to use Replication Factor equal to 3 If the cluster is deployed in 3 AZs, there will be only a single replica in each of the AZs This severely limits the choices a load balancing algorithm has A problem with many solutions - a tradeoff between latencies and cost

Optimizing load balancing in ScyllaDB Rust Driver Benchmarks of ScyllaDB Rust Driver have shown that load balancing computation is a significant part of sending a query Main goal: reduce number of allocations and atomic operations while building the query plan, especially on the happy path: Plan function was split to pick() and fallback() methods. This allowed to better optimize the most common case, where only one node from the load balancing plan is needed Precomputation of replica sets: A struct introduced that precomputes replica lists of a given strategies, and provides O(1) access to desired replica slices

Optimizing load balancing in ScyllaDB Rust Driver Inserts: ---------- allocs/req: 15.00 reallocs/req: 8.00 frees/req: 15.00 bytes allocated/req: 2458.05 bytes reallocated/req: 269.06 bytes freed/req: 2456.80 (allocated - freed)/req: 1.25 Inserts: ---------- allocs/req: 6.01 reallocs/req: 6.00 frees/req: 6.00 bytes allocated/req: 381.80 bytes reallocated/req: 173.05 bytes freed/req: 380.62 (allocated - freed)/req: 1.18 Before After

Optimizing load balancing in ScyllaDB Rust Driver Inserts: ---------- allocs/req: 15.00 reallocs/req: 8.00 frees/req: 15.00 bytes allocated/req: 2458.05 bytes reallocated/req: 269.06 bytes freed/req: 2456.80 (allocated - freed)/req: 1.25 Inserts: ---------- allocs/req: 6.01 reallocs/req: 6.00 frees/req: 6.00 bytes allocated/req: 381.80 bytes reallocated/req: 173.05 bytes freed/req: 380.62 (allocated - freed)/req: 1.18 Before After 9 fewer allocations (-60%)

Optimizing load balancing in ScyllaDB Rust Driver Inserts: ---------- allocs/req: 15.00 reallocs/req: 8.00 frees/req: 15.00 bytes allocated/req: 2458.05 bytes reallocated/req: 269.06 bytes freed/req: 2456.80 (allocated - freed)/req: 1.25 Inserts: ---------- allocs/req: 6.01 reallocs/req: 6.00 frees/req: 6.00 bytes allocated/req: 381.80 bytes reallocated/req: 173.05 bytes freed/req: 380.62 (allocated - freed)/req: 1.18 Before After 84% fewer bytes allocated

Piotr Grabowski [email protected] Thank you! Let’s connect.
Tags