Quantifying the Performance Impact of Shard-per-core Architecture

ScyllaDB 491 views 51 slides Jun 26, 2024
Slide 1
Slide 1 of 51
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51

About This Presentation

Most software isn’t architected to take advantage of modern hardware. How does a shard-per-code and shared-nothing architecture help – and exactly what impact can it make? Dor will examine technical opportunities and tradeoffs, as well as disclose the results of a new benchmark study.


Slide Content

Quantifying the Performance of Shard Per Core Architecture Dor Laor Co-Founder of ScyllaDB

Dor Laor Co-Founder of ScyllaDB Something cool I’ve done: Hello world, Core router KVM, OSv, ScyllaDB My perspective on P99: Less is more Another thing about me: Aspiring MTB rider, orthopedic patient What I do away from work: Fight for democratization

Let’s Shard

Past, Present & Future Me: router FIB, d evice drivers, kernel load balancer Fault tolerance virtualization using lock step, single VCPU V irtio, no false cache-line sharing PV ticketlocks reduce TLB flush In-kernel apps, no syscalls Van Jacobson networking Shard-per-Core Seastar & ScyllaDB

Locking is Evil

The LOCK# Signal

The ‘Bible’ Documentation

Split Locks (detection) Bus Lock of Unaligned buffers still hurts

Locking & NUMA & Virtualization

Locking & NUMA & Virtualization - Still

How much will it take to invalidate the cache?

Mutex::lock() No contention In local cache No contention In remote cache ~ 5ns ~ 50ns C ontention Unbounded based on lock holder Time P

Locking is Evil Round-Trip is Evil

Cost of cross-core RTT

RTT measurement; Delayed Branch blog

AWS C5.metal, (28-4)*2 cores, Cascade Lake 18 hop RTT worse case Hyper threads, share L1

AWS C5.metal, (28-4)*2 cores, Cascade Lake

Ice Lake, AWS C6i.metal, 2*32 cores, 128 vcpu

Ice Lake, AWS C6i.metal, 2*32 cores, 128 vcpu hyperthread RTT Within die mesh interconnect RTT Cross NUMA node UPI interconnect RTT

Single access -> JEmalloc variability Small object From Tcache Time P Tcache miss Arena access No lock contention ~ 25-50ns ~ 200ns ~ 400ns Arena access w/ lock contention Arena miss call mmap( ) ~ 2000ns ……..

P99 Latency == Variance Sources of evil: Numa/NUCA Remote cpu RTT Locking TLB flush System call Interrupts Contention

“Intellectuals solve problems, geniuses prevent them” LWN Link

Birth of Our Shard per Core Locks are evil attitude Van Jacobson's Net channels in OSv Pivot to databases, solve 330 VMs = 1M OPS

Shard Per Core Architecture Threads Shards

Transitional vs Sharded Kernel Cassandra TCP/IP Scheduler queue queue queue queue queue threads NIC Queues Kernel Traditional Stack Memory Lock contention Cache contention NUMA unfriendly Don’t Do It!

Transitional vs Sharded Kernel Cassandra TCP/IP Scheduler queue queue queue queue queue threads NIC Queues Kernel Traditional Stack SeaStar’s Sharded Stack Memory Lock contention Cache contention NUMA unfriendly Application TCP/IP Task Scheduler queue queue queue queue queue smp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/IP Task Scheduler queue queue queue queue queue smp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/IP Task Scheduler queue queue queue queue queue smp queue NIC Queue DPDK Kernel (isn’t involved) Userspace No contention Linear scaling NUMA friendly Core Database Task Scheduler queue queue queue queue queue smp queue NIC Queue Userspace

Traditional stack Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches A CPU Scheduler

Traditional stack ScyllaDB’s stack Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches No sharing, millions of parallel events A CPU Scheduler

An I/O Scheduler Query Commitlog Compaction Queue Queue Queue Userspace I/O Scheduler Disk Max useful disk concurrency I/O queued in FS/device No queues

Does it Work?

Top

Time between failures TBF Ease of maintenance No noisy neighbours No virtualization, container overhead No other moving parts Scale up before out! Small vs Large Machines

AWS i4i instance vertical scale 2X 2X 2X 2X 2X 2X Credit: Felipe Cardeneti Mendes

Linear Scale Ingestion Constant Time Ingestion 2X 2X 2X 2X 2X

Linear Scale Ingestion (2023 vs 2018) Constant Time Ingestion 2X 2X 2X 2X 2X

Ingestion: 17.9k OPS per shard * 16

Ingestion: <2ms P99 per shard

“Nodes must be small, in case they fail” You can scale OPS but can you scale failures?

2X 2X 2X 2X 2X “Nodes must be small, in case they fail” No they don’t! {Replace, Add, remove} Node at constant time

Compaction Scale 2X 2X 2X 2X 2X

Compaction while Ingestion: 70MB/s per shard 120 * 70MB/s = 8.4GB/s

20GB/s I4i.metal Major Compaction

Scale Up before Scale Out

You Want More? Tablets - how to distribute data among the shards dynamically

You Want More? Tablets - how to distribute data among the shards dynamically Per-Core Challenges - QUIC (many connections, HLB)

You Want More? Tablets - how to distribute data among the shards dynamically Per-Core Challenges - QUIC (many connections, HLB) IO_uring - Get rid of interrupts, SI, and IO stalls

You Want More? Tablets - how to distribute data among the shards dynamically Per-Core Challenges - QUIC (many connections, HLB) IO_uring - Get rid of interrupts, SI, and IO stalls Client-side shard-per-core

Dor Laor [email protected] DorLaor@twitter Thank you! Let’s connect.
Tags