Quantifying the Performance Impact of Shard-per-core Architecture

ScyllaDB 491 views 51 slides Jun 26, 2024

Slide 1 of 51

About This Presentation

Most software isn’t architected to take advantage of modern hardware. How does a shard-per-code and shared-nothing architecture help – and exactly what impact can it make? Dor will examine technical opportunities and tradeoffs, as well as disclose the results of a new benchmark study.

Size: 26.91 MB

Language: en

Added: Jun 26, 2024

Slides: 51 pages

Slide Content

Quantifying the Performance of Shard Per Core Architecture Dor Laor Co-Founder of ScyllaDB

Dor Laor Co-Founder of ScyllaDB Something cool I’ve done: Hello world, Core router KVM, OSv, ScyllaDB My perspective on P99: Less is more Another thing about me: Aspiring MTB rider, orthopedic patient What I do away from work: Fight for democratization

Let’s Shard

Past, Present & Future Me: router FIB, d evice drivers, kernel load balancer Fault tolerance virtualization using lock step, single VCPU V irtio, no false cache-line sharing PV ticketlocks reduce TLB flush In-kernel apps, no syscalls Van Jacobson networking Shard-per-Core Seastar & ScyllaDB

Locking is Evil

The LOCK# Signal

The ‘Bible’ Documentation

Split Locks (detection) Bus Lock of Unaligned buffers still hurts

Locking & NUMA & Virtualization

Locking & NUMA & Virtualization - Still

How much will it take to invalidate the cache?

Mutex::lock() No contention In local cache No contention In remote cache ~ 5ns ~ 50ns C ontention Unbounded based on lock holder Time P

Locking is Evil Round-Trip is Evil

Cost of cross-core RTT

RTT measurement; Delayed Branch blog

AWS C5.metal, (28-4)*2 cores, Cascade Lake 18 hop RTT worse case Hyper threads, share L1

AWS C5.metal, (28-4)*2 cores, Cascade Lake

Ice Lake, AWS C6i.metal, 2*32 cores, 128 vcpu

Ice Lake, AWS C6i.metal, 2*32 cores, 128 vcpu hyperthread RTT Within die mesh interconnect RTT Cross NUMA node UPI interconnect RTT

Single access -> JEmalloc variability Small object From Tcache Time P Tcache miss Arena access No lock contention ~ 25-50ns ~ 200ns ~ 400ns Arena access w/ lock contention Arena miss call mmap( ) ~ 2000ns ……..

P99 Latency == Variance Sources of evil: Numa/NUCA Remote cpu RTT Locking TLB flush System call Interrupts Contention

“Intellectuals solve problems, geniuses prevent them” LWN Link

Birth of Our Shard per Core Locks are evil attitude Van Jacobson's Net channels in OSv Pivot to databases, solve 330 VMs = 1M OPS

Shard Per Core Architecture Threads Shards

Transitional vs Sharded Kernel Cassandra TCP/IP Scheduler queue queue queue queue queue threads NIC Queues Kernel Traditional Stack Memory Lock contention Cache contention NUMA unfriendly Don’t Do It!

Transitional vs Sharded Kernel Cassandra TCP/IP Scheduler queue queue queue queue queue threads NIC Queues Kernel Traditional Stack SeaStar’s Sharded Stack Memory Lock contention Cache contention NUMA unfriendly Application TCP/IP Task Scheduler queue queue queue queue queue smp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/IP Task Scheduler queue queue queue queue queue smp queue NIC Queue DPDK Kernel (isn’t involved) Userspace Application TCP/IP Task Scheduler queue queue queue queue queue smp queue NIC Queue DPDK Kernel (isn’t involved) Userspace No contention Linear scaling NUMA friendly Core Database Task Scheduler queue queue queue queue queue smp queue NIC Queue Userspace

Traditional stack Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches A CPU Scheduler

Traditional stack ScyllaDB’s stack Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches No sharing, millions of parallel events A CPU Scheduler

An I/O Scheduler Query Commitlog Compaction Queue Queue Queue Userspace I/O Scheduler Disk Max useful disk concurrency I/O queued in FS/device No queues

Does it Work?

Top

Time between failures TBF Ease of maintenance No noisy neighbours No virtualization, container overhead No other moving parts Scale up before out! Small vs Large Machines

AWS i4i instance vertical scale 2X 2X 2X 2X 2X 2X Credit: Felipe Cardeneti Mendes

Linear Scale Ingestion Constant Time Ingestion 2X 2X 2X 2X 2X

Linear Scale Ingestion (2023 vs 2018) Constant Time Ingestion 2X 2X 2X 2X 2X

Ingestion: 17.9k OPS per shard * 16

Ingestion: <2ms P99 per shard

“Nodes must be small, in case they fail” You can scale OPS but can you scale failures?

2X 2X 2X 2X 2X “Nodes must be small, in case they fail” No they don’t! {Replace, Add, remove} Node at constant time

Compaction Scale 2X 2X 2X 2X 2X

Compaction while Ingestion: 70MB/s per shard 120 * 70MB/s = 8.4GB/s

20GB/s I4i.metal Major Compaction

Scale Up before Scale Out

You Want More? Tablets - how to distribute data among the shards dynamically

You Want More? Tablets - how to distribute data among the shards dynamically Per-Core Challenges - QUIC (many connections, HLB)

You Want More? Tablets - how to distribute data among the shards dynamically Per-Core Challenges - QUIC (many connections, HLB) IO_uring - Get rid of interrupts, SI, and IO stalls

You Want More? Tablets - how to distribute data among the shards dynamically Per-Core Challenges - QUIC (many connections, HLB) IO_uring - Get rid of interrupts, SI, and IO stalls Client-side shard-per-core

Quantifying the Performance Impact of Shard-per-core Architecture

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Quantifying the Performance Impact of Shard-per-core Architecture

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......