P99 Publish Performance in a Multi-Cloud NATS.io System

ScyllaDB 184 views 37 slides Jun 25, 2024
Slide 1
Slide 1 of 37
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37

About This Presentation

This talk will walk through the strategies and improvements made to the NATS server to accomplish P99 goals for persistent publishing to NATS JetStream that was replicated across all three major cloud providers over private networks.


Slide Content

P99 Publish Performance in Multi-Cloud NATS.io Derek Collison Founder & CEO Synadia

Derek Collison Founder & CEO Synadia S omething cool I’ve done: NATS.io, CloudFoundry, Google AJAX APIs, TIBCO Rendezvous and EMS My perspective on P99s: Systems L evel Problem Another thing about me: Conducted f ederal trials for first intra-arterial blood gas device What I do away from work: Travel, Boating

GOAL

<20ms P99 Streaming Publish across Multi-Cloud Use NATS / JetStream to publish messages which will be stored in multiple cloud providers at the same time through consensus Requirement of using all three major cloud providers as a single system P99 under load of <20ms 1gbit private network

What is NATS.io

NATS.io and JetStream

NATS.io and JetStream NATS is a Modern Connective Technology for Adaptive Edge & Distributed Systems Can do MicroServices, Streaming, Key-Value and Object Stores Based on a core pub/sub engine - location independent addressing At-Most-Once, At-Least-Once, and Exactly-Once* Static binary servers can link to form any topology and are multi-operator Secure multi-tenancy JetStream is our persistence and clustering / consensus technology More info: nats.io / youtube.com/@SynadiaCommunications

NATS Topology

NATS Topology This is a single NATS system spanning 9 servers with 3 servers per cloud provider Cloud Providers are connected via private 1Gbit links, 3-5ms RTT Topology is actually a full mesh one hop NATS cluster All have persistence through JetStream All have tags that allow placement of JetStream Streams and Consumers Anti-affinity tags allow R3 streams across all 3 providers and different AZs DNS for system and each provider

Creating a Stream (Anti-affinity tags)

Creating a Stream Stream creation uses anti-affinity to place R3 stream on each instance of CP Applications can connect to any NATS server JetStream runs a meta-layer for assignments - All servers have copy, e.g. dna HA / R3 assets run a consensus algorithm built upon NRGs NRGs are NATS implementation of RAFT Fast cooperative transfer and storage fault detection Leader is selected and they listen / subscribe for subjects assigned to the stream

Publishing to a Stream Clients can connect to any server They publish / send a message to a subject that the stream is listening on If the message has a reply subject, we respond with an ACK ACK is only sent after consensus has been reached

GOAL <20ms P99 Streaming Publish

How did we do?

First Results System started not heavily loaded P50 - P75 were decent Load was then increased (more streams, more publishers, more consumers) P50 still ok, P75 started to drift P99 was > 5s in some instances - NOT GOOD

Digging In..

Triage NATS servers <= 2.9.x have one connection between them All traffic from multiple accounts is mux’d across that connection A NATS server running JetStream could have many internal subscriptions Many of these would run inline, and hence be susceptible to delays Disk IO was a major culprit Network already offload due to NATS server architecture

Removing IO from hot path Work was done to remove any internal callbacks to JetStream from routes Storing a message was already removed but the NRG traffic was still inline Moved to a lock and a copy inline with all route processing Queued up internally for another execution context to process

Second Results We saw improvements P50 - P90 were looking decent Under load we were still getting not great results for P95 and P99

Digging Back In..

Second Triage Determined that some coarse grained locks were experiencing contention Interest based retention streams causing issues with NRG snapshots This due to a bad initial design for storing state on interior deletes

Removing coarse grained locks Work was done to add additional fine grained locks in the hot path In some instances we used atomics (NATS is written in Go) Only lock / atomic required was now for the internal queues Each function that needed execution context had its own dedicated IPQ Go channels are too slow for this section of the server

Switch to Limit Retention Interest based retention can cause interior deletes Massive interior deletes caused issues with NRG snapshots for <=2.9 Original design did not properly account for these Large Key/Value stores also had issues Redesigned in 2.10, with 1600x in time, and 50,000x in space Switched to Limits based for now

Third Results We saw more improvements 🙂 P50 - P95 were looking decent Under load we were still getting so so results for P99 and above Determined that the memory system was now our bottleneck 🤨 Never seen this before (sans maybe the very early days of Go, 0.52 days)

Third Triage Determined that the memory system was now our bottleneck 🤨 Never seen this before (sans maybe the very early days of Go, 0.52 days)

Removing memory contention Work was done use sync.Pools for objects we needed to post on IPQs These are thread friendly in terms of contention Functions now have their own IPQ with dedicate lock or atomic and sync.Pool

Final Results We saw more improvements 🙂 P50 - P99 were looking decent and reached our goal Under load we could still get good results However, not all was perfect 🤨 System reacted poorly on certain setups Clumping of client connections Clumping of JetStream asset leaders, streams or consumers Large interior deletes on NRG snapshots

Next Steps JetStream peer selection can be modified already and we try to balance Peer selection and placement can be changed during runtime as well Balancing client connections is possible but manual, will automate Each JetStream asset leader can be instructed to step-down and select a new leader This process also needs to be automated NATS v2.10 has multi-path routing, with mux’d and pinned routes NATS v2.10 has a redesigned NRG snapshot logic that is >1000x better in latency and size

NATS v2.10 is released!

Derek Collison [email protected] @derekcollison nats.io / synadia.com Thank you! Let’s connect.
Tags