P99 Publish Performance in a Multi-Cloud NATS.io System
ScyllaDB
184 views
37 slides
Jun 25, 2024
Slide 1 of 37
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
About This Presentation
This talk will walk through the strategies and improvements made to the NATS server to accomplish P99 goals for persistent publishing to NATS JetStream that was replicated across all three major cloud providers over private networks.
Size: 3.93 MB
Language: en
Added: Jun 25, 2024
Slides: 37 pages
Slide Content
P99 Publish Performance in Multi-Cloud NATS.io Derek Collison Founder & CEO Synadia
Derek Collison Founder & CEO Synadia S omething cool I’ve done: NATS.io, CloudFoundry, Google AJAX APIs, TIBCO Rendezvous and EMS My perspective on P99s: Systems L evel Problem Another thing about me: Conducted f ederal trials for first intra-arterial blood gas device What I do away from work: Travel, Boating
GOAL
<20ms P99 Streaming Publish across Multi-Cloud Use NATS / JetStream to publish messages which will be stored in multiple cloud providers at the same time through consensus Requirement of using all three major cloud providers as a single system P99 under load of <20ms 1gbit private network
What is NATS.io
NATS.io and JetStream
NATS.io and JetStream NATS is a Modern Connective Technology for Adaptive Edge & Distributed Systems Can do MicroServices, Streaming, Key-Value and Object Stores Based on a core pub/sub engine - location independent addressing At-Most-Once, At-Least-Once, and Exactly-Once* Static binary servers can link to form any topology and are multi-operator Secure multi-tenancy JetStream is our persistence and clustering / consensus technology More info: nats.io / youtube.com/@SynadiaCommunications
NATS Topology
NATS Topology This is a single NATS system spanning 9 servers with 3 servers per cloud provider Cloud Providers are connected via private 1Gbit links, 3-5ms RTT Topology is actually a full mesh one hop NATS cluster All have persistence through JetStream All have tags that allow placement of JetStream Streams and Consumers Anti-affinity tags allow R3 streams across all 3 providers and different AZs DNS for system and each provider
Creating a Stream (Anti-affinity tags)
Creating a Stream Stream creation uses anti-affinity to place R3 stream on each instance of CP Applications can connect to any NATS server JetStream runs a meta-layer for assignments - All servers have copy, e.g. dna HA / R3 assets run a consensus algorithm built upon NRGs NRGs are NATS implementation of RAFT Fast cooperative transfer and storage fault detection Leader is selected and they listen / subscribe for subjects assigned to the stream
Publishing to a Stream Clients can connect to any server They publish / send a message to a subject that the stream is listening on If the message has a reply subject, we respond with an ACK ACK is only sent after consensus has been reached
GOAL <20ms P99 Streaming Publish
How did we do?
First Results System started not heavily loaded P50 - P75 were decent Load was then increased (more streams, more publishers, more consumers) P50 still ok, P75 started to drift P99 was > 5s in some instances - NOT GOOD
Digging In..
Triage NATS servers <= 2.9.x have one connection between them All traffic from multiple accounts is mux’d across that connection A NATS server running JetStream could have many internal subscriptions Many of these would run inline, and hence be susceptible to delays Disk IO was a major culprit Network already offload due to NATS server architecture
Removing IO from hot path Work was done to remove any internal callbacks to JetStream from routes Storing a message was already removed but the NRG traffic was still inline Moved to a lock and a copy inline with all route processing Queued up internally for another execution context to process
Second Results We saw improvements P50 - P90 were looking decent Under load we were still getting not great results for P95 and P99
Digging Back In..
Second Triage Determined that some coarse grained locks were experiencing contention Interest based retention streams causing issues with NRG snapshots This due to a bad initial design for storing state on interior deletes
Removing coarse grained locks Work was done to add additional fine grained locks in the hot path In some instances we used atomics (NATS is written in Go) Only lock / atomic required was now for the internal queues Each function that needed execution context had its own dedicated IPQ Go channels are too slow for this section of the server
Switch to Limit Retention Interest based retention can cause interior deletes Massive interior deletes caused issues with NRG snapshots for <=2.9 Original design did not properly account for these Large Key/Value stores also had issues Redesigned in 2.10, with 1600x in time, and 50,000x in space Switched to Limits based for now
Third Results We saw more improvements 🙂 P50 - P95 were looking decent Under load we were still getting so so results for P99 and above Determined that the memory system was now our bottleneck 🤨 Never seen this before (sans maybe the very early days of Go, 0.52 days)
Third Triage Determined that the memory system was now our bottleneck 🤨 Never seen this before (sans maybe the very early days of Go, 0.52 days)
Removing memory contention Work was done use sync.Pools for objects we needed to post on IPQs These are thread friendly in terms of contention Functions now have their own IPQ with dedicate lock or atomic and sync.Pool
Final Results We saw more improvements 🙂 P50 - P99 were looking decent and reached our goal Under load we could still get good results However, not all was perfect 🤨 System reacted poorly on certain setups Clumping of client connections Clumping of JetStream asset leaders, streams or consumers Large interior deletes on NRG snapshots
Next Steps JetStream peer selection can be modified already and we try to balance Peer selection and placement can be changed during runtime as well Balancing client connections is possible but manual, will automate Each JetStream asset leader can be instructed to step-down and select a new leader This process also needs to be automated NATS v2.10 has multi-path routing, with mux’d and pinned routes NATS v2.10 has a redesigned NRG snapshot logic that is >1000x better in latency and size