Architecting a High-Performance (Open Source) Distributed Message Queuing System in C++

ScyllaDB 229 views 16 slides Jul 02, 2024
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves te...


Slide Content

Architecting a High-Performance Distributed Message Queuing System Vitaly Dzhitenov Senior Engineer at Bloomberg

Agenda BlazingMQ overview Distributed architecture Key performance-related ideas Broker architecture Fixing bottlenecks ‹#›

What is BlazingMQ? Multi-producer, multi-consumer message queue Physical decoupling, as well as temporal isolation, between the actors Guaranteed acknowledgment Message persistence and replication High availability Transport abstraction Scalability (just add more workers / applications); high fan-out ratio (1:6,000+) ‹#›

Distributed architecture ‹#› App App Proxy Node Node Node Node App App Proxy Data Center 2 Data Center 1 Replica Leader Replica Primary

Queue trajectories ‹#› Primary Replica Replica Replica Proxy Proxy Proxy Proxy Proxy Consumer Consumer Consumer Consumer Consumer Consumer Consumer Proxy Producer Proxy Producer Producer Replica Replica PUTs PUSHes

BlazingMQ at Bloomberg Battle-tested in production for eight (8) years 55,000+ queues Processing billions messages and terabytes of data daily Low Latency For 600,000 msg/sec to no persistence queue w/ fan-out ratio 5, the median is 1.7ms For 150,000 msg/sec over 10 persistent queues, the median is 1.4ms https://bloomberg.github.io/blazingmq/docs/performance/benchmarks/ ‹#›

Performance Actor thread model Batching Memory and Object pools, polymorphic allocators No data copying ‹#›

Actors Client Reading/writing to client Statistics and validation Queue lookup Queue Storage and replication Data routing Cluster Reading/writing to cluster nodes Cluster health Primary node Queue lookup ‹#›

Primary Replica Replica Proxy Proxy Actor Model ‹#› PUT PUSH PUT PUSH PUT PUSH Cluster Cluster Cluster Cluster Cluster Queue Queue Queue Queue Queue Client Client Client Client Producer Consumer

CLIENT CLUSTER DispatcherClient type: QUEUE Event Dispatcher ‹#› ThreadPool Queue Processors: Monitored SingleConsumerQueues Clients EventPool

Batching Batch builders for every data type Flushing (to the network) on: Size limit, fixed or auto-tuning Dispatcher queue idleness Intelligent batching decisions: Adjustable batch size Interdependent flushing ‹#›

Proxy Primary Replica Channel Advanced batching ‹#› Cluster Client Queue network network Channel network network Client Client Cluster Queue network network PUT PUSH Replication PUSH PUSH PUT PUT Queue Cluster PUT PUT PUSH

Actor bottleneck ‹#› Cluster Client Queue Client Queue Client Queue

The solution Separate Control and Data planes Keep Cluster on the Control Plane and bypass it on the Data Plane Queue takes over Context, Statistics, and Validation work on the Data Plane Queue validates data using lockless synchronization with Cluster AtomicGate Based on one atomic int Multiple lockless, non-blocking AtomicGate::tryEnter Single AtomicGate::open , AtomicGate::closeAndDrain ‹#›

Published as Open Source! https://github.com/bloomberg/blazingmq https://bloomberg.github.io/blazingmq https://bloomberg.github.io/blazingmq/docs/performance/benchmarks/ ‹#›

Vitaly Dzhitenov [email protected] @TechAtBloomberg Thank you! Let’s connect.
Tags