Architecting a High-Performance (Open Source) Distributed Message Queuing System in C++
ScyllaDB
229 views
16 slides
Jul 02, 2024
Slide 1 of 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
About This Presentation
BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves te...
BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves terabytes of data and billions of messages across tens of thousands of queues in production every day.
BlazingMQ provides highly-available, fault-tolerant queues courtesy of replication based on the Raft consensus algorithm. In addition, it provides a rich set of enterprise message routing strategies, enabling users to implement a variety of scenarios for message processing.
Written in C++ from the ground up, BlazingMQ has been architected with low latency as one of its core requirements. This has resulted in some unique design and implementation choices at all levels of the system, such as its lock-free threading model, custom memory allocators, compact wire protocol, multi-hop network topology, and more.
This talk will provide an overview of BlazingMQ. We will then delve into the system’s core design principles, architecture, and implementation details in order to explore the crucial role they play in its performance and reliability.
*BlazingMQ will be released as open source between now and P99 (exact timing is still TBD)
Size: 1.39 MB
Language: en
Added: Jul 02, 2024
Slides: 16 pages
Slide Content
Architecting a High-Performance Distributed Message Queuing System Vitaly Dzhitenov Senior Engineer at Bloomberg
What is BlazingMQ? Multi-producer, multi-consumer message queue Physical decoupling, as well as temporal isolation, between the actors Guaranteed acknowledgment Message persistence and replication High availability Transport abstraction Scalability (just add more workers / applications); high fan-out ratio (1:6,000+) ‹#›
Distributed architecture ‹#› App App Proxy Node Node Node Node App App Proxy Data Center 2 Data Center 1 Replica Leader Replica Primary
BlazingMQ at Bloomberg Battle-tested in production for eight (8) years 55,000+ queues Processing billions messages and terabytes of data daily Low Latency For 600,000 msg/sec to no persistence queue w/ fan-out ratio 5, the median is 1.7ms For 150,000 msg/sec over 10 persistent queues, the median is 1.4ms https://bloomberg.github.io/blazingmq/docs/performance/benchmarks/ ‹#›
Performance Actor thread model Batching Memory and Object pools, polymorphic allocators No data copying ‹#›
Actors Client Reading/writing to client Statistics and validation Queue lookup Queue Storage and replication Data routing Cluster Reading/writing to cluster nodes Cluster health Primary node Queue lookup ‹#›
Primary Replica Replica Proxy Proxy Actor Model ‹#› PUT PUSH PUT PUSH PUT PUSH Cluster Cluster Cluster Cluster Cluster Queue Queue Queue Queue Queue Client Client Client Client Producer Consumer
Batching Batch builders for every data type Flushing (to the network) on: Size limit, fixed or auto-tuning Dispatcher queue idleness Intelligent batching decisions: Adjustable batch size Interdependent flushing ‹#›
Proxy Primary Replica Channel Advanced batching ‹#› Cluster Client Queue network network Channel network network Client Client Cluster Queue network network PUT PUSH Replication PUSH PUSH PUT PUT Queue Cluster PUT PUT PUSH
Actor bottleneck ‹#› Cluster Client Queue Client Queue Client Queue
The solution Separate Control and Data planes Keep Cluster on the Control Plane and bypass it on the Data Plane Queue takes over Context, Statistics, and Validation work on the Data Plane Queue validates data using lockless synchronization with Cluster AtomicGate Based on one atomic int Multiple lockless, non-blocking AtomicGate::tryEnter Single AtomicGate::open , AtomicGate::closeAndDrain ‹#›
Published as Open Source! https://github.com/bloomberg/blazingmq https://bloomberg.github.io/blazingmq https://bloomberg.github.io/blazingmq/docs/performance/benchmarks/ ‹#›