Deprecating Distributed Locks for Low Latency Event Consumption by Danish Rehman
ScyllaDB
0 views
38 slides
Oct 08, 2025
Slide 1 of 38
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
About This Presentation
This session covers how Attentive scaled mutual exclusivity in its event-driven architecture by evolving from Redis-based locks to Apache Pulsar’s FAILOVER and KEY_SHARED subscription modes. Operating at the scale of 620M texts/day during peak events, we share how we addressed the “split-brain�...
This session covers how Attentive scaled mutual exclusivity in its event-driven architecture by evolving from Redis-based locks to Apache Pulsar’s FAILOVER and KEY_SHARED subscription modes. Operating at the scale of 620M texts/day during peak events, we share how we addressed the “split-brain” effect, improved the latency of event consumption, and navigated the trade-offs across subscription models. We’ll also dive into key lessons learned in building a reliable, low-latency, high-throughput message orchestration system.
Size: 2.5 MB
Language: en
Added: Oct 08, 2025
Slides: 38 pages
Slide Content
A ScyllaDB Community
Deprecating Distributed Locks for
Low Latency Event Consumption
Danish Rehman
Staff Software Engineer
<Danish Rehman> (he)
Staff Software Engineer at Attentive
■Contributor to Apache Pulsar project
■Love the focus on low-latency based engineering.
■Deep dive into Distributed systems
■I spent my time outdoor swimming and growing fruits
Understanding
Attentive
Attentive
Attentive is a marketing automation platform that helps businesses send
personalized, real-time messages: primarily via SMS and email to engage
customers, drive sales, and build brand loyalty.
The Marketing Challenge
■Reaching subscribers is not just about sending messages
■It’s about when the message is received
■Timing is the difference between engagement and irrelevance
■Latency directly impacts customer attention
The Role of Low Latency in Engagement
Imagine: Customer leaves a cart at checkout
■If a reminder SMS arrives 10 minutes late → ignored
■If it arrives within seconds → recovered sale
■Milliseconds = Millions in revenue
Scale + personalization + speed = engagement
Operational Scale
Traffic closely follows consumer shopping patterns, with peak activity around
noon ET and significant surges during holiday periods.
■Trusted by 8,000+ brands to power personalized customer experiences
■Sub-second decisions delivered at global scale
■3.9 Billion SMS & Email in the year: 2024
■620 Million messages on black friday 2024
Understanding event
driven architecture
■We use Apache Pulsar as our distributed messaging and streaming platform.
■Producers (User intent) produce events
■Consumers (Microservices) consume & provide actionable business logic
Apache Pulsar in shared subscription type
■Multiple consumers can attach
to the same subscription
■Each message is delivered to
only one consumer
■Messages are distributed in a
round-robin fashion across
consumers
■Ensures load balancing across
consumers, but does not
guarantee ordering per key
Understanding the problem
Split Brain: Events from a user are processed parallely
■Events for the same user
can be processed
concurrently by different
consumers
■Each consumer makes
independent send
decisions, without
coordination
■Example: A single user may
receive multiple
overlapping messages
■Result: Poor user
experience (confusing,
repetitive communication)
■Consequence: Increased
risk of subscriber
frustration and opt-outs
Events from a given user are processed in order
■Goal: Ensure that events for the same
user are not processed concurrently
■Optimal approach: Route all of a
user’s events to a single processing
unit
■Guarantees consistent
decision-making for each user
■Prevents duplicate or conflicting
messages
■Lays the foundation for low-latency,
reliable message orchestration
Various solutions
Solution 01: Using Redis as a distributed lock
First Attempt: To enforce exclusivity, we initially relied on Redis-based distributed
locks (e.g., Redlock)
■In Pulsar’s SHARED mode, exclusivity requires an external lock
■First solution: Redis-based distributed lock (Redlock)
■These provided necessary exclusivity during MVP stage
■Allowed safe decision-making without duplicate sends
Solution 01: Using Redis as a distributed lock
Impact / Results
■✅ Effective for MVP → gave required exclusivity
■❌ At scale, performance degradation emerged
■Growing traffic = higher lock pressure
Result: System became slower and less efficient during peak load
Solution 01: Using Redis as a distributed lock
Concerns
■❌ Increased lock contention
■❌ Consumer stalling
Concern 0: Increase in Lock contention as scale increased
❌ Increased lock contention
■High-traffic shopping events
trigger many concurrent events
■Multiple consumers compete
for locks → efficiency drops
Concern 1: Consumer stalling
❌ Consumer stalling
■Example: 4 consumers, only 1
acquires the lock
■Others stall until timeout →
cycle repeats with new events
■Causes delays, contention, and
poor throughput
■Some consumers remain idle,
never processing events
Solution 01: Using Redis as a distributed lock
Broader Concerns
■❌ Lack of fencing tokens
■❌ Vulnerable to GC pauses & network delays → ordering violations
■❌ Not ideal when business correctness depends on exclusivity
Key lesson: Redis locks solve early-stage needs, but don’t scale with strong
consistency + low latency requirements in event driven architecture
Solution 02: Using Apache pulsar’s failover mode
Why We Adopted Failover mode
■Ahead of BFCM (Black Friday/Cyber Monday), we aligned on evolving our
architecture
■Goal: Address bottlenecks in the existing message orchestration system
■Strategy: Switch to Pulsar’s FAILOVER subscription mode for:
●Guaranteed ordering per key
●Mutual exclusivity in event processing
Apache Pulsar with Failover subscription mode
■A master consumer is chosen for
each partition
■If the master disconnects → next
consumer in line takes over
■Similar to Kafka consumer groups
(one consumer per partition)
■Guarantees:
●Events for the same key
always handled by a single
consumer
●Mutual exclusivity without
external locks
Solution 02: Using Apache pulsar’s failover mode
Impacts
■✅ Eliminated multi-consumer stalling caused by SHARED mode
■✅ Achieved per-user exclusivity in processing
■✅ Simplified coordination
●one partition → one active consumer
■⚡ Improved stability during peak BFCM traffic
But: Trade-offs emerged that challenged long-term adoption
Solution 02: Using Apache pulsar’s failover mode
Concerns
■❌ Head of the line blocking:
●A stalled task blocks all subsequent keys on that consumer increases latency & reduces throughput
■❌ Scalability Limits:
●Consumers ≤ partitions → idle consumers if overscaled
●Increasing partitions is irreversible → not flexible
■❌ Scaling During Peak Load:
●Limited to vertical scaling per consumer
●Horizontal scaling constrained by partition count
■❌ Retry Support Issue:
●Built-in retries not supported (to preserve ordering)
Solution 03: Apache pulsar’s key shared mode
Why We Adopted KEY_SHARED
■Goal: Meet business need for mutual exclusivity while mitigating earlier
tradeoffs
■Benefit: Reduced head-of-line blocking at partition level (though not
eliminated)
Apache Pulsar with Key shared subscription mode
■Unlike SHARED, all messages with
the same key go to the same
consumer
■Guarantees ordering per key ✅
■Still enables parallel processing
across different keys
■Ideal for enforcing mutual
exclusivity at user level
Solution 03: Apache pulsar’s key shared mode
Impacts / Results
■✅ Reduced head-of-line blocking
■✅ Enabled temporary scaling up for BFCM traffic surge
■✅ Allowed cost-efficient scale down post-event
Result: Resounding success during BFCM ?????? . KEY_SHARED proved invaluable for
temporary scaling while preserving exclusivity
Solution 03: Apache pulsar’s key shared mode
Concern
■❌ Risky producer-consumer coupling
●first event key must be correct or contract breaks
■❌ Reduced throughput
●smaller producer batches = higher CPU load
■❌ Mutual exclusivity violations possible during pod crash/shutdown
●requires idempotent consumers + graceful deployment configs
Tools used during this rollout
1.Distributed Log Tracing
Distributed log tracing with a unique UUID provide debuggability across the
organization
■Provides visibility into message flows across the system
■Helps identify deviations in behavior early
■Enables investigation of anomalous user experiences back to point of
consumption
■Critical for spot-checking during migrations and design changes
■Strengthens confidence in maintaining low-latency, reliable delivery
Distributed Log Tracing
2. Mutual Exclusivity Counter
■Purpose: Verify that mutual exclusivity is being maintained in event-driven
processing
■Ensures system integrity under at-least-once delivery guarantees
■Logic:
●If counter > 0 → exclusivity violated for that key
●Metric is published to observability tools
●Alerts the team for outages or degraded performance
■Provided a confidence signal during multiple architectural evolutions
Usage of mutual exclusivity counter
FUNCTION executeWithMutualExclusivityCounter(key):
// Track operation start
incrementMutualExclusivityCounter(key)
TRY:
// Perform the critical operation
result = performCriticalTask(key)
FINALLY:
// Always decrement to maintain correct state
decrementMutualExclusivityCounter(key)
RETURN result
Decrement mutual exclusivity counter
FUNCTION decrementMutualExclusivityCounter(key):
counter = getAtomicCounter(key)
counter.decrement()
Mutual Exclusivity Counter: Increment
FUNCTION incrementMutualExclusivityCounter(key):
// Retrieve counter from persistent store
counter = getAtomicCounter(key)
IF counter.getValue() > 0:
// Another operation is in progress.
publishToObservabilityApp(key, counter.getValue())
counter.increment()
Concerns with the Counter
■Not perfect
●distributed counter without exclusivity may introduce race conditions
■Only detects
●signals the existence of violations, not the exact count
■Flaky during deployments
●observed false spikes during rollouts
■Mitigation strategies
●Layer counter with deployment metrics for context
●Use idempotent consumers to absorb occasional duplicates
Lessons learned
■Distributed systems are hard
●Complex and come with numerous edge cases.
■Simplicity scales
■Re-using existing infrastructure: By deprecating Redis and leveraging Pulsar
■Decoupling Mutual Exclusivity from Business Logic
■ Application business logic is now agnostic to mutual exclusivity,