Deprecating Distributed Locks for Low Latency Event Consumption by Danish Rehman

ScyllaDB 0 views 38 slides Oct 08, 2025
Slide 1
Slide 1 of 38
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38

About This Presentation

This session covers how Attentive scaled mutual exclusivity in its event-driven architecture by evolving from Redis-based locks to Apache Pulsar’s FAILOVER and KEY_SHARED subscription modes. Operating at the scale of 620M texts/day during peak events, we share how we addressed the “split-brain�...


Slide Content

A ScyllaDB Community
Deprecating Distributed Locks for
Low Latency Event Consumption
Danish Rehman
Staff Software Engineer

<Danish Rehman> (he)

Staff Software Engineer at Attentive
■Contributor to Apache Pulsar project
■Love the focus on low-latency based engineering.
■Deep dive into Distributed systems
■I spent my time outdoor swimming and growing fruits

Understanding
Attentive

Attentive
Attentive is a marketing automation platform that helps businesses send
personalized, real-time messages: primarily via SMS and email to engage
customers, drive sales, and build brand loyalty.

The Marketing Challenge
■Reaching subscribers is not just about sending messages
■It’s about when the message is received
■Timing is the difference between engagement and irrelevance
■Latency directly impacts customer attention

The Role of Low Latency in Engagement


Imagine: Customer leaves a cart at checkout
■If a reminder SMS arrives 10 minutes late → ignored
■If it arrives within seconds → recovered sale
■Milliseconds = Millions in revenue

Scale + personalization + speed = engagement

Operational Scale
Traffic closely follows consumer shopping patterns, with peak activity around
noon ET and significant surges during holiday periods.
■Trusted by 8,000+ brands to power personalized customer experiences
■Sub-second decisions delivered at global scale
■3.9 Billion SMS & Email in the year: 2024
■620 Million messages on black friday 2024

Understanding event
driven architecture

■We use Apache Pulsar as our distributed messaging and streaming platform.
■Producers (User intent) produce events
■Consumers (Microservices) consume & provide actionable business logic

Apache Pulsar in shared subscription type
■Multiple consumers can attach
to the same subscription
■Each message is delivered to
only one consumer
■Messages are distributed in a
round-robin fashion across
consumers
■Ensures load balancing across
consumers, but does not
guarantee ordering per key

Understanding the problem

Split Brain: Events from a user are processed parallely
■Events for the same user
can be processed
concurrently by different
consumers
■Each consumer makes
independent send
decisions, without
coordination
■Example: A single user may
receive multiple
overlapping messages
■Result: Poor user
experience (confusing,
repetitive communication)
■Consequence: Increased
risk of subscriber
frustration and opt-outs

Events from a given user are processed in order

■Goal: Ensure that events for the same
user are not processed concurrently
■Optimal approach: Route all of a
user’s events to a single processing
unit
■Guarantees consistent
decision-making for each user
■Prevents duplicate or conflicting
messages
■Lays the foundation for low-latency,
reliable message orchestration

Various solutions

Solution 01: Using Redis as a distributed lock
First Attempt: To enforce exclusivity, we initially relied on Redis-based distributed
locks (e.g., Redlock)
■In Pulsar’s SHARED mode, exclusivity requires an external lock
■First solution: Redis-based distributed lock (Redlock)
■These provided necessary exclusivity during MVP stage
■Allowed safe decision-making without duplicate sends

Solution 01: Using Redis as a distributed lock
Impact / Results
■✅ Effective for MVP → gave required exclusivity
■❌ At scale, performance degradation emerged
■Growing traffic = higher lock pressure

Result: System became slower and less efficient during peak load

Solution 01: Using Redis as a distributed lock
Concerns
■❌ Increased lock contention
■❌ Consumer stalling

Concern 0: Increase in Lock contention as scale increased

❌ Increased lock contention
■High-traffic shopping events
trigger many concurrent events
■Multiple consumers compete
for locks → efficiency drops

Concern 1: Consumer stalling

❌ Consumer stalling
■Example: 4 consumers, only 1
acquires the lock
■Others stall until timeout →
cycle repeats with new events
■Causes delays, contention, and
poor throughput
■Some consumers remain idle,
never processing events

Solution 01: Using Redis as a distributed lock

Broader Concerns
■❌ Lack of fencing tokens
■❌ Vulnerable to GC pauses & network delays → ordering violations
■❌ Not ideal when business correctness depends on exclusivity

Key lesson: Redis locks solve early-stage needs, but don’t scale with strong
consistency + low latency requirements in event driven architecture

Solution 02: Using Apache pulsar’s failover mode
Why We Adopted Failover mode
■Ahead of BFCM (Black Friday/Cyber Monday), we aligned on evolving our
architecture
■Goal: Address bottlenecks in the existing message orchestration system
■Strategy: Switch to Pulsar’s FAILOVER subscription mode for:
●Guaranteed ordering per key
●Mutual exclusivity in event processing

Apache Pulsar with Failover subscription mode

■A master consumer is chosen for
each partition
■If the master disconnects → next
consumer in line takes over
■Similar to Kafka consumer groups
(one consumer per partition)
■Guarantees:
●Events for the same key
always handled by a single
consumer
●Mutual exclusivity without
external locks

Solution 02: Using Apache pulsar’s failover mode
Impacts
■✅ Eliminated multi-consumer stalling caused by SHARED mode
■✅ Achieved per-user exclusivity in processing
■✅ Simplified coordination
●one partition → one active consumer
■⚡ Improved stability during peak BFCM traffic

But: Trade-offs emerged that challenged long-term adoption

Solution 02: Using Apache pulsar’s failover mode
Concerns
■❌ Head of the line blocking:
●A stalled task blocks all subsequent keys on that consumer increases latency & reduces throughput
■❌ Scalability Limits:
●Consumers ≤ partitions → idle consumers if overscaled
●Increasing partitions is irreversible → not flexible
■❌ Scaling During Peak Load:
●Limited to vertical scaling per consumer
●Horizontal scaling constrained by partition count
■❌ Retry Support Issue:
●Built-in retries not supported (to preserve ordering)

Solution 03: Apache pulsar’s key shared mode
Why We Adopted KEY_SHARED
■Goal: Meet business need for mutual exclusivity while mitigating earlier
tradeoffs
■Benefit: Reduced head-of-line blocking at partition level (though not
eliminated)

Apache Pulsar with Key shared subscription mode

■Unlike SHARED, all messages with
the same key go to the same
consumer
■Guarantees ordering per key ✅
■Still enables parallel processing
across different keys
■Ideal for enforcing mutual
exclusivity at user level

Solution 03: Apache pulsar’s key shared mode
Impacts / Results
■✅ Reduced head-of-line blocking
■✅ Enabled temporary scaling up for BFCM traffic surge
■✅ Allowed cost-efficient scale down post-event

Result: Resounding success during BFCM ?????? . KEY_SHARED proved invaluable for
temporary scaling while preserving exclusivity

Solution 03: Apache pulsar’s key shared mode
Concern
■❌ Risky producer-consumer coupling
●first event key must be correct or contract breaks
■❌ Reduced throughput
●smaller producer batches = higher CPU load
■❌ Mutual exclusivity violations possible during pod crash/shutdown
●requires idempotent consumers + graceful deployment configs

Tools used during this rollout

1.Distributed Log Tracing

Distributed log tracing with a unique UUID provide debuggability across the
organization
■Provides visibility into message flows across the system
■Helps identify deviations in behavior early
■Enables investigation of anomalous user experiences back to point of
consumption
■Critical for spot-checking during migrations and design changes
■Strengthens confidence in maintaining low-latency, reliable delivery

Distributed Log Tracing

2. Mutual Exclusivity Counter
■Purpose: Verify that mutual exclusivity is being maintained in event-driven
processing
■Ensures system integrity under at-least-once delivery guarantees
■Logic:
●If counter > 0 → exclusivity violated for that key
●Metric is published to observability tools
●Alerts the team for outages or degraded performance
■Provided a confidence signal during multiple architectural evolutions

Usage of mutual exclusivity counter
FUNCTION executeWithMutualExclusivityCounter(key):
// Track operation start
incrementMutualExclusivityCounter(key)

TRY:
// Perform the critical operation
result = performCriticalTask(key)
FINALLY:
// Always decrement to maintain correct state
decrementMutualExclusivityCounter(key)

RETURN result

Decrement mutual exclusivity counter
FUNCTION decrementMutualExclusivityCounter(key):
counter = getAtomicCounter(key)
counter.decrement()

Mutual Exclusivity Counter: Increment
FUNCTION incrementMutualExclusivityCounter(key):
// Retrieve counter from persistent store
counter = getAtomicCounter(key)

IF counter.getValue() > 0:
// Another operation is in progress.
publishToObservabilityApp(key, counter.getValue())

counter.increment()

Concerns with the Counter
■Not perfect
●distributed counter without exclusivity may introduce race conditions
■Only detects
●signals the existence of violations, not the exact count
■Flaky during deployments
●observed false spikes during rollouts
■Mitigation strategies
●Layer counter with deployment metrics for context
●Use idempotent consumers to absorb occasional duplicates

Lessons learned
■Distributed systems are hard
●Complex and come with numerous edge cases.
■Simplicity scales
■Re-using existing infrastructure: By deprecating Redis and leveraging Pulsar
■Decoupling Mutual Exclusivity from Business Logic
■ Application business logic is now agnostic to mutual exclusivity,

Thank you! Let’s connect.
Danish Rehman
[email protected]
danishrehman.com
Tags