Deprecating Distributed Locks for Low Latency Event Consumption by Danish Rehman

ScyllaDB 0 views 38 slides Oct 08, 2025

Slide 1 of 38

About This Presentation

This session covers how Attentive scaled mutual exclusivity in its event-driven architecture by evolving from Redis-based locks to Apache Pulsar’s FAILOVER and KEY_SHARED subscription modes. Operating at the scale of 620M texts/day during peak events, we share how we addressed the “split-brain�...

Size: 2.5 MB

Language: en

Added: Oct 08, 2025

Slides: 38 pages

Slide Content

A ScyllaDB Community
Deprecating Distributed Locks for
Low Latency Event Consumption
Danish Rehman
Staff Software Engineer

<Danish Rehman> (he)

Staff Software Engineer at Attentive
■Contributor to Apache Pulsar project
■Love the focus on low-latency based engineering.
■Deep dive into Distributed systems
■I spent my time outdoor swimming and growing fruits

Understanding
Attentive

Attentive
Attentive is a marketing automation platform that helps businesses send
personalized, real-time messages: primarily via SMS and email to engage
customers, drive sales, and build brand loyalty.

The Marketing Challenge
■Reaching subscribers is not just about sending messages
■It’s about when the message is received
■Timing is the difference between engagement and irrelevance
■Latency directly impacts customer attention

The Role of Low Latency in Engagement

Imagine: Customer leaves a cart at checkout
■If a reminder SMS arrives 10 minutes late → ignored
■If it arrives within seconds → recovered sale
■Milliseconds = Millions in revenue

Scale + personalization + speed = engagement

Operational Scale
Traﬃc closely follows consumer shopping patterns, with peak activity around
noon ET and signiﬁcant surges during holiday periods.
■Trusted by 8,000+ brands to power personalized customer experiences
■Sub-second decisions delivered at global scale
■3.9 Billion SMS & Email in the year: 2024
■620 Million messages on black friday 2024

Understanding event
driven architecture

■We use Apache Pulsar as our distributed messaging and streaming platform.
■Producers (User intent) produce events
■Consumers (Microservices) consume & provide actionable business logic

Apache Pulsar in shared subscription type
■Multiple consumers can attach
to the same subscription
■Each message is delivered to
only one consumer
■Messages are distributed in a
round-robin fashion across
consumers
■Ensures load balancing across
consumers, but does not
guarantee ordering per key

Understanding the problem

Split Brain: Events from a user are processed parallely
■Events for the same user
can be processed
concurrently by different
consumers
■Each consumer makes
independent send
decisions, without
coordination
■Example: A single user may
receive multiple
overlapping messages
■Result: Poor user
experience (confusing,
repetitive communication)
■Consequence: Increased
risk of subscriber
frustration and opt-outs

Events from a given user are processed in order

■Goal: Ensure that events for the same
user are not processed concurrently
■Optimal approach: Route all of a
user’s events to a single processing
unit
■Guarantees consistent
decision-making for each user
■Prevents duplicate or conﬂicting
messages
■Lays the foundation for low-latency,
reliable message orchestration

Various solutions

Solution 01: Using Redis as a distributed lock
First Attempt: To enforce exclusivity, we initially relied on Redis-based distributed
locks (e.g., Redlock)
■In Pulsar’s SHARED mode, exclusivity requires an external lock
■First solution: Redis-based distributed lock (Redlock)
■These provided necessary exclusivity during MVP stage
■Allowed safe decision-making without duplicate sends

Solution 01: Using Redis as a distributed lock
Impact / Results
■✅ Effective for MVP → gave required exclusivity
■❌ At scale, performance degradation emerged
■Growing traﬃc = higher lock pressure

Result: System became slower and less eﬃcient during peak load

Solution 01: Using Redis as a distributed lock
Concerns
■❌ Increased lock contention
■❌ Consumer stalling

Concern 0: Increase in Lock contention as scale increased

❌ Increased lock contention
■High-traﬃc shopping events
trigger many concurrent events
■Multiple consumers compete
for locks → eﬃciency drops

Concern 1: Consumer stalling

❌ Consumer stalling
■Example: 4 consumers, only 1
acquires the lock
■Others stall until timeout →
cycle repeats with new events
■Causes delays, contention, and
poor throughput
■Some consumers remain idle,
never processing events

Solution 01: Using Redis as a distributed lock

Broader Concerns
■❌ Lack of fencing tokens
■❌ Vulnerable to GC pauses & network delays → ordering violations
■❌ Not ideal when business correctness depends on exclusivity

Key lesson: Redis locks solve early-stage needs, but don’t scale with strong
consistency + low latency requirements in event driven architecture

Solution 02: Using Apache pulsar’s failover mode
Why We Adopted Failover mode
■Ahead of BFCM (Black Friday/Cyber Monday), we aligned on evolving our
architecture
■Goal: Address bottlenecks in the existing message orchestration system
■Strategy: Switch to Pulsar’s FAILOVER subscription mode for:
●Guaranteed ordering per key
●Mutual exclusivity in event processing

Apache Pulsar with Failover subscription mode

■A master consumer is chosen for
each partition
■If the master disconnects → next
consumer in line takes over
■Similar to Kafka consumer groups
(one consumer per partition)
■Guarantees:
●Events for the same key
always handled by a single
consumer
●Mutual exclusivity without
external locks

Solution 02: Using Apache pulsar’s failover mode
Impacts
■✅ Eliminated multi-consumer stalling caused by SHARED mode
■✅ Achieved per-user exclusivity in processing
■✅ Simpliﬁed coordination
●one partition → one active consumer
■⚡ Improved stability during peak BFCM traﬃc

But: Trade-offs emerged that challenged long-term adoption

Solution 02: Using Apache pulsar’s failover mode
Concerns
■❌ Head of the line blocking:
●A stalled task blocks all subsequent keys on that consumer increases latency & reduces throughput
■❌ Scalability Limits:
●Consumers ≤ partitions → idle consumers if overscaled
●Increasing partitions is irreversible → not ﬂexible
■❌ Scaling During Peak Load:
●Limited to vertical scaling per consumer
●Horizontal scaling constrained by partition count
■❌ Retry Support Issue:
●Built-in retries not supported (to preserve ordering)

Solution 03: Apache pulsar’s key shared mode
Why We Adopted KEY_SHARED
■Goal: Meet business need for mutual exclusivity while mitigating earlier
tradeoffs
■Beneﬁt: Reduced head-of-line blocking at partition level (though not
eliminated)

Apache Pulsar with Key shared subscription mode

■Unlike SHARED, all messages with
the same key go to the same
consumer
■Guarantees ordering per key ✅
■Still enables parallel processing
across different keys
■Ideal for enforcing mutual
exclusivity at user level

Solution 03: Apache pulsar’s key shared mode
Impacts / Results
■✅ Reduced head-of-line blocking
■✅ Enabled temporary scaling up for BFCM traﬃc surge
■✅ Allowed cost-eﬃcient scale down post-event

Result: Resounding success during BFCM ?????? . KEY_SHARED proved invaluable for
temporary scaling while preserving exclusivity

Solution 03: Apache pulsar’s key shared mode
Concern
■❌ Risky producer-consumer coupling
●ﬁrst event key must be correct or contract breaks
■❌ Reduced throughput
●smaller producer batches = higher CPU load
■❌ Mutual exclusivity violations possible during pod crash/shutdown
●requires idempotent consumers + graceful deployment conﬁgs

Tools used during this rollout

1.Distributed Log Tracing

Distributed log tracing with a unique UUID provide debuggability across the
organization
■Provides visibility into message ﬂows across the system
■Helps identify deviations in behavior early
■Enables investigation of anomalous user experiences back to point of
consumption
■Critical for spot-checking during migrations and design changes
■Strengthens conﬁdence in maintaining low-latency, reliable delivery

Distributed Log Tracing

2. Mutual Exclusivity Counter
■Purpose: Verify that mutual exclusivity is being maintained in event-driven
processing
■Ensures system integrity under at-least-once delivery guarantees
■Logic:
●If counter > 0 → exclusivity violated for that key
●Metric is published to observability tools
●Alerts the team for outages or degraded performance
■Provided a conﬁdence signal during multiple architectural evolutions

Usage of mutual exclusivity counter
FUNCTION executeWithMutualExclusivityCounter(key):
// Track operation start
incrementMutualExclusivityCounter(key)

TRY:
// Perform the critical operation
result = performCriticalTask(key)
FINALLY:
// Always decrement to maintain correct state
decrementMutualExclusivityCounter(key)

RETURN result

Decrement mutual exclusivity counter
FUNCTION decrementMutualExclusivityCounter(key):
counter = getAtomicCounter(key)
counter.decrement()

Mutual Exclusivity Counter: Increment
FUNCTION incrementMutualExclusivityCounter(key):
// Retrieve counter from persistent store
counter = getAtomicCounter(key)

IF counter.getValue() > 0:
// Another operation is in progress.
publishToObservabilityApp(key, counter.getValue())

counter.increment()

Concerns with the Counter
■Not perfect
●distributed counter without exclusivity may introduce race conditions
■Only detects
●signals the existence of violations, not the exact count
■Flaky during deployments
●observed false spikes during rollouts
■Mitigation strategies
●Layer counter with deployment metrics for context
●Use idempotent consumers to absorb occasional duplicates

Lessons learned
■Distributed systems are hard
●Complex and come with numerous edge cases.
■Simplicity scales
■Re-using existing infrastructure: By deprecating Redis and leveraging Pulsar
■Decoupling Mutual Exclusivity from Business Logic
■ Application business logic is now agnostic to mutual exclusivity,

Deprecating Distributed Locks for Low Latency Event Consumption by Danish Rehman

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Deprecating Distributed Locks for Low Latency Event Consumption by Danish Rehman

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx