Persistence Pipelines in a Processing Graph: Mutable Big Data at Salesforce by Spencer Ho
ScyllaDB
66 views
22 slides
Mar 10, 2025
Slide 1 of 22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
About This Presentation
This is a case study on managing mutable big data: Exploring the evolution of the persistence layer in a processing graph, tackling design challenges, and refining key operational principles along the way.
Size: 1.17 MB
Language: en
Added: Mar 10, 2025
Slides: 22 pages
Slide Content
A ScyllaDB Community
Persistence Pipelines in a
Processing Graph:
Mutable Big Data at Salesforce
Spencer Ho
Architect
Spencer Ho
■From Networking/Telecom to Mobile Device, online
game/retail, and eventually CRM SaaS.
■Data analytics, data engineering, stream and batch
processing in conjunction with various database and
storage technologies.
■Enjoy watching movies
■Mutable Big Data
■Refactor For Operation
■Architecture Principles
■Map to Implementation
■Widen the Scope
Presentation Agenda
Mutable Big Data
Case Study
A conclusion drawn from the journey responding to operating issues and
design challenges.
■Analytical Model for Capacity and Degradation in Distributed Systems
■https://engineering.salesforce.com/analytical-model-for-capacity-and-degrad
ation-in-distributed-systems-f0888ec62ecc/
■Embracing Mutable Big Data
■https://engineering.salesforce.com/embracing-mutable-big-data-bf7106c206
4d/
This is one of the stories along the journey.
Activity Platform
Salesforce Activity Platform (AP) ingests, stores, augments, and serves
user’s activity data.
■User’s activity data such as emails, meetings, voice and video calls are
served to applications as “time-series” data.
■Activity records are stored in Cassandra tables, and updated to
corresponding indices in the ElasticSearch cluster.
■Changes in the activity itself
■Meeting reschedule and updates
■Tasks closure or deletion
■Ingestion order that has an impact to the database
■A meeting has to be created ahead of meeting time
■An activity may be captured long after it was created
■The consideration of GDPR (General Data Protection Regulation)
■Changes to the activity metadata
■System-generated data fields, especially nowadays with AI and agents.
Temporal Order and Mutability
■Activity time is when an activity takes place
■This is how users view/consume the activities
■Capture time is the time when the existence of an activity is made
known to the system
■This is how the database/storage ingests records
■Activity time may not be always earlier than capture time
■This has ramifications to the choice of database
■Fresh data vs historical data
■Off the shelf time-series databases are not suitable for our use case
Activity Time vs Capture Time
Refactor For Operation
Persisting and Indexing Pipelines
The main pipelines for the activity persistence function
■Activity persisting pipeline for storing to Cassandra
■Activity indexing pipeline for indexing to ElasticSearch
■Both pipelines are built on Apache Storm topologies.
Fact of Life
■They are two separate databases (or storages)
■Data records and index can be “eventually consistent”
■But will never be in sync as long as there are traffic coming in
Processing Graph
Refactor for Fast Diagnostics
Time to Diagnosis
■The time for diagnostics and repair counts towards SLA time
■It is easy to identify a troubled streaming processing runtime
■The not-so-easy part is to identify the root cause
Design For Operation
■Not everyone has the same level of familiarity with every part of the
processing graph
■Optimization for one function should not interfere with other functions
The original implementation was written as part of an activity-type specific
processing service, carried out in a single Storm topology.
■The service facade encapsulated all the operations, and the processor
bolt invoked the service entry method.
Separating the processing components in one topology into multiple
topologies.
■Easier to measure, monitor and tune
■Better Availability
From Service Facade to Specialized Pipeline
Architecture Principles
■At-least once, in-order, idempotent processing
■A common practice to pair a NoSQL database with ElasticSearch as indices
■Cassandra as the system of record
■No way to stop and verify across the two storages
■Accept the fact that they are two storages
■Another form of Eventual Consistency
■Need to optimize separately
Processing Semantics
Map to Implementation
Throughput and Latency - Cassandra
■Cassandra persisting topology
■One message (record) at a time for Cassandra, no matter if it is Create,
Modify, or Delete.
■The topology can be scaled directly to the traffic because Cassandra is good
at frequent, individual write operations.
■Complete the Cassandra write operation before generating the corresponding
indexing message.
Throughput and Latency - ElasticSearch
■ElasticSearch indexing topology
■Downstream of the persisting topology.
■It has to uphold the at least once, idempotent execution
■Document ID has to be fixed and known beforehand
■Index shard routing key is determined by the topology
■Micro-batch indexing writes
■Special care is given to topology stream routing so it matches index
shard routing.
Widen the Scope
Different Use Cases or Considerations
■Any difference if it built on Spark Streaming or Kafka Streams
■What if a CPU/Memory bound computation
■Upstream of a persisting operation pipeline
■Integration with a data lake if something can give
■Write embeddings to Vector database
Conclusion
The total solution lies not just in the use of NoSQL
databases, but also in the processing pipelines that fully
embrace the fact of eventual consistency.
■The original blog will be published on medium.com soon.
Stay in Touch
Shan-Cheng Ho aka Spencer Ho [email protected]
https://www.linkedin.com/in/spencer-ho-998374/