Elasticity vs. State? Exploring Kafka Streams Cassandra State Store
ScyllaDB
305 views
27 slides
Jun 21, 2024
Slide 1 of 27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
About This Presentation
kafka-streams-cassandra-state-store' is a drop-in Kafka Streams State Store implementation that persists data to Apache Cassandra.
By moving the state to an external datastore the stateful streams app (from a deployment point of view) effectively becomes stateless. This greatly improves elastic...
kafka-streams-cassandra-state-store' is a drop-in Kafka Streams State Store implementation that persists data to Apache Cassandra.
By moving the state to an external datastore the stateful streams app (from a deployment point of view) effectively becomes stateless. This greatly improves elasticity and allows for fluent CI/CD (rolling upgrades, security patching, pod eviction, ...).
It also can also help to reduce failure recovery and rebalancing downtimes, with demos showing sporty 100ms rebalancing downtimes for your stateful Kafka Streams application, no matter the size of the application’s state.
As a bonus accessing Cassandra State Stores via 'Interactive Queries' (e.g. exposing via REST API) is simple and efficient since there's no need for an RPC layer proxying and fanning out requests to all instances of your streams application.
Size: 11.57 MB
Language: en
Added: Jun 21, 2024
Slides: 27 pages
Slide Content
Elasticity vs. State? A New Kafka Streams State Store Hartmut Armbruster, Software Engineer at Thriving.dev
Hartmut Armbruster Software Architect & Developer Spent the past years working with distributed systems and real time data processing with clients such as HSBC, NEX Group plc, Deutsche Bahn. Striving to see the bigger picture, passionate about architecture, combining, integrating, and bringing all things together.
Recap: Kafka Streams, State Stores, RocksDB Managing State: Challenges & Opportunities Kafka Streams Cassandra State Store ‘Drop-in’ State Store Alternative - Quickstart Data Models & Supported Store Types Interactive Queries Conclusions, Limitations, Next Steps Presentation Agenda
Recap Kafka Streams, State Store, RocksDB
Apache Kafka Distributed Streaming Platform Stores its data in topics, immutable Each topic has partitions
Functional Java API Stream Processing on data stored in Kafka Kafka Streams Processor Topology Source(s) Stateless Stream Processors filter, map, flatMap Stateful Stream Processors join, group, window, aggregate Sink(s)
Stateful Processors -> State Stores State is l ocal , distributed across replicas State Stores Types RocksDB InMemory Writes to local + changelog topics ‘State Restore’ from changelog if local state is lost
Managing State Challenges & Opportunities
Challenges for Stateful Topologies (1) Example Use Case Real-time data processing Considerably large state ⚡ App Upgrade ⚡ Security Patching ⚡ Infrastructure Failure ⚡ Scaling up/down Mitigation Use ‘persistent’ stores (RocksDB) R un Stateful (keep disks -> local state) Prevent /M inimise Rebalances Static Group Membership ⚠️ Standby Replicas Warmup Replicas Moving Restoration to a Dedicated Thread (KAFKA-10199) Tuning Restore Consumer Config State restore required 😱 …may take minutes/hours!!
Challenges for Stateful Topologies (2) Idle Streams Group Members Following rebalancing & task re-assignment new members get warmup tasks assigned Warmup Replicas catch up Transition from ‘Warmup -> Active’ Never Happens Unbalanced assignment Idle Replicas Performance Degradation Unused Allocated Resources 💸
Kafka Streams Cassandra State Store
Kafka Streams Cassandra State Store thriving-dev/kafka-streams-cassandra-state-store
Why Cassandra / ScyllaDB? Distributed Architecture Scalability (r/w & data) High Availability through Data Replication High Fault Tolerance High Performance (Expiring Data with TTL)
Under t he h ood - Data Model Key/Value are of type BLOB Any Payloads supported String, Number, Avro, Protobuf, … Serialized -> bytes
Under the hood - Primary Keys Composite Primary Key Having ` key ` as Clustering Key allows additional IQs such as `range` & `prefixScan` No. of partitions defined by Streams tasks (source topic partitions) ⚠️ Large Partitions BATCH inserts possible & efficient partitionedKeyValueStore global KeyValueStore Store Entry `key` as Primary Key High Cardinality, works for any data volumes No Support for `range` & `prefixScan` Lookups possible from any instance, knowledge/context of partition/task is not required
Store Types partitioned KeyValueStore global KeyValueStore partitioned Versioned KeyValueStore global Versioned KeyValueStore
Interactive Queries
Interactive Queries, REST API
Regular Kafka Streams State Store Interactive Queries, REST API
Cassandra Kafka Streams State Store Interactive Queries, REST API
Conclusions
Conclusions D rop-in replacement state store P ersistent, allows for very large state No changelog topic, no state restore required -> run as stateless -> instant rebalancing -> reduce rebalance downtimes & recovery time -> improve elasticity + scalability Exactly-Once Semantics (EOS) not supported No consistency guarantees for non-idempotent data streams on hard failures No window & session store support (yet) Experimental The bad and the ugly
Next Steps https://github.com/thriving-dev/kafka-streams-cassandra-state-store Buffered + batched writes at offset commit Add Window & Session Store Support Address shortcomings on reliability, consistency, EOS Benchmark Add in-memory read cache Open Source. Apache-2.0 license.