Event-Driven Architecture Masterclass: Challenges in Stream Processing

ScyllaDB 174 views 24 slides May 15, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Discuss the core tradeoffs and considerations involved in order-free and ordered stream processing. Brian Taylor walks through the pros and cons of three different approaches: no data dependency, deferred inter-event data dependency, and streaming inter-event data dependency.


Slide Content

A Tale of 3 Pipelines Brian Taylor

Reference Architectures ‹#› Stream to Stream Stream to State Stateful Stream to Stream Write: Money = Performance Read: Data dependency limited Data dependency limited Money = Performance ODP “Analytic segments” ODP “Real-Time Segments” Webhook system What it is: How it scales:

Inter-Event Data Dependency ‹#› A property of the stream and the problem . Measures the way that events impact the processing of following events. For example: A stream of record mutations that must be applied one after another within a single record id A stream with many records and no sequential mutations for any given record id has no data dependency A stream with a single record id and only sequential mutations has maximum data dependency A topic with a single partition has maximum data dependency (in some sense)

Inter-Event Data Dependency The average length of the data-dependent chains in your stream decide your average throughput at any scale. This is equivalent to the way “the sequential portion” of a problem constrains the maximum parallel speedup in Amdhal’s Law. S: max speedup fraction, s: parallelism, p: “data dependency fraction”

Big Idea ‹#› “It’s all about the data-dependency, baby” No data dependency: Smooth scaling Data dependency: Navigating hell

No Data Dependency ∅DD ‹#›

Reference Architecture ‹#› Stream to Stream Money = Performance Webhook system What it is: How it scales: Subscription information Change Notifications Delivery Requests

What you can do with ∅DD ‹#› Abstractly Data reshaping Order-independent enrichment Non-self Joins Concrete Use Cases Adapters Sentiment detectors Geo-IP mappers Redaction If no external data access is required: Redpanda transforms FTW!

Performance Tradespace ‹#› More money = More Throughput Tactics: Add shards and partitions until you have enough capacity

Deferred Inter-Event Data Dependency DEDD ‹#›

Reference Architecture ‹#› Stream to State Write: Money = Performance Read: Data dependency limited ODP “Analytic segments” Optimizely Experimentation What it is: How it scales:

What you can do with DEDD ‹#› Abstractly Use it when Write Performance is more important than Read Performance Concrete Use Cases Reporting: Especially when users read less than they write Nightly model training

Performance Tradespace ‹#› Write side: More money = More Speed Read side: Data-dependency limited Tactics: Reduce data dependency with finer grained partitioning

Streaming Inter-Event Data Dependency SEDD ‹#›

Reference Architecture ‹#› Stateful Stream to Stream Data dependency limited ODP “Real-Time Segments” What it is: How it scales:

What you can do with SEDD ‹#› Abstractly Streaming aggregates Pattern detectors Concrete Use Cases Segmentation Real time model training

Performance Tradespace ‹#› Throughput : Data-dependency limited Tactics for reducing data-dependency: Finer grained partitioning Accept eventual consistency with CRDTs

Fundamental Tradeoff Inter-event data dependency Max throughput If you need SEDD and throughput, welcome to hell. ‹#›

But… ‹#›

Query Latency Data Latency Query Latency : Time it takes to respond to a request Driven by : DD work remaining to resolve the request Impact : The places where it’s suitable to use your query API ‹#› Data Latency : How long it takes for new information to impact a query Driven by : How you cheated to hide from your data dependency Impact : How actionable the results from your API are

“Cheating” out of Hell ‹#› S tream to State Introduces a data latency / cost tradeoff Min-data latency is now data dependency limited Everyone else’s “Real-Time Segments” What it is: How it scales: Periodic State to Stream

Wrapping it Up

Data Dependency Decides Everything ∅DD - Oddly common in example code and marketing materials. Very rarely happens in real life. DEDD - Practical workaround most of the time. Became truly effective in the last decade as data warehouses have matured. SEDD - Sounds like “sad” for a reason. A difficult place to be. Hopefully the next decade will bring some meaningful breakthroughs here. ‹#›

Keep in touch ! Brian Taylor Director of Engineering Optimizely [email protected] @netguy204