Mitigating the Impact of State Management in Cloud Stream Processing Systems

ScyllaDB 110 views 42 slides Jul 02, 2024
Slide 1
Slide 1 of 42
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42

About This Presentation

Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issu...


Slide Content

Mitigating the Impact of State Management in Cloud Stream Processing Systems Yingjun Wu Founder at RisingWave Labs

Yingjun Wu ( he/him/his ) Founder at RisingWave Labs Chief Everything Officer Ex-AWS Redshift Ex-IBM Research Almaden PhD in database systems and stream processing

What is RisingWave? A distributed SQL streaming database Open sourced in April 2022 under Apache 2.0 License >5K GitHub stars >130 GitHub contributors >1K Slack members >100K K8s deployments

Background

Background

Background

Background

Stream processing systems continuously process large volume of data Background

Stream processing systems continuously process large volume of data Need to maintain internal states for stateful operators such as joins and aggregations Background

Consider joining two data streams Impression stream Click stream State Management

Consider joining two data streams Impression stream Click stream State Management

Consider joining two data streams Impression stream Click stream State Management

State Management: Existing Solutions MapReduce style Compute-storage coupled

MapReduce style Compute-storage coupled State Management: Existing Solutions

State Management: Existing Solutions MapReduce style Compute-storage coupled

State Management: Existing Solutions MapReduce style Compute-storage coupled

State Management: the Cloud Era Cloud -native style Compute-storage decoupled

State Management: the Cloud Era Cloud-native style Compute-storage decoupled +

Consider joining two data streams Impression stream Click stream State Management in the Cloud

Consider joining two data streams Impression stream Click stream State Management in the Cloud

Consider joining two data streams Impression stream Click stream State Management in the Cloud

Consider joining two data streams Impression stream Click stream State Management in the Cloud

How to mitigate the issue?

Tiered Storage EC2 local storage: the “cloud cache” Super fast! Data will get lost if machine crashes EBS: the “cloud disk” Fast 99.999% durability (5 nines) S3: the persistent storage Slow 99.999999999% durability (11 nines)

Tiered Storage EC2 local storage: the “cloud cache” Super fast! Data will get lost if machine crashes EBS: the “cloud disk” Fast 99.999% durability (5 nines) S3: the persistent storage Slow 99.999999999% durability (11 nines)

Tiered Storage Should we really leverage EBS?

Tiered Storage Should we really leverage EBS? No, at least for now…

Tiered Storage Should we really leverage EBS? No, at least for now… EC2 local storage vs EBS EBS is more expensive than EC2 local storage EBS is slower than EC2 local storage EBS is persistent while EC2 local storage is not

Tiered Storage How to manage data at different layers?

Log Structured Merge Tree Use log structured merge tree! Recently accessed data will be cached in local machine Upper level runs will be periodically compacted to lower levels

Compaction Compactions in LSM trees can cause huge latency spikes!

Compaction Compactions in LSM trees can cause huge latency spikes! Move the compaction to remote machines

Compaction Compactions in LSM trees can cause huge latency spikes! Move the compaction to remote machines

Compaction Compactions in LSM trees can cause huge latency spikes! Move the compaction to remote machines Build SST Upload SST

Compaction Compactions in LSM trees can cause huge latency spikes! Move the compaction to remote machines Build SST Upload SST

Compaction Accessing S3 is always expensive!

Compaction Accessing S3 is always expensive!

Compaction Accessing S3 is always expensive! S3 buckets

Compaction Accessing S3 is always expensive! Each bucket has performance limit! S3 buckets

Compaction Accessing S3 is always expensive! Each bucket has performance limit! Using too many buckets can cause fragmentation issues! We set #buckets to a magic number and scale based on workloads. S3 buckets

Conclusion State management is a challenging problem in stream processing systems. Decoupled compute-storage architecture helps achieve infinite and independent scalability. Tiered storage helps optimize performance. Using remote compaction and streaming compaction to optimize network. Set number of buckets wisely to reduce S3 access bottleneck.

Yingjun Wu @YingjunWu r isingwave.com/github risingwave.com/slack Thank you! Let’s connect.
Tags