Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
110 views
42 slides
Jul 02, 2024
Slide 1 of 42
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
About This Presentation
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issu...
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Size: 3.39 MB
Language: en
Added: Jul 02, 2024
Slides: 42 pages
Slide Content
Mitigating the Impact of State Management in Cloud Stream Processing Systems Yingjun Wu Founder at RisingWave Labs
Yingjun Wu ( he/him/his ) Founder at RisingWave Labs Chief Everything Officer Ex-AWS Redshift Ex-IBM Research Almaden PhD in database systems and stream processing
What is RisingWave? A distributed SQL streaming database Open sourced in April 2022 under Apache 2.0 License >5K GitHub stars >130 GitHub contributors >1K Slack members >100K K8s deployments
Background
Background
Background
Background
Stream processing systems continuously process large volume of data Background
Stream processing systems continuously process large volume of data Need to maintain internal states for stateful operators such as joins and aggregations Background
Consider joining two data streams Impression stream Click stream State Management
Consider joining two data streams Impression stream Click stream State Management
Consider joining two data streams Impression stream Click stream State Management
State Management: Existing Solutions MapReduce style Compute-storage coupled
MapReduce style Compute-storage coupled State Management: Existing Solutions
State Management: Existing Solutions MapReduce style Compute-storage coupled
State Management: Existing Solutions MapReduce style Compute-storage coupled
State Management: the Cloud Era Cloud -native style Compute-storage decoupled
State Management: the Cloud Era Cloud-native style Compute-storage decoupled +
Consider joining two data streams Impression stream Click stream State Management in the Cloud
Consider joining two data streams Impression stream Click stream State Management in the Cloud
Consider joining two data streams Impression stream Click stream State Management in the Cloud
Consider joining two data streams Impression stream Click stream State Management in the Cloud
How to mitigate the issue?
Tiered Storage EC2 local storage: the “cloud cache” Super fast! Data will get lost if machine crashes EBS: the “cloud disk” Fast 99.999% durability (5 nines) S3: the persistent storage Slow 99.999999999% durability (11 nines)
Tiered Storage EC2 local storage: the “cloud cache” Super fast! Data will get lost if machine crashes EBS: the “cloud disk” Fast 99.999% durability (5 nines) S3: the persistent storage Slow 99.999999999% durability (11 nines)
Tiered Storage Should we really leverage EBS?
Tiered Storage Should we really leverage EBS? No, at least for now…
Tiered Storage Should we really leverage EBS? No, at least for now… EC2 local storage vs EBS EBS is more expensive than EC2 local storage EBS is slower than EC2 local storage EBS is persistent while EC2 local storage is not
Tiered Storage How to manage data at different layers?
Log Structured Merge Tree Use log structured merge tree! Recently accessed data will be cached in local machine Upper level runs will be periodically compacted to lower levels
Compaction Compactions in LSM trees can cause huge latency spikes!
Compaction Compactions in LSM trees can cause huge latency spikes! Move the compaction to remote machines
Compaction Compactions in LSM trees can cause huge latency spikes! Move the compaction to remote machines
Compaction Compactions in LSM trees can cause huge latency spikes! Move the compaction to remote machines Build SST Upload SST
Compaction Compactions in LSM trees can cause huge latency spikes! Move the compaction to remote machines Build SST Upload SST
Compaction Accessing S3 is always expensive!
Compaction Accessing S3 is always expensive!
Compaction Accessing S3 is always expensive! S3 buckets
Compaction Accessing S3 is always expensive! Each bucket has performance limit! S3 buckets
Compaction Accessing S3 is always expensive! Each bucket has performance limit! Using too many buckets can cause fragmentation issues! We set #buckets to a magic number and scale based on workloads. S3 buckets
Conclusion State management is a challenging problem in stream processing systems. Decoupled compute-storage architecture helps achieve infinite and independent scalability. Tiered storage helps optimize performance. Using remote compaction and streaming compaction to optimize network. Set number of buckets wisely to reduce S3 access bottleneck.
Yingjun Wu @YingjunWu r isingwave.com/github risingwave.com/slack Thank you! Let’s connect.