ScyllaDB’s Workload Prioritization provides resource optimization and performance isolation across workloads with different performance needs, such as Analytics and Real-time. In this session, you will learn how Workload Prioritization works, how you can use it to run different types of workloads ...
ScyllaDB’s Workload Prioritization provides resource optimization and performance isolation across workloads with different performance needs, such as Analytics and Real-time. In this session, you will learn how Workload Prioritization works, how you can use it to run different types of workloads together under a single ScyllaDB cluster, and how to fine-tune priorities and resource allocation based on your specific requirements.
Size: 5.31 MB
Language: en
Added: Jun 20, 2024
Slides: 26 pages
Slide Content
Real-Time or Analytics Workloads... Why Not Both? Felipe Cardeneti Mendes, Solution Architect at ScyllaDB
Felipe Mendes Published Author ScyllaDB Committer IT Specialist & Solution Architect Open Source Enthusiast Your photo goes here, smile :)
ScyllaDB as an Analytics Engine
An IoT Application Total amount of data points 526 billion temperature readings 1 ,000,000 sensors, representing homes in an area 365 days (1 year storage requirement) 1 reading per minute
Analytics over the entire data? How long would it take at normal speeds? We need more if analytics are a part of the pipeline W e need Scylla DB 200,000 points/second 730 hours (30 days) 1 million points/second 146 hours (almost a week)
Easy Peasy Scanning 3 months of data Finished Scanning. Succeeded 132,480,000,000 rows. Failed 0 rows. RPC Failures: 0. Took 110,892.91 ms Processed 1,194,666,083 rows/s Absolute min: 19.71, date 2019-08-27, sensorID 473869 Absolute max: 135.21, date 2019-08-27, sensorID 473869
Easy Peasy Scanning the entire dataset Finished Scanning. Succeeded 525,599,474,400 rows. Failed 0 rows. RPC Failures: 0. Took 542,191.31 ms Processed 969,398,554 rows/s Absolute min: 68.00, date 2019-05-28, sensorID 82114 Absolute max: 79.99, date 2019-03-19, sensorID 152594
We can efficiently process over 1.2 billion points per second (we’ll process whatever you need, too!)
Concurrency Challenges
‹#›
P99 climbs to unacceptable values The Latency Problem
Throughput gradually drops, load distribution becomes unfair The Throughput Problem
Why Contention Happens? Primarily lack of system resources (disk I/O, CPU time) Not necessarily a problem Introduces queueing
Easy then, let's simply ... Addressing the Problem? Divide and conquer! Division in Space (Multi DC) Division in time (off peak OLAP)
Workload Prioritization
Isolation and Performance Optimization Background Tasks User Tasks
Shares Different workloads require different priorities Meet SLAs Flexible Configuration Adaptability to Changing Conditions
Getting Started Set up authentication, assign Roles to each workload Prioritization is wired on the authenticated users role Define your Service Levels For a primary workload: For a secondary one: Profit! CREATE SERVICE LEVEL main WITH shares = 200 CREATE SERVICE LEVEL secondary WITH shares = 600
Prioritization and Isolation are NOT enough!
Workload Characteristics
Workloads Characteristics: Time ‹#› The timeout dilemma: Timeout should follow: 𝑇𝑠𝑒𝑟𝑣𝑒𝑟 ≤ 𝑇𝑐𝑙𝑖𝑒𝑛𝑡 For Real-time : Can’t be too high Incurs retries or dropped requests Excessive retries result in wasted resources For Analytics: Can’t be too low Otherwise Batch will likely fail High throughput will typically increase latencies due to contention
Workloads Characteristics: Shedding ‹#› Overload response: Interactive workload: Throttling won't help Delaying response to user A will not cause some user B to stop sending requests Unbound concurrency Batch workload: Just throttle Allow us to have a knob that controls the pace of the analytics workload Bounded concurrency
Introducing Workload Characterization ‹#› Ideally – we want the database to behave differently: For Real-time: Have low timeout Load shedding (fail excessive requests), as the database can NOT slow down interactive workloads. Dedicate most of the resources to this workload. For Batch: Relatively higher timeout Apply back-pressure via throttling Use mostly unused resources
For Real-Time: For Analytics: Why not just hint the database with specifics? Introducing Workload Characterization ‹#› Have low timeout (30ms) timeout=30ms Load shedding AND workload_type=interactive Dedicate most of the resources AND shares=800 Have relatively high timeout (5s) timeout=5s Throttling AND workload_type=batch Use mostly unused resources AND shares=200