Cost-Effective Burst Scaling For Distributed Query Execution
ScyllaDB
117 views
16 slides
Jun 25, 2024
Slide 1 of 16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
About This Presentation
Building a query engine that scales efficiently is a difficult task. Queries over big datasets stored in Object Storage require a large amount of IO and compute power. Keeping the latency of expensive queries acceptable when using a fixed size compute cluster is only possible when over-provisioning ...
Building a query engine that scales efficiently is a difficult task. Queries over big datasets stored in Object Storage require a large amount of IO and compute power. Keeping the latency of expensive queries acceptable when using a fixed size compute cluster is only possible when over-provisioning a cluster, while dynamically up- or downscaling is too slow for interactive queries.
To overcome these challenges, we built a distributed execution model which allows us to dynamically execute across both AWS Lambda and EC2 resources. With this model we can shed excess load to lambda functions to preserve low latency while we scale EC2 capacity to manage costs.
Size: 1.3 MB
Language: en
Added: Jun 25, 2024
Slides: 16 pages
Slide Content
Cost-Effe ctive Burst Scaling for Distributed Query Execution Dan Harris Principal Software Engineer @ Coralogix
Dan Harris ( he/him ) Principal Software Engineer @ Coralogix Building DataPrime Query Engine , a distributed query engine for low-cost, real time analytics on observability data Apache Arrow committer To the user, a query engine is only as good as its p99 latency When I’m not hacking, I’m a runner, hiker, husband and dad (not in that order :))
DataPrime Query Engine: Goals Native execution of DataPrime, a custom query language optimized for heterogeneous, semi-structured data Low-latency interactive query execution over petabytes of highly dynamic data Cost-effectiveness Separation of compute and storage Efficient resource utilization to manage infrastructure costs
DataPrime Query Engine: Architecture Based on Arrow Ballista, a distributed execution engine for Arrow DataFusion Two c omponent services Scheduler: Responsible for distributed query planning and orchestration Executor: Executes tasks assigned by the scheduler
Autoscaling: Thesis Customer workloads are fairly predictable Follow normal business hours in a given region Query execution can be parallelized very effectively Separation of compute and storage Scaling compute does not require moving data around
Autoscaling: Antithesis Individual queries can be very large Not uncommon to scan ~500TB for a single query By the time you scale up the query is already done or has already timed out Auto-scaling signals are laggy Servers are expensive…
Autoscaling: Synthesis We can decompose this into two sub-problems Problem 1: How do we keep enough spare capacity to manage sudden bursts without falling over? Problem 2: How do we manage sustained load cost-effectively?
Autoscaling: Right Tool For the Job Problem 1: AWS Lambda Instant(ish) availability Only pay for what you use Really, really expensive when you do use it Problem 2: AWS EC2 Delayed provisioning Pay whether you use it or not Relatively cheap
Distributed Query Execution: A Brief Primer Execution Plan AggregateExec: mode=Final, gby=[applicationame, severity], aggr=[count(1)] CoalescePartitionsExec AggregateExec: mode=Partial, gby=[applicationame, severity], aggr=[count(1)] ProjectionExec: expr=[applicationame, severity] TableScanExec: logs, partitions = 1000 source logs | filter kubernetes.pod_name ~ 'my-service' | groupby application, severity derive count(1) SELECT application, severity, count(1) FROM logs WHERE kubernetes.pod_name LIKE '%my-service%'
Autoscaling: Signals CPU Utilization Per-executor utilization throttled by scheduler Task Queue Good metrics for pending work Too volatile when sampled Task Load Average Good metric for cluster utilization Accurate sampling Smoothed value a good metric for “sustained” load
Autoscaling: All Together Now Architect for hybrid execution (serverless + EC2) Burst spiky compute loads onto serverless workers Shift sustained load to cheaper EC2 compute Profit!
Coda
Dan Harris dan@coralogix @thinkharderdev https://github.com/thinkharderdev https://www.linkedin.com/in/dsh2va/ Thank you! Let’s connect.