Cost-Effective Burst Scaling For Distributed Query Execution

ScyllaDB 117 views 16 slides Jun 25, 2024

Slide 1 of 16

About This Presentation

Building a query engine that scales efficiently is a difficult task. Queries over big datasets stored in Object Storage require a large amount of IO and compute power. Keeping the latency of expensive queries acceptable when using a fixed size compute cluster is only possible when over-provisioning ...

Size: 1.3 MB

Language: en

Added: Jun 25, 2024

Slides: 16 pages

Slide Content

Cost-Effe ctive Burst Scaling for Distributed Query Execution Dan Harris Principal Software Engineer @ Coralogix

Dan Harris ( he/him ) Principal Software Engineer @ Coralogix Building DataPrime Query Engine , a distributed query engine for low-cost, real time analytics on observability data Apache Arrow committer To the user, a query engine is only as good as its p99 latency When I’m not hacking, I’m a runner, hiker, husband and dad (not in that order :))

DataPrime Query Engine: Goals Native execution of DataPrime, a custom query language optimized for heterogeneous, semi-structured data Low-latency interactive query execution over petabytes of highly dynamic data Cost-effectiveness Separation of compute and storage Efficient resource utilization to manage infrastructure costs

DataPrime Query Engine: Architecture Based on Arrow Ballista, a distributed execution engine for Arrow DataFusion Two c omponent services Scheduler: Responsible for distributed query planning and orchestration Executor: Executes tasks assigned by the scheduler

Autoscaling: Thesis Customer workloads are fairly predictable Follow normal business hours in a given region Query execution can be parallelized very effectively Separation of compute and storage Scaling compute does not require moving data around

Autoscaling: Antithesis Individual queries can be very large Not uncommon to scan ~500TB for a single query By the time you scale up the query is already done or has already timed out Auto-scaling signals are laggy Servers are expensive…

Autoscaling: Synthesis We can decompose this into two sub-problems Problem 1: How do we keep enough spare capacity to manage sudden bursts without falling over? Problem 2: How do we manage sustained load cost-effectively?

Autoscaling: Right Tool For the Job Problem 1: AWS Lambda Instant(ish) availability Only pay for what you use Really, really expensive when you do use it Problem 2: AWS EC2 Delayed provisioning Pay whether you use it or not Relatively cheap

Distributed Query Execution: A Brief Primer Execution Plan AggregateExec: mode=Final, gby=[applicationame, severity], aggr=[count(1)] CoalescePartitionsExec AggregateExec: mode=Partial, gby=[applicationame, severity], aggr=[count(1)] ProjectionExec: expr=[applicationame, severity] TableScanExec: logs, partitions = 1000 source logs | filter kubernetes.pod_name ~ 'my-service' | groupby application, severity derive count(1) SELECT application, severity, count(1) FROM logs WHERE kubernetes.pod_name LIKE '%my-service%'

Distributed Query Execution: A Brief Primer Distributed Execution Plan Stage 1: ShuffleWriterExec AggregateExec: mode=Partial, gby=[applicationame, severity], aggr=[count(1)] ProjectionExec: expr=[applicationame, severity] TableScanExec: logs, partitions = 1000 Stage 2: ShuffleWriterExec AggregateExec: mode=Final, gby=[application, severity], aggr=[count(1)], metrics=[] CoalescePartitionsExec ShuffleReadExec

Distributed Query Execution: Data Flow

Hybrid Execution

Autoscaling: Signals CPU Utilization Per-executor utilization throttled by scheduler Task Queue Good metrics for pending work Too volatile when sampled Task Load Average Good metric for cluster utilization Accurate sampling Smoothed value a good metric for “sustained” load

Autoscaling: All Together Now Architect for hybrid execution (serverless + EC2) Burst spiky compute loads onto serverless workers Shift sustained load to cheaper EC2 compute Profit!

Coda

Dan Harris dan@coralogix @thinkharderdev https://github.com/thinkharderdev https://www.linkedin.com/in/dsh2va/ Thank you! Let’s connect.

Cost-Effective Burst Scaling For Distributed Query Execution

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Cost-Effective Burst Scaling For Distributed Query Execution

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

MGV Residential Design projects for different clients, including a New Mexico Adobe project-1-.pdf

EUNITED_Advocacy and Public Engagement through Visual Media

DESIGN THINKINGGG PPT 2 TOPIC IDEATION.pptx

DESIGN THINKING CHAPTER 1 PPTT PPT 1.pptx

Hinduism and Its History - PowerPoint Slides.pptx

Service Attributes of Manufactured Parts.pptx