Batch to near-realtime: inspired by a real production incident

shiv4289 34 views 60 slides Jul 09, 2024

Slide 1 of 60

About This Presentation

This slide deck was used for the platformatory streams meetup in Bengaluru on July 7, 2024.

This is a real world account from an Apache Druid cluster in production. A story of 48 hours of debugging, learning and understanding batch vs stream better, filing a couple of issues in Druid open source pr...

Size: 10.35 MB

Language: en

Added: Jul 09, 2024

Slides: 60 pages

Slide Content

Batch (to) Near-Realtime Shivji kumar Jha, Staff Engineer, Nutanix Sachidananda Maharana, MTS 4, Nutanix

ABOUT US Software Engineer & Regular Speaker / Meetups Excited about: Distributed Databases & Streaming Open-Source Software & Communities MySQL / Postgres, Pulsar/NATS, Druid/Clickhouse Regular Platform Engineer Excited about: Distributed OLAP Databases Open-Source Enthusiast Shivji Kumar Jha Staff Engineer CPaaS Data Platform, Nutanix Sachidananda Maharana Software Engineer OLAP Ninja CPaaS Team, Nutanix

CONTENTS Background App Context The problem Statement Druid 101 Ingestion Issues & Fixes Query Issues & Fixes Fix in common libraries (OSS) Be a good citizen!

BACKGROUND: Application CONTEXT UI based slice and dice analytics with filters Easy Ops: One Multi-tenant DB (Druid) Cluster Fine grained isolation per customer / use-case Resilient Pipeline: Temporal for orchestration Durable / Scalable Storage: S3 Java Workers with Postgres storage for state.

BACKGROUND: APP ingestion pipeline

EARLIER: old batch system 3 hrs

NOW: near-real-time INGESTION nudge State Machine, absorb backpressure Cron : 5 mins

Batch to near-real-time system Cron : 5 mins Ingestion V 1 I ngestion V2 Druid Ingestion Tasks Big Tasks Many small tasks

ANY IDEAs how to handle THis ? https:// www.dreamstime.com /stock-image-diversity-show-hands-image9264971

IDEAS? Scale [UP / OUT] resources? Isolate tasks [ingest / compact / query]? Parallelize? Choose simple performant data assignment algos? Throttle & throw defined error codes Global Queues Per customer

MORE IDEAS: Compaction for better Query! A lot of task means a lot of small files During query, reading from lots of small files not efficient S3 also prefers smaller number of bigger files. Solution: Merge files in the background Enter Compaction: Combine smaller files into rolled up (fewer) bigger files. Reduce the granularity as data gets old! Example: Minute => hourly => daily => weekly… Can’t graphs for huge time at less granularity. Saves disk space, network throughput, latency

SUMMARY: Change in Requirements Change in Requirement: Batch (3 hours) to 5 minutes Earlier: Agent collects data, dumps to S3. Cron runs every 3 hour, ingests from S3 to Druid SLA : 3 hours New Design: SLA : 10 minutes Agent collects data, dumps to S3 every 5 minutes. Ingestion Pipeline ingests to Druid depending on what Druid likes. Ingestion Pipeline gobbles backpressure. Release Plan Data sources uploaded to cluster in a phased manner

Heard of DRUID? https:// www.dreamstime.com /stock-image-diversity-show-hands-image9264971

DRUID 101

DRUID 101 Write Read

Peon Processes(Slots) Segments(ready to query)

Druid Nos : 4+ years in Prod Cluster size Last 24 hrs

INGESTION ISSUEs

Deploying new design

Customer 1 Deploy for 1 customer

Customer 1 Customer 3 Customer 2 Deploy for more

Customer 1 Customer 3 Customer 2 Customer N Too many tasks Deploy for all… booom !

Customer 1 Customer 3 Customer 2 Customer N Too many tasks No tasks Chill life Deploy for all… booom !

Proof of the Pudding!

Proof of the Pudding! 2

Proof of the Pudding! 3

Summary: Druid INGESTION struggling Ingested smaller, but more tasks. O nboarding a few large customers, all good for a day! More confidence  Onboarded all customers at once Ingest task queue kept piling up (till 25K), overwhelmed after 5K Soon, overlord (ingestion orchestrator) node CPU usage at 100% All the tasks stuck in pending state Task count was 12x more than previous but smaller. Ingest tier nodes (MM) were sitting idle, no incoming tasks. Ingestion task state not updating as overlord was overwhelmed. Druid Overlord

Get INGESTION ORCHESTRATOR Alive

Overlord Process Ingest tasks not Being assigned!!

Overlord Process Bigger VM Scale up VM (AWS EC2)

Overlord Process Bigger VM Bigger DB Also, scale up DB

Handling the Overlord… No support for horizontal scalability Only active/passive. Vertically scale overlord instance as well as its Postgres DB capacity. Optimize the Changed Druid process configs to optimize task assigment !

Handling the Overlord… Optimize the Changed Druid process configs to optimize task assigment ! Restrict Global task queue & Throttle! Set max pending tasks per customer to 1 Throttle, Don’t give up druid.indexer.queue.maxSize : 5000 GET /druid/indexer/v1/ pendingTasks?datasource =ds1

SCALE UP POSTGRES DB Queries to Postgres for task were taking long time. Add more CPU to DB server Overlord CPU utilization is less Number of pending tasks are less

Scaling INGESTION TIER

Middle managers Peon Processes More VMs

Middle managers Peon Processes Bigger Compact tasks More VMs Tiering Middle Managers

Middle managers More Slots Bigger Compact tasks More VMs More slots per Middle Managers

Middle managers More Slots Bigger Compact tasks Less VMs Right size Middle Managers

Summary : Scaling Middle Manager Increased number of middle manager as so that more task slots are available for overlord to assign tasks. Then we increased number of slots per middle manager as new tasks were small i.e. having less number of files to ingest. We created a separate tier for compaction as these tasks took more resource then the current index tasks. Then we right sized the middle manager count in each tier by reducing it. 12 MMs * 5 slots => 24MMs * 5 slots 24 MMs * 5 slots => 12MMs * 10 slots 12 MMs * 10 slots => 10 MMs * 10 slots + 2 MMs *5 slots Tiering

ISSUES IN QUERY TIER

Too many segments to assign !!!

Coordinator Process

Coordinator Process Bigger VM

Coordinator Process Bigger VM Same big DB

Handling the Coordinator… Increased Coordinator instance type as it is not scalable horizontally Tried the following coordinator dynamic configs:

Handling the Coordinator… Increased Coordinator instance type as it is not scalable horizontally Use the dynamic configs. Restart on scale is hard! reducing the number of segments per coordinator cycle maxSegmentsToMove : 1000 percentOfSegmentsToConsiderPerMove : 25

Handling the Coordinator… Increased Coordinator instance type as it is not scalable horizontally Tried the following coordinator dynamic configs: Assign segments In round-robin fashion first. Lazily reassign with chosen balancer strategy later maxSegmentsToMove : 1000 percentOfSegmentsToConsiderPerMove : 25 useRoundRobinSegmentAssignment : true

Finally, the QUERY Nodes

Handling the Historicals Until auto compaction done: More no of segments for queries More resources to power query tier Beware: More segments = more cache to load Check for linux configs max_map_count Query for recent data Query for recent data Older Historicals Current Historicals Larger segments Smaller segments Datasource 2 Datasource 1 Datasource 1 Datasource 2

FILE ISSUES in OSS

Fix Issues TOO IF possible!

Happy State!!!

Thank You Questions Shivji Kumar Jha linkedin.com/in/shivjijha/ slideshare.net/shiv4289/presentations/ youtube.com/@shivjikumarjha Sachidananda Maharana https://www.linkedin.com/in/sachidanandamaharana/ Liked this? Checkout our upcoming meetups!

Batch to near-realtime: inspired by a real production incident

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Batch to near-realtime: inspired by a real production incident

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx