Scaling Cron at Slack by Claire Adams, Slack

ScyllaDB 302 views 20 slides Mar 07, 2025

Slide 1 of 20

About This Presentation

Scaling cron at Slack: From a single node to a hyperscale distributed system. Learn how Slack evolved their infrastructure to handle high-volume workloads while ensuring reliability and maintainability.

Size: 1.77 MB

Language: en

Added: Mar 07, 2025

Slides: 20 pages

Slide Content

A ScyllaDB Community
Scaling Cron at Slack
Claire Adams
Senior Software Engineer

Claire Adams (she/her)
■Software Engineer @ Slack
■Worked on Infra/Async Compute
■Scale: ~10 billion executions a day
■Now working on Search AI
■Lives in New York

How we transformed Slack’s cron system from a single node into a
high-scale distributed system to meet increasing load demands
and improve reliability and maintainability
Presentation Topic

Presentation Agenda
■What is cron?
■What do crons do at Slack?
■How many crons are there (scale)?
■What was the original setup?
■What was the new setup?
■Other alternative setups?
■Impact + takeaways

What is cron?
■Command line utility to schedule scripts (jobs)
■Used to automate maintenance and administration tasks

What does cron do at Slack?
Responsible for critical Slack functionality:
■Slack reminders
■Email notiﬁcations
■Status cleanup
■Scheduled send
■Database cleanups
■Calculating analytics
■…etc

Scale
■Slack has about 39 million daily
active users
■~385 cron scripts
■About 2,209 executions an hour /
340,890 a week / ~20 million a year

Legacy architecture
■For ~10 years, there was one server with one
crontab
■Scripts were execute locally on the server
■Scaling == buy a bigger server
■Security patches == downtime
■~11 incidents (OOMing etc) in the year before
this rewrite!

Legacy architecture
Cron box

New architecture
■New: High-scale distributed system for
scheduling jobs
■Goals: increase scalability, provide reliable
uptime, decrease maintenance burden
■Components:
■Existing job execution service
■Orchestrator scheduling service
■Database for cron run tracking

Image source

New architecture
Orchestrator Job Queue
Database

Orchestrator scheduling service
■Golang service on Kubernetes
■Already using Golang & Kubernetes for infra
■Use Golang library for cron so migration was easier
■Can keep crontab format
■Much easier if don’t need to coordinate work across many teams
■Leader election with locking
Image source
Orchestrator

Leader election with locking
■Don’t need all pods to be scheduling jobs - can one pod lead and others
in standby mode to take over quickly
■Synchronizing pods seemed more of a headache than a help
■Can off load the memory/CPU intensive work to the execution service

Job QueueLeader
Standby
Standby
Standby
Image source

Existing job execution service
■Slack’s Job Queue:
■async compute system; multiple queues
■processes about ~10 billion jobs a day
■reliable, scalable; at least once guarantees
■One script == one job
■Isolate to own queue for speedy execution
■No added maintenance/on-call burden

Job Queue
Job
Worker

Database
Database for job tracking
■New system: use a database to
track runs
■Check for running status before running
again since some scripts run longer
than their frequency
■Useful for reporting on job state +
investigating any errors

In summary
Orchestrator
Job
Queue
Database
Cron box
Transformation

Why not use kube cron jobs or other alternatives?
■Because of our scale + maintenance burden
■Expensive to spin pods up and down for the scale
■~54,000 pods a day!
■Diﬃcult to debug
■Idempotent, which is diﬃcult for really quick jobs
■Would need to invest in better tooling & ongoing
maintenance
■Need to consider migration effort

Image source

How are things now?
■Completed migration a little over 1 year ago
■About 6 million cron jobs have run
successfully
■Decreased on-call burden
■Issues with ﬁrst DB pick; no incidents since
switching!
Image source

Takeaways
■Use what you have: job queue,
kubernetes, golang
■decrease maintenance burden while
getting huge scale wins
■Keep it simple: Slack managed
key functionality with cron scripts
running on one box for almost 10
years
Image source

Stay in Touch
Claire Adams
[email protected]
https://github.com/claire-1
https://www.linkedin.com/in/clairebadams/

Scaling Cron at Slack by Claire Adams, Slack

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Scaling Cron at Slack by Claire Adams, Slack

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx