Scaling Cron at Slack by Claire Adams, Slack

ScyllaDB 302 views 20 slides Mar 07, 2025
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Scaling cron at Slack: From a single node to a hyperscale distributed system. Learn how Slack evolved their infrastructure to handle high-volume workloads while ensuring reliability and maintainability.


Slide Content

A ScyllaDB Community
Scaling Cron at Slack
Claire Adams
Senior Software Engineer

Claire Adams (she/her)
■Software Engineer @ Slack
■Worked on Infra/Async Compute
■Scale: ~10 billion executions a day
■Now working on Search AI
■Lives in New York

How we transformed Slack’s cron system from a single node into a
high-scale distributed system to meet increasing load demands
and improve reliability and maintainability
Presentation Topic

Presentation Agenda
■What is cron?
■What do crons do at Slack?
■How many crons are there (scale)?
■What was the original setup?
■What was the new setup?
■Other alternative setups?
■Impact + takeaways

What is cron?
■Command line utility to schedule scripts (jobs)
■Used to automate maintenance and administration tasks

What does cron do at Slack?
Responsible for critical Slack functionality:
■Slack reminders
■Email notifications
■Status cleanup
■Scheduled send
■Database cleanups
■Calculating analytics
■…etc

Scale
■Slack has about 39 million daily
active users
■~385 cron scripts
■About 2,209 executions an hour /
340,890 a week / ~20 million a year

Legacy architecture
■For ~10 years, there was one server with one
crontab
■Scripts were execute locally on the server
■Scaling == buy a bigger server
■Security patches == downtime
■~11 incidents (OOMing etc) in the year before
this rewrite!

Legacy architecture
Cron box

New architecture
■New: High-scale distributed system for
scheduling jobs
■Goals: increase scalability, provide reliable
uptime, decrease maintenance burden
■Components:
■Existing job execution service
■Orchestrator scheduling service
■Database for cron run tracking


Image source

New architecture
Orchestrator Job Queue
Database

Orchestrator scheduling service
■Golang service on Kubernetes
■Already using Golang & Kubernetes for infra
■Use Golang library for cron so migration was easier
■Can keep crontab format
■Much easier if don’t need to coordinate work across many teams
■Leader election with locking
Image source
Orchestrator

Leader election with locking
■Don’t need all pods to be scheduling jobs - can one pod lead and others
in standby mode to take over quickly
■Synchronizing pods seemed more of a headache than a help
■Can off load the memory/CPU intensive work to the execution service


Job QueueLeader
Standby
Standby
Standby
Image source

Existing job execution service
■Slack’s Job Queue:
■async compute system; multiple queues
■processes about ~10 billion jobs a day
■reliable, scalable; at least once guarantees
■One script == one job
■Isolate to own queue for speedy execution
■No added maintenance/on-call burden


Job Queue
Job
Worker

Database
Database for job tracking
■New system: use a database to
track runs
■Check for running status before running
again since some scripts run longer
than their frequency
■Useful for reporting on job state +
investigating any errors

In summary
Orchestrator
Job
Queue
Database
Cron box
Transformation

Why not use kube cron jobs or other alternatives?
■Because of our scale + maintenance burden
■Expensive to spin pods up and down for the scale
■~54,000 pods a day!
■Difficult to debug
■Idempotent, which is difficult for really quick jobs
■Would need to invest in better tooling & ongoing
maintenance
■Need to consider migration effort


Image source

How are things now?
■Completed migration a little over 1 year ago
■About 6 million cron jobs have run
successfully
■Decreased on-call burden
■Issues with first DB pick; no incidents since
switching!
Image source

Takeaways
■Use what you have: job queue,
kubernetes, golang
■decrease maintenance burden while
getting huge scale wins
■Keep it simple: Slack managed
key functionality with cron scripts
running on one box for almost 10
years
Image source

Stay in Touch
Claire Adams
[email protected]
https://github.com/claire-1
https://www.linkedin.com/in/clairebadams/
Tags