Scaling cron at Slack: From a single node to a hyperscale distributed system. Learn how Slack evolved their infrastructure to handle high-volume workloads while ensuring reliability and maintainability.
Size: 1.77 MB
Language: en
Added: Mar 07, 2025
Slides: 20 pages
Slide Content
A ScyllaDB Community
Scaling Cron at Slack
Claire Adams
Senior Software Engineer
Claire Adams (she/her)
■Software Engineer @ Slack
■Worked on Infra/Async Compute
■Scale: ~10 billion executions a day
■Now working on Search AI
■Lives in New York
How we transformed Slack’s cron system from a single node into a
high-scale distributed system to meet increasing load demands
and improve reliability and maintainability
Presentation Topic
Presentation Agenda
■What is cron?
■What do crons do at Slack?
■How many crons are there (scale)?
■What was the original setup?
■What was the new setup?
■Other alternative setups?
■Impact + takeaways
What is cron?
■Command line utility to schedule scripts (jobs)
■Used to automate maintenance and administration tasks
What does cron do at Slack?
Responsible for critical Slack functionality:
■Slack reminders
■Email notifications
■Status cleanup
■Scheduled send
■Database cleanups
■Calculating analytics
■…etc
Scale
■Slack has about 39 million daily
active users
■~385 cron scripts
■About 2,209 executions an hour /
340,890 a week / ~20 million a year
Legacy architecture
■For ~10 years, there was one server with one
crontab
■Scripts were execute locally on the server
■Scaling == buy a bigger server
■Security patches == downtime
■~11 incidents (OOMing etc) in the year before
this rewrite!
Legacy architecture
Cron box
New architecture
■New: High-scale distributed system for
scheduling jobs
■Goals: increase scalability, provide reliable
uptime, decrease maintenance burden
■Components:
■Existing job execution service
■Orchestrator scheduling service
■Database for cron run tracking
Image source
New architecture
Orchestrator Job Queue
Database
Orchestrator scheduling service
■Golang service on Kubernetes
■Already using Golang & Kubernetes for infra
■Use Golang library for cron so migration was easier
■Can keep crontab format
■Much easier if don’t need to coordinate work across many teams
■Leader election with locking
Image source
Orchestrator
Leader election with locking
■Don’t need all pods to be scheduling jobs - can one pod lead and others
in standby mode to take over quickly
■Synchronizing pods seemed more of a headache than a help
■Can off load the memory/CPU intensive work to the execution service
Existing job execution service
■Slack’s Job Queue:
■async compute system; multiple queues
■processes about ~10 billion jobs a day
■reliable, scalable; at least once guarantees
■One script == one job
■Isolate to own queue for speedy execution
■No added maintenance/on-call burden
Job Queue
Job
Worker
Database
Database for job tracking
■New system: use a database to
track runs
■Check for running status before running
again since some scripts run longer
than their frequency
■Useful for reporting on job state +
investigating any errors
In summary
Orchestrator
Job
Queue
Database
Cron box
Transformation
Why not use kube cron jobs or other alternatives?
■Because of our scale + maintenance burden
■Expensive to spin pods up and down for the scale
■~54,000 pods a day!
■Difficult to debug
■Idempotent, which is difficult for really quick jobs
■Would need to invest in better tooling & ongoing
maintenance
■Need to consider migration effort
Image source
How are things now?
■Completed migration a little over 1 year ago
■About 6 million cron jobs have run
successfully
■Decreased on-call burden
■Issues with first DB pick; no incidents since
switching!
Image source
Takeaways
■Use what you have: job queue,
kubernetes, golang
■decrease maintenance burden while
getting huge scale wins
■Keep it simple: Slack managed
key functionality with cron scripts
running on one box for almost 10
years
Image source
Stay in Touch
Claire Adams [email protected]
https://github.com/claire-1
https://www.linkedin.com/in/clairebadams/