How Discord Performs Database Upgrades at Scale by Ethan Donowitz
ScyllaDB
206 views
29 slides
Mar 04, 2025
Slide 1 of 29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
About This Presentation
Discord relies on ScyllaDB to serve millions of reads per second across many clusters, so they needed a comprehensive strategy to sufficiently de-risk upgrades to avoid impact to our users. To accomplish this, they use what they call “shadow clusters.” This talk explains how testing with shadow ...
Discord relies on ScyllaDB to serve millions of reads per second across many clusters, so they needed a comprehensive strategy to sufficiently de-risk upgrades to avoid impact to our users. To accomplish this, they use what they call “shadow clusters.” This talk explains how testing with shadow clusters has been paramount to de-risking complicated upgrades for one of the most important pieces of infrastructure at Discord.
Size: 1.04 MB
Language: en
Added: Mar 04, 2025
Slides: 29 pages
Slide Content
A ScyllaDB Community
How Discord Performs
Database Upgrades at Scale
Ethan Donowitz
Senior Software Engineer
Ethan Donowitz (they/them)
■Senior Software Engineer at Discord on the
Persistence Infrastructure team
■Our team owns the persistent data stores in use
across Discord
■I live in NYC
■My favorite color is purple
■Why database upgrades can be scary
■How Discord uses shadow clusters to make them less scary
■Using data services to mirror production traffic and perform read
validations
■What we monitor when testing a new version
■How we can make this process less toilsome
Presentation Agenda
Scenario
■A new version of your distributed database is released with the latest
and greatest features, promising to improve query performance by an
order of magnitude
■Your database cluster gets hundreds of thousands of queries per
second and contains hundreds of terabytes of data
■Your database cluster is used in many critical flows in your application
■How do you verify that it is safe to upgrade to the new version?
Potential Risks
■Changes to data formats or query semantics can result in data
corruption
■A node may not be able to boot after being upgraded
■Changes in performance may affect upstream systems and users
■Coordinating a complicated upgrade across all the nodes in a large
cluster can be toilsome and leaves a lot of room for operator error
How can we de-risk the upgrade? Ideas…
■Maybe we can test in a staging environment? But staging doesn’t
simulate the load in a production environment…
■What if we wrote a load testing framework? It is difficult to accurately
simulate the query patterns that occur in production, and tailoring such a
framework to the schemas across all our clusters would be
time-consuming
■Okay, then maybe we can capture production traffic and replay it? Closer,
but this still requires substantial work to determine how to capture the
traffic, where to store it, and how to replay it
Shadow Clusters!
What is a Shadow Cluster?
■A shadow cluster is a mirror of a production cluster that has the same
data (mostly) and receives the same reads and writes
■Created from the disk snapshots from nodes in its corresponding
production cluster
■Production traffic (both reads and writes) is mirrored to the shadow
cluster via our data services
■In general, for a given upgrade, we create a shadow cluster for each of
our production clusters and run tests before performing the upgrade in
production
Setting up a Shadow Cluster
■Provision new infrastructure (compute instances + disks) using
terraform
■The disks are created using disk snapshots from production; ensures the cluster
has all of the production data
■Use configuration management tools to set up the new instances (e.g.
install ScyllaDB, apply configuration, etc.)
■Wait for the nodes to come online and finish compactions
Data Services and Read
Validations
What is a Data Service?
■Rust gRPC service that sits between our Python API monolith and our
databases
■Generally, an RPC is defined for each query
■Perform query coalescing (if we receive multiple concurrent queries for
a single hot key, we only issue the query once)
■Can implement custom logic in the RPCs that would be expensive to
implement in Python (e.g. converting between data formats)
■Serve as a protective layer to our database clusters
What is a Data Service?
API Data Service
Data Service
Data Service
ScyllaDB
Cluster
ScyllaDB
Cluster
ScyllaDB
Cluster
Data Services: Mirroring and Validation
■Data services can mirror incoming queries to a shadow cluster
■They can also perform read validations by comparing results from the
production and shadow clusters
■At a high level:
■We issue the query to the production cluster and wait for the results
■We spawn an asynchronous task that issues the same query to the shadow
cluster, waits for the results, and compares the results to the production results (a
metric is emitted that counts successful and unsuccessful read validations)
■We return the results to the requesting client and the task runs in the background
Data Services and Shadow Clusters
API Data Service
ScyllaDB
Cluster
Shadow
Cluster
Read Validations
■Data services issue the same query to both the production cluster and
the shadow cluster and compare the results
■Rust and tokio make this very easy: database queries are I/O-bound, and
tokio is designed to handle high a high number of concurrent I/O-bound
tasks
■If the results don’t match, a metric is emitted
■Generally, gives us peace of mind that we don’t experience irreversible
data corruption during and after the upgrade
Setting Up Read Validations
■Because shadow clusters are created from production disk snapshots
(usually from the prior day), the cluster is missing about one day’s worth
of writes
■This gap in data would produce false positive read validation
mismatches
■e.g. If some entity A exists in the disk snapshot and is mutated between the
creation of the disk snapshot and when we start to mirror writes to the shadow
cluster, the shadow cluster will be missing the mutation that is present in the
production cluster. As a result, when we perform a read validation on that row, it
will fail
How We Solve This: Snowflakes
■Entity IDs at Discord are all snowflakes
■A snowflake is a unique, 64-bit integer that contains an embedded
timestamp that reflects the time the snowflake was generated, meaning
snowflakes monotonically increase over time
■This gives them the property that if entity A is older than entity B, A’s ID
will be less than B’s ID
■We use a service to generate these IDs to ensure the IDs are unique
Read Validations and Snowflakes
■Once we’ve got a shadow cluster stood up and receiving mirrored writes
from production, we generate a new snowflake and set it as the
“cutoff_id” in the data service
■When the data service receives a read, it only performs a read validation
if the ID in the query is greater than the configured “cutoff_id”
■This ensures that we are only performing read validations against
records that were created after the shadow cluster began receiving
mirrored production writes
■At a high level, the shadow cluster setup process looks something like
this:
■Provision and configure the infrastructure for the shadow cluster using disk
snapshots from the production cluster
■Slowly start mirroring reads and writes to the shadow cluster, keeping an eye on
the cluster’s health
■Once 100% of writes are being mirrored to the shadow cluster, generate a new
snowflake via the snowflake service, and configure this as the “cutoff_id” in the
data service
■Configure the percentage of reads that should be validated (usually anywhere from
1 to 100%)
Bringing It All Together
Performing the Upgrade
Performing the Upgrade
■Once we are mirroring traffic to the shadow cluster and performing
read validations, we can perform the actual upgrade
■The upgrade is performed on a rolling basis, one node at a time
■Our clusters run with a replication factor of 3 and consistency level
“QUORUM”, so one node going down does not result in downtime
■At a high level, upgrading a node looks like:
■Swapping in the binary for the new version
■Updating any configuration that needs to be changed across the versions
■Draining and restarting the node
Performing the Upgrade
■The upgrade process is automated via a combination of scripts
and configuration management tools
■We monitor the shadow cluster closely while the upgrade takes
place to ensure the cluster remains healthy
■It is a useful exercise to perform the upgrade exactly as it would
be performed in production to keep the testing process as realistic
as possible
■It is also useful as an operator to practice performing the upgrade
in a production-like environment as preparation for the real deal
After the Upgrade: Monitoring Query Performance
■Our data services have per-query latency metrics
■We compare the per-query latencies between the production and
shadow clusters to look for regressions or improvements
■When evaluating a regression in latency, we look at the context:
■Where in our application is the query used? Some endpoints are more
latency-sensitive than others
■What is the absolute latency of the query? (A 50% regression in a query
with sub-millisecond latency is negligible)
■Do we just need to wait for the cache to warm?
■Performance can often be improved by tuning settings
After the Upgrade: Monitoring Read Validations
■We closely watch the metric that counts read validation
mismatches
■There are occasionally false-positives due to differences in the
way reads and writes are interleaved and applied on the
production and shadow clusters
■To account for this, read validations are retried a fixed number of
times to check for convergence; if the results never converge, that
suggests a problem
■To troubleshoot read validation mismatches, we can inspect the
mismatched data to see what’s different
After the Upgrade: General Testing
■We also perform a number of general tests on the shadow cluster
after performing the upgrade:
■Knock over a node to simulate a host failure
■Perform a repair (a process that synchronizes data across the nodes in a
ScyllaDB cluster)
■Perform a schema migration
■Go through our metrics dashboards to ensure that metrics are not broken
■Perform a backup
Wrapping Up
Wrapping Up
■Shadow cluster testing has been an invaluable tool whenever we
need to test a risky production operation
■They give us the opportunity to practice running through complex
operations before performing them in production
■Our data services serve as a convenient place for us to implement
traffic mirroring and read validations
Looking Forward: ScyllaDB Control Plane
■Currently, shadow cluster testing is a very manual process
■We have scripts to help us, but they tend to be brittle
■We are currently investing time in building a control plane to
manage our ScyllaDB infrastructure
■When finished, it will allow us to automate operations like:
■The entire shadow cluster creation process (short of provisioning the
infrastructure)
■Updating settings across the nodes in a cluster
■Installing new versions
■Restarting nodes