How Discord Performs Database Upgrades at Scale by Ethan Donowitz

ScyllaDB 206 views 29 slides Mar 04, 2025

Slide 1 of 29

About This Presentation

Discord relies on ScyllaDB to serve millions of reads per second across many clusters, so they needed a comprehensive strategy to sufficiently de-risk upgrades to avoid impact to our users. To accomplish this, they use what they call “shadow clusters.” This talk explains how testing with shadow ...

Size: 1.04 MB

Language: en

Added: Mar 04, 2025

Slides: 29 pages

Slide Content

A ScyllaDB Community
How Discord Performs
Database Upgrades at Scale
Ethan Donowitz
Senior Software Engineer

Ethan Donowitz (they/them)
■Senior Software Engineer at Discord on the
Persistence Infrastructure team
■Our team owns the persistent data stores in use
across Discord
■I live in NYC
■My favorite color is purple

■Why database upgrades can be scary
■How Discord uses shadow clusters to make them less scary
■Using data services to mirror production traﬃc and perform read
validations
■What we monitor when testing a new version
■How we can make this process less toilsome
Presentation Agenda

Scenario
■A new version of your distributed database is released with the latest
and greatest features, promising to improve query performance by an
order of magnitude
■Your database cluster gets hundreds of thousands of queries per
second and contains hundreds of terabytes of data
■Your database cluster is used in many critical ﬂows in your application
■How do you verify that it is safe to upgrade to the new version?

Potential Risks
■Changes to data formats or query semantics can result in data
corruption
■A node may not be able to boot after being upgraded
■Changes in performance may affect upstream systems and users
■Coordinating a complicated upgrade across all the nodes in a large
cluster can be toilsome and leaves a lot of room for operator error

How can we de-risk the upgrade? Ideas…
■Maybe we can test in a staging environment? But staging doesn’t
simulate the load in a production environment…
■What if we wrote a load testing framework? It is diﬃcult to accurately
simulate the query patterns that occur in production, and tailoring such a
framework to the schemas across all our clusters would be
time-consuming
■Okay, then maybe we can capture production traﬃc and replay it? Closer,
but this still requires substantial work to determine how to capture the
traﬃc, where to store it, and how to replay it

Shadow Clusters!

What is a Shadow Cluster?
■A shadow cluster is a mirror of a production cluster that has the same
data (mostly) and receives the same reads and writes
■Created from the disk snapshots from nodes in its corresponding
production cluster
■Production traﬃc (both reads and writes) is mirrored to the shadow
cluster via our data services
■In general, for a given upgrade, we create a shadow cluster for each of
our production clusters and run tests before performing the upgrade in
production

Setting up a Shadow Cluster
■Provision new infrastructure (compute instances + disks) using
terraform
■The disks are created using disk snapshots from production; ensures the cluster
has all of the production data
■Use conﬁguration management tools to set up the new instances (e.g.
install ScyllaDB, apply conﬁguration, etc.)
■Wait for the nodes to come online and ﬁnish compactions

Data Services and Read
Validations

What is a Data Service?
■Rust gRPC service that sits between our Python API monolith and our
databases
■Generally, an RPC is deﬁned for each query
■Perform query coalescing (if we receive multiple concurrent queries for
a single hot key, we only issue the query once)
■Can implement custom logic in the RPCs that would be expensive to
implement in Python (e.g. converting between data formats)
■Serve as a protective layer to our database clusters

What is a Data Service?

API Data Service
Data Service
Data Service
ScyllaDB
Cluster
ScyllaDB
Cluster
ScyllaDB
Cluster

Data Services: Mirroring and Validation
■Data services can mirror incoming queries to a shadow cluster
■They can also perform read validations by comparing results from the
production and shadow clusters
■At a high level:
■We issue the query to the production cluster and wait for the results
■We spawn an asynchronous task that issues the same query to the shadow
cluster, waits for the results, and compares the results to the production results (a
metric is emitted that counts successful and unsuccessful read validations)
■We return the results to the requesting client and the task runs in the background

Data Services and Shadow Clusters

API Data Service
ScyllaDB
Cluster
Shadow
Cluster

Read Validations
■Data services issue the same query to both the production cluster and
the shadow cluster and compare the results
■Rust and tokio make this very easy: database queries are I/O-bound, and
tokio is designed to handle high a high number of concurrent I/O-bound
tasks
■If the results don’t match, a metric is emitted
■Generally, gives us peace of mind that we don’t experience irreversible
data corruption during and after the upgrade

Setting Up Read Validations
■Because shadow clusters are created from production disk snapshots
(usually from the prior day), the cluster is missing about one day’s worth
of writes
■This gap in data would produce false positive read validation
mismatches
■e.g. If some entity A exists in the disk snapshot and is mutated between the
creation of the disk snapshot and when we start to mirror writes to the shadow
cluster, the shadow cluster will be missing the mutation that is present in the
production cluster. As a result, when we perform a read validation on that row, it
will fail

How We Solve This: Snowﬂakes
■Entity IDs at Discord are all snowﬂakes
■A snowﬂake is a unique, 64-bit integer that contains an embedded
timestamp that reﬂects the time the snowﬂake was generated, meaning
snowﬂakes monotonically increase over time
■This gives them the property that if entity A is older than entity B, A’s ID
will be less than B’s ID
■We use a service to generate these IDs to ensure the IDs are unique

Read Validations and Snowﬂakes
■Once we’ve got a shadow cluster stood up and receiving mirrored writes
from production, we generate a new snowﬂake and set it as the
“cutoff_id” in the data service
■When the data service receives a read, it only performs a read validation
if the ID in the query is greater than the conﬁgured “cutoff_id”
■This ensures that we are only performing read validations against
records that were created after the shadow cluster began receiving
mirrored production writes

■At a high level, the shadow cluster setup process looks something like
this:
■Provision and conﬁgure the infrastructure for the shadow cluster using disk
snapshots from the production cluster
■Slowly start mirroring reads and writes to the shadow cluster, keeping an eye on
the cluster’s health
■Once 100% of writes are being mirrored to the shadow cluster, generate a new
snowﬂake via the snowﬂake service, and conﬁgure this as the “cutoff_id” in the
data service
■Conﬁgure the percentage of reads that should be validated (usually anywhere from
1 to 100%)

Bringing It All Together

Performing the Upgrade

Performing the Upgrade
■Once we are mirroring traﬃc to the shadow cluster and performing
read validations, we can perform the actual upgrade
■The upgrade is performed on a rolling basis, one node at a time
■Our clusters run with a replication factor of 3 and consistency level
“QUORUM”, so one node going down does not result in downtime
■At a high level, upgrading a node looks like:
■Swapping in the binary for the new version
■Updating any conﬁguration that needs to be changed across the versions
■Draining and restarting the node

Performing the Upgrade
■The upgrade process is automated via a combination of scripts
and conﬁguration management tools
■We monitor the shadow cluster closely while the upgrade takes
place to ensure the cluster remains healthy
■It is a useful exercise to perform the upgrade exactly as it would
be performed in production to keep the testing process as realistic
as possible
■It is also useful as an operator to practice performing the upgrade
in a production-like environment as preparation for the real deal

After the Upgrade: Monitoring Query Performance
■Our data services have per-query latency metrics
■We compare the per-query latencies between the production and
shadow clusters to look for regressions or improvements
■When evaluating a regression in latency, we look at the context:
■Where in our application is the query used? Some endpoints are more
latency-sensitive than others
■What is the absolute latency of the query? (A 50% regression in a query
with sub-millisecond latency is negligible)
■Do we just need to wait for the cache to warm?
■Performance can often be improved by tuning settings

After the Upgrade: Monitoring Read Validations
■We closely watch the metric that counts read validation
mismatches
■There are occasionally false-positives due to differences in the
way reads and writes are interleaved and applied on the
production and shadow clusters
■To account for this, read validations are retried a ﬁxed number of
times to check for convergence; if the results never converge, that
suggests a problem
■To troubleshoot read validation mismatches, we can inspect the
mismatched data to see what’s different

After the Upgrade: General Testing
■We also perform a number of general tests on the shadow cluster
after performing the upgrade:
■Knock over a node to simulate a host failure
■Perform a repair (a process that synchronizes data across the nodes in a
ScyllaDB cluster)
■Perform a schema migration
■Go through our metrics dashboards to ensure that metrics are not broken
■Perform a backup

Wrapping Up

Wrapping Up
■Shadow cluster testing has been an invaluable tool whenever we
need to test a risky production operation
■They give us the opportunity to practice running through complex
operations before performing them in production
■Our data services serve as a convenient place for us to implement
traﬃc mirroring and read validations

Looking Forward: ScyllaDB Control Plane
■Currently, shadow cluster testing is a very manual process
■We have scripts to help us, but they tend to be brittle
■We are currently investing time in building a control plane to
manage our ScyllaDB infrastructure
■When ﬁnished, it will allow us to automate operations like:
■The entire shadow cluster creation process (short of provisioning the
infrastructure)
■Updating settings across the nodes in a cluster
■Installing new versions
■Restarting nodes

How Discord Performs Database Upgrades at Scale by Ethan Donowitz

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

How Discord Performs Database Upgrades at Scale by Ethan Donowitz

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......