Data replication ensures high availability—reliable, consistent, and timely access. Dive into the tough problems often skipped: reliable backfills, schema changes, data validation, failure resistance, and parallel writes. Learn how Shopify achieves ~5M rows/sec with <500ms p99 lag. #DataReplica...
Data replication ensures high availability—reliable, consistent, and timely access. Dive into the tough problems often skipped: reliable backfills, schema changes, data validation, failure resistance, and parallel writes. Learn how Shopify achieves ~5M rows/sec with <500ms p99 lag. #DataReplication #TechTalk
Size: 947.21 KB
Language: en
Added: Oct 14, 2024
Slides: 22 pages
Slide Content
A ScyllaDB Community
Reliable Filtered Data Replication
Cameron Morgan
Staff Infrastructure Engineer - Databases
The team
Hi, I’m Cameron Morgan
I’m a staff infrastructure engineer at Shopify. I work on our Database
platform team. ~6000 databases under management.
@cameron_p_m on twitter/X
cameronpmorgan.com
I built this with really smart people; Aaron Brady, Anton Sergeev
Pawan Dubey, Shiv Nagarajan and others.
Outline
●Background on Shopify and data replication
●Our problem: filtered replication with extremely low latency
●Result: Our solution 2 years in production
●Alternatives considered
●Our design
●Discuss 3 hard problems of replication
○Proving data is correct
○Zero downtime schema changes
○Write performance
●Conclusion and tradeoffs
Background: Shopify
Shopify has millions of merchants - over a trillion in sales.
Flash sales and smart people made our databases good.
●Main logical database processes ~20 million queries per second.
●~200k transactions per second.
●$4.2 Million per minute in sales.
●Largest tenants are 10s of TB.
●Last Black Friday our MySQL instances transited ~7 exabytes of data between
the microprocessor and the memory subsystem. All the words ever spoken by
mankind would equal ~5 exabytes.
Really fun problems.
Background: Data replication
Continuously copying data from one
datastore to another to make data
highly available to users.
Users can access data in a reliable,
consistent and timely fashion.
Resiliency, scale and performance
benefits, but costs more and adds
operational complexity.
E.g replicate into an inverted index
for Google search, or replicating
into SQLite like Apple Notes, or
simply horizontally scaling your
database.
Our problem: Human version
We wanted to split read heavy and write
heavy workloads so we could optimize our
database for reads.
It would cost too much to put all the data
everywhere.
We had replication lag. Users should not
notice lag.
Speed of light problem. We want responses
served with p90 of 100ms. Ping between US
central and Asia is 250ms.
Our problem: Technical Version
●p99 <5s replication lag.
●Partial replication. Table and column filtering needed for cost
/ PII reasons
●Partial parallel snapshots. It should be possible to copy table
or column data into the databases while we are live in
production for zero downtime Schema changes.
●Transactional consistency. Transactions are preserved and have
causal consistency. We inherit the properties of MySQL
replication.
●Data must be 100% verifiably correct.
●Ability to do stateless transformations for PII scrubbing or
filtering rows
Result: Our solution 2 years in production
Most shops replicated and
served out of many regions.
Merchants get global
performance for free.
Response times cut in half.
Shopify saw a ~50% reduction
in response time in Edge
regions after rolling this
out by making network time
minimal.
Replication lag p99.9 < 1s.
During high load flash sales,
we don’t see replication lag
due to filtered replication.
Why not MySQL filtered replication?
●MySQL (non-filtered) replication is
battle tested, reliable and fast.
●MySQL filtered replication is not a
complete solution and less common.
●Would require a separate process for
the initial partial snapshot and
coordination for changing the filtered
set in production. No way to do row /
column filters.
●Looking at it from reverse, we are
extracting out the “simple” part,
reading and writing the data and making
the system overall more complex.
●We also tried some other tools.
The Easy Problems (the V1)
go process to read the binlog and write statements
partial incremental snapshot support
persist the GTID into the database for crash recovery or deployments
Partial incremental snapshot / backfill
Getting all the historical data!
●Snapshots that select * or hold open a
transaction do not work at scale. Debezium
does this by default.
●We implemented a watermark based
incremental backfill and contributed the
algorithm to Debezium for a different
project. Read only version of the DBLog
algorithm.
●Problem: Shared write path -> induced lag.
Broke our SLO.
We don’t use Debezium for this project, so we
implemented this in Go. We use go mysql Canal.
Hard Problem 1: Correctness
●It’s impossible to design a replication system and be 100%
correct on the first attempt. There are always bugs, and in
systems like this cause data corruption.
●We made it a condition for a write to be considered “completed”
that we checksummed the result.
●Became the most essential part of our system. Allowed for large
a risky changes to be developed quickly.
●There are two general approaches. Full table verification or
incremental verification.
Hard Problem 1: Correctness
●After write, async checksum, then checkpoint GTID. Required some
design and iterative changes for it to not slow down replication.
Reads are much faster than writes for point lookups.
●We started with incremental verification, but now do full table
verification too.
●Incremental is more efficient and faster feedback, but more complex
and doesn’t show incorrect data from things outside of your system.
Hard problem 2: Schema changes
●10 schema changes per day
●Must be easy/fast for developers. Schema changes are
features.
●On a specific database, schema changes are run with 0
downtime using a online schema change based strategy. A new
table is created with all the changes, data is copied to
that new table, and then we do an atomic table rename from
the temporary name to the new name.
●Schema changes are are a very difficult problem
Hard problem 2: Schema changes
●A developer submits a new version of the tables, columns and indexes
they would like on the edge. This triggers a state machine in a
regular rails app.
●It decides if we need to copy new data into columns or tables and
updates the metadata table backfilter is watching, which then
backfills that data. Once complete, the application is able to use
that new data.
We broke our SLO on some databases due to write performance
Sequential writes always limited by Little’s Law:
Naive parallel writes cause incorrect data directly
1.UPDATE foo SET (col_b) = 1;
2.UPDATE foo SET (col_b) = 2;
You can assume writes can take infinite time or complete instantly.
Update 2 completes before update 1 and then you end up with incorrect
data in your database. Now when you write 10k transactions /s this
happens instantly if your algorithm is wrong.
Hard problem 3: Write performance
Hard problem 3: Write performance
Solution: WRITESET based parallel replication written in Go
MySQL already has parallel replication. The default algorithm in MySQL 8
is called WRITESET.
The simple version is that MySQL takes the set of keys a transaction
touches and determines transactions are independent if their keys, their
WRITESETS are disjoint, meaning not overlapping. If the transactions are
independent, then they can be written in parallel.
We chose to implement WRITESET based parallel replication in Go
Brought our replication lag from breaking the p99 of 5s SLO to the p99.9
<1s that we have today.
Solution
We have built a system that has been in production for close to 2
years and achieved our design goals.
●50% reduction in response time by making network time
minimal.
●p50 300us queries.
●No longer prohibitively expensive!
●No more prohibitively high replication lag!
●Flexibility as to what data is located where!
The tradeoffs:
●One schema is simpler than two schemas and is easier to
reason about for devs.
●Operational complexity is higher and requires more dedicated
time / staffing than MySQL replication.
Bonus Personal Learnings
●Data verification improves trust and improves development
speed.
●Understand your bottlenecks before over optimizing. Improving
areas that aren't your bottleneck are usually not worth the
time. IE shipping better Go perf improvements or using a faster
language doesn’t matter as much because our bottleneck is
always database writes (I/O).
●Monitoring the write ahead log is incredibly useful in order to
make design and configuration changes to the database itself.
Thank you! Let’s connect.
Cameron Morgan [email protected]
@cameron_p_m
cameronpmorgan.com
Hard Problem 2: Snapshots without inducing
replication lag
Shared write path -> induced lag. Broke our SLO.
We needed to do many tables and once and wanted snapshots to be
fast. We decided to do backfills in parallel using their own
writers.
Solved our lag issue but reduced our consistency model during the
snapshot process. This is fine because data being snapshotted is
by definition not in production.
Hard problem 4: Write performance
We leaned heavily on our verifier which caught some really
interesting bugs. For example:
If
●the local transaction ID was rotated (called sequence_number)
AND
●the transaction right before the rotation touched the same
primary key as the transaction after the rotation AND
●the transaction right before the rotation was committed after
the transaction after the rotation
We would see a verification error. This was quite easy to fix in a
unit test after it was caught by the verifier in staging.