Inside Expedia's Migration to ScyllaDB for Change Data Capture
ScyllaDB
360 views
25 slides
Jun 21, 2024
Slide 1 of 25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
About This Presentation
Databases Migrations are no fun, and there are several different strategies and considerations one must be aware of prior to actually doing it in production. In this talk, Jean Carlo and Manikar Rangu will deep dive on Expedia’s migration journey from Cassandra to ScyllaDB. They cover the aspects ...
Databases Migrations are no fun, and there are several different strategies and considerations one must be aware of prior to actually doing it in production. In this talk, Jean Carlo and Manikar Rangu will deep dive on Expedia’s migration journey from Cassandra to ScyllaDB. They cover the aspects and pitfalls the team needed to overcome as part of their Identity service project.
Size: 2.37 MB
Language: en
Added: Jun 21, 2024
Slides: 25 pages
Slide Content
Inside Expedia’s Migration to ScyllaDB for Change Data Capture Jean Carlo Rivera Ura , Database A dministrator at Expedia Group Manikar Rangu, Database Administrator at Expedia Group
Jean Carlo Rivera Ura Ms Computer Science and Database administrator 10 years working with NoSQL databases Open source enthusiast Mountain activities passionate
Manikar Rangu Master of Technology(M.Tech) and NoSQL Database Administrator 9+ years working with NoSQL databases Open source database expert
Context Migration c hallenges Considered options ScyllaDB Migrator Recommendations What we will do different Presentation Agenda
Context
Expedia is a travel technology company Petabytes of data thousands Request per second Data Infrastructure team Automation NoSQL and SQL databases Context
Cluster Cassandra 15 nodes i3.xlarge 1 DC Session and logging data Around 50 tables CDC and Cassandra Replicate changes Debezium Connector monitoring Pay attention to consumers and the space given in cdc_free_space_in_mb Context Logging activity CDC Kafka Other datastores and apps
CDC and ScyllaDB Embedded feature CDCtable with TTL Just need to alter table with cdc = {'enabled':true}; Ready to be consume cql Kafka integrations Context Logging activity Kafka Other datastores and apps CDC
Migration Challenges
Zero Downtime No Data lost allowed SLO <130ms 50 tables to migrate 1TB of non compressed data Data Validation TLS connections Migration Challenges
Considered Options
SSTable Loader Create snapshots and move sstables It needs to be run on dedicated instances No way to restart the load from a previous incomplete job Unless your data on Cassandra is consistent, you may want to use all the sstables in every node Considered Options
Spark based ScyllaDB Migrator Real-time data streaming in batch wise. Instead of copying the data for 3 times, we can just scan the tables with the consistency level and copy them into ScyllaDB. Savepoints: Keep track of data ranges tokens migrated We can preserve the WRITETIME and TTL attributes of the fields that are copied. It uses the spark parallel processing capabilities to distribute the load across all nodes. Offers a validator job to compare differences between source and destination Spark based ScyllaDB migrator comes with built-in monitoring capabilities Considered Options ScyllaDB Open Source ScyllaDB Migrator Writing data to Scylla Requesting data with scylla migrator
ScyllaDB Migrator The selected option
Build up the Spark cluster Ansible Automation Multiple tables, multiple configurations Big tables with specific cases AWS Instance type i4i.4xlarge for Cluster Spark ScyllaDB Migrator
Online migration No downtime Control throughput SLO <130ms Data integrity Dual writes Load historical data ScyllaDB Migrator Logging activity Dual writes
Run scylla migrator Scale up or down the nodes on Spark cluster Control throughput ScyllaDB Migrator Logging activity Dual writes ScyllaDB Migrator
Validation phase compareTimestamps writetimeToleranceMillis failuresToFetch Failures found Identify the apps Sort data by relevance ScyllaDB Migrator Logging activity Dual writes ScyllaDB Validator
50 tables to migrate 1TB of non compressed data Scylla cluster of 9 nodes i4i.xlarge ScyllaDB Migrator Logging activity Switch over New cluster
Recommendations
Spark Tuning Default installation isn’t enough Spark_deamon_memory to 20GB Spark.driver.memory to 5g Scylla-migrator Tuning You have to play with the splitCount for big tables Big tables and big partitions Tuning migrator parameters help to relieve write pressure SplitCount fetchSize 8 SPARK_WORKER_INSTANCES and 8 SPARK_WORKER_CORES Recommendations
Purging data Null values on cluster key CDC tables Disable the cdc before migration ( https://github.com/scylladb/scylladb/issues/7251 .) After migration, enable CDC Recommendations CDC good to go
What We Will Do Differently
Create a new dc on Cassandra side Uses EMR aws What we will do differently ScyllaDB Cluster ScyllaDB Migrator Writing data to ScyllaDB Requesting with scylla migrator Client Activity isolated provisioning Second Datacenter First Datacenter
Stay in Touch Jean Carlo Rivera Ura [email protected] https://github.com/carlo4002 https://www.linkedin.com/in/jeancarloriveraura/ Mani Rangu [email protected] https://www.linkedin.com/in/manikar-r-9a1a99141/