Zero Downtime Critical Traffic Migration @Netflix Scale

ScyllaDB 342 views 25 slides Jun 24, 2024
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

Picture yourself enthralled by the latest episode of your beloved Netflix series, delighting in an uninterrupted, high-definition streaming experience. Behind these perfect moments of entertainment is a complex mechanism, with numerous gears and cogs working in harmony. But what happens when this ma...


Slide Content

Zero Downtime Critical Traffic Migration @Netflix Scale Abhishek Pandey Tech Lead at Meta + Ex Senior Engineer at Netflix

Abhishek Pandey ( he/him/his ) Tech Lead at Meta Migrated and modernized bunch of critical Netflix components. Explore nature with my wife and newborn. Travel and Play Tennis 2 truths and a lie: Caused global outage at Uber, Met Roger Federer, Got fired once.

Introduction Picture yourself enthralled by the latest episode of your beloved Netflix series, delighting in an uninterrupted, high-definition streaming experience. Behind these perfect moments of entertainment is a complex mechanism, with numerous gears and cogs working in harmony. Large-scale system migrations are necessary when this machinery needs transformation.

Introduction Challenges in Transitioning Traffic Netflix's Challenge: Uninterrupted Streaming Backend Systems: Orchestrating Product Experience Evolution and Optimization of Backend Systems Focus of this talk: Migration Strategies

Challenges of System Migrations Main Challenge: Transitioning Traffic with No Customer Impact Ensuring confidence in upgraded architecture. Strategies to meet Quality-of-Experience metrics Architecture of Backend Systems Distributed microservices architecture. Migration points across the service call graph. Stateless and stateful APIs involved.

Migration Phases Phase 1: Validating functional correctness and performance. Functional Correctness, Scalability, Performance, Resilience Monitoring QoE, SLAs, and KPIs Phase 2: Controlled migration with continuous monitoring. Minimizing Incident Risks Continuous Metrics Monitoring

Replay Traffic Testing What is replay traffic testing? Benefits of using replay traffic. Sandboxed testing at scale. Exercise diversity of inputs. Functional correctness, performance validation and load testing.

Replay Traffic Testing Components Component 1: Traffic Duplication and Correlation Clone and Fork Production Traffic Record and Correlate Responses Component 2: Comparative Analysis and Reporting Compare and Analyze Responses Generate Comprehensive Reports

Approaches for Replay Traffic Generation Device Driven Approach

Approaches for Replay Traffic Generation Server Driven Approach

Approaches for Replay Traffic Generation Dedicated Service Approach

Analyzing Replay Traffic Normalization Preprocessing Responses Addressing Alterations Comparison Diffing Responses Generating Summary Metrics

Comparing Live Traffic Replay Traffic Analysis

Load Testing with Replay Traffic Stress Testing with Replay Traffic Regulating Load Evaluating Performance Metrics

Stateful Systems and Replay Testing Application to Stateful Systems Isolated Data Stores Recording State and Response

Case Study: Netflix's Use of Replay Testing Netflix’s Migration Projects Benefits of Replay Testing Building Confidence through Testing

Move to Production

Canary Deployments Canary Concept Baseline vs. Canary Clusters Monitoring Performance Metrics

Sticky Canaries Enhancing Traditional Canary Process The Canary Pool Monitoring Broader System Metrics

A/B Testing Verifying Hypotheses Controlled Experiments Divide The Population

Dialing Traffic Traffic Control Dial Function Monitoring and Rollback

Migrating Persistent Stores Challenges with Stateful APIs ETL-Based Dual-Write Strategy Dual Reads

Clean Up Cleanup and Optimization Removing Migration-Related Code Documentation for Future Migrations

Conclusion Utilized diverse techniques for various migrations. Achieved success with minimal downtime. Gained valuable insights and refined methods. Customized strategies for unique migration scenarios. Goal: Seamless migrations without disruptions.

Abhishek Pandey pandeyabhi1987 @gmail.com pandeyabhi1987 @beinrhythm_abhi Thank you! Let’s connect.
Tags