"Resilient by Design: Strategies for Building Robust Architecture at Uklon", Oleksandr Chumak
fwdays
399 views
33 slides
Sep 14, 2024
Slide 1 of 33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
About This Presentation
It is essential for development of ride-hailing platforms that software systems not only meet functional requirements, but are also resilient against failures, unexpected loads, and external threats.
Resilient architecture ensures that a system can maintain acceptable levels of service, even when c...
It is essential for development of ride-hailing platforms that software systems not only meet functional requirements, but are also resilient against failures, unexpected loads, and external threats.
Resilient architecture ensures that a system can maintain acceptable levels of service, even when components fail or endure stress.
This speech covers principles and practices for designing robust software architectures that can adapt to challenges and recover, ensuring continuity and reliability.
Size: 5.67 MB
Language: en
Added: Sep 14, 2024
Slides: 33 pages
Slide Content
Uklon in numbers
12130+
EngineersProduct Teams
16 M
Android/iOS
downloads
1.5M+
Riders MAU
1B+
Events per day
30+
microservices
o.5TB
2
Countries
30
Cities
~200
Deployments per
month
SLO/SLI
200ms (99perc)
15ms (50perc)
Response time
1B+
Events per day
o.5TB
99.97%
Availability SLO
99.99% *
Availability
(w/ degraded mode)
SLA level of 99.97% uptime results
in the following periods of
allowed downtime/unavailability:
■Daily: 26s
■Weekly: 3m 1.4s
■Monthly: 13m 2.4s
■Quarterly: 39m 7.3s
■Yearly: 2h 36m 29s
2hours/year
of total downtime were
prevented
downtime(1h)
= $100k
How do we measure availability?
1B+
Events per day
o.5TB
availability (http)
availability (Product metrics):
-order placement
-order acceptance
-order completion
●Directly tied to user happiness:
The metric reflects how service
availability impacts user satisfaction.
What are we
NOT
discussing
today?
●Reliability patterns in Microservices
●Coordination patterns (orchestration,
choreography)
●Consistancy (Atomic, Eventual,
Idempotency)
●Disaster recovery and cross - Region
failback
What are we
talking about
today?
●Metastable Failure Case Studies
●Decoupling through Tiers and Service
layers
●Graceful degradation
●Security
Publicly reported outages (Ukraine 2024)
Publicly reported outages (global 2024)
Publicly reported outages (Uklon 2023)
What Are the Main Causes
of Failures?
Metastable Failures in the Wild
Sustaining effects: The most common
sustaining effect is due to the retry policy,
affecting more than 50% of the studied
incidents
Recovery from a metastable failure: Direct
load shedding, such as throttling, dropping
requests, or changing workload parameters,
was used in over 55% of the cases
●Around 45% of observed triggers are
due to engineer errors, such as buggy
configuration or code deployments,
and latent bugs (i.e., undetected
preexisting bugs).
●Load spikes are another prominent
trigger category, with around 35% of
incidents reporting it.
●A significant number of cases (45%)
have more than one trigger.
States and transitions of a system
experiencing a metastable failure
Metastable failures:
●Class includes mane known problems
●Failure amplifiers
●Naturally arise from optimizing the common case
●Over-represented in major distributed systems
outages
●Hard to predict
Sustaining effects due to:
●Request Retries
●Look-aside Cache
●Slow error handling
1.Adopt resilience patterns: adopt resilience patterns like circuit breaker, retry, bulkhead and timeout patterns.
2.Decentralization: Avoid having a single point of failure by decentralizing components.
3.Decoupling: Reduce dependencies between system components to minimize the impact of failures
4.Redundancy and Replication: Introduce redundancy at various levels of the system to ensure that if one component fails, another
can take over. This can include redundant servers, databases, and network paths.
5.Fault Isolation: Design systems in a way that isolates faults to prevent them from cascading through the entire system. Isolating
failures helps contain the impact and allows the rest of the system to continue functioning.
6.Automated Recovery: Automate recovery processes to reduce the time it takes to restore services after a failure. This includes
automated backups, configuration management, and deployment rollback procedures.
7.Graceful Degradation: Design systems to gracefully degrade in case of failures. When certain components fail, the system should
still provide basic functionality rather than completely breaking.
8.Security: Protect the system from malicious attacks and unauthorized access, which can lead to failures.
9.Proactive Monitoring and Alerting: Adopt comprehensive monitoring tools to continuously track the health and performance of the
system. Design alerts that notify the respective stakeholders when issues are detected, enabling prompt responses to failures.
Key principles
associated with designing for failure
Decoupling.
Loosely-coupled architecture?
https://www.uber.com/en-AU/blog/microservice-architecture/
Spotify's Architecture at the 2023...
Decoupling #1: Service Tiers
●Tier 4: Non-critical; failures have no significant impact
on customer experience or finances
(DriverSpeedControl).
●Tier 3: Minor impact; failures are difficult to notice or
have limited business effects (DenyList, TrafficJam).
●Tier 2: Important services; failure degrades experience
but doesn’t prevent interaction (PromoCodes,
DriverBonuses, Chats, Feedbacks).
●Tier 1: Critical services; failure causes significant
customer impact or financial loss (Ride-hailing,
Delivery, Payments).
How to Use Service Tiers
in microservice?
1.Avoid: Higher-priority services should not
depend on lower-priority services.
2.Must: Tier-1 services relying on
lower-priority services must have
contingency and failover plans for potential
downtime.
3.Assume: Tier-4 services can assume Tier-1
services will respond. If a Tier-1 service fails,
it’s acceptable for the Tier-4 service to fail
as well, since efforts will prioritize restoring
the Tier-1 service.
1.Service Level Objectives (SLOs): Different levels of service guarantees based on
the importance of the customer or the service being provided
2.Infrastructure updates: Starting infrastructure updates from a lower priority tier can be
an effective strategy to minimize risk and ensure a smooth transition
3.Disaster Recovery Planning (DRP): Implement the most robust recovery strategies for
Tier 1 assets
Decoupling #1: How to Apply Tiers
in Practice
Decoupling #2: Vertical ownership
●End-to-End Responsibility
●Autonomous components
●Faster Development and
Deployment
Tier 1 Tier 1 Tier 3
Decoupling #2: Vertical ownership
Low coupled
components
Components with
High Cohesion
Decoupling #3: Service Layers.
Dependency management at scale.
1.Foundation: Provides foundational services
like PlaceSearch or Routing.
2.Core: Manages essential business logic
(Ordering, Delivery, Dispatching).
3.Value Added: Processes data from other
services to create rich user experiences.
4.Edge: API gateway with BFFs tailored
to specific client needs.
Fix “Integration Points”
Avoid dependencies in Core layer
1.Data Extensions: Replicating data between
components via public state events.
2.Logic Extensions: Nuget packages
Fix “Chain Reaction” with Bulkhead
isolation (cell-based)
Deploy the Dispatching service as
three independent services:
1.PregameDispatching
2.BroadcastDispatching
3.OfferDispatching
(using the same codebase with
three different hosts)
Postmortem-6
Handling and recovering
from metastable failures
-Load shedding
-Throttling
-Drop requests
-Change workload parameters
-Traffic Prioritization
Graceful degradation
Tolerate Partner’s and Payment’s outage
Critical requests
●Avoid using cache as a look-aside
cache
●Use cache for fallback-only scenarios
Traffic jam
Graceful degradation to historical data
●Grouping messages by partition key
●Aggregating messages in hopping window
●MapReduce
46
How do you implement change?
1.Canary releasing
2.Feature Toggles
a.geography-based
b.weight-based
Benefits:
●Risk Mitigation: Reduces the potential impact of bugs by limiting new features to
certain regions.
●Customization: Supports regional customization and compliance.
●Controlled Rollout: Allows staged deployment and faster issue identification.
Traffic on new system/feature
Handling and recovering from
metastable failures
46
Protect the system from malicious attacks and unauthorized access,
which can lead to failures:
1.Penetration testing
a.Grey Box testing
b.DDoS-testing
c.Phishing testing
2.The Principle of Least Privilege
3.Product Security scanning
4.Vulnerability management
Information Security
Contacts
CTO
Oleksandr Chumak
Let's connect
on LinkedIn