"Resilient by Design: Strategies for Building Robust Architecture at Uklon", Oleksandr Chumak

fwdays 399 views 33 slides Sep 14, 2024
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

It is essential for development of ride-hailing platforms that software systems not only meet functional requirements, but are also resilient against failures, unexpected loads, and external threats.

Resilient architecture ensures that a system can maintain acceptable levels of service, even when c...


Slide Content

Uklon in numbers
12130+
EngineersProduct Teams
16 M
Android/iOS
downloads
1.5M+
Riders MAU
1B+
Events per day
30+
microservices
o.5TB
2
Countries
30
Cities
~200
Deployments per
month

SLO/SLI
200ms (99perc)
15ms (50perc)

Response time
1B+
Events per day
o.5TB
99.97%
Availability SLO
99.99% *
Availability
(w/ degraded mode)
SLA level of 99.97% uptime results
in the following periods of
allowed downtime/unavailability:
■Daily: 26s
■Weekly: 3m 1.4s
■Monthly: 13m 2.4s
■Quarterly: 39m 7.3s
■Yearly: 2h 36m 29s
2hours/year
of total downtime were
prevented
downtime(1h)
= $100k

How do we measure availability?
1B+
Events per day
o.5TB
availability (http)
availability (Product metrics):
-order placement
-order acceptance
-order completion

●Directly tied to user happiness:
The metric reflects how service
availability impacts user satisfaction.

What are we
NOT
discussing
today?
●Reliability patterns in Microservices
●Coordination patterns (orchestration,
choreography)
●Consistancy (Atomic, Eventual,
Idempotency)
●Disaster recovery and cross - Region
failback

What are we
talking about
today?
●Metastable Failure Case Studies
●Decoupling through Tiers and Service
layers
●Graceful degradation
●Security

Publicly reported outages (Ukraine 2024)

Publicly reported outages (global 2024)

Publicly reported outages (Uklon 2023)

What Are the Main Causes
of Failures?

Metastable Failures in the Wild
Sustaining effects: The most common
sustaining effect is due to the retry policy,
affecting more than 50% of the studied
incidents
Recovery from a metastable failure: Direct
load shedding, such as throttling, dropping
requests, or changing workload parameters,
was used in over 55% of the cases
●Around 45% of observed triggers are
due to engineer errors, such as buggy
configuration or code deployments,
and latent bugs (i.e., undetected
preexisting bugs).
●Load spikes are another prominent
trigger category, with around 35% of
incidents reporting it.
●A significant number of cases (45%)
have more than one trigger.

States and transitions of a system
experiencing a metastable failure
Metastable failures:
●Class includes mane known problems
●Failure amplifiers
●Naturally arise from optimizing the common case
●Over-represented in major distributed systems
outages
●Hard to predict
Sustaining effects due to:
●Request Retries
●Look-aside Cache
●Slow error handling

1.Adopt resilience patterns: adopt resilience patterns like circuit breaker, retry, bulkhead and timeout patterns.
2.Decentralization: Avoid having a single point of failure by decentralizing components.
3.Decoupling: Reduce dependencies between system components to minimize the impact of failures
4.Redundancy and Replication: Introduce redundancy at various levels of the system to ensure that if one component fails, another
can take over. This can include redundant servers, databases, and network paths.
5.Fault Isolation: Design systems in a way that isolates faults to prevent them from cascading through the entire system. Isolating
failures helps contain the impact and allows the rest of the system to continue functioning.
6.Automated Recovery: Automate recovery processes to reduce the time it takes to restore services after a failure. This includes
automated backups, configuration management, and deployment rollback procedures.
7.Graceful Degradation: Design systems to gracefully degrade in case of failures. When certain components fail, the system should
still provide basic functionality rather than completely breaking.
8.Security: Protect the system from malicious attacks and unauthorized access, which can lead to failures.
9.Proactive Monitoring and Alerting: Adopt comprehensive monitoring tools to continuously track the health and performance of the
system. Design alerts that notify the respective stakeholders when issues are detected, enabling prompt responses to failures.
Key principles
associated with designing for failure

Decoupling.
Loosely-coupled architecture?
https://www.uber.com/en-AU/blog/microservice-architecture/
Spotify's Architecture at the 2023...

Decoupling #1: Service Tiers
●Tier 4: Non-critical; failures have no significant impact
on customer experience or finances
(DriverSpeedControl).
●Tier 3: Minor impact; failures are difficult to notice or
have limited business effects (DenyList, TrafficJam).
●Tier 2: Important services; failure degrades experience
but doesn’t prevent interaction (PromoCodes,
DriverBonuses, Chats, Feedbacks).
●Tier 1: Critical services; failure causes significant
customer impact or financial loss (Ride-hailing,
Delivery, Payments).

How to Use Service Tiers
in microservice?
1.Avoid: Higher-priority services should not
depend on lower-priority services.
2.Must: Tier-1 services relying on
lower-priority services must have
contingency and failover plans for potential
downtime.
3.Assume: Tier-4 services can assume Tier-1
services will respond. If a Tier-1 service fails,
it’s acceptable for the Tier-4 service to fail
as well, since efforts will prioritize restoring
the Tier-1 service.

1.Service Level Objectives (SLOs): Different levels of service guarantees based on
the importance of the customer or the service being provided
2.Infrastructure updates: Starting infrastructure updates from a lower priority tier can be
an effective strategy to minimize risk and ensure a smooth transition
3.Disaster Recovery Planning (DRP): Implement the most robust recovery strategies for
Tier 1 assets
Decoupling #1: How to Apply Tiers
in Practice

Decoupling #2: Vertical ownership
●End-to-End Responsibility
●Autonomous components
●Faster Development and
Deployment
Tier 1 Tier 1 Tier 3

Decoupling #2: Vertical ownership
Low coupled
components
Components with
High Cohesion

Decoupling #3: Service Layers.
Dependency management at scale.
1.Foundation: Provides foundational services
like PlaceSearch or Routing.
2.Core: Manages essential business logic
(Ordering, Delivery, Dispatching).
3.Value Added: Processes data from other
services to create rich user experiences.
4.Edge: API gateway with BFFs tailored
to specific client needs.

Graceful degradation

Dispatching: overall architecture
Antipatterns:
●Integration Points
●Chain Reaction
●Cascading Failures
●Self-Denial Attack

Fix “Integration Points”
Avoid dependencies in Core layer

1.Data Extensions: Replicating data between
components via public state events.
2.Logic Extensions: Nuget packages

Fix “Chain Reaction” with Bulkhead
isolation (cell-based)
Deploy the Dispatching service as
three independent services:
1.PregameDispatching
2.BroadcastDispatching
3.OfferDispatching
(using the same codebase with
three different hosts)

Postmortem-6

Handling and recovering
from metastable failures
-Load shedding
-Throttling
-Drop requests
-Change workload parameters
-Traffic Prioritization

Graceful degradation
Tolerate Partner’s and Payment’s outage
Critical requests
●Avoid using cache as a look-aside
cache
●Use cache for fallback-only scenarios

Traffic jam
Graceful degradation to historical data
●Grouping messages by partition key
●Aggregating messages in hopping window
●MapReduce

46
How do you implement change?
1.Canary releasing
2.Feature Toggles
a.geography-based
b.weight-based


Benefits:
●Risk Mitigation: Reduces the potential impact of bugs by limiting new features to
certain regions.
●Customization: Supports regional customization and compliance.
●Controlled Rollout: Allows staged deployment and faster issue identification.
Traffic on new system/feature

Handling and recovering from
metastable failures

46

Protect the system from malicious attacks and unauthorized access,
which can lead to failures:
1.Penetration testing
a.Grey Box testing
b.DDoS-testing
c.Phishing testing
2.The Principle of Least Privilege
3.Product Security scanning
4.Vulnerability management
Information Security

Contacts
CTO
Oleksandr Chumak
Let's connect
on LinkedIn