"Resilient by Design: Strategies for Building Robust Architecture at Uklon", Oleksandr Chumak

fwdays 399 views 33 slides Sep 14, 2024

Slide 1 of 33

About This Presentation

It is essential for development of ride-hailing platforms that software systems not only meet functional requirements, but are also resilient against failures, unexpected loads, and external threats.

Resilient architecture ensures that a system can maintain acceptable levels of service, even when c...

Size: 5.67 MB

Language: en

Added: Sep 14, 2024

Slides: 33 pages

Slide Content

Uklon in numbers
12130+
EngineersProduct Teams
16 M
Android/iOS
downloads
1.5M+
Riders MAU
1B+
Events per day
30+
microservices
o.5TB
2
Countries
30
Cities
~200
Deployments per
month

SLO/SLI
200ms (99perc)
15ms (50perc)

Response time
1B+
Events per day
o.5TB
99.97%
Availability SLO
99.99% *
Availability
(w/ degraded mode)
SLA level of 99.97% uptime results
in the following periods of
allowed downtime/unavailability:
■Daily: 26s
■Weekly: 3m 1.4s
■Monthly: 13m 2.4s
■Quarterly: 39m 7.3s
■Yearly: 2h 36m 29s
2hours/year
of total downtime were
prevented
downtime(1h)
= $100k

How do we measure availability?
1B+
Events per day
o.5TB
availability (http)
availability (Product metrics):
-order placement
-order acceptance
-order completion

●Directly tied to user happiness:
The metric reﬂects how service
availability impacts user satisfaction.

What are we
NOT
discussing
today?
●Reliability patterns in Microservices
●Coordination patterns (orchestration,
choreography)
●Consistancy (Atomic, Eventual,
Idempotency)
●Disaster recovery and cross - Region
failback

What are we
talking about
today?
●Metastable Failure Case Studies
●Decoupling through Tiers and Service
layers
●Graceful degradation
●Security

Publicly reported outages (Ukraine 2024)

Publicly reported outages (global 2024)

Publicly reported outages (Uklon 2023)

What Are the Main Causes
of Failures?

Metastable Failures in the Wild
Sustaining effects: The most common
sustaining effect is due to the retry policy,
affecting more than 50% of the studied
incidents
Recovery from a metastable failure: Direct
load shedding, such as throttling, dropping
requests, or changing workload parameters,
was used in over 55% of the cases
●Around 45% of observed triggers are
due to engineer errors, such as buggy
conﬁguration or code deployments,
and latent bugs (i.e., undetected
preexisting bugs).
●Load spikes are another prominent
trigger category, with around 35% of
incidents reporting it.
●A signiﬁcant number of cases (45%)
have more than one trigger.

States and transitions of a system
experiencing a metastable failure
Metastable failures:
●Class includes mane known problems
●Failure ampliﬁers
●Naturally arise from optimizing the common case
●Over-represented in major distributed systems
outages
●Hard to predict
Sustaining eﬀects due to:
●Request Retries
●Look-aside Cache
●Slow error handling

1.Adopt resilience patterns: adopt resilience patterns like circuit breaker, retry, bulkhead and timeout patterns.
2.Decentralization: Avoid having a single point of failure by decentralizing components.
3.Decoupling: Reduce dependencies between system components to minimize the impact of failures
4.Redundancy and Replication: Introduce redundancy at various levels of the system to ensure that if one component fails, another
can take over. This can include redundant servers, databases, and network paths.
5.Fault Isolation: Design systems in a way that isolates faults to prevent them from cascading through the entire system. Isolating
failures helps contain the impact and allows the rest of the system to continue functioning.
6.Automated Recovery: Automate recovery processes to reduce the time it takes to restore services after a failure. This includes
automated backups, conﬁguration management, and deployment rollback procedures.
7.Graceful Degradation: Design systems to gracefully degrade in case of failures. When certain components fail, the system should
still provide basic functionality rather than completely breaking.
8.Security: Protect the system from malicious attacks and unauthorized access, which can lead to failures.
9.Proactive Monitoring and Alerting: Adopt comprehensive monitoring tools to continuously track the health and performance of the
system. Design alerts that notify the respective stakeholders when issues are detected, enabling prompt responses to failures.
Key principles
associated with designing for failure

Decoupling.
Loosely-coupled architecture?
https://www.uber.com/en-AU/blog/microservice-architecture/
Spotify's Architecture at the 2023...

Decoupling #1: Service Tiers
●Tier 4: Non-critical; failures have no signiﬁcant impact
on customer experience or ﬁnances
(DriverSpeedControl).
●Tier 3: Minor impact; failures are diﬃcult to notice or
have limited business eﬀects (DenyList, TraﬃcJam).
●Tier 2: Important services; failure degrades experience
but doesn’t prevent interaction (PromoCodes,
DriverBonuses, Chats, Feedbacks).
●Tier 1: Critical services; failure causes signiﬁcant
customer impact or ﬁnancial loss (Ride-hailing,
Delivery, Payments).

How to Use Service Tiers
in microservice?
1.Avoid: Higher-priority services should not
depend on lower-priority services.
2.Must: Tier-1 services relying on
lower-priority services must have
contingency and failover plans for potential
downtime.
3.Assume: Tier-4 services can assume Tier-1
services will respond. If a Tier-1 service fails,
it’s acceptable for the Tier-4 service to fail
as well, since eﬀorts will prioritize restoring
the Tier-1 service.

1.Service Level Objectives (SLOs): Diﬀerent levels of service guarantees based on
the importance of the customer or the service being provided
2.Infrastructure updates: Starting infrastructure updates from a lower priority tier can be
an eﬀective strategy to minimize risk and ensure a smooth transition
3.Disaster Recovery Planning (DRP): Implement the most robust recovery strategies for
Tier 1 assets
Decoupling #1: How to Apply Tiers
in Practice

Decoupling #2: Vertical ownership
●End-to-End Responsibility
●Autonomous components
●Faster Development and
Deployment
Tier 1 Tier 1 Tier 3

Decoupling #2: Vertical ownership
Low coupled
components
Components with
High Cohesion

Decoupling #3: Service Layers.
Dependency management at scale.
1.Foundation: Provides foundational services
like PlaceSearch or Routing.
2.Core: Manages essential business logic
(Ordering, Delivery, Dispatching).
3.Value Added: Processes data from other
services to create rich user experiences.
4.Edge: API gateway with BFFs tailored
to speciﬁc client needs.

Graceful degradation

Dispatching: overall architecture
Antipatterns:
●Integration Points
●Chain Reaction
●Cascading Failures
●Self-Denial Attack

Fix “Integration Points”
Avoid dependencies in Core layer

1.Data Extensions: Replicating data between
components via public state events.
2.Logic Extensions: Nuget packages

Fix “Chain Reaction” with Bulkhead
isolation (cell-based)
Deploy the Dispatching service as
three independent services:
1.PregameDispatching
2.BroadcastDispatching
3.OﬀerDispatching
(using the same codebase with
three diﬀerent hosts)

Postmortem-6

Handling and recovering
from metastable failures
-Load shedding
-Throttling
-Drop requests
-Change workload parameters
-Traﬃc Prioritization

Graceful degradation
Tolerate Partner’s and Payment’s outage
Critical requests
●Avoid using cache as a look-aside
cache
●Use cache for fallback-only scenarios

Traffic jam
Graceful degradation to historical data
●Grouping messages by partition key
●Aggregating messages in hopping window
●MapReduce

46
How do you implement change?
1.Canary releasing
2.Feature Toggles
a.geography-based
b.weight-based

Beneﬁts:
●Risk Mitigation: Reduces the potential impact of bugs by limiting new features to
certain regions.
●Customization: Supports regional customization and compliance.
●Controlled Rollout: Allows staged deployment and faster issue identiﬁcation.
Traﬃc on new system/feature

Handling and recovering from
metastable failures

46

Protect the system from malicious attacks and unauthorized access,
which can lead to failures:
1.Penetration testing
a.Grey Box testing
b.DDoS-testing
c.Phishing testing
2.The Principle of Least Privilege
3.Product Security scanning
4.Vulnerability management
Information Security

"Resilient by Design: Strategies for Building Robust Architecture at Uklon", Oleksandr Chumak

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

&quot;Resilient by Design: Strategies for Building Robust Architecture at Uklon&quot;, Oleksandr Chumak

About This Presentation

Slide Content

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx

"Resilient by Design: Strategies for Building Robust Architecture at Uklon", Oleksandr Chumak