What Flow Metrics Teaches Us About Designing Resilient Systems by Mourjo Sen
ScyllaDB
129 views
79 slides
Mar 10, 2025
Slide 1 of 111
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
About This Presentation
Flow metrics help distinguish valuable work ("goodput") from total work done ("throughput"). This talk explores how flow metrics improve agile teams and system resilience by tracking velocity, time, efficiency, and load—key to designing scalable systems that handle real-world d...
Flow metrics help distinguish valuable work ("goodput") from total work done ("throughput"). This talk explores how flow metrics improve agile teams and system resilience by tracking velocity, time, efficiency, and load—key to designing scalable systems that handle real-world demands efficiently.
Size: 8.48 MB
Language: en
Added: Mar 10, 2025
Slides: 79 pages
Slide Content
A ScyllaDB Community
What Flow Metrics
Teaches Us About Designing
Resilient Systems
Mourjo Sen
Senior Software Engineer
About Me
■Senior Software Engineer at Booking.com
■Building backend systems for 10 years
■Writing about system resilience since 2020
mourjo.medium.com
Everything Fails. All the time.
– Werner Vogels
Metrics Should Help Solve Problems
Metrics Should Help Solve Problems
Metrics Should Help Solve Problems
Latency
increase
(impact)
github.com/mourjo/monster-scale-2025
Metrics Should Help Solve Problems
Increase in
the number
of requests
github.com/mourjo/monster-scale-2025
Metrics Should Help Solve Problems
Everything is
okay
(200 status
codes)
github.com/mourjo/monster-scale-2025
Metrics Should Help Solve Problems
No impact on
processing
time
github.com/mourjo/monster-scale-2025
Metrics Should Help Solve Problems
Disconnect between
metrics and value
proposition
github.com/mourjo/monster-scale-2025
Software: A Means to Deliver
Business Value
Building Software: An Analogy
Building Software: An Analogy
Building Software: An Analogy
Building Software: An Analogy
Tasks to do
Building Software: An Analogy
Teams work on
the tasks
Tasks to do
Building Software: An Analogy
Teams work on
the tasks
Tasks to do
Building Software: An Analogy
Teams work on
the tasks
Tasks to do
Assembled
product
Building Software: An Analogy
Teams work on
the tasks
Business Value
Tasks to do
Assembled
product
What is Flow Metrics?
■The customer wants the final result
Business Value Does Not Magically Appear
■The customer wants the final result, not individual work items
Business Value Does Not Magically Appear
■The customer wants the final result, not individual work items
■Flow is the movement of work items from potential to concrete value
Business Value Does Not Magically Appear
■The customer wants the final result, not individual work items
■Flow is the movement of work items from potential to concrete value
■Flow metrics measure flow
Business Value Does Not Magically Appear
Flow Metrics in Software Engineering
Software
Engineering
Flow Metrics in Software Engineering
WIP
Software
Engineering
Flow Metrics in Software Engineering
WIP
Age
Software
Engineering
Flow Metrics in Software Engineering
WIP
Age
Cycle Time
Software
Engineering
Flow Metrics in Software Engineering
WIP
Age
Cycle Time
Throughput
Software
Engineering
Flow Metrics in Systems
WIP
Age
Cycle Time
Throughput
RequestsUsers Servers
Applying Flow Metrics to
Software Systems
Flow Metrics Applied to Systems
Flow Metrics Applied to Systems
Flow Metrics Applied to Systems
Flow Metrics Applied to Systems
Flow Metrics Applied to Systems
Scenario 1: Spike in Incoming Requests
Scenario 1: Spike in Incoming Requests
Increase in WIP
Scenario 1: Spike in Incoming Requests
Increase in WIP
Scenario 1: Spike in Incoming Requests
Scenario 1: Spike in Incoming Requests
Scenario 1: Spike in Incoming Requests
Scenario 1: Spike in Incoming Requests
Increase in age
Scenario 1: Spike in Incoming Requests
Increase in age
Scenario 1: Spike in Incoming Requests
Increase in age
Scenario 1: Spike in Incoming Requests
Increase in age
Scenario 1: Spike in Incoming Requests
Increase in
cycle time
Scenario 2: Degraded Dependency
Scenario 2: Degraded Dependency
Scenario 2: Degraded Dependency
Scenario 2: Degraded Dependency
Scenario 2: Degraded Dependency
Increase in WIP
Scenario 2: Degraded Dependency
Increase in age
Increase in WIP
Scenario 2: Degraded Dependency
Increase in age
Increase in WIP
Decrease in
throughput
Increase in
Cycle Time
■If there is congestion
■WIP increases
■Age increases
■Cycle time will increase
Flow Metrics Applied to Systems
■If there is congestion
■WIP increases
■Age increases
■Cycle time will increase
Flow Metrics Applied to Systems
Leading indicators
Detecting Congestion with Flow Metrics
Detecting Congestion with Flow Metrics
Detecting Congestion with Flow Metrics
Detecting Congestion with Flow Metrics
Detecting Congestion with Flow Metrics
Detecting Congestion with Flow Metrics
Impact observed
by the client
Detecting Congestion with Flow Metrics
Leading
Indicators
Impact observed
by the client
Detecting Congestion with Flow Metrics
Leading
Indicators
Impact observed
by the client
Detecting Congestion with Flow Metrics
Leading
Indicators
Requests spend
most of their
age just waiting
Impact observed
by the client
Detecting Congestion with Flow Metrics
Leading
Indicators
There are
too many
requests in
the system
Impact observed
by the client
Detecting Congestion with Flow Metrics
Source
Detecting Congestion with Flow Metrics
Source
Just detecting
congestion is not
good enough
On-the-fly Resilience with
Flow Metrics
Beyond Monitoring: Limiting WIP and Age
Server Threads
Queue length slows down
flow of business value
Beyond Monitoring: Limiting WIP and Age
Server Threads
Server Threads
Worker ThreadsLimited
Queue Size
Beyond Monitoring: Limiting WIP and Age
Limited
Queue Size
Age filter
Beyond Monitoring: Limiting WIP and Age
Server Threads
Worker Threads
Gatekeeper
Threads
Limited
Queue Size
Age filter
Beyond Monitoring: Limiting WIP and Age
Server Threads
Worker Threads
Gatekeeper
Threads
Limited
Queue Size
Age filter
Beyond Monitoring: Limiting WIP and Age
Server Threads
Worker Threads
Why Limit WIP and Age?
■Bounded wait-time for clients
Gatekeeper
Threads
Limited
Queue Size
Age filter
Why Limit WIP and Age?
■Bounded wait-time for clients
■Effective communication when there is congestion
Gatekeeper
Threads
Limited
Queue Size
Age filter
Why Limit WIP and Age?
■Bounded wait-time for clients
■Effective communication when there is congestion
■Faster recovery after incident
Gatekeeper
Threads
Limited
Queue Size
Age filter
Result 1: Spike in Incoming Requests
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
Result 1: Spike in Incoming Requests
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server
Result 1: Spike in Incoming Requests
Responds in timeUnresponsive
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server
Result 1: Spike in Incoming Requests
Recovers immediately
Time to recover is longer
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server
Result 1: Spike in Incoming Requests
Graceful degradationPretends everything is 200 OK
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server
Result 2: Degraded Dependency
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server
Result 2: Degraded Dependency
Responds in timeUnresponsive
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server
Result 2: Degraded Dependency
Recovers immediately
Time to recover is longer
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server
Result 2: Degraded Dependency
Graceful degradationPretends everything is 200 OK
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server
End-to-End Resilience with
Flow Metrics
Local Measurements to Global Resilience
Local Measurements to Global Resilience
Local Measurements to Global Resilience
Local Measurements to Global Resilience
Local Measurements to Global Resilience
Local Measurements to Global Resilience
Local Measurements to Global Resilience
Measure the flow of
individual requests, not
the server’s capacity
?
Independent of System Characteristics
■Machines in a Cluster
■Memory and Garbage Collection
■Distribution of type of requests
■Number of downstream dependencies
■CPU Utilization
■…
Source
Source
Source
Source
Resilience By Design is an
Insurance Policy
Your Service Is at Risk
■Sudden burst of traffic
■Slow dependency
■Latency bugs
■…
Your Service Is at Risk
■Sudden burst of traffic
■Slow dependency
■Latency bugs
■…
Flow metrics detects and protects
your service against cascading
failures
Your Service System Is at Risk
■Sudden burst of traffic
■Slow dependency
■Latency bugs
■…
Flow metrics detects and protects
your system against cascading
failures
Your Service System Is at Risk
■Sudden burst of traffic
■Slow dependency
■Latency bugs
■…
Flow metrics detects and protects
your system against cascading
failures
An Insurance Policy: Resilience With Flow Metrics
■Acknowledgement of limits of a system
■Acknowledgement of limits of a system
■Effective communication during outages
An Insurance Policy: Resilience With Flow Metrics
■Acknowledgement of limits of a system
■Effective communication during outages
■Independent of system characteristics
An Insurance Policy: Resilience With Flow Metrics
An Insurance Policy: Resilience With Flow Metrics
■Acknowledgement of limits of a system
■Effective communication during outages
■Independent of system characteristics
■Simple to implement with little overhead
An Insurance Policy: Resilience With Flow Metrics
■Acknowledgement of limits of a system
■Effective communication during outages
■Independent of system characteristics
■Simple to implement with little overhead
Embed Flow of Business Value
into System Design
github.com/mourjo/monster-scale-2025