What Flow Metrics Teaches Us About Designing Resilient Systems by Mourjo Sen

ScyllaDB 129 views 79 slides Mar 10, 2025
Slide 1
Slide 1 of 111
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111

About This Presentation

Flow metrics help distinguish valuable work ("goodput") from total work done ("throughput"). This talk explores how flow metrics improve agile teams and system resilience by tracking velocity, time, efficiency, and load—key to designing scalable systems that handle real-world d...


Slide Content

A ScyllaDB Community
What Flow Metrics
Teaches Us About Designing
Resilient Systems
Mourjo Sen
Senior Software Engineer

About Me
■Senior Software Engineer at Booking.com
■Building backend systems for 10 years
■Writing about system resilience since 2020

mourjo.medium.com

Everything Fails. All the time.
– Werner Vogels

Metrics Should Help Solve Problems

Metrics Should Help Solve Problems

Metrics Should Help Solve Problems
Latency
increase
(impact)
github.com/mourjo/monster-scale-2025

Metrics Should Help Solve Problems
Increase in
the number
of requests
github.com/mourjo/monster-scale-2025

Metrics Should Help Solve Problems
Everything is
okay
(200 status
codes)
github.com/mourjo/monster-scale-2025

Metrics Should Help Solve Problems
No impact on
processing
time
github.com/mourjo/monster-scale-2025

Metrics Should Help Solve Problems
Disconnect between
metrics and value
proposition
github.com/mourjo/monster-scale-2025

Software: A Means to Deliver
Business Value

Building Software: An Analogy

Building Software: An Analogy

Building Software: An Analogy

Building Software: An Analogy
Tasks to do

Building Software: An Analogy
Teams work on
the tasks
Tasks to do

Building Software: An Analogy
Teams work on
the tasks
Tasks to do

Building Software: An Analogy
Teams work on
the tasks
Tasks to do
Assembled
product

Building Software: An Analogy
Teams work on
the tasks
Business Value
Tasks to do
Assembled
product

What is Flow Metrics?

■The customer wants the final result
Business Value Does Not Magically Appear

■The customer wants the final result, not individual work items
Business Value Does Not Magically Appear

■The customer wants the final result, not individual work items
■Flow is the movement of work items from potential to concrete value

Business Value Does Not Magically Appear

■The customer wants the final result, not individual work items
■Flow is the movement of work items from potential to concrete value
■Flow metrics measure flow

Business Value Does Not Magically Appear

Flow Metrics in Software Engineering
Software
Engineering

Flow Metrics in Software Engineering
WIP

Software
Engineering

Flow Metrics in Software Engineering
WIP
Age
Software
Engineering

Flow Metrics in Software Engineering
WIP
Age
Cycle Time

Software
Engineering

Flow Metrics in Software Engineering
WIP
Age
Cycle Time
Throughput
Software
Engineering

Flow Metrics in Systems
WIP
Age
Cycle Time
Throughput
RequestsUsers Servers

Applying Flow Metrics to
Software Systems

Flow Metrics Applied to Systems

Flow Metrics Applied to Systems

Flow Metrics Applied to Systems

Flow Metrics Applied to Systems

Flow Metrics Applied to Systems

Scenario 1: Spike in Incoming Requests

Scenario 1: Spike in Incoming Requests
Increase in WIP

Scenario 1: Spike in Incoming Requests
Increase in WIP

Scenario 1: Spike in Incoming Requests

Scenario 1: Spike in Incoming Requests

Scenario 1: Spike in Incoming Requests

Scenario 1: Spike in Incoming Requests
Increase in age

Scenario 1: Spike in Incoming Requests
Increase in age

Scenario 1: Spike in Incoming Requests
Increase in age

Scenario 1: Spike in Incoming Requests
Increase in age

Scenario 1: Spike in Incoming Requests
Increase in
cycle time

Scenario 2: Degraded Dependency

Scenario 2: Degraded Dependency

Scenario 2: Degraded Dependency

Scenario 2: Degraded Dependency

Scenario 2: Degraded Dependency
Increase in WIP

Scenario 2: Degraded Dependency
Increase in age
Increase in WIP

Scenario 2: Degraded Dependency
Increase in age
Increase in WIP
Decrease in
throughput
Increase in
Cycle Time

■If there is congestion
■WIP increases
■Age increases
■Cycle time will increase
Flow Metrics Applied to Systems

■If there is congestion
■WIP increases
■Age increases
■Cycle time will increase
Flow Metrics Applied to Systems
Leading indicators

Detecting Congestion with Flow Metrics

Detecting Congestion with Flow Metrics

Detecting Congestion with Flow Metrics

Detecting Congestion with Flow Metrics

Detecting Congestion with Flow Metrics

Detecting Congestion with Flow Metrics
Impact observed
by the client

Detecting Congestion with Flow Metrics
Leading
Indicators
Impact observed
by the client

Detecting Congestion with Flow Metrics
Leading
Indicators
Impact observed
by the client

Detecting Congestion with Flow Metrics
Leading
Indicators
Requests spend
most of their
age just waiting
Impact observed
by the client

Detecting Congestion with Flow Metrics
Leading
Indicators
There are
too many
requests in
the system
Impact observed
by the client

Detecting Congestion with Flow Metrics
Source

Detecting Congestion with Flow Metrics
Source
Just detecting
congestion is not
good enough

On-the-fly Resilience with
Flow Metrics

Beyond Monitoring: Limiting WIP and Age
Server Threads

Queue length slows down
flow of business value
Beyond Monitoring: Limiting WIP and Age
Server Threads

Server Threads
Worker ThreadsLimited
Queue Size
Beyond Monitoring: Limiting WIP and Age

Limited
Queue Size
Age filter
Beyond Monitoring: Limiting WIP and Age
Server Threads
Worker Threads

Gatekeeper
Threads
Limited
Queue Size
Age filter
Beyond Monitoring: Limiting WIP and Age
Server Threads
Worker Threads

Gatekeeper
Threads
Limited
Queue Size
Age filter
Beyond Monitoring: Limiting WIP and Age
Server Threads
Worker Threads

Why Limit WIP and Age?
■Bounded wait-time for clients
Gatekeeper
Threads
Limited
Queue Size
Age filter

Why Limit WIP and Age?
■Bounded wait-time for clients
■Effective communication when there is congestion
Gatekeeper
Threads
Limited
Queue Size
Age filter

Why Limit WIP and Age?
■Bounded wait-time for clients
■Effective communication when there is congestion
■Faster recovery after incident
Gatekeeper
Threads
Limited
Queue Size
Age filter

Result 1: Spike in Incoming Requests
github.com/mourjo/monster-scale-2025
Default Spring Boot Server

Result 1: Spike in Incoming Requests
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server

Result 1: Spike in Incoming Requests
Responds in timeUnresponsive
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server

Result 1: Spike in Incoming Requests
Recovers immediately

Time to recover is longer

github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server

Result 1: Spike in Incoming Requests
Graceful degradationPretends everything is 200 OK
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server

Result 2: Degraded Dependency
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server

Result 2: Degraded Dependency
Responds in timeUnresponsive
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server

Result 2: Degraded Dependency
Recovers immediately

Time to recover is longer

github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server

Result 2: Degraded Dependency
Graceful degradationPretends everything is 200 OK
github.com/mourjo/monster-scale-2025
Default Spring Boot Server
WIP-Age Limiting Server

End-to-End Resilience with
Flow Metrics

Local Measurements to Global Resilience

Local Measurements to Global Resilience

Local Measurements to Global Resilience

Local Measurements to Global Resilience

Local Measurements to Global Resilience

Local Measurements to Global Resilience

Local Measurements to Global Resilience
Measure the flow of
individual requests, not
the server’s capacity
?

Independent of System Characteristics
■Machines in a Cluster
■Memory and Garbage Collection
■Distribution of type of requests
■Number of downstream dependencies
■CPU Utilization
■…

Source

Source

Source

Source

Resilience By Design is an
Insurance Policy

Your Service Is at Risk
■Sudden burst of traffic
■Slow dependency
■Latency bugs
■…

Your Service Is at Risk
■Sudden burst of traffic
■Slow dependency
■Latency bugs
■…
Flow metrics detects and protects
your service against cascading
failures

Your Service System Is at Risk
■Sudden burst of traffic
■Slow dependency
■Latency bugs
■…
Flow metrics detects and protects
your system against cascading
failures

Your Service System Is at Risk
■Sudden burst of traffic
■Slow dependency
■Latency bugs
■…
Flow metrics detects and protects
your system against cascading
failures

An Insurance Policy: Resilience With Flow Metrics
■Acknowledgement of limits of a system

■Acknowledgement of limits of a system
■Effective communication during outages

An Insurance Policy: Resilience With Flow Metrics

■Acknowledgement of limits of a system
■Effective communication during outages
■Independent of system characteristics
An Insurance Policy: Resilience With Flow Metrics

An Insurance Policy: Resilience With Flow Metrics
■Acknowledgement of limits of a system
■Effective communication during outages
■Independent of system characteristics
■Simple to implement with little overhead

An Insurance Policy: Resilience With Flow Metrics
■Acknowledgement of limits of a system
■Effective communication during outages
■Independent of system characteristics
■Simple to implement with little overhead

Embed Flow of Business Value
into System Design
github.com/mourjo/monster-scale-2025

Thank you!
Questions?
[email protected]
linkedin.com/in/mourjo

https://mourjo.me
github.com/mourjo

What Flow Metrics
Teaches Us About Designing
Resilient Systems
github.com/mourjo/monster-scale-2025
Tags