Chicago AWS Architectural Resilience Day 2024

awschicago 317 views 161 slides Sep 10, 2024
Slide 1
Slide 1 of 161
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151
Slide 152
152
Slide 153
153
Slide 154
154
Slide 155
155
Slide 156
156
Slide 157
157
Slide 158
158
Slide 159
159
Slide 160
160
Slide 161
161

About This Presentation

September 2024 the first-ever community Resilience Day in Chicago.
See the video recording on the AWSChicago Youtube: https://youtu.be/z4camus_96c

Thank you, presenters from AWS and PWC.


Slide Content

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS ARCHITECTURAL RESILIENCE DAY
Chicago AWS User Group –
September 5th
Presented by the AWS Team

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Housekeeping Items
Food, Drinks, and Breaks
Food and beverages will
be served at 12:15
Restrooms are located
at the center
Other Details
You will each be given
one AWS account to for
the labs. This will be
active for the duration of
the event.
Guest WIFI Password is
written on the
Applied3715#
Look for support staff if
you need help with
anything at all
Feedback
Please provide us
feedback so that we
know how to improve
these days going
forward
2

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
3
TimeSession
09:00AM - 09:15AMOpening Remarks
09:15AM - 09:30AMPresentation | Introduction to Resilience on AWS
09:30AM – 09:45AMPresentation | Setting Objectives
09:45AM - 10:30AMWorkshop | Setting Objectives with AWS Resilience Hub
10:30AM – 10:45AMBreak
10:45AM - 11:45AMPresentation | Designing and Implementing
11:45AM - 12:15PMWorkshop | Designing and Implementing for High Availability
12:15PM - 01:15PMLunch and <Partner> Session (12:30PM – 12:50PM)

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda (continued)
4
TimeSession
01:15PM - 01:45PMWorkshop | Designing and Implementing for Disaster Recovery
01:45PM - 02:15PMPresentation | Evaluating and Testing
02:15PM - 03:00PMPresentation | Operating
03:00PM - 03:15PMBreak
03:15PM – 04:00PMWorkshop | Evaluating and Testing with AWS Fault Injection Service
04:00PM – 04:15PMPresentation | Responding and Learning
04:15PM - 04:35PMWorkshop | Responding and Learning
04:35PM - 05:00PMThank you and Closing

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Your team today
5

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Introduction to
Resilience
Justin Higgins
Sr. Solutions Architect
AWS
6

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
“Resilience equals revenue” — Gartner, 2023
Companies realize the importance of resilience
in today’s technological landscape:
Financial cost
Fortune 1000 companies lose an estimated
$1.5B–$2.5B annually due to unplanned
system downtime (IDC)
Beyond financial cost, there is also a brand cost
Brand cost

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Customer Story
Resilient Architecture built on AWS
reduced the overall downtime of a
company that faced the CrowdStrike
outage
•CrowdStrike agent was installed on EC2 Windows Instances that was part of an
Auto Scaling Group
•Amazon EC2 Auto Scaling health check determined that anInServiceinstance is
unhealthy, it replaced it with a new instance
•Their total outage was reduced to about 30 mins while the CrowdStrike agent
update was being pushed

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A model for resilience
ABILITY OF A WORKLOAD TO RECOVER FROM INFRASTRUCTURE OR SERVICE DISRUPTIONS
High availability
Resistance to common failures through
design and operational mechanisms at
a primary site
Core services, design goals to
meet availability goals
Disaster recovery
Returning to normal operation within
specific targets at a recovery site for
failures that cannot be handled by HA
Backup and recovery, data bunkering,
managed recovery objectives
The mental model
Continuous improvement
CI/CD, observability, moving beyond pre-deployment
testing towards chaos engineering patterns

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data loss Downtime
Recovery point objective
RPO
Recovery time objective
RTODisaster
How much data can you afford
to recreate or lose?
How quickly must you recover?
What is the cost of downtime?
Recovery objectives

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 11
Recovery Time Objective
(RTO)
Recovery Point Objective
(RPO)

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 12
Q: How do we build systems that will never
fail?
A: Systems will always fail. It’s what
happens when systems fail that matters.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reliability
Ability of a workload to perform its required function correctly
and consistently…
– Reliability Pillar, AWS Well-Architected Framework
Resilience
Ability of a workload to recover from infrastructure or service
disruptions…
Definitions
13

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
People, Process and Technology
14
PRO TIP
Most customers start with
technology. But, without people
and process, the tech doesn’t
matter. Testing and validation are
crucial.
Success
Technology
Process
People

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shared Responsibility for Resilience
15
HARDWARE AND SERVICES
COMPUTESTORAGEDATABASENETWORKING
AWS GLOBAL INFRASTRUCTURE
REGIONSAVAILABILITY ZONESEDGE LOCATIONS
AWS
RESPONSIBILITY FOR
RESILIENCE ‘OF’ THE
CLOUD
NETWORKING, QUOTAS, AND CONSTRAINTS
WORKLOAD ARCHITECTURE
CONTINUOUS TESTING OF CRITICAL INFRASTRUCTURE
OBSERVABILITY AND
FAILURE MANAGEMENT
CHANGE MANAGEMENT AND
OPERATIONAL RESILIENCE
CUSTOMER
RESPONSIBILITY FOR
RESILIENCE ‘IN’ THE
CLOUD

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resilience ‘of’ the Cloud
A CULTURE BUILT AROUND RESILIENCE
Safe,
continuous
deployment
Minimizes impact on
production caused by
faulty deployments
Correction of Error
(CoE) processes
Helps teams understand
the root cause and
prevents reoccurrence
Operational
Readiness Reviews
(ORR)
Ensures compliance to
best practices prior to a
service launch
Service
ownership model
Incentivizes continuous
improvement of
operations
16

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resilience ‘in’ the Cloud
RESILIENCE LIFECYCLE FRAMEWORK
Set
objectives
Design
and
implement
Evaluate
and testOperate
Respond
and learnKey Outputs
17

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
18
AWS Resilience Lifecycle Framework

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 19
I know (or was told) that we need to be
resilient. Where do I start?

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set Objectives
20
Set
objectives
Design and
implement
Evaluate
and testOperate
Respond
and learn

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The Business Impact of Failure
WHATARETHE RISKS?
21
A leading U.S.
airline lost $150M
in profits due to an
outage (Forbes)
Cost/hour of a critical application
failure (IDC)
Global community
platform lost
$8.2M in revenue
due to an outage
(Reuters)
Average hourly cost of infrastructure
failure is $100K per hour among
Fortune 1000 (IDC)
Animationstudio
deletedalmost
completedmovie
Studio hadbackups,
but theyhadfailedto
workfor the past
month. (Independent)

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Traditional Resilience Model
22
TierAvailabilityLoss ImpactRTO/RPO
Silver99.95%Standard business systems24 Hours/12 Hours
Gold99.99%Mission/Safety/Security/Business Critical1 Hour / 30 Minutes
Platinum99.995%National/Critical infrastructure15 Minutes / 1 Minute
PRO TIP
Customers typically use a combination of revenue, regulation, and reputation to define these. This
may typically be done through a risk ranking business function.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Authenticate
Browse
Search
Payment
Service
Inventory
Service
Warehouse
ServiceCustomer Delivery
Service
ML
Personalize
Service
Delivery
Providers
Product
Catalog
Credit
Card
Company
Third
Parties
Impact – Entire System
ORDER MANAGEMENT SYSTEM
23

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Authenticate
Browse
Search
Payment
Service
Inventory
Service
Warehouse
ServiceCustomer Delivery
Service
ML
Personalize
Service
Delivery
Providers
Product
Catalog
Critical Path – Focus Here
Credit
Card
Company
Third
Parties
Impact – Critical User Journey
IDENTIFYING THE CRITICAL PATH IN YOUR WORKLOAD
24

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
High availability
About application availability
Smaller scale, more frequent events:
•Component failures
•Network issues
•Load spikes
Usually automated mitigation or ‘self
healing’
Measures mean over time
25
May
12345
6789101112
13141516171819
20212223242526
2728293031

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability
DEFINING AVAILABILITY USING SLOS AND ERROR BUDGET
26
Amazon SLO example (not real numbers):
“In a 28-day trailing window, we will serve 99.9% of
requests with a latency of less than 1000 milliseconds.“
Error budget:
•0.1% of requests
•Consumed when out of range (> 1000 ms)
•Rate of consumption: Burn rate
•Replenished when bad requests age out (> 28 days)

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Disaster Recovery
About business continuity
Larger scale, less frequent, events:
•Natural disasters
•Technical failures
•Human actions
Measures a one-time event:
•Recovery Time
•Recovery Point
27
Natural
Disaster
Technical
Failure
Human
Actions

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery Point and Recovery Time Objective (RPO/RTO)
28
How much data can you afford to
recreate or lose?
How quickly must you recover?
What is the cost of downtime?
Disaster
Recovery Point (RPO)Recovery Time (RTO)
Data lossDown timeTime

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key Takeaways
•Know your business objectives before designing a system
•Ensure you have the key stakeholders in alignment on your
approach
29

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
30
Building Resilience One Step at a Time
Learn how the Flexibility of AWS Opens New Doors
for Business Continuity
AWS Resilience Hub

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 31
Lab #1 – Set Objectives

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Many Thanks to our Sponsor Partner !!!
32

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Before Starting
•Sign-out from your personal or corporate AWS Accounts.
•Disconnect your VPN connection (If possible).
•Disable automatic translation in your browser.
•Have an easily accessible email address.
33

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Things to Know
•As a participant, you will have access to an AWS account with
any pre-provisioned infrastructure and IAM policies necessary to
complete this workshop. Disconnect your VPN connection (If
possible).
•The AWS account will be available for the duration of the event.
After that, you will lose access to the account.
•Make sure to review the terms and conditions of the event. Do
not upload any personal or confidential information to the
account.
34

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Workshop Studio Access
Navigate to the following URL: https://bit.ly/4e8T0KI
35
After accepting the T&C:
•Access the AWS
Console from the left
menu
•Access the workshop
instructions from the
Get started button

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architecture & Objectives
•Analyze the resilience posture
of our architecture using AWS
Resilience Hub
•Review the changes proposed
by AWS Resilience Hub
•Deploy some
recommendations
36

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 37
Q: What are the key resilience patterns I need to
consider when designing a resilient system?

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ryan Baker
Principal Technical Account Manager
Amazon Web Services
Design and
Implement
38
Set
objectives
Design
and
implement
Evaluate
and testOperate
Respond
and learn

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Pillars of AWS Well-Architected
Cost
OptimizationReliabilitySecurityOperational
Excellence
Performance
EfficiencySustainability
39

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reliability Pillar – Four Areas of Best Practices
40
Foundations
Change managementFailure management
Workload architecture
Reliability

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Desired Resilience Properties
41
Fault isolationSufficient capacityTimely outputCorrect outputRedundancy

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resilience Trade-offs
42
Operational burdenComplexityConsistency & latencyCost & effort

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 43
Resilient
Architecture

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 44
Q: Since failures will happen how to limit the impact
when it does?

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fault Isolation Boundaries
PhysicalLogical
45
AWS Partition
AWS Region
Availability Zone
Instance
AWS Account
Microservices
Cells

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Regions and Availability Zones (AZs) - Physical
AWS REGIONS ARE PHYSICAL LOCATIONS AROUND THE WORLD WHERE WE CLUSTER DATA CENTERS
Regions worldwide
AWS Regions
Announced Regions
Data centerData center
Data center
Each AZ includes one or
more discrete data centers
Data centers, each with
redundant power, networking,
and connectivity,
housed in separate facilities.
Each AWS Region has multiple AZs
Transit
Transit
AZ
AZ
A Region is a physical
location in the world
AZ
46

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cellular Architecture - Logical
•A cellular architecture is a design pattern
where a service is split into multiple
deployment stacks, each called a “cell”
•Each cell is an independent instance of the
service
•Each cell is assigned a portion of data (one or
more customers)
•Each independently service the full workload
of assigned customers
•Cells share nothing, including databases
47
Cell router
Service
Application Load
Balancer
Compute
Storage
Cell 0
Application Load
Balancer
Compute
Storage
Cell n
Application Load
Balancer
Compute
Storage
Cell 1

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Control Planes and Data Planes
Control plane
•Create, update, delete, list,
and described resources
•Complex orchestration with
many dependencies
•Lower volume
Data plane
•The day-to-day business of
the resource
•Simpler, fewer dependencies
•Higher volume
48

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Control Plane and Data Plane Example
49Data Plane: EC2 Instances Running
RunInstancesControl Plane: Launching an Instance

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 50
Q: Which of these is going to fail more
often?
A: The control plane.
Control planes tend to be more complex
and thus statistically have a higher
likelihood of failure.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 51
Q: Which of these is more critical?
A: The data plane.
Data planes provide the “day-to-day”
business of the service.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Static Stability
Static stability means a system can operate in a static state and
continue to operate as normal without the need to make changes
during the failure or unavailability of a dependency.
52

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 53
Q: But what does that MEAN?

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability Zone AAvailability Zone B
100%100%
50%50%50%
Availability Zone AAvailability Zone BAvailability Zone C
If using 3 Availability Zones (AZs)
Provision enough EC2 capacity such
that the two remaining AZs can
handle 100% of your workload load
If using 2 Availability Zones (AZs)
Provision enough EC2 capacity such
that the one remaining AZ can
handle 100% of your workload load
Static Stability Example
54

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Service Types
UNDERSTANDING THE DIFFERENCES
Global services
IAMAmazon
CloudFront
Amazon
Route 53
AWS Global
Accelerator
Zonal services
AZ
AZAZ
Amazon EC2Amazon EBSAmazon RDS
Regional services
AZ
AZ
AZ
Amazon S3Amazon
DynamoDB
Amazon SQS
Control plane: Regional
Data plane: Zonal
Control plane: Regional
Data plane: Regional
Control plane: Single Region
Data plane: Global
55

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Region
Region
Region
IAM Control Plane and Data Plane Example
Amazon S3Amazon SQSAmazon
DynamoDB
ARC
Auth Runtime
Service (ARS)
Data plane: Authentication
and authorization for all AWS
service API calls
IAM
CreateRole
Propagators
Control plane: Manage IAM
identities and policies
ARCARC
56

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Avoid Control Plane Actions for Recovery
GLOBAL SERVICES
IAM CloudFront
Route 53
Global
Accelerator
Organizations
Route 53
ARC
What will workWhat may not work
AuthN/AuthZ of
signed AWS requests
CRUD-L IAM policies,
roles, users
DNS resolution
and health checks
Updates to routing
policies and records,
creating health checks
Listing and updating
routing controls
CRUD-L
cluster endpoints
What will workWhat may not work
Will continue to cache
and serve content,
perform origin failover
CRUD-L
CloudFront
distributions
Service control policy
(SCP) evaluation
View or update
organization structure,
modify SCPs
Edge routing will
continue to function,
existing traffic dials
and endpoint weights
Add/modify
endpoints, change
traffic dials
57

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Route 53 best practices
Don’t do thisDo this!
RegionRegion
www.example.com → us-east-1.example.comwww.example.com → us-west-2.example.com
ChangeResourceRecordSet
RegionRegion
Health check:
us-west-2.example.com
Health check:
us-east-1.example.com
www.example.com
58

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Route 53 best practices
59
RegionRegion
Health check:
us-east-1.example.com
Health check:
us-west-2.example.com
www.example.com
CloudWatch
Alarm
Initiated by
application
health
Initiated by a
person
Application recovery
controller
Target an
object in an
S3 bucket

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 60
Resilient Software
Design Patterns

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failure-oriented Patterns
Client (Front-End)
•Timeouts
•Retries w/back off
•Jitter
•Limit queue sizes
•Caching
Client
Service
61

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failure-oriented Patterns
Service (Back-End)
•Rate limit
•Rejection (load shedding)
•Caching
•Circuit breakerService
Client
62

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Simple Designs and Constant Work
NodeNodeNodeNode
Configuration
agent
Users
PUSH
Amazon S3
bucket
NodeNodeNodeNode
Configuration
agent
Users
PULLvs
63

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Asynchronous (Static Stability)
64
System 1
System 2
System 1
System 2
Async/Indirect
Synchronization

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Graceful Degradation (Static Stability)
65
Missing
frequently
bought
together
Missing the item
description
Can still add
to cartCan still browse
images
65

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Avoid Bimodal – Fallback Processing
Prevent bimodal behavior:
Bimodal behavior is when your workload exhibits different behavior under failure modes
https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/?did=ba_card&trk=ba_card
Consumer
Service
PayPal
Venmo
X
fallback
try paypal.pay(who, howMuch) :
# paypal processing
exception:
# venmo processing
venmo.pay(telNo, howMuch, fromWhere) :
if mod(requestNum, 100) == 0:
venmo.pay(telNo, howMuch, fromWhere)
else
paypal.pay(who, howMuch)
exception:
venmo.pay(telNo, howMuch, fromWhere)
Convert to something that regularly exercise both paths
66

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key Takeaways
•Control planes are more complex and have more dependencies than data planes, thus are statistically more likely to fail
•A dependency on a data plane is better than a control plane
•Static stability means a system can operate in a static state and continue to operate as normal without the need to make changes during the failure or unavailability of a dependency
•One way AWS achieves static stability is by removing control plane dependencies from the data plane in AWS services
•Understand what does and might not work during a control plane impairment for global services.
67

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
68
Reliability Pillar: AWS Well-Architected Framework
Amazon Builder’s Library
Advanced Multi-AZ Resilience Patterns
AWS Fault Isolation Boundaries

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Design and
Implement
69
Set
objectives
Design
and
implement
Evaluate
and testOperate
Respond
and learn

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Pillars of AWS Well-Architected
Cost
OptimizationReliabilitySecurityOperational
Excellence
Performance
EfficiencySustainability
70

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reliability Pillar – Four Areas of Best Practices
71
Foundations
Change managementFailure management
Workload architecture
Reliability

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Desired Resilience Properties
72
Fault isolationSufficient capacityTimely outputCorrect outputRedundancy

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resilience Trade-offs
73
Operational burdenComplexityConsistency & latencyCost & effort

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 74
Resilient
Architecture

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 75
Q: Since failures will happen how to limit the impact
when it does?

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fault Isolation Boundaries
PhysicalLogical
76
AWS Partition
AWS Region
Availability Zone
Instance
AWS Account
Microservices
Cells

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Regions and Availability Zones (AZs) - Physical
AWS REGIONS ARE PHYSICAL LOCATIONS AROUND THE WORLD WHERE WE CLUSTER DATA CENTERS
Regions worldwide
AWS Regions
Announced Regions
Data centerData center
Data center
Each AZ includes one or
more discrete data centers
Data centers, each with
redundant power, networking,
and connectivity,
housed in separate facilities.
Each AWS Region has multiple AZs
Transit
Transit
AZ
AZ
A Region is a physical
location in the world
AZ
77

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Cellular Architecture - Logical
•A cellular architecture is a design pattern
where a service is split into multiple
deployment stacks, each called a “cell”
•Each cell is an independent instance of the
service
•Each cell is assigned a portion of data (one or
more customers)
•Each independently service the full workload
of assigned customers
•Cells share nothing, including databases
78
Cell router
Service
Application Load
Balancer
Compute
Storage
Cell 0
Application Load
Balancer
Compute
Storage
Cell n
Application Load
Balancer
Compute
Storage
Cell 1

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Control Planes and Data Planes
Control plane
•Create, update, delete, list,
and described resources
•Complex orchestration with
many dependencies
•Lower volume
Data plane
•The day-to-day business of
the resource
•Simpler, fewer dependencies
•Higher volume
79

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Control Plane and Data Plane Example
80Data Plane: EC2 Instances Running
RunInstancesControl Plane: Launching an Instance

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 81
Q: Which of these is going to fail more
often?
A: The control plane.
Control planes tend to be more complex
and thus statistically have a higher
likelihood of failure.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 82
Q: Which of these is more critical?
A: The data plane.
Data planes provide the “day-to-day”
business of the service.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Static Stability
Static stability means a system can operate in a static state and
continue to operate as normal without the need to make changes
during the failure or unavailability of a dependency.
83

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 84
Q: But what does that MEAN?

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Availability Zone AAvailability Zone B
100%100%
50%50%50%
Availability Zone AAvailability Zone BAvailability Zone C
If using 3 Availability Zones (AZs)
Provision enough EC2 capacity such
that the two remaining AZs can
handle 100% of your workload load
If using 2 Availability Zones (AZs)
Provision enough EC2 capacity such
that the one remaining AZ can
handle 100% of your workload load
Static Stability Example
85

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Service Types
UNDERSTANDING THE DIFFERENCES
Global services
IAMAmazon
CloudFront
Amazon
Route 53
AWS Global
Accelerator
Zonal services
AZ
AZAZ
Amazon EC2Amazon EBSAmazon RDS
Regional services
AZ
AZ
AZ
Amazon S3Amazon
DynamoDB
Amazon SQS
Control plane: Regional
Data plane: Zonal
Control plane: Regional
Data plane: Regional
Control plane: Single Region
Data plane: Global
86

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Region
Region
Region
IAM Control Plane and Data Plane Example
Amazon S3Amazon SQSAmazon
DynamoDB
ARC
Auth Runtime
Service (ARS)
Data plane: Authentication
and authorization for all AWS
service API calls
IAM
CreateRole
Propagators
Control plane: Manage IAM
identities and policies
ARCARC
87

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Avoid Control Plane Actions for Recovery
GLOBAL SERVICES
IAM CloudFront
Route 53
Global
Accelerator
Organizations
Route 53
ARC
What will workWhat may not work
AuthN/AuthZ of
signed AWS requests
CRUD-L IAM policies,
roles, users
DNS resolution
and health checks
Updates to routing
policies and records,
creating health checks
Listing and updating
routing controls
CRUD-L
cluster endpoints
What will workWhat may not work
Will continue to cache
and serve content,
perform origin failover
CRUD-L
CloudFront
distributions
Service control policy
(SCP) evaluation
View or update
organization structure,
modify SCPs
Edge routing will
continue to function,
existing traffic dials
and endpoint weights
Add/modify
endpoints, change
traffic dials
88

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Route 53 best practices
Don’t do thisDo this!
RegionRegion
www.example.com → us-east-1.example.comwww.example.com → us-west-2.example.com
ChangeResourceRecordSet
RegionRegion
Health check:
us-west-2.example.com
Health check:
us-east-1.example.com
www.example.com
89

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Route 53 best practices
90
RegionRegion
Health check:
us-east-1.example.com
Health check:
us-west-2.example.com
www.example.com
CloudWatch
Alarm
Initiated by
application
health
Initiated by a
person
Application recovery
controller
Target an
object in an
S3 bucket

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 91
Resilient Software
Design Patterns

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failure-oriented Patterns
Client (Front-End)
•Timeouts
•Retries w/back off
•Jitter
•Limit queue sizes
•Caching
Client
Service
92

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Failure-oriented Patterns
Service (Back-End)
•Rate limit
•Rejection (load shedding)
•Caching
•Circuit breakerService
Client
93

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Simple Designs and Constant Work
NodeNodeNodeNode
Configuration
agent
Users
PUSH
Amazon S3
bucket
NodeNodeNodeNode
Configuration
agent
Users
PULLvs
94

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Asynchronous (Static Stability)
95
System 1
System 2
System 1
System 2
Async/Indirect
Synchronization

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Graceful Degradation (Static Stability)
96
Missing
frequently
bought
together
Missing the item
description
Can still add
to cartCan still browse
images
96

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Avoid Bimodal – Fallback Processing
Prevent bimodal behavior:
Bimodal behavior is when your workload exhibits different behavior under failure modes
https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/?did=ba_card&trk=ba_card
Consumer
Service
PayPal
Venmo
X
fallback
try paypal.pay(who, howMuch) :
# paypal processing
exception:
# venmo processing
venmo.pay(telNo, howMuch, fromWhere) :
if mod(requestNum, 100) == 0:
venmo.pay(telNo, howMuch, fromWhere)
else
paypal.pay(who, howMuch)
exception:
venmo.pay(telNo, howMuch, fromWhere)
Convert to something that regularly exercise both paths
97

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key Takeaways
•Control planes are more complex and have more dependencies than data planes, thus are statistically more likely to fail
•A dependency on a data plane is better than a control plane
•Static stability means a system can operate in a static state and continue to operate as normal without the need to make changes during the failure or unavailability of a dependency
•One way AWS achieves static stability is by removing control plane dependencies from the data plane in AWS services
•Understand what does and might not work during a control plane impairment for global services.
98

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
99
Reliability Pillar: AWS Well-Architected Framework
Amazon Builder’s Library
Advanced Multi-AZ Resilience Patterns
AWS Fault Isolation Boundaries

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 100
Lab #2 – Design and
implement for
High Availability

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architecture & Objectives
•Analyze the resilience posture of
our architecture again using AWS
Resilience Hub and detail the
compliance with our RTO/RPO
•Update resilience policy to include a
new requirement: multi-Region Disaster Recovery
•Analyze whether the resilience
posture of our architecture meets
the DR requirement
•Deploy some Disaster Recovery-
focused recommendations
101

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 102
Lab #3 – Design and
implement for
Disaster Recovery

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architecture
103

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Objectives
•Analyze the resilience posture of our architecture using AWS Resilience Hub and detail the
compliance with our regional RTO/RPO.
•Make adjustments to Amazon S3 and AWS Backup to meet our RTO/RPO requirements.
•Verify our architecture complies with our resilience policy.
104

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 105
Q: How can I meet my business
requirements for availability and recovery ?
A: You must test.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Evaluate and
Test
106
Set
objectives
Design and
implement
Evaluate
and testOperate
Respond
and learn

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Traditional Tests
Type of tests:
•Unit
•Integration
•Functional
•Performance and load
•Smoke
•Regression
107
AUTOMATE

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chasing The Unknowns in Your Environments
108
Things we are
aware of and
understand
Things we
understand but
are not aware of
Things we are
aware of but
don’t understand
Things we are
neither aware of
nor understandUnknown
Known
KnownsUnknowns
PRO TIP
Customers often ask what key things they
should measure to understand when a
system is becoming impaired. You should
test for known and unknown conditions to
identify and refine what you should
measure.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos Engineering
A SCIENTIFIC METHOD
Steady
state
Hypothesis
Run
Experiment
Verify
Improve
109

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Typical Chaos Experiment
Inject events that simulate:
Hardware failures, like servers dying
Software failures, like malformed responses
Nonfailure events, like spikes in traffic or scaling events
Of-the-cloud impairments
Any event capable of disrupting steady state
110

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos Experiment Definition - Hypothesis
111
At a rate of 300 TPS, if 40% of the nodes in the EKS node-group are
terminated, the Transaction Create API continues to serve the 99th percentile
of requests in under 100 ms (steady state)
The EKS nodes will recover within 5 minutes, and pods will get scheduled and
process traffic within 8 minutes after the initiation of the experiment
Alerts will fire within 3 minutes

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Controlled Experiments Through Canary Deployments
112
Amazon
Route 53
Elastic Load
Balancing
Target group
WITHOUT the
Chaos
Experiment
Target group
WITH the
Chaos
Experiment
Requests
Compute
Database
Compute
Verify that both groups are healthy before moving forward
Synthetic
Load
Generation
Source: https://medium.com/@adhorn/chaos-engineering-q-a-how-to-safely-inject-failure-ced26e11b3db

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Fault Injection
Service
Experiment
template
AWS Command
Line Interface
AWS Management
Console
AWS Identity and
Access Management
FIS safeguardsFIS engine
Compute
Start experiment
Third party
AWS
Amazon
EventBridge
Amazon
CloudWatch
alarms
AWS resources
DatabasesNetworkingStorage
Monitoring
Stop experiment
AWS Fault Injection Service
113

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos Mindset – Impact SQS
CUSTOMER ASK: I WANT TO UNDERSTAND WHAT HAPPENS IF MY QUEUE SIZE GROWS
114
SQS
VPC
Endpoint
Amazon EC2
aws:network:disrupt-connectivity
Prefix of VPCEAWS Fault Injection
Service

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 115
Q: What should I do if there is no pre-defined
experiment?
A: Create your own! Many experiments/faults
in AWS Fault Injection Service execute custom
scripts on local instances

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos Mindset – Think Outside the Box (Impact SQS)
CUSTOMER ASK: I WANT TO UNDERSTAND WHAT HAPPENS IF MY QUEUE SIZE GROWS
116
SQS
AWS Lambda
•Throttle Lambda
•Remove permission to pick up
msg from the queue
PRO TIP
You can use any mechanism from connectivity to
permissions that creates a logical failure.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
When a Chaos Experiment Becomes a Chaos Test
Type of tests:
•Unit
•Integration
•Functional
•Performance and load
•Smoke
•Regression <= Chaos Test
117

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Game Days
SIMULATE FAILURE OREVENTTO TEST SYSTEMS, PROCESSES, AND TEAM RESPONSES
Success
Technology
Process
People
118
Analysis
What happened
Follow-up items
Briefing
Overview
Roles
People
Cross-
disciplinary
team
Planning
Scenario
Events
Preparation
Execution
In production
Capture
feedback

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
119
AWS re:Invent 2023 - Practice like you play:
How Amazon scales resilience to new heights

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 120
Lab #4 – Evaluate and Test

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Architecture & Objectives
§Deploy operational
recommendations from AWS
Resilience Hub
§Review the Amazon CloudWatch
canaries and dashboards
§Introduce chaos engineering
experiments using AWS Fault
Injection Service
121

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 122
Q: How do I find the signals that matter in all
this noise?

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Ryan Baker
Principal Technical Account Manager
Amazon Web Services
Operate
123
Set
objectives
Design and
implement
Evaluate
and testOperate
Respond
and learn

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Operational Excellence Pillar – Four Areas of Best Practices
Organizations
OperateEvolve
Prepare
124
Operational
Excellence

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metrics: What to Measure ?
Let’s collect
EVERYTHING !
125

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Monitoring in the Cloud
THREE PILLARS OF OBSERVABILITY TOOLING
Observability
PRO TIP
Use Embedded Metrics Format to
produce logs and metrics together.
126
Logs
Time-stamped records of
discrete events
One log per “unit of work”
Traces
User’s journey across
multiple applications
and systems
Metrics
Numeric data
Measured at various
time intervals

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Implementing Golden Metrics
IN ORDER TO DETECT, INVESTIGATE, AND RESPOND TO IMPACT
Impact assessment
metrics
Customer experience
metrics
Operational health
metrics
PRO TIP
Use KPIs to measure
user experience
127

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Relationship Between MTBF, MTTR, and MTTD
time
MTTR
MTTD
RepairTime
MTBF
Fa i l u r e
Occurs
Repair
Starts
Resume Normal
Operations
Fa i l u r e
Occurs
Availability Metrics
128
…are the three factors that are used to improve availability in distributed systems
Reduce detection times
(shorter MTTD & MTTR)
Reduce repair times
(shorter MTTR)
Less frequent failure
(longer MTBF)

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metrics Collection Example
Local
Cache
Remote CacheRemote
Database
GetProductInfo()
Which product are we looking up?
Who called the API?
What product category is this in?
When was the call made?
Did we find the item in the local cache?
Did we find the item in the remote cache?
How long did it take to read from the cache?
How long did it take to deserialize the object from
the cache?
How full is the local cache?
How long did the query take?
Did the query succeed?
If it failed, was it due to timeout? Was it
an invalid query? Did we lose the
connection?
If it timed out, was the connection pool
full? Did we fail to connect entirely? Was
it just slow to respond?
How long did it take to populate the caches?
Were they full and did they evict other items?
How big was the product info object?
What was the response code from the server?
What was the latency?
129

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Annual Downtime
•Appropriate for longer-term
goal setting and tracking
•Requires a definition of what
“downtime” means
•Whenever downtime occurs,
sum all of those intervals to
calculate annual availability
•This is where we can calculate
the ‘9s’ by looking at
historical data
130
Drop below 95% availability for
any API during a 5-minute
window
System-wide metric/KPI like a
drop in order rate of 10% or
more against forecast
Impact to specific critical
functions

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Latency
•Latency also impacts
availability
•Use percentiles and trimmed
mean to measure latency
•Analyze histograms for
latency distribution trends
131
Source: chrome://histograms/#Net.TCP_Connection_Latency

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 132
Failure isn’t binary

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Abstract Model
System
Core business
logic
Observer
Reactor
Report
Probe
App 1Difference
50 ms
App 2
53 ms
App 3
56 ms
Average
latency:
53 ms
Latency
threshold:
60 ms
70 ms
Average
latency:
59.66 ms
+40% latency
133

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
System Versus Customer Perspective
unhealthy
healthy
ALL
GOOD
MASKED
FAILURE
GRAY
FAILURE
DETECTED
FAILURE
System Perspective
healthy
unhealthy
Customer Perspective
134

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 135
Health is observed differently from different perspectives
The underlying system may not see impact at all or the impact may not cross a
threshold
Individual users of that system may be disproportionately impacted
Can’t rely on the underlying system to detect and mitigate the failure, users of
the system must take action themselves
Differential Observability

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon CloudWatch Composite Alarms
•Monitor the state of other alarms
•Can also monitor other composite alarms
•Simple logic alarm expressions, AND, OR,
NOT to aggregate
(
ALARM(”HighLatency")
OR
ALARM(”HighErrors")
)
AND OK(”HighMemoryUtilization")
PRO TIP
Use dimensions to give your metric a
unique identifier. Use your fault
isolation boundaries as a dimension.

136

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Composite Alarm Example
137
ServiceName: Home
AZ ID: use1-az1
Evaluation Periods: 5
Datapoints to Alarm: 3
Threshold: Availability
< 99.9%
ServiceName: ListProducts
AZ ID: use1-az1
Evaluation Periods: 5
Datapoints to Alarm: 3
Threshold: Availability
< 99.9%
az1-availability
ALARM(“az1-home-availability”) OR
ALARM(“az1-listproducts-
availability”)
ServiceName: Home
AZ ID: use1-az2
Evaluation Periods: 5
Datapoints to Alarm: 3
Threshold: Availability
< 99.9%
ServiceName: ListProducts
AZ ID: use1-az2
Evaluation Periods: 5
Datapoints to Alarm: 3
Threshold: Availability
< 99.9%
az2-availability
ALARM(“az2-home-availability”) OR
ALARM(“az2-listproducts-
availability”)

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Putting it All Together
138
use1-az1-impact
(ALARM(“az1-availability”) OR ALARM(“az1-latency”))
AND NOT
(ALARM(“az2-availability”) OR ALARM(“az2-latency”)
OR
ALARM(“az3-availability”) OR ALARM(“az3-latency”))
not-single-instance-use1-az1
INSIGHT_RULE_METRIC(“5xx-errors-use1-
az1”, “UniqueContributors”) >= 2
use1-az1-isolated-impact
ALARM(“use1-az1-impact”) AND
ALARM(“not-single-instance-
use1-az1”)
az1-availability
ALARM(“az1-home-availability”) OR
ALARM(“az1-listproducts-availability”)
az1-latency
ALARM(“az1-home-latency”) OR
ALARM(“az1-listproducts-latency”)
SNS topic

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Canaries
PRO TIP
Use CloudWatch blueprints to easily set
up the right canary for your use case
Synthetics
Amazon CloudWatch
Website
Alarm
139

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Implement CI/CD
140
CI/CD
Best
Practices
Full
automation
Test Driven
Development
Frequent
commits and
integrations
Immutable
infrastructure
Rollback
mechanism
Version
control
Canary and
blue/green
deployments

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key Takeaways
•Continuously introspect your metrics
•Gray failures are defined by differential observability
•You must take action to detect and mitigate, you can’t rely on the
underlying service
•Emit metrics with dimensions aligned to your fault boundaries to
detect the impacted fault container
•Create Golden Metrics and make them easy to adopt
141

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
142
Operational Excellence Pillar:
AWS Well-Architected Framework
AWS re:Invent 2020: Monitoring production services at Amazon
Building dashboards for operational visibility
Instrumenting distributed systems for operational visibility

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 143
Q: I need to know the most important metrics?
Also there is a lot of noise!

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Respond and
Learn
144
Set
objectives
Design and
implement
Evaluate
and testOperate
Respond
and learn

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Separating Signal From Noise
145
Latency
Time
Query Latency
Latency
What happened here?
P99 Sev2

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Using Different Metrics
146
Latency
Time
Latency
Large QuerySmall Query
if (queryCost < 10)
metrics.addTime(“querySmall”, duration)
else
metrics.addTime(“queryLarge”, duration)
Use two different
dimensions to the same
metric we started with…
big query and little
query

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Client Behavior
147
Error Rate
Time
Errors
Errors

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Graphing Top Contributors
148
Error Rate
Time
Errors per Client
Customer ACustomer BCustomer CCustomer DCustomer E

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Using Top Contributors to Find Problems
149
Error Rate
Time
Errors per Client
Customer ACustomer BCustomer CCustomer DCustomer E
PRO TIP
Using a tool like “Contributor Insights” allows you
separate signal from noise.
In this case determine if it’s a few clients or all
clients.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Creating a Checklist
System Architecture
•API diagram and workflow
•Retry/back-off strategy
•Dependencies?
Release Quality and
Procedures
•What mechanisms do you
use to deploy?
•Automatically roll back
incorrect deployments?
Incident and Event
Management
•What operational goals
have you identified?
•Preventive measures?
150

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Post-incident Analysis
SHARING LESSONS LEARNED
What did we learn &
what do we action on?
How were our people
and processes?
How do we share
what we’ve learned?
151

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Weekly Operations Review
Senior leadership + Open to every engineer
•Success stories
•Metrics review (The Wheel)
•COE review
•What can improve?
152
PRO TIP
This builds the cultural muscle for wide spread adoption through
learning. By having EVERY engineer in the company able to
participate and each accountable for operational metrics the
culture of accountability can be fostered.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Continuous Resilience
153
Set
objectives
Design
and
implement
Evaluate
and testOperate
Respond
and learn
Continuous
Resilience
PRO TIP
Resilience is a continuous
journey. Don’t ‘set it and
forget it’. People and
technologies are ever
changing. Make sure your
resilience changes with
them.

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resources
154
AWS re:Invent 2023 –
Building observability to increase resiliency

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved. 155
Lab #5 – Respond and Learn

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Objectives
•Analyze the resilience
score of our architecture
and review upward trend
•Learn about workshops
that can help us with our
incident analysis reports
156

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Event Customized Slides – We have capability to have 4-6
slides that are specific to your event – This can be
customer stories, use cases, anything that you want to
make the event more localized
157

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Resilience Lifecycle Framework
Stages are iterated through repeatedly
People may be working on different
stages at the same time
Maturity is frequently not equal across
stages
Start with a shallow pass through all
stages; build maturity over time
Not all capabilities and practices are
necessary for every application
158
Set
objectives
Design
and
implement
Evaluate
and testOperate
Respond
and learn
Continuous
Resilience

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
We have shown you some tools to build and
operate more resilient systems today
Resilience is a journey not a destination
Concentrate on what’s most important for your
business needs at the time
And do ask your AWS account team for help
AWS is here to help you
159

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Closing
Survey

AWS ARCHITECTURAL RESILIENCE DAY
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.