5 practical operability techniques for teams - Matthew Skelton - SQUID meetup 2018
matthewskelton
1,301 views
103 slides
Nov 29, 2018
Slide 1 of 103
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
About This Presentation
In this talk, we explore five practical, tried-and-tested, real world techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT:
- Logging as a live diagnostics vector with sparse Event IDs
- Operational checklists and ‘Run Book dia...
In this talk, we explore five practical, tried-and-tested, real world techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT:
- Logging as a live diagnostics vector with sparse Event IDs
- Operational checklists and ‘Run Book dialogue sheets’ as a discovery mechanism for teams
- Deployment Verification Tests as a way to assess runtime dependencies and readiness for service
- Correlation IDs beyond simple HTTP calls
- Lightweight ‘User Personas’ as drivers for operational dashboards
Based on work in many industry sectors, we will learn how to improve the operability of software systems using these team-friendly techniques.
Matthew Skelton is Head of Consulting at Conflux (confluxdigital.net) where he specialises in Continuous Delivery, operability and organisation design for software in manufacturing, ecommerce, and online services, including cloud, IoT, and embedded software.
Size: 7.93 MB
Language: en
Added: Nov 29, 2018
Slides: 103 pages
Slide Content
1
5 practical operability
techniques for teams
Matthew Skelton, Conflux
@matthewpskelton
confluxdigital.net
SQUID meetup, London
Weds 28 November 2018
Practical operability
Why do we need a focus on
operability?
5 practical operability
techniques that work
2
modern event-based logging
Run Book dialogue sheets
endpoint healthchecks
correlation IDs
user personas
3
team
collaboration
techniques
4
About me
Matthew Skelton, Conflux
@matthewpskelton
matthewskelton.net
Leeds, UK
5
20% discount for
SQUID meetup!
You?
Software Developer
Tester / QA
DevOps Engineer
Ops Engineer / SRE
Head of Department
6
“But can’t we just give
those things to an SRE?”
10
“But can’t we just give
those things to the
DevOp?”
11
Operability is
a shared concern
#BizDevTestSecOps
12
Operability is
a shared concern
#BizDevTestSecOps
13
14
15
A self-managed Kubernetes
cluster near you
Operability is
a shared concern
#BizDevTestSecOps
16
17
SRE: operability consultants
Collaborate on
operability
here
CC BY-SA devopstopologies.com
Practical operability techniques
1.Modern logging with event IDs
2.Run Book dialogue sheets
3.Endpoint healthchecks
4.Correlation IDs
5.Lightweight User Personas
18
19
Logging with Event IDs
Lack of observability for
distributed systems
20
Modern logging w/ Event IDs
Distinct application states
No “logorrhoea” (!)
Distributed tracing via logs
Build a shared understanding
21
search by event
Event ID
{Delivered,
InTransit,
Arrived}
22
transaction
trace
Correlation ID
612999958…
23
Modern logging with event IDs
helps to produce a well-defined
event space:
human-readable events
24
Which calls might fail?
25
How many distinct event
types (state transitions) in
your application?
26
example:
https://github.com/EqualExperts/opslogger
Sean Reilly
@seanjreilly
32
Example: video processing
On-demand processing of TV and
mobile streaming adverts
Ad-agency → TV broadcaster
High throughput
Glitch-free video & audio
33
Storage I/O
Worker Job
Queue
Upload
34
35
36
Example: video processing
Discover processing bottlenecks
Trigger alerts
Report on KPIs
Target areas for improvement
37
Modern logging w/ Event IDs
clarity about software behavior
reduce time to detect problems
increase team engagement
enhance collaboration
38
Modern Logging:
Collaborate on Event IDs
and Correlation traces for
better system awareness
39
Run Book dialogue sheets
Operational aspects not
addressed, or addressed
too late in the cycle
41
Run Book dialogue sheets
Checklists for typical
operational considerations
Team-friendly exploration
42
Run Book dialogue sheets help
to increase awareness of
operability within teams
43
runbooktemplate.infoRun Book dialogue sheets
44
System characteristics
Hours of operation
During what hours does the service or system actually need to operate? Can portions or features of the
system be unavailable at times if needed?
Hours of operation - core features
(e.g. 03:00-01:00 GMT+0)
Hours of operation - secondary features
(e.g. 07:00-23:00 GMT+0)
Data and processing flows
How and where does data flow through the system? What controls or triggers data flows?
(e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data )
…
45
http://runbooktemplate.info/
Github, CC BY-SA
46
runbooktemplate.infoRun Book dialogue sheets
47
Run Book dialogue sheets
Early discovery of
operational requirements
Input to team backlog
“Shift-left” testing
Avoid operational problems
48
49
http://operabilityquestions.com/
Github, CC BY-SA
OperabilityQuestions.com
Freeform, exploratory
questions for teams
Usability, viability, reliability,
observability, securability, …
(Github, CC BY-SA)
50
Run Book dialogue sheets:
Collaborate on operational
requirements for better
system awareness
51
Endpoint healthchecks
“Why has my deployment
failed again?”
“Why is Pre-Prod always
so flaky?”
53
Endpoint healthchecks
Simple HTTP check
Common way to assess any
service/app/component
Key operational requirement
54
endpoint healthchecks
Every runnable app/service/daemon
exposes /status/health
An HTTP GET to the endpoint returns:
200 – "I am healthy"
500 – "I am sick"
55
Endpoint healthchecks help
teams to collaborate on
service viability
56
endpoint healthchecks
Each component is responsible for
determining its own health and
viability – this is very contextual
57
endpoint healthchecks
Use JSON as a response type –
parsable by both
machines and humans!
58
59
endpoint healthchecks
For databases and other non-HTTP
components, run a lightweight HTTP
service in front of the component
200 / 500 responses
60
Helper service
61
https://github.com/Lugribossk/simple-dashboard
62
63
Question:
What does this look like for
Serverless?
¯\_(ツ)_/¯
Endpoint healthchecks
Rapid diagnosis and visibility
Reduce confusion around
environment state
“Fail fast” → “learn sooner”
64
Endpoint healthchecks:
Collaborate on component
health status for better
system awareness
65
Correlation IDs
“Which nodes handled
the request?”
67
Correlation IDs
Unique-ish identifiers
Trace calls across machine &
container boundaries
Re-assemble HTTP
request/response later
68
‘Unique-ish’ identifier for each request
Passed through downstream layers
69
Correlation IDs help teams to
think about the big picture:
end-to-end outcomes
70
Unique-ish ID
71
Synchronous HTTP:
X-HEADER e.g. X-trace-id
X-trace-id: 348e1cf8
If header is present, pass it on
(Yes, RFC6648, but this is internal only)
72
Asynchonous (queues, etc.):
Message Attributes, name:value pair
e.g. "trace-id":"348e1cf8"
AWS SQS: SendMessage() / ReceiveMessage()
Log the Correlation ID if present
73
explore the UX and needs of
different roles for rapid
decisions via dashboards
99
use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
team collaboration techniques
100
Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
operabilitybook.com
20% discount for SQUID meetup!
http://leanpub.com/SoftwareOperability/c/SquidMeetup
Download a free sample chapter
101
Resources
•Team Guide to Software Operability by Matthew Skelton
and Rob Thatcher http://operabilitybook.com/
•Run Book template & Run Book dialogue sheets
http://runbooktemplate.info/
•Operability Questions http://operabilityquestions.com/
•5 proven operability techniques for software teams
https://techbeacon.com/5-proven-operability-techniques-
software-teams
102
thank you
103
@matthewpskelton / operabilitybook.com