5 practical operability techniques for teams - Matthew Skelton - SQUID meetup 2018

matthewskelton 1,301 views 103 slides Nov 29, 2018
Slide 1
Slide 1 of 103
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103

About This Presentation

In this talk, we explore five practical, tried-and-tested, real world techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT:

- Logging as a live diagnostics vector with sparse Event IDs
- Operational checklists and ‘Run Book dia...


Slide Content

1
5 practical operability
techniques for teams
Matthew Skelton, Conflux
@matthewpskelton
confluxdigital.net
SQUID meetup, London
Weds 28 November 2018

Practical operability
Why do we need a focus on
operability?
5 practical operability
techniques that work
2

modern event-based logging
Run Book dialogue sheets
endpoint healthchecks
correlation IDs
user personas
3

team
collaboration
techniques
4

About me
Matthew Skelton, Conflux
@matthewpskelton
matthewskelton.net
Leeds, UK
5
20% discount for
SQUID meetup!

You?
Software Developer
Tester / QA
DevOps Engineer
Ops Engineer / SRE
Head of Department
6

Practical
Operability Techniques
for Teams
7

Operability

making software work well
in Production
8

Operability
9
Scale
Restore
Inspect
Failover
Monitor
Diagnose
Secure
Cleardown
Report

“But can’t we just give
those things to an SRE?”
10

“But can’t we just give
those things to the
DevOp?”
11

Operability is
a shared concern

#BizDevTestSecOps
12

Operability is
a shared concern

#BizDevTestSecOps
13

14

15
A self-managed Kubernetes
cluster near you

Operability is
a shared concern

#BizDevTestSecOps
16

17
SRE: operability consultants
Collaborate on
operability
here
CC BY-SA devopstopologies.com

Practical operability techniques
1.Modern logging with event IDs
2.Run Book dialogue sheets
3.Endpoint healthchecks
4.Correlation IDs
5.Lightweight User Personas
18

19
Logging with Event IDs

Lack of observability for
distributed systems
20

Modern logging w/ Event IDs
Distinct application states
No “logorrhoea” (!)
Distributed tracing via logs
Build a shared understanding
21

search by event

Event ID

{Delivered,
InTransit,
Arrived}
22

transaction
trace

Correlation ID

612999958…
23

Modern logging with event IDs
helps to produce a well-defined
event space:
human-readable events
24

Which calls might fail?
25

How many distinct event
types (state transitions) in
your application?
26

27

represent distinct states
28

enum

Human-readable sets:
unique values, sparse,
immutable

C#, Java, Python, node
(Ruby, PHP, …)
29

public enum EventID
{
// Badly-initialised logging data
NotSet = 0,
// An unrecognised event has occurred
UnexpectedError = 10000,

ApplicationStarted = 20000,
ApplicationShutdownNoticeReceived = 20001,

MessageQueued = 40000,
MessagePeeked = 40001,

BasketItemAdded = 60001,
BasketItemRemoved = 60002,

CreditCardDetailsSubmitted = 70001,

// ...
}
30

BasketItemAdded = 60001
BasketItemRemoved = 60002

31

example:
https://github.com/EqualExperts/opslogger
Sean Reilly
@seanjreilly
32

Example: video processing
On-demand processing of TV and
mobile streaming adverts
Ad-agency → TV broadcaster
High throughput
Glitch-free video & audio
33

Storage I/O
Worker Job
Queue
Upload
34

35

36

Example: video processing
Discover processing bottlenecks
Trigger alerts
Report on KPIs
Target areas for improvement
37

Modern logging w/ Event IDs
clarity about software behavior
reduce time to detect problems
increase team engagement
enhance collaboration
38

Modern Logging:
Collaborate on Event IDs
and Correlation traces for
better system awareness
39

Run Book dialogue sheets

Operational aspects not
addressed, or addressed
too late in the cycle
41

Run Book dialogue sheets

Checklists for typical
operational considerations
Team-friendly exploration
42

Run Book dialogue sheets help
to increase awareness of
operability within teams
43

runbooktemplate.infoRun Book dialogue sheets
44

System characteristics

Hours of operation

During what hours does the service or system actually need to operate? Can portions or features of the
system be unavailable at times if needed?

Hours of operation - core features

(e.g. 03:00-01:00 GMT+0)

Hours of operation - secondary features

(e.g. 07:00-23:00 GMT+0)

Data and processing flows

How and where does data flow through the system? What controls or triggers data flows?
(e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data )


45

http://runbooktemplate.info/
Github, CC BY-SA
46

runbooktemplate.infoRun Book dialogue sheets
47

Run Book dialogue sheets
Early discovery of
operational requirements
Input to team backlog
“Shift-left” testing
Avoid operational problems
48

49
http://operabilityquestions.com/
Github, CC BY-SA

OperabilityQuestions.com
Freeform, exploratory
questions for teams
Usability, viability, reliability,
observability, securability, …
(Github, CC BY-SA)
50

Run Book dialogue sheets:
Collaborate on operational
requirements for better
system awareness
51

Endpoint healthchecks

“Why has my deployment
failed again?”
“Why is Pre-Prod always
so flaky?”
53

Endpoint healthchecks
Simple HTTP check
Common way to assess any
service/app/component
Key operational requirement
54

endpoint healthchecks

Every runnable app/service/daemon
exposes /status/health
An HTTP GET to the endpoint returns:
200 – "I am healthy"
500 – "I am sick"
55

Endpoint healthchecks help
teams to collaborate on
service viability
56

endpoint healthchecks

Each component is responsible for
determining its own health and
viability – this is very contextual
57

endpoint healthchecks

Use JSON as a response type –
parsable by both
machines and humans!
58

59

endpoint healthchecks

For databases and other non-HTTP
components, run a lightweight HTTP
service in front of the component
200 / 500 responses
60

Helper service
61

https://github.com/Lugribossk/simple-dashboard
62

63
Question:

What does this look like for
Serverless?

¯\_(ツ)_/¯

Endpoint healthchecks
Rapid diagnosis and visibility
Reduce confusion around
environment state
“Fail fast” → “learn sooner”
64

Endpoint healthchecks:
Collaborate on component
health status for better
system awareness
65

Correlation IDs

“Which nodes handled
the request?”
67

Correlation IDs
Unique-ish identifiers
Trace calls across machine &
container boundaries
Re-assemble HTTP
request/response later
68

‘Unique-ish’ identifier for each request

Passed through downstream layers
69

Correlation IDs help teams to
think about the big picture:
end-to-end outcomes
70

Unique-ish ID
71

Synchronous HTTP:

X-HEADER e.g. X-trace-id
X-trace-id: 348e1cf8
If header is present, pass it on

(Yes, RFC6648, but this is internal only)
72

Asynchonous (queues, etc.):

Message Attributes, name:value pair
e.g. "trace-id":"348e1cf8"
AWS SQS: SendMessage() / ReceiveMessage()
Log the Correlation ID if present
73

Example: OpenTracing / PCF

3 tracing elements:
TraceID, SpanID, ParentSpan
"X-B3-TraceId" "X-B3-SpanId"
"X-B3-ParentSpan"
74

Example: OpenTracing / PCF

Always log the TraceID as-is
Log calling SpanID as ParentSpan
Log new SpanID
75

Trace
Span
ParentSpan
76

Correlation IDs
Detect bottlenecks and
unexpected interactions
Increase transparency
Learn about the system
77

Correlation IDs:
Collaborate on distributed
tracing for better system
awareness
78

Lightweight user personas

Software is difficult to
operate: poor UX for Ops.
80

Lightweight User Personas
Simple characterisation of
user needs for Dev/Test/Ops
Based on full UX user
personas but less detailed
81

Lightweight user personas:

Ops Engineer
Test Engineer
Build & Deployment Engineer
Service Owner
82

Lightweight user personas
help teams to build systems
with good UX for all users
83

Lightweight user personas:

Consider the User Experience (UX) of
engineers and team members using
and working with the software
84

http://www.keepitusable.com/blog/?tag=alan-cooper
85
Motivations

Goals

Frustrations

Lightweight user personas:

What data does the User Persona need
visible on a dashboard in order to
make decisions rapidly & safely?
86

https://www.geckoboard.com/blog/visualisation-upgrades-progressing-towards-a-more-useful-and-beautiful-dashboard/ 87

Lightweight User Personas
Empathise better with people
from other roles
Capture missing operational
requirements
88

Lightweight User Personas:
Collaborate on user needs
for better
system awareness
89

Summary
90

Operability

making software work well
in Production
91

92

Lack of observability
Operational aspects not known
“Why has deployment failed?”
What handled the request?
Poor UX for Ops
93

94
SRE: operability consultants
Collaborate on
operability
here
CC BY-SA devopstopologies.com

Logging with Event IDs

use enum-based Event IDs to
explore runtime behaviour and
fault conditions
95

Run Book dialogue sheets

explore and establish
operational requirements as a
team, around a physical table,
together
96

Endpoint healthchecks

HTTP 200 / 500 responses to
/status/health call with JSON
details – good for tools and
humans
97

Correlation IDs

trace execution using correlation IDs:

synchronous (HTTP X-trace-id)
async (SQS MessageAttribute)
98

Lightweight user personas

explore the UX and needs of
different roles for rapid
decisions via dashboards
99

use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
team collaboration techniques
100

Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
operabilitybook.com
20% discount for SQUID meetup!
http://leanpub.com/SoftwareOperability/c/SquidMeetup
Download a free sample chapter
101

Resources
•Team Guide to Software Operability by Matthew Skelton
and Rob Thatcher http://operabilitybook.com/
•Run Book template & Run Book dialogue sheets
http://runbooktemplate.info/
•Operability Questions http://operabilityquestions.com/
•5 proven operability techniques for software teams
https://techbeacon.com/5-proven-operability-techniques-
software-teams
102

thank you
103
@matthewpskelton / operabilitybook.com

@ConfluxHQ / confluxdigital.net