5 practical operability techniques for teams - Matthew Skelton - ADDO 2018

October 17, 2018
5 practical operability techniques
for teams
Matthew Skelton, Conflux
@matthewpskelton
confluxdigital.net

October 17, 2018
Thank You Supporters

October 17, 2018
Meet me in the Slack channel for Q&A
bit.ly/addo-slack

Practical operability
Why do we need a focus on
operability?
5 practical operability
techniques that work
5

modern event-based logging
Run Book dialogue sheets
endpoint healthchecks
correlation IDs
user personas
6

team
collaboration
techniques
7

About me
Matthew Skelton, Conflux
@matthewpskelton
matthewskelton.net
Leeds, UK
8

You?
Software Developer
Tester / QA
DevOps Engineer
Ops Engineer / SRE
Head of Department
9

Practical
Operability Techniques
for Teams
10

Operability
making software work well
in Production
11

Operability
12
Scale
Restore
Inspect
Failover
Monitor
Diagnose
Secure
Cleardown
Report

“But can’t we just give
those things to an SRE?”
13

“But can’t we just give
those things to the
DevOp?”
14

Operability is
a shared concern
#BizDevTestSecOps
15

Operability is
a shared concern
#BizDevTestSecOps
16

18
A self-managed Kubernetes
cluster near you

Operability is
a shared concern
#BizDevTestSecOps
19

20
SRE: operability consultants
Collaborate on
operability
here
CC BY-SA devopstopologies.com

Practical operability techniques
1. Modern logging with event IDs
2. Run Book dialogue sheets
3. Endpoint healthchecks
4. Correlation IDs
5. Lightweight User Personas
21

Lack of observability for
distributed systems
23

Modern logging w/ Event IDs
Distinct application states
No “logorrhoea” (!)
Distributed tracing via logs
Build a shared understanding
24

search by event
Event ID
{Delivered,
InTransit,
Arrived}
25

transaction
trace
Correlation ID
612999958…
26

Modern logging with event IDs
helps to produce a well-defined
event space:
human-readable events
27

How many distinct event
types (state transitions) in
your application?
28

enum
Human-readable sets:
unique values, sparse,
immutable
C#, Java, Python, node
(Ruby, PHP, …)
32

Technical
Domain
public enum EventID
{
// Badly-initialised logging data
NotSet = 0,
// An unrecognised event has occurred
UnexpectedError = 10000,
ApplicationStarted = 20000,
ApplicationShutdownNoticeReceived = 20001,
MessageQueued = 40000,
MessagePeeked = 40001,
BasketItemAdded = 60001,
BasketItemRemoved = 60002,
CreditCardDetailsSubmitted = 70001,
// ...
} 33

BasketItemAdded = 60001
BasketItemRemoved = 60002
34

example:
https://github.com/EqualExperts/opslogger
Sean Reilly
@seanjreilly 35

Example: video processing
On-demand processing of TV and
mobile streaming adverts
Ad-agency → TV broadcaster
High throughput
Glitch-free video & audio
36

Storage I/O
Worker Job
Queue
Upload
37

Example: video processing
Discover processing bottlenecks
Trigger alerts
Report on KPIs
Target areas for improvement
40

Modern logging w/ Event IDs
clarity about software behavior
reduce time to detect problems
increase team engagement
enhance collaboration
41

Modern Logging:
Collaborate on Event IDs
and Correlation traces for
better system awareness
42

Operational aspects not
addressed, or addressed
too late in the cycle
44

Checklists for typical
operational considerations
Team-friendly exploration
45

Run Book dialogue sheets help
to increase awareness of
operability within teams
46

runbooktemplate.infoRun Book dialogue sheets
47

System characteristics
Hours of operation
During what hours does the service or system actually need to operate? Can portions or features of the
system be unavailable at times if needed?
Hours of operation - core features
(e.g. 03:00-01:00 GMT+0)
Hours of operation - secondary features
(e.g. 07:00-23:00 GMT+0)
Data and processing flows
How and where does data flow through the system? What controls or triggers data flows?
(e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data )
… 48

http://runbooktemplate.info/
Github, CC BY-SA
49

runbooktemplate.infoRun Book dialogue sheets
50

Early discovery of
operational requirements
Input to team backlog
“Shift-left” testing
Avoid operational problems
51

52
http://operabilityquestions.com/
Github, CC BY-SA

OperabilityQuestions.com
Freeform, exploratory
questions for teams
Usability, viability, reliability,
observability, securability, …
(Github, CC BY-SA)
53

Run Book dialogue sheets:
Collaborate on operational
requirements for better
system awareness
54

“Why has my deployment
failed again?”
“Why is Pre-Prod always
so flaky?”
56

Endpoint healthchecks
Simple HTTP check
Common way to assess any
service/app/component
Key operational requirement
57

Every runnable app/service/daemon
exposes /status/health
An HTTP GET to the endpoint returns:
200 – "I am healthy"
500 – "I am sick"
58

Endpoint healthchecks help
teams to collaborate on
service viability
59

Each component is responsible for
determining its own health and
viability – this is very contextual
60

Use JSON as a response type –
parsable by both
machines and humans!
61

For databases and other non-HTTP
components, run a lightweight HTTP
service in front of the component
200 / 500 responses
63

https://github.com/Lugribossk/simple-dashboard
65

66
Question:
What does this look like for
Serverless?
¯_(ツ)_/¯

Rapid diagnosis and visibility
Reduce confusion around
environment state
“Fail fast” → “learn sooner”
67

Endpoint healthchecks:
Collaborate on component
health status for better
system awareness
68

“Which nodes handled
the request?”
70

Correlation IDs
Unique-ish identifiers
Trace calls across machine &
container boundaries
Re-assemble HTTP
request/response later
71

‘Unique-ish’ identifier for each request
Passed through downstream layers
72

Correlation IDs help teams to
think about the big picture:
end-to-end outcomes
73

Synchronous HTTP:
X-HEADER e.g. X-trace-id
X-trace-id: 348e1cf8
If header is present, pass it on
(Yes, RFC6648, but this is internal only)
75

Asynchonous (queues, etc.):
Message Attributes, name:value pair
e.g. "trace-id":"348e1cf8"
AWS SQS: SendMessage() / ReceiveMessage()
Log the Correlation ID if present
76

Example: OpenTracing / PCF
3 tracing elements:
TraceID, SpanID, ParentSpan
"X-B3-TraceId" "X-B3-SpanId"
"X-B3-ParentSpan"
77

Example: OpenTracing / PCF
Always log the TraceID as-is
Log calling SpanID as ParentSpan
Log new SpanID
78

Correlation IDs
Detect bottlenecks and
unexpected interactions
Increase transparency
Learn about the system
80

Correlation IDs:
Collaborate on distributed
tracing for better system
awareness
81

Software is difficult to
operate: poor UX for Ops.
83

Lightweight User Personas
Simple characterisation of
user needs for Dev/Test/Ops
Based on full UX user
personas but less detailed
84

Lightweight user personas:
Ops Engineer
Test Engineer
Build & Deployment Engineer
Service Owner
85

Lightweight user personas
help teams to build systems
with good UX for all users
86

Consider the User Experience (UX) of
engineers and team members using
and working with the software
87

http://www.keepitusable.com/blog/?tag=alan-cooper
88
Motivations
Goals
Frustrations

What data does the User Persona need
visible on a dashboard in order to
make decisions rapidly & safely?
89

https://www.geckoboard.com/blog/visualisation-upgrades-progressing-towards-a-more-useful-and-beautiful-dashboard/ 90

Lightweight User Personas
Empathise better with people
from other roles
Capture missing operational
requirements
91

Lightweight User Personas:
Collaborate on user needs
for better
system awareness
92

Operability
making software work well
in Production
94

Lack of observability
Operational aspects not known
“Why has deployment failed?”
What handled the request?
Poor UX for Ops
96

97
SRE: operability consultants
Collaborate on
operability
here
CC BY-SA devopstopologies.com

Logging with Event IDs
use enum-based Event IDs to
explore runtime behaviour and
fault conditions
98

explore and establish
operational requirements as a
team, around a physical table,
together
99

HTTP 200 / 500 responses to
/status/health call with JSON
details – good for tools and
humans
100

Correlation IDs
trace execution using correlation
IDs:
synchronous (HTTP X-trace-id)
async (SQS MessageAttribute)
101

Lightweight user personas
explore the UX and needs of
different roles for rapid
decisions via dashboards
102

use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
team collaboration techniques
103

Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
operabilitybook.com
20% discount for ADDO 2018!
http://leanpub.com/SoftwareOperability/c/ADDO2018
Download a free sample chapter
104

Resources
•Team Guide to Software Operability by Matthew Skelton
and Rob Thatcher http://operabilitybook.com/
•Run Book template & Run Book dialogue sheets
http://runbooktemplate.info/
•Operability Questions http://operabilityquestions.com/
•5 proven operability techniques for software teams
https://techbeacon.com/5-proven-operability-techniques-
software-teams
105

October 17, 2018
thank you
@matthewpskelton / operabilitybook.com
@ConfluxHQ / confluxdigital.net

5 practical operability techniques for teams - Matthew Skelton - ADDO 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 5 practical operability techniques for teams - Matthew Skelton - ADDO 2018

Similar to 5 practical operability techniques for teams - Matthew Skelton - ADDO 2018 (20)

Recently uploaded

Recently uploaded (20)

5 practical operability techniques for teams - Matthew Skelton - ADDO 2018