In this talk, we explore five practical, tried-and-tested, real world techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT.
Logging as a live diagnostics vector with sparse Event IDs
Operational checklists and ‘Run Book dialogue sheets’ as a discovery mechanism for teams
Endpoint healthchecks as a way to assess runtime dependencies and complexity
Correlation IDs beyond simple HTTP calls
Lightweight ‘User Personas’ as drivers for operational dashboards
Based on our work in many industry sectors, we will share our experience of helping teams to improve the operability of their software systems through
Required audience experience
Some experience of building web-scale systems or industrial IoT/embedded systems would be helpful.
Objective of the talk
We will share our experience of helping teams to improve the operability of their software systems. Attendees will learn some practical operability approaches and how teams can expand their understanding and awareness of operability through these simple, team-friendly techniques.
From a talk given at Continuous Lifecycle London 2018: https://continuouslifecycle.london/sessions/practical-team-focused-operability-techniques-for-distributed-systems/
17. Practical operability techniques
1. Modern logging
2. Run Book dialogue sheets
3. Endpoint healthchecks
4. Correlation IDs
5. Lightweight User Personas
17
31. Example: video processing
On-demand processing of TV and
mobile streaming adverts
Ad-agency → TV broadcaster
High throughput
Glitch-free video & audio
31
42. System characteristics
Hours of operation
During what hours does the service or system actually need to operate? Can portions or features of the
system be unavailable at times if needed?
Hours of operation - core features
(e.g. 03:00-01:00 GMT+0)
Hours of operation - secondary features
(e.g. 07:00-23:00 GMT+0)
Data and processing flows
How and where does data flow through the system? What controls or triggers data flows?
(e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data )
… 42
50. endpoint healthchecks
Every runnable app/service/daemon
exposes /status/health
An HTTP GET to the endpoint returns:
200 – "I am healthy"
500 – "I am sick"
50
91. use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
team collaboration techniques
91
92. Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
skeltonthatcher.com/publications
Download a free sample chapter
92
93. Resources
•Team Guide to Software Operability by Matthew
Skelton and Rob Thatcher (Skelton Thatcher
Publications, 2016) http://operabilitybook.com/
•Run Book template & Run Book dialogue sheets
http://runbooktemplate.info/
93