Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Practical Operability
Techniques for Teams
Matthew Skelton
Skelton Thatcher Consulting
skeltonthatcher.com / @SkeltonThatc...
Today
What is operability?
Modern logging
Run Book dialogue sheets
Endpoint healthchecks
Correlation IDs
User Personas for...
Training
1-day tutorial with exercises
Book via Unicom: http://www.unicom.co.uk/workshops.html
You
Software Developer
Tester / QA
DevOps Engineer
Team Leader
Head of Department
Operability:
use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
te...
About us
Co-founders at Skelton Thatcher Consulting
Matthew Skelton Rob Thatcher
Team-first digital transformation
30+ organisations
UK, US, EU, India, China
We build modern capabilities
by mentoring your teams
Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
skeltonthatcher.com/publications
Download a free sample ...
Practical Operability
Techniques for Teams
What is operability?
Operability
making software work well
in Production
Logging with Event IDs
logging with Event IDs
reduce time to detect problems
increase team engagement
increase configurability
enhance collaborat...
search by
event
Event ID
{Delivered,
InTransit,
Arrived}
transaction
trace
Correlation ID
612999958…
How many distinct event
types (state transitions)
in your application?
represent distinct states
enum
Human-readable sets:
unique values, sparse,
immutable
C#, Java, Python, node
(Ruby, PHP, …)
Technical
Domain
public enum EventID
{
// Badly-initialised logging data
NotSet = 0,
// An unrecognised event has occurred...
BasketItemAdded = 60001
BasketItemRemoved = 60002
log using Event IDs
with a modern
‘structured logging’ library
example:
https://github.com/EqualExperts/opslogger
Sean Reilly
@seanjreilly
Example: video processing
On-demand processing of TV and
mobile streaming adverts
Ad-agency  TV broadcaster
High throughp...
Storage I/O
Worker Job
Queue
Upload
Example: video processing
Discover processing bottlenecks
Trigger alerts via LogEntries /
HostedGraphite
Report on KPIs
Ta...
http://honeycomb.tv/
Run Book dialogue sheets
Run Book dialogue sheets
Checklists for typical operational
considerations
Team-friendly exploration
runbooktemplate.infoRun Book dialogue sheets
System characteristics
Hours of operation
During what hours does the service or system actually need to operate? Can porti...
http://runbooktemplate.info/
runbooktemplate.infoRun Book dialogue sheets
Endpoint healthchecks
endpoint healthchecks
Every runnable app/service/daemon
exposes /status/health
An HTTP GET to the endpoint returns:
200 – ...
endpoint healthchecks
Each component is responsible for
determining its own health and
viability – this is very contextual
endpoint healthchecks
Use JSON as a response type –
parsable by both
machines and humans!
endpoint healthchecks
For databases and other non-HTTP
components, run a lightweight HTTP
service in front of the componen...
Helper service
https://github.com/Lugribossk/simple-dashboard
Correlation IDs
‘Unique-ish’ identifier for each request
Passed through downstream layers
Unique-ish ID
Synchronous HTTP:
X-HEADER e.g. X-trace-id
X-trace-id: 348e1cf8
If header is present, pass it on
(Yes, RFC6648, but this i...
Asynchonous (queues, etc.):
Message Attributes, name:value pair
e.g. "trace-id":"348e1cf8"
AWS SQS: SendMessage() / Receiv...
Example: electronic trading
High speed, low latency
Trading options & derivatives
Connected to stock exchanges
Sub-millise...
Correlations IDs for trading
Evidence for timely operation
Help identify bottlenecks
Target areas for perf tuning
Identify...
Example: OpenTracing / PCF
3 tracing elements:
TraceID, SpanID, ParentSpan
"X-B3-TraceId" "X-B3-SpanId"
"X-B3-ParentSpan"
Example: OpenTracing / PCF
Always log the TraceID as-is
Log calling SpanID as ParentSpan
Log new SpanID
Trace
Span
ParentSpan
Lightweight user personas
Lightweight user personas:
Ops Engineer
Test Engineer
Build & Deployment Engineer
Service Owner
Lightweight user personas:
Consider the User Experience (UX) of
engineers and team members using
and working with the soft...
http://www.keepitusable.com/blog/?tag=alan-cooper
https://www.geckoboard.com/blog/visualisation-upgrades-
progressing-towards-a-more-useful-and-beautiful-dashboard/
Lightweight user personas:
What data does the User Persona
need visible on a dashboard in order
to make decisions rapidly ...
Summary
Operability
making software work well
in Production
logging with Event IDs
use enum-based Event IDs to
explore runtime behaviour
and fault conditions
Run Book dialogue sheets
explore and establish
operational requirements as
a team, around a physical
table, together
endpoint healthchecks
HTTP 200 / 500 responses to
/status/health call with
JSON details – good for
tools and humans
Correlation IDs
trace execution using
correlation IDs:
synchronous (HTTP X-trace-id)
async (SQS MessageAttribute)
lightweight user personas
explore the UX and needs of
different roles for rapid
decisions via dashboards
Operability
use modern logging, Run Book
dialogue sheets, endpoint
healthchecks, correlation IDs,
and user personas as
tea...
Team Guide to
Software Operability
Matthew Skelton & Rob Thatcher
skeltonthatcher.com/publications
Download a free sample ...
Training
1-day tutorial with exercises
Book via Unicom: http://www.unicom.co.uk/workshops.html
Resources
• Training: Practical Operability for Developers and Testers – led
by Matthew Skelton and Rob Thatcher – 1-day w...
Questions?
Twitter: @SkeltonThatcher | #operability
email: questions@skeltonthatcher.com
thank you
@SkeltonThatcher
skeltonthatcher.com
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Practical operability techniques for distributed systems - Velocity EU 2017
Upcoming SlideShare
Loading in …5
×

Practical operability techniques for distributed systems - Velocity EU 2017

451 views

Published on

Modern software systems now increasingly span cloud and on-premises deployments and remote embedded devices and sensors. These distributed systems bring challenges with data, connectivity, performance, and systems management; to ensure success, you must design and build with operability as a first-class property.

Matthew Skelton shares five practical, tried-and-tested techniques for improving operability with many kinds of software systems, including the cloud, serverless, on-premises, and the IoT: logging as a live diagnostics vector with sparse event IDs; operational checklists and runbook dialog sheets as a discovery mechanism for teams; endpoint health checks as a way to assess runtime dependencies and complexity; correlation IDs beyond simple HTTP calls; and lightweight user personas as drivers for operational dashboards.

These techniques work very differently with different technologies. For instance, an IoT device has limited storage, processing, and I/O, so generating and shipping of logs and metrics looks very different from cloud or serverless cases. However, the principles—logging as a live diagnostics vector, event IDs for discovery, etc.—work remarkably well across very different technologies.

Drawing from his experience helping teams improve the operability of their software systems, Matthew explains what works (and what doesn’t) and how teams can expand their understanding and awareness of operability through these straightforward, team-friendly techniques.
From a talk given by Matthew Skelton at Velocity Conference EU 2017 - https://conferences.oreilly.com/velocity/vl-eu/public/schedule/detail/61954

Published in: Software
  • Be the first to comment

  • Be the first to like this

Practical operability techniques for distributed systems - Velocity EU 2017

  1. 1. Practical Operability Techniques for Teams Matthew Skelton Skelton Thatcher Consulting skeltonthatcher.com / @SkeltonThatcher Velocity Conference EU 2017, London – 19 Oct 2017
  2. 2. Today What is operability? Modern logging Run Book dialogue sheets Endpoint healthchecks Correlation IDs User Personas for dashboards
  3. 3. Training 1-day tutorial with exercises Book via Unicom: http://www.unicom.co.uk/workshops.html
  4. 4. You Software Developer Tester / QA DevOps Engineer Team Leader Head of Department
  5. 5. Operability: use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques
  6. 6. About us Co-founders at Skelton Thatcher Consulting Matthew Skelton Rob Thatcher
  7. 7. Team-first digital transformation 30+ organisations UK, US, EU, India, China
  8. 8. We build modern capabilities by mentoring your teams
  9. 9. Team Guide to Software Operability Matthew Skelton & Rob Thatcher skeltonthatcher.com/publications Download a free sample chapter
  10. 10. Practical Operability Techniques for Teams
  11. 11. What is operability?
  12. 12. Operability making software work well in Production
  13. 13. Logging with Event IDs
  14. 14. logging with Event IDs reduce time to detect problems increase team engagement increase configurability enhance collaboration #operability
  15. 15. search by event Event ID {Delivered, InTransit, Arrived}
  16. 16. transaction trace Correlation ID 612999958…
  17. 17. How many distinct event types (state transitions) in your application?
  18. 18. represent distinct states
  19. 19. enum Human-readable sets: unique values, sparse, immutable C#, Java, Python, node (Ruby, PHP, …)
  20. 20. Technical Domain public enum EventID { // Badly-initialised logging data NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... }
  21. 21. BasketItemAdded = 60001 BasketItemRemoved = 60002
  22. 22. log using Event IDs with a modern ‘structured logging’ library
  23. 23. example: https://github.com/EqualExperts/opslogger Sean Reilly @seanjreilly
  24. 24. Example: video processing On-demand processing of TV and mobile streaming adverts Ad-agency  TV broadcaster High throughput Glitch-free video & audio
  25. 25. Storage I/O Worker Job Queue Upload
  26. 26. Example: video processing Discover processing bottlenecks Trigger alerts via LogEntries / HostedGraphite Report on KPIs Target areas for improvement
  27. 27. http://honeycomb.tv/
  28. 28. Run Book dialogue sheets
  29. 29. Run Book dialogue sheets Checklists for typical operational considerations Team-friendly exploration
  30. 30. runbooktemplate.infoRun Book dialogue sheets
  31. 31. System characteristics Hours of operation During what hours does the service or system actually need to operate? Can portions or features of the system be unavailable at times if needed? Hours of operation - core features (e.g. 03:00-01:00 GMT+0) Hours of operation - secondary features (e.g. 07:00-23:00 GMT+0) Data and processing flows How and where does data flow through the system? What controls or triggers data flows? (e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data ) …
  32. 32. http://runbooktemplate.info/
  33. 33. runbooktemplate.infoRun Book dialogue sheets
  34. 34. Endpoint healthchecks
  35. 35. endpoint healthchecks Every runnable app/service/daemon exposes /status/health An HTTP GET to the endpoint returns: 200 – "I am healthy" 500 – "I am sick"
  36. 36. endpoint healthchecks Each component is responsible for determining its own health and viability – this is very contextual
  37. 37. endpoint healthchecks Use JSON as a response type – parsable by both machines and humans!
  38. 38. endpoint healthchecks For databases and other non-HTTP components, run a lightweight HTTP service in front of the component 200 / 500 responses
  39. 39. Helper service
  40. 40. https://github.com/Lugribossk/simple-dashboard
  41. 41. Correlation IDs
  42. 42. ‘Unique-ish’ identifier for each request Passed through downstream layers
  43. 43. Unique-ish ID
  44. 44. Synchronous HTTP: X-HEADER e.g. X-trace-id X-trace-id: 348e1cf8 If header is present, pass it on (Yes, RFC6648, but this is internal only)
  45. 45. Asynchonous (queues, etc.): Message Attributes, name:value pair e.g. "trace-id":"348e1cf8" AWS SQS: SendMessage() / ReceiveMessage() Log the Correlation ID if present
  46. 46. Example: electronic trading High speed, low latency Trading options & derivatives Connected to stock exchanges Sub-millisecond timings > £1 million per day traded
  47. 47. Correlations IDs for trading Evidence for timely operation Help identify bottlenecks Target areas for perf tuning Identify race conditions Increase operability
  48. 48. Example: OpenTracing / PCF 3 tracing elements: TraceID, SpanID, ParentSpan "X-B3-TraceId" "X-B3-SpanId" "X-B3-ParentSpan"
  49. 49. Example: OpenTracing / PCF Always log the TraceID as-is Log calling SpanID as ParentSpan Log new SpanID
  50. 50. Trace Span ParentSpan
  51. 51. Lightweight user personas
  52. 52. Lightweight user personas: Ops Engineer Test Engineer Build & Deployment Engineer Service Owner
  53. 53. Lightweight user personas: Consider the User Experience (UX) of engineers and team members using and working with the software
  54. 54. http://www.keepitusable.com/blog/?tag=alan-cooper
  55. 55. https://www.geckoboard.com/blog/visualisation-upgrades- progressing-towards-a-more-useful-and-beautiful-dashboard/
  56. 56. Lightweight user personas: What data does the User Persona need visible on a dashboard in order to make decisions rapidly & safely?
  57. 57. Summary
  58. 58. Operability making software work well in Production
  59. 59. logging with Event IDs use enum-based Event IDs to explore runtime behaviour and fault conditions
  60. 60. Run Book dialogue sheets explore and establish operational requirements as a team, around a physical table, together
  61. 61. endpoint healthchecks HTTP 200 / 500 responses to /status/health call with JSON details – good for tools and humans
  62. 62. Correlation IDs trace execution using correlation IDs: synchronous (HTTP X-trace-id) async (SQS MessageAttribute)
  63. 63. lightweight user personas explore the UX and needs of different roles for rapid decisions via dashboards
  64. 64. Operability use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques
  65. 65. Team Guide to Software Operability Matthew Skelton & Rob Thatcher skeltonthatcher.com/publications Download a free sample chapter
  66. 66. Training 1-day tutorial with exercises Book via Unicom: http://www.unicom.co.uk/workshops.html
  67. 67. Resources • Training: Practical Operability for Developers and Testers – led by Matthew Skelton and Rob Thatcher – 1-day workshop – http://www.unicom.co.uk/practical-operability-for-developers- and-testers.html • Team Guide to Software Operability by Matthew Skelton and Rob Thatcher (Skelton Thatcher Publications, 2016) http://operabilitybook.com/ • Run Book template & Run Book dialogue sheets http://runbooktemplate.info/
  68. 68. Questions? Twitter: @SkeltonThatcher | #operability email: questions@skeltonthatcher.com
  69. 69. thank you @SkeltonThatcher skeltonthatcher.com

×