Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Practical operability techniques for teams - IPEXPO 2017


Published on

Modern software systems now increasingly span cloud, on-premise, and remote embedded devices & sensors. These distributed systems bring challenges with data, connectivity, performance, and systems management, so for business success we need to design and build with operability as a first class property.
In this talk, we explore five practical, tried-and-tested, real world techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT:

- Logging as a live diagnostics vector with sparse Event IDs
- Operational checklists and 'Run Book dialogue sheets' as a discovery mechanism for teams
- Endpoint healthchecks as a way to assess runtime dependencies and complexity
- Correlation IDs beyond simple HTTP calls
- Lightweight 'User Personas' as drivers for operational dashboards

These techniques work very differently with different technologies. For instance, an IoT device has limited storage, processing, and I/O, so generation and shipping of logs and metrics looks very different from the cloud or Serverless case. However, the principles - logging as a live diagnostics vector, Event IDs for discovery, etc. - work remarkably well across very different technologies.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Practical operability techniques for teams - IPEXPO 2017

  1. 1. Practical Operability Techniques for Teams Matthew Skelton Skelton Thatcher Consulting / @SkeltonThatcher IPEXPO Europe, ExCeL, London – 05 October 2017
  2. 2. Today What is operability? Modern logging Run Book dialogue sheets Endpoint healthchecks Correlation IDs User Personas for dashboards
  3. 3. Training 1-day tutorial with exercises Book via Unicom:
  4. 4. You Software Developer Tester / QA DevOps Engineer Team Leader Head of Department
  5. 5. Operability: use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques
  6. 6. About us Co-founders at Skelton Thatcher Consulting Matthew Skelton Rob Thatcher
  7. 7. Team-first digital transformation 30+ organisations UK, US, EU, India, China
  8. 8. We build modern capabilities by mentoring your teams
  9. 9. Team Guide to Software Operability Matthew Skelton & Rob Thatcher Download a free sample chapter
  10. 10. Practical Operability Techniques for Teams
  11. 11. What is operability?
  12. 12. Operability making software work well in Production
  13. 13. Logging with Event IDs
  14. 14. logging with Event IDs reduce time to detect problems increase team engagement increase configurability enhance collaboration #operability
  15. 15. search by event Event ID {Delivered, InTransit, Arrived}
  16. 16. transaction trace Correlation ID 612999958…
  17. 17. How many distinct event types (state transitions) in your application?
  18. 18. represent distinct states
  19. 19. enum Human-readable sets: unique values, sparse, immutable C#, Java, Python, node (Ruby, PHP, …)
  20. 20. Technical Domain public enum EventID { // Badly-initialised logging data NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... }
  21. 21. BasketItemAdded = 60001 BasketItemRemoved = 60002
  22. 22. log using Event IDs with a modern ‘structured logging’ library
  23. 23. Example: video processing On-demand processing of TV advertisements Ad-agency  TV broadcaster High throughput Glitch-free video & audio
  24. 24. Storage I/O Worker Job Queue Upload
  25. 25. Example: video processing Discover processing bottlenecks Trigger alerts via LogEntries / HostedGraphite Report on KPIs Target areas for improvement
  26. 26. Run Book dialogue sheets
  27. 27. Run Book dialogue sheets Checklists for typical operational considerations Team-friendly exploration
  28. 28. runbooktemplate.infoRun Book dialogue sheets
  29. 29. System characteristics Hours of operation During what hours does the service or system actually need to operate? Can portions or features of the system be unavailable at times if needed? Hours of operation - core features (e.g. 03:00-01:00 GMT+0) Hours of operation - secondary features (e.g. 07:00-23:00 GMT+0) Data and processing flows How and where does data flow through the system? What controls or triggers data flows? (e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data ) …
  30. 30.
  31. 31. runbooktemplate.infoRun Book dialogue sheets
  32. 32. Endpoint healthchecks
  33. 33. endpoint healthchecks Every runnable app/service/daemon exposes /status/health An HTTP GET to the endpoint returns: 200 – "I am healthy" 500 – "I am sick"
  34. 34. endpoint healthchecks Use JSON as a response type – parsable by both machines and humans!
  35. 35. endpoint healthchecks For databases and other non-HTTP components, run a lightweight HTTP service in front of the component 200 / 500 responses
  36. 36.
  37. 37. Correlation IDs
  38. 38. ‘Unique-ish’ identifier for each request Passed through downstream layers
  39. 39. Unique-ish ID
  40. 40. Synchronous HTTP: X-HEADER e.g. X-trace-id X-trace-id: 348e1cf8 If header is present, pass it on (Yes, RFC6648, but this is internal only)
  41. 41. Asynchonous (queues, etc.): Message Attributes, name:value pair e.g. "trace-id":"348e1cf8" AWS SQS: SendMessage() / ReceiveMessage() Log the Correlation ID if present
  42. 42. Example: electronic trading High speed, low latency Trading options & derivatives Connected to stock exchanges Sub-millisecond timings > £1 million per day traded
  43. 43. Correlations IDs for trading Evidence for timely operation Help identify bottlenecks Target areas for perf tuning Identify race conditions Increase operability
  44. 44. Lightweight user personas
  45. 45. Lightweight user personas: Ops Engineer Test Engineer Build & Deployment Engineer Service Owner
  46. 46. Lightweight user personas: Consider the User Experience (UX) of engineers and team members using and working with the software
  47. 47.
  48. 48. progressing-towards-a-more-useful-and-beautiful-dashboard/
  49. 49. Lightweight user personas: What data does the User Persona need visible on a dashboard in order to make decisions rapidly & safely?
  50. 50. Summary
  51. 51. Operability making software work well in Production
  52. 52. logging with Event IDs use enum-based Event IDs to explore runtime behaviour and fault conditions
  53. 53. Run Book dialogue sheets explore and establish operational requirements as a team, around a physical table, together
  54. 54. endpoint healthchecks HTTP 200 / 500 responses to /status/health call with JSON details – good for tools and humans
  55. 55. Correlation IDs trace execution using correlation IDs: synchronous (HTTP X-trace-id) async (SQS MessageAttribute)
  56. 56. lightweight user personas explore the UX and needs of different roles for rapid decisions via dashboards
  57. 57. Operability use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques
  58. 58. Team Guide to Software Operability Matthew Skelton & Rob Thatcher Download a free sample chapter
  59. 59. Training 1-day tutorial with exercises Book via Unicom:
  60. 60. Resources • Training: Practical Operability for Developers and Testers – led by Matthew Skelton and Rob Thatcher – 1-day workshop – and-testers.html • Team Guide to Software Operability by Matthew Skelton and Rob Thatcher (Skelton Thatcher Publications, 2016) • Run Book template & Run Book dialogue sheets
  61. 61. Questions? Twitter: @SkeltonThatcher | #operability email:
  62. 62. thank you @SkeltonThatcher