Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

5 practical operability techniques - Matthew Skelton - SkillsMatter 2018

592 views

Published on

In this talk, we explore five practical, tried-and-tested, real world techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT.

- Modern logging as a live diagnostics vector with sparse Event IDs

- Operational checklists and ‘Run Book dialogue sheets’ as a discovery mechanism for teams

- Endpoint Healthchecks as a way to assess runtime dependencies and readiness for service

- Correlation IDs beyond simple HTTP calls

- Lightweight ‘User Personas’ as drivers for operational dashboards

Based on work in many industry sectors, we will learn how to improve the operability of software systems using these team-friendly techniques.

Published in: Software
  • Be the first to comment

  • Be the first to like this

5 practical operability techniques - Matthew Skelton - SkillsMatter 2018

  1. 1. 1 5 practical operability techniques for teams Matthew Skelton, Conflux @matthewpskelton confluxdigital.netSkillsMatter, London Tues 23 October 2018
  2. 2. Practical operability Why do we need a focus on operability? 5 practical operability techniques that work 2
  3. 3. modern event-based logging Run Book dialogue sheets endpoint healthchecks correlation IDs user personas 3
  4. 4. team collaboration techniques 4
  5. 5. About me Matthew Skelton, Conflux @matthewpskelton matthewskelton.net Leeds, UK 5 20% discount for SkillsMatter!
  6. 6. You? Software Developer Tester / QA DevOps Engineer Ops Engineer / SRE Head of Department 6
  7. 7. Practical Operability Techniques for Teams 7
  8. 8. Operability making software work well in Production 8
  9. 9. Operability 9 Scale Restore Inspect Failover Monitor Diagnose Secure Cleardown Report
  10. 10. “But can’t we just give those things to an SRE?” 10
  11. 11. “But can’t we just give those things to the DevOp?” 11
  12. 12. Operability is a shared concern #BizDevTestSecOps 12
  13. 13. Operability is a shared concern #BizDevTestSecOps 13
  14. 14. 14
  15. 15. 15 A self-managed Kubernetes cluster near you
  16. 16. Operability is a shared concern #BizDevTestSecOps 16
  17. 17. 17 SRE: operability consultants Collaborate on operability here CC BY-SA devopstopologies.com
  18. 18. Practical operability techniques 1. Modern logging with event IDs 2. Run Book dialogue sheets 3. Endpoint healthchecks 4. Correlation IDs 5. Lightweight User Personas 18
  19. 19. 19 Logging with Event IDs
  20. 20. Lack of observability for distributed systems 20
  21. 21. Modern logging w/ Event IDs Distinct application states No “logorrhoea” (!) Distributed tracing via logs Build a shared understanding 21
  22. 22. search by event Event ID {Delivered, InTransit, Arrived} 22
  23. 23. transaction trace Correlation ID 612999958… 23
  24. 24. Modern logging with event IDs helps to produce a well-defined event space: human-readable events 24
  25. 25. Which calls might fail? 25
  26. 26. How many distinct event types (state transitions) in your application? 26
  27. 27. 27
  28. 28. represent distinct states 28
  29. 29. enum Human-readable sets: unique values, sparse, immutable C#, Java, Python, node (Ruby, PHP, …) 29
  30. 30. Technical Domain public enum EventID { // Badly-initialised logging data NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... } 30
  31. 31. BasketItemAdded = 60001 BasketItemRemoved = 60002 31
  32. 32. example: https://github.com/EqualExperts/opslogger Sean Reilly @seanjreilly 32
  33. 33. Example: video processing On-demand processing of TV and mobile streaming adverts Ad-agency → TV broadcaster High throughput Glitch-free video & audio 33
  34. 34. Storage I/O Worker Job Queue Upload 34
  35. 35. 35
  36. 36. 36
  37. 37. Example: video processing Discover processing bottlenecks Trigger alerts Report on KPIs Target areas for improvement 37
  38. 38. Modern logging w/ Event IDs clarity about software behavior reduce time to detect problems increase team engagement enhance collaboration 38
  39. 39. Modern Logging: Collaborate on Event IDs and Correlation traces for better system awareness 39
  40. 40. Run Book dialogue sheets 40
  41. 41. Operational aspects not addressed, or addressed too late in the cycle 41
  42. 42. Run Book dialogue sheets Checklists for typical operational considerations Team-friendly exploration 42
  43. 43. Run Book dialogue sheets help to increase awareness of operability within teams 43
  44. 44. runbooktemplate.infoRun Book dialogue sheets 44
  45. 45. System characteristics Hours of operation During what hours does the service or system actually need to operate? Can portions or features of the system be unavailable at times if needed? Hours of operation - core features (e.g. 03:00-01:00 GMT+0) Hours of operation - secondary features (e.g. 07:00-23:00 GMT+0) Data and processing flows How and where does data flow through the system? What controls or triggers data flows? (e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data ) … 45
  46. 46. http://runbooktemplate.info/ Github, CC BY-SA 46
  47. 47. runbooktemplate.infoRun Book dialogue sheets 47
  48. 48. Run Book dialogue sheets Early discovery of operational requirements Input to team backlog “Shift-left” testing Avoid operational problems 48
  49. 49. 49 http://operabilityquestions.com/ Github, CC BY-SA
  50. 50. OperabilityQuestions.com Freeform, exploratory questions for teams Usability, viability, reliability, observability, securability, … (Github, CC BY-SA) 50
  51. 51. Run Book dialogue sheets: Collaborate on operational requirements for better system awareness 51
  52. 52. Endpoint healthchecks 52
  53. 53. “Why has my deployment failed again?” “Why is Pre-Prod always so flaky?” 53
  54. 54. Endpoint healthchecks Simple HTTP check Common way to assess any service/app/component Key operational requirement 54
  55. 55. endpoint healthchecks Every runnable app/service/daemon exposes /status/health An HTTP GET to the endpoint returns: 200 – "I am healthy" 500 – "I am sick" 55
  56. 56. Endpoint healthchecks help teams to collaborate on service viability 56
  57. 57. endpoint healthchecks Each component is responsible for determining its own health and viability – this is very contextual 57
  58. 58. endpoint healthchecks Use JSON as a response type – parsable by both machines and humans! 58
  59. 59. 59
  60. 60. endpoint healthchecks For databases and other non-HTTP components, run a lightweight HTTP service in front of the component 200 / 500 responses 60
  61. 61. Helper service 61
  62. 62. https://github.com/Lugribossk/simple-dashboard 62
  63. 63. 63 Question: What does this look like for Serverless? ¯_(ツ)_/¯
  64. 64. Endpoint healthchecks Rapid diagnosis and visibility Reduce confusion around environment state “Fail fast” → “learn sooner” 64
  65. 65. Endpoint healthchecks: Collaborate on component health status for better system awareness 65
  66. 66. Correlation IDs 66
  67. 67. “Which nodes handled the request?” 67
  68. 68. Correlation IDs Unique-ish identifiers Trace calls across machine & container boundaries Re-assemble HTTP request/response later 68
  69. 69. ‘Unique-ish’ identifier for each request Passed through downstream layers 69
  70. 70. Correlation IDs help teams to think about the big picture: end-to-end outcomes 70
  71. 71. Unique-ish ID 71
  72. 72. Synchronous HTTP: X-HEADER e.g. X-trace-id X-trace-id: 348e1cf8 If header is present, pass it on (Yes, RFC6648, but this is internal only) 72
  73. 73. Asynchonous (queues, etc.): Message Attributes, name:value pair e.g. "trace-id":"348e1cf8" AWS SQS: SendMessage() / ReceiveMessage() Log the Correlation ID if present 73
  74. 74. Example: OpenTracing / PCF 3 tracing elements: TraceID, SpanID, ParentSpan "X-B3-TraceId" "X-B3-SpanId" "X-B3-ParentSpan" 74
  75. 75. Example: OpenTracing / PCF Always log the TraceID as-is Log calling SpanID as ParentSpan Log new SpanID 75
  76. 76. Trace Span ParentSpan 76
  77. 77. Correlation IDs Detect bottlenecks and unexpected interactions Increase transparency Learn about the system 77
  78. 78. Correlation IDs: Collaborate on distributed tracing for better system awareness 78
  79. 79. Lightweight user personas 79
  80. 80. Software is difficult to operate: poor UX for Ops. 80
  81. 81. Lightweight User Personas Simple characterisation of user needs for Dev/Test/Ops Based on full UX user personas but less detailed 81
  82. 82. Lightweight user personas: Ops Engineer Test Engineer Build & Deployment Engineer Service Owner 82
  83. 83. Lightweight user personas help teams to build systems with good UX for all users 83
  84. 84. Lightweight user personas: Consider the User Experience (UX) of engineers and team members using and working with the software 84
  85. 85. http://www.keepitusable.com/blog/?tag=alan-cooper 85 Motivations Goals Frustrations
  86. 86. Lightweight user personas: What data does the User Persona need visible on a dashboard in order to make decisions rapidly & safely? 86
  87. 87. https://www.geckoboard.com/blog/visualisation-upgrades-progressing-towards-a-more-useful-and-beautiful-dashboard/ 87
  88. 88. Lightweight User Personas Empathise better with people from other roles Capture missing operational requirements 88
  89. 89. Lightweight User Personas: Collaborate on user needs for better system awareness 89
  90. 90. Summary 90
  91. 91. Operability making software work well in Production 91
  92. 92. 92
  93. 93. Lack of observability Operational aspects not known “Why has deployment failed?” What handled the request? Poor UX for Ops 93
  94. 94. 94 SRE: operability consultants Collaborate on operability here CC BY-SA devopstopologies.com
  95. 95. Logging with Event IDs use enum-based Event IDs to explore runtime behaviour and fault conditions 95
  96. 96. Run Book dialogue sheets explore and establish operational requirements as a team, around a physical table, together 96
  97. 97. Endpoint healthchecks HTTP 200 / 500 responses to /status/health call with JSON details – good for tools and humans 97
  98. 98. Correlation IDs trace execution using correlation IDs: synchronous (HTTP X-trace-id) async (SQS MessageAttribute) 98
  99. 99. Lightweight user personas explore the UX and needs of different roles for rapid decisions via dashboards 99
  100. 100. use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques 100
  101. 101. Team Guide to Software Operability Matthew Skelton & Rob Thatcher operabilitybook.com 20% discount for SkillsMatter! http://leanpub.com/SoftwareOperability/c/SkillsMatter Download a free sample chapter 101
  102. 102. Resources •Team Guide to Software Operability by Matthew Skelton and Rob Thatcher http://operabilitybook.com/ •Run Book template & Run Book dialogue sheets http://runbooktemplate.info/ •Operability Questions http://operabilityquestions.com/ •5 proven operability techniques for software teams https://techbeacon.com/5-proven-operability-techniques- software-teams 102
  103. 103. thank you 103 @matthewpskelton / operabilitybook.com @ConfluxHQ / confluxdigital.net

×