Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

5 practical operability techniques for teams - Matthew Skelton - ADDO 2018

2,052 views

Published on

From a talk at All Day DevOps 2018

In this talk, we explore five practical, tried-and-tested, real world, team-focused techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT:

- modern event-based logging
- Run Book dialogue sheets
- endpoint healthchecks
- correlation IDs
- user personas

We use these as team collaboration techniques to improve software operability in the context of DevOps and SRE

Published in: Software

5 practical operability techniques for teams - Matthew Skelton - ADDO 2018

  1. 1. October 17, 2018 5 practical operability techniques for teams Matthew Skelton, Conflux @matthewpskelton confluxdigital.net
  2. 2. October 17, 2018
  3. 3. October 17, 2018 Thank You Supporters
  4. 4. October 17, 2018 Meet me in the Slack channel for Q&A bit.ly/addo-slack
  5. 5. Practical operability Why do we need a focus on operability? 5 practical operability techniques that work 5
  6. 6. modern event-based logging Run Book dialogue sheets endpoint healthchecks correlation IDs user personas 6
  7. 7. team collaboration techniques 7
  8. 8. About me Matthew Skelton, Conflux @matthewpskelton matthewskelton.net Leeds, UK 8
  9. 9. You? Software Developer Tester / QA DevOps Engineer Ops Engineer / SRE Head of Department 9
  10. 10. Practical Operability Techniques for Teams 10
  11. 11. Operability making software work well in Production 11
  12. 12. Operability 12 Scale Restore Inspect Failover Monitor Diagnose Secure Cleardown Report
  13. 13. “But can’t we just give those things to an SRE?” 13
  14. 14. “But can’t we just give those things to the DevOp?” 14
  15. 15. Operability is a shared concern #BizDevTestSecOps 15
  16. 16. Operability is a shared concern #BizDevTestSecOps 16
  17. 17. 17
  18. 18. 18 A self-managed Kubernetes cluster near you
  19. 19. Operability is a shared concern #BizDevTestSecOps 19
  20. 20. 20 SRE: operability consultants Collaborate on operability here CC BY-SA devopstopologies.com
  21. 21. Practical operability techniques 1. Modern logging with event IDs 2. Run Book dialogue sheets 3. Endpoint healthchecks 4. Correlation IDs 5. Lightweight User Personas 21
  22. 22. 22 Logging with Event IDs
  23. 23. Lack of observability for distributed systems 23
  24. 24. Modern logging w/ Event IDs Distinct application states No “logorrhoea” (!) Distributed tracing via logs Build a shared understanding 24
  25. 25. search by event Event ID {Delivered, InTransit, Arrived} 25
  26. 26. transaction trace Correlation ID 612999958… 26
  27. 27. Modern logging with event IDs helps to produce a well-defined event space: human-readable events 27
  28. 28. How many distinct event types (state transitions) in your application? 28
  29. 29. Which calls might fail? 29
  30. 30. 30
  31. 31. represent distinct states 31
  32. 32. enum Human-readable sets: unique values, sparse, immutable C#, Java, Python, node (Ruby, PHP, …) 32
  33. 33. Technical Domain public enum EventID { // Badly-initialised logging data NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... } 33
  34. 34. BasketItemAdded = 60001 BasketItemRemoved = 60002 34
  35. 35. example: https://github.com/EqualExperts/opslogger Sean Reilly @seanjreilly 35
  36. 36. Example: video processing On-demand processing of TV and mobile streaming adverts Ad-agency → TV broadcaster High throughput Glitch-free video & audio 36
  37. 37. Storage I/O Worker Job Queue Upload 37
  38. 38. 38
  39. 39. 39
  40. 40. Example: video processing Discover processing bottlenecks Trigger alerts Report on KPIs Target areas for improvement 40
  41. 41. Modern logging w/ Event IDs clarity about software behavior reduce time to detect problems increase team engagement enhance collaboration 41
  42. 42. Modern Logging: Collaborate on Event IDs and Correlation traces for better system awareness 42
  43. 43. Run Book dialogue sheets 43
  44. 44. Operational aspects not addressed, or addressed too late in the cycle 44
  45. 45. Run Book dialogue sheets Checklists for typical operational considerations Team-friendly exploration 45
  46. 46. Run Book dialogue sheets help to increase awareness of operability within teams 46
  47. 47. runbooktemplate.infoRun Book dialogue sheets 47
  48. 48. System characteristics Hours of operation During what hours does the service or system actually need to operate? Can portions or features of the system be unavailable at times if needed? Hours of operation - core features (e.g. 03:00-01:00 GMT+0) Hours of operation - secondary features (e.g. 07:00-23:00 GMT+0) Data and processing flows How and where does data flow through the system? What controls or triggers data flows? (e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data ) … 48
  49. 49. http://runbooktemplate.info/ Github, CC BY-SA 49
  50. 50. runbooktemplate.infoRun Book dialogue sheets 50
  51. 51. Run Book dialogue sheets Early discovery of operational requirements Input to team backlog “Shift-left” testing Avoid operational problems 51
  52. 52. 52 http://operabilityquestions.com/ Github, CC BY-SA
  53. 53. OperabilityQuestions.com Freeform, exploratory questions for teams Usability, viability, reliability, observability, securability, … (Github, CC BY-SA) 53
  54. 54. Run Book dialogue sheets: Collaborate on operational requirements for better system awareness 54
  55. 55. Endpoint healthchecks 55
  56. 56. “Why has my deployment failed again?” “Why is Pre-Prod always so flaky?” 56
  57. 57. Endpoint healthchecks Simple HTTP check Common way to assess any service/app/component Key operational requirement 57
  58. 58. endpoint healthchecks Every runnable app/service/daemon exposes /status/health An HTTP GET to the endpoint returns: 200 – "I am healthy" 500 – "I am sick" 58
  59. 59. Endpoint healthchecks help teams to collaborate on service viability 59
  60. 60. endpoint healthchecks Each component is responsible for determining its own health and viability – this is very contextual 60
  61. 61. endpoint healthchecks Use JSON as a response type – parsable by both machines and humans! 61
  62. 62. 62
  63. 63. endpoint healthchecks For databases and other non-HTTP components, run a lightweight HTTP service in front of the component 200 / 500 responses 63
  64. 64. Helper service 64
  65. 65. https://github.com/Lugribossk/simple-dashboard 65
  66. 66. 66 Question: What does this look like for Serverless? ¯_(ツ)_/¯
  67. 67. Endpoint healthchecks Rapid diagnosis and visibility Reduce confusion around environment state “Fail fast” → “learn sooner” 67
  68. 68. Endpoint healthchecks: Collaborate on component health status for better system awareness 68
  69. 69. Correlation IDs 69
  70. 70. “Which nodes handled the request?” 70
  71. 71. Correlation IDs Unique-ish identifiers Trace calls across machine & container boundaries Re-assemble HTTP request/response later 71
  72. 72. ‘Unique-ish’ identifier for each request Passed through downstream layers 72
  73. 73. Correlation IDs help teams to think about the big picture: end-to-end outcomes 73
  74. 74. Unique-ish ID 74
  75. 75. Synchronous HTTP: X-HEADER e.g. X-trace-id X-trace-id: 348e1cf8 If header is present, pass it on (Yes, RFC6648, but this is internal only) 75
  76. 76. Asynchonous (queues, etc.): Message Attributes, name:value pair e.g. "trace-id":"348e1cf8" AWS SQS: SendMessage() / ReceiveMessage() Log the Correlation ID if present 76
  77. 77. Example: OpenTracing / PCF 3 tracing elements: TraceID, SpanID, ParentSpan "X-B3-TraceId" "X-B3-SpanId" "X-B3-ParentSpan" 77
  78. 78. Example: OpenTracing / PCF Always log the TraceID as-is Log calling SpanID as ParentSpan Log new SpanID 78
  79. 79. Trace Span ParentSpan 79
  80. 80. Correlation IDs Detect bottlenecks and unexpected interactions Increase transparency Learn about the system 80
  81. 81. Correlation IDs: Collaborate on distributed tracing for better system awareness 81
  82. 82. Lightweight user personas 82
  83. 83. Software is difficult to operate: poor UX for Ops. 83
  84. 84. Lightweight User Personas Simple characterisation of user needs for Dev/Test/Ops Based on full UX user personas but less detailed 84
  85. 85. Lightweight user personas: Ops Engineer Test Engineer Build & Deployment Engineer Service Owner 85
  86. 86. Lightweight user personas help teams to build systems with good UX for all users 86
  87. 87. Lightweight user personas: Consider the User Experience (UX) of engineers and team members using and working with the software 87
  88. 88. http://www.keepitusable.com/blog/?tag=alan-cooper 88 Motivations Goals Frustrations
  89. 89. Lightweight user personas: What data does the User Persona need visible on a dashboard in order to make decisions rapidly & safely? 89
  90. 90. https://www.geckoboard.com/blog/visualisation-upgrades-progressing-towards-a-more-useful-and-beautiful-dashboard/ 90
  91. 91. Lightweight User Personas Empathise better with people from other roles Capture missing operational requirements 91
  92. 92. Lightweight User Personas: Collaborate on user needs for better system awareness 92
  93. 93. Summary 93
  94. 94. Operability making software work well in Production 94
  95. 95. 95
  96. 96. Lack of observability Operational aspects not known “Why has deployment failed?” What handled the request? Poor UX for Ops 96
  97. 97. 97 SRE: operability consultants Collaborate on operability here CC BY-SA devopstopologies.com
  98. 98. Logging with Event IDs use enum-based Event IDs to explore runtime behaviour and fault conditions 98
  99. 99. Run Book dialogue sheets explore and establish operational requirements as a team, around a physical table, together 99
  100. 100. Endpoint healthchecks HTTP 200 / 500 responses to /status/health call with JSON details – good for tools and humans 100
  101. 101. Correlation IDs trace execution using correlation IDs: synchronous (HTTP X-trace-id) async (SQS MessageAttribute) 101
  102. 102. Lightweight user personas explore the UX and needs of different roles for rapid decisions via dashboards 102
  103. 103. use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques 103
  104. 104. Team Guide to Software Operability Matthew Skelton & Rob Thatcher operabilitybook.com 20% discount for ADDO 2018! http://leanpub.com/SoftwareOperability/c/ADDO2018 Download a free sample chapter 104
  105. 105. Resources •Team Guide to Software Operability by Matthew Skelton and Rob Thatcher http://operabilitybook.com/ •Run Book template & Run Book dialogue sheets http://runbooktemplate.info/ •Operability Questions http://operabilityquestions.com/ •5 proven operability techniques for software teams https://techbeacon.com/5-proven-operability-techniques- software-teams 105
  106. 106. October 17, 2018 Meet me in the Slack channel for Q&A bit.ly/addo-slack
  107. 107. October 17, 2018 thank you @matthewpskelton / operabilitybook.com @ConfluxHQ / confluxdigital.net

×