Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Practical operability techniques for teams - Matthew Skelton - Conflux - Continuous Lifecycle London 2018

530 views

Published on

In this talk, we explore five practical, tried-and-tested, real world techniques for improving operability with many kinds of software systems, including cloud, Serverless, on-premise, and IoT.

Logging as a live diagnostics vector with sparse Event IDs
Operational checklists and ‘Run Book dialogue sheets’ as a discovery mechanism for teams
Endpoint healthchecks as a way to assess runtime dependencies and complexity
Correlation IDs beyond simple HTTP calls
Lightweight ‘User Personas’ as drivers for operational dashboards
Based on our work in many industry sectors, we will share our experience of helping teams to improve the operability of their software systems through

Required audience experience

Some experience of building web-scale systems or industrial IoT/embedded systems would be helpful.

Objective of the talk

We will share our experience of helping teams to improve the operability of their software systems. Attendees will learn some practical operability approaches and how teams can expand their understanding and awareness of operability through these simple, team-friendly techniques.

From a talk given at Continuous Lifecycle London 2018: https://continuouslifecycle.london/sessions/practical-team-focused-operability-techniques-for-distributed-systems/

Published in: Software
  • Be the first to comment

Practical operability techniques for teams - Matthew Skelton - Conflux - Continuous Lifecycle London 2018

  1. 1. Practical Operability Techniques for Teams Matthew Skelton Conflux confluxdigital.net / @ConfluxHQ Continuous Lifecycle London, 17 May 2018
  2. 2. Practical operability Why do we need a focus on operability? 5 practical operability techniques to try 2
  3. 3. use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques 3
  4. 4. About me Matthew Skelton, Conflux @matthewpskelton matthewskelton.net Leeds, UK 4 assemblyconf.com
  5. 5. Team Guide to Software Operability Matthew Skelton & Rob Thatcher skeltonthatcher.com/publications Download a free sample chapter 5
  6. 6. You? Software Developer Tester / QA DevOps Engineer Team Leader Head of Department 6
  7. 7. Practical Operability Techniques for Teams 7
  8. 8. Operability making software work well in Production 8
  9. 9. Operability 9
  10. 10. “But can’t we just give those things to a DevOp?” 10
  11. 11. “But can’t we just give those things to the DevOp?” 11
  12. 12. Operability is a shared concern #BizDevTestSecOps 12
  13. 13. Operability is a shared concern #BizDevTestSecOps 13
  14. 14. 14
  15. 15. 15 A self-managed Kubernetes cluster near you
  16. 16. Operability is a shared concern #BizDevTestSecOps 16
  17. 17. Practical operability techniques 1. Modern logging 2. Run Book dialogue sheets 3. Endpoint healthchecks 4. Correlation IDs 5. Lightweight User Personas 17
  18. 18. Lack of observability for distributed systems 18
  19. 19. 19 Logging with Event IDs
  20. 20. Modern logging w/ Event IDs Distinct application states No “logorrhoea” (!) Distributed tracing via logs Build a shared understanding 20
  21. 21. search by event Event ID {Delivered, InTransit, Arrived} 21
  22. 22. transaction trace Correlation ID 612999958… 22
  23. 23. How many distinct event types (state transitions) in your application? 23
  24. 24. 24
  25. 25. represent distinct states 25
  26. 26. enum Human-readable sets: unique values, sparse, immutable C#, Java, Python, node (Ruby, PHP, …) 26
  27. 27. public enum EventID { // Badly-initialised logging data NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... } 27
  28. 28. BasketItemAdded = 60001 BasketItemRemoved = 60002 28
  29. 29. log using Event IDs with a modern ‘structured logging’ library 29
  30. 30. example: https://github.com/EqualExperts/opslogger Sean Reilly @seanjreilly 30
  31. 31. Example: video processing On-demand processing of TV and mobile streaming adverts Ad-agency → TV broadcaster High throughput Glitch-free video & audio 31
  32. 32. Storage I/O Worker Job Queue Upload 32
  33. 33. 33
  34. 34. 34
  35. 35. Example: video processing Discover processing bottlenecks Trigger alerts Report on KPIs Target areas for improvement 35
  36. 36. Modern logging w/ Event IDs reduce time to detect problems increase team engagement increase configurability enhance collaboration 36
  37. 37. Modern Logging: Collaborate on Event IDs and Correlation traces for better system awareness 37
  38. 38. Operational aspects not addressed, or addressed too late in the cycle 38
  39. 39. Run Book dialogue sheets
  40. 40. Run Book dialogue sheets Checklists for typical operational considerations Team-friendly exploration 40
  41. 41. runbooktemplate.infoRun Book dialogue sheets 41
  42. 42. System characteristics Hours of operation During what hours does the service or system actually need to operate? Can portions or features of the system be unavailable at times if needed? Hours of operation - core features (e.g. 03:00-01:00 GMT+0) Hours of operation - secondary features (e.g. 07:00-23:00 GMT+0) Data and processing flows How and where does data flow through the system? What controls or triggers data flows? (e.g. mobile requests / scheduled batch jobs / inbound IoT sensor data ) … 42
  43. 43. http://runbooktemplate.info/ 43
  44. 44. runbooktemplate.infoRun Book dialogue sheets 44
  45. 45. Run Book dialogue sheets Early discovery of operational requirements Input to team backlog “Shift-left” testing Avoid operational problems 45
  46. 46. Run Book dialogue sheets: Collaborate on operational requirements for better system awareness 46
  47. 47. “Why has my deployment failed again?” “Why is Pre-Prod always so flaky?” 47
  48. 48. Endpoint healthchecks
  49. 49. Endpoint healthchecks Simple HTTP check Common way to assess any service/app/component Key operational requirement 49
  50. 50. endpoint healthchecks Every runnable app/service/daemon exposes /status/health An HTTP GET to the endpoint returns: 200 – "I am healthy" 500 – "I am sick" 50
  51. 51. endpoint healthchecks Each component is responsible for determining its own health and viability – this is very contextual 51
  52. 52. endpoint healthchecks Use JSON as a response type – parsable by both machines and humans! 52
  53. 53. 53
  54. 54. endpoint healthchecks For databases and other non-HTTP components, run a lightweight HTTP service in front of the component 200 / 500 responses 54
  55. 55. Helper service 55
  56. 56. https://github.com/Lugribossk/simple-dashboard 56
  57. 57. 57 Question: What does this look like for Serverless? ¯_(ツ)_/¯
  58. 58. Endpoint healthchecks Rapid dashboarding and diagnosis Reduce confusion around environment state “Fail fast” aka “learn sooner” 58
  59. 59. Endpoint healthchecks: Collaborate on component health status for better system awareness 59
  60. 60. “Which containers [servers] handled the request?” 60
  61. 61. Correlation IDs
  62. 62. Correlation IDs Unique-ish identifiers Trace calls across machine & container boundaries Re-assemble HTTP request/response later 62
  63. 63. ‘Unique-ish’ identifier for each request Passed through downstream layers 63
  64. 64. Unique-ish ID 64
  65. 65. Synchronous HTTP: X-HEADER e.g. X-trace-id X-trace-id: 348e1cf8 If header is present, pass it on (Yes, RFC6648, but this is internal only) 65
  66. 66. Asynchonous (queues, etc.): Message Attributes, name:value pair e.g. "trace-id":"348e1cf8" AWS SQS: SendMessage() / ReceiveMessage() Log the Correlation ID if present 66
  67. 67. Example: OpenTracing / PCF 3 tracing elements: TraceID, SpanID, ParentSpan "X-B3-TraceId" "X-B3-SpanId" "X-B3-ParentSpan" 67
  68. 68. Example: OpenTracing / PCF Always log the TraceID as-is Log calling SpanID as ParentSpan Log new SpanID 68
  69. 69. Trace Span ParentSpan 69
  70. 70. Correlation IDs Detect bottlenecks and unexpected interactions Increase transparency Learn about the system 70
  71. 71. Correlation IDs: Collaborate on distributed tracing for better system awareness 71
  72. 72. Software is difficult to operate: poor UX for Ops. 72
  73. 73. Lightweight user personas
  74. 74. Lightweight User Personas Simple characterisation of user needs for Dev/Test/Ops Based on full UX user personas but less detailed 74
  75. 75. Lightweight user personas: Ops Engineer Test Engineer Build & Deployment Engineer Service Owner 75
  76. 76. Lightweight user personas: Consider the User Experience (UX) of engineers and team members using and working with the software 76
  77. 77. http://www.keepitusable.com/blog/?tag=alan-cooper 77 Motivations Goals Frustrations
  78. 78. Lightweight user personas: What data does the User Persona need visible on a dashboard in order to make decisions rapidly & safely? 78
  79. 79. https://www.geckoboard.com/blog/visualisation-upgrades-progressing-towards-a-more-useful-and-beautiful-dashboard/ 79
  80. 80. Lightweight User Personas Empathise better with people from other roles Capture missing operational requirements 80
  81. 81. Lightweight User Personas: Collaborate on user needs for better system awareness 81
  82. 82. Summary 82
  83. 83. Operability making software work well in Production 83
  84. 84. 84
  85. 85. Lack of observability Operational aspects not known “Why has deployment failed?” What handled the request? Poor UX for Ops 85
  86. 86. Logging with Event IDs use enum-based Event IDs to explore runtime behaviour and fault conditions 86
  87. 87. Run Book dialogue sheets explore and establish operational requirements as a team, around a physical table, together 87
  88. 88. Endpoint healthchecks HTTP 200 / 500 responses to /status/health call with JSON details – good for tools and humans 88
  89. 89. Correlation IDs trace execution using correlation IDs: synchronous (HTTP X-trace-id) async (SQS MessageAttribute) 89
  90. 90. Lightweight user personas explore the UX and needs of different roles for rapid decisions via dashboards 90
  91. 91. use modern logging, Run Book dialogue sheets, endpoint healthchecks, correlation IDs, and user personas as team collaboration techniques 91
  92. 92. Team Guide to Software Operability Matthew Skelton & Rob Thatcher skeltonthatcher.com/publications Download a free sample chapter 92
  93. 93. Resources •Team Guide to Software Operability by Matthew Skelton and Rob Thatcher (Skelton Thatcher Publications, 2016) http://operabilitybook.com/ •Run Book template & Run Book dialogue sheets http://runbooktemplate.info/ 93
  94. 94. thank you @ConfluxHQ confluxdigital.net

×