Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Un-broken Logging - Operability.io 2015 - Matthew Skelton

4,940 views

Published on

The way in which many (most?) software teams use logging needs a re-think as we move into a world of microservices and remote sensors. Instead of using logging merely to dump out stack traces, our logs become a continuous trace of application state, with unique-enough identifiers for every interesting point of execution. We also use transaction identifiers to trace calls across components, services, and queues, so that we can reconstruct distributed calls after the fact. Logging becomes a rich source of insight for developers and operations people alike, as we 'listen to the logs' and tighten feedback cycles to improve our software systems.

Published in: Software

Un-broken Logging - Operability.io 2015 - Matthew Skelton

  1. 1. Un-Broken Logging the foundation of software operability Operability.io conference #OIO15 Friday 25th September 2015 Matthew Skelton Skelton Thatcher Consulting @matthewpskelton
  2. 2. The way we use logging is (often) broken How to make our logging more awesome Why we should care
  3. 3. Matthew Skelton @matthewpskelton #OIO15
  4. 4. @Operability #operability WhoOwnsMyOperability.com
  5. 5. confession: I am a big fan of logging
  6. 6. exceptional situations edge cases metrics analytics ‘audits’ … @evanphx
  7. 7. execution trace
  8. 8. BAD STUFF
  9. 9. Logging is often unloved 1. Discontinuous 2. Errors only, or arbitrary 3. ‘Bolted on’ 4. No aggregation & search 5. Specify severity up front
  10. 10. GOOD STUFF
  11. 11. How to make logging awesome 1. Continuous event IDs 2. Transaction tracing 3. Log aggregation & search tools 4. Design for logging 5. Decoupled severity
  12. 12. reduce time-to-detect increase team engagement increase configurability enhance DevOps collaboration #operability
  13. 13. Background
  14. 14. Autonomous weather station
  15. 15. MRI brain scan imaging
  16. 16. Oil well monitoring
  17. 17. Web-scale systems
  18. 18. logging makes things work
  19. 19. (event sourcing) (structured logging) (CQRS)
  20. 20. How is logging usually broken?
  21. 21. Logging is often unloved 1. Discontinuous 2. Errors only, or arbitrary 3. ‘Bolted on’ 4. No aggregation & search 5. Specify severity up front
  22. 22. using logging mainly for errors
  23. 23. inconsistent use of logging
  24. 24. logging slows down the software
  25. 25. logging ‘pollutes’ my precious domain model
  26. 26. logging is just for those weird Ops people
  27. 27. logging assumed to be free ($0) to implement no budget for aggregating logs across machines
  28. 28. log aggregation happens only in Production logs not available to Devs
  29. 29. fights over log severity levels
  30. 30. poor time synchronisation
  31. 31. Some history, with pirates
  32. 32. weather, course, sightings, latitude, longitude, … (even when quiet)
  33. 33. JohnHarrison
  34. 34. Why log?
  35. 35. verification traceability accountability charting the waters
  36. 36. - June 13th – Pirates!!!! - Weds – Sharks!!! - 19th Jun –BIGGER sharks!!!!
  37. 37. How to make logging awesome
  38. 38. How to make logging awesome 1. Continuous event IDs 2. Transaction tracing 3. Log aggregation & search tools 4. Design for logging 5. Decoupled severity
  39. 39. Storage I/O Worker Job Queue Upload
  40. 40. Continuous event IDs
  41. 41. How many distinct event types (state transitions) in your application?
  42. 42. represent distinct states
  43. 43. enum Human-readable sets: unique values, sparse, immutable C#, Java, Python, node (Ruby, PHP, …)
  44. 44. public enum EventID { // Badly-initialised logging data NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, PageGenerationStarted = 30000, PageGenerationCompleted = 30001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... }
  45. 45. Technical Domain public enum EventID { // Badly-initialised logging data NotSet = 0, // An unrecognised event has occurred UnexpectedError = 10000, ApplicationStarted = 20000, ApplicationShutdownNoticeReceived = 20001, PageGenerationStarted = 30000, PageGenerationCompleted = 30001, MessageQueued = 40000, MessagePeeked = 40001, BasketItemAdded = 60001, BasketItemRemoved = 60002, CreditCardDetailsSubmitted = 70001, // ... }
  46. 46. BasketItemAdded = 60001
  47. 47. BasketItemAdded = 60001 BasketItemRemoved = 60002
  48. 48. BasketItemAdded = 60001 BasketItemRemoved = 60002
  49. 49. represent distinct states
  50. 50. OrderSvc_BasketItemAdded
  51. 51. Monolith to microservices: debugger does not have the full view
  52. 52. Even with remote debugger, it’s boring to attach and detach
  53. 53. Storage I/O Worker Job Queue Upload
  54. 54. Transaction tracing
  55. 55. ‘Unique-ish’ identifier for each request Passed through downstream layers
  56. 56. Unique-ish ID
  57. 57. What about APM?
  58. 58. APM gives us application insight BUT How much do we learn? Is APM available on the Dev box? It’s not just ‘an Ops problem’!
  59. 59. Helps us to understand how the software really works Small overhead is worth it
  60. 60. Configurable severity levels
  61. 61. Which log level is right?
  62. 62. DEBUG, INFO, WARNING, ERROR, CRITICAL
  63. 63. Log level should *not* be fixed at compile or build time!
  64. 64. Tune log levels
  65. 65. Tune log levels
  66. 66. Tune log levels
  67. 67. { "eventmappings": { "events": { "event": [ { "id": "CacheServiceStarted", "severity": { "level": "Information" } }, { "id": "PageCachePurged", "severity": { "level": "Debug" }, "state": { "enabled": false } }, { "id": "DatabaseConnectionTimeOut", "severity": { "level": "Error" } } ] } } }
  68. 68. Tune severity levels of specific event IDs
  69. 69. Event tracing Use enumerations (or closest thing) Technical and Domain event types Distributed systems: debuggers less useful Trace calls with ‘unique-enough’ handles Tune log levels via config
  70. 70. Log aggregation & search tools
  71. 71. Design for log aggregation
  72. 72. develop the software using log aggregation as a first-class thing
  73. 73. stories for testing logging
  74. 74. BasketItemAdded grep BasketItem
  75. 75. logging is (‘just’) another system component
  76. 76. NTP
  77. 77. Dev and Ops collaboration* * and testers too!
  78. 78. Where?
  79. 79. auditing compliance pre-emptive fault diagnosis performance metrics …
  80. 80. Recap
  81. 81. Logging is often unloved 1. Discontinuous 2. Errors only, or arbitrary 3. ‘Bolted on’ 4. No aggregation & search 5. Specify severity up front
  82. 82. How to make logging awesome 1. Continuous event IDs 2. Transaction tracing 3. Log aggregation & search tools 4. Design for logging 5. Decoupled severity
  83. 83. logging makes things work
  84. 84. “There is no thought behind aspect-oriented programming”
  85. 85. MINDFUL LOGGING (?!)
  86. 86. database transaction logs
  87. 87. ‘Structured Logging’ TW: “Adopt” (May 2015) https://www.thoughtworks.com/radar/techniques/structured-logging http://gregoryszorc.com/ .NET: http://serilog.net/ Java: https://github.com/fluent/fluent-logger-java
  88. 88. sanity
  89. 89. More Ditch the Debugger and Use Log Analysis Instead Matthew Skelton https://blog.logentries.com/2015/07/ditch- the-debugger-and-use-log-analysis-instead/
  90. 90. More Using Log Aggregation Across Dev & Ops: The Pricing Advantage Rob Thatcher https://blog.logentries.com/2015/08/using- log-aggregation-across-dev-ops-the-pricing- advantage/
  91. 91. Evan Phoenix (@evanphx) youtube.com/watch?v=Z-JskKlIBOA
  92. 92. Books operabilitybook.comoperationalfeatures.com
  93. 93. Thank you http://skeltonthatcher.com/ enquiries@skeltonthatcher.com @SkeltonThatcher +44 (0)20 8242 4103 @matthewpskelton

×