Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Observability - Experiencing the “why” behind the jargon (FlowCon 2019)

This is a near duplication of the previous keynote deck where we talk about three examples of where I really felt the pain of not applying core observability techniques. The three covered are:
- No pre-aggregation
- Arbitrarily wide events
- Exploration over dashboarding

  • Be the first to comment

Observability - Experiencing the “why” behind the jargon (FlowCon 2019)

  1. 1. @A_Bangser @FlowConFR #FlowCon My slides are / will be available for you at: @A_Bangser @FlowConFR #FlowCon Observability - Experiencing the “why” behind the jargon Abby Bangser https://www.slideshare.net/AbigailBangser
  2. 2. @A_Bangser @FlowConFR #FlowCon Observability In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
  3. 3. @A_Bangser @FlowConFR #FlowCon Observability In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
  4. 4. @A_Bangser @FlowConFR #FlowCon “measure of how well” means observability is a scale
  5. 5. @A_Bangser @FlowConFR #FlowCon “measure of how well” means observability is a scale
  6. 6. @A_Bangser @FlowConFR #FlowCon “measure of how well” means observability is a scale Incident triage Incident triage happening?!
  7. 7. @A_Bangser @FlowConFR #FlowCon “measure of how well” means observability is a scale How easy is it to answer a new question without deploying new code? Incident triage Incident triage happening?! observability observability observability
  8. 8. @A_Bangser @FlowConFR #FlowCon Observability In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
  9. 9. @A_Bangser @FlowConFR #FlowCon External outputs help us answer these questions
  10. 10. @A_Bangser @FlowConFR #FlowCon External outputs help us answer these questions
  11. 11. @A_Bangser @FlowConFR #FlowCon External outputs help us answer these questions
  12. 12. @A_Bangser @FlowConFR #FlowCon External outputs help us answer these questions
  13. 13. @A_Bangser @FlowConFR #FlowCon So you might be thinking… “right, monitoring” https://bravenewgeek.com/wp-content/uploads/2019/10/monitoring_vs_observability_overlay-1024x539.png
  14. 14. @A_Bangser @FlowConFR #FlowCon So you might be thinking… “right, monitoring” https://bravenewgeek.com/wp-content/uploads/2019/10/monitoring_vs_observability_overlay-1024x539.png
  15. 15. @A_Bangser @FlowConFR #FlowCon So you might be thinking… “right, monitoring” https://bravenewgeek.com/wp-content/uploads/2019/10/monitoring_vs_observability_overlay-1024x539.png
  16. 16. @A_Bangser @FlowConFR #FlowCon True observability is discovering new behaviours https://bravenewgeek.com/wp-content/uploads/2019/10/monitoring_vs_observability_overlay-1024x539.png
  17. 17. @A_Bangser @FlowConFR #FlowCon Observability In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.
  18. 18. @A_Bangser @FlowConFR #FlowCon Characteristics of what generates valuable outputs https://thenewstack.io/observability-a-3-year-retrospective/ ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory
  19. 19. @A_Bangser @FlowConFR #FlowCon Characteristics of what generates valuable outputs https://thenewstack.io/observability-a-3-year-retrospective/ ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory ByTwitter,CCBY4.0, https://commons.wikimedia.org/w/index.php?curid=76921548
  20. 20. @A_Bangser @FlowConFR #FlowCon Let’s understand a couple of these through examples ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory
  21. 21. @A_Bangser @FlowConFR #FlowCon Let’s understand a couple of these through examples ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory
  22. 22. @A_Bangser @FlowConFR #FlowCon The promise of monitoring vs my reality My rollercoaster journey with understanding metrics and pre-aggregation starts back in 2016...
  23. 23. @A_Bangser @FlowConFR #FlowCon Monitorama 2016 - an awakening Lessons include… ➔ It is not just testing that is dead ➔ Wow! There is a world of available data I have no idea about ➔ These tools are so cool...wait, what are these tools?
  24. 24. @A_Bangser @FlowConFR #FlowCon Metrics can track success (and failure) of changes made https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals
  25. 25. @A_Bangser @FlowConFR #FlowCon An ask: I want to monitor live systems An opportunity: Help create a client’s first cloud infrastructure @A_Bangser @FlowConFR #FlowCon
  26. 26. @A_Bangser @FlowConFR #FlowCon An operations focused project changed my tool chain
  27. 27. @A_Bangser @FlowConFR #FlowCon An operations focused project changed my tool chain
  28. 28. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon And then… just like testability, operability became hard to prioritise
  29. 29. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon Two years and many projects later Hobbsy had a plan Track latency over 4 weeks and alert when current trends exceed 2 standard deviations 2standarddeviations
  30. 30. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon Two years and many projects later Hobbsy had a plan Track latency over 4 weeks and alert when current trends exceed 2 standard deviations 2standarddeviations
  31. 31. @A_Bangser @FlowConFR #FlowCon To do this at MOO s / MOO / any company over a few years old / ➔ 40 services ➔ 4 core languages ➔ 3 eras of architectural decisions ➔ 2 transport protocols (http and gRPC)
  32. 32. @A_Bangser @FlowConFR #FlowCon To do this at MOO s / MOO / any company over a few years old / ➔ 40 services ➔ 4 core languages ➔ 3 eras of architectural decisions ➔ 2 transport protocols (http and gRPC) ...and a partridge in a pear tree
  33. 33. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon The plan: Standardise metrics across the estate
  34. 34. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon Consistency across services created so much learning
  35. 35. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon But...
  36. 36. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon Our data collection made certain assumptions which in the end required re-collecting in a different way
  37. 37. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon How histograms gets generated in a time series DB le= 0.05 http_requests_seconds_bucket le= 0.1 le= 0.5 le= 1 le= 5 le= +inf * “le” stands for “less than or equal to”
  38. 38. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon How histograms gets generated in a time series DB le= 0.05 http_requests_seconds_bucket le= 0.1 le= 0.5 le= 1 le= 5 le= +inf * “le” stands for “less than or equal to” www.moo.com in 0.25 seconds
  39. 39. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon How histograms gets generated in a time series DB le= 0.05 http_requests_seconds_bucket le= 0.1 le= 0.5 le= 1 le= 5 le= +inf * “le” stands for “less than or equal to” www.moo.com in 0.25 seconds
  40. 40. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon How histograms gets generated in a time series DB le= 0.05 http_requests_seconds_bucket le= 0.1 le= 0.5 le= 1 le= 5 le= +inf * “le” stands for “less than or equal to” www.moo.com/big_file in 5 seconds www.moo.com in 0.25 seconds
  41. 41. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon How histograms gets generated in a time series DB le= 0.05 http_requests_seconds_bucket le= 0.1 le= 0.5 le= 1 le= 5 le= +inf * “le” stands for “less than or equal to” www.moo.com/big_file in 5 seconds www.moo.com in 0.25 seconds
  42. 42. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon We collected counts of how many requests per bucket le= .05 http_requests_seconds_bucket le= .1 le= .5 le= 1 le= 5 le= +inf Offset 1week2week3week4week le= .05 le= .1 le= .5 le= 1 le= 5 le= +inf le= .05 le= .1 le= .5 le= 1 le= 5 le= +inf le= .05 le= .1 le= .5 le= 1 le= 5 le= +inf
  43. 43. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon The data we had collected, we had to throw away http_requests_seconds_bucket Offset 1week2week3week4week le= .05 le= 5le= .5le= .1 le= +infle= 1 le= .05 le= .1 le= +infle= 1 le= 5le= .1 le= +infle= 1 le= .05 le= 5le= .1 le= .5 le= .05 le= .5 le= .5 le= 1 le= 5 le= +inf
  44. 44. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon At least the update was made, now we are all set right?
  45. 45. @A_Bangser @FlowConFR #FlowCon le= .05 le= 5le= .1 le= .5 le= 1 le= +inf Except, that 99th percentile...what does that actually mean?
  46. 46. @A_Bangser @FlowConFR #FlowCon Let’s see what our logs say about it
  47. 47. @A_Bangser @FlowConFR #FlowCon Just 1% of 500,000 requests applies to 56,000 people
  48. 48. @A_Bangser @FlowConFR #FlowCon To see >10 seconds, I would need the 99.996%
  49. 49. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon So, while consistent metrics trending over time was a big step forward... In retrospect, these experiences were not mature observability
  50. 50. @A_Bangser @FlowConFR #FlowCon Why avoid pre-aggregation? Because you can never regain the original context and detail, you can only ever ask predetermined questions
  51. 51. @A_Bangser @FlowConFR #FlowCon Let’s understand a couple of these through examples ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory
  52. 52. @A_Bangser @FlowConFR #FlowCon Data is not the same as information Step one is accepting that while sentences may be readable. <key : value> pairs are more easily queried.
  53. 53. @A_Bangser @FlowConFR #FlowCon Even from the first “Hello World” we humans logged
  54. 54. @A_Bangser @FlowConFR #FlowCon And from there we wanted more information 7a82dd3a
  55. 55. @A_Bangser @FlowConFR #FlowCon And from there we wanted more information 7a82dd3a
  56. 56. @A_Bangser @FlowConFR #FlowCon So then we backfilled in structure grok { match => [ "Request", "%{URIPROTO:request_uri_scheme}:// %{HOSTNAME:request_uri_host}(?::%{POSINT:request_uri_port}) ?%{URIPATH:request_uri_path}(?:%{URIPARAM:request_uri_query})?" ]} }
  57. 57. @A_Bangser @FlowConFR #FlowCon And of course, from there we wanted more mutate { split => { "uri_array" => "/"} add_field => { "uri_root" => ["/%{[uri_array][1]}"] "uri_first" => ["/%{[uri_array][2]}"] "uri_second" => ["/%{[uri_array][3]}"] "uri_root_first" => "%{uri_root}%{uri_first}" "uri_root_second" => "%{uri_root}%{uri_first}%{uri_second}" }
  58. 58. @A_Bangser @FlowConFR #FlowCon And even looking past the bad fields values, lots of servers means lots of intermingled logs
  59. 59. @A_Bangser @FlowConFR #FlowCon And even looking past the bad fields values, lots of servers means lots of intermingled logs
  60. 60. @A_Bangser @FlowConFR #FlowCon Rewind...how are logs written during a request?
  61. 61. @A_Bangser @FlowConFR #FlowCon Detailing how logs get written during a request @PostMapping("flip") public ResponseEntity flipImage(@RequestParam("image") MultipartFile file, @RequestParam(value = "vertical") Boolean vertical, @RequestParam(value = "horizontal") Boolean horizontal) { if (file.getContentType() != null) { LOGGER.warn("Wrong content type uploaded: {}", file.getContentType()); return new ResponseEntity<>("Wrong content type uploaded: " + file.getContentType()); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); if (flippedImage == null) { return new ResponseEntity<>("Failed to flip image", HttpStatus.INTERNAL_SERVER_ERROR); } LOGGER.info("Successfully flipped image id: {}", file.getId()); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); }
  62. 62. @A_Bangser @FlowConFR #FlowCon Detailing how logs get written during a request @PostMapping("flip") public ResponseEntity flipImage(@RequestParam("image") MultipartFile file, @RequestParam(value = "vertical") Boolean vertical, @RequestParam(value = "horizontal") Boolean horizontal) { if (file.getContentType() != null) { LOGGER.warn("Wrong content type uploaded: {}", file.getContentType()); return new ResponseEntity<>("Wrong content type uploaded: " + file.getContentType()); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); if (flippedImage == null) { return new ResponseEntity<>("Failed to flip image", HttpStatus.INTERNAL_SERVER_ERROR); } LOGGER.info("Successfully flipped image id: {}", file.getId()); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); }
  63. 63. @A_Bangser @FlowConFR #FlowCon Detailing how logs get written during a request @PostMapping("flip") public ResponseEntity flipImage(@RequestParam("image") MultipartFile file, @RequestParam(value = "vertical") Boolean vertical, @RequestParam(value = "horizontal") Boolean horizontal) { if (file.getContentType() != null) { LOGGER.warn("Wrong content type uploaded: {}", file.getContentType()); return new ResponseEntity<>("Wrong content type uploaded: " + file.getContentType()); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); if (flippedImage == null) { return new ResponseEntity<>("Failed to flip image", HttpStatus.INTERNAL_SERVER_ERROR); } LOGGER.info("Successfully flipped image id: {}", file.getId()); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } LOGGER.info("Receiving {} image to flip.", file.getContentType());
  64. 64. @A_Bangser @FlowConFR #FlowCon Detailing how logs get written during a request @PostMapping("flip") public ResponseEntity flipImage(@RequestParam("image") MultipartFile file, @RequestParam(value = "vertical") Boolean vertical, @RequestParam(value = "horizontal") Boolean horizontal) { if (file.getContentType() != null) { LOGGER.warn("Wrong content type uploaded: {}", file.getContentType()); return new ResponseEntity<>("Wrong content type uploaded: " + file.getContentType()); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); if (flippedImage == null) { return new ResponseEntity<>("Failed to flip image", HttpStatus.INTERNAL_SERVER_ERROR); } LOGGER.info("Successfully flipped image id: {}", file.getId()); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } LOGGER.info("Receiving {} image to flip.", file.getContentType());
  65. 65. @A_Bangser @FlowConFR #FlowCon Detailing how logs get written during a request @PostMapping("flip") public ResponseEntity flipImage(@RequestParam("image") MultipartFile file, @RequestParam(value = "vertical") Boolean vertical, @RequestParam(value = "horizontal") Boolean horizontal) { if (file.getContentType() != null) { LOGGER.warn("Wrong content type uploaded: {}", file.getContentType()); return new ResponseEntity<>("Wrong content type uploaded: " + file.getContentType()); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); if (flippedImage == null) { return new ResponseEntity<>("Failed to flip image", HttpStatus.INTERNAL_SERVER_ERROR); } LOGGER.info("Successfully flipped image id: {}", file.getId()); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); }
  66. 66. @A_Bangser @FlowConFR #FlowCon Detailing how logs get written during a request @PostMapping("flip") public ResponseEntity flipImage(@RequestParam("image") MultipartFile file, @RequestParam(value = "vertical") Boolean vertical, @RequestParam(value = "horizontal") Boolean horizontal) { if (file.getContentType() != null) { LOGGER.warn("Wrong content type uploaded: {}", file.getContentType()); return new ResponseEntity<>("Wrong content type uploaded: " + file.getContentType()); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); if (flippedImage == null) { return new ResponseEntity<>("Failed to flip image", HttpStatus.INTERNAL_SERVER_ERROR); } LOGGER.info("Successfully flipped image id: {}", file.getId()); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); }
  67. 67. @A_Bangser @FlowConFR #FlowCon Detailing how logs get written during a request @PostMapping("flip") public ResponseEntity flipImage(@RequestParam("image") MultipartFile file, @RequestParam(value = "vertical") Boolean vertical, @RequestParam(value = "horizontal") Boolean horizontal) { if (file.getContentType() != null) { LOGGER.warn("Wrong content type uploaded: {}", file.getContentType()); return new ResponseEntity<>("Wrong content type uploaded: " + file.getContentType()); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); if (flippedImage == null) { return new ResponseEntity<>("Failed to flip image", HttpStatus.INTERNAL_SERVER_ERROR); } LOGGER.info("Successfully flipped image id: {}", file.getId()); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } LOGGER.info("Successfully flipped image id: {}", file.getId()");
  68. 68. @A_Bangser @FlowConFR #FlowCon Detailing how logs get written during a request @PostMapping("flip") public ResponseEntity flipImage(@RequestParam("image") MultipartFile file, @RequestParam(value = "vertical") Boolean vertical, @RequestParam(value = "horizontal") Boolean horizontal) { if (file.getContentType() != null) { LOGGER.warn("Wrong content type uploaded: {}", file.getContentType()); return new ResponseEntity<>("Wrong content type uploaded: " + file.getContentType()); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); if (flippedImage == null) { return new ResponseEntity<>("Failed to flip image", HttpStatus.INTERNAL_SERVER_ERROR); } LOGGER.info("Successfully flipped image id: {}", file.getId()); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } LOGGER.info("Successfully flipped image id: {}", file.getId()");
  69. 69. @A_Bangser @FlowConFR #FlowCon Detailing how logs get written during a request @PostMapping("flip") public ResponseEntity flipImage(@RequestParam("image") MultipartFile file, @RequestParam(value = "vertical") Boolean vertical, @RequestParam(value = "horizontal") Boolean horizontal) { if (file.getContentType() != null) { LOGGER.warn("Wrong content type uploaded: {}", file.getContentType()); return new ResponseEntity<>("Wrong content type uploaded: " + file.getContentType()); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); if (flippedImage == null) { return new ResponseEntity<>("Failed to flip image", HttpStatus.INTERNAL_SERVER_ERROR); } LOGGER.info("Successfully flipped image id: {}", file.getId()); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); }
  70. 70. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon Log outputs
  71. 71. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); }
  72. 72. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); }
  73. 73. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } EVENT.addField("content.type", file.getContentType());
  74. 74. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip");
  75. 75. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId());
  76. 76. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("flip_vertical", vertical); EVENT.addField("image_id", file.getId());
  77. 77. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } LOGGER.info("Receiving {} image to flip.", file.getContentType()); EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("flip_vertical", vertical); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_horizontal", horizontal);
  78. 78. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("flip_vertical", vertical); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_horizontal", horizontal);
  79. 79. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("flip_vertical", vertical); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_horizontal", horizontal);
  80. 80. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("flip_vertical", vertical); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_horizontal", horizontal);
  81. 81. @A_Bangser @FlowConFR #FlowCon In contrast, how an event is created during a request @PostMapping("flip") public ResponseEntity flipImage(...) { EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_vertical", vertical); EVENT.addField("flip_horizontal", horizontal); ... LOGGER.info("Receiving {} image to flip.", file.getContentType()); byte[] flippedImage = imageService.flip(file, vertical, horizontal); ... LOGGER.info("Successfully flipped image id: {}", file.getId()); EVENT.addField("action.success", "true"); return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK); } EVENT.addField("action.success", "true"); EVENT.addField("content.type", file.getContentType()); EVENT.addField("action", "flip"); EVENT.addField("flip_vertical", vertical); EVENT.addField("image_id", file.getId()); EVENT.addField("flip_horizontal", horizontal);
  82. 82. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon Comparing the outputs Multiple logs A single event
  83. 83. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon Making the information easy to query
  84. 84. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon And keeping the information in context
  85. 85. @A_Bangser @FlowConFR #FlowCon@A_Bangser @FlowConFR #FlowCon Most importantly, making it easy to add more!
  86. 86. @A_Bangser @FlowConFR #FlowCon In order to combate tribal knowledge based guessing when debugging our complex systems, we need: A low friction way to add fields to your logs for structure and searchability Allowing application and user context to be wrapped in a business context CustomerID:234567VersionOfApp:2 RequestedUri:www.
  87. 87. @A_Bangser @FlowConFR #FlowCon Let’s understand a couple of these through examples ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory
  88. 88. @A_Bangser @FlowConFR #FlowCon Debugging distributed systems is hard Especially when business impact is on the line. Let’s talk outages
  89. 89. @A_Bangser @FlowConFR #FlowCon Hmmm, a warning alert has come in This is an automated alert based on a warning production service sending a high percent of 500’s in production!
  90. 90. @A_Bangser @FlowConFR #FlowCon Yup, definitely an issue
  91. 91. @A_Bangser @FlowConFR #FlowCon All hands on deck, what is happening...and why?
  92. 92. @A_Bangser @FlowConFR #FlowCon All hands on deck, what is happening...and why?
  93. 93. @A_Bangser @FlowConFR #FlowCon All hands on deck, what is happening...and why?
  94. 94. @A_Bangser @FlowConFR #FlowCon All hands on deck, what is happening...and why?
  95. 95. @A_Bangser @FlowConFR #FlowCon All hands on deck, what is happening...and why?
  96. 96. @A_Bangser @FlowConFR #FlowCon 2+ hrs and still aren’t sure we know what happened
  97. 97. @A_Bangser @FlowConFR #FlowCon And then it keeps happening
  98. 98. @A_Bangser @FlowConFR #FlowCon Oncall engineers are not amused
  99. 99. @A_Bangser @FlowConFR #FlowCon But the service owners weren’t just lounging around
  100. 100. @A_Bangser @FlowConFR #FlowCon And these were some awesome dashboards
  101. 101. @A_Bangser @FlowConFR #FlowCon Let’s break down what this dashboard shows Request Counts Response Latency
  102. 102. @A_Bangser @FlowConFR #FlowCon Let’s break down what this dashboard shows Enhanced Images Original Images Enhanced Images Enhanced and resized Request Counts Response Latency
  103. 103. @A_Bangser @FlowConFR #FlowCon This dashboard helped limit impact ~3 hours 40 min
  104. 104. @A_Bangser @FlowConFR #FlowCon And eventually, powerful human pattern matchers solved the problem
  105. 105. @A_Bangser @FlowConFR #FlowCon So what happens to this dashboard now?
  106. 106. @A_Bangser @FlowConFR #FlowCon They have been sent to a farm… with their other friends
  107. 107. @A_Bangser @FlowConFR #FlowCon Why ditch the dashboards? The scar tissue of your past outages is not a sufficient replacement for the creativity required to investigate your future incidents https://www.needpix.com/photo/907639/images-leash-leash-polaroid-free-pictures-free-photos-free-images-royalty-free
  108. 108. @A_Bangser @FlowConFR #FlowCon Let’s revisit those characteristics ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515
  109. 109. @A_Bangser @FlowConFR #FlowCon Let’s revisit those characteristics ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515 The only way to ask new questions is to keep the original raw data available and queryable
  110. 110. @A_Bangser @FlowConFR #FlowCon Let’s revisit those characteristics ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515 Make data easy to add details to and easy to query
  111. 111. @A_Bangser @FlowConFR #FlowCon Let’s revisit those characteristics ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515 Empower creative and shared exploration based on business context
  112. 112. @A_Bangser @FlowConFR #FlowCon Let’s revisit those characteristics ➔ raw events ➔ no pre-aggregation ➔ structured data ➔ arbitrarily wide events ➔ schema-less-ness ➔ high cardinality dimensions ➔ oriented around the lifecycle of the request ➔ batched up context ➔ static dashboards don’t work, it must be exploratory ByTwitter,CCBY4.0,https://commons.wikimedia.org/w/index.php?curid=80936515 The only way to ask new questions is to keep the original raw data available and queryable Make data easy to add details to and easy to query Empower creative and shared exploration based on business context
  113. 113. @A_Bangser @FlowConFR #FlowCon QA TWU Looking back journeys are never clear, so why do we still expect them to be when we start a new one? Political Science Major Data analysis for investments A desire to learn how to code Automation FTW! An “analyst” computer A “DevOps” friend engaged me in his work onitorama An infrastructure project Platform Engineering @ Professional scuba diver A (slight) obsession with observability
  114. 114. @A_Bangser @FlowConFR #FlowCon Start where you are. Use what you have. Do what you can. - Arthur Ashe
  115. 115. @A_Bangser @FlowConFR #FlowCon ➔ All of tech and product is now asking more interesting questions ➔ We are expecting more of our tooling ➔ We are building new awareness about our services and system Start where you are. Use what you have. Do what you can. - Arthur Ashe
  116. 116. @A_Bangser @FlowConFR #FlowCon Thank you! www.SlideShare.net/ AbigailBangser @A_Bangser

×