Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Michael DeSa [InfluxData] | Monitoring Methodologies | InfluxDays Virtual Experience London 2020

The objective of this workshop will be to introduce participants to the RED and USE monitoring methodologies. We will compare and contrast the two methodologies. Care will be taken to highlight best practices associated with each methodology. The workshop will culminate in participants designing schema and using Flux to create relevant visualizations of their data.

  • Be the first to comment

Michael DeSa [InfluxData] | Monitoring Methodologies | InfluxDays Virtual Experience London 2020

  1. 1. Michael Desa - Engineering Manager Monitoring Methodologies
  2. 2. © 2020 InfluxData. All rights reserved. 2 Agenda: ● Part 1 ○ Explain the USE and RED monitoring methodologies ○ Talk about what to do if they don’t apply ● Part 2 ○ Reason about types of metrics and events ○ Design metrics with RED/USE methodology in mind ○ Create queries around these metrics
  3. 3. Part 1 Monitoring Methodologies
  4. 4. What is monitoring?
  5. 5. © 2020 InfluxData. All rights reserved. 5 Monitoring - What ● Observing the progress/quality of something over a period of time ● It’s a system where you collect metrics and events about your other systems ● It allows you to observe those events ● It helps you diagnose and understand the state of your system
  6. 6. Why is it so difficult?
  7. 7. © 2020 InfluxData. All rights reserved. 7 Monitoring - Difficult ● There’s a near endless amount of things that we could monitor about any one system ● It’s not immediately clear which data is signal and which is noise ● In general, people tend to just monitor everything
  8. 8. © 2020 InfluxData. All rights reserved. 8 Goal of this talk ● What signals to pay attention to ● Ways you can define those signals ● How to write queries that give you insight into those signals
  9. 9. What should I monitor?
  10. 10. What do you want to know about your system?
  11. 11. © 2020 InfluxData. All rights reserved. 11 What do you want to know ● Why isn’t my system performing as expected? ● What is the experience of my system?
  12. 12. USE & RED Monitoring Methodologies
  13. 13. © 2020 InfluxData. All rights reserved. 13 USE Method ● Developed by Brendan Gregg ● Used to identify performance problems in a system
  14. 14. © 2020 InfluxData. All rights reserved. 14 USE Method ● Utilization ○ The time that a resource is busy ● Saturation ○ The degree to which that resource is busy ● Errors ○ The number of errors associated with a resource
  15. 15. 😒
  16. 16. © 2020 InfluxData. All rights reserved. 16 USE Method Imagine you’re running a pizza shop... order order order order order order Saturation Utilization Errors
  17. 17. 🤔
  18. 18. © 2020 InfluxData. All rights reserved. 18 Lots of Errors Imagine you’re running a pizza shop... order order order order order order Errors UtilizationSaturation
  19. 19. © 2020 InfluxData. All rights reserved. 19 High Utilization - 100% Imagine you’re running a pizza shop... order order order order order order Saturation Utilization order order order order order order
  20. 20. © 2020 InfluxData. All rights reserved. 20 High Utilization - 70% Imagine you’re running a pizza shop... Saturation Utilization order order order order order order
  21. 21. © 2020 InfluxData. All rights reserved. 21 High Saturation / Low Utilization Imagine you’re running a pizza shop... Saturation Utilization order order order order order order order order order order order order
  22. 22. © 2020 InfluxData. All rights reserved. 22 Applying the USE Method ● Itemize all of the resources in your system ● For each resource in your system monitor: ○ Utilization ○ Saturation ○ Errors
  23. 23. What if I don’t have a performance problem?
  24. 24. © 2020 InfluxData. All rights reserved. 24 RED Method ● Developed at WeaveWorks ● Rooted in Google’s Four Golden Signals ● Helps you understand the experience of your system ● Used to set Service Level Objectives*
  25. 25. © 2020 InfluxData. All rights reserved. 25 RED Method ● (Request) Rate ○ The count of the occurrence of an request ● Error (Rate) ○ The count of errors associated with an request ● (Request) Duration ○ How long it takes you to do each request
  26. 26. © 2020 InfluxData. All rights reserved. 26 RED Method Imagine you’re running a pizza shop... order order order order order order Pizza Rate - the rate of incoming pizza orders Error Rate - the frequency of errors encountered while making pizzas Duration - how long it take you to make a pizza from the time the order is place
  27. 27. © 2020 InfluxData. All rights reserved. 27 High Duration Imagine you’re running a pizza shop... order order order order order order Long Pizza Order
  28. 28. © 2020 InfluxData. All rights reserved. 28 High Errors Imagine you’re running a pizza shop... order order order order order order
  29. 29. © 2020 InfluxData. All rights reserved. 29 High Request Rate Imagine you’re running a pizza shop... order order order order order order order order order order order order
  30. 30. © 2020 InfluxData. All rights reserved. 30 Applying the RED Method ● For each operation you’d like to understand monitor: ○ Operation Rate ○ Operation Errors ○ Operation Duration
  31. 31. What if I don’t have either a performance problem or an experience problem?
  32. 32. © 2020 InfluxData. All rights reserved. 32 Develop a Methodology ● Reason from first principles about your system ● Have specific ideas about what you’d like to understand ● Construct/Discover metrics in your system that will help you answer questions
  33. 33. Part 2 Monitoring in Practice
  34. 34. 🤔
  35. 35. © 2020 InfluxData. All rights reserved. 35 Monitoring Data Types ● Pull Style Metrics ○ Metrics are stored within a system and scraped periodically ○ Typically store aggregated values ● Push Style Events ○ Metrics are “fire and forget” ○ Typically store point-in-time or non-aggregated values
  36. 36. © 2020 InfluxData. All rights reserved. 36 Monitoring Data Types Pull Style Metrics ○ Counters ■ A monotonically increasing integer ■ Used to track operations where you’d like the rate that the operation happens ○ Gauges ■ A value that can go up or down ■ Used to track things like the number of concurrent request ○ Others*
  37. 37. © 2020 InfluxData. All rights reserved. 37 Monitoring Data Types Push Style Events ○ Raw Events ■ Contains no other information other than that the operation has happened ■ Common events are combined together using shared metadata ○ Derived Events ■ Generates an individual metric for each thing that takes place ■ Can be thought of as a point in time gauge
  38. 38. 😒
  39. 39. A Quick Note about Flux
  40. 40. Base Flux // cpu,host=A usage_idle=11,usage_user=10 from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "cpu")
  41. 41. Base Flux // cpu,host=A usage_idle=11,usage_user=10 import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "cpu") |> v1.fieldsAsCols()
  42. 42. Derived Events RED HTTP API Monitoring
  43. 43. © 2020 InfluxData. All rights reserved. 43 HTTP API Monitoring - Derived Events http_event,host=A,request_id=abc,status=500 durationNs=123045 t0 http_event,host=A,request_id=123,status=500 durationNs=123021 t1 http_event,host=A,request_id=345,status=200 durationNs=213045 t2 http_event,host=B,request_id=xyz,status=200 durationNs=213045 t3
  44. 44. © 2020 InfluxData. All rights reserved. 44 HTTP API Monitoring - Derived Events (Rate) http_event,host=A,request_id=abc,status=500 durationNs=123045 t0 http_event,host=A,request_id=123,status=500 durationNs=123021 t1 http_event,host=A,request_id=345,status=200 durationNs=213045 t2 http_event,host=B,request_id=xyz,status=200 durationNs=213045 t3 Count Them: 4!
  45. 45. Derived Events Request Rate import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() |> group() |> aggregateWindow( every: 1m, fn: count, column: "durationNs", )
  46. 46. © 2020 InfluxData. All rights reserved. 46 HTTP API Monitoring - Derived Events (Errors) http_event,host=A,request_id=abc,status=500 durationNs=123045 t0 http_event,host=A,request_id=123,status=500 durationNs=123021 t1 http_event,host=A,request_id=345,status=200 durationNs=213045 t2 http_event,host=B,request_id=xyz,status=200 durationNs=213045 t3 Count Them: 2!
  47. 47. Derived Events Error Rate import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() |> filter(fn: (r) => r.status =~ /5d{2}|4d{2}/) |> group() |> aggregateWindow( every: 1m, fn: count, column: "durationNs", )
  48. 48. © 2020 InfluxData. All rights reserved. 48 HTTP API Monitoring - Derived Events (Duration) http_event,host=A,request_id=abc,status=500 durationNs=123045 t0 http_event,host=A,request_id=123,status=500 durationNs=123021 t1 http_event,host=A,request_id=345,status=200 durationNs=213045 t2 http_event,host=B,request_id=xyz,status=200 durationNs=213045 t3 Take the Mean of Them: 168039!
  49. 49. Derived Events Duration import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() |> group() |> aggregateWindow( every: 1m, fn: mean, column: "durationNs", )
  50. 50. Raw Events RED HTTP API Monitoring
  51. 51. © 2020 InfluxData. All rights reserved. 51 HTTP API Monitoring - Raw Events http_event,host=A,request_id=abc event="request started" t0 http_event,host=A,request_id=abc,status=500 event="request finished" t1 http_event,host=A,request_id=123 event="request started" t2 http_event,host=A,request_id=123,status=500 event="request finished" t3 http_event,host=A,request_id=345 event="request started" t4 http_event,host=A,request_id=345,status=200 event="request finished" t5 http_event,host=B,request_id=xyz event="request started" t6 http_event,host=B,request_id=xyz,status=200 event="request finished" t7
  52. 52. © 2020 InfluxData. All rights reserved. 52 HTTP API Monitoring - Raw Events (Rate) http_event,host=A,request_id=abc event="request started" t0 http_event,host=A,request_id=abc,status=500 event="request finished" t1 http_event,host=A,request_id=123 event="request started" t2 http_event,host=A,request_id=123,status=500 event="request finished" t3 http_event,host=A,request_id=345 event="request started" t4 http_event,host=A,request_id=345,status=200 event="request finished" t5 http_event,host=B,request_id=xyz event="request started" t6 http_event,host=B,request_id=xyz,status=200 event="request finished" t7 Count the finished event: 4
  53. 53. Raw Events Rate import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() |> filter(fn: (r) => r.event == "request finished") |> group() |> aggregateWindow(every: 1m, fn: count, column: "event")
  54. 54. © 2020 InfluxData. All rights reserved. 54 HTTP API Monitoring - Raw Events (Errors) http_event,host=A,request_id=abc event="request started" t0 http_event,host=A,request_id=abc,status=500 event="request finished" t1 http_event,host=A,request_id=123 event="request started" t2 http_event,host=A,request_id=123,status=500 event="request finished" t3 http_event,host=A,request_id=345 event="request started" t4 http_event,host=A,request_id=345,status=200 event="request finished" t5 http_event,host=B,request_id=xyz event="request started" t6 http_event,host=B,request_id=xyz,status=200 event="request finished" t7 Count the error events: 2
  55. 55. Raw Events Errors import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() |> filter(fn: (r) => r.status =~ /5d{2}|4d{2}/) |> filter(fn: (r) => r.event == "request finished") |> group() |> aggregateWindow(every: 1m, fn: count, column: "event")
  56. 56. © 2020 InfluxData. All rights reserved. 56 HTTP API Monitoring - Raw Events (Duration) http_event,host=A,request_id=abc event="request started" t0 http_event,host=A,request_id=abc,status=500 event="request finished" t1 http_event,host=A,request_id=123 event="request started" t2 http_event,host=A,request_id=123,status=500 event="request finished" t3 http_event,host=A,request_id=345 event="request started" t4 http_event,host=A,request_id=345,status=200 event="request finished" t5 http_event,host=B,request_id=xyz event="request started" t6 http_event,host=B,request_id=xyz,status=200 event="request finished" t7 Identify start/finish events
  57. 57. Raw Events Duration import "influxdata/influxdb/v1" base = from(bucket: "bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() start = base |> filter(fn: (r) => r.event == "request started") finish = base |> filter(fn: (r) => r.event == "request finished") join(tables: {start: start, finish: finish}, on: ["request_id"]) |> duplicate(as: "_time", column: "_time_start") |> map(fn: (r) => ({r with durationNs: int(v: r._time_finish) - int(v: r._time_start)})) |> map(fn: (r) => ({r with durationS: float(v: r.durationNs)/1000000000.0})) |> group() |> aggregateWindow(every: 30s, fn: mean, column: "durationS")
  58. 58. © 2020 InfluxData. All rights reserved. 58 HTTP API Monitoring - Raw Events (Duration) http_event,host=A,request_id=abc event="request started" t0 http_event,host=A,request_id=abc,status=500 event="request finished" t1 http_event,host=A,request_id=123 event="request started" t2 http_event,host=A,request_id=123,status=500 event="request finished" t3 http_event,host=A,request_id=345 event="request started" t4 http_event,host=A,request_id=345,status=200 event="request finished" t5 http_event,host=B,request_id=xyz event="request started" t6 http_event,host=B,request_id=xyz,status=200 event="request finished" t7 Join them on their request_id
  59. 59. Raw Events Duration import "influxdata/influxdb/v1" base = from(bucket: "bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() start = base |> filter(fn: (r) => r.event == "request started") finish = base |> filter(fn: (r) => r.event == "request finished") join(tables: {start: start, finish: finish}, on: ["request_id"]) |> duplicate(as: "_time", column: "_time_start") |> map(fn: (r) => ({r with durationNs: int(v: r._time_finish) - int(v: r._time_start)})) |> map(fn: (r) => ({r with durationS: float(v: r.durationNs)/1000000000.0})) |> group() |> aggregateWindow(every: 30s, fn: mean, column: "durationS")
  60. 60. © 2020 InfluxData. All rights reserved. 60 HTTP API Monitoring - Raw Events (Duration) http_event,host=A,request_id=abc,status=500 event_start=t0,event_stop=t1 t0 http_event,host=A,request_id=123,status=500 event_start=t2,event_stop=t3 t2 http_event,host=A,request_id=345,status=200 event_start=t4,event_stop=t5 t4 http_event,host=A,request_id=xyz,status=200 event_start=t6,event_stop=t7 t6 Compute the time deltas: (t1 - t0), (t3 - t2), (t5 - t4), (t7 - t6)
  61. 61. Raw Events Duration import "influxdata/influxdb/v1" base = from(bucket: "bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() start = base |> filter(fn: (r) => r.event == "request started") finish = base |> filter(fn: (r) => r.event == "request finished") join(tables: {start: start, finish: finish}, on: ["request_id"]) |> duplicate(as: "_time", column: "_time_start") |> map(fn: (r) => ({r with durationNs: int(v: r._time_finish) - int(v: r._time_start)})) |> map(fn: (r) => ({r with durationS: float(v: r.durationNs)/1000000000.0})) |> group() |> aggregateWindow(every: 30s, fn: mean, column: "durationS")
  62. 62. Raw Events Duration import "influxdata/influxdb/v1" base = from(bucket: "bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() start = base |> filter(fn: (r) => r.event == "request started") finish = base |> filter(fn: (r) => r.event == "request finished") join(tables: {start: start, finish: finish}, on: ["request_id"]) |> duplicate(as: "_time", column: "_time_start") |> map(fn: (r) => ({r with durationNs: int(v: r._time_finish) - int(v: r._time_start)})) |> map(fn: (r) => ({r with durationS: float(v: r.durationNs)/1000000000.0})) |> group() |> aggregateWindow(every: 30s, fn: mean, column: "durationS")
  63. 63. © 2020 InfluxData. All rights reserved. 63 HTTP API Monitoring - Raw Events (Duration) http_event,host=A,request_id=abc,status=500 event_duration=t1-t0 t0 http_event,host=A,request_id=123,status=500 event_duration=t3-t2 t2 http_event,host=A,request_id=345,status=200 event_duration=t5-t4 t4 http_event,host=A,request_id=xyz,status=200 event_duration=t7-t6 t6 Average the durations: [(t1 - t0) + (t3 - t2) + (t5 - t4) + (t7 - t6)]/4
  64. 64. Raw Events Duration import "influxdata/influxdb/v1" base = from(bucket: "bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() start = base |> filter(fn: (r) => r.event == "request started") finish = base |> filter(fn: (r) => r.event == "request finished") join(tables: {start: start, finish: finish}, on: ["request_id"]) |> duplicate(as: "_time", column: "_time_start") |> map(fn: (r) => ({r with durationNs: int(v: r._time_finish) - int(v: r._time_start)})) |> map(fn: (r) => ({r with durationS: float(v: r.durationNs)/1000000000.0})) |> group() |> aggregateWindow(every: 30s, fn: mean, column: "durationS")
  65. 65. Pull Style RED HTTP API Monitoring
  66. 66. © 2020 InfluxData. All rights reserved. 66 HTTP API Monitoring - Counters http_requests,host=A,status=500 durationSum=12304,requestCount=2 t0 http_requests,host=A,status=500 durationSum=52504,requestCount=5 t1 http_requests,host=A,status=500 durationSum=92307,requestCount=9 t2 http_requests,host=B,status=200 durationSum=23304,requestCount=1 t0 http_requests,host=B,status=200 durationSum=35045,requestCount=2 t1 http_requests,host=B,status=200 durationSum=90713,requestCount=8 t2 durationSum: counter - sum of all requests durations with shared metadata requestCount: counter - count of the total number of requests with shared metadata
  67. 67. © 2020 InfluxData. All rights reserved. 67 HTTP API Monitoring - Counters (Rate) http_requests,host=A,status=500 durationSum=12304,requestCount=2 t0 http_requests,host=A,status=500 durationSum=52504,requestCount=5 t1 http_requests,host=A,status=500 durationSum=92307,requestCount=9 t2 http_requests,host=B,status=200 durationSum=23304,requestCount=1 t0 http_requests,host=B,status=200 durationSum=35045,requestCount=2 t1 http_requests,host=B,status=200 durationSum=90713,requestCount=8 t2 Take the delta of the request count
  68. 68. Counters Rate import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_requests") |> difference(nonNegative: true) |> v1.fieldsAsCols() |> group() |> aggregateWindow(every: 1m, fn: sum, column: "totalCount"
  69. 69. © 2020 InfluxData. All rights reserved. 69 HTTP API Monitoring - Counters (Rate) http_requests,host=A,status=500 deltaRequestCount=3 t1 http_requests,host=A,status=500 deltaRequestCount=4 t2 http_requests,host=B,status=200 deltaRequestCount=1 t1 http_requests,host=B,status=200 deltaRequestCount=7 t2 sum the request count delta
  70. 70. Counters Rate import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_requests") |> difference(nonNegative: true) |> v1.fieldsAsCols() |> group() |> aggregateWindow(every: 1m, fn: sum, column: "totalCount"
  71. 71. © 2020 InfluxData. All rights reserved. 71 HTTP API Monitoring - Counters (Errors) http_requests,host=A,status=500 durationSum=12304,requestCount=2 t0 http_requests,host=A,status=500 durationSum=52504,requestCount=5 t1 http_requests,host=A,status=500 durationSum=92307,requestCount=9 t2 http_requests,host=B,status=200 durationSum=23304,requestCount=1 t0 http_requests,host=B,status=200 durationSum=35045,requestCount=2 t1 http_requests,host=B,status=200 durationSum=90713,requestCount=8 t2 Do the same things we did for rate, but limit it to errors
  72. 72. Counters Errors import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_requests") |> filter(fn: (r) => r.status =~ /5d{2}|4d{2}/) |> difference(nonNegative: true) |> v1.fieldsAsCols() |> group() |> aggregateWindow(every: 1m, fn: sum, column: "totalCount"
  73. 73. © 2020 InfluxData. All rights reserved. 73 HTTP API Monitoring - Counters (Duration) http_requests,host=A,status=500 durationSum=12304,requestCount=2 t0 http_requests,host=A,status=500 durationSum=52504,requestCount=5 t1 http_requests,host=A,status=500 durationSum=92307,requestCount=9 t2 http_requests,host=B,status=200 durationSum=23304,requestCount=1 t0 http_requests,host=B,status=200 durationSum=35045,requestCount=2 t1 http_requests,host=B,status=200 durationSum=90713,requestCount=8 t2 For this example, lets focus on average duration Conceptually, we want to figure what was the total duration of all requests and to divide that by the total number of requests
  74. 74. © 2020 InfluxData. All rights reserved. 74 HTTP API Monitoring - Counters (Duration) http_requests,host=A,status=500 durationSum=12304,requestCount=2 t0 http_requests,host=A,status=500 durationSum=52504,requestCount=5 t1 http_requests,host=A,status=500 durationSum=92307,requestCount=9 t2 http_requests,host=B,status=200 durationSum=23304,requestCount=1 t0 http_requests,host=B,status=200 durationSum=35045,requestCount=2 t1 http_requests,host=B,status=200 durationSum=90713,requestCount=8 t2 For this example, lets focus on average duration 1. compute total duration 2. compute total requests
  75. 75. Remember the Normalization // cpu,host=A usage_idle=11,usage_user=10 from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "cpu")
  76. 76. © 2020 InfluxData. All rights reserved. 76 HTTP API Monitoring - Counters (Duration) http_requests,host=A,status=500,_field=durationSum _value=12304 t0 http_requests,host=A,status=500,_field=requestCount _value=2 t0 http_requests,host=A,status=500,_field=durationSum _value=52504 t1 http_requests,host=A,status=500,_field=requestCount _value=5 t1 http_requests,host=A,status=500,_field=durationSum _value=92307 t1 http_requests,host=A,status=500,_field=requestCount _value=9 t1 ... Compute the rates of the _values for each _field
  77. 77. Counters Duration import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_requests") |> filter(fn: (r) => r._field == "durationSum" or r._field == "requestCount") |> difference(nonNegative: true) |> aggregateWindow(every: 10s, fn: sum, column: "_value") |> group(columns: ["_field"]) |> aggregateWindow(every: 10s, fn: sum) |> v1.fieldsAsCols() |> map(fn: (r) => ({r with avgDuration: float(v: r.durationSum) / float(v: r.requestCount)}))
  78. 78. © 2020 InfluxData. All rights reserved. 78 HTTP API Monitoring - Counters (Duration) http_requests,host=A,status=500,_field=durationSum delta_value=40200 t1 http_requests,host=A,status=500,_field=requestCount delta_value=3 t1 http_requests,host=A,status=500,_field=durationSum delta_value=52007 t2 http_requests,host=A,status=500,_field=requestCount delta_value=4 t2 ... Sum the values that share a common _field
  79. 79. Counters Duration import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_requests") |> filter(fn: (r) => r._field == "durationSum" or r._field == "requestCount") |> difference(nonNegative: true) |> aggregateWindow(every: 10s, fn: sum, column: "_value") |> group(columns: ["_field"]) |> aggregateWindow(every: 10s, fn: sum) |> v1.fieldsAsCols() |> map(fn: (r) => ({r with avgDuration: float(v: r.durationSum) / float(v: r.requestCount)}))
  80. 80. © 2020 InfluxData. All rights reserved. 80 HTTP API Monitoring - Counters (Duration) http_requests,host=A,status=500,_field=durationSum sum_delta_value=92207 t1 http_requests,host=A,status=500,_field=requestCount sum_delta_value=7 t1 Pivot the tables
  81. 81. Counters Duration import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_requests") |> filter(fn: (r) => r._field == "durationSum" or r._field == "requestCount") |> difference(nonNegative: true) |> aggregateWindow(every: 10s, fn: sum, column: "_value") |> group(columns: ["_field"]) |> aggregateWindow(every: 10s, fn: sum) |> v1.fieldsAsCols() |> map(fn: (r) => ({r with avgDuration: float(v: r.durationSum) / float(v: r.requestCount)}))
  82. 82. © 2020 InfluxData. All rights reserved. 82 HTTP API Monitoring - Counters (Duration) http_requests,host=A,status=500 duration=92207,requestCount=7 t1 ... Compute the average duration/requestCount
  83. 83. Counters Duration import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_requests") |> filter(fn: (r) => r._field == "durationSum" or r._field == "requestCount") |> difference(nonNegative: true) |> aggregateWindow(every: 10s, fn: sum, column: "_value") |> group(columns: ["_field"]) |> aggregateWindow(every: 10s, fn: sum) |> v1.fieldsAsCols() |> map(fn: (r) => ({r with avgDuration: float(v: r.durationSum) / float(v: r.requestCount)}))
  84. 84. © 2020 InfluxData. All rights reserved. 84 Monitoring Data Types - Advanced Pull Style Metrics ○ Summaries ■ Used to get cross service quantiles ● Not just mean ■ Creates a sequence of counters for various buckets ● Uses those values to compute approximate quantiles ○ Histograms ■ Used to get quantiles for a single service ■ Directly computes the quantile in service during metric collection
  85. 85. © 2020 InfluxData. All rights reserved. 85 Which should I use? Comes down to cardinality and available resources ● Server Side Application ○ Pull style metrics ● Analytics in a frontend application ○ Derived/Raw Events ● IoT Device ○ Derived/Raw Events
  86. 86. 😅
  87. 87. © 2020 InfluxData. All rights reserved. 87 How does this all connect back? ● RED style metrics for understanding our service ○ Define SLIs and SLOs from these ● USE style metrics for understanding a component of our system ○ Get insight into the internal bottlenecks & issues within our system
  88. 88. Availability SLI import "influxdata/influxdb/v1" all = from(bucket: "bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() |> filter(fn: (r) => r.event == "request finished") errors = all |> filter(fn: (r) => r.status =~ /5d{2}|4d{2}/) |> group() |> aggregateWindow(every: 30s, fn: count, column: "event") rate = all |> group() |> aggregateWindow(every: 30s, fn: count, column: "event") join(tables: {errors: errors, rate: rate}, on: ["_time"]) |> filter(fn: (r) => r.event_rate != 0) |> map(fn: (r) => ({r with availability: 100.0 - 100.0 * float(v: r.event_errors) / float(v: r.event_rate)}))
  89. 89. Request > 15s SLI import "influxdata/influxdb/v1" base = from(bucket: "other") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "http_event") |> v1.fieldsAsCols() start = base |> filter(fn: (r) => r.event == "request started") finish = base |> filter(fn: (r) => r.event == "request finished") all = join(tables: {start: start, finish: finish}, on: ["request_id"]) |> map(fn: (r) => ({r with durationNs: int(v: r._time_finish) - int(v: r._time_start)})) |> map(fn: (r) => ({r with durationS: float(v: r.durationNs)/1000000000.0})) total = all |> group() |> count(column: "durationS") |> set(key: "join", value: "hack") greater_than = all |> filter(fn: (r) => r.durationS >= 15.0) |> group() |> count(column: "durationS") |> set(key: "join", value: "hack") join(tables: {total: total, greater_than: greater_than}, on: ["join"]) |> drop(columns: ["join"]) |> map(fn: (r) => ({r with _value: 100.0 * float(v: r.durationS_greater_than) / float(v: r.durationS_total) })) |> map(fn: (r) => ({r with _value: 100.0 - r._value})
  90. 90. 🤔
  91. 91. USE - Understanding Mutex Contention in HTTP server
  92. 92. © 2020 InfluxData. All rights reserved. 92 Mutex Monitoring - Pull Style mutex,host=A,id=1 acquireDurationSum=12,busySum=10,totalCount=1 t0 mutex,host=A,id=1 acquireDurationSum=18,busySum=19,totalCount=3 t2 mutex,host=A,id=1 acquireDurationSum=29,busySum=29,totalCount=7 t3 mutex,host=A,id=1 acquireDurationSum=35,busySum=51,totalCount=9 t4 mutex,host=B,id=1 acquireDurationSum=10,busySum=21,totalCount=3 t5 mutex,host=B,id=1 acquireDurationSum=10,busySum=60,totalCount=9 t6
  93. 93. Utilization import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "mutex") |> filter(fn: (r) => r._field == "busySum" or r._field == "totalCount") |> difference(nonNegative: true) |> aggregateWindow(every: 10s, fn: sum, column: "_value") |> group(columns: ["id", "_field"]) |> aggregateWindow(every: 10s, fn: sum) |> v1.fieldsAsCols() |> map(fn: (r) => ({r with avgBusyDuration: float(v: r.busySum) / float(v: r.totalCount)}))
  94. 94. Saturation import "influxdata/influxdb/v1" from(bucket: "my_bucket") |> range(start: -5m) |> filter(fn: (r) => r._measurement == "mutex") |> filter(fn: (r) => r._field == "acquireDurationSum" or r._field == "totalCount") |> difference(nonNegative: true) |> aggregateWindow(every: 10s, fn: sum, column: "_value") |> group(columns: ["id", "_field"]) |> aggregateWindow(every: 10s, fn: sum) |> v1.fieldsAsCols() |> map(fn: (r) => ({r with avgBusyDuration: float(v: r.acquireDurationSum) / float(v: r.totalCount)}))
  95. 95. © 2020 InfluxData. All rights reserved. 95 Conclusion ● Be methodical about instrumenting your system ○ USE/RED methods can be helpful tools ● Reason about trade-offs with the different styles of metrics ● Flux doesn’t dictate which metric style you have to use
  96. 96. The End!

×