Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Application Metrics (with Prometheus examples) #PHPDD18

305 views

Published on

We all know not to poke at alien life forms in another planet, right? But what about metrics, do you know how to pick, measure and draw conclusions from them? In this talk we will cover various Site Reliability Engineering topics, such as SLIs and SLOs while we explore real life examples of defining and implementing metrics in a system with examples using Prometheus, an open-source system monitoring and alert platform, to demonstrate implementation. Let's get back to some real science.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Application Metrics (with Prometheus examples) #PHPDD18

  1. 1. Application Metrics with Prometheus examples Rafael Dohms @rdohms
  2. 2. How do you do metrics?
  3. 3. “The Prometheus 
 Scientist Method”
  4. 4. I hope not.
  5. 5. jobs.usabilla.com Rafael Dohms Staff Engineer rdohmsdoh.ms
  6. 6. FeedbackFeedback jobs.usabilla.com Rafael Dohms Staff Engineer rdohmsdoh.ms We are hiring!
 jobs.usabilla.com
  7. 7. Let’s talk about metrics. 
 
 But let’s do it with a concrete example.
  8. 8. Kafka / DDD / Autonomous Microservices / Monitoring
  9. 9. Kafka / DDD / Autonomous Microservices / Monitoring
  10. 10. Kafka / DDD / Autonomous Microservices / Monitoring
  11. 11. Metrics are insights into the current state of your application.
  12. 12. Metrics tell you if your service is healthy.
  13. 13. Canary Deploys OksanaLatysheva
  14. 14. Metrics tell you what is wrong.
  15. 15. Metrics tell you what is right.
  16. 16. Metrics tell you what will soon be wrong.
  17. 17. Metrics tell you where to start looking.
  18. 18. Site Reliability Engineering
  19. 19. SLIs SLOs ◎ SLAs
  20. 20. SLIs Service Level Indicators “A quantitative measure of some aspect of your application” The response time of a request was 150ms Source: Site Reliability Engineering - O’Reilly
  21. 21. SLOs ◎ Service Level Objectives “A target value or a range of values for something measured by an SLI” Request response times should be below 200ms Source: Site Reliability Engineering - O’Reilly
  22. 22. Help you drive architectural decisions, like optimisation SLOs ◎ Response time SLO: 150 ms
 95th Percentile of Processing time (PHP time): 5ms
 
 As a result we decided to invest more time in exploring the problem domain and not optimising our stack.
  23. 23. SLAs Service Level Agreements “An explicit or implicit contract with your customer,that includes consequences of missing their SLOs” The 99th percentile of requests response times should meet our SLO,or we will refund users Source: Site Reliability Engineering - O’Reilly
  24. 24. Measuring
  25. 25. –Etsy Engineering “If it moves, we track it.” https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
  26. 26. Metrics Statistics What is happening right now? How often does this happen? Telemetry
  27. 27. Telemetry “the process of recording and transmitting the readings of an instrument”
  28. 28. Statistics / Analytics “the practice of collecting and analysing numerical data in large quantities”
  29. 29. Statistics / Analytics “the practice of collecting and analysing numerical data in large quantities”
  30. 30. I really miss Ayrton Senna Statistics / Analytics “the practice of collecting and analysing numerical data in large quantities”
  31. 31. Statistics Incoming feedback items with origin information Telemetry response time of public endpoints
  32. 32. “If it moves, we track it.”
  33. 33. Request Latency System Throughput Error Rate Availability Resource Usage “If it moves, we track it.”
  34. 34. Request Latency System Throughput Error Rate Availability Resource Usage “If it moves, we track it.” Incoming Data Peak frequency CPU Memory Disk Space Bandwith node PHP NginX Database
  35. 35. Request Latency System Throughput Error Rate Availability Resource Usage “If it moves, we track it.” Incoming Data Peak frequency CPU Memory Disk Space Bandwith node PHP NginX Database Measure Monitoring Measure measurements
  36. 36. Metrics,Everywhere.
  37. 37. SLIs
  38. 38. Picking good SLIs
  39. 39. SLIs may change according to who is looking at the data.
  40. 40. Understanding the nature of your system
  41. 41. User-Facing 
 serving system? availability,throughput,latency
  42. 42. Storage System? availability,durability,latency
  43. 43. Big Data Systems? throughput,end-to-end latency
  44. 44. User-Facing and Big Data Systems
  45. 45. ๏SLIs - Response time in the“receive”endpoint - Turn around time,from“receive” to“show”. - Individual processing time per step - Data counting: how many,what nature User-Facing and Big Data Systems
  46. 46. ๏SLIs - Response time in the“receive”endpoint - Turn around time,from“receive” to“show”. - Individual processing time per step - Data counting: how many,what nature User-Facing and Big Data Systems More relevant to development team
  47. 47. ๏SLIs - Response time in the“receive”endpoint - Turn around time,from“receive” to“show”. - Individual processing time per step - Data counting: how many,what nature ๏Other Metrics - node,nginx,php-fpm,java metrics - server metrics: cpu,memory,disk space - Size of cluster - Kafka health User-Facing and Big Data Systems More relevant to development team
  48. 48. ๏SLIs - Response time in the“receive”endpoint - Turn around time,from“receive” to“show”. - Individual processing time per step - Data counting: how many,what nature ๏Other Metrics - node,nginx,php-fpm,java metrics - server metrics: cpu,memory,disk space - Size of cluster - Kafka health User-Facing and Big Data Systems More relevant to development team More relevant to Infrastructure team
  49. 49. Picking Targets
  50. 50. Target value SLI value >= target Target Range lower bound <= SLI value <= upper bound
  51. 51. Don’t pick a target based on current performance What is the business need? What are users trying to achieve? How much impact does it have on the user experience?
  52. 52. How long can it take between the user clicking submit and a confirmation that our servers received the data?
  53. 53. How long can it take between the user clicking submit and a confirmation that our servers received the data? “Immediate" “We sell as real time” “500ms,too much HTML“ “I don’t know”
  54. 54. How long can it take between the user clicking submit and a confirmation that our servers received the data? “Immediate" “We sell as real time” “500ms,too much HTML“ “I don’t know” What is human perception of immediate? 100ms Collection API should respond within 150ms
  55. 55. Some, but not too many. can you settle an argument or priority based on it?
  56. 56. Don’t over achieve. The Chubby example.
  57. 57. Adapt. Evolve. re-define SLO’s as your product evolves.
  58. 58. Meeting Expectations.
  59. 59. Attach consequences to your Objectives.
  60. 60. The night is dark and full of loopholes. take a friend from legal with you.
  61. 61. Safety Margins. like setting the alarm 5 minutes before the meeting.
  62. 62. Metrics in Practice.
  63. 63. prometheus.io
  64. 64. Push Model scale this!
  65. 65. Pull Model scale this!
  66. 66. Prometheus Telemetry Statistics Prometheus StatsD,InfluxDB,etc… + Long Term Storage
  67. 67. GaugeHistogramCounter Summary Cumulative metric the represents a single number that only increases Samples and count of observations over time A counter,that can go up or down Same as a histogram but with stream of quantiles over a sliding window.
  68. 68. jimdo/prometheus_client_php
  69. 69. reads from /metrics reads from local storage writes to local storage your code /metrics
  70. 70. <?php use PrometheusCounter; use PrometheusHistogram; use PrometheusStorageAPC;
 require_once 'vendor/autoload.php'; $adapter = new APC(); $histogram = new Histogram( $adapter, 'my_app', 'response_time_ms', 'This measures ....', ['status', 'url'], [0, 10, 50, 100] ); $histogram->observe(15, ['200', '/url']); $counter = new Counter($adapter, 'my_app', 'count_total', 'How many...', ['status', 'url']); $counter->inc(['200', '/url']); $counter->incBy(5, ['200', '/url']);
  71. 71. <?php use PrometheusCounter; use PrometheusHistogram; use PrometheusStorageAPC;
 require_once 'vendor/autoload.php'; $adapter = new APC(); $histogram = new Histogram( $adapter, 'my_app', 'response_time_ms', 'This measures ....', ['status', 'url'], [0, 10, 50, 100] ); $histogram->observe(15, ['200', '/url']); $counter = new Counter($adapter, 'my_app', 'count_total', 'How many...', ['status', 'url']); $counter->inc(['200', '/url']); $counter->incBy(5, ['200', '/url']);
  72. 72. <?php use PrometheusCounter; use PrometheusHistogram; use PrometheusStorageAPC;
 require_once 'vendor/autoload.php'; $adapter = new APC(); $histogram = new Histogram( $adapter, 'my_app', 'response_time_ms', 'This measures ....', ['status', 'url'], [0, 10, 50, 100] ); $histogram->observe(15, ['200', '/url']); $counter = new Counter($adapter, 'my_app', 'count_total', 'How many...', ['status', 'url']); $counter->inc(['200', '/url']); $counter->incBy(5, ['200', '/url']); APC / APCu Redis
  73. 73. <?php use PrometheusCounter; use PrometheusHistogram; use PrometheusStorageAPC;
 require_once 'vendor/autoload.php'; $adapter = new APC(); $histogram = new Histogram( $adapter, 'my_app', 'response_time_ms', 'This measures ....', ['status', 'url'], [0, 10, 50, 100] ); $histogram->observe(15, ['200', '/url']); $counter = new Counter($adapter, 'my_app', 'count_total', 'How many...', ['status', 'url']); $counter->inc(['200', '/url']); $counter->incBy(5, ['200', '/url']); namespace metric name help label names buckets
  74. 74. <?php use PrometheusCounter; use PrometheusHistogram; use PrometheusStorageAPC;
 require_once 'vendor/autoload.php'; $adapter = new APC(); $histogram = new Histogram( $adapter, 'my_app', 'response_time_ms', 'This measures ....', ['status', 'url'], [0, 10, 50, 100] ); $histogram->observe(15, ['200', '/url']); $counter = new Counter($adapter, 'my_app', 'count_total', 'How many...', ['status', 'url']); $counter->inc(['200', '/url']); $counter->incBy(5, ['200', '/url']); measurement label values
  75. 75. <?php use PrometheusCounter; use PrometheusHistogram; use PrometheusStorageAPC;
 require_once 'vendor/autoload.php'; $adapter = new APC(); $histogram = new Histogram( $adapter, 'my_app', 'response_time_ms', 'This measures ....', ['status', 'url'], [0, 10, 50, 100] ); $histogram->observe(15, ['200', '/url']); $counter = new Counter($adapter, 'my_app', 'count_total', 'How many...', ['status', 'url']); $counter->inc(['200', '/url']); $counter->incBy(5, ['200', '/url']); namespace metric name help labels
  76. 76. <?php use PrometheusCounter; use PrometheusHistogram; use PrometheusStorageAPC;
 require_once 'vendor/autoload.php'; $adapter = new APC(); $histogram = new Histogram( $adapter, 'my_app', 'response_time_ms', 'This measures ....', ['status', 'url'], [0, 10, 50, 100] ); $histogram->observe(15, ['200', '/url']); $counter = new Counter($adapter, 'my_app', 'count_total', 'How many...', ['status', 'url']); $counter->inc(['200', '/url']); $counter->incBy(5, ['200', '/url']);
  77. 77. <?php use PrometheusCounter; use PrometheusHistogram; use PrometheusStorageAPC;
 require_once 'vendor/autoload.php'; $adapter = new APC(); $histogram = new Histogram( $adapter, 'my_app', 'response_time_ms', 'This measures ....', ['status', 'url'], [0, 10, 50, 100] ); $histogram->observe(15, ['200', '/url']); $counter = new Counter($adapter, 'my_app', 'count_total', 'How many...', ['status', 'url']); $counter->inc(['200', '/url']); $counter->incBy(5, ['200', '/url']);
  78. 78. <?php use PrometheusRenderTextFormat; use PrometheusStorageAPC; require_once 'vendor/autoload.php'; $adapter = new APC(); $renderer = new RenderTextFormat(); $result = $renderer->render($adapter->collect()); echo $result;
  79. 79. <?php use PrometheusRenderTextFormat; use PrometheusStorageAPC; require_once 'vendor/autoload.php'; $adapter = new APC(); $renderer = new RenderTextFormat(); $result = $renderer->render($adapter->collect()); echo $result;
  80. 80. <?php use PrometheusRenderTextFormat; use PrometheusStorageAPC; require_once 'vendor/autoload.php'; $adapter = new APC(); # HELP my_app_count_total How many... # TYPE my_app_count_total counter my_app_count_total{status="200",url="/url"} 6 # HELP my_app_response_time_ms This measures .... # TYPE my_app_response_time_ms histogram my_app_response_time_ms_bucket{status="200",url="/url",le="0"} 0 my_app_response_time_ms_bucket{status="200",url="/url",le="10"} 0 my_app_response_time_ms_bucket{status="200",url="/url",le="50"} 1 my_app_response_time_ms_bucket{status="200",url="/url",le="100"} 1 my_app_response_time_ms_bucket{status="200",url="/url",le="+Inf"} 1 my_app_response_time_ms_count{status="200",url="/url"} 1 my_app_response_time_ms_sum{status="200",url="/url"} 16 $renderer = new RenderTextFormat(); $result = $renderer->render($adapter->collect()); echo $result;
  81. 81. –Also Rafael (today) “I’ll just try this live demo again.” http://localhost:9090/graph http://localhost:8180/metrics –Rafael (yesterday) “Demos always fail.” http://localhost:8180/index https://github.com/rdohms/talk-app-metrics
  82. 82. You can’t act on what you can’t see.
  83. 83. Metrics without actionability are just numbers on a screen.
  84. 84. Act as soon as an 
 SLO is threatened .
  85. 85. Thank you. Drop me some 
 feedback at Usabilla 
 and make this talk 
 better. @rdohms
 http://slides.doh.ms https://joind.in/talk/56e55

×