Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
@snehainguva
digitalocean.com
frustrating deployment and
update process
digitalocean.com
container orchestration
solutions with a bazillion
features
digitalocean.com
monitoring dynamically
changing services
observability in a dynamically
scheduled world
digitalocean.com
leveraging prometheus and
alertmanager for cluster monitoring
and alerting
digitalocean.com
about me
software engineer @DigitalOcean
former delivery, currently observability
kubernetes, prometheus
digitalocean.com
the plan:
● the “olden” days vs. container orchestration
● docc at DigitalOcean
● prometheus + alertmanag...
digitalocean.com
the “olden” days:
service owners write an application
provision a server with chef or ansible
use a CI/CD...
digitalocean.com
the “olden” days:
use nagios + various plugins to monitor
use collectd + statsd + graphite/carbon
digitalocean.com
the “olden” days:
longer to provision than write actual service
hard to set up monitoring services
blackb...
digitalocean.com
docc: Digital Ocean Command Center
a tool for deploying containerized,
stateless applications
digitalocean.com
post-docc:
abstraction layer on top of kubernetes
deployments and updates take minutes, not hours
easy-to...
Source: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture.png
CLI DOCCSERVER
daemonset → pods
service
digitalocean.com
post-docc:
service owners write an application
service owner dockerizes application
describe application ...
digitalocean.com
post-docc:
view running applications
get application logs
easily scale, update, or restart applications
digitalocean.com
But what about monitoring?
digitalocean.com
Let’s use
prometheus + alertmanager
Source: https://github.com/prometheus/prometheus
digitalocean.com
Why use prometheus and
alertmanager?
digitalocean.com
easy to deploy
digitalocean.com
flexible and extensible
digitalocean.com
labelling works well with kubernetes primitives
digitalocean.com
complementary kubernetes service discovery
digitalocean.com
low-level metrics leveraged via
a strong query language
digitalocean.com
counters
gauges
histograms
summaries
digitalocean.com
alertmanager:
easily deployed alongside prometheus
dedupes alerts
high availability configuration
multipl...
digitalocean.com
any downsides?
push vs. pull model
service owner must instrument application
digitalocean.com
putting it all together
digitalocean.com
instrument your application
use prometheus golang client
expose metrics endpoint
digitalocean.com
specify metrics, ports, alerts in your manifest file
Which metrics endpoint should be scraped?
Which cont...
digitalocean.com
use docc CLI to deploy your application
CLI DOCCSERVER
$ docc deploy manifest.json
digitalocean.com
daemonset → pods
service
promconfig
alertconfigalertmanager
docc
digitalocean.com
prometheus talks to the kubernetes api and grabs
the metrics endpoint and port information
promconfigserv...
digitalocean.com
promconfig grabs alert information and
rewrites prometheus rules file
promconfigservice
digitalocean.com
alertconfig grabs alert routes and
rewrites alertmanager configuration file
service alertmanager alertcon...
digitalocean.com
Some stats
digitalocean.com
300+ production applications
1.5 million+ timeseries
100+ prometheus alerts
digitalocean.com
What should we monitor?
digitalocean.com
counters: cumulative, increasing metric
gauges: single metric that goes up or down
histograms: samples an...
digitalocean.com
latency - histogram + summaries
traffic - counters + rate()
error - counters + rate()
saturations - gauge
digitalocean.com
R - request rate
E - error rate
D - duration
digitalocean.com
U - utilization
S - saturation
E - error rate
digitalocean.com
cluster CPU reservation
node memory utilization
loadbalancer connection error rate
service http request d...
digitalocean.com
How should we alert?
digitalocean.com
State-based alerts
Is there a divergence between expected
state and actual state of a service?
digitalocean.com
State-based alerts
Is my service up and/or scrapeable?
absent(up{kubernetes_name="doccserver"}) or
sum(up...
digitalocean.com
Threshold alerts
Do any of our measured metrics exceed a lower or
upper bound?
digitalocean.com
Threshold alerts
Is our loadbalancer at 50% capacity in terms of sessions?
max(haproxy_frontend_current_s...
digitalocean.com
Common pitfalls
digitalocean.com
Pitfall #1: Alerting fatigue
digitalocean.com
Solution: Slack and/or Pagerduty
send only the most urgent, production alerts to pagerduty
try out differ...
digitalocean.com
Solution: Dedupe and group alerts
digitalocean.com
Pitfall #2: Confused service owners
digitalocean.com
Solution: Docs and suggested alerts
extensive documentation and tutorials, accessible from CLI
prometheus...
digitalocean.com
Pitfall #3: Who owns what?
digitalocean.com
Solution: opinionated manifest file
services owner must include maintainer information
aerts themselves i...
digitalocean.com
Pitfall #4: Monitoring the monitors
digitalocean.com
Solution: Duplicate promethei and HA
alertmanager
alertmanager
alertmanager alertmanager
digitalocean.com
Solution: Deadman’s switch
ALERT JustKeepSwimming
IF vector(1)
digitalocean.com
digitalocean.com
#1: Automated alerts
utilize user-defined memory and cpu limits for
threshold alerts
automatic state-base...
digitalocean.com
#2: Leverage metrics for autopilot
user trusts in our custom controllers and
schedulers
collect metrics a...
digitalocean.com
#3: Leverage metrics for autoscaling
services based on resource usage, #
connections, etc.
loadbalancers ...
digitalocean.com
a brave new world of container
orchestration
OSS whitebox monitoring
extensibility
sources
● The best prometheus tutorials you will ever
read, Julius Volz
● Actual Prometheus Website, Julien Friedman
● Kub...
Observability in a Dynamically Scheduled World
Upcoming SlideShare
Loading in …5
×

Observability in a Dynamically Scheduled World

3,674 views

Published on

The industry is moving toward a microservices architecture, and many companies have embraced container orchestration solutions such as Kubernetes. DigitalOcean is no different. Over the past year, DigitalOcean’s Delivery team has been building a runtime platform based on Kubernetes with the goal of making shipping code easier. The system has empowered service owners to quickly and efficiently deploy and update their applications. A vital component is a white box monitoring and alerting solution based on Prometheus and Alertmanager.

Sneha Inguva offers an overview of the system and shares problems encountered, potential solutions, and key lessons learned in the process. Sneha dives into the setup of Prometheus and Alertmanager that allows service owners to instrument their own metrics and alerts, explaining the service owner’s point of view and the internals that allow for the dynamic addition of alerts, and offers a glimpse of future modifications to the system. Join in to learn how to leverage open source tools for your monitoring and alerting needs.

Published in: Engineering
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Observability in a Dynamically Scheduled World

  1. 1. @snehainguva
  2. 2. digitalocean.com frustrating deployment and update process
  3. 3. digitalocean.com container orchestration solutions with a bazillion features
  4. 4. digitalocean.com monitoring dynamically changing services
  5. 5. observability in a dynamically scheduled world
  6. 6. digitalocean.com leveraging prometheus and alertmanager for cluster monitoring and alerting
  7. 7. digitalocean.com about me software engineer @DigitalOcean former delivery, currently observability kubernetes, prometheus
  8. 8. digitalocean.com the plan: ● the “olden” days vs. container orchestration ● docc at DigitalOcean ● prometheus + alertmanager and docc ● alerting in action: examples ● potential pitfalls ● next steps
  9. 9. digitalocean.com the “olden” days: service owners write an application provision a server with chef or ansible use a CI/CD pipeline, bash scripts, or other tools to deploy and update application on VM
  10. 10. digitalocean.com the “olden” days: use nagios + various plugins to monitor use collectd + statsd + graphite/carbon
  11. 11. digitalocean.com the “olden” days: longer to provision than write actual service hard to set up monitoring services blackbox monitoring NOT insightful whitebox monitoring services NOT easily queryable
  12. 12. digitalocean.com docc: Digital Ocean Command Center a tool for deploying containerized, stateless applications
  13. 13. digitalocean.com post-docc: abstraction layer on top of kubernetes deployments and updates take minutes, not hours easy-to-use CLI
  14. 14. Source: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture.png
  15. 15. CLI DOCCSERVER daemonset → pods service
  16. 16. digitalocean.com post-docc: service owners write an application service owner dockerizes application describe application in json manifest file deploy!
  17. 17. digitalocean.com post-docc: view running applications get application logs easily scale, update, or restart applications
  18. 18. digitalocean.com But what about monitoring?
  19. 19. digitalocean.com Let’s use prometheus + alertmanager
  20. 20. Source: https://github.com/prometheus/prometheus
  21. 21. digitalocean.com Why use prometheus and alertmanager?
  22. 22. digitalocean.com easy to deploy
  23. 23. digitalocean.com flexible and extensible
  24. 24. digitalocean.com labelling works well with kubernetes primitives
  25. 25. digitalocean.com complementary kubernetes service discovery
  26. 26. digitalocean.com low-level metrics leveraged via a strong query language
  27. 27. digitalocean.com counters gauges histograms summaries
  28. 28. digitalocean.com alertmanager: easily deployed alongside prometheus dedupes alerts high availability configuration multiple receiver options
  29. 29. digitalocean.com any downsides? push vs. pull model service owner must instrument application
  30. 30. digitalocean.com putting it all together
  31. 31. digitalocean.com instrument your application use prometheus golang client expose metrics endpoint
  32. 32. digitalocean.com specify metrics, ports, alerts in your manifest file Which metrics endpoint should be scraped? Which container port needs to be exposed? Specify alerting rule, duration interval, and channel.
  33. 33. digitalocean.com use docc CLI to deploy your application CLI DOCCSERVER $ docc deploy manifest.json
  34. 34. digitalocean.com daemonset → pods service promconfig alertconfigalertmanager docc
  35. 35. digitalocean.com prometheus talks to the kubernetes api and grabs the metrics endpoint and port information promconfigservice
  36. 36. digitalocean.com promconfig grabs alert information and rewrites prometheus rules file promconfigservice
  37. 37. digitalocean.com alertconfig grabs alert routes and rewrites alertmanager configuration file service alertmanager alertconfig
  38. 38. digitalocean.com Some stats
  39. 39. digitalocean.com 300+ production applications 1.5 million+ timeseries 100+ prometheus alerts
  40. 40. digitalocean.com What should we monitor?
  41. 41. digitalocean.com counters: cumulative, increasing metric gauges: single metric that goes up or down histograms: samples and buckets observations summaries: also samples observations but can calculate things like quantiles
  42. 42. digitalocean.com latency - histogram + summaries traffic - counters + rate() error - counters + rate() saturations - gauge
  43. 43. digitalocean.com R - request rate E - error rate D - duration
  44. 44. digitalocean.com U - utilization S - saturation E - error rate
  45. 45. digitalocean.com cluster CPU reservation node memory utilization loadbalancer connection error rate service http request duration real-life examples
  46. 46. digitalocean.com How should we alert?
  47. 47. digitalocean.com State-based alerts Is there a divergence between expected state and actual state of a service?
  48. 48. digitalocean.com State-based alerts Is my service up and/or scrapeable? absent(up{kubernetes_name="doccserver"}) or sum(up{kubernetes_name="doccserver"}) == 0 Do I have the # of loadbalancers I expect? sum(up{kubernetes_name="loadbalancer"}) < 3
  49. 49. digitalocean.com Threshold alerts Do any of our measured metrics exceed a lower or upper bound?
  50. 50. digitalocean.com Threshold alerts Is our loadbalancer at 50% capacity in terms of sessions? max(haproxy_frontend_current_sessions / haproxy_frontend_limit_sessions) BY (kubernetes_node_name, frontend) * 100 > 50 Are 50 percent of tests taking longer than 10 minutes? max(test_duration_seconds{quantile="0.5",res ult="pass"}) BY (test_name) > 600
  51. 51. digitalocean.com Common pitfalls
  52. 52. digitalocean.com Pitfall #1: Alerting fatigue
  53. 53. digitalocean.com Solution: Slack and/or Pagerduty send only the most urgent, production alerts to pagerduty try out different promQL queries to have less spikey metrics
  54. 54. digitalocean.com Solution: Dedupe and group alerts
  55. 55. digitalocean.com Pitfall #2: Confused service owners
  56. 56. digitalocean.com Solution: Docs and suggested alerts extensive documentation and tutorials, accessible from CLI prometheus slack channel for real-time help standard alert examples
  57. 57. digitalocean.com Pitfall #3: Who owns what?
  58. 58. digitalocean.com Solution: opinionated manifest file services owner must include maintainer information aerts themselves include descriptions and summaries with several labels alerts must include team-specific receivers
  59. 59. digitalocean.com Pitfall #4: Monitoring the monitors
  60. 60. digitalocean.com Solution: Duplicate promethei and HA alertmanager alertmanager alertmanager alertmanager
  61. 61. digitalocean.com Solution: Deadman’s switch ALERT JustKeepSwimming IF vector(1)
  62. 62. digitalocean.com
  63. 63. digitalocean.com #1: Automated alerts utilize user-defined memory and cpu limits for threshold alerts automatic state-based alerts
  64. 64. digitalocean.com #2: Leverage metrics for autopilot user trusts in our custom controllers and schedulers collect metrics and build model about resource usage over time accordingly adjust limits and alerts
  65. 65. digitalocean.com #3: Leverage metrics for autoscaling services based on resource usage, # connections, etc. loadbalancers based on # of frontend and backend connections # of worker nodes based on memory and cpu capacity metrics
  66. 66. digitalocean.com a brave new world of container orchestration OSS whitebox monitoring extensibility
  67. 67. sources ● The best prometheus tutorials you will ever read, Julius Volz ● Actual Prometheus Website, Julien Friedman ● Kubernetes Project

×