Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling Prometheus on Kubernetes with Thanos

1,528 views

Published on

My talk from the Kubernetes Manchester meetup (6th Dec 2018) on how we are using Thanos to scale Prometheus on Kubernetes at BookingGo.

Published in: Technology
  • Be the first to comment

Scaling Prometheus on Kubernetes with Thanos

  1. 1. Scaling Prometheus on Kubernetes Tom Riley @ Booking.com
  2. 2. BookingGo.Cloud ?? Kubernetes Delivery Platform Self Service for Development Teams Everything as Code 100% Customer Focused & 100% Business Value Cloud Native Learn safely in Production Public Cloud We ❤️ Open Source
  3. 3. BookingGo.Cloud Infrastructure
  4. 4. BookingGo.Cloud Infrastructure
  5. 5. BookingGo.Cloud Environments • Dev.. • Test.. • Production.. • Tooling.. • ..plus multiple regions! • 10 Kubernetes clusters in total and more in the pipeline!
  6. 6. What are we doing with Observability? Past.. • Delivered Logging & Events on Kubernetes using Elastic Stack Present.. • Deliver a product around Time Series Metrics that is suitable for BookingGo.Cloud including alerting as code feature • Continuously evolve and update our BookingGo Monitoring & Observability defaults • Deliver a learning path around Observability; helping users onboard to BookingGo.Cloud and further extend their knowledge via workshops and documentation Future.. • OpenTracing for BookingGo.Cloud • Continue evolving Observability culture
  7. 7. What are we doing with Observability? Past.. • Delivered Logging & Events on Kubernetes using Elastic Stack Present.. • Deliver a product around Time Series Metrics that is suitable for BookingGo.Cloud including alerting as code feature • Continuously evolve and update our BookingGo Monitoring & Observability defaults • Deliver a learning path around Observability; helping users onboard to BookingGo.Cloud and further extend their knowledge via workshops and documentation Future.. • OpenTracing for BookingGo.Cloud • Continue evolving Observability culture
  8. 8. Time Series Metrics Project Goals • Provide engineer friendly tooling and instrumentation libraries • Low cardinality monitoring; but one datastore fits all contexts • First class API support; no vendor lock-in, open source • Single pane of glass for Monitoring • Monitoring as code; Kubernetes native experience • Provide consistent mechanism for Alerting based on Metrics • Reboot monitoring culture at BookingGo Monitoring & Observability as part of the application development lifecycle
  9. 9. Prometheus Kubernetes Infrastructure and Application Monitoring with Prometheus
  10. 10. Prometheus – What is it? • Prometheus is a metrics oriented Monitoring solution (TSDB & Tooling) • Released by SoundCloud in 2012 • Prometheus project joined Cloud Native Computing Foundation in 2016 • During 2018, become the second project to graduate from incubation alongside Kubernetes
  11. 11. Prometheus – What is it? Prometheus Application Service Discovery Application Exporter Alert Manager Grafana
  12. 12. Prometheus - Day One • Deployed kube-prometheus example to all of our K8 clusters • Each cluster then had a single Prometheus instance and Grafana front end • Encourage development teams to start exposing Prometheus metrics from day one • Opportunity to see if Prometheus was the right technology for us with very little upfront investment required – learning safely in production! • Will the development teams get value from it? • Do we feel the technology fits within Kubernetes? bit.ly/2S6Lmq0
  13. 13. Prometheus - Day One Learnings Happy Development Teams!
  14. 14. Prometheus - Day One Learnings Prometheus ❤️ Kubernetes
  15. 15. Kubernetes Prometheus Operator • Defines Custom Resource Definitions (CRD) for deploying and configuring Prometheus & AlertManager • As simple as: • Deploy the operator to your Kubernetes cluster • Start deploying the CRD objects to define your Prometheus setup • Operator launches Prometheus pods automatically based on CRD configuration
  16. 16. Kubernetes Prometheus Operator Deploy Prometheus bit.ly/2R7ohn8
  17. 17. Kubernetes Prometheus Operator Configure Prometheus Targets bit.ly/2R7ohn8
  18. 18. Next Steps.. • We decided to continue ahead with use of Prometheus • But we had determined a number of challenges.. 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  19. 19. Next Steps.. • What are the common patterns for tackling these problems? • How did we approach this? • We keep a close eye on sources of information, blogs, tech talks on YouTube, KubeCon/PromCon videos, etc. • We attended conferences to learn from others! • Read documentation and best practices • Keep a close eye on new and evolving projects from GitHub, etc.
  20. 20. Highly Available Prometheus Targets Targets Targets Prometheus x1 Scrape Targets
  21. 21. Highly Available Prometheus Targets Targets Targets Prometheus x2 Highly Available! Scrape Targets, Twice!
  22. 22. Highly Available Prometheus Challenges: • We have two sources of duplicate metrics! • Well, so called duplicates – metrics will vary between the two slightly! • Which do we use?
  23. 23. Highly Available Prometheus Targets Targets Targets Use a Load Balancer Load Balancer
  24. 24. Highly Available Prometheus Targets Targets Targets Could use something like HA Proxy HA Proxy
  25. 25. Highly Available Prometheus Targets Targets Targets Use a Service when running in K8 Kubernetes Service
  26. 26. Highly Available Prometheus Targets Targets Targets Not without its challenges: • When you refresh the data, you will see it change as metrics will potentially differ between the two instances Kubernetes Service
  27. 27. Highly Available Prometheus Targets Targets Targets Not without its challenges: • When you refresh the data, you will see it change as metrics will potentially differ between the two instances • Use sticky load balancing or make the second instance a hot standby • This solution is becoming complicated and does not scale with query load Kubernetes Service
  28. 28. Challenges 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  29. 29. Federated Prometheus Scrape metrics at /federate to centralized Prometheus instance
  30. 30. Federated Prometheus Add Grafana.. Single Pane of Glass!!
  31. 31. Federated Prometheus Also not without its challenges.. • Duplicating metrics is costly • Have to configure desired metrics you wish to federate and can easily be forgotten • Single point of failure
  32. 32. Prometheus for Practioners @ Monitorama EU 2018 Slides: https://bit.ly/2AqB11d Monitorama Talk: https://vimeo.com/289893972
  33. 33. Challenges 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  34. 34. Long Term Storage Storage Storage Storage
  35. 35. Long Term Storage Storage • Prometheus was initially designed for short metrics retention, it was designed for monitoring & alerting on what is happening ‘now’ • Local storage can be expensive, especially if using SSD • We wanted to store years of metrics, will this scale efficiently with Prometheus?
  36. 36. Long Term Storage • Remote write/read API • Prometheus has remote storage APIs • Concerns around the complexity of operating Elasticsearch or similar alongside Prometheus https://bit.ly/2zt5try
  37. 37. Challenges 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  38. 38. Hello, Thanos
  39. 39. Thanos – What is it? “Thanos is a set of components that can be composed into a highly available metric system with unlimited storage capacity”
  40. 40. Thanos – What is it? Developed and open-sourced by engineers at London based Improbable github.com/improbable-eng/thanos 619 commits, 2.3k GitHub stars, 50 contributors
  41. 41. Thanos – What does it do? • Designed to work in Kubernetes, supported by the Prometheus-Operator • Global querying view across all connected Prometheus servers • Deduplication and merging of metrics collected from Prometheus HA pairs • Seamless integration with existing Prometheus setups • Any object storage as its only, optional dependency • Downsampling historical data for massive query speedup • Cross-cluster federation • Fault-tolerant query routing • Simple gRPC "Store API" for unified data access across all metric data • Easy integration points for custom metric providers https://bit.ly/2KCAWfB
  42. 42. Challenges Thanos helps to tackle all these problems in a different way.. 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  43. 43. HA Prometheus with Thanos Targets Targets Targets
  44. 44. HA Prometheus with Thanos Targets Targets Targets Query 2. Thanos Query makes gRPC call to Thanos sidecar for metrics and de-duplicates 1. Thanos sidecar deployed alongside Prometheus in Kubernetes Pod using operator 3. Thanos Query exposes Prometheus HTTP API or gRPC
  45. 45. Federation with Thanos Use a centralized instance of Thanos Query to federate the edge instances of Prometheus & Thanos Query
  46. 46. Federation with Thanos Query No need to scrape metrics to a centralized Prometheus Query scales horizontally therefore eliminating the single point of failure! Prometheus instances running at the edge now HA & metrics are de-duplicated. We operate these in both AWS & GCP within K8 Point Grafana at single Prometheus HTTP API with metrics from all environments
  47. 47. Challenges 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  48. 48. Long Term Storage with Thanos Targets Targets Targets Query 1. Thanos Sidecar ships metrics to storage bucket such as AWS S3 or GCP Storage Store 2. Thanos Store makes metrics available via Thanos Store API for Query
  49. 49. How?? Memory Block Targets Targets Disk Block
  50. 50. Long Term Storage with Thanos • Significantly reduce storage requirements of each Prometheus instance – only need to story around 2 to 24 hours of metrics • Significantly cheaper storing metrics in a bucket versus scaling SSD storage • Thanos Compact executes compression of Prometheus TSDB data within the bucket and also downsamples data for when querying over long time periods – keeps raw (1m), 5m & 15m samples • Query automatically de-duplicates data within Prometheus and metrics store in the storage bucket • Thanos is built from Prometheus TSDB code – not redesigning the wheel
  51. 51. Thanos in Summary Query • Prometheus automated in K8 • Single Prometheus API • Long term metric retention
  52. 52. How do we make this self-serve? • Deployments to BookingGo.Cloud are automated using our BGCloud CLI & Helm charts that we own • To self-serve metrics.. 1. Expose Prometheus supported metrics endpoint for application 2. Set helm value to configure path to metrics endpoint and enable metrics 3. Deploy to platform using CLI tool via CI/CD pipeline 4. Start building dashboards in Grafana!
  53. 53. How do we make this self-serve? • It is as simple as setting this in the applications self-contained configuration and deploying via a pipeline: bookinggo: metrics: enabled: true path: /actuator/prometheus
  54. 54. Things I’ve missed.. • We are building an Observability culture at BookingGo to ensure good quality monitoring becomes part of application development lifecycle, including its operation! – Prometheus and Thanos is just one part of the tooling to enable this • Alerting as a Service – Development teams have full control over alerting configuration and is part of a code deployment of their application • How to monitor Kubernetes infrastructure – so many metrics are exposed out the box or easily available using Prometheus exporters • How we actually deploy all of this to Kubernetes – we use Helm and write our charts to fit the use case if one is not available in the open source community! • So much more…
  55. 55. Learn more about Thanos • If you want to learn more about Thanos search for ‘PromCon 2018: Thanos - Prometheus at Scale’ on YouTube • https://bit.ly/2P6edZE • Join Improbable’s engineering Slack group to chat #thanos • improbable-eng.slack.com • Follow the project on GitHub • https://github.com/improbable-eng/thanos • Prometheus: Up & Running book • https://oreil.ly/2r74zN5
  56. 56. Thank you for listening! Questions? E: thomas.riley@booking.com S: Riley @ kubernetes.slack.com

×