Scaling Prometheus on Kubernetes with Thanos

Scaling Prometheus on Kubernetes
Tom Riley @ Booking.com

BookingGo.Cloud ??
Kubernetes
Delivery Platform
Self Service for
Development Teams
Everything as Code
100% Customer
Focused &
100% Business Value
Cloud Native
Learn safely
in Production
Public Cloud
We ❤️ Open Source

BookingGo.Cloud Infrastructure

BookingGo.Cloud Environments
• Dev..
• Test..
• Production..
• Tooling..
• ..plus multiple regions!
• 10 Kubernetes clusters in total and more in the pipeline!

What are we doing with Observability?
Past..
• Delivered Logging & Events on Kubernetes using Elastic Stack
Present..
• Deliver a product around Time Series Metrics that is suitable for BookingGo.Cloud
including alerting as code feature
• Continuously evolve and update our BookingGo Monitoring & Observability defaults
• Deliver a learning path around Observability; helping users onboard to BookingGo.Cloud
and further extend their knowledge via workshops and documentation
Future..
• OpenTracing for BookingGo.Cloud
• Continue evolving Observability culture

Time Series Metrics Project Goals
• Provide engineer friendly tooling and instrumentation libraries
• Low cardinality monitoring; but one datastore fits all contexts
• First class API support; no vendor lock-in, open source
• Single pane of glass for Monitoring
• Monitoring as code; Kubernetes native experience
• Provide consistent mechanism for Alerting based on Metrics
• Reboot monitoring culture at BookingGo
Monitoring & Observability as part of the
application development lifecycle

Prometheus
Kubernetes Infrastructure
and Application
Monitoring with
Prometheus

Prometheus – What is it?
• Prometheus is a metrics oriented Monitoring solution (TSDB & Tooling)
• Released by SoundCloud in 2012
• Prometheus project joined Cloud Native Computing Foundation in 2016
• During 2018, become the second project to graduate from incubation
alongside Kubernetes

Prometheus – What is it?
Prometheus
Application
Service
Discovery Application
Exporter
Alert
Manager
Grafana

Prometheus - Day One
• Deployed kube-prometheus example to all of our K8 clusters
• Each cluster then had a single Prometheus instance and Grafana front end
• Encourage development teams to start exposing Prometheus metrics from
day one
• Opportunity to see if Prometheus was the right technology for us with
very little upfront investment required – learning safely in production!
• Will the development teams get value from it?
• Do we feel the technology fits within Kubernetes?
bit.ly/2S6Lmq0

Prometheus - Day One Learnings
Happy
Development
Teams!

Prometheus - Day One Learnings
Prometheus
❤️
Kubernetes

Kubernetes Prometheus Operator
• Defines Custom Resource Definitions (CRD) for deploying and configuring
Prometheus & AlertManager
• As simple as:
• Deploy the operator to your Kubernetes cluster
• Start deploying the CRD objects to define your Prometheus setup
• Operator launches Prometheus pods automatically based on CRD
configuration

Deploy Prometheus
bit.ly/2R7ohn8

Configure Prometheus
Targets
bit.ly/2R7ohn8

Next Steps..
• We decided to continue ahead with use of Prometheus
• But we had determined a number of challenges..
1. How do we run HA Prometheus in our K8 clusters?
2. How do we achieve the single pane of glass when we have so many
distributed instances of Prometheus?
3. How do we scale Prometheus retention from days to months or even
years?

Next Steps..
• What are the common patterns for tackling these problems?
• How did we approach this?
• We keep a close eye on sources of information, blogs, tech talks on
YouTube, KubeCon/PromCon videos, etc.
• We attended conferences to learn from others!
• Read documentation and best practices
• Keep a close eye on new and evolving projects from GitHub, etc.

Highly Available Prometheus
Targets Targets Targets
Prometheus x1
Scrape Targets

Prometheus x2
Highly Available!
Scrape Targets,
Twice!

Challenges:
• We have two sources of
duplicate metrics!
• Well, so called duplicates
– metrics will vary
between the two slightly!
• Which do we use?

Use a Load Balancer
Load Balancer

Could use something
like HA Proxy
HA Proxy

Use a Service when
running in K8
Kubernetes Service

Not without its challenges:
• When you refresh the data,
you will see it change as
metrics will potentially differ
between the two instances
Kubernetes Service

Not without its challenges:
• When you refresh the data,
you will see it change as
metrics will potentially differ
between the two instances
• Use sticky load balancing or
make the second instance a
hot standby
• This solution is becoming
complicated and does not
scale with query load
Kubernetes Service

Challenges
years?

Federated Prometheus
Scrape metrics at
/federate to centralized
Prometheus instance

Add Grafana..
Single Pane of Glass!!

Also not without its challenges..
• Duplicating metrics is costly
• Have to configure desired metrics you
wish to federate and can easily be
forgotten
• Single point of failure

Prometheus for Practioners @ Monitorama EU 2018
Slides:
https://bit.ly/2AqB11d
Monitorama Talk:
https://vimeo.com/289893972

Long Term Storage
Storage
Storage Storage

Long Term Storage
Storage
• Prometheus was initially designed for short
metrics retention, it was designed for
monitoring & alerting on what is
happening ‘now’
• Local storage can be expensive, especially if
using SSD
• We wanted to store years of metrics, will
this scale efficiently with Prometheus?

Long Term Storage
• Remote write/read API
• Prometheus has remote storage APIs
• Concerns around the complexity of operating Elasticsearch or similar
alongside Prometheus
https://bit.ly/2zt5try

Thanos – What is it?
“Thanos is a set of components
that can be composed into a
highly available metric system
with unlimited storage capacity”

Thanos – What is it?
Developed and open-sourced by
engineers at London based Improbable
github.com/improbable-eng/thanos
619 commits, 2.3k GitHub stars, 50 contributors

Thanos – What does it do?
• Designed to work in Kubernetes, supported by the Prometheus-Operator
• Global querying view across all connected Prometheus servers
• Deduplication and merging of metrics collected from Prometheus HA pairs
• Seamless integration with existing Prometheus setups
• Any object storage as its only, optional dependency
• Downsampling historical data for massive query speedup
• Cross-cluster federation
• Fault-tolerant query routing
• Simple gRPC "Store API" for unified data access across all metric data
• Easy integration points for custom metric providers
https://bit.ly/2KCAWfB

Challenges
Thanos helps to tackle all these problems in a different way..
years?

HA Prometheus with Thanos

HA Prometheus with Thanos
Query
2. Thanos
Query makes
gRPC call to
Thanos sidecar
for metrics and
de-duplicates
1. Thanos
sidecar
deployed
alongside
Prometheus in
Kubernetes
Pod using
operator
3. Thanos
Query exposes
Prometheus
HTTP API or
gRPC

Federation with Thanos
Use a centralized instance
of Thanos Query to
federate the edge
instances of Prometheus &
Thanos
Query

Federation with Thanos
Query
No need to scrape metrics to a
centralized Prometheus
Query scales horizontally therefore
eliminating the single point of failure!
Prometheus instances running
at the edge now HA & metrics
are de-duplicated. We operate
these in both AWS & GCP
within K8
Point Grafana at single Prometheus
HTTP API with metrics from all
environments

Long Term Storage with Thanos
Query 1. Thanos Sidecar
ships metrics to
storage bucket
such as AWS S3
or GCP Storage
Store
2. Thanos Store makes metrics
available via Thanos Store API
for Query

How??
Memory Block
Targets
Targets
Disk Block

Long Term Storage with Thanos
• Significantly reduce storage requirements of each Prometheus instance –
only need to story around 2 to 24 hours of metrics
• Significantly cheaper storing metrics in a bucket versus scaling SSD storage
• Thanos Compact executes compression of Prometheus TSDB data within
the bucket and also downsamples data for when querying over long time
periods – keeps raw (1m), 5m & 15m samples
• Query automatically de-duplicates data within Prometheus and metrics
store in the storage bucket
• Thanos is built from Prometheus TSDB code – not redesigning the wheel

Thanos in Summary
Query • Prometheus automated in K8
• Single Prometheus API
• Long term metric retention

How do we make this self-serve?
• Deployments to BookingGo.Cloud are automated using our BGCloud CLI
& Helm charts that we own
• To self-serve metrics..
1. Expose Prometheus supported metrics endpoint for application
2. Set helm value to configure path to metrics endpoint and enable
metrics
3. Deploy to platform using CLI tool via CI/CD pipeline
4. Start building dashboards in Grafana!

How do we make this self-serve?
• It is as simple as setting this in the applications self-contained
configuration and deploying via a pipeline:
bookinggo:
metrics:
enabled: true
path: /actuator/prometheus

Things I’ve missed..
• We are building an Observability culture at BookingGo to ensure good quality
monitoring becomes part of application development lifecycle, including its
operation! – Prometheus and Thanos is just one part of the tooling to enable
this
• Alerting as a Service – Development teams have full control over alerting
configuration and is part of a code deployment of their application
• How to monitor Kubernetes infrastructure – so many metrics are exposed out
the box or easily available using Prometheus exporters
• How we actually deploy all of this to Kubernetes – we use Helm and write our
charts to fit the use case if one is not available in the open source community!
• So much more…

Learn more about Thanos
• If you want to learn more about Thanos search for ‘PromCon 2018: Thanos
- Prometheus at Scale’ on YouTube
• https://bit.ly/2P6edZE
• Join Improbable’s engineering Slack group to chat #thanos
• improbable-eng.slack.com
• Follow the project on GitHub
• https://github.com/improbable-eng/thanos
• Prometheus: Up & Running book
• https://oreil.ly/2r74zN5

Thank you for listening!
Questions?
E: thomas.riley@booking.com
S: Riley @ kubernetes.slack.com

Scaling Prometheus on Kubernetes with Thanos

More Related Content

What's hot

Similar to Scaling Prometheus on Kubernetes with Thanos

Recently uploaded

Scaling Prometheus on Kubernetes with Thanos

Editor's Notes