Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2019)

Prometheus in Practice:
High Availability with Thanos
Tom Riley
DevOpsDays Edinburgh 2019
About Me
● Tom Riley
● Infrastructure @ Nuance
● Previously Booking.com
● Co-Organiser Cloud Native
+ Kubernetes Manchester
Today
● Introduction to Prometheus
● Monitoring Kubernetes
● High Availability Prometheus
● Long Term Storage for
Prometheus
What is Prometheus?
● Prometheus is a metrics oriented Monitoring solution (TSDB & Tooling)
● Released by SoundCloud in 2012
● Prometheus project joined Cloud Native Computing Foundation in 2016
● During 2018, become the second project to graduate from incubation
alongside Kubernetes
What is Prometheus?
Prometheus
Application
Prometheus Metrics
Prometheus Metrics
Metric Name
Prometheus Metrics
Metric Labels
Prometheus Metrics
Metric Values
Prometheus Metrics
Metric Name Metric Labels Metric Values
Metric
What is Prometheus?
Prometheus
Application
Service
Discovery Application
Exporter
Alert
Manager
Grafana
Demo
Environment
1. Kubernetes on my laptop using
KIND
2. Prometheus Operator
3. Monitoring Kubernetes via:
Kube-state-metrics
Node Exporter
Kubelet & cAdvisor
4. Grafana Dashboards
Prometheus Operator
Prometheus
Operator
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: prometheus
spec:
baseImage: quay.io/prometheus/prometheus
logLevel: info
replicas: 1
resources:
limits:
cpu: 1
memory: 100Mi
requests:
cpu: 1
memory: 100Mi
retention: 12h
serviceAccountName: prometheus-service-account
serviceMonitorSelector:
matchLabels:
serviceMonitorSelector: prometheus
version: v2.10.0
Deploying a Prometheus Instance...
Prometheus
Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
serviceMonitorSelector: prometheus
name: prometheus
namespace: prometheus
spec:
endpoints:
- interval: 30s
path: /metrics
targetPort: 9090
namespaceSelector:
matchNames:
- prometheus
selector:
matchLabels:
app: prometheus
Configure Prometheus Targets with
ServiceMonitor...
Demo 1...
Highly Un-Available Prometheus
● In our demo environment we have
a single instance of Prometheus,
as shown in the diagram to the
right
● If the Kubernetes worker node
that Prometheus is running on
fails the Pod will temporarily
become unavailable as it is
evicted and launched elsewhere Targets Targets Targets
Scrape Targets
Highly Available Prometheus
Targets Targets Targets
Prometheus x2
Highly Available!
Scrape Targets,
Twice!
Highly Available Prometheus
Challenges:
• We have two sources of
duplicate metrics!
• Which do we use?
Highly Available Prometheus
Targets Targets Targets
Use a Load Balancer
Load Balancer
Highly Available Prometheus
Targets Targets Targets
Use a Service when
running in K8
Kubernetes Service
Demo 2...
Highly Available Prometheus
Targets Targets Targets
Not without its challenges:
• When you refresh the data,
you will see it change as
metrics will potentially differ
between the two instances
Kubernetes Service
Highly Available Prometheus
Targets Targets Targets
Not without its challenges:
• When you refresh the data,
you will see it change as
metrics will potentially differ
between the two instances
• Use sticky load balancing or
make the second instance a
hot standby
• This solution is becoming
complicated and does not
scale with query load
Kubernetes Service
Prometheus HA with Thanos
“Thanos is a set of components
that can be composed into a highly
available metric system with
unlimited storage capacity”
Prometheus HA with Thanos
Developed and open-sourced by engineers
at London based Improbable
Today, 5 core maintainers from various
organisations.
github.com/thanos-io/thanos
1000+ commits, 4k+ GitHub stars, 138 contributors
Prometheus HA with Thanos
Targets Targets Targets
Prometheus HA with Thanos
Targets Targets Targets
Query
2. Thanos Query
makes gRPC
call to Thanos
sidecar for
metrics and de-
duplicates
1. Thanos
sidecar
deployed
alongside
Prometheus in
Kubernetes
Pod using
operator
3. Thanos Query
exposes
Prometheus
HTTP API or
gRPC
Demo 3...
Long Term
Storage
The Challenge:
You want to store months or even
years worth of metrics within
Prometheus.
You still need to be able to query
that data and it be performant. Like,
all the data!
Long Term Storage
Storage
Storage Storage
Long Term
Nightmare?
Long Term Storage
Storage
• Prometheus was initially designed for short
metrics retention, it was designed for
monitoring & alerting on what is happening
‘now’
• Local storage can be expensive, especially if
using SSD
• You want to store years of metrics, will this
scale efficiently with Prometheus?
Long Term Storage
• Remote write/read API
• Prometheus has remote storage APIs
• The complexity of operating Elasticsearch or similar alongside
Prometheus seems somewhat overengineered
Hello again, Thanos!
Long Term Storage with Thanos
Targets Targets Targets
Query
1. Thanos Sidecar
ships metrics to
storage bucket
such as AWS S3
or GCP Storage
Store
2. Thanos Store makes
metrics available via Thanos
Store API for Query
How?
Memory Block
Targets
Targets
Disk Block
Long Term Storage with Thanos
• Significantly reduce storage requirements of each Prometheus instance –
only need to story around 2 to 24 hours of metrics
• Significantly cheaper storing metrics in a bucket versus scaling SSD
storage
• Thanos Compact executes compression of Prometheus TSDB data within
the bucket and also downsamples data for when querying over long time
periods – keeps raw (1m), 5m & 15m samples
• Query automatically de-duplicates data within Prometheus and metrics
store in the storage bucket
• Thanos is built from Prometheus TSDB code – not redesigning the wheel
Demo 4...
Conclusion
● Use Prometheus Operator for making the automation of Prometheus on
Kubernetes easy!
● Collect time series metrics from everywhere in Kubernetes and start
building dashboards to enhance the Observability of your platform and
services!
● Use Thanos for adding resilience and ease of scalability with Prometheus
in Kubernetes.. It is as easy as deploying a sidecar!
Questions?
Thank you for listening!
I have published a series of K8s Observability tutorials at:
https://observability.thomasriley.co.uk
Get in touch:
Mail: contact@thomasriley.co.uk
Slack: Riley @ kubernetes.slack.com
Twitter: @therealriley
1 of 41

Recommended

Thanos: Global, durable Prometheus monitoring by
Thanos: Global, durable Prometheus monitoringThanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringBartłomiej Płotka
25.9K views87 slides
Kubernetes Observability with Prometheus by Example by
Kubernetes Observability with Prometheus by ExampleKubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleThomas Riley
1.1K views40 slides
Scaling Prometheus on Kubernetes with Thanos by
Scaling Prometheus on Kubernetes with ThanosScaling Prometheus on Kubernetes with Thanos
Scaling Prometheus on Kubernetes with ThanosThomas Riley
4.5K views57 slides
Prometheus and Thanos by
Prometheus and ThanosPrometheus and Thanos
Prometheus and ThanosCloudOps2005
1.2K views22 slides
Thanos - Prometheus on Scale by
Thanos - Prometheus on ScaleThanos - Prometheus on Scale
Thanos - Prometheus on ScaleBartłomiej Płotka
1.4K views52 slides
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana by
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
233 views31 slides

More Related Content

What's hot

Monitoring with prometheus by
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
6.8K views53 slides
Introduction to Prometheus by
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
6.7K views55 slides
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin... by
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...Vietnam Open Infrastructure User Group
357 views32 slides
VictoriaLogs: Open Source Log Management System - Preview by
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
2.4K views98 slides
Storing 16 Bytes at Scale by
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at ScaleFabian Reinartz
15.6K views62 slides
Terraform Basics by
Terraform BasicsTerraform Basics
Terraform BasicsMohammed Fazuluddin
1.9K views12 slides

What's hot(20)

Monitoring with prometheus by Kasper Nissen
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
Kasper Nissen6.8K views
Introduction to Prometheus by Julien Pivotto
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
Julien Pivotto6.7K views
VictoriaLogs: Open Source Log Management System - Preview by VictoriaMetrics
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics2.4K views
Improved alerting with Prometheus and Alertmanager by Julien Pivotto
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and Alertmanager
Julien Pivotto4.5K views
Monitoring Kubernetes with Prometheus by Grafana Labs
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
Grafana Labs4.2K views
Building infrastructure as code using Terraform - DevOps Krakow by Anton Babenko
Building infrastructure as code using Terraform - DevOps KrakowBuilding infrastructure as code using Terraform - DevOps Krakow
Building infrastructure as code using Terraform - DevOps Krakow
Anton Babenko1.8K views
Kubernetes and Prometheus by Weaveworks
Kubernetes and PrometheusKubernetes and Prometheus
Kubernetes and Prometheus
Weaveworks9.5K views
Getting Started Monitoring with Prometheus and Grafana by Syah Dwi Prihatmoko
Getting Started Monitoring with Prometheus and GrafanaGetting Started Monitoring with Prometheus and Grafana
Getting Started Monitoring with Prometheus and Grafana
Syah Dwi Prihatmoko3.5K views
Introduction to kubernetes by Rishabh Indoria
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
Rishabh Indoria12.8K views
Helm - the Better Way to Deploy on Kubernetes - Reinhard Nägele - Codemotion... by Codemotion
 Helm - the Better Way to Deploy on Kubernetes - Reinhard Nägele - Codemotion... Helm - the Better Way to Deploy on Kubernetes - Reinhard Nägele - Codemotion...
Helm - the Better Way to Deploy on Kubernetes - Reinhard Nägele - Codemotion...
Codemotion1.8K views
Why Splunk Chose Pulsar_Karthik Ramasamy by StreamNative
Why Splunk Chose Pulsar_Karthik RamasamyWhy Splunk Chose Pulsar_Karthik Ramasamy
Why Splunk Chose Pulsar_Karthik Ramasamy
StreamNative2.5K views

Similar to Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2019)

Monitoring with prometheus at scale by
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scaleAdam Hamsik
78 views21 slides
Monitoring with prometheus at scale by
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scaleJuraj Hantak
209 views21 slides
Build cloud native solution using open source by
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source Nitesh Jadhav
99 views29 slides
Prometheus kubernetes tech talk by
Prometheus kubernetes tech talkPrometheus kubernetes tech talk
Prometheus kubernetes tech talkChandresh Pancholi
306 views17 slides
Prometheus - basics by
Prometheus - basicsPrometheus - basics
Prometheus - basicsJuraj Hantak
156 views27 slides
CNCF Thanos @ Qonto by
CNCF Thanos @ QontoCNCF Thanos @ Qonto
CNCF Thanos @ QontoAlexis Sellier
365 views49 slides

Similar to Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2019)(20)

Monitoring with prometheus at scale by Adam Hamsik
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
Adam Hamsik78 views
Monitoring with prometheus at scale by Juraj Hantak
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
Juraj Hantak209 views
Build cloud native solution using open source by Nitesh Jadhav
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
Nitesh Jadhav99 views
Nex clipper 1905_summary_eng by Jinyong Kim
Nex clipper 1905_summary_engNex clipper 1905_summary_eng
Nex clipper 1905_summary_eng
Jinyong Kim1.2K views
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz) by QAware GmbH
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
Kubernetes One-Click Deployment: Hands-on Workshop (Mainz)
QAware GmbH501 views
Lunar Way and the Cloud Native "stack" by Kasper Nissen
Lunar Way and the Cloud Native "stack"Lunar Way and the Cloud Native "stack"
Lunar Way and the Cloud Native "stack"
Kasper Nissen694 views
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016) by Brian Brazil
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil16.5K views
MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad... by MongoDB
MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad...MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad...
MongoDB .local London 2019: Migrating a Monolith to MongoDB Atlas – Auto Trad...
MongoDB463 views
Monitoring kubernetes with prometheus-operator by Lili Cosic
Monitoring kubernetes with prometheus-operatorMonitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operator
Lili Cosic292 views
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi... by Docker, Inc.
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
DockerCon SF 2015 : Reliably shipping containers in a resource rich world usi...
Docker, Inc.6.6K views
How to Improve the Observability of Apache Cassandra and Kafka applications... by Paul Brebner
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...
Paul Brebner263 views
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ... by tdc-globalcode
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
tdc-globalcode489 views
Big data Argentina meetup 2020-09: Intro to presto on docker by Federico Palladoro
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro211 views
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by... by NETWAYS
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS14 views
Slides: How to Select a PaaS by Altoros
Slides: How to Select a PaaSSlides: How to Select a PaaS
Slides: How to Select a PaaS
Altoros1.9K views

Recently uploaded

TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensorssugiuralab
23 views15 slides
Design Driven Network Assurance by
Design Driven Network AssuranceDesign Driven Network Assurance
Design Driven Network AssuranceNetwork Automation Forum
19 views42 slides
PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
139 views17 slides
Data Integrity for Banking and Financial Services by
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial ServicesPrecisely
29 views26 slides
Ransomware is Knocking your Door_Final.pdf by
Ransomware is Knocking your Door_Final.pdfRansomware is Knocking your Door_Final.pdf
Ransomware is Knocking your Door_Final.pdfSecurity Bootcamp
66 views46 slides
"Surviving highload with Node.js", Andrii Shumada by
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada Fwdays
33 views29 slides

Recently uploaded(20)

TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab23 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi139 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software317 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana17 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson126 views
Future of AR - Facebook Presentation by ssuserb54b561
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
ssuserb54b56122 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker48 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf

Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2019)

  • 1. Prometheus in Practice: High Availability with Thanos Tom Riley DevOpsDays Edinburgh 2019
  • 2. About Me ● Tom Riley ● Infrastructure @ Nuance ● Previously Booking.com ● Co-Organiser Cloud Native + Kubernetes Manchester
  • 3. Today ● Introduction to Prometheus ● Monitoring Kubernetes ● High Availability Prometheus ● Long Term Storage for Prometheus
  • 4. What is Prometheus? ● Prometheus is a metrics oriented Monitoring solution (TSDB & Tooling) ● Released by SoundCloud in 2012 ● Prometheus project joined Cloud Native Computing Foundation in 2016 ● During 2018, become the second project to graduate from incubation alongside Kubernetes
  • 10. Prometheus Metrics Metric Name Metric Labels Metric Values Metric
  • 11. What is Prometheus? Prometheus Application Service Discovery Application Exporter Alert Manager Grafana
  • 12. Demo Environment 1. Kubernetes on my laptop using KIND 2. Prometheus Operator 3. Monitoring Kubernetes via: Kube-state-metrics Node Exporter Kubelet & cAdvisor 4. Grafana Dashboards
  • 14. Prometheus Operator apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: prometheus spec: baseImage: quay.io/prometheus/prometheus logLevel: info replicas: 1 resources: limits: cpu: 1 memory: 100Mi requests: cpu: 1 memory: 100Mi retention: 12h serviceAccountName: prometheus-service-account serviceMonitorSelector: matchLabels: serviceMonitorSelector: prometheus version: v2.10.0 Deploying a Prometheus Instance...
  • 15. Prometheus Operator apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: serviceMonitorSelector: prometheus name: prometheus namespace: prometheus spec: endpoints: - interval: 30s path: /metrics targetPort: 9090 namespaceSelector: matchNames: - prometheus selector: matchLabels: app: prometheus Configure Prometheus Targets with ServiceMonitor...
  • 17. Highly Un-Available Prometheus ● In our demo environment we have a single instance of Prometheus, as shown in the diagram to the right ● If the Kubernetes worker node that Prometheus is running on fails the Pod will temporarily become unavailable as it is evicted and launched elsewhere Targets Targets Targets Scrape Targets
  • 18. Highly Available Prometheus Targets Targets Targets Prometheus x2 Highly Available! Scrape Targets, Twice!
  • 19. Highly Available Prometheus Challenges: • We have two sources of duplicate metrics! • Which do we use?
  • 20. Highly Available Prometheus Targets Targets Targets Use a Load Balancer Load Balancer
  • 21. Highly Available Prometheus Targets Targets Targets Use a Service when running in K8 Kubernetes Service
  • 23. Highly Available Prometheus Targets Targets Targets Not without its challenges: • When you refresh the data, you will see it change as metrics will potentially differ between the two instances Kubernetes Service
  • 24. Highly Available Prometheus Targets Targets Targets Not without its challenges: • When you refresh the data, you will see it change as metrics will potentially differ between the two instances • Use sticky load balancing or make the second instance a hot standby • This solution is becoming complicated and does not scale with query load Kubernetes Service
  • 25. Prometheus HA with Thanos “Thanos is a set of components that can be composed into a highly available metric system with unlimited storage capacity”
  • 26. Prometheus HA with Thanos Developed and open-sourced by engineers at London based Improbable Today, 5 core maintainers from various organisations. github.com/thanos-io/thanos 1000+ commits, 4k+ GitHub stars, 138 contributors
  • 27. Prometheus HA with Thanos Targets Targets Targets
  • 28. Prometheus HA with Thanos Targets Targets Targets Query 2. Thanos Query makes gRPC call to Thanos sidecar for metrics and de- duplicates 1. Thanos sidecar deployed alongside Prometheus in Kubernetes Pod using operator 3. Thanos Query exposes Prometheus HTTP API or gRPC
  • 30. Long Term Storage The Challenge: You want to store months or even years worth of metrics within Prometheus. You still need to be able to query that data and it be performant. Like, all the data!
  • 33. Long Term Storage Storage • Prometheus was initially designed for short metrics retention, it was designed for monitoring & alerting on what is happening ‘now’ • Local storage can be expensive, especially if using SSD • You want to store years of metrics, will this scale efficiently with Prometheus?
  • 34. Long Term Storage • Remote write/read API • Prometheus has remote storage APIs • The complexity of operating Elasticsearch or similar alongside Prometheus seems somewhat overengineered
  • 36. Long Term Storage with Thanos Targets Targets Targets Query 1. Thanos Sidecar ships metrics to storage bucket such as AWS S3 or GCP Storage Store 2. Thanos Store makes metrics available via Thanos Store API for Query
  • 38. Long Term Storage with Thanos • Significantly reduce storage requirements of each Prometheus instance – only need to story around 2 to 24 hours of metrics • Significantly cheaper storing metrics in a bucket versus scaling SSD storage • Thanos Compact executes compression of Prometheus TSDB data within the bucket and also downsamples data for when querying over long time periods – keeps raw (1m), 5m & 15m samples • Query automatically de-duplicates data within Prometheus and metrics store in the storage bucket • Thanos is built from Prometheus TSDB code – not redesigning the wheel
  • 40. Conclusion ● Use Prometheus Operator for making the automation of Prometheus on Kubernetes easy! ● Collect time series metrics from everywhere in Kubernetes and start building dashboards to enhance the Observability of your platform and services! ● Use Thanos for adding resilience and ease of scalability with Prometheus in Kubernetes.. It is as easy as deploying a sidecar!
  • 41. Questions? Thank you for listening! I have published a series of K8s Observability tutorials at: https://observability.thomasriley.co.uk Get in touch: Mail: contact@thomasriley.co.uk Slack: Riley @ kubernetes.slack.com Twitter: @therealriley