SlideShare a Scribd company logo

Scaling Prometheus on Kubernetes with Thanos

My talk from the Kubernetes Manchester meetup (6th Dec 2018) on how we are using Thanos to scale Prometheus on Kubernetes at BookingGo.

1 of 57
Scaling Prometheus on Kubernetes
Tom Riley @ Booking.com
Scaling Prometheus on Kubernetes with Thanos
BookingGo.Cloud ??
Kubernetes
Delivery Platform
Self Service for
Development Teams
Everything as Code
100% Customer
Focused &
100% Business Value
Cloud Native
Learn safely
in Production
Public Cloud
We ❤️ Open Source
BookingGo.Cloud Infrastructure
BookingGo.Cloud Infrastructure
BookingGo.Cloud Environments
• Dev..
• Test..
• Production..
• Tooling..
• ..plus multiple regions!
• 10 Kubernetes clusters in total and more in the pipeline!

Recommended

Prometheus and Thanos
Prometheus and ThanosPrometheus and Thanos
Prometheus and ThanosCloudOps2005
 
Thanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringThanos: Global, durable Prometheus monitoring
Thanos: Global, durable Prometheus monitoringBartłomiej Płotka
 
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Thomas Riley
 
Kubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleKubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleThomas Riley
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheusKasper Nissen
 
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioTHE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioDevOpsDays Tel Aviv
 

More Related Content

What's hot

Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Brian Brazil
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus OverviewBrian Brazil
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)Lucas Jellema
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideBytemark
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Service Discovery In Kubernetes
Service Discovery In KubernetesService Discovery In Kubernetes
Service Discovery In KubernetesKnoldus Inc.
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to PrometheusJulien Pivotto
 
Monitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operatorMonitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operatorLili Cosic
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 
InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
InfluxDB + Telegraf Operator: Easy Kubernetes MonitoringInfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
InfluxDB + Telegraf Operator: Easy Kubernetes MonitoringInfluxData
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
Prometheus
PrometheusPrometheus
Prometheuswyukawa
 
An introduction to terraform
An introduction to terraformAn introduction to terraform
An introduction to terraformJulien Pivotto
 
Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)Chris Aniszczyk
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with PrometheusQAware GmbH
 

What's hot (20)

Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)Monitoring your Python with Prometheus (Python Ireland April 2015)
Monitoring your Python with Prometheus (Python Ireland April 2015)
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory Guide
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Service Discovery In Kubernetes
Service Discovery In KubernetesService Discovery In Kubernetes
Service Discovery In Kubernetes
 
Introduction to Prometheus
Introduction to PrometheusIntroduction to Prometheus
Introduction to Prometheus
 
Monitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operatorMonitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operator
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
InfluxDB + Telegraf Operator: Easy Kubernetes MonitoringInfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
InfluxDB + Telegraf Operator: Easy Kubernetes Monitoring
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Gitops Hands On
Gitops Hands OnGitops Hands On
Gitops Hands On
 
An Overview of Spinnaker
An Overview of SpinnakerAn Overview of Spinnaker
An Overview of Spinnaker
 
Prometheus
PrometheusPrometheus
Prometheus
 
An introduction to terraform
An introduction to terraformAn introduction to terraform
An introduction to terraform
 
Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)
 
Cloud Monitoring with Prometheus
Cloud Monitoring with PrometheusCloud Monitoring with Prometheus
Cloud Monitoring with Prometheus
 

Similar to Scaling Prometheus on Kubernetes with Thanos

Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basicsJuraj Hantak
 
Next Generation Architecture Showcase July 2019
Next Generation Architecture Showcase July 2019Next Generation Architecture Showcase July 2019
Next Generation Architecture Showcase July 2019Alan Pearson Mathews
 
CarTrawler's Feature Team Architecture and Development Process Showcase by Lu...
CarTrawler's Feature Team Architecture and Development Process Showcase by Lu...CarTrawler's Feature Team Architecture and Development Process Showcase by Lu...
CarTrawler's Feature Team Architecture and Development Process Showcase by Lu...Lucas Sacramento
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source Nitesh Jadhav
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scaleJuraj Hantak
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scaleAdam Hamsik
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...tdc-globalcode
 
DevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBMDevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBMatSistemas
 
SpringOne Tour: An Introduction to Azure Spring Apps Enterprise
SpringOne Tour: An Introduction to Azure Spring Apps EnterpriseSpringOne Tour: An Introduction to Azure Spring Apps Enterprise
SpringOne Tour: An Introduction to Azure Spring Apps EnterpriseVMware Tanzu
 
Webinar: Capabilities, Confidence and Community – What Flux GA Means for You
Webinar: Capabilities, Confidence and Community – What Flux GA Means for YouWebinar: Capabilities, Confidence and Community – What Flux GA Means for You
Webinar: Capabilities, Confidence and Community – What Flux GA Means for YouWeaveworks
 
給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗William Yeh
 
10 tips for Cloud Native Security
10 tips for Cloud Native Security10 tips for Cloud Native Security
10 tips for Cloud Native SecurityKarthik Gaekwad
 
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017Codemotion
 
Mete Atamel "Resilient microservices with kubernetes"
Mete Atamel "Resilient microservices with kubernetes"Mete Atamel "Resilient microservices with kubernetes"
Mete Atamel "Resilient microservices with kubernetes"IT Event
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes mattersPlatform9
 
Seminar Modernizing Your Development Using Microservices, Container & Kubernetes
Seminar Modernizing Your Development Using Microservices, Container & KubernetesSeminar Modernizing Your Development Using Microservices, Container & Kubernetes
Seminar Modernizing Your Development Using Microservices, Container & KubernetesPT Datacomm Diangraha
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Brian Brazil
 
GCCP JSCOE Session 2
GCCP JSCOE Session 2GCCP JSCOE Session 2
GCCP JSCOE Session 2GDSC
 

Similar to Scaling Prometheus on Kubernetes with Thanos (20)

Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
 
Next Generation Architecture Showcase July 2019
Next Generation Architecture Showcase July 2019Next Generation Architecture Showcase July 2019
Next Generation Architecture Showcase July 2019
 
CarTrawler's Feature Team Architecture and Development Process Showcase by Lu...
CarTrawler's Feature Team Architecture and Development Process Showcase by Lu...CarTrawler's Feature Team Architecture and Development Process Showcase by Lu...
CarTrawler's Feature Team Architecture and Development Process Showcase by Lu...
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
 
Monitoring with prometheus at scale
Monitoring with prometheus at scaleMonitoring with prometheus at scale
Monitoring with prometheus at scale
 
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
TDC2017 | São Paulo - Trilha Cloud Computing How we figured out we had a SRE ...
 
DevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBMDevOps Spain 2019. Beatriz Martínez-IBM
DevOps Spain 2019. Beatriz Martínez-IBM
 
SpringOne Tour: An Introduction to Azure Spring Apps Enterprise
SpringOne Tour: An Introduction to Azure Spring Apps EnterpriseSpringOne Tour: An Introduction to Azure Spring Apps Enterprise
SpringOne Tour: An Introduction to Azure Spring Apps Enterprise
 
Webinar: Capabilities, Confidence and Community – What Flux GA Means for You
Webinar: Capabilities, Confidence and Community – What Flux GA Means for YouWebinar: Capabilities, Confidence and Community – What Flux GA Means for You
Webinar: Capabilities, Confidence and Community – What Flux GA Means for You
 
Where should I run my code? Serverless, Containers, Virtual Machines and more
Where should I run my code? Serverless, Containers, Virtual Machines and moreWhere should I run my code? Serverless, Containers, Virtual Machines and more
Where should I run my code? Serverless, Containers, Virtual Machines and more
 
給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗給 RD 的 Kubernetes 初體驗
給 RD 的 Kubernetes 初體驗
 
10 tips for Cloud Native Security
10 tips for Cloud Native Security10 tips for Cloud Native Security
10 tips for Cloud Native Security
 
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
Resilient microservices with Kubernetes - Mete Atamel - Codemotion Rome 2017
 
Deploy prometheus on kubernetes
Deploy prometheus on kubernetesDeploy prometheus on kubernetes
Deploy prometheus on kubernetes
 
Mete Atamel "Resilient microservices with kubernetes"
Mete Atamel "Resilient microservices with kubernetes"Mete Atamel "Resilient microservices with kubernetes"
Mete Atamel "Resilient microservices with kubernetes"
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes matters
 
Seminar Modernizing Your Development Using Microservices, Container & Kubernetes
Seminar Modernizing Your Development Using Microservices, Container & KubernetesSeminar Modernizing Your Development Using Microservices, Container & Kubernetes
Seminar Modernizing Your Development Using Microservices, Container & Kubernetes
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
GCCP JSCOE Session 2
GCCP JSCOE Session 2GCCP JSCOE Session 2
GCCP JSCOE Session 2
 

Recently uploaded

Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Jay Zhao
 
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUGBoosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUGRick Ossendrijver
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingerssuser9354ce
 
How AI and ChatGPT are changing cybersecurity forever.pptx
How AI and ChatGPT are changing cybersecurity forever.pptxHow AI and ChatGPT are changing cybersecurity forever.pptx
How AI and ChatGPT are changing cybersecurity forever.pptxInfosec
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerSaiLinnThu2
 
Achieving Excellence IESVE for HVAC Simulation.pdf
Achieving Excellence IESVE for HVAC Simulation.pdfAchieving Excellence IESVE for HVAC Simulation.pdf
Achieving Excellence IESVE for HVAC Simulation.pdfIES VE
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceVijayananda Mohire
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxNeo4j
 
TrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueShapeBlue
 
AGFM - Toyota Coaster 1HZ Install Guide.pdf
AGFM - Toyota Coaster 1HZ Install Guide.pdfAGFM - Toyota Coaster 1HZ Install Guide.pdf
AGFM - Toyota Coaster 1HZ Install Guide.pdfRodneyThomas28
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsScyllaDB
 
Enterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book ReviewEnterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book ReviewAshraf Fouad
 
KUBRICK Graphs: A journey from in vogue to success-ion
KUBRICK Graphs: A journey from in vogue to success-ionKUBRICK Graphs: A journey from in vogue to success-ion
KUBRICK Graphs: A journey from in vogue to success-ionNeo4j
 
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...James Anderson
 
Large Language Models and Applications in Healthcare
Large Language Models and Applications in HealthcareLarge Language Models and Applications in Healthcare
Large Language Models and Applications in HealthcareAsma Ben Abacha
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIVijayananda Mohire
 
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueCloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueShapeBlue
 
software-quality-assurance question paper 2023
software-quality-assurance question paper 2023software-quality-assurance question paper 2023
software-quality-assurance question paper 2023RohanMistry15
 

Recently uploaded (20)

Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
Leonis Insights: The State of AI (7 trends for 2023 and 7 predictions for 2024)
 
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUGBoosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
Boosting Developer Effectiveness with a Java platform team 1.4 - ArnhemJUG
 
iOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostingeriOncologi_Pitch Deck_2024 slide show for hostinger
iOncologi_Pitch Deck_2024 slide show for hostinger
 
How AI and ChatGPT are changing cybersecurity forever.pptx
How AI and ChatGPT are changing cybersecurity forever.pptxHow AI and ChatGPT are changing cybersecurity forever.pptx
How AI and ChatGPT are changing cybersecurity forever.pptx
 
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-ManagerCentralized TLS Certificates Management Using Vault PKI + Cert-Manager
Centralized TLS Certificates Management Using Vault PKI + Cert-Manager
 
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...In sharing we trust. Taking advantage of a diverse consortium to build a tran...
In sharing we trust. Taking advantage of a diverse consortium to build a tran...
 
Achieving Excellence IESVE for HVAC Simulation.pdf
Achieving Excellence IESVE for HVAC Simulation.pdfAchieving Excellence IESVE for HVAC Simulation.pdf
Achieving Excellence IESVE for HVAC Simulation.pdf
 
My Journey towards Artificial Intelligence
My Journey towards Artificial IntelligenceMy Journey towards Artificial Intelligence
My Journey towards Artificial Intelligence
 
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptxGraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
GraphSummit London Feb 2024 - ABK - Neo4j Product Vision and Roadmap.pptx
 
TrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI InnovationsTrustArc Webinar - TrustArc's Latest AI Innovations
TrustArc Webinar - TrustArc's Latest AI Innovations
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
 
AGFM - Toyota Coaster 1HZ Install Guide.pdf
AGFM - Toyota Coaster 1HZ Install Guide.pdfAGFM - Toyota Coaster 1HZ Install Guide.pdf
AGFM - Toyota Coaster 1HZ Install Guide.pdf
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Enterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book ReviewEnterprise Architecture As Strategy - Book Review
Enterprise Architecture As Strategy - Book Review
 
KUBRICK Graphs: A journey from in vogue to success-ion
KUBRICK Graphs: A journey from in vogue to success-ionKUBRICK Graphs: A journey from in vogue to success-ion
KUBRICK Graphs: A journey from in vogue to success-ion
 
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
GDG Cloud Southlake 30 Brian Demers Breeding 10x Developers with Developer Pr...
 
Large Language Models and Applications in Healthcare
Large Language Models and Applications in HealthcareLarge Language Models and Applications in Healthcare
Large Language Models and Applications in Healthcare
 
Key projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AIKey projects in AI, ML and Generative AI
Key projects in AI, ML and Generative AI
 
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueCloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
 
software-quality-assurance question paper 2023
software-quality-assurance question paper 2023software-quality-assurance question paper 2023
software-quality-assurance question paper 2023
 

Scaling Prometheus on Kubernetes with Thanos

  • 1. Scaling Prometheus on Kubernetes Tom Riley @ Booking.com
  • 3. BookingGo.Cloud ?? Kubernetes Delivery Platform Self Service for Development Teams Everything as Code 100% Customer Focused & 100% Business Value Cloud Native Learn safely in Production Public Cloud We ❤️ Open Source
  • 6. BookingGo.Cloud Environments • Dev.. • Test.. • Production.. • Tooling.. • ..plus multiple regions! • 10 Kubernetes clusters in total and more in the pipeline!
  • 7. What are we doing with Observability? Past.. • Delivered Logging & Events on Kubernetes using Elastic Stack Present.. • Deliver a product around Time Series Metrics that is suitable for BookingGo.Cloud including alerting as code feature • Continuously evolve and update our BookingGo Monitoring & Observability defaults • Deliver a learning path around Observability; helping users onboard to BookingGo.Cloud and further extend their knowledge via workshops and documentation Future.. • OpenTracing for BookingGo.Cloud • Continue evolving Observability culture
  • 8. What are we doing with Observability? Past.. • Delivered Logging & Events on Kubernetes using Elastic Stack Present.. • Deliver a product around Time Series Metrics that is suitable for BookingGo.Cloud including alerting as code feature • Continuously evolve and update our BookingGo Monitoring & Observability defaults • Deliver a learning path around Observability; helping users onboard to BookingGo.Cloud and further extend their knowledge via workshops and documentation Future.. • OpenTracing for BookingGo.Cloud • Continue evolving Observability culture
  • 9. Time Series Metrics Project Goals • Provide engineer friendly tooling and instrumentation libraries • Low cardinality monitoring; but one datastore fits all contexts • First class API support; no vendor lock-in, open source • Single pane of glass for Monitoring • Monitoring as code; Kubernetes native experience • Provide consistent mechanism for Alerting based on Metrics • Reboot monitoring culture at BookingGo Monitoring & Observability as part of the application development lifecycle
  • 11. Prometheus – What is it? • Prometheus is a metrics oriented Monitoring solution (TSDB & Tooling) • Released by SoundCloud in 2012 • Prometheus project joined Cloud Native Computing Foundation in 2016 • During 2018, become the second project to graduate from incubation alongside Kubernetes
  • 12. Prometheus – What is it? Prometheus Application Service Discovery Application Exporter Alert Manager Grafana
  • 13. Prometheus - Day One • Deployed kube-prometheus example to all of our K8 clusters • Each cluster then had a single Prometheus instance and Grafana front end • Encourage development teams to start exposing Prometheus metrics from day one • Opportunity to see if Prometheus was the right technology for us with very little upfront investment required – learning safely in production! • Will the development teams get value from it? • Do we feel the technology fits within Kubernetes? bit.ly/2S6Lmq0
  • 14. Prometheus - Day One Learnings Happy Development Teams!
  • 15. Prometheus - Day One Learnings Prometheus ❤️ Kubernetes
  • 16. Kubernetes Prometheus Operator • Defines Custom Resource Definitions (CRD) for deploying and configuring Prometheus & AlertManager • As simple as: • Deploy the operator to your Kubernetes cluster • Start deploying the CRD objects to define your Prometheus setup • Operator launches Prometheus pods automatically based on CRD configuration
  • 17. Kubernetes Prometheus Operator Deploy Prometheus bit.ly/2R7ohn8
  • 18. Kubernetes Prometheus Operator Configure Prometheus Targets bit.ly/2R7ohn8
  • 19. Next Steps.. • We decided to continue ahead with use of Prometheus • But we had determined a number of challenges.. 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  • 20. Next Steps.. • What are the common patterns for tackling these problems? • How did we approach this? • We keep a close eye on sources of information, blogs, tech talks on YouTube, KubeCon/PromCon videos, etc. • We attended conferences to learn from others! • Read documentation and best practices • Keep a close eye on new and evolving projects from GitHub, etc.
  • 21. Highly Available Prometheus Targets Targets Targets Prometheus x1 Scrape Targets
  • 22. Highly Available Prometheus Targets Targets Targets Prometheus x2 Highly Available! Scrape Targets, Twice!
  • 23. Highly Available Prometheus Challenges: • We have two sources of duplicate metrics! • Well, so called duplicates – metrics will vary between the two slightly! • Which do we use?
  • 24. Highly Available Prometheus Targets Targets Targets Use a Load Balancer Load Balancer
  • 25. Highly Available Prometheus Targets Targets Targets Could use something like HA Proxy HA Proxy
  • 26. Highly Available Prometheus Targets Targets Targets Use a Service when running in K8 Kubernetes Service
  • 27. Highly Available Prometheus Targets Targets Targets Not without its challenges: • When you refresh the data, you will see it change as metrics will potentially differ between the two instances Kubernetes Service
  • 28. Highly Available Prometheus Targets Targets Targets Not without its challenges: • When you refresh the data, you will see it change as metrics will potentially differ between the two instances • Use sticky load balancing or make the second instance a hot standby • This solution is becoming complicated and does not scale with query load Kubernetes Service
  • 29. Challenges 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  • 30. Federated Prometheus Scrape metrics at /federate to centralized Prometheus instance
  • 32. Federated Prometheus Also not without its challenges.. • Duplicating metrics is costly • Have to configure desired metrics you wish to federate and can easily be forgotten • Single point of failure
  • 33. Prometheus for Practioners @ Monitorama EU 2018 Slides: https://bit.ly/2AqB11d Monitorama Talk: https://vimeo.com/289893972
  • 34. Challenges 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  • 36. Long Term Storage Storage • Prometheus was initially designed for short metrics retention, it was designed for monitoring & alerting on what is happening ‘now’ • Local storage can be expensive, especially if using SSD • We wanted to store years of metrics, will this scale efficiently with Prometheus?
  • 37. Long Term Storage • Remote write/read API • Prometheus has remote storage APIs • Concerns around the complexity of operating Elasticsearch or similar alongside Prometheus https://bit.ly/2zt5try
  • 38. Challenges 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  • 40. Thanos – What is it? “Thanos is a set of components that can be composed into a highly available metric system with unlimited storage capacity”
  • 41. Thanos – What is it? Developed and open-sourced by engineers at London based Improbable github.com/improbable-eng/thanos 619 commits, 2.3k GitHub stars, 50 contributors
  • 42. Thanos – What does it do? • Designed to work in Kubernetes, supported by the Prometheus-Operator • Global querying view across all connected Prometheus servers • Deduplication and merging of metrics collected from Prometheus HA pairs • Seamless integration with existing Prometheus setups • Any object storage as its only, optional dependency • Downsampling historical data for massive query speedup • Cross-cluster federation • Fault-tolerant query routing • Simple gRPC "Store API" for unified data access across all metric data • Easy integration points for custom metric providers https://bit.ly/2KCAWfB
  • 43. Challenges Thanos helps to tackle all these problems in a different way.. 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  • 44. HA Prometheus with Thanos Targets Targets Targets
  • 45. HA Prometheus with Thanos Targets Targets Targets Query 2. Thanos Query makes gRPC call to Thanos sidecar for metrics and de-duplicates 1. Thanos sidecar deployed alongside Prometheus in Kubernetes Pod using operator 3. Thanos Query exposes Prometheus HTTP API or gRPC
  • 46. Federation with Thanos Use a centralized instance of Thanos Query to federate the edge instances of Prometheus & Thanos Query
  • 47. Federation with Thanos Query No need to scrape metrics to a centralized Prometheus Query scales horizontally therefore eliminating the single point of failure! Prometheus instances running at the edge now HA & metrics are de-duplicated. We operate these in both AWS & GCP within K8 Point Grafana at single Prometheus HTTP API with metrics from all environments
  • 48. Challenges 1. How do we run HA Prometheus in our K8 clusters? 2. How do we achieve the single pane of glass when we have so many distributed instances of Prometheus? 3. How do we scale Prometheus retention from days to months or even years?
  • 49. Long Term Storage with Thanos Targets Targets Targets Query 1. Thanos Sidecar ships metrics to storage bucket such as AWS S3 or GCP Storage Store 2. Thanos Store makes metrics available via Thanos Store API for Query
  • 51. Long Term Storage with Thanos • Significantly reduce storage requirements of each Prometheus instance – only need to story around 2 to 24 hours of metrics • Significantly cheaper storing metrics in a bucket versus scaling SSD storage • Thanos Compact executes compression of Prometheus TSDB data within the bucket and also downsamples data for when querying over long time periods – keeps raw (1m), 5m & 15m samples • Query automatically de-duplicates data within Prometheus and metrics store in the storage bucket • Thanos is built from Prometheus TSDB code – not redesigning the wheel
  • 52. Thanos in Summary Query • Prometheus automated in K8 • Single Prometheus API • Long term metric retention
  • 53. How do we make this self-serve? • Deployments to BookingGo.Cloud are automated using our BGCloud CLI & Helm charts that we own • To self-serve metrics.. 1. Expose Prometheus supported metrics endpoint for application 2. Set helm value to configure path to metrics endpoint and enable metrics 3. Deploy to platform using CLI tool via CI/CD pipeline 4. Start building dashboards in Grafana!
  • 54. How do we make this self-serve? • It is as simple as setting this in the applications self-contained configuration and deploying via a pipeline: bookinggo: metrics: enabled: true path: /actuator/prometheus
  • 55. Things I’ve missed.. • We are building an Observability culture at BookingGo to ensure good quality monitoring becomes part of application development lifecycle, including its operation! – Prometheus and Thanos is just one part of the tooling to enable this • Alerting as a Service – Development teams have full control over alerting configuration and is part of a code deployment of their application • How to monitor Kubernetes infrastructure – so many metrics are exposed out the box or easily available using Prometheus exporters • How we actually deploy all of this to Kubernetes – we use Helm and write our charts to fit the use case if one is not available in the open source community! • So much more…
  • 56. Learn more about Thanos • If you want to learn more about Thanos search for ‘PromCon 2018: Thanos - Prometheus at Scale’ on YouTube • https://bit.ly/2P6edZE • Join Improbable’s engineering Slack group to chat #thanos • improbable-eng.slack.com • Follow the project on GitHub • https://github.com/improbable-eng/thanos • Prometheus: Up & Running book • https://oreil.ly/2r74zN5
  • 57. Thank you for listening! Questions? E: thomas.riley@booking.com S: Riley @ kubernetes.slack.com

Editor's Notes

  1. Logging & Events: expensive short term high context Metrics: cheap long term low context
  2. Engineer friendly tooling High cardinality monitoring: keep ALL the context First class API support, no vendor lock-in, future proof Single pane of glass Monitoring as code; K8 native experience Consistent mechanism for alerting Reboot our monitoring culture Part of application development lifecycle
  3. SLOW DOWN
  4. Prometheus is a Time Series DB Open Sourced by SoundCloud in 2012 Joined CNCF incubator in 2016 Graduated alongside Kubernetes in 2018 Community it moving toward this
  5. Operator is deployed into the cluster Deploy Kube code to launch a Prometheus instance, the operator will then deploy and manage this for us ServiceMonitor automates the configuration for scraping metrics endpoints in a K8 native way
  6. SLOW DOWN
  7. SLOW DOWN
  8. SLOW DOWN
  9. SLOW DOWN
  10. SLOW DOWN