SlideShare a Scribd company logo
Monitoring with Prometheus
at Scale
Adam Hamsik
adam.hamsik@lablabs.io
Labyrinth Labs
Rock-solid infrastructure and DevOps
● Building rock-solid and secure foundations for all your digital operations. Our
mission is to let you focus on your business without ever needing to worry
about technical issues again.
● Making you ready for growing traffic, safe against new security vulnerabilities
and data-loss.
2
TL;DR
● We will start with common monitoring issues and problems.
● Deploying Prometheus is easy and running a single instance can be sufficient
for most deployments.
● We will have a quick look at AlertManager
● We will talk about scalability limits of prometheus instance, when and how to
use sharding.
● What is Trickster and why you should use it too
● How Thanos/Cortex can help you when all hope is lost.
3
Common Monitoring Problems
● Monitoring tools are limited both technically and conceptually
● Most of existing tools don’t really scale with current infrastructure needs.
● Limited visibility
○ Generally we want to monitor and gather as much information as we can.
○ Even if we don’t need it right away usually it will be useful in a future(I promise)
● No common application monitoring interface. There are different
protocols/standards
○ Openmetrics
○ SNMP
4
Common Monitoring Problems
5
Prometheus Monitoring System
The Prometheus monitoring system and time series database is CNCF graduated
project.
● Originally developed by exGooglers for SoundCloud as their internal monitoring
system
● Inspired by Google’s Borgmon monitoring system
● Open Source under the Apache License
● Written as monolithic application in Go
6
Prometheus Server Overview
● Multi-dimensional data model with time series data identified by metric
name and key/value(labels) pairs
● PromQL, a flexible query language to leverage this dimensionality
● No reliance on distributed storage; single server nodes are autonomous
● Targets are discovered via service discovery or static configuration
● Pushing time series is supported via an intermediary gateway
● Monitor Services not Machines/Servers
7
Prometheus Architecture
81. https://www.prometheus.io/assets/architecture.png
Company Prometheus
Usage
● We deployed first prometheus servers
● Add some services
● Setup trickster as a Grafana Cache
● Add more services/servers
● Continuous adding of CPU/Memory to Prometheus instance
● Setup simple federation/sharding if single instance is too big
● Use Thanos
9
First Prometheus Deployment
● Deploying your first Prometheus server is very easy. Fetch prometheus
binary + config.
● There is a no concept of a Prometheus Cluster
● Generally Prometheus can scale very well with CPU/Memory
○ Providing more cpu/memory allows prometheus to monitor more
metrics
○ It’s hard to run large pod in a kubernetes cluster if it’s as big as a
worker node.
● If job is too big for a single server you can use federation/sharding
(remote reads) for simple scaling
10
Trickster Setup
● Loading complicated/big dashboard on Grafana can overload your
prometheus server
○ Use trickster to cache PromQL results for future reuse
○ Queries on metrics with high cardinality can use a lot of memory on
you prometheus instance[1].
○ Use limits to make sure user will not overload your server
query.max-concurrency/query.max-samples
● Delta Proxy caching - inspects the time range of a client query to
determine what data points are already cached
111. https://www.robustperception.io/limiting-promql-resource-usage
Trickster Setup
121. https://github.com/tricksterproxy/trickster/blob/main/docs/images/partial-cache-hit.png
Trickster Setup
131. https://secure.meetupstatic.com/photos/event/5/7/9/e/600_469882430.jpeg
Metrics Cardinality
● Prometheus performance almost always comes to one thing metrics
cardinality.
● Cardinality describes how many unique values of some metric you have
○ container_tasks_state metric will have a unique (pod/container) pair for each running
container in your cluster
○ custom_api_http_request will have a unique metric for each combination of
url/http_method/env. (/api/v2/users, get, dev; /api/v2/users, post, prod...)
141. https://www.robustperception.io/cardinality-is-key
Bad Metrics Cardinality
151. https://www.robustperception.io/cardinality-is-key
● See example below where we throw away bad fluentd metrics and dropped number of
scrapped metrics by ½
● If you are using fluentd look for fluentd_tail_file_inode, fluentd_tail_file_position
○ In our use case we saw cardinality 1220 from 2 metrics above per node !
Thanos/Cortex as ultimate solution
● If you have multiple kubernetes clusters, datacenters with millions of
metrics and adding more CPU/memory to prometheus is not an option.
○ Consider adding Thanos/Cortex to your infrastructure
● Thanos querier Prometheus Server HA, can load metrics from multiple
prometheus servers and make sure it will present full data to user.
○ Implements Prometheus 1.1 HTTP api.
● Thanos compactor can downsample, change retention or resolution of
your metrics.
● Thanos store is a component which can save your metrics in a AWS S3
compatible object store.
16
Thanos architecture
17
Thanos SideCar
18
● It implements Thanos’ Store API on top of Prometheus’ remote-read API. This allows
Queriers to treat Prometheus servers as yet another source of time series data without
directly talking to its APIs.
● Optionally, the sidecar uploads TSDB blocks to an object storage bucket as Prometheus
produces them every 2 hours. This allows Prometheus servers to be run with relatively
low retention while their historic data is made durable and queryable via object storage.
● Optionally Thanos sidecar is able to watch Prometheus rules and configuration,
decompress and substitute environment variables if needed and ping Prometheus to
reload them.
Thanos Query
19
● The PromQL query is posted to the Querier
● It interprets the query and goes to a pre-filter
● The query fans out its request for stores, prometheuses or other queries on the basis of labels and
time-range requirements
● The Query only sends and receives StoreAPI messages
● After it has collected all the responses, it merges and deduplicates them (if enabled)
● It then sends back the series for the user
1. https://banzaicloud.com/img/blog/multi-cluster-monitoring/life_of_a_query.png
Questions ?
20
Thank You.
We are hiring, remote working DevOps/Kubernetes
engineers.
adam.hamsik@lablabs.io
www.lablabs.io
21

More Related Content

What's hot

Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Brian Brazil
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
Docker, Inc.
 
Monitoring Large-scale Cloud Infrastructures with OpenNebula
Monitoring Large-scale Cloud Infrastructures with OpenNebulaMonitoring Large-scale Cloud Infrastructures with OpenNebula
Monitoring Large-scale Cloud Infrastructures with OpenNebula
NETWAYS
 
Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021
Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021
Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021
StreamNative
 
Until successful scope in mule
Until successful scope in muleUntil successful scope in mule
Until successful scope in mule
Ankit Lawaniya
 
Nginx dhruba mandal
Nginx dhruba mandalNginx dhruba mandal
Nginx dhruba mandal
Dhrubaji Mandal ♛
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
Fabian Reinartz
 
Colorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca OperationsColorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca Operations
dlfryar
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Brian Brazil
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Brian Brazil
 
Webinar: Keeping Your MongoDB Data Safe
Webinar: Keeping Your MongoDB Data SafeWebinar: Keeping Your MongoDB Data Safe
Webinar: Keeping Your MongoDB Data Safe
MongoDB
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...Tim Vaillancourt
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Brian Brazil
 
Next Generation DevOps in Drupal: DrupalCamp London 2014
Next Generation DevOps in Drupal: DrupalCamp London 2014Next Generation DevOps in Drupal: DrupalCamp London 2014
Next Generation DevOps in Drupal: DrupalCamp London 2014
Barney Hanlon
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Brian Brazil
 
Massively Scaled High Performance Web Services with PHP
Massively Scaled High Performance Web Services with PHPMassively Scaled High Performance Web Services with PHP
Massively Scaled High Performance Web Services with PHP
Demin Yin
 
MongoDB at MapMyFitness
MongoDB at MapMyFitnessMongoDB at MapMyFitness
MongoDB at MapMyFitness
MapMyFitness
 

What's hot (20)

Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)Prometheus for Monitoring Metrics (Percona Live Europe 2017)
Prometheus for Monitoring Metrics (Percona Live Europe 2017)
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
Monitoring Large-scale Cloud Infrastructures with OpenNebula
Monitoring Large-scale Cloud Infrastructures with OpenNebulaMonitoring Large-scale Cloud Infrastructures with OpenNebula
Monitoring Large-scale Cloud Infrastructures with OpenNebula
 
Alejandro Zuno Data Backup English
Alejandro Zuno Data Backup EnglishAlejandro Zuno Data Backup English
Alejandro Zuno Data Backup English
 
Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021
Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021
Distributed Tests on Pulsar with Fallout - Pulsar Summit NA 2021
 
Until successful scope in mule
Until successful scope in muleUntil successful scope in mule
Until successful scope in mule
 
Nginx dhruba mandal
Nginx dhruba mandalNginx dhruba mandal
Nginx dhruba mandal
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
 
Storing 16 Bytes at Scale
Storing 16 Bytes at ScaleStoring 16 Bytes at Scale
Storing 16 Bytes at Scale
 
Colorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca OperationsColorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca Operations
 
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
Evolving Prometheus for the Cloud Native World (FOSDEM 2018)
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
 
Webinar: Keeping Your MongoDB Data Safe
Webinar: Keeping Your MongoDB Data SafeWebinar: Keeping Your MongoDB Data Safe
Webinar: Keeping Your MongoDB Data Safe
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
 
Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)Systems Monitoring with Prometheus (Devops Ireland April 2015)
Systems Monitoring with Prometheus (Devops Ireland April 2015)
 
Next Generation DevOps in Drupal: DrupalCamp London 2014
Next Generation DevOps in Drupal: DrupalCamp London 2014Next Generation DevOps in Drupal: DrupalCamp London 2014
Next Generation DevOps in Drupal: DrupalCamp London 2014
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
 
Massively Scaled High Performance Web Services with PHP
Massively Scaled High Performance Web Services with PHPMassively Scaled High Performance Web Services with PHP
Massively Scaled High Performance Web Services with PHP
 
MongoDB at MapMyFitness
MongoDB at MapMyFitnessMongoDB at MapMyFitness
MongoDB at MapMyFitness
 

Similar to Monitoring with prometheus at scale

Prometheus
PrometheusPrometheus
Prometheus
Aakanksha Mane
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
Brian Brazil
 
Kubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleKubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by Example
Thomas Riley
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
Nitesh Jadhav
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil
 
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Thomas Riley
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
Brian Brazil
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
GetInData
 
Monitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operatorMonitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operator
Lili Cosic
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET Journal
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Sridhar Kumar N
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
Brian Brazil
 
Scaling Prometheus on Kubernetes with Thanos
Scaling Prometheus on Kubernetes with ThanosScaling Prometheus on Kubernetes with Thanos
Scaling Prometheus on Kubernetes with Thanos
Thomas Riley
 
DevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheusDevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3
Alluxio, Inc.
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
Lhouceine OUHAMZA
 
Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
Dhrubaji Mandal ♛
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
Bol.com Techlab
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
Hien Nguyen Van
 

Similar to Monitoring with prometheus at scale (20)

Prometheus
PrometheusPrometheus
Prometheus
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
Kubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by ExampleKubernetes Observability with Prometheus by Example
Kubernetes Observability with Prometheus by Example
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
Prometheus in Practice: High Availability with Thanos (DevOpsDays Edinburgh 2...
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Monitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operatorMonitoring kubernetes with prometheus-operator
Monitoring kubernetes with prometheus-operator
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
IRJET- Real Time Monitoring of Servers with Prometheus and Grafana for High A...
 
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaPrometheus - Intro, CNCF, TSDB,PromQL,Grafana
Prometheus - Intro, CNCF, TSDB,PromQL,Grafana
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
 
Scaling Prometheus on Kubernetes with Thanos
Scaling Prometheus on Kubernetes with ThanosScaling Prometheus on Kubernetes with Thanos
Scaling Prometheus on Kubernetes with Thanos
 
DevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheusDevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheus
 
What's New in Alluxio 2.3
What's New in Alluxio 2.3What's New in Alluxio 2.3
What's New in Alluxio 2.3
 
Prometheus and Grafana
Prometheus and GrafanaPrometheus and Grafana
Prometheus and Grafana
 
Cloud Monitoring tool Grafana
Cloud Monitoring  tool Grafana Cloud Monitoring  tool Grafana
Cloud Monitoring tool Grafana
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 
Prometheus monitoring
Prometheus monitoringPrometheus monitoring
Prometheus monitoring
 

More from Adam Hamsik

Ingress controller present, past and future
Ingress controller present, past and futureIngress controller present, past and future
Ingress controller present, past and future
Adam Hamsik
 
Event driven autoscaling with keda
Event driven autoscaling with kedaEvent driven autoscaling with keda
Event driven autoscaling with keda
Adam Hamsik
 
Comparison of existing cni plugins for kubernetes
Comparison of existing cni plugins for kubernetesComparison of existing cni plugins for kubernetes
Comparison of existing cni plugins for kubernetes
Adam Hamsik
 
Aws summit 2019 running kubernetes
Aws summit 2019   running kubernetesAws summit 2019   running kubernetes
Aws summit 2019 running kubernetes
Adam Hamsik
 
Staying out of_trouble_with_k8s_on_aws
Staying out of_trouble_with_k8s_on_awsStaying out of_trouble_with_k8s_on_aws
Staying out of_trouble_with_k8s_on_aws
Adam Hamsik
 
Kubernetes @ pixel
Kubernetes @ pixelKubernetes @ pixel
Kubernetes @ pixel
Adam Hamsik
 

More from Adam Hamsik (6)

Ingress controller present, past and future
Ingress controller present, past and futureIngress controller present, past and future
Ingress controller present, past and future
 
Event driven autoscaling with keda
Event driven autoscaling with kedaEvent driven autoscaling with keda
Event driven autoscaling with keda
 
Comparison of existing cni plugins for kubernetes
Comparison of existing cni plugins for kubernetesComparison of existing cni plugins for kubernetes
Comparison of existing cni plugins for kubernetes
 
Aws summit 2019 running kubernetes
Aws summit 2019   running kubernetesAws summit 2019   running kubernetes
Aws summit 2019 running kubernetes
 
Staying out of_trouble_with_k8s_on_aws
Staying out of_trouble_with_k8s_on_awsStaying out of_trouble_with_k8s_on_aws
Staying out of_trouble_with_k8s_on_aws
 
Kubernetes @ pixel
Kubernetes @ pixelKubernetes @ pixel
Kubernetes @ pixel
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Monitoring with prometheus at scale

  • 1. Monitoring with Prometheus at Scale Adam Hamsik adam.hamsik@lablabs.io
  • 2. Labyrinth Labs Rock-solid infrastructure and DevOps ● Building rock-solid and secure foundations for all your digital operations. Our mission is to let you focus on your business without ever needing to worry about technical issues again. ● Making you ready for growing traffic, safe against new security vulnerabilities and data-loss. 2
  • 3. TL;DR ● We will start with common monitoring issues and problems. ● Deploying Prometheus is easy and running a single instance can be sufficient for most deployments. ● We will have a quick look at AlertManager ● We will talk about scalability limits of prometheus instance, when and how to use sharding. ● What is Trickster and why you should use it too ● How Thanos/Cortex can help you when all hope is lost. 3
  • 4. Common Monitoring Problems ● Monitoring tools are limited both technically and conceptually ● Most of existing tools don’t really scale with current infrastructure needs. ● Limited visibility ○ Generally we want to monitor and gather as much information as we can. ○ Even if we don’t need it right away usually it will be useful in a future(I promise) ● No common application monitoring interface. There are different protocols/standards ○ Openmetrics ○ SNMP 4
  • 6. Prometheus Monitoring System The Prometheus monitoring system and time series database is CNCF graduated project. ● Originally developed by exGooglers for SoundCloud as their internal monitoring system ● Inspired by Google’s Borgmon monitoring system ● Open Source under the Apache License ● Written as monolithic application in Go 6
  • 7. Prometheus Server Overview ● Multi-dimensional data model with time series data identified by metric name and key/value(labels) pairs ● PromQL, a flexible query language to leverage this dimensionality ● No reliance on distributed storage; single server nodes are autonomous ● Targets are discovered via service discovery or static configuration ● Pushing time series is supported via an intermediary gateway ● Monitor Services not Machines/Servers 7
  • 9. Company Prometheus Usage ● We deployed first prometheus servers ● Add some services ● Setup trickster as a Grafana Cache ● Add more services/servers ● Continuous adding of CPU/Memory to Prometheus instance ● Setup simple federation/sharding if single instance is too big ● Use Thanos 9
  • 10. First Prometheus Deployment ● Deploying your first Prometheus server is very easy. Fetch prometheus binary + config. ● There is a no concept of a Prometheus Cluster ● Generally Prometheus can scale very well with CPU/Memory ○ Providing more cpu/memory allows prometheus to monitor more metrics ○ It’s hard to run large pod in a kubernetes cluster if it’s as big as a worker node. ● If job is too big for a single server you can use federation/sharding (remote reads) for simple scaling 10
  • 11. Trickster Setup ● Loading complicated/big dashboard on Grafana can overload your prometheus server ○ Use trickster to cache PromQL results for future reuse ○ Queries on metrics with high cardinality can use a lot of memory on you prometheus instance[1]. ○ Use limits to make sure user will not overload your server query.max-concurrency/query.max-samples ● Delta Proxy caching - inspects the time range of a client query to determine what data points are already cached 111. https://www.robustperception.io/limiting-promql-resource-usage
  • 14. Metrics Cardinality ● Prometheus performance almost always comes to one thing metrics cardinality. ● Cardinality describes how many unique values of some metric you have ○ container_tasks_state metric will have a unique (pod/container) pair for each running container in your cluster ○ custom_api_http_request will have a unique metric for each combination of url/http_method/env. (/api/v2/users, get, dev; /api/v2/users, post, prod...) 141. https://www.robustperception.io/cardinality-is-key
  • 15. Bad Metrics Cardinality 151. https://www.robustperception.io/cardinality-is-key ● See example below where we throw away bad fluentd metrics and dropped number of scrapped metrics by ½ ● If you are using fluentd look for fluentd_tail_file_inode, fluentd_tail_file_position ○ In our use case we saw cardinality 1220 from 2 metrics above per node !
  • 16. Thanos/Cortex as ultimate solution ● If you have multiple kubernetes clusters, datacenters with millions of metrics and adding more CPU/memory to prometheus is not an option. ○ Consider adding Thanos/Cortex to your infrastructure ● Thanos querier Prometheus Server HA, can load metrics from multiple prometheus servers and make sure it will present full data to user. ○ Implements Prometheus 1.1 HTTP api. ● Thanos compactor can downsample, change retention or resolution of your metrics. ● Thanos store is a component which can save your metrics in a AWS S3 compatible object store. 16
  • 18. Thanos SideCar 18 ● It implements Thanos’ Store API on top of Prometheus’ remote-read API. This allows Queriers to treat Prometheus servers as yet another source of time series data without directly talking to its APIs. ● Optionally, the sidecar uploads TSDB blocks to an object storage bucket as Prometheus produces them every 2 hours. This allows Prometheus servers to be run with relatively low retention while their historic data is made durable and queryable via object storage. ● Optionally Thanos sidecar is able to watch Prometheus rules and configuration, decompress and substitute environment variables if needed and ping Prometheus to reload them.
  • 19. Thanos Query 19 ● The PromQL query is posted to the Querier ● It interprets the query and goes to a pre-filter ● The query fans out its request for stores, prometheuses or other queries on the basis of labels and time-range requirements ● The Query only sends and receives StoreAPI messages ● After it has collected all the responses, it merges and deduplicates them (if enabled) ● It then sends back the series for the user 1. https://banzaicloud.com/img/blog/multi-cluster-monitoring/life_of_a_query.png
  • 21. Thank You. We are hiring, remote working DevOps/Kubernetes engineers. adam.hamsik@lablabs.io www.lablabs.io 21