SlideShare a Scribd company logo
Hardware-level data-center
monitoring with Prometheus
Conrad Hoffmann
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
AMS5
2118 servers
56 racks
2118 servers
56 racks
200 network devices
2118 servers
56 racks
200 network devices
2 * 2 generic uplinks
3 AWS Direct Connect
3 Google X-Connect
Where we started...
& NRPE
Cloud Watch
Cacti
What’s paging you at night?
Collection Visualization Alerting
Cacti ✔ ✔ ✔
CloudWatch ✔ ✔ ✔
Ganglia ✔
Graphite ✔ ✔
Icinga/Nagios ✔ ✔ ✔
Smokeping ✔ ✔ ✔
Statsd ✔
https://xkcd.com/927/
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
prometheus.io
The Promise of Prometheus
Prometheus is a reliable, scalable, flexible monitoring and
alerting system that is easy to integrate and focused on real
time metrics.
Prometheus: reliability
● Pull-based (“scrape”)
● List of known targets
○ Can be dynamic, e.g. DNS or service discovery
● Built-in meta-monitoring
● Redundancy is easy
Prometheus: scalability
● Performant, efficient storage
● Scales well to available resources
● Easy to scale horizontally
● Federation
Prometheus: flexibility
● Multi-dimensional, label-based data model
● Each data point is defined by
○ A metric name
○ An arbitrary number of key-value pairs (labels)
○ A value
○ A timestamp (added by Prometheus)
● Data points with identical metric names and labels form a time series
● Powerful query language allows for easy aggregation based on labels
Prometheus: flexibility
Target exposes:
http_responses_total{backend="foo",code="2xx"} 804
http_responses_total{backend="foo",code="4xx"} 3170
http_responses_total{backend="bar",code="2xx"} 6637
http_responses_total{backend="bar",code="4xx"} 26
Possible query:
sum(http_responses_total{backend="foo"})
Prometheus: ease of integration
● Data format is text based
● Scrapes are HTTP requests
● Many integrations exist already
● Excellent tooling/libraries to write new ones
Application
Prometheus: ease of integration
Host node
exporter
Prometheus: ease of integration
Host SNMP
exporter
Router B
Router A
Prometheus: ease of integration
Network
Host SNMP
exporter
Router B
Router A
Prometheus: ease of integration
Network
Nomen est omen...
● Alerting
● Silencing
● Alert grouping & routing
● High availability
Alertmanager
Displays data from many sources:
● Prometheus
● Graphite
● Influx
● OpenTSDB
● Elasticsearch
● MySQL/Postgres
● CloudWatch
● ...
Grafana
grafana.com
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
Now withProtips!
Node exporter
● Exports: OS- and hardware-level metrics for running systems
● Replaces: Ganglia, some Icinga/NRPE checks
● Noteworthy:
○ Comes with many collectors built-in
○ Use WMI exporter on Windows
Protip I
Use the node exporter’s text file collector as an easy integration point for
custom metrics!
Examples: Chef data, RAID controller data, SMART data, cron jobs, ...
node
exporter
script
Text
file
Host
Blackbox exporter
● Exports: data about probes against endpoints that don’t support
Prometheus natively (DNS, HTTP(S), ICMP, TCP)
● Replaces: Smokeping, some Icinga checks
● Noteworthy:
○ Monitor TLS certificate expiry :)
Blackbox exporter - Smokeping replacement
1. Send ICMP probe every five seconds
Blackbox exporter - Smokeping replacement
2. Alert on target down and packet loss
ALERT SmokepingTargetDown
IF probe_success{job="smokeping"} == 0
FOR 2m
ALERT SmokepingTargetPacketLoss
IF 100*(1-avg_over_time(probe_success{job="smokeping"}[2m]))> 20
Blackbox exporter - Smokeping replacement
3. Use Prometheus aggregation functions in Grafana
Blackbox exporter - Smokeping replacement
Protip II
Scrape more, scrape faster!
● ~ 1M metrics
● > 5000 targets
● Mostly 10s scrape interval, some 5s, some longer
● 50 days retention time
● 250 GB storage ¯_(ツ)_/¯
SNMP exporter
● Exports: SNMP data from network devices
● Replaces: Cacti
● Noteworthy:
○ a pain to configure
SNMP exporter - Cacti replacement
Once you have got the right SNMP config, alerts and nice graphs are easy!
SNMP exporter - Cacti replacement
Cacti’s killer feature: the weathermap plugin!
https://network-weathermap.com/
SNMP exporter - Cacti replacement
There is a diagram panel type in Grafana, but…
… we’re not quite there yet ¯_(ツ)_/¯
Protip III
Build a dedicated long-term Prometheus server:
● Scrape only a few selected metrics
● Yank retention time way up
● Make backups (hot backups possible in Prometheus >2.1)
Very useful data for estimating e.g. future bandwidth needs!
Collins exporter - Collins?
● https://tumblr.github.io/collins
● Infrastructure management / IPAM
● Server inventory, classification and lifecycle management
Collins exporter
● Exports: asset inventory data from Collins
● Replaces: a bunch of scripts
● Noteworthy:
○ https://github.com/soundcloud/collins_exporter
Collins exporter
Collins exporter
● Another candidate for long-term storage
● Valuable data for capacity planning
Protip IV
Build your own integrations!
Collins exporter:
● Written in Go
● 1 source file
● 264 lines total ¯_(ツ)_/¯
IPMI exporter
● Exports: IPMI data retrieved from BMCs
● Replaces: many Nagios/NRPE checks
● Noteworthy:
○ https://github.com/soundcloud/ipmi_exporter
○ Works regardless of hosts power state
IPMI exporter
● Mostly sensor data: temperature, fans, power consumption
● Mostly used for alerting:
○ Fans
○ Power supplies
○ Batteries
Protip V
Make use of techniques to ingest non-numeric data!*
● Use labels to expose (semi-)static data of interest
*...but do it with some caution!
ipmi_bmc_info{firmware_revision="2.52",manufacturer_id="Dell_Inc"} 1
Protip V
Make use of techniques to ingest non-numeric data!*
● Use labels and binary values to represent state
*...but do it with some caution!
collins_asset_state{tag="ABCD1234",state="Allocated"}
collins_asset_state{tag="ABCD1234",state="Maintenance"}
collins_asset_state{tag="ABCD1234",state="Unallocated"}
1
0
0
And now: merging data sources
Example: BMC Firmware revisions of certain server types
And now: merging data sources
Query: ipmi_bmc_info{firmware_revision!="2.52"}
Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}
And now: merging data sources
Query: ipmi_bmc_info{firmware_revision!="2.52"}
Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}
Query: collins_asset_details{nodeclass="app-2"}
Result: collins_asset_details{ipmi_address="10.1.2.3",...}
And now: merging data sources
Query: ipmi_bmc_info{firmware_revision!="2.52"}
Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}
Query: collins_asset_details{nodeclass="app-2"}
Result: collins_asset_details{ipmi_address="10.1.2.3",...}
Query: label_replace(ipmi_bmc_info, "ipmi_address", "$1", "instance", "(.*)")
Result: ipmi_bmc_info{firmware_revision="2.41",ipmi_address="10.1.2.3",...}
And now: merging data sources
Query: collins_asset_details{nodeclass="app-2"} *
on (ipmi_address)
group_left(firmware_revision)
label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)")
Result: {firmware_revision="2.41",ipmi_address="10.1.2.3",
nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234"}
And now: merging data sources
Query: collins_asset_details{nodeclass="app-2"} *
on (ipmi_address)
group_left(firmware_revision)
label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)")
* on (tag) group_left(status) (collins_asset_status == 1)
Result: {firmware_revision="2.41",ipmi_address="10.1.2.3",
nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234",status="Allocated"}
Where we are now...
& NRPE
Cloud Watch
Cacti
✘
✘ ✘ ✘
✘ ✘
✘
Collection Visualization Alerting
CloudWatch ✔ ✔ ✔
Graphite (✔)
Prometheus ✔
Grafana ✔
Alertmanager ✔
What’s paging you at night?
What’s up with this CloudWatch thing?
● There is a CloudWatch exporter
● However, CloudWatch internal architecture is fundamentally
incompatible with Prometheus
● Using CloudWatch as Grafana data source can incur costs
Outline
I. Our data-center
II. Brief intro to Prometheus
III. All my exporters
IV. TL;DR & Soon™
So, is it working?
● Yes
Was it worth it?
● Yes
Why was it worth it?
● Many integrations readily available
● New ones are easy to write
● Quality and quantity of monitoring has
increased
● Monitoring and alerting has become much
more consistent
● Easy to merge data sources for alerting or
graphing
This is true across the entire organization, not just infrastructure!
Soon: long term storage
● Not a primary concern for Prometheus
● Simple solution as explained
● Remote (read/)write interface
● Some features in Prometheus 2.0 to allow external solutions
○ Check out e.g. Thanos: https://github.com/improbable-eng/thanos
Soon: forging a standard?
OpenMetrics working group
● https://github.com/RichiH/OpenMetrics
This is the end...
Thank you!

More Related Content

What's hot

DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
smalltown
 
CNTUG x SDN Meetup #33 Talk 1: 從 Cilium 認識 cgroup ebpf - Ruian
CNTUG x SDN Meetup #33  Talk 1: 從 Cilium 認識 cgroup ebpf - RuianCNTUG x SDN Meetup #33  Talk 1: 從 Cilium 認識 cgroup ebpf - Ruian
CNTUG x SDN Meetup #33 Talk 1: 從 Cilium 認識 cgroup ebpf - Ruian
HanLing Shen
 
How to Prepare for CKA Exam
How to Prepare for CKA ExamHow to Prepare for CKA Exam
How to Prepare for CKA Exam
Alfie Chen
 
Red Hat Forum Benelux 2015
Red Hat Forum Benelux 2015Red Hat Forum Benelux 2015
Red Hat Forum Benelux 2015
Microsoft
 
From Code to Kubernetes
From Code to KubernetesFrom Code to Kubernetes
From Code to Kubernetes
Daniel Oliveira Filho
 
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s ClustersDeploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Syah Dwi Prihatmoko
 
Cloud Native User Group: Prometheus Day 2
Cloud Native User Group:  Prometheus Day 2Cloud Native User Group:  Prometheus Day 2
Cloud Native User Group: Prometheus Day 2
smalltown
 
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
Martin Etmajer
 
KubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to Kubernetes
KubeAcademy
 
Linuxcon secureefficientcontainerimagemanagementharbor
Linuxcon secureefficientcontainerimagemanagementharborLinuxcon secureefficientcontainerimagemanagementharbor
Linuxcon secureefficientcontainerimagemanagementharbor
LinuxCon ContainerCon CloudOpen China
 
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
NETWAYS
 
Neutron high availability open stack architecture openstack israel event 2015
Neutron high availability  open stack architecture   openstack israel event 2015Neutron high availability  open stack architecture   openstack israel event 2015
Neutron high availability open stack architecture openstack israel event 2015
Arthur Berezin
 
A Kong retrospective: from 0.10 to 0.13
A Kong retrospective: from 0.10 to 0.13A Kong retrospective: from 0.10 to 0.13
A Kong retrospective: from 0.10 to 0.13
Thibault Charbonnier
 
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes ForwardKubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeAcademy
 
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
WDDay
 
Introduction of eBPF - 時下最夯的Linux Technology
Introduction of eBPF - 時下最夯的Linux Technology Introduction of eBPF - 時下最夯的Linux Technology
Introduction of eBPF - 時下最夯的Linux Technology
Jace Liang
 
Kubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best PracticesKubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best Practices
Ajeet Singh Raina
 
Enabling Production Grade Containerized Applications through Policy Based Inf...
Enabling Production Grade Containerized Applications through Policy Based Inf...Enabling Production Grade Containerized Applications through Policy Based Inf...
Enabling Production Grade Containerized Applications through Policy Based Inf...
Docker, Inc.
 
Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...
Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...
Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...
Red Hat Developers
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
Grafana Labs
 

What's hot (20)

DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
CNTUG x SDN Meetup #33 Talk 1: 從 Cilium 認識 cgroup ebpf - Ruian
CNTUG x SDN Meetup #33  Talk 1: 從 Cilium 認識 cgroup ebpf - RuianCNTUG x SDN Meetup #33  Talk 1: 從 Cilium 認識 cgroup ebpf - Ruian
CNTUG x SDN Meetup #33 Talk 1: 從 Cilium 認識 cgroup ebpf - Ruian
 
How to Prepare for CKA Exam
How to Prepare for CKA ExamHow to Prepare for CKA Exam
How to Prepare for CKA Exam
 
Red Hat Forum Benelux 2015
Red Hat Forum Benelux 2015Red Hat Forum Benelux 2015
Red Hat Forum Benelux 2015
 
From Code to Kubernetes
From Code to KubernetesFrom Code to Kubernetes
From Code to Kubernetes
 
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s ClustersDeploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
Deploy Prometheus - Grafana and EFK stack on Kubic k8s Clusters
 
Cloud Native User Group: Prometheus Day 2
Cloud Native User Group:  Prometheus Day 2Cloud Native User Group:  Prometheus Day 2
Cloud Native User Group: Prometheus Day 2
 
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
Challenges in a Microservices Age: Monitoring, Logging and Tracing on Red Hat...
 
KubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to KubernetesKubeCon EU 2016: Heroku to Kubernetes
KubeCon EU 2016: Heroku to Kubernetes
 
Linuxcon secureefficientcontainerimagemanagementharbor
Linuxcon secureefficientcontainerimagemanagementharborLinuxcon secureefficientcontainerimagemanagementharbor
Linuxcon secureefficientcontainerimagemanagementharbor
 
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
 
Neutron high availability open stack architecture openstack israel event 2015
Neutron high availability  open stack architecture   openstack israel event 2015Neutron high availability  open stack architecture   openstack israel event 2015
Neutron high availability open stack architecture openstack israel event 2015
 
A Kong retrospective: from 0.10 to 0.13
A Kong retrospective: from 0.10 to 0.13A Kong retrospective: from 0.10 to 0.13
A Kong retrospective: from 0.10 to 0.13
 
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes ForwardKubeCon EU 2016 Keynote: Pushing Kubernetes Forward
KubeCon EU 2016 Keynote: Pushing Kubernetes Forward
 
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
 
Introduction of eBPF - 時下最夯的Linux Technology
Introduction of eBPF - 時下最夯的Linux Technology Introduction of eBPF - 時下最夯的Linux Technology
Introduction of eBPF - 時下最夯的Linux Technology
 
Kubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best PracticesKubernetes Monitoring & Best Practices
Kubernetes Monitoring & Best Practices
 
Enabling Production Grade Containerized Applications through Policy Based Inf...
Enabling Production Grade Containerized Applications through Policy Based Inf...Enabling Production Grade Containerized Applications through Policy Based Inf...
Enabling Production Grade Containerized Applications through Policy Based Inf...
 
Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...
Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...
Serverless, Tekton, and Argo CD: How to craft modern CI/CD workflows | DevNat...
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 

Similar to OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann

Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
GetInData
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
Marco Pas
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
GetInData
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialTim Vaillancourt
 
Microservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing MicroservicesMicroservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing Microservices
QAware GmbH
 
Monitoring CloudStack and components
Monitoring CloudStack and componentsMonitoring CloudStack and components
Monitoring CloudStack and components
ShapeBlue
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
Changshu Liu
 
"Wie passen Serverless & Autonomous zusammen?"
"Wie passen Serverless & Autonomous zusammen?""Wie passen Serverless & Autonomous zusammen?"
"Wie passen Serverless & Autonomous zusammen?"
Volker Linz
 
DevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheusDevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga
 
Prometheus Training
Prometheus TrainingPrometheus Training
Prometheus Training
Tim Tyler
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slides
Weaveworks
 
FIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart SystemsFIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE
 
Digital Forensics and Incident Response in The Cloud Part 3
Digital Forensics and Incident Response in The Cloud Part 3Digital Forensics and Incident Response in The Cloud Part 3
Digital Forensics and Incident Response in The Cloud Part 3
Velocidex Enterprises
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
Calvin French-Owen
 
Social Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections PinkSocial Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections Pink
Nico Meisenzahl
 
Hunting for APT in network logs workshop presentation
Hunting for APT in network logs workshop presentationHunting for APT in network logs workshop presentation
Hunting for APT in network logs workshop presentation
OlehLevytskyi1
 
Cloud Run - the rise of serverless and containerization
Cloud Run - the rise of serverless and containerizationCloud Run - the rise of serverless and containerization
Cloud Run - the rise of serverless and containerization
Márton Kodok
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
Freddy Buenaño
 

Similar to OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann (20)

Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Monitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_TutorialMonitoring_with_Prometheus_Grafana_Tutorial
Monitoring_with_Prometheus_Grafana_Tutorial
 
Microservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing MicroservicesMicroservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing Microservices
 
Monitoring CloudStack and components
Monitoring CloudStack and componentsMonitoring CloudStack and components
Monitoring CloudStack and components
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
"Wie passen Serverless & Autonomous zusammen?"
"Wie passen Serverless & Autonomous zusammen?""Wie passen Serverless & Autonomous zusammen?"
"Wie passen Serverless & Autonomous zusammen?"
 
DevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheusDevOps Braga #15: Agentless monitoring with icinga and prometheus
DevOps Braga #15: Agentless monitoring with icinga and prometheus
 
Prometheus Training
Prometheus TrainingPrometheus Training
Prometheus Training
 
Kubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slidesKubecon seattle 2018 workshop slides
Kubecon seattle 2018 workshop slides
 
FIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart SystemsFIWARE Wednesday Webinars - Short Term History within Smart Systems
FIWARE Wednesday Webinars - Short Term History within Smart Systems
 
Digital Forensics and Incident Response in The Cloud Part 3
Digital Forensics and Incident Response in The Cloud Part 3Digital Forensics and Incident Response in The Cloud Part 3
Digital Forensics and Incident Response in The Cloud Part 3
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
 
Social Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections PinkSocial Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections Pink
 
Hunting for APT in network logs workshop presentation
Hunting for APT in network logs workshop presentationHunting for APT in network logs workshop presentation
Hunting for APT in network logs workshop presentation
 
Cloud Run - the rise of serverless and containerization
Cloud Run - the rise of serverless and containerizationCloud Run - the rise of serverless and containerization
Cloud Run - the rise of serverless and containerization
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
 
26.1.7 lab snort and firewall rules
26.1.7 lab   snort and firewall rules26.1.7 lab   snort and firewall rules
26.1.7 lab snort and firewall rules
 

Recently uploaded

Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
NaapbooksPrivateLimi
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
ayushiqss
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 

Recently uploaded (20)

Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
Visitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.appVisitor Management System in India- Vizman.app
Visitor Management System in India- Vizman.app
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Why React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdfWhy React Native as a Strategic Advantage for Startup Innovation.pdf
Why React Native as a Strategic Advantage for Startup Innovation.pdf
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 

OSDC 2018 | Hardware-level data-center monitoring with Prometheus by Conrad Hoffmann

  • 1. Hardware-level data-center monitoring with Prometheus Conrad Hoffmann
  • 2. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™
  • 3. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™ AMS5
  • 4.
  • 5.
  • 7. 2118 servers 56 racks 200 network devices
  • 8. 2118 servers 56 racks 200 network devices 2 * 2 generic uplinks 3 AWS Direct Connect 3 Google X-Connect
  • 9. Where we started... & NRPE Cloud Watch Cacti
  • 10. What’s paging you at night? Collection Visualization Alerting Cacti ✔ ✔ ✔ CloudWatch ✔ ✔ ✔ Ganglia ✔ Graphite ✔ ✔ Icinga/Nagios ✔ ✔ ✔ Smokeping ✔ ✔ ✔ Statsd ✔
  • 12. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™ prometheus.io
  • 13. The Promise of Prometheus Prometheus is a reliable, scalable, flexible monitoring and alerting system that is easy to integrate and focused on real time metrics.
  • 14. Prometheus: reliability ● Pull-based (“scrape”) ● List of known targets ○ Can be dynamic, e.g. DNS or service discovery ● Built-in meta-monitoring ● Redundancy is easy
  • 15. Prometheus: scalability ● Performant, efficient storage ● Scales well to available resources ● Easy to scale horizontally ● Federation
  • 16. Prometheus: flexibility ● Multi-dimensional, label-based data model ● Each data point is defined by ○ A metric name ○ An arbitrary number of key-value pairs (labels) ○ A value ○ A timestamp (added by Prometheus) ● Data points with identical metric names and labels form a time series ● Powerful query language allows for easy aggregation based on labels
  • 17. Prometheus: flexibility Target exposes: http_responses_total{backend="foo",code="2xx"} 804 http_responses_total{backend="foo",code="4xx"} 3170 http_responses_total{backend="bar",code="2xx"} 6637 http_responses_total{backend="bar",code="4xx"} 26 Possible query: sum(http_responses_total{backend="foo"})
  • 18. Prometheus: ease of integration ● Data format is text based ● Scrapes are HTTP requests ● Many integrations exist already ● Excellent tooling/libraries to write new ones
  • 21. Host SNMP exporter Router B Router A Prometheus: ease of integration Network
  • 22. Host SNMP exporter Router B Router A Prometheus: ease of integration Network
  • 23. Nomen est omen... ● Alerting ● Silencing ● Alert grouping & routing ● High availability Alertmanager
  • 24. Displays data from many sources: ● Prometheus ● Graphite ● Influx ● OpenTSDB ● Elasticsearch ● MySQL/Postgres ● CloudWatch ● ... Grafana grafana.com
  • 25. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™ Now withProtips!
  • 26. Node exporter ● Exports: OS- and hardware-level metrics for running systems ● Replaces: Ganglia, some Icinga/NRPE checks ● Noteworthy: ○ Comes with many collectors built-in ○ Use WMI exporter on Windows
  • 27. Protip I Use the node exporter’s text file collector as an easy integration point for custom metrics! Examples: Chef data, RAID controller data, SMART data, cron jobs, ... node exporter script Text file Host
  • 28. Blackbox exporter ● Exports: data about probes against endpoints that don’t support Prometheus natively (DNS, HTTP(S), ICMP, TCP) ● Replaces: Smokeping, some Icinga checks ● Noteworthy: ○ Monitor TLS certificate expiry :)
  • 29. Blackbox exporter - Smokeping replacement 1. Send ICMP probe every five seconds
  • 30. Blackbox exporter - Smokeping replacement 2. Alert on target down and packet loss ALERT SmokepingTargetDown IF probe_success{job="smokeping"} == 0 FOR 2m ALERT SmokepingTargetPacketLoss IF 100*(1-avg_over_time(probe_success{job="smokeping"}[2m]))> 20
  • 31. Blackbox exporter - Smokeping replacement 3. Use Prometheus aggregation functions in Grafana
  • 32. Blackbox exporter - Smokeping replacement
  • 33. Protip II Scrape more, scrape faster! ● ~ 1M metrics ● > 5000 targets ● Mostly 10s scrape interval, some 5s, some longer ● 50 days retention time ● 250 GB storage ¯_(ツ)_/¯
  • 34. SNMP exporter ● Exports: SNMP data from network devices ● Replaces: Cacti ● Noteworthy: ○ a pain to configure
  • 35. SNMP exporter - Cacti replacement Once you have got the right SNMP config, alerts and nice graphs are easy!
  • 36. SNMP exporter - Cacti replacement Cacti’s killer feature: the weathermap plugin! https://network-weathermap.com/
  • 37. SNMP exporter - Cacti replacement There is a diagram panel type in Grafana, but… … we’re not quite there yet ¯_(ツ)_/¯
  • 38.
  • 39. Protip III Build a dedicated long-term Prometheus server: ● Scrape only a few selected metrics ● Yank retention time way up ● Make backups (hot backups possible in Prometheus >2.1) Very useful data for estimating e.g. future bandwidth needs!
  • 40. Collins exporter - Collins? ● https://tumblr.github.io/collins ● Infrastructure management / IPAM ● Server inventory, classification and lifecycle management
  • 41. Collins exporter ● Exports: asset inventory data from Collins ● Replaces: a bunch of scripts ● Noteworthy: ○ https://github.com/soundcloud/collins_exporter
  • 43. Collins exporter ● Another candidate for long-term storage ● Valuable data for capacity planning
  • 44. Protip IV Build your own integrations! Collins exporter: ● Written in Go ● 1 source file ● 264 lines total ¯_(ツ)_/¯
  • 45. IPMI exporter ● Exports: IPMI data retrieved from BMCs ● Replaces: many Nagios/NRPE checks ● Noteworthy: ○ https://github.com/soundcloud/ipmi_exporter ○ Works regardless of hosts power state
  • 46. IPMI exporter ● Mostly sensor data: temperature, fans, power consumption ● Mostly used for alerting: ○ Fans ○ Power supplies ○ Batteries
  • 47. Protip V Make use of techniques to ingest non-numeric data!* ● Use labels to expose (semi-)static data of interest *...but do it with some caution! ipmi_bmc_info{firmware_revision="2.52",manufacturer_id="Dell_Inc"} 1
  • 48. Protip V Make use of techniques to ingest non-numeric data!* ● Use labels and binary values to represent state *...but do it with some caution! collins_asset_state{tag="ABCD1234",state="Allocated"} collins_asset_state{tag="ABCD1234",state="Maintenance"} collins_asset_state{tag="ABCD1234",state="Unallocated"} 1 0 0
  • 49. And now: merging data sources Example: BMC Firmware revisions of certain server types
  • 50. And now: merging data sources Query: ipmi_bmc_info{firmware_revision!="2.52"} Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...}
  • 51. And now: merging data sources Query: ipmi_bmc_info{firmware_revision!="2.52"} Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...} Query: collins_asset_details{nodeclass="app-2"} Result: collins_asset_details{ipmi_address="10.1.2.3",...}
  • 52. And now: merging data sources Query: ipmi_bmc_info{firmware_revision!="2.52"} Result: ipmi_bmc_info{firmware_revision="2.41",instance="10.1.2.3",...} Query: collins_asset_details{nodeclass="app-2"} Result: collins_asset_details{ipmi_address="10.1.2.3",...} Query: label_replace(ipmi_bmc_info, "ipmi_address", "$1", "instance", "(.*)") Result: ipmi_bmc_info{firmware_revision="2.41",ipmi_address="10.1.2.3",...}
  • 53. And now: merging data sources Query: collins_asset_details{nodeclass="app-2"} * on (ipmi_address) group_left(firmware_revision) label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)") Result: {firmware_revision="2.41",ipmi_address="10.1.2.3", nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234"}
  • 54. And now: merging data sources Query: collins_asset_details{nodeclass="app-2"} * on (ipmi_address) group_left(firmware_revision) label_replace(ipmi_bmc_info{firmware_revision!="2.52"}, "ipmi_address", "$1", "instance", "(.*)") * on (tag) group_left(status) (collins_asset_status == 1) Result: {firmware_revision="2.41",ipmi_address="10.1.2.3", nodeclass="app-2",primary_address="10.10.20.30",tag="ABCD1234",status="Allocated"}
  • 55. Where we are now... & NRPE Cloud Watch Cacti ✘ ✘ ✘ ✘ ✘ ✘ ✘
  • 56. Collection Visualization Alerting CloudWatch ✔ ✔ ✔ Graphite (✔) Prometheus ✔ Grafana ✔ Alertmanager ✔ What’s paging you at night?
  • 57. What’s up with this CloudWatch thing? ● There is a CloudWatch exporter ● However, CloudWatch internal architecture is fundamentally incompatible with Prometheus ● Using CloudWatch as Grafana data source can incur costs
  • 58. Outline I. Our data-center II. Brief intro to Prometheus III. All my exporters IV. TL;DR & Soon™
  • 59. So, is it working? ● Yes
  • 60. Was it worth it? ● Yes
  • 61. Why was it worth it? ● Many integrations readily available ● New ones are easy to write ● Quality and quantity of monitoring has increased ● Monitoring and alerting has become much more consistent ● Easy to merge data sources for alerting or graphing This is true across the entire organization, not just infrastructure!
  • 62. Soon: long term storage ● Not a primary concern for Prometheus ● Simple solution as explained ● Remote (read/)write interface ● Some features in Prometheus 2.0 to allow external solutions ○ Check out e.g. Thanos: https://github.com/improbable-eng/thanos
  • 63. Soon: forging a standard? OpenMetrics working group ● https://github.com/RichiH/OpenMetrics
  • 64. This is the end... Thank you!