SlideShare a Scribd company logo
FROM NOTHING TO PROMETHEUS
ONE YEAR AFTER
MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018
Speaker ID
Antoine LEROYER
⬢ Infrastructure Engineer / SRE @ Deezer since 2016
⬢ DevOps @ EDF (2013-2016)
⬢ Sysadmin @ Netvibes (2012-2013)
2
MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018
Agenda
⬢ Deezer in 30 seconds
⬢ State of Deezer Infrastructure in 2016
⬢ Why Prometheus?
⬢ Let’s dive in our setup
⬢ What’s next for us in monitoring
⬢ Questions
3
4
Deezer in 30 seconds
5
Deezer in 30 seconds
⬢ Streaming music service
⬢ Launched in 2007
⬢ Available on multiple devices: Mobile, Desktop, TV, Speakers, etc.
6
12M 185+ 43M
active users countries tracks
(and counting)
State of Deezer Infrastructure
In 2016
7
⬢ Fully managed by our provider
⬡ Rack and initial setup of servers
⬡ Configuration management
⬡ Monitoring
⬡ Alerting
⬢ Majority of bare metal servers (400+)
⬢ Infrastructure Team was small
⬢ Technical staff went big (x4 in one year)
⬢ ...so our team got new members to handle the growth :)
8
State of Deezer Infrastructure in 2016
The new Infrastructure Team @ Deezer
9
If you want to managed yourself the production without your provider, you
need a proper monitoring solution. (and other things but that’s not the point here)
10
But first, we ask ourselves
Okay, so what our needs?
State of Deezer Infrastructure in 2016
Our needs
⬢ Have a bunch of metrics to make nice graphs
⬢ Send alerts if something went wrong
⬢ Easy to deploy on the existing infrastructure
⬢ But also support container orchestration for the future
⬢ Being able to scale up/down without triggering alerts
11
State of Deezer Infrastructure in 2016
Why Prometheus?
12
What is Prometheus?
13
⬢ Open-source systems monitoring and alerting toolkit
⬢ Time series database with metrics name and labels
⬢ Pull time series over HTTP instead of push
⬢ Targets are discovered via service discovery
⬢ No distributed storage, nodes are autonomous
https://prometheus.io/docs/introduction/overview/
Why Prometheus?
What is Prometheus?
14
⬢ Typical time series
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027
http_requests_total{method="post",code="400"} 3
Why Prometheus?
Why Prometheus?
⬢ Design for metrics (TSDB)
⬢ Provide alerting thanks to Alertmanager
⬢ Grafana support
⬢ High performances
⬢ Powerful but simple query language (PromQL)
⬢ Service Discovery
⬡ Follow your infrastructure scaling up/down
⬡ Ready for container orchestration
15
Let’s dive in our setup
16
First, a service discovery
17
We use Consul:
⬢ Already deployed on some servers
⬢ Supported by Prometheus
⬢ Blazing fast and lightweight
⬢ Service declaration with tags support
⬢ Bonus:
⬡ we have service check
⬡ and a K/V store
Let’s dive in our setup
Consul by Hashicorp (consul.io)
Then, Prometheus
⬢ 2 monstrous servers in each PoP
⬡ 32 cores
⬡ 128GB RAM
⬡ RAID 10 SSD
⬢ Currently running 2.1
Also, an Alertmanager cluster.
18
Let’s dive in our setup
And also exporters
It can be:
⬢ A daemon exposing metrics through an HTTP endpoint
⬢ A HTTP endpoint inside your application
⬢ A Prometheus pushgateway
Your endpoint must expose plain text data in Prometheus format.
19
Let’s dive in our setup
Prometheus infrastructure for one datacenter
20
How do I monitor a server and its services?
1. Deploy consul agent and a bunch of exporters on a node
2. Add services to your consul agent with some tags
3. ????
4. Profit!!!!
21
Let’s dive in our setup
Consul Agent configuration
22
Let’s dive in our setup
port where my exporter is
listening
tag to filter
environment
# Consul Service JSON for Apache
{
"service": {
"name": "apache",
"tags": [
"prod",
"apache",
"exporter-6110"
],
"address": "",
"port": 443,
"enableTagOverride": false,
"checks": [
{
"script": "apache-check.sh",
"interval": "5s"
}
]
}
}
Prometheus relabeling: a strong feature
23
Let’s dive in our setup
⬢ Before scraping, Prometheus allow you to change/create labels
⬢ You can create labels to help you identify your metrics
# Replace job label with service name
- source_labels: [__meta_consul_service]
target_label: job
# Add datacenter label
- source_labels: [__meta_consul_dc]
target_label: dc
# Add instance name
- source_labels: [__meta_consul_node]
target_label: instance
# Create a group label from node name
- source_labels: [__meta_consul_node]
regex: ^(blm|dzr|dev)-([a-z]+)-.*
target_label: group
replacement: ${2}
Prometheus relabeling: a strong feature
24
Let’s dive in our setup
⬢ You can change internal labels of Prometheus
⬡ They start with __ and will be removed before storing the metric
⬡ You can override labels used for scraping to obtain a dynamic configuration
# Retrieve exporter port from consul tags
- source_labels: [__meta_consul_tags]
regex: .*,exporter-([0-9]+),.*
target_label: __exporter_port
replacement: ${1}
# Define addr:port to scrape
- source_labels: [__meta_consul_address,__exporter_port]
separator: ":"
target_label: __address__
replacement: ${1}
After relabeling
25
Let’s dive in our setup
Just a bunch of exporters
26
Typical week @ Deezer
27
Typical day for memcached
28
Impact of Prometheus v2
29
1.8.2 2.1.0
Impact of Prometheus v2
30
1.8.2 2.1.0
⬢ We have over 2.3 millions time series
⬢ It scrapes ~57k samples per seconds
⬢ 30s interval scrape in general
⬢ No late so far
31
Some stats about Prometheus itself
OS tuning
# SSD Tuning
echo 0 > /sys/block/sdX/queue/rotational
echo deadline > /sys/block/sdX/queue/scheduler
# /etc/sysctl.d/local.conf
vm.swappiness=1
# /etc/security/limits.d/00prometheus
prometheus - nofile 10000000
# If you have an Intel CPU, want consistent CPU frequencies and scaling_governor
# doesn’t work. Put this in your kernel boot args.
intel_pstate=disable
32
Let’s dive in our setup
# Equal to 2/3 of your total memory
-storage.local.target-heap-size
# Set it to 5m to reduce charge on SSD
-storage.local.checkpoint-interval
# If you have a large number of time series and a low scrape interval
# you can increase this above 10k easily
-storage.local.num-fingerprint-mutexes
# If you have SSD, you can put this one really high
-storage.local.checkpoint-dirty-series-limit
33
Some 1.6.x to 1.8.x settings (in case you need it)
Let’s dive in our setup
Source: Configuring Prometheus for High Performance [A] - Björn Rabenstein, SoundCloud Ltd.
In 2.x
New TSDB engine. Just one setting:
--storage.tsdb.retention
Prometheus will take care of the rest.
Just ensure you have enough disk space. (depending on retention)
34
Let’s dive in our setup
What’s next for us in monitoring?
35
What’s next for us in monitoring?
⬢ Go over 15 days of retention
⬡ Use remote read/write feature to export/read back data
⬢ Experiment with remote read to have only one endpoint to read metrics from
⬢ Alerting as a Service
⬡ Try to automate Prometheus alerting rules creation
⬡ Provision Alertmanager for each team
⬢ Write some exporters :)
⬢ Kubernetes!
36
Questions?
37
Thanks!
38

More Related Content

What's hot

Linuxday.at - Lightning Talk
Linuxday.at - Lightning TalkLinuxday.at - Lightning Talk
Linuxday.at - Lightning Talk
Jan Gehring
 
Centralized Logging with syslog
Centralized Logging with syslogCentralized Logging with syslog
Centralized Logging with syslog
amiable_indian
 
Fluentd v0.12 master guide
Fluentd v0.12 master guideFluentd v0.12 master guide
Fluentd v0.12 master guide
N Masahiro
 
Logstash
LogstashLogstash
Logstash
琛琳 饶
 
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
Susan Potter
 
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Zabbix
 
Fluentd and PHP
Fluentd and PHPFluentd and PHP
Fluentd and PHP
chobi e
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4
琛琳 饶
 
Life of an Fluentd event
Life of an Fluentd eventLife of an Fluentd event
Life of an Fluentd event
Kiyoto Tamura
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
Treasure Data, Inc.
 
Nmap Scripting Engine and http-enumeration
Nmap Scripting Engine and http-enumerationNmap Scripting Engine and http-enumeration
Nmap Scripting Engine and http-enumeration
Robert Rowley
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
Arun prasath
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
Treasure Data, Inc.
 
Using Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibanaUsing Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibana
Alejandro E Brito Monedero
 
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Puppet
 
ELK stack at weibo.com
ELK stack at weibo.comELK stack at weibo.com
ELK stack at weibo.com
琛琳 饶
 
Nessus scan report using microsoft patchs scan policy - Tareq Hanaysha
Nessus scan report using microsoft patchs scan policy - Tareq HanayshaNessus scan report using microsoft patchs scan policy - Tareq Hanaysha
Nessus scan report using microsoft patchs scan policy - Tareq Hanaysha
Hanaysha
 
Fluentd meetup #2
Fluentd meetup #2Fluentd meetup #2
Fluentd meetup #2
Treasure Data, Inc.
 
Lua tech talk
Lua tech talkLua tech talk
Lua tech talk
Locaweb
 
Like loggly using open source
Like loggly using open sourceLike loggly using open source
Like loggly using open source
Thomas Alrin
 

What's hot (20)

Linuxday.at - Lightning Talk
Linuxday.at - Lightning TalkLinuxday.at - Lightning Talk
Linuxday.at - Lightning Talk
 
Centralized Logging with syslog
Centralized Logging with syslogCentralized Logging with syslog
Centralized Logging with syslog
 
Fluentd v0.12 master guide
Fluentd v0.12 master guideFluentd v0.12 master guide
Fluentd v0.12 master guide
 
Logstash
LogstashLogstash
Logstash
 
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
 
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
 
Fluentd and PHP
Fluentd and PHPFluentd and PHP
Fluentd and PHP
 
{{more}} Kibana4
{{more}} Kibana4{{more}} Kibana4
{{more}} Kibana4
 
Life of an Fluentd event
Life of an Fluentd eventLife of an Fluentd event
Life of an Fluentd event
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Nmap Scripting Engine and http-enumeration
Nmap Scripting Engine and http-enumerationNmap Scripting Engine and http-enumeration
Nmap Scripting Engine and http-enumeration
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Using Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibanaUsing Logstash, elasticsearch & kibana
Using Logstash, elasticsearch & kibana
 
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
 
ELK stack at weibo.com
ELK stack at weibo.comELK stack at weibo.com
ELK stack at weibo.com
 
Nessus scan report using microsoft patchs scan policy - Tareq Hanaysha
Nessus scan report using microsoft patchs scan policy - Tareq HanayshaNessus scan report using microsoft patchs scan policy - Tareq Hanaysha
Nessus scan report using microsoft patchs scan policy - Tareq Hanaysha
 
Fluentd meetup #2
Fluentd meetup #2Fluentd meetup #2
Fluentd meetup #2
 
Lua tech talk
Lua tech talkLua tech talk
Lua tech talk
 
Like loggly using open source
Like loggly using open sourceLike loggly using open source
Like loggly using open source
 

Similar to From nothing to Prometheus : one year after

Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
System monitoring
System monitoringSystem monitoring
System monitoring
HardikBadola
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
Brian Brazil
 
Why NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB AtlasWhy NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB Atlas
Datavail
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Brian Brazil
 
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
Ji-Woong Choi
 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and Kubernetes
Sreenivas Makam
 
Prometheus Training
Prometheus TrainingPrometheus Training
Prometheus Training
Tim Tyler
 
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
InfluxData
 
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB AtlasMongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
N Masahiro
 
Build reliable, traceable, distributed systems with ZeroMQ
Build reliable, traceable, distributed systems with ZeroMQBuild reliable, traceable, distributed systems with ZeroMQ
Build reliable, traceable, distributed systems with ZeroMQ
Robin Xiao
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
Amazon Web Services
 
Building an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsBuilding an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult Steps
DigitalOcean
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Brian Brazil
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
InfluxData
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
Brian Brazil
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
Mike Willbanks
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
Kuberton
 

Similar to From nothing to Prometheus : one year after (20)

Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
System monitoring
System monitoringSystem monitoring
System monitoring
 
Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)Prometheus (Microsoft, 2016)
Prometheus (Microsoft, 2016)
 
Why NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB AtlasWhy NBC Universal Migrated to MongoDB Atlas
Why NBC Universal Migrated to MongoDB Atlas
 
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
Prometheus: A Next Generation Monitoring System (FOSDEM 2016)
 
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
[오픈소스컨설팅] 프로메테우스 모니터링 살펴보고 구성하기
 
Service Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and KubernetesService Discovery using etcd, Consul and Kubernetes
Service Discovery using etcd, Consul and Kubernetes
 
Prometheus Training
Prometheus TrainingPrometheus Training
Prometheus Training
 
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
 
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB AtlasMongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
MongoDB World 2019: Why NBCUniversal Migrated to MongoDB Atlas
 
Fluentd - RubyKansai 65
Fluentd - RubyKansai 65Fluentd - RubyKansai 65
Fluentd - RubyKansai 65
 
Build reliable, traceable, distributed systems with ZeroMQ
Build reliable, traceable, distributed systems with ZeroMQBuild reliable, traceable, distributed systems with ZeroMQ
Build reliable, traceable, distributed systems with ZeroMQ
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
 
Building an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult StepsBuilding an Observability Platform in 389 Difficult Steps
Building an Observability Platform in 389 Difficult Steps
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
 
Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)Microservices and Prometheus (Microservices NYC 2016)
Microservices and Prometheus (Microservices NYC 2016)
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
 

Recently uploaded

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 

Recently uploaded (20)

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 

From nothing to Prometheus : one year after

  • 1. FROM NOTHING TO PROMETHEUS ONE YEAR AFTER MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018
  • 2. Speaker ID Antoine LEROYER ⬢ Infrastructure Engineer / SRE @ Deezer since 2016 ⬢ DevOps @ EDF (2013-2016) ⬢ Sysadmin @ Netvibes (2012-2013) 2 MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018
  • 3. Agenda ⬢ Deezer in 30 seconds ⬢ State of Deezer Infrastructure in 2016 ⬢ Why Prometheus? ⬢ Let’s dive in our setup ⬢ What’s next for us in monitoring ⬢ Questions 3
  • 4. 4
  • 5. Deezer in 30 seconds 5
  • 6. Deezer in 30 seconds ⬢ Streaming music service ⬢ Launched in 2007 ⬢ Available on multiple devices: Mobile, Desktop, TV, Speakers, etc. 6 12M 185+ 43M active users countries tracks (and counting)
  • 7. State of Deezer Infrastructure In 2016 7
  • 8. ⬢ Fully managed by our provider ⬡ Rack and initial setup of servers ⬡ Configuration management ⬡ Monitoring ⬡ Alerting ⬢ Majority of bare metal servers (400+) ⬢ Infrastructure Team was small ⬢ Technical staff went big (x4 in one year) ⬢ ...so our team got new members to handle the growth :) 8 State of Deezer Infrastructure in 2016
  • 9. The new Infrastructure Team @ Deezer 9
  • 10. If you want to managed yourself the production without your provider, you need a proper monitoring solution. (and other things but that’s not the point here) 10 But first, we ask ourselves Okay, so what our needs? State of Deezer Infrastructure in 2016
  • 11. Our needs ⬢ Have a bunch of metrics to make nice graphs ⬢ Send alerts if something went wrong ⬢ Easy to deploy on the existing infrastructure ⬢ But also support container orchestration for the future ⬢ Being able to scale up/down without triggering alerts 11 State of Deezer Infrastructure in 2016
  • 13. What is Prometheus? 13 ⬢ Open-source systems monitoring and alerting toolkit ⬢ Time series database with metrics name and labels ⬢ Pull time series over HTTP instead of push ⬢ Targets are discovered via service discovery ⬢ No distributed storage, nodes are autonomous https://prometheus.io/docs/introduction/overview/ Why Prometheus?
  • 14. What is Prometheus? 14 ⬢ Typical time series # HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total counter http_requests_total{method="post",code="200"} 1027 http_requests_total{method="post",code="400"} 3 Why Prometheus?
  • 15. Why Prometheus? ⬢ Design for metrics (TSDB) ⬢ Provide alerting thanks to Alertmanager ⬢ Grafana support ⬢ High performances ⬢ Powerful but simple query language (PromQL) ⬢ Service Discovery ⬡ Follow your infrastructure scaling up/down ⬡ Ready for container orchestration 15
  • 16. Let’s dive in our setup 16
  • 17. First, a service discovery 17 We use Consul: ⬢ Already deployed on some servers ⬢ Supported by Prometheus ⬢ Blazing fast and lightweight ⬢ Service declaration with tags support ⬢ Bonus: ⬡ we have service check ⬡ and a K/V store Let’s dive in our setup Consul by Hashicorp (consul.io)
  • 18. Then, Prometheus ⬢ 2 monstrous servers in each PoP ⬡ 32 cores ⬡ 128GB RAM ⬡ RAID 10 SSD ⬢ Currently running 2.1 Also, an Alertmanager cluster. 18 Let’s dive in our setup
  • 19. And also exporters It can be: ⬢ A daemon exposing metrics through an HTTP endpoint ⬢ A HTTP endpoint inside your application ⬢ A Prometheus pushgateway Your endpoint must expose plain text data in Prometheus format. 19 Let’s dive in our setup
  • 20. Prometheus infrastructure for one datacenter 20
  • 21. How do I monitor a server and its services? 1. Deploy consul agent and a bunch of exporters on a node 2. Add services to your consul agent with some tags 3. ???? 4. Profit!!!! 21 Let’s dive in our setup
  • 22. Consul Agent configuration 22 Let’s dive in our setup port where my exporter is listening tag to filter environment # Consul Service JSON for Apache { "service": { "name": "apache", "tags": [ "prod", "apache", "exporter-6110" ], "address": "", "port": 443, "enableTagOverride": false, "checks": [ { "script": "apache-check.sh", "interval": "5s" } ] } }
  • 23. Prometheus relabeling: a strong feature 23 Let’s dive in our setup ⬢ Before scraping, Prometheus allow you to change/create labels ⬢ You can create labels to help you identify your metrics # Replace job label with service name - source_labels: [__meta_consul_service] target_label: job # Add datacenter label - source_labels: [__meta_consul_dc] target_label: dc # Add instance name - source_labels: [__meta_consul_node] target_label: instance # Create a group label from node name - source_labels: [__meta_consul_node] regex: ^(blm|dzr|dev)-([a-z]+)-.* target_label: group replacement: ${2}
  • 24. Prometheus relabeling: a strong feature 24 Let’s dive in our setup ⬢ You can change internal labels of Prometheus ⬡ They start with __ and will be removed before storing the metric ⬡ You can override labels used for scraping to obtain a dynamic configuration # Retrieve exporter port from consul tags - source_labels: [__meta_consul_tags] regex: .*,exporter-([0-9]+),.* target_label: __exporter_port replacement: ${1} # Define addr:port to scrape - source_labels: [__meta_consul_address,__exporter_port] separator: ":" target_label: __address__ replacement: ${1}
  • 26. Just a bunch of exporters 26
  • 27. Typical week @ Deezer 27
  • 28. Typical day for memcached 28
  • 29. Impact of Prometheus v2 29 1.8.2 2.1.0
  • 30. Impact of Prometheus v2 30 1.8.2 2.1.0
  • 31. ⬢ We have over 2.3 millions time series ⬢ It scrapes ~57k samples per seconds ⬢ 30s interval scrape in general ⬢ No late so far 31 Some stats about Prometheus itself
  • 32. OS tuning # SSD Tuning echo 0 > /sys/block/sdX/queue/rotational echo deadline > /sys/block/sdX/queue/scheduler # /etc/sysctl.d/local.conf vm.swappiness=1 # /etc/security/limits.d/00prometheus prometheus - nofile 10000000 # If you have an Intel CPU, want consistent CPU frequencies and scaling_governor # doesn’t work. Put this in your kernel boot args. intel_pstate=disable 32 Let’s dive in our setup
  • 33. # Equal to 2/3 of your total memory -storage.local.target-heap-size # Set it to 5m to reduce charge on SSD -storage.local.checkpoint-interval # If you have a large number of time series and a low scrape interval # you can increase this above 10k easily -storage.local.num-fingerprint-mutexes # If you have SSD, you can put this one really high -storage.local.checkpoint-dirty-series-limit 33 Some 1.6.x to 1.8.x settings (in case you need it) Let’s dive in our setup Source: Configuring Prometheus for High Performance [A] - Björn Rabenstein, SoundCloud Ltd.
  • 34. In 2.x New TSDB engine. Just one setting: --storage.tsdb.retention Prometheus will take care of the rest. Just ensure you have enough disk space. (depending on retention) 34 Let’s dive in our setup
  • 35. What’s next for us in monitoring? 35
  • 36. What’s next for us in monitoring? ⬢ Go over 15 days of retention ⬡ Use remote read/write feature to export/read back data ⬢ Experiment with remote read to have only one endpoint to read metrics from ⬢ Alerting as a Service ⬡ Try to automate Prometheus alerting rules creation ⬡ Provision Alertmanager for each team ⬢ Write some exporters :) ⬢ Kubernetes! 36