From nothing to Prometheus : one year after

FROM NOTHING TO PROMETHEUS
ONE YEAR AFTER
MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018

Speaker ID
Antoine LEROYER
⬢ Infrastructure Engineer / SRE @ Deezer since 2016
⬢ DevOps @ EDF (2013-2016)
⬢ Sysadmin @ Netvibes (2012-2013)
2
MEETUP CLOUD NATIVE COMPUTING PARIS - FEBRUARY 2018

Agenda
⬢ Deezer in 30 seconds
⬢ State of Deezer Infrastructure in 2016
⬢ Why Prometheus?
⬢ Let’s dive in our setup
⬢ What’s next for us in monitoring
⬢ Questions
3

Deezer in 30 seconds
⬢ Streaming music service
⬢ Launched in 2007
⬢ Available on multiple devices: Mobile, Desktop, TV, Speakers, etc.
6
12M 185+ 43M
active users countries tracks
(and counting)

State of Deezer Infrastructure
In 2016
7

⬢ Fully managed by our provider
⬡ Rack and initial setup of servers
⬡ Configuration management
⬡ Monitoring
⬡ Alerting
⬢ Majority of bare metal servers (400+)
⬢ Infrastructure Team was small
⬢ Technical staff went big (x4 in one year)
⬢ ...so our team got new members to handle the growth :)
8
State of Deezer Infrastructure in 2016

The new Infrastructure Team @ Deezer
9

If you want to managed yourself the production without your provider, you
need a proper monitoring solution. (and other things but that’s not the point here)
10
But first, we ask ourselves
Okay, so what our needs?

Our needs
⬢ Have a bunch of metrics to make nice graphs
⬢ Send alerts if something went wrong
⬢ Easy to deploy on the existing infrastructure
⬢ But also support container orchestration for the future
⬢ Being able to scale up/down without triggering alerts
11

What is Prometheus?
13
⬢ Open-source systems monitoring and alerting toolkit
⬢ Time series database with metrics name and labels
⬢ Pull time series over HTTP instead of push
⬢ Targets are discovered via service discovery
⬢ No distributed storage, nodes are autonomous
https://prometheus.io/docs/introduction/overview/
Why Prometheus?

What is Prometheus?
14
⬢ Typical time series
# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total counter
http_requests_total{method="post",code="200"} 1027
http_requests_total{method="post",code="400"} 3
Why Prometheus?

Why Prometheus?
⬢ Design for metrics (TSDB)
⬢ Provide alerting thanks to Alertmanager
⬢ Grafana support
⬢ High performances
⬢ Powerful but simple query language (PromQL)
⬢ Service Discovery
⬡ Follow your infrastructure scaling up/down
⬡ Ready for container orchestration
15

First, a service discovery
17
We use Consul:
⬢ Already deployed on some servers
⬢ Supported by Prometheus
⬢ Blazing fast and lightweight
⬢ Service declaration with tags support
⬢ Bonus:
⬡ we have service check
⬡ and a K/V store
Let’s dive in our setup
Consul by Hashicorp (consul.io)

Then, Prometheus
⬢ 2 monstrous servers in each PoP
⬡ 32 cores
⬡ 128GB RAM
⬡ RAID 10 SSD
⬢ Currently running 2.1
Also, an Alertmanager cluster.
18

And also exporters
It can be:
⬢ A daemon exposing metrics through an HTTP endpoint
⬢ A HTTP endpoint inside your application
⬢ A Prometheus pushgateway
Your endpoint must expose plain text data in Prometheus format.
19

Prometheus infrastructure for one datacenter
20

How do I monitor a server and its services?
1. Deploy consul agent and a bunch of exporters on a node
2. Add services to your consul agent with some tags
3. ????
4. Profit!!!!
21

Consul Agent configuration
22
port where my exporter is
listening
tag to filter
environment
# Consul Service JSON for Apache
{
"service": {
"name": "apache",
"tags": [
"prod",
"apache",
"exporter-6110"
],
"address": "",
"port": 443,
"enableTagOverride": false,
"checks": [
{
"script": "apache-check.sh",
"interval": "5s"
}
]
}
}

Prometheus relabeling: a strong feature
23
⬢ Before scraping, Prometheus allow you to change/create labels
⬢ You can create labels to help you identify your metrics
# Replace job label with service name
- source_labels: [__meta_consul_service]
target_label: job
# Add datacenter label
- source_labels: [__meta_consul_dc]
target_label: dc
# Add instance name
- source_labels: [__meta_consul_node]
target_label: instance
# Create a group label from node name
- source_labels: [__meta_consul_node]
regex: ^(blm|dzr|dev)-([a-z]+)-.*
target_label: group
replacement: ${2}

Prometheus relabeling: a strong feature
24
⬢ You can change internal labels of Prometheus
⬡ They start with __ and will be removed before storing the metric
⬡ You can override labels used for scraping to obtain a dynamic configuration
# Retrieve exporter port from consul tags
- source_labels: [__meta_consul_tags]
regex: .*,exporter-([0-9]+),.*
target_label: __exporter_port
replacement: ${1}
# Define addr:port to scrape
- source_labels: [__meta_consul_address,__exporter_port]
separator: ":"
target_label: __address__
replacement: ${1}

After relabeling
25

Impact of Prometheus v2
29
1.8.2 2.1.0

Impact of Prometheus v2
30
1.8.2 2.1.0

⬢ We have over 2.3 millions time series
⬢ It scrapes ~57k samples per seconds
⬢ 30s interval scrape in general
⬢ No late so far
31
Some stats about Prometheus itself

OS tuning
# SSD Tuning
echo 0 > /sys/block/sdX/queue/rotational
echo deadline > /sys/block/sdX/queue/scheduler
# /etc/sysctl.d/local.conf
vm.swappiness=1
# /etc/security/limits.d/00prometheus
prometheus - nofile 10000000
# If you have an Intel CPU, want consistent CPU frequencies and scaling_governor
# doesn’t work. Put this in your kernel boot args.
intel_pstate=disable
32

# Equal to 2/3 of your total memory
-storage.local.target-heap-size
# Set it to 5m to reduce charge on SSD
-storage.local.checkpoint-interval
# If you have a large number of time series and a low scrape interval
# you can increase this above 10k easily
-storage.local.num-fingerprint-mutexes
# If you have SSD, you can put this one really high
-storage.local.checkpoint-dirty-series-limit
33
Some 1.6.x to 1.8.x settings (in case you need it)
Source: Configuring Prometheus for High Performance [A] - Björn Rabenstein, SoundCloud Ltd.

In 2.x
New TSDB engine. Just one setting:
--storage.tsdb.retention
Prometheus will take care of the rest.
Just ensure you have enough disk space. (depending on retention)
34

What’s next for us in monitoring?
35

What’s next for us in monitoring?
⬢ Go over 15 days of retention
⬡ Use remote read/write feature to export/read back data
⬢ Experiment with remote read to have only one endpoint to read metrics from
⬢ Alerting as a Service
⬡ Try to automate Prometheus alerting rules creation
⬡ Provision Alertmanager for each team
⬢ Write some exporters :)
⬢ Kubernetes!
36

From nothing to Prometheus : one year after

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to From nothing to Prometheus : one year after

Similar to From nothing to Prometheus : one year after (20)

Recently uploaded

Recently uploaded (20)

From nothing to Prometheus : one year after