Monitoring Cloud-Native applications with
Prometheus
Jacopo Nardiello
CODEMOTION MILAN - SPECIAL EDITION
10 – 11 NOVEMBER 2017
Jacopo Nardiello
SIGHUP Founder & DevOps Engineer
@jnardiello
~ whoami
~ ./stuff_I_poke_around_with
- Linux
- Kubernetes (clusters lifecycles and workloads scheduling in general)
- The CloudTM
(VMs and Containers + other people's computers)
- golang
- More devops toys FTW! (CI/CDs, Ansible, etc..)
What is exactly “Cloud-Native”?
Cloud-Native is NOT The CloudTM
At its root, Cloud Native is structuring teams, culture and
technology to utilize automation and architectures to manage
complexity and unlock velocity.
Joe Beda
There’s a copernican revolution happening on
infrastructures
A fundamental shift:
From VM-based Mutable
to Highly Dynamic and Immutable
infrastructures
The path to Cloud-Native Architectures
Why Containers
- A new infrastructural unit
- Atomic deployments
- Very small footprint, superfast scaling
Why Orchestrators
- Sandboxed environment
- Computers take over the scheduling
- Automatic Healthchecks and self-healing
Cloud-Native is
challenging
Prometheus
Cloud-Native monitoring with
Overview: What is Prometheus?
Community Driven Open-source
Monitoring and Alerting framework.
- Time series database for instrumentation,
metrics collection, storage and querying
- Alerting entity
- Integrated tools for metrics exposure
Overview: A bit of context around Prometheus
Started in 2012 as a SoundCloud
internal project
Second project to join CNCF after
Kubernetes
Overview: Focus
Operational systems monitoring
Dynamic cloud environments
Core features
● Powerful no-sql query language, PromQL
● Time series data model
● Optimized to be efficient
● Operational & Architectural simplicity
Pull
/metrics endpoints
Monitoring model: Pull
Prometheus Architecture
The Architecture behind Prometheus
Prometheus core
- Service discovery and targets
definition
- Metrics scraping
- Time series database
- Alerts and Recording rules
- Alerting evaluation
- Metrics query
Alertmanager
- Alerting & silencing
- Dispatching notification to
different channels
Exporters & SDKs
Formatting metrics to be exported
in the expected prometheus
format
- Either exporters (Node, Rabbit,
Mysql, etc..)
- SDKs to export application
metrics
Prometheus Basics
Prom Server configuration
- CLI flags for the immutable
daemon
- Config file defines scraping
targets, instances and jobs
Prom Server configuration
- CLI flags for the immutable
daemon
- Config file defines scraping
targets, instances and jobs
global:
scrape_interval: 1m
scrape_timeout: 30s
external_labels:
cluster: "test-cluster"
rule_files:
- rules/rules.yml
# Scraping targets
scrape_configs:
- job_name: 'some-service'
static_config:
- <host> or <dns>
labels:
app: "some-service"
prometheus.yml
/metrics
# HELP hash_seconds Time taken to create hashes
# TYPE hash_seconds histogram
hash_seconds_bucket{code="200",le="1"} 2
hash_seconds_bucket{code="200",le="2.5"} 2
hash_seconds_bucket{code="200",le="5"} 2
hash_seconds_bucket{code="200",le="10"} 2
hash_seconds_bucket{code="200",le="+Inf"} 2
hash_seconds_sum{code="200"} 9.370800000000002e-05
hash_seconds_count{code="200"} 2
Data model & querying
api_http_requests_total{method="POST", handler="/messages"}
- Labels based data model
- Each label and combination of labels is a dimension where we
can filter and aggregate exported data
- Changing, adding or removing a label will create a new time
series
PromQL & Label based queries
http_requests_total all time series related to the metric http_requests_total
http_requests_total{code="200",method="get"} time series related to successful request with
method get for the metric http_requests_total
http_requests_total{code="200",method="get"}[5m] returns a range vector
PromQL & Label based queries
http_requests_total{status!~"^4..$"}
Selecting all errors-related time series using
regexes
sum(rate(http_requests_total[5m])) by (job) Applying functions, in this case we sum over a
range vector and aggregating by job
Prometheus web interface
Visualization
Plotting and graphing are out of prometheus
scope.
Use Grafana
Alerting
Rules
- Evaluated by the prometheus
server on a regular basis
- If a certain query matches a
condition, the alert is triggered
ALERT InstanceDown
IF up == 0
FOR 5m
LABELS { severity = "critical" }
ANNOTATIONS {
summary = "Instance {{ $labels.instance }} down",
description = "{{ $labels.instance }} of job {{
$labels.job }} has been down for more than 5
minutes.",
}
Until Prometheus 1.8
This syntax has been changed to standard yaml starting
from Prometheus v2 (structure stays the same)
Alert Dispatching
Job of the alertmanager is to dispatch
alerts to the right channel according to
their severity
Cloud-Native monitoring
Service discovery
Scraping statically defined targets is not very useful
kubernetes_sd_config
Native integration for kubernetes environments
- Prometheus is aware of running in a kubernetes cluster
- Automatically retrieve scraping targets such as nodes, pods, containers from the
k8s API
More integrations (many more…)
- ec2_sd_config
- azure_sd_config
- openstack_sd_config
- gce_sd_config
- kubernetes_sd_config
- consul_sd_config
- dns_sd_config
- file_sd_config
- marathon_sd_config
- nerve_sd_config
- triton_sd_config
- static_config
Re-labeling
- Relabeling is a very powerful mechanism that allow us to further manipulate labels from the targets.
- It’s a very effective way to turn targets from an API and apply sophisticated targeting strategies (i.e.
manipulating addresses or ports, filtering a subset of targets, etc..)
A quick configuration example:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Demo Time!
Thank you,
Questions?
We are hiring!
jacopo@sighup.io
@jnardiello

Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Codemotion Milan 2017

  • 1.
    Monitoring Cloud-Native applicationswith Prometheus Jacopo Nardiello CODEMOTION MILAN - SPECIAL EDITION 10 – 11 NOVEMBER 2017
  • 2.
    Jacopo Nardiello SIGHUP Founder& DevOps Engineer @jnardiello ~ whoami
  • 3.
    ~ ./stuff_I_poke_around_with - Linux -Kubernetes (clusters lifecycles and workloads scheduling in general) - The CloudTM (VMs and Containers + other people's computers) - golang - More devops toys FTW! (CI/CDs, Ansible, etc..)
  • 4.
    What is exactly“Cloud-Native”?
  • 5.
    Cloud-Native is NOTThe CloudTM At its root, Cloud Native is structuring teams, culture and technology to utilize automation and architectures to manage complexity and unlock velocity. Joe Beda
  • 6.
    There’s a copernicanrevolution happening on infrastructures A fundamental shift: From VM-based Mutable to Highly Dynamic and Immutable infrastructures
  • 7.
    The path toCloud-Native Architectures
  • 8.
    Why Containers - Anew infrastructural unit - Atomic deployments - Very small footprint, superfast scaling
  • 9.
    Why Orchestrators - Sandboxedenvironment - Computers take over the scheduling - Automatic Healthchecks and self-healing
  • 10.
  • 11.
  • 12.
    Overview: What isPrometheus? Community Driven Open-source Monitoring and Alerting framework. - Time series database for instrumentation, metrics collection, storage and querying - Alerting entity - Integrated tools for metrics exposure
  • 13.
    Overview: A bitof context around Prometheus Started in 2012 as a SoundCloud internal project Second project to join CNCF after Kubernetes
  • 14.
    Overview: Focus Operational systemsmonitoring Dynamic cloud environments
  • 15.
    Core features ● Powerfulno-sql query language, PromQL ● Time series data model ● Optimized to be efficient ● Operational & Architectural simplicity
  • 16.
  • 17.
  • 18.
  • 19.
    Prometheus core - Servicediscovery and targets definition - Metrics scraping - Time series database - Alerts and Recording rules - Alerting evaluation - Metrics query
  • 20.
    Alertmanager - Alerting &silencing - Dispatching notification to different channels
  • 21.
    Exporters & SDKs Formattingmetrics to be exported in the expected prometheus format - Either exporters (Node, Rabbit, Mysql, etc..) - SDKs to export application metrics
  • 22.
  • 23.
    Prom Server configuration -CLI flags for the immutable daemon - Config file defines scraping targets, instances and jobs
  • 24.
    Prom Server configuration -CLI flags for the immutable daemon - Config file defines scraping targets, instances and jobs global: scrape_interval: 1m scrape_timeout: 30s external_labels: cluster: "test-cluster" rule_files: - rules/rules.yml # Scraping targets scrape_configs: - job_name: 'some-service' static_config: - <host> or <dns> labels: app: "some-service" prometheus.yml
  • 25.
    /metrics # HELP hash_secondsTime taken to create hashes # TYPE hash_seconds histogram hash_seconds_bucket{code="200",le="1"} 2 hash_seconds_bucket{code="200",le="2.5"} 2 hash_seconds_bucket{code="200",le="5"} 2 hash_seconds_bucket{code="200",le="10"} 2 hash_seconds_bucket{code="200",le="+Inf"} 2 hash_seconds_sum{code="200"} 9.370800000000002e-05 hash_seconds_count{code="200"} 2
  • 26.
    Data model &querying api_http_requests_total{method="POST", handler="/messages"} - Labels based data model - Each label and combination of labels is a dimension where we can filter and aggregate exported data - Changing, adding or removing a label will create a new time series
  • 27.
    PromQL & Labelbased queries http_requests_total all time series related to the metric http_requests_total http_requests_total{code="200",method="get"} time series related to successful request with method get for the metric http_requests_total http_requests_total{code="200",method="get"}[5m] returns a range vector
  • 28.
    PromQL & Labelbased queries http_requests_total{status!~"^4..$"} Selecting all errors-related time series using regexes sum(rate(http_requests_total[5m])) by (job) Applying functions, in this case we sum over a range vector and aggregating by job
  • 29.
  • 30.
    Visualization Plotting and graphingare out of prometheus scope. Use Grafana
  • 31.
    Alerting Rules - Evaluated bythe prometheus server on a regular basis - If a certain query matches a condition, the alert is triggered ALERT InstanceDown IF up == 0 FOR 5m LABELS { severity = "critical" } ANNOTATIONS { summary = "Instance {{ $labels.instance }} down", description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.", } Until Prometheus 1.8 This syntax has been changed to standard yaml starting from Prometheus v2 (structure stays the same)
  • 32.
    Alert Dispatching Job ofthe alertmanager is to dispatch alerts to the right channel according to their severity
  • 33.
  • 34.
    Service discovery Scraping staticallydefined targets is not very useful kubernetes_sd_config Native integration for kubernetes environments - Prometheus is aware of running in a kubernetes cluster - Automatically retrieve scraping targets such as nodes, pods, containers from the k8s API
  • 35.
    More integrations (manymore…) - ec2_sd_config - azure_sd_config - openstack_sd_config - gce_sd_config - kubernetes_sd_config - consul_sd_config - dns_sd_config - file_sd_config - marathon_sd_config - nerve_sd_config - triton_sd_config - static_config
  • 36.
    Re-labeling - Relabeling isa very powerful mechanism that allow us to further manipulate labels from the targets. - It’s a very effective way to turn targets from an API and apply sophisticated targeting strategies (i.e. manipulating addresses or ports, filtering a subset of targets, etc..) A quick configuration example: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
  • 37.
  • 38.
    Thank you, Questions? We arehiring! jacopo@sighup.io @jnardiello