Modern Monitoring [ with Prometheus ]

Tikal KnowledgeTikal Knowledge
Haggai Philip Zagury - DevOps Group Lead - Tikal Knowledge

FullStack Developers Israel
INTRO - WHO WE ARE
WHO WE ARE ?
▸ Tikal helps ISV’s in Israel & abroad in their technological
challenges.
▸ Our Engineers are Fullstack Developers with expertise in
Android, DevOps, Java, JS, Ruby & Python
▸ We are passionate about technology and specialize in
OpenSource technologies.
▸ Our Tech and Group leaders help establish & enhance existing
software teams with innovative & creative thinking.
https://www.meetup.com/full-stack-developer-il/

INTRODUCTION TO MODERN MONITORING
CURRENT STATUS [ INFRASTRUCTURE ]
▸ AWS, Cloud, Hybrid / Multi Cloud
▸ Define metrics and system health based on experience and application
specific behaviors.
▸ Many False Positives
▸ Scaling is hard [ semi-auto, manual ]
Tikal Knowledge

COMMON MONITORING STATUS
▸ OPS own monitoring domain
▸ Define metrics and system health based on experience and application
specific behaviours.
▸ Many False Positives
▸ Scaling is hard [ semi-auto, manual ]
Tikal Knowledge

COMMON MONITORING SOLUTIONS
▸ cloud watch
▸ new relic
▸ Nagios
▸ App Dynamics
▸ Data Dog
▸ Many more ….
Tikal Knowledge

GOALS
▸ Improve existing monitoring and RCA indicators
▸ Reduce false positives & ‘customer driven alerting’
▸ Proactively identify data anomalies / diversions
▸ Provide meaningful / intelligent notifications [ severity, SLA compliance etc ]
▸ Proactively remediate commonly known issues, or set the foundation of a
robust substitute
▸ Provide KPI integration policy & methodology for both DevOps & R&D teams
Tikal Knowledge

CHALLENGES
▸ Preserve the knowledge and insights in the existing Monitoring system
▸ Cultural changes:
▸ APM is part of the development process
▸ Monitoring tools are part of the developer stack (or he will wake up on any
issue with his code/app)
▸ On-call isn’t only for OPS … Everybody’s accountable
▸ breakdown the “wall of confusion” between dev and ops
Tikal Knowledge

The Gap of Traditional Monitoring
- We know what we want to know …
Tikal Knowledge

System Metrics
Not enough || Too much a little too late
Tikal Knowledge

We do not always
know what we are
looking @ / 4 …
Tikal Knowledge

Is this OK ?! || Normal
What happened at 4AM
Tikal Knowledge

If your’e lucky
+
= No action needed
Tikal Knowledge

Go back to sleep
( you still work up ! )
Tikal Knowledge

REALITY
Murphy’s law …
Tikal Knowledge

Stop using Nagios
(so it can die peacefully)
Feb 13, 2014 [ slideshare ]
Tikal Knowledge

In 2 words:
Configuration files…
In a few more:
- resources
- services
- dependencies
- …
Tikal Knowledge

Traditional Monitoring
• Reliable
• Durable
• Scalable
Conclusion …
system monitoring does not suffice, enter APM
Tikal Knowledge

HOW DID WE GET HERE
Tikal Knowledge

TRADITIONAL MONITORING WAS(IS) ALL ABOUT THE “BLACK BOX” | “OS” METRICS
▸ All we care about is that the system is OK …
APPLICATION
FROTNEND
APPLICATION
BACKEND
APPLICATION
DATABASE
Tikal Knowledge

OPS ARE WORKING ON OPTIMIZING INFRASTRUCTURE …
▸ Throw more RAM &
“Reports”
▸ Add another node to
the “FE cluster”
▸ Add another shard to
the DB …
▸ ….
APPLICATION …
Tikal Knowledge

IN THE PAST ~10 YEARS
▸ Developers have started to implement METRICS
▸ Organizations are adopting Standards
▸ Common metrics have become a commodity
Tikal Knowledge

REALITY PREVAILS
Tikal Knowledge

APPLICATION
FROTNEND
APPLICATION
BACKEND
APPLICATION
DATABASE
APPLICATION …
Tikal Knowledge

Multipule
Dimensions
• [ Stability ]
• Ops dimension
• [ Innovation ]
• Dev dimension
• Product dimension
Tikal Knowledge

Even More
• Environment [ stg, uat, prod ]
• Application Stack(s) || tags || types
• Business metrics
Tikal Knowledge

TEAMS | SCOPES | METRICS - COME TOGETHR

Tikal KnowledgeTikal Knowledge
Apply

MONITORING CRITARIA’S
▸ Server (OS) level monitoring
▸ Application Monitoring (APM)
▸ Perimeter (External website) monitoring
▸ Event driven remediation
▸ Alerting and Escalation
▸ Associated log data & anomaly detection
Tikal Knowledge

REQUIRED FEATURES
Accessibility
Scheduling
SLA’s assured
Auth & Authorization
Escalation
Durable & Resilient
Forensics
Automatic
Flexible & Elastic
Accountable
Tikal Knowledge

IT’S AN ITERATIVE PROCESS
▸ How quick did we recover ?
▸ What worked / Didn’t work ?
▸ Iterative improvements [ Chaos Monkey, 10 story test ]
▸ RCA -> Remediation [ a.k.a False positive lifecycle ]
Tikal Knowledge

HOW TO DEFINE A METRIC OR ALERT VS. HOW TO STORE DATA
▸ A Metric’s Lifecycle & Design
▸ Time Series Data stream(s) || source(s)
▸ Common tagging
▸ Metric naming conventions and implications
▸ Micro Services, Integration of Traditional and New Generation solutions
▸ Choose short, mid & long term tools / services
Tikal Knowledge

A METRIC’S LIFECYCLE
NEW (A)
METRIC
INFRUSTRUCTURE (OS)
APPLICATION
EXTERNAL (DEPENDENCY / ENDPOINT)
REMEDIABLE ?
ALEARTABLE ?
LOG CORRELATION
SCOPE OF IMPACT
LEARN IN DEV | STG
}
} DEFINE IN DEV | STG
} SHIP TO PROD
Tikal Knowledge

A METRIC’S LIFECYCLE - “TAG-ABLE” == FILTERABLE | MEASURABLE | QUANTIFIABLE
NEW (A)
METRIC
INFRUSTRUCTURE (OS)
APPLICATION
REMEDIABLE ?
ALEARTABLE ?
LOG CORRELATION
SCOPE OF IMPACT
LEARN IN DEV | STG
}
} SHIP TO PROD
DEVLOPMENT STAGING PRODUCTIONENVIRONMENT
Tikal Knowledge

A METRIC’S LIFECYCLE
NEW (A)
METRIC
INFRUSTRUCTURE (OS)
APPLICATION
REMEDIABLE ?
ALEARTABLE ?
LOG CORRELATION
SCOPE OF IMPACT
LEARN IN DEV | STG
}
} SHIP TO PROD
- QUANTIFIABLE METRICS: SEVERITY, CRITICAL STATE
- EXPOSING A SERVICE
- CONSUMING A SERVICE
- - WHY DOES MY SERVICE HAVE AN OS IMPACT ?
- - IS IT BY DESIGN ?
- FALLBACK METHODS ?
- ALTERNATE ENDPOINTS / RETRY ?
- FEATURE TOGGLE
- DEFINE SEVERITY
37
Tikal Knowledge

TSD PRINCIPLES
Credit->http://opentsdb.net/overview.html
Tikal Knowledge

DATAPOINTS
Credit->https://www.datadoghq.com/blog/the-power-of-tagged-metrics/
IntoolslikePrometheusyoudon'tneedthetimestampitjustusescollectiontimestamp
Tikal Knowledge

MIX ’N’ MATCH
Tikal Knowledge

SHORT | MID | LONG TERM SOLUTIONS
Tikal Knowledge

PROMETHEUS
https://github.com/prometheus/prometheus
Tikal Knowledge

FEATURES
▸ Open-source systems monitoring and alerting toolkit
▸ A multi-dimensional data model (time series identified by metric name and key/value pairs)
▸ A flexible query language to leverage this dimensionality
▸ A no reliance on distributed storage; single server nodes are autonomous**
▸ A time series collection happens via a pull model over HTTP
▸ A pushing time series is supported via an intermediary gateway
▸ A targets are discovered via service discovery or static configuration
▸ A multiple modes of graphing and dashboarding support
Tikal Knowledge

PROMETHEUS ARCHITECTURE
Dashboarding
Prometheus Server Alertmanager
Retrieval /
Collection
DataSerie
s
Storage
[DB]
PromQ
L
web UI
Prometheus
server
Prometheus
server(s)
Push Gateway
Service Discovery Providers
Prometheus
server
Prometheus
exporters
Tikal Knowledge

UNTIL NOW
‣ Try providing this to each developer
‣ Sensu has a very similar approach to
APM …
‣ Complexity is the barrier …
Tikal Knowledge

UNTIL NOW
‣ Pull has become an advantage …
‣ Severity is implied [TSD]
‣ False Positives reduction
‣ Docker makes it super simple
‣ Go Lang lightweight approach
Tikal Knowledge

IMPLEMENTATION
Tikal Knowledge

IMPLEMENTATION
‣ Review old system metrics & capabilities and decide what’s good whats bad
‣ What can move
‣ What needs to stay | integrate to new system
‣ Prometheus deployment is Automated from day 1
‣ Prometheus exporter services are tagged and labeled per application stack | layer
‣ Preferably Dockerized
‣ Metric Design Workshops | meetings | slack group
‣ Alert Design Workshops | meetings | slack group
‣ Teams Mectic tags and Alerting & Escalation
Tikal Knowledge

STEP1 - IMPLEMENT DISCOVERY
AWS Discovery -> https://github.com/prometheus/prometheus/tree/master/discovery
NEW NODE
DEPLOYMEN
T
SERVICE
DISCOVERY
DEV
STAGING
PRODUCTION
STACK / APP
NAME Alertmanager
Tikal Knowledge

STEP2 - IMPLEMENT EXPORTERS
https://prometheus.io/docs/instrumenting/exporters/
Official node exporter -> https://github.com/prometheus/node_exporter
Mssql Exporter -> https://hub.docker.com/r/awaragi/prometheus-mssql-
exporter/
Nagios Exporter -> https://github.com/m-lab/prometheus-nagios-exporter
Tikal Knowledge

STEP3 - IMPLEMENT CUSTOM APPLICATION METRICS
https://prometheus.io/docs/instrumenting/exporters/
Windows WMI -> https://github.com/martinlindhe/wmi_exporter
Java -> https://github.com/prometheus/jmx_exporter
node.js -> https://www.npmjs.com/browse/keyword/prometheus
.Net -> https://github.com/andrasm/prometheus-net
Tikal Knowledge

STEP4 - ADAPT TO YOUR INFRA MONITORING [ FILTER || TAG || SELECTOR ]
kubernetes_sd_config
Tikal Knowledge

STEP 5 - METRIC DESIGN
‣ Review sample METRICS and GRAPHS
‣ Define | Reuse
‣ Naming conventions { https://prometheus.io/docs/practices/naming/ }
‣ Quantifiable [ numbers not strings … ]
Tikal Knowledge

DEVELOPER TOOL
Tikal Knowledge

DEVELOPER TOOL - SIMPLE GRAPHS
Tikal Knowledge

DEVELOPER TOOL - METRICS - USING PROMQL
▸ Simple queries:
▸ rate(http_requests_total[5m])
▸ Linear predictions
▸ predict_linear(node_filesystem_free[1h], 4*3600)
Tikal Knowledge

GRAFANA - SIMILAR WORKING EXPERIENCE - MUCH NICER
Tikal Knowledge

STEP 6 - ALERT DESIGN
‣ Review new METRICS and GRAPHS define | design thresholds
‣ Define Severity
‣ Ownership
‣ Escalation lader
Tikal Knowledge

ALERT DESIGN
▸ ALERT <alert name>
▸ IF <expression>
▸ [ FOR <duration> ]
▸ [ LABELS <label set> ]
▸ [ ANNOTATIONS <label set> ]
Tikal Knowledge

ALERT FOR ANY INSTANCE THAT IS UNREACHABLE FOR >5 MINUTES.
ALERT high_load
IF node_load1 > 0.5
ANNOTATIONS {description="{{ $labels.instance }} of job {{ $labels.job }} is
under high load.", summary="Instance {{ $labels.instance }} under high load"}
Tikal Knowledge

STILL LOOKING FOR ONLINE EDITOR FOR EASE OF DEVELOPMENT
https://github.com/alerta/prometheus-config
Tikal Knowledge

SIMPLE YAML FILE
route:
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- send_resolved: true
username: '<username>'
channel: '#<channel-name>'
api_url: '<incomming-webhook-url>'
WHERE TO ROUTE TO
ROUTER DETAILS
Tikal Knowledge

ALERTING
global:
resolve_timeout: 5m
smtp_require_tls: true
pagerduty_url: https://events.pagerduty.com/generic/2010-04-15/create_event.json
hipchat_url: https://api.hipchat.com/
opsgenie_api_host: https://api.opsgenie.com/
victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
receiver: slack
receivers:
- name: slack
slack_configs:
api_url: <secret>
username: <username>
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '{{ template "slack.default.title" . }}'
title_link: '{{ template "slack.default.titlelink" . }}'
pretext: '{{ template "slack.default.pretext" . }}'
text: '{{ template "slack.default.text" . }}'
fallback: '{{ template "slack.default.fallback" . }}'
icon_emoji: '{{ template "slack.default.iconemoji" . }}'
icon_url: '{{ template "slack.default.iconurl" . }}'
templates: []
}
}Channel Configuration
Variables | Global configuration
Tikal Knowledge

ALERT TEMPLATING
▸ What | How to say …
https://prometheus.io/blog/2016/03/03/custom-alertmanager-templates/
api_url: <secret>
username: <username>
color: '{{ if eq .Status "firing" }}danger{{ else }}
good{{ end }}'
title: '{{ template "slack.default.title" . }}'
title_link: '{{ template "slack.default.titlelink" . }}'
pretext: '{{ template "slack.default.pretext" . }}'
text: '{{ template "slack.default.text" . }}'
fallback: '{{ template "slack.default.fallback" . }}'
icon_emoji: '{{ template "slack.default.iconemoji" . }}'
icon_url: '{{ template "slack.default.iconurl" . }}'
Tikal Knowledge

SILENCING, VIA UI / API
Tikal Knowledge

ANSWERS REQUIRED FEATURES
Accessibility
Scheduling
SLA’s assured
Auth & Authorization
Escalation
Durable & Resilient
Forensics
Automatic
Flexible & Elastic
Accountable
Tikal Knowledge

NEXT STEPS
INFRUSTRUCTURE (OS)
APPLICATION
REMEDIABLE ?
ALEARTABLE ?
LOG CORRELATION
}
ALERT MANAGER
LEGACY
IDENTIFY
CHOOSE
Tikal Knowledge

DEMO TIME
‣ Docker-compose - ready fro R&D to start using to run create custom application
Metrics.
‣ Prometheus, Node_exporter, Alertmanager Cadvisor, Grafana
Tikal Knowledge

DOCKER SETTINGS - VOLUMES, NETWORKS
version: ‘2'
volumes:
prometheus_data: {}
grafana_data: {}
networks:
front-tier:
driver: bridge
back-tier:
driver: bridge
Docker-compose version
Docker volumes for preometheus and grafana
Docker Networks
Tikal Knowledge

PROMETHEUS - OFFICIAL CONTAINER
services:
prometheus:
image: prom/prometheus
container_name: prometheus
volumes:
- ./prometheus/:/etc/prometheus/
- prometheus_data:/prometheus
command:
- '-config.file=/etc/prometheus/prometheus.yml'
- '-storage.local.path=/prometheus'
- '-alertmanager.url=http://alertmanager:9093'
expose:
- 9090
ports:
- 9090:9090
links:
- cadvisor:cadvisor
- alertmanager:alertmanager
depends_on:
- cadvisor
networks:
- back-tier
Docker Service name
Docker volumes for prometheus and grafana
Expose as service on specified port
Ports to expose as service
Link to cadvisor & alertmanager
Network placement ‘back-tier’
Configuration
Tikal Knowledge

ALERT MANAGER
alertmanager:
image: prom/alertmanager
ports:
- 9093:9093
volumes:
- ./alertmanager/:/etc/alertmanager/
networks:
- back-tier
command:
- '-config.file=/etc/alertmanager/config.yml'
- '-storage.path=/alertmanager'
Tikal Knowledge

CADVISOR
cadvisor:
image: google/cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
expose:
- 8080
networks:
- back-tier
grafana:
image: grafana/grafana
depends_on:
- prometheus
ports:
- 3000:3000
volumes:
- grafana_data:/var/lib/grafana
env_file:
- config.monitoring
networks:
- back-tier
- front-tier
Tikal Knowledge

GRAFANA
grafana:
image: grafana/grafana
depends_on:
- prometheus
ports:
- 3000:3000
volumes:
- grafana_data:/var/lib/grafana
env_file:
- config.monitoring
networks:
- back-tier
- front-tier
Tikal Knowledge

DOCKER PS
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3dcfd7c289cb grafana/grafana "/run.sh" 21 hours ago Up 4 minutes 0.0.0.0:3000->3000/tcp prometheus_grafana_1
2b2817fc0bd9 prom/prometheus "/bin/prometheus -..." 21 hours ago Up 4 minutes 0.0.0.0:9090->9090/tcp prometheus
d2c6849d3bd9 google/cadvisor "/usr/bin/cadvisor..." 21 hours ago Up 4 minutes 8080/tcp prometheus_cadvisor_1
d4a3c3ceb97d prom/node-exporter "/bin/node_exporte..." 21 hours ago Up 4 minutes 9100/tcp node-exporter
75eb08791ea9 prom/alertmanager "/bin/alertmanager..." 21 hours ago Up 4 minutes 0.0.0.0:9093->9093/tcp prometheus_alertmanager_1
Tikal Knowledge

DEMO PROJECT ON GITHUB
https://github.com/shelleg/monlog-compose-stack
Tikal Knowledge

‣ All containers - monitored by prometheus + graphed in a small nice project.
Tikal Knowledge

TEXT
ROLLOUT [ LLD ]
Tikal Knowledge

PLACEMENT OPTIONS
‣ 1 main prometheus server vs. 1 Prometheus server per team
‣ 1 Alert-manager [ with pre-defined “receivers” ] vs. 1 per team / concern
Tikal Knowledge

DEPLOYMENT OPTIONS
‣ Automate deployment of prometheus server(s) / Alert-manager [ pre-defined
“receivers” ]
‣ Ansible, puppet etc
‣ Jenkins
‣ The combination of the 2 ;)
‣ Automation helps solve the “one 2 Many” dilemma IMHO …
Tikal Knowledge

DEVELOPER STACK
‣ Options:
‣ Personal Docker / Docker-compose[ private fork if desired ]
‣ A small startup.cmd / startup.sh starting go applications of promethes & alertmanager
‣ A centralized Grafana / Alertmanager with only prometheus on dev-machine
‣ Toolkit for
‣ develop metrics, alarms, graphs
‣ Add exporters to configuration [ tendency :: as common as you develop new services ]
‣ SDLC -> Gil Pull/MErge request mechanism
Tikal Knowledge

DEVELOPER STACK(S) - EXAMPLE
Tikal Knowledge

ALERTS IN SCM MASTER -> STG -> PRD
Tikal Knowledge

POPULATE ALERTS | METRICS | DASHBOARDS VIA SCM
1. Use “ready made” || good starring point graphs from grafana dashboard exchange or build your own
2. Customize
3. Add / push to git master branch
4. “ci” server -> listen on GitHook -> push to staging
5. “ci” server -> wait for manual trigger -> push to production
Tikal Knowledge

CONTINUOUS DELIVERY OPTIONS [ ADDING AN ALERT SAMPLE WORKFLOW ]
master (dev)
staging
production
DEVELOP
DEPLOY TO STAGE
DEPLOY TO PROD
1 centralized repo
branch per env /
prometheus instance
Tikal Knowledge

CONTINUOUS DELIVERY OPTIONS [ ADDING GRAPHS ]
master (dev)
staging
production
DEVELOP
DEPLOY TO STAGE
DEPLOY TO PROD
“Grafana Dashboard hub”
- separate repo ?
- part of monitoring repo ?
Tikal Knowledge

CI PIPELINE -DATA ORIGINS & PRESENTATION
Exporters
REGION POD INSTANCE *
}
}
App Metrics
OS Metrics
Filter Tags & Alerts
Tikal Knowledge

CI PIPELINE
DEV
STAGING
PRODUCTION
STACK / APP
NAME
ALERTMANAGE
R
ALERTMANAGE
R
Web-hook (PR-builder)
GRAFANA
GRAFANA
OPS “CLEANUP” ROUTINE(S)
Tikal Knowledge

BUILDING THE PIPELINE
‣ Routine on submit / push builds to dev/stg
‣ Run daily / weekly deployments of Alerts (prometheus) |
Dashboards (grafana)
‣ Avoid / rollback any manual changes of Alerts /
Graphs etc
‣ Help make automation a common practice
‣ Scheduled task which syncs and re-configures the
desired state from SCM
Tikal Knowledge

MESURE THE PIPELINE
‣ Pipeline steps are monitored
‣ Expose metrics such as:
‣ deployment time & status [ in env | stack etc ]
‣ count (# of alerts, new vs old last week, month etc)
‣ Metric counters [ application metrics ] …
‣ [ Jenkins exporter || push gateway TBD ]
Tikal Knowledge

FEEDBACK / QUESTIONS ? I’M HERE …
HAGZAG@TIKALK.COM, 0545302525
Haggai Philip Zagury - Tikal Knowledge
MONITORING HLD
FullStack Developers Israel

Modern Monitoring [ with Prometheus ]

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Modern Monitoring [ with Prometheus ]

Similar to Modern Monitoring [ with Prometheus ] (20)

More from Haggai Philip Zagury

More from Haggai Philip Zagury (19)

Recently uploaded

Recently uploaded (20)

Modern Monitoring [ with Prometheus ]