SlideShare a Scribd company logo
1 of 52
Download to read offline
ZMON - OS monitoring in the cloud
Atmosphere 2016 | Krakow 17.5.2016 | jan.mussler@zalando.de | @JanMussler
15 countries
3 fulfillment centers
18+ million active customers
3.0+ billion € revenue
135+ million visits per month
1.000+ employees in tech
Europe's Leading Fashion Platform
Visit us: tech.zalando.com
Zalando’s Technology History
RADICAL AGILITY
AUTONOMY
➊ One AWS account per Team
➋ Deployment with Docker
➌ Managed SSH Access
➍ REST/OAuth 2.0 mandatory
➎ Traceability of changes
IN A NUTSHELL
STUPS
AWS
DEPLOYMENT
Senza CLI
Deploy Tool
Pier One
Docker Registry
docker pull
docker push
Taupage
AMI
Internet
*.abc.example.org *.xyz.example.org
Team ABC Team XYZ
ISOLATED AWS ACCOUNTS
EC2EC2
ELBELB
EC2
ZMON
Flexible and extendable: Checks & Alerts in Python
Integrate: REST APIs, OAUTH2, AWS Auto Discovery
Fully configurable via UI / API: no restarts required!
Great for teams: team dashboards, alerts inheritance
Fast/scaling metrics: Redis, KairosDB + Grafana2
Hackweek 2015 - iOS app and Android app ;-)
ZMON - High Lights ;-)
Display historic data using Grafana 2
Notifications plus iOS and Android App
E-Mail
Full authentication for all endpoints
OAUTH2 login flow (e.g. via Github login)
“TV Tokens” for “read-only” dashboard login
Grafana 2 bundled and API implemented
● ZMON stores dashboards incl. tags/stars
● KairosDB proxy
● ElasticSearch proxy (in progress)
ZMON Controller -> UI + REST API
Example
Tokeninfo (GO)Tokeninfo (GO)
Provider (Java)
Provider (Java)
Tokeninfo (GO)Tokeninfo (GO)
C* Nodes
C* Nodes
C* Nodes
C* Nodes
Plan B Deployment - Multi Region Setup (JWT issue/verification)
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
C* NodesProvider (Java)ELB
Tokeninfo (Go)ELB
Will create “entities” to describe deployment
ELBs, ASGs, Application, instances,...
Crawls AWS API every 60 sec to update
ZMON AWS Agent - Auto Discovery
➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]"
id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]
type: instance
application_id: planb-tokeninfo
host: 172.31.169.6
infrastructure_account: aws:999
instance_type: c4.xlarge
ip: 172.31.169.6
ports: { '9020': 9020, '9021': 9021 }
region: eu-west-1
source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44
stack_name: planb-tokeninfo-eu-west-1
stack_version: cd44
Example Instance Entity
Instance Metrics
● Memory usage
● Disk space usage
● CPU usage
● Application logs
● Application metrics
Monitoring Plan-B instances on AWS
Scalyr Agent
Log shipping
Prometheus
Node Agent
:9100/metrics
Taupage AMI (Ubuntu base)
Application Container
Go / Spring Boot / Cassandra
Docker run time
:8080 -> app
:7979 -> metrics
Jolokia Request Example
Check Results
Alert on application metrics
HTTP requests reading JSON application metrics
Read JMX data via Jolokia/HTTP for Cassandra
Read Prometheus Node data for EC2 metrics
CloudWatch() queries for ELB metrics
Scalyr API queries for application logs
Check commands used so far
Annotated Metric Data in Grafana
Annotated Metric Data in Grafana
Entities
● hosts, databases, applications, instances ...
● generic key value object
● 10000+ entities in our deployment
Entities
{
"id": "node01:8080",
"type": "instance",
"host": "node01",
"ports": {"8080":8080,"8181":8181},
"application_id": "zmon",
"application_version": "0.1.0",
"dc":"dc1"
}
Entity "node01:8080"
Entity Service (part of controller)
id: localhost:5432
type: postgres
host: localhost
port: 5432
shards:
local_zmon_db: "localhost:5432/local_zmon_db"
local-postgres.yaml
Integrated easy-to-use entity store with REST API
Build your own discovery agent (K8S, …)
>zmon entities push local-postgres.yaml
Checks
● select subset of entities
● executes Python expression
○ powerful using eval with custom context
○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch,
Redis, SNMP/NRPE, tcp, SOAP, Scalyr, ES, ...
● returns "value" object
○ Quickly, every check returned "dicts"
Checks
REST API to update or use web front end
zmon check-definitions update select-1-check.yaml
Managing checks
name: "Select 1"
owning_team: "Team ZMON"
command: |
sql().execute("select 1 as a").results()
entities:
- type: postgresql
interval: 15
description: "Test connection to PostgreSQL"
select-1-check.yaml
Trial Run - Quick feedback and easier development
Alerts
● Executes using a check’s value, bound to single check
● Defines team and responsible team
● Allows inheritance from other alert
● Evaluates Python expression yielding True/False
● No "WARNING" state, no "UNKNOWN" state
● Priorities(color) and tags
Alerts
Downtimes
● Set or schedule downtimes using the UI
● Use API to automate downtimes, e.g. in deployment tool
Anyone can add alerts to checks
Alerts are owned by team
Monitor application boundaries/dependencies
Make use of inheritance to customize
Sharing and reuse of alerts and checks
Deployment
Workers
(Python)
Workers
(Python)
ZMON Core + UI + KairosDB
Scheduler
(jvm)
Redis
Worker
(Python)
KairosDB
(Java)
Controller
(Java)
PostgreSQL
Queue/State
CLI
(Python)
Check/Alert definition
Entity data
Cassandra
Frontend
(AngularJS)
Metric Cache
ZMON in AWS / Multi DC Setup
*.foo.example.org *.bar.example.org
Team "Foo" Team "Bar"
EC2
Instance
EC2
InstanceEC2
Instance
EC2
Instance
ZMON
Appliance
ZMON
ApplianceEC2
Instance
EC2
Instance
ZMON
Data Service
ELB ELB
● Scheduler supports queue filters by entity
○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters
● Scheduler can apply base filter
○ only handles entities with {"dc":"dc1"}
● Worker can report home using:
○ Redis (we use this across DCs)
○ HTTPS (AWS->DC)
Multi DC / Zone deployment possible
Micro
Services
Expose your data / Convention on key names/structure
{
"zmon.response.200.GET.checks.all-active-check-definitions.count": 10,
"zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071,
"zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181,
"zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512,
"zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173,
"zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233,
"zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.max": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.median": 1161,
"zmon.response.200.GET.checks.all-active-check-definitions.min": 1114
}
Application metrics
Continued ...
Spring boot (extending metrics)
https://github.com/zalando/zmon-actuator
Python (Swagger first on Flask)
https://github.com/zalando/connexion
Clojure (Swagger first)
https://github.com/zalando-stups/friboo/
Example libraries and framework support ...
Demo:
https://demo.zmon.io
ZMON on Github:
https://github.com/zalando/zmon
Documentation:
https://docs.zmon.io
Zalando Tech:
https://tech.zalando.com

More Related Content

Viewers also liked

[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz][4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
PROIDEA
 
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
PROIDEA
 

Viewers also liked (10)

MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...MCE^3 - Dariusz Seweryn, Paweł Urban -  Demystifying Android's Bluetooth Low ...
MCE^3 - Dariusz Seweryn, Paweł Urban - Demystifying Android's Bluetooth Low ...
 
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
Nawyki kognitywne zwiększające efektywność i skuteczność programisty (Artur K...
 
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
[4developers2016] - Nowe wyzwania w tworzeniu Universal Windows Application n...
 
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
[CONFidence 2016] Jakub Kałużny, Mateusz Olejarka - Big problems with big dat...
 
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
[CONFidence 2016] Marcin Kaczmarek - Security and forensic projects based on ...
 
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst [CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
[CONFidence 2016] Jacek Grymuza - From a life of SOC Analyst
 
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz][4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
[4developers2016] - Medytacja dla programistów [Krzysztof Muchewicz]
 
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
[4developers2016] Dlaczego stalkuje userów i ty też powinieneś zacząć. (Grzeg...
 
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
[4developers2016] - Nie rób makiet. Nadawaj im znaczenie (Łukasz Tyrała)
 
[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)[4developers2016] PHP 7 (Michał Pipa)
[4developers2016] PHP 7 (Michał Pipa)
 

Similar to Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs

Similar to Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs (20)

OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
 
ZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering PlatformZMON: Monitoring Zalando's Engineering Platform
ZMON: Monitoring Zalando's Engineering Platform
 
DevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless ArchitectureDevOps, Microservices and Serverless Architecture
DevOps, Microservices and Serverless Architecture
 
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel AvivSelf Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
Self Service Agile Infrastructure for Product Teams - Pop-up Loft Tel Aviv
 
Monitoring klassisch oder Cloud
Monitoring klassisch oder CloudMonitoring klassisch oder Cloud
Monitoring klassisch oder Cloud
 
F5 Automation and service discovery
F5 Automation and service discoveryF5 Automation and service discovery
F5 Automation and service discovery
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHow Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:Invent
 
Network Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient networkNetwork Automation with Salt and NAPALM: a self-resilient network
Network Automation with Salt and NAPALM: a self-resilient network
 
AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.AWS security monitoring and compliance validation from Adobe.
AWS security monitoring and compliance validation from Adobe.
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made Simple
 
Containerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS LambdaContainerless in the Cloud with AWS Lambda
Containerless in the Cloud with AWS Lambda
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
 
Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...Handson Oracle Management Cloud with Application Performance Monitoring and L...
Handson Oracle Management Cloud with Application Performance Monitoring and L...
 
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
Security DevOps - Wie Sie in agilen Projekten trotzdem sicher bleiben // DevO...
 
AWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacksAWS re:Invent 2016 : announcement, technical demos and feedbacks
AWS re:Invent 2016 : announcement, technical demos and feedbacks
 
Deep Dive into SpaceONE
Deep Dive into SpaceONEDeep Dive into SpaceONE
Deep Dive into SpaceONE
 
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando TechDocker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
Docker Berlin Meetup June 2015: Docker powering Radical Agility @ Zalando Tech
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs

  • 1. ZMON - OS monitoring in the cloud Atmosphere 2016 | Krakow 17.5.2016 | jan.mussler@zalando.de | @JanMussler
  • 2. 15 countries 3 fulfillment centers 18+ million active customers 3.0+ billion € revenue 135+ million visits per month 1.000+ employees in tech Europe's Leading Fashion Platform Visit us: tech.zalando.com
  • 5. ➊ One AWS account per Team ➋ Deployment with Docker ➌ Managed SSH Access ➍ REST/OAuth 2.0 mandatory ➎ Traceability of changes IN A NUTSHELL STUPS
  • 6. AWS DEPLOYMENT Senza CLI Deploy Tool Pier One Docker Registry docker pull docker push Taupage AMI
  • 7. Internet *.abc.example.org *.xyz.example.org Team ABC Team XYZ ISOLATED AWS ACCOUNTS EC2EC2 ELBELB EC2
  • 9. Flexible and extendable: Checks & Alerts in Python Integrate: REST APIs, OAUTH2, AWS Auto Discovery Fully configurable via UI / API: no restarts required! Great for teams: team dashboards, alerts inheritance Fast/scaling metrics: Redis, KairosDB + Grafana2 Hackweek 2015 - iOS app and Android app ;-) ZMON - High Lights ;-)
  • 10.
  • 11.
  • 12.
  • 13.
  • 14. Display historic data using Grafana 2
  • 15. Notifications plus iOS and Android App E-Mail
  • 16. Full authentication for all endpoints OAUTH2 login flow (e.g. via Github login) “TV Tokens” for “read-only” dashboard login Grafana 2 bundled and API implemented ● ZMON stores dashboards incl. tags/stars ● KairosDB proxy ● ElasticSearch proxy (in progress) ZMON Controller -> UI + REST API
  • 18. Tokeninfo (GO)Tokeninfo (GO) Provider (Java) Provider (Java) Tokeninfo (GO)Tokeninfo (GO) C* Nodes C* Nodes C* Nodes C* Nodes Plan B Deployment - Multi Region Setup (JWT issue/verification) C* NodesProvider (Java)ELB Tokeninfo (Go)ELB C* NodesProvider (Java)ELB Tokeninfo (Go)ELB
  • 19. Will create “entities” to describe deployment ELBs, ASGs, Application, instances,... Crawls AWS API every 60 sec to update ZMON AWS Agent - Auto Discovery
  • 20. ➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]" id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1] type: instance application_id: planb-tokeninfo host: 172.31.169.6 infrastructure_account: aws:999 instance_type: c4.xlarge ip: 172.31.169.6 ports: { '9020': 9020, '9021': 9021 } region: eu-west-1 source: registry.opensource.zalan.do/stups/planb-tokeninfo:cd44 stack_name: planb-tokeninfo-eu-west-1 stack_version: cd44 Example Instance Entity
  • 21. Instance Metrics ● Memory usage ● Disk space usage ● CPU usage ● Application logs ● Application metrics Monitoring Plan-B instances on AWS Scalyr Agent Log shipping Prometheus Node Agent :9100/metrics Taupage AMI (Ubuntu base) Application Container Go / Spring Boot / Cassandra Docker run time :8080 -> app :7979 -> metrics
  • 25. HTTP requests reading JSON application metrics Read JMX data via Jolokia/HTTP for Cassandra Read Prometheus Node data for EC2 metrics CloudWatch() queries for ELB metrics Scalyr API queries for application logs Check commands used so far
  • 26. Annotated Metric Data in Grafana
  • 27. Annotated Metric Data in Grafana
  • 29. ● hosts, databases, applications, instances ... ● generic key value object ● 10000+ entities in our deployment Entities { "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1" } Entity "node01:8080"
  • 30. Entity Service (part of controller) id: localhost:5432 type: postgres host: localhost port: 5432 shards: local_zmon_db: "localhost:5432/local_zmon_db" local-postgres.yaml Integrated easy-to-use entity store with REST API Build your own discovery agent (K8S, …) >zmon entities push local-postgres.yaml
  • 32. ● select subset of entities ● executes Python expression ○ powerful using eval with custom context ○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch, Redis, SNMP/NRPE, tcp, SOAP, Scalyr, ES, ... ● returns "value" object ○ Quickly, every check returned "dicts" Checks
  • 33. REST API to update or use web front end zmon check-definitions update select-1-check.yaml Managing checks name: "Select 1" owning_team: "Team ZMON" command: | sql().execute("select 1 as a").results() entities: - type: postgresql interval: 15 description: "Test connection to PostgreSQL" select-1-check.yaml
  • 34.
  • 35. Trial Run - Quick feedback and easier development
  • 37. ● Executes using a check’s value, bound to single check ● Defines team and responsible team ● Allows inheritance from other alert ● Evaluates Python expression yielding True/False ● No "WARNING" state, no "UNKNOWN" state ● Priorities(color) and tags Alerts
  • 38.
  • 39.
  • 40. Downtimes ● Set or schedule downtimes using the UI ● Use API to automate downtimes, e.g. in deployment tool
  • 41. Anyone can add alerts to checks Alerts are owned by team Monitor application boundaries/dependencies Make use of inheritance to customize Sharing and reuse of alerts and checks
  • 43. Workers (Python) Workers (Python) ZMON Core + UI + KairosDB Scheduler (jvm) Redis Worker (Python) KairosDB (Java) Controller (Java) PostgreSQL Queue/State CLI (Python) Check/Alert definition Entity data Cassandra Frontend (AngularJS) Metric Cache
  • 44. ZMON in AWS / Multi DC Setup *.foo.example.org *.bar.example.org Team "Foo" Team "Bar" EC2 Instance EC2 InstanceEC2 Instance EC2 Instance ZMON Appliance ZMON ApplianceEC2 Instance EC2 Instance ZMON Data Service ELB ELB
  • 45. ● Scheduler supports queue filters by entity ○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters ● Scheduler can apply base filter ○ only handles entities with {"dc":"dc1"} ● Worker can report home using: ○ Redis (we use this across DCs) ○ HTTPS (AWS->DC) Multi DC / Zone deployment possible
  • 46.
  • 48. Expose your data / Convention on key names/structure { "zmon.response.200.GET.checks.all-active-check-definitions.count": 10, "zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071, "zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181, "zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512, "zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173, "zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233, "zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.max": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.median": 1161, "zmon.response.200.GET.checks.all-active-check-definitions.min": 1114 }
  • 51. Spring boot (extending metrics) https://github.com/zalando/zmon-actuator Python (Swagger first on Flask) https://github.com/zalando/connexion Clojure (Swagger first) https://github.com/zalando-stups/friboo/ Example libraries and framework support ...