SlideShare a Scribd company logo
ZMON - Monitoring Our Platform
DevOps Meetup Dublin | September 3, 2015 | jan.mussler@zalando.de | @JanMussler
15 countries
3 fulfillment centers
16+ million active customers
2.2+ billion € revenue 2014
130+ million visits per month
8.000+ employees
ONE of EUROPE’S LARGEST ONLINE FASHION RETAILERS
Visit us: tech.zalando.com
Zalando’s Technology History
(Some!) Technologies We Use
Monitoring Situation Until Late 2013
ICINGA plus custom frontend (ZMON 1)
Did not scale with growth:
● Our UI became too slow
● Number of systems to check too many
● Number of teams that wanted checks grew
● Every request had to go through single team
Improve performance and throughput
Autonomy for individual teams
Flexibility and extendability
Integration into tooling (CMDB, DeployCtl …)
Goals of new ZMON development
Entity:
Anything you may want to monitor
Can be used as a "dimension"
Checks:
Runnable Python snippet fetching data
Alert on Check:
Python expression yielding true or false
The basic terminology ...
Zalando Tech - 24x7 team setup
Incident Team
Incident Team
Alerts
Database
Team
Alerts
2nd Level
Database
Calls if help needed
Inheritance with
custom thresholds
SMS / E-Mail
observes
Customizable ZMON dashboards
Customizable ZMON dashboards
Customizable ZMON dashboards
Display historic data using Grafana
Workers
(Python)
Workers
(Python)
ZMON’s core components
Scheduler
(jvm)
Redis
Workers
(Python)
KairosDB
(java)
Controller
(Java)
PostgreSQL
Queue/State
CLI
(Python)
Check/Alert definition
Entity data
Cassandra
Frontend
(Angular)
Redis
Slave
● hosts, databases, applications, instances ...
● generic key value object
● 4000+ entities in our deployment
Entities
{
"id": "node01:8080",
"type": "instance",
"host": "node01",
"ports": {"8080":8080,"8181":8181},
"application_id": "zmon",
"application_version": "0.1.0",
"dc":"dc1"
}
Entity "node01:8080"
Database Entity
{
"id": "customer-live-slave",
"type": "database",
"role": "slave",
"environment": "live",
"shards": {
"customer1": "customer1.db:5432/customer1"
"customer2": "customer2.db:5432/customer2"
"customer3": "customer3.db:5432/customer3"
"customer4": "customer4.db:5432/customer4"
}
}
Entity: customer-live-slave
Integrated easy-to-use entity store with REST API
>zmon entities push local-postgres.yaml
Entity Service
id: localhost:5432
type: postgres
host: localhost
port: 5432
shards:
local_zmon_db: "localhost:5432/local_zmon_db"
local-postgres.yaml
● select subset of entities
● executes Python expression
○ powerful using eval with custom context
○ Builtins: HTTP, PostgreSQL, MySQL, Cloudwatch,
Redis, SNMP, tcp, SOAP, Scalyr...
● returns "value" object
○ Quickly, every check returned "dicts"
Checks
REST API to update / auto-import from SCM
zmon check-definitions update select-1-check.yaml
Managing checks
name: "Select 1"
owning_team: "Team 1"
command: |
sql().execute("select 1 as a").results()
entities:
- type: postgres
interval: 15
description: "test connection"
select-1-check.yaml
● Executes using a check’s value, bound to single check
● Defines team and responsible team
● Allows inheritance from other alert
● Evaluates Python expression yielding True/False
● No "WARNING" state, no "UNKNOWN" state
● Priorities and tags
Alerts
Trial Run - Quick feedback and download YAML
Anyone can add alerts to checks
Alerts are owned by team
Monitor application boundaries/dependencies
Make use of inheritance to customize
Sharing and reuse of alerts and checks
Workers
(Python)
Workers
(Python)
ZMON Core + UI + KairosDB
Scheduler
(jvm)
Redis
Worker
(Python)
KairosDB
(java)
Controller
(Java)
PostgreSQL
Queue/State
CLI
(Python)
Check/Alert definition
Entity data
Cassandra
Frontend
(Angular)
Redis
Slave
Vagrant Box deploys Docker images
Downtimes
● Set or schedule downtimes using the UI
● Use API to automate downtimes, e.g. in deployment tool
Extendability - Check and Alert functions
● Improve user experience through provided functions
Extendability - Check and Alert functions
● Improve user experience through function wrappers
The Microservices World
● Request rates
● Response rates by HTTP status code
● Latency
Key Metrics for your service?
Expose your data
{
"zmon.response.200.GET.checks.all-active-check-definitions.count": 10,
"zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18076110580284566,
"zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.1518180485219247,
"zmon.response.200.GET.checks.all-active-check-definitions.meanRate": 0.06792011610723951,
"zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512398137982051,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.75thPercentile": 1173,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.95thPercentile": 1233,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.98thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.999thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.99thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.max": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.mean": 1170,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.median": 1161,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.min": 1114,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.stdDev": 42,
}
Start tracking your metrics
Display application statistics
Application metrics
Continued ...
Reuse of check
Spring boot
https://github.com/zalando/zmon-actuator
Clojure
https://github.com/zalando-stups/friboo/
Play (done, to be released)
Libraries available for
Multi DC / AWS
ZMON in AWS Setup
*.foo.example.org *.bar.example.org
Team "Foo" Team "Bar"
EC2
Instance
EC2
InstanceEC2
Instance
EC2
Instance
ZMON
Appliance
ZMON
Appliance
KairosDB
EC2
Instance
EC2
Instance
ZMON
Data Service
ELB ELB
● Scheduler supports queue filters by entity
○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters
● Scheduler can apply base filter
○ only handles entities with {"dc":"dc1"}
● Worker can report home using:
○ Redis (we use this across DCs)
○ HTTPS (AWS->DC)
Multi DC / Zone deployment possible
Uses Amazon API to fetch:
● ELBs
● EC2 instances
● RDS instances
Pushes enriched entities to entity service
ZMON AWS Agent
Prometheus?
read "text" result
Kubernetes Example: Exports in Prometheus text format
kubelet_docker_operations_latency_microseconds{operation_type="inspect_container",quantile="0.9"} 9602
kubelet_docker_operations_latency_microseconds{operation_type="list_containers",quantile="0.9"} 9740
Yields a usable nested dictionary
{"list_images":
{"0.9":"120252",
"0.99":"120252",
"0.5":"120252"},
"version":
{"0.9":"1281",
"0.99":"2183",
"0.5":"873"},
"list_containers":
{"0.9":"9740",
"0.99":"23378",
"0.5":"3717"},
"inspect_container":
{"0.9":"9602",
"0.99":"18367",
"0.5":"4419"}
}
Internals
ZMON’s basic data flow
Scheduler
(jvm)
Redis
{"check":
{"id": 1,
"entity": {"host":"monitor01"},
"command": "snmp().load()",
"alerts":[
{"id":100,
"condition": "value[‘load1’]>10"}
]
}
}
ZMON’s basic data flow
Worker
(python)
Redis
-- store check result "snmp().load()"
lpush zmon:check:1:monitor01 {"load1":5,"load5":3,"load15:2}
-- keep last 20 results (for dashboard charts)
ltrim zmon:check:1:monitor01 20
-- alert active?
sadd zmon:alert:100 monitor01
-- alert inactive?
srem zmon:alert:100 monitor01
ZMON Vagrant Box:
https://github.com/zalando/zmon
ZMON Homepage:
https://zalando.github.io/zmon
Zalando Tech:
https://tech.zalando.com

More Related Content

What's hot

03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
Guido Schmutz
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Curse of Dimensionality and Big Data
Curse of Dimensionality and Big DataCurse of Dimensionality and Big Data
Curse of Dimensionality and Big Data
Stephane Marchand-Maillet
 
Node js Introduction
Node js IntroductionNode js Introduction
Node js Introduction
sanskriti agarwal
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Dineesha Suraweera
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結
James Chen
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
DataWorks Summit/Hadoop Summit
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
JWORKS powered by Ordina
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
Taehoon Kim
 
Graph database
Graph database Graph database
Graph database
Shruti Arya
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
MapR Technologies
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
nehabsairam
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
Bushra Jbawi
 
ElasticSearch at berlinbuzzwords 2010
ElasticSearch at berlinbuzzwords 2010ElasticSearch at berlinbuzzwords 2010
ElasticSearch at berlinbuzzwords 2010
Elasticsearch
 
MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals
Antonios Giannopoulos
 

What's hot (20)

03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Curse of Dimensionality and Big Data
Curse of Dimensionality and Big DataCurse of Dimensionality and Big Data
Curse of Dimensionality and Big Data
 
Node js Introduction
Node js IntroductionNode js Introduction
Node js Introduction
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
딥러닝과 강화 학습으로 나보다 잘하는 쿠키런 AI 구현하기 DEVIEW 2016
 
Graph database
Graph database Graph database
Graph database
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
ElasticSearch at berlinbuzzwords 2010
ElasticSearch at berlinbuzzwords 2010ElasticSearch at berlinbuzzwords 2010
ElasticSearch at berlinbuzzwords 2010
 
MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals MongoDB Sharding Fundamentals
MongoDB Sharding Fundamentals
 

Similar to ZMON: Monitoring Zalando's Engineering Platform

OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
NETWAYS
 
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
NETWAYS
 
Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in...
Atmosphere 2016 - Jan Mussler -  ZMON: Zalando's OS approach to monitoring in...Atmosphere 2016 - Jan Mussler -  ZMON: Zalando's OS approach to monitoring in...
Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in...
PROIDEA
 
Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted
Mite Mitreski
 
Airbnb - StreamAlert
Airbnb - StreamAlertAirbnb - StreamAlert
Airbnb - StreamAlert
Amazon Web Services
 
Fully Automate Application Delivery with Puppet and F5 - PuppetConf 2014
Fully Automate Application Delivery with Puppet and F5 - PuppetConf 2014Fully Automate Application Delivery with Puppet and F5 - PuppetConf 2014
Fully Automate Application Delivery with Puppet and F5 - PuppetConf 2014
Puppet
 
RabbitMQ + CouchDB = Awesome
RabbitMQ + CouchDB = AwesomeRabbitMQ + CouchDB = Awesome
RabbitMQ + CouchDB = Awesome
Lenz Gschwendtner
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB
 
Webinar: Securing your data - Mitigating the risks with MongoDB
Webinar: Securing your data - Mitigating the risks with MongoDBWebinar: Securing your data - Mitigating the risks with MongoDB
Webinar: Securing your data - Mitigating the risks with MongoDB
MongoDB
 
Saltstack - Orchestration & Application Deployment
Saltstack - Orchestration & Application DeploymentSaltstack - Orchestration & Application Deployment
Saltstack - Orchestration & Application Deployment
inovex GmbH
 
Implementare e gestire soluzioni per l'Internet of Things (IoT) in modo rapid...
Implementare e gestire soluzioni per l'Internet of Things (IoT) in modo rapid...Implementare e gestire soluzioni per l'Internet of Things (IoT) in modo rapid...
Implementare e gestire soluzioni per l'Internet of Things (IoT) in modo rapid...
Amazon Web Services
 
AWS IoT Deep Dive
AWS IoT Deep DiveAWS IoT Deep Dive
AWS IoT Deep Dive
Kristana Kane
 
Monitoring und Metriken im Wunderland
Monitoring und Metriken im WunderlandMonitoring und Metriken im Wunderland
Monitoring und Metriken im Wunderland
D
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
Srinath Perera
 
NEW LAUNCH! Building Distributed Applications with AWS Step Functions
NEW LAUNCH! Building Distributed Applications with AWS Step FunctionsNEW LAUNCH! Building Distributed Applications with AWS Step Functions
NEW LAUNCH! Building Distributed Applications with AWS Step Functions
Amazon Web Services
 
AWS re:Inforce 2019 - Threat Hunting in CloudTrail & GuardDuty
AWS re:Inforce 2019 - Threat Hunting in CloudTrail & GuardDutyAWS re:Inforce 2019 - Threat Hunting in CloudTrail & GuardDuty
AWS re:Inforce 2019 - Threat Hunting in CloudTrail & GuardDuty
Chris Farris
 
Connecting Xamarin Apps with IBM Worklight in Bluemix
Connecting Xamarin Apps with IBM Worklight in BluemixConnecting Xamarin Apps with IBM Worklight in Bluemix
Connecting Xamarin Apps with IBM Worklight in Bluemix
IBM
 
SRV408 Deep Dive on AWS IoT
SRV408 Deep Dive on AWS IoTSRV408 Deep Dive on AWS IoT
SRV408 Deep Dive on AWS IoT
Amazon Web Services
 
Ibm xamarin gtruty
Ibm xamarin gtrutyIbm xamarin gtruty
Ibm xamarin gtruty
Ron Favali
 
MongoDB Days UK: Securing Your Deployment with MongoDB Enterprise
MongoDB Days UK: Securing Your Deployment with MongoDB EnterpriseMongoDB Days UK: Securing Your Deployment with MongoDB Enterprise
MongoDB Days UK: Securing Your Deployment with MongoDB Enterprise
MongoDB
 

Similar to ZMON: Monitoring Zalando's Engineering Platform (20)

OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
OSMC 2016 | ZMON: Zalando's OS approach to monitoring in the cloud and DCs by...
 
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
OSMC 2016 - ZMON Zalandos OS approach to monitoring in the cloud and DCs by J...
 
Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in...
Atmosphere 2016 - Jan Mussler -  ZMON: Zalando's OS approach to monitoring in...Atmosphere 2016 - Jan Mussler -  ZMON: Zalando's OS approach to monitoring in...
Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in...
 
Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted Getting all the 99.99(9) you always wanted
Getting all the 99.99(9) you always wanted
 
Airbnb - StreamAlert
Airbnb - StreamAlertAirbnb - StreamAlert
Airbnb - StreamAlert
 
Fully Automate Application Delivery with Puppet and F5 - PuppetConf 2014
Fully Automate Application Delivery with Puppet and F5 - PuppetConf 2014Fully Automate Application Delivery with Puppet and F5 - PuppetConf 2014
Fully Automate Application Delivery with Puppet and F5 - PuppetConf 2014
 
RabbitMQ + CouchDB = Awesome
RabbitMQ + CouchDB = AwesomeRabbitMQ + CouchDB = Awesome
RabbitMQ + CouchDB = Awesome
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
 
Webinar: Securing your data - Mitigating the risks with MongoDB
Webinar: Securing your data - Mitigating the risks with MongoDBWebinar: Securing your data - Mitigating the risks with MongoDB
Webinar: Securing your data - Mitigating the risks with MongoDB
 
Saltstack - Orchestration & Application Deployment
Saltstack - Orchestration & Application DeploymentSaltstack - Orchestration & Application Deployment
Saltstack - Orchestration & Application Deployment
 
Implementare e gestire soluzioni per l'Internet of Things (IoT) in modo rapid...
Implementare e gestire soluzioni per l'Internet of Things (IoT) in modo rapid...Implementare e gestire soluzioni per l'Internet of Things (IoT) in modo rapid...
Implementare e gestire soluzioni per l'Internet of Things (IoT) in modo rapid...
 
AWS IoT Deep Dive
AWS IoT Deep DiveAWS IoT Deep Dive
AWS IoT Deep Dive
 
Monitoring und Metriken im Wunderland
Monitoring und Metriken im WunderlandMonitoring und Metriken im Wunderland
Monitoring und Metriken im Wunderland
 
Introduction to WSO2 Data Analytics Platform
Introduction to  WSO2 Data Analytics PlatformIntroduction to  WSO2 Data Analytics Platform
Introduction to WSO2 Data Analytics Platform
 
NEW LAUNCH! Building Distributed Applications with AWS Step Functions
NEW LAUNCH! Building Distributed Applications with AWS Step FunctionsNEW LAUNCH! Building Distributed Applications with AWS Step Functions
NEW LAUNCH! Building Distributed Applications with AWS Step Functions
 
AWS re:Inforce 2019 - Threat Hunting in CloudTrail & GuardDuty
AWS re:Inforce 2019 - Threat Hunting in CloudTrail & GuardDutyAWS re:Inforce 2019 - Threat Hunting in CloudTrail & GuardDuty
AWS re:Inforce 2019 - Threat Hunting in CloudTrail & GuardDuty
 
Connecting Xamarin Apps with IBM Worklight in Bluemix
Connecting Xamarin Apps with IBM Worklight in BluemixConnecting Xamarin Apps with IBM Worklight in Bluemix
Connecting Xamarin Apps with IBM Worklight in Bluemix
 
SRV408 Deep Dive on AWS IoT
SRV408 Deep Dive on AWS IoTSRV408 Deep Dive on AWS IoT
SRV408 Deep Dive on AWS IoT
 
Ibm xamarin gtruty
Ibm xamarin gtrutyIbm xamarin gtruty
Ibm xamarin gtruty
 
MongoDB Days UK: Securing Your Deployment with MongoDB Enterprise
MongoDB Days UK: Securing Your Deployment with MongoDB EnterpriseMongoDB Days UK: Securing Your Deployment with MongoDB Enterprise
MongoDB Days UK: Securing Your Deployment with MongoDB Enterprise
 

More from Zalando Technology

Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Zalando Technology
 
How We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards ScalabilityHow We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards Scalability
Zalando Technology
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
Zalando Technology
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
Zalando Technology
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
Zalando Technology
 
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnReactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Zalando Technology
 
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Technology
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Zalando Technology
 
Building a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & SlickBuilding a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & Slick
Zalando Technology
 
Radical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and MicroservicesRadical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and Microservices
Zalando Technology
 
Order Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community DayOrder Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community Day
Zalando Technology
 
Mobile Testing Challenges at Zalando Tech
Mobile Testing Challenges at Zalando TechMobile Testing Challenges at Zalando Tech
Mobile Testing Challenges at Zalando Tech
Zalando Technology
 
Auto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando TeamAuto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando Team
Zalando Technology
 
Radical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the CloudRadical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the Cloud
Zalando Technology
 

More from Zalando Technology (14)

Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
Stream Processing using Apache Flink in Zalando's World of Microservices - Re...
 
How We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards ScalabilityHow We Made our Tech Organization and Architecture Converge Towards Scalability
How We Made our Tech Organization and Architecture Converge Towards Scalability
 
Powering Radical Agility with Docker
Powering Radical Agility with Docker Powering Radical Agility with Docker
Powering Radical Agility with Docker
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
 
High Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando PatroniHigh Availability PostgreSQL with Zalando Patroni
High Availability PostgreSQL with Zalando Patroni
 
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland KuhnReactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
Reactive Design Patterns: a talk by Typesafe's Dr. Roland Kuhn
 
Zalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three MonthsZalando Tech: From Java to Scala in Less Than Three Months
Zalando Tech: From Java to Scala in Less Than Three Months
 
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj TalkSpark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
 
Building a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & SlickBuilding a Reactive RESTful API with Akka Http & Slick
Building a Reactive RESTful API with Akka Http & Slick
 
Radical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and MicroservicesRadical Agility with Autonomous Teams and Microservices
Radical Agility with Autonomous Teams and Microservices
 
Order Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community DayOrder Processing at Scale: Zalando at Camunda Community Day
Order Processing at Scale: Zalando at Camunda Community Day
 
Mobile Testing Challenges at Zalando Tech
Mobile Testing Challenges at Zalando TechMobile Testing Challenges at Zalando Tech
Mobile Testing Challenges at Zalando Tech
 
Auto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando TeamAuto-scaling your API: Insights and Tips from the Zalando Team
Auto-scaling your API: Insights and Tips from the Zalando Team
 
Radical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the CloudRadical Agility with Autonomous Teams and Microservices in the Cloud
Radical Agility with Autonomous Teams and Microservices in the Cloud
 

Recently uploaded

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 

Recently uploaded (20)

Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 

ZMON: Monitoring Zalando's Engineering Platform

  • 1. ZMON - Monitoring Our Platform DevOps Meetup Dublin | September 3, 2015 | jan.mussler@zalando.de | @JanMussler
  • 2. 15 countries 3 fulfillment centers 16+ million active customers 2.2+ billion € revenue 2014 130+ million visits per month 8.000+ employees ONE of EUROPE’S LARGEST ONLINE FASHION RETAILERS Visit us: tech.zalando.com
  • 5. Monitoring Situation Until Late 2013 ICINGA plus custom frontend (ZMON 1) Did not scale with growth: ● Our UI became too slow ● Number of systems to check too many ● Number of teams that wanted checks grew ● Every request had to go through single team
  • 6. Improve performance and throughput Autonomy for individual teams Flexibility and extendability Integration into tooling (CMDB, DeployCtl …) Goals of new ZMON development
  • 7. Entity: Anything you may want to monitor Can be used as a "dimension" Checks: Runnable Python snippet fetching data Alert on Check: Python expression yielding true or false The basic terminology ...
  • 8. Zalando Tech - 24x7 team setup Incident Team Incident Team Alerts Database Team Alerts 2nd Level Database Calls if help needed Inheritance with custom thresholds SMS / E-Mail observes
  • 12. Display historic data using Grafana
  • 14. ● hosts, databases, applications, instances ... ● generic key value object ● 4000+ entities in our deployment Entities { "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1" } Entity "node01:8080"
  • 15. Database Entity { "id": "customer-live-slave", "type": "database", "role": "slave", "environment": "live", "shards": { "customer1": "customer1.db:5432/customer1" "customer2": "customer2.db:5432/customer2" "customer3": "customer3.db:5432/customer3" "customer4": "customer4.db:5432/customer4" } } Entity: customer-live-slave
  • 16. Integrated easy-to-use entity store with REST API >zmon entities push local-postgres.yaml Entity Service id: localhost:5432 type: postgres host: localhost port: 5432 shards: local_zmon_db: "localhost:5432/local_zmon_db" local-postgres.yaml
  • 17. ● select subset of entities ● executes Python expression ○ powerful using eval with custom context ○ Builtins: HTTP, PostgreSQL, MySQL, Cloudwatch, Redis, SNMP, tcp, SOAP, Scalyr... ● returns "value" object ○ Quickly, every check returned "dicts" Checks
  • 18. REST API to update / auto-import from SCM zmon check-definitions update select-1-check.yaml Managing checks name: "Select 1" owning_team: "Team 1" command: | sql().execute("select 1 as a").results() entities: - type: postgres interval: 15 description: "test connection" select-1-check.yaml
  • 19.
  • 20. ● Executes using a check’s value, bound to single check ● Defines team and responsible team ● Allows inheritance from other alert ● Evaluates Python expression yielding True/False ● No "WARNING" state, no "UNKNOWN" state ● Priorities and tags Alerts
  • 21.
  • 22.
  • 23. Trial Run - Quick feedback and download YAML
  • 24. Anyone can add alerts to checks Alerts are owned by team Monitor application boundaries/dependencies Make use of inheritance to customize Sharing and reuse of alerts and checks
  • 25. Workers (Python) Workers (Python) ZMON Core + UI + KairosDB Scheduler (jvm) Redis Worker (Python) KairosDB (java) Controller (Java) PostgreSQL Queue/State CLI (Python) Check/Alert definition Entity data Cassandra Frontend (Angular) Redis Slave
  • 26. Vagrant Box deploys Docker images
  • 27. Downtimes ● Set or schedule downtimes using the UI ● Use API to automate downtimes, e.g. in deployment tool
  • 28. Extendability - Check and Alert functions ● Improve user experience through provided functions
  • 29. Extendability - Check and Alert functions ● Improve user experience through function wrappers
  • 31. ● Request rates ● Response rates by HTTP status code ● Latency Key Metrics for your service?
  • 32. Expose your data { "zmon.response.200.GET.checks.all-active-check-definitions.count": 10, "zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18076110580284566, "zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.1518180485219247, "zmon.response.200.GET.checks.all-active-check-definitions.meanRate": 0.06792011610723951, "zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512398137982051, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.75thPercentile": 1173, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.95thPercentile": 1233, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.98thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.999thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.99thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.max": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.mean": 1170, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.median": 1161, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.min": 1114, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.stdDev": 42, }
  • 39. Multi DC / AWS
  • 40. ZMON in AWS Setup *.foo.example.org *.bar.example.org Team "Foo" Team "Bar" EC2 Instance EC2 InstanceEC2 Instance EC2 Instance ZMON Appliance ZMON Appliance KairosDB EC2 Instance EC2 Instance ZMON Data Service ELB ELB
  • 41. ● Scheduler supports queue filters by entity ○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters ● Scheduler can apply base filter ○ only handles entities with {"dc":"dc1"} ● Worker can report home using: ○ Redis (we use this across DCs) ○ HTTPS (AWS->DC) Multi DC / Zone deployment possible
  • 42. Uses Amazon API to fetch: ● ELBs ● EC2 instances ● RDS instances Pushes enriched entities to entity service ZMON AWS Agent
  • 44. Kubernetes Example: Exports in Prometheus text format kubelet_docker_operations_latency_microseconds{operation_type="inspect_container",quantile="0.9"} 9602 kubelet_docker_operations_latency_microseconds{operation_type="list_containers",quantile="0.9"} 9740
  • 45. Yields a usable nested dictionary {"list_images": {"0.9":"120252", "0.99":"120252", "0.5":"120252"}, "version": {"0.9":"1281", "0.99":"2183", "0.5":"873"}, "list_containers": {"0.9":"9740", "0.99":"23378", "0.5":"3717"}, "inspect_container": {"0.9":"9602", "0.99":"18367", "0.5":"4419"} }
  • 47. ZMON’s basic data flow Scheduler (jvm) Redis {"check": {"id": 1, "entity": {"host":"monitor01"}, "command": "snmp().load()", "alerts":[ {"id":100, "condition": "value[‘load1’]>10"} ] } }
  • 48. ZMON’s basic data flow Worker (python) Redis -- store check result "snmp().load()" lpush zmon:check:1:monitor01 {"load1":5,"load5":3,"load15:2} -- keep last 20 results (for dashboard charts) ltrim zmon:check:1:monitor01 20 -- alert active? sadd zmon:alert:100 monitor01 -- alert inactive? srem zmon:alert:100 monitor01
  • 49. ZMON Vagrant Box: https://github.com/zalando/zmon ZMON Homepage: https://zalando.github.io/zmon Zalando Tech: https://tech.zalando.com