Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs


Published on

Two years ago we set out to build our own monitoring tool replacing Icinga. Our biggest focus was flexiblity and autonomy for the growing number of teams and engineers to enable them to monitor their services from small micro services to databases to higher level business KPIs. Today ZMON provides teams with the a federated monitoring solution that gathers data not only in our DCs but also in the connected AWS VPCs and assists teams with service auto discovery and sharing of checks/alerts to make everyone's life easier. ZMON comes along with Grafana2 and KairosDB enabling rich data driven dashboards.

As ZMON is an open source project and relies on some great products (kairosdb/redis) in the background we also provide some insights into how we build, ship, and deploy exactly the same docker images everyone can try for himself using gocd.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Atmosphere 2016 - Jan Mussler - ZMON: Zalando's OS approach to monitoring in the cloud and DCs

  1. 1. ZMON - OS monitoring in the cloud Atmosphere 2016 | Krakow 17.5.2016 | | @JanMussler
  2. 2. 15 countries 3 fulfillment centers 18+ million active customers 3.0+ billion € revenue 135+ million visits per month 1.000+ employees in tech Europe's Leading Fashion Platform Visit us:
  3. 3. Zalando’s Technology History
  5. 5. ➊ One AWS account per Team ➋ Deployment with Docker ➌ Managed SSH Access ➍ REST/OAuth 2.0 mandatory ➎ Traceability of changes IN A NUTSHELL STUPS
  6. 6. AWS DEPLOYMENT Senza CLI Deploy Tool Pier One Docker Registry docker pull docker push Taupage AMI
  8. 8. ZMON
  9. 9. Flexible and extendable: Checks & Alerts in Python Integrate: REST APIs, OAUTH2, AWS Auto Discovery Fully configurable via UI / API: no restarts required! Great for teams: team dashboards, alerts inheritance Fast/scaling metrics: Redis, KairosDB + Grafana2 Hackweek 2015 - iOS app and Android app ;-) ZMON - High Lights ;-)
  10. 10. Display historic data using Grafana 2
  11. 11. Notifications plus iOS and Android App E-Mail
  12. 12. Full authentication for all endpoints OAUTH2 login flow (e.g. via Github login) “TV Tokens” for “read-only” dashboard login Grafana 2 bundled and API implemented ● ZMON stores dashboards incl. tags/stars ● KairosDB proxy ● ElasticSearch proxy (in progress) ZMON Controller -> UI + REST API
  13. 13. Example
  14. 14. Tokeninfo (GO)Tokeninfo (GO) Provider (Java) Provider (Java) Tokeninfo (GO)Tokeninfo (GO) C* Nodes C* Nodes C* Nodes C* Nodes Plan B Deployment - Multi Region Setup (JWT issue/verification) C* NodesProvider (Java)ELB Tokeninfo (Go)ELB C* NodesProvider (Java)ELB Tokeninfo (Go)ELB
  15. 15. Will create “entities” to describe deployment ELBs, ASGs, Application, instances,... Crawls AWS API every 60 sec to update ZMON AWS Agent - Auto Discovery
  16. 16. ➜ ~ zmon entities get "planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1]" id: planb-tokeninfo-cd44-oFM6x[aws:999:eu-west-1] type: instance application_id: planb-tokeninfo host: infrastructure_account: aws:999 instance_type: c4.xlarge ip: ports: { '9020': 9020, '9021': 9021 } region: eu-west-1 source: stack_name: planb-tokeninfo-eu-west-1 stack_version: cd44 Example Instance Entity
  17. 17. Instance Metrics ● Memory usage ● Disk space usage ● CPU usage ● Application logs ● Application metrics Monitoring Plan-B instances on AWS Scalyr Agent Log shipping Prometheus Node Agent :9100/metrics Taupage AMI (Ubuntu base) Application Container Go / Spring Boot / Cassandra Docker run time :8080 -> app :7979 -> metrics
  18. 18. Jolokia Request Example
  19. 19. Check Results
  20. 20. Alert on application metrics
  21. 21. HTTP requests reading JSON application metrics Read JMX data via Jolokia/HTTP for Cassandra Read Prometheus Node data for EC2 metrics CloudWatch() queries for ELB metrics Scalyr API queries for application logs Check commands used so far
  22. 22. Annotated Metric Data in Grafana
  23. 23. Annotated Metric Data in Grafana
  24. 24. Entities
  25. 25. ● hosts, databases, applications, instances ... ● generic key value object ● 10000+ entities in our deployment Entities { "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1" } Entity "node01:8080"
  26. 26. Entity Service (part of controller) id: localhost:5432 type: postgres host: localhost port: 5432 shards: local_zmon_db: "localhost:5432/local_zmon_db" local-postgres.yaml Integrated easy-to-use entity store with REST API Build your own discovery agent (K8S, …) >zmon entities push local-postgres.yaml
  27. 27. Checks
  28. 28. ● select subset of entities ● executes Python expression ○ powerful using eval with custom context ○ Builtins: HTTP, PostgreSQL, MySQL, CloudWatch, Redis, SNMP/NRPE, tcp, SOAP, Scalyr, ES, ... ● returns "value" object ○ Quickly, every check returned "dicts" Checks
  29. 29. REST API to update or use web front end zmon check-definitions update select-1-check.yaml Managing checks name: "Select 1" owning_team: "Team ZMON" command: | sql().execute("select 1 as a").results() entities: - type: postgresql interval: 15 description: "Test connection to PostgreSQL" select-1-check.yaml
  30. 30. Trial Run - Quick feedback and easier development
  31. 31. Alerts
  32. 32. ● Executes using a check’s value, bound to single check ● Defines team and responsible team ● Allows inheritance from other alert ● Evaluates Python expression yielding True/False ● No "WARNING" state, no "UNKNOWN" state ● Priorities(color) and tags Alerts
  33. 33. Downtimes ● Set or schedule downtimes using the UI ● Use API to automate downtimes, e.g. in deployment tool
  34. 34. Anyone can add alerts to checks Alerts are owned by team Monitor application boundaries/dependencies Make use of inheritance to customize Sharing and reuse of alerts and checks
  35. 35. Deployment
  36. 36. Workers (Python) Workers (Python) ZMON Core + UI + KairosDB Scheduler (jvm) Redis Worker (Python) KairosDB (Java) Controller (Java) PostgreSQL Queue/State CLI (Python) Check/Alert definition Entity data Cassandra Frontend (AngularJS) Metric Cache
  37. 37. ZMON in AWS / Multi DC Setup * * Team "Foo" Team "Bar" EC2 Instance EC2 InstanceEC2 Instance EC2 Instance ZMON Appliance ZMON ApplianceEC2 Instance EC2 Instance ZMON Data Service ELB ELB
  38. 38. ● Scheduler supports queue filters by entity ○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters ● Scheduler can apply base filter ○ only handles entities with {"dc":"dc1"} ● Worker can report home using: ○ Redis (we use this across DCs) ○ HTTPS (AWS->DC) Multi DC / Zone deployment possible
  39. 39. Micro Services
  40. 40. Expose your data / Convention on key names/structure { "zmon.response.200.GET.checks.all-active-check-definitions.count": 10, "zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18071, "zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.15181, "zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512, "zmon.response.200.GET.checks.all-active-check-definitions.75thPercentile": 1173, "zmon.response.200.GET.checks.all-active-check-definitions.95thPercentile": 1233, "zmon.response.200.GET.checks.all-active-check-definitions.999thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.99thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.max": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.median": 1161, "zmon.response.200.GET.checks.all-active-check-definitions.min": 1114 }
  41. 41. Application metrics
  42. 42. Continued ...
  43. 43. Spring boot (extending metrics) Python (Swagger first on Flask) Clojure (Swagger first) Example libraries and framework support ...
  44. 44. Demo: ZMON on Github: Documentation: Zalando Tech: