According to service scale, there are hundreds or thousands of running containers in your service. Should we monitor each container by microscope or monitor each microservice by magnifier? This depends which granularity can help us find and solve the problems. In this sharing, I will introduce how to use cAdvisor, Icinga2, InfluxDB and Grafana to build a self-hosted monitoring system. In addition, I also discuss with how to embrace open source and share some practical experiences.
1. The Art of Container Monitoring
Derek Chen
2016.9.22
2. Copyright 2016 Trend Micro Inc.2
About me
• DevOps Engineer at Trend Micro
– Agile transformation
– Micro service and cloud service
– Docker integration
– Monitoring system development
• Automate all the things
• Make everything smoother
• Find me at derekhound@gmail.com
3. Copyright 2016 Trend Micro Inc.3
• We want to know when things go wrong
• We want to know when things aren’t quite right
• We want to know in advance of problems
Why monitoring?
Microscope?
Magnifier?
Telescope?
4. Copyright 2016 Trend Micro Inc.4
Blackbox vs Whitebox
• Blackbox
– Requires no participation of the monitored system
– Observes external functionality, “what the user see”
– End to end test
• Whitebox
– Collects data internally provided by the target system
– Has more granular information about the system
– Can provide warning of problems before they occur
5. Copyright 2016 Trend Micro Inc.5
Blackbox tests measure…
• Can you ping the server
• Can you fetch a page from the server
• Does it have the correct contents
Whitebox monitoring collects…
• Network statistics (packets/bytes sent/received)
• System load (cpu, memory, disk usage)
• Application statistics (connection count, query per second)
6. Copyright 2016 Trend Micro Inc.6
Goal
Operating System
Application
Business Logic
Metrics
Events
Logs
Router
Destinations
Store
Visualize
Alert
Monitoring Framework
• Be metric, event, and log
• Allow us to easily visualize the
stat of our environment
• Provide contextual and useful
notification.
• Focus on whitebox monitoring
instead of blackbox monitoring
[image] https://www.artofmonitoring.com
7. Copyright 2016 Trend Micro Inc.7
• The practice of system admin is changing quickly
– DevOps
– Infrastructure as Code
– Visualization
– Cloud Platform
• Commercial solution can’t keep up!
Why Open Source Software Monitoring?
[image] http://devops.com/2014/05/05/meet-infrastructure-code
8. Copyright 2016 Trend Micro Inc.8
Monolithic
• If all-in-one gets the job done, then great
• Good for smaller scale, non-tech-focused
companies
Modular
• DevOps requires flexibility and innovation
• Good for tech driven and ops-focused
companies
9. Metric
We’ll rely most heavily on metrics to help us understand
what’s going on in our environment.
10. Copyright 2016 Trend Micro Inc.10
Metric
• Provide a dynamic, real-time picture of the state of
your infrastructure
Collect Store Visualize
Operating System
Application
Business Logic
11. Copyright 2016 Trend Micro Inc.11
Pull Mode
A central collector periodically
requests metrics from each
monitored system
Scrape
Metrics
Server #1
Service
Server #2
Service
Server #3
Service
Monitoring
Push Mode
Metrics are periodically sent
by each monitored system to a
central collector
Server #1
Agent
Server #2
Agent
Server #3
Agent
Monitoring
Push
Metrics
18. Copyright 2016 Trend Micro Inc.18
Database
Application Metric – Pull Mode
K8S Minion
K8S Master
API Server
Kubernetes Cluster
Elasticsearch
Monitor
GrafanaInfluxDB
NginxPostgres ES
Telegraf
K8S Minion
Redis
19. Copyright 2016 Trend Micro Inc.19
Database
Application Metric – Push Mode
K8S Minion
Telegraf
K8S Master
API Server
Kubernetes Cluster
Elasticsearch
Monitor
GrafanaInfluxDB
Nginx
Telegraf
Postgres
Telegraf
ES
K8S Minion
Telegraf
Redis
Pod Pod
20. Copyright 2016 Trend Micro Inc.20
Telegraf
• An agent written in Go
• Easy to contribute or develop your plugins
• Supported Input Plugins
– System: cpu, mem, net, disk, processes and etc.
– Application: docker, nginx, postgresql, redis, and etc.
– Third party api: aws cloudwatch
• Supported output plugins
– influxdb, graphite, cloudwatch, datadog and etc.
21. Copyright 2016 Trend Micro Inc.21
Pull Mode
• Collector discoveries services
periodically
• Collector needs to talk to target
services
– Overlay network. Ex: Flannel,
Weave, etc.
– Forward proxy
Scrape
Metrics
Server #1
Service
Server #2
Service
Server #3
Service
Monitoring
22. Copyright 2016 Trend Micro Inc.22
Push Mode
• Agent sends metrics as soon as it
starts up
• Suitable for short-lived containers
or dynamic environments
Server #1
Agent
Server #2
Agent
Server #3
Agent
Monitoring
Push
Metrics
24. Copyright 2016 Trend Micro Inc.24
• Similar scope to Graphite
• Written in Go
• No external dependency
• Ready for billions row
• Several client libraries
• SQL style queries
40. Copyright 2016 Trend Micro Inc.40
• Originally forked from Nagios in 2009
• Independent version Icinga2 since 2014
• Monitors everything
• Gathering status
• Collect performance data
/use/lib/nagios/plugins/plugin_name
Any program which returns
• 0 – OK
• 1 – WARNING
• 2 – CRITICAL
Message to STDOUT
41. Copyright 2016 Trend Micro Inc.41
Database
Event Monitoring
K8S Minion
K8S Master
API Server
Kubernetes Cluster
Elasticsearch
Monitor
GrafanaInfluxDBIcinga2
NginxPostgres ES
K8S Minion
Redis
45. Copyright 2016 Trend Micro Inc.45
Server Monitoring
• About changes and occurrences in our environment
– Is cpu load too high?
– Is memory not enough?
– Is docker engine still alive?
External Monitoring
• End to end test from user’s perspective
• Can you connect ssh port 22?
• Can you browse web page?
• Can you request RESTful API successfully?
46. Dashboard
• Provides a clear view for current environment
Alerting
• Notifies using email, slack, pager and etc.
• Notification escalation
Others…
• Distributed Monitoring
• Reloads config without interrupt checks
47. Log
Logs are a subset of events. They’re often most useful
for fault diagnosis and investigation
48. Copyright 2016 Trend Micro Inc.48
• EFK stack is a mature logging solution
EFK
Shipping all of your log
to where it should go
The main part to store your
data with high availability
Visualize will power your data.
To know more about its value.
51. Copyright 2016 Trend Micro Inc.51
Last but Not least
• Data Flow
– Blackbox vs Whitebox
– Pull Mode vs Push Mode
• Data Content
– Opeating System, Application,
Business Logic
• Data Type
– Metric, Events, Logs
Operating System
Application
Business Logic
Metrics
Events
Logs
Router
Destinations
Store
Visualize
Alert
Monitoring Framwork
[image] https://www.artofmonitoring.com
在容器的世界中,依據系統的規模大小,可能有數百個到數千個容器在運行。我們到底是應該用顯微鏡觀察每一個 container 的運行狀態,還是用放大鏡觀察 micro service 的整體狀態,還是拿個望遠鏡,遠遠的觀望整個系統的狀態,取決於觀察的精細度是否能幫助我們容易找到問題和解決問題。
因為我們所處的環境不斷的在改變。近幾年提昌的 DevOps,需要團隊做敏捷的開發,需要整合 CI / CD 的 tool。最近開始要有一句響亮的口號 ‘Infrastructure as Code’,希望能夠自動化完成所有的部署工作。我們需要把數據視覺化。我們開始大量的採用 cloud platform, 例如 aws, google cloud, azure。
環境一直在改變,但是 commercial solution 在大部分的時候沒有這麼彈性,難以符合我們不斷變動的需求。
右上方的圖是 google nexus 手機,是一個 all in one solution 的代表,右下方的圖是 google開發的模組化手機,可以根據你的需求替換不同的模組。當你在選擇 open source software 時,你會選擇那一種呢,all in one solution 或是 modular solution 呢?
假如這個 all in one solution 剛好符合你的需求,那你真得非常幸運,Promethues 就是這類型的代表。
但是,大部份的時候我們沒有那麼幸運,我們必須將數個 software 放在一起去建構我們自己的 monitoring system。