Back in the days, you had a single machine and you could scroll down the single log file to figure out what is going on. In this Big Data world you need to combine a lot of logs together to figure out what is going on. Data is coming in huge volumes, with high speed so choosing important information and getting rid of noise becomes real challenge. There is a need for a centralized monitoring platform which will aid the engineers operating the systems, and serve the right information at the right time.
This talk will try to help you understand all the challenges and you will get an idea which tools and technology stacks are good fit to successfully monitor Big Data systems. The focus will be on open source and free solutions. The problem can be separated in two domains which both are the subject of this talk: metrics stack to gather simple metrics on central place and log stack to aggregate logs from different machines to central place. We will finish up with a combined stack and ideas how it can be improved even further with alerting and automated failover scenarios.
Unlocking the Potential of the Cloud for IBM Power Systems
Challenges of monitoring distributed systems
1. Nenad Bozic, Co-founder and Technical Consultant @ SmartCat
Challenges of Monitoring Distributed
Systems
2.
3. Agenda
● Monitoring 101
● Metric data stream and tools
● Log data stream and tools
● Combine metrics and logs for full control
● Real world example
● SMART alerting
4. Monitoring 101
● Metrics data stream contains real-time data about application
(system components) performances
● Log data stream contains information about different activities and
events within the system
● Alerting is giving you the freedom not to watch graphs and logs all
the time
5.
6. Metric data stream
● Easily forgotten and pushed aside when chasing deadlines
● Metrics are indicators that everything is working within expected
boundaries
● Good dashboard has enough information (not too much, not to little)
Distributed system -> many graphs to watch -> information overload trap
7. Metric data stream - decision
● SAS solutions vs self-managed solutions
● Paying solutions vs free solutions
● Decision based on technical team skillset and security of your data
● SAS solutions: AppDynamics, AWS CloudWatch, DataDog
● Self-managed solutions: Riemann, TICK stack
8. Metric data stream - stack
● Riemann as sinc that handles events and sends them to Riemann
server
● InfluxDB as NoSQL store which is build for measurements
● Grafana as visualization tool (flexible configurable graphs from
many data sources)
9.
10. Log data stream
● Log monitoring on single machine requires skill and knowledge
● Same challenges as with metrics (not too much, not to little)
● Metrics are indicator that something happened and logs provide
context what happened
Distributed system -> many terminals open -> information overload trap
11.
12. Log data stream - decision
● SAS solutions vs self-managed solutions
● Paying solutions and free solutions
● Decision based on technical team skillset and security of your data
● SAS solutions: Loggly, Splunk, Rollbar
● Self-managed solutions: ELK, Splunk
13. Log data stream - ELK stack
● ELK (ElasticSearch, LogStash, Kibana) all open source
● ElasticSearch indexes log messages for easier searching
● Logstash can filter, manipulate and transform messages
● Kibana is visualization tool with filtering capabilities
● Filebeat is sending log mesages from instances
14.
15. Combine logs and metrics
● It is much easier to look at graphs than logs
● Good metric coverage can pinpoint problems
● Usually we need log messages to bring context
● Grafana can combine InfluxDB (measurement data store) and
ElasticSearch (log index)
16.
17. Real world example
● Provide reliable latency guarantee for 99.999% request
● 3 node application with 12 node Cassandra cluster
● Lot of metrics transferred to metrics machine
● Whole infrastructure deployed to AWS
● We needed fine grained diagnostics for queries to database both on
cluster and application level among other things
20. Alerting
● Alerting is giving you freedom not to look at graphs
● Someone else placed domain knowledge about alerts
● Alerting must not be frequent since you will end up ignoring alerts
Distributed system -> many alerts -> information overload trap
21.
22. SMART Alerting
● Have more context when anomaly happens
● Have snapshot of the system at moment something happened
● Be proactive, not reactive, let system predict cause of malfunction
and prevent it instead of curing it
23. Conclusion
● Have right amount of information, not too much, not too little
● Having good selection of metrics and logs is iterative process
● Do not end up fixing monitoring machine instead of fixing application
code (especially in distributed world)
● Be proactive, not reactive
● Tailor metrics by your needs, build tools if there are not any that
suite your use case
24. Links
● Monitoring stack for distributed systems - SmartCat blog post
● Distributed logging - SmartCat blog post
● Metrics collection stack for distributed systems - SmartCat blog post
● Monitoring machine ansible project (Riemann, Influx, Grafana, ELK)
- SmartCat github project