Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Challenges of monitoring distributed systems


Published on

Back in the days, you had a single machine and you could scroll down the single log file to figure out what is going on. In this Big Data world you need to combine a lot of logs together to figure out what is going on. Data is coming in huge volumes, with high speed so choosing important information and getting rid of noise becomes real challenge. There is a need for a centralized monitoring platform which will aid the engineers operating the systems, and serve the right information at the right time.

This talk will try to help you understand all the challenges and you will get an idea which tools and technology stacks are good fit to successfully monitor Big Data systems. The focus will be on open source and free solutions. The problem can be separated in two domains which both are the subject of this talk: metrics stack to gather simple metrics on central place and log stack to aggregate logs from different machines to central place. We will finish up with a combined stack and ideas how it can be improved even further with alerting and automated failover scenarios.

Published in: Technology
  • Be the first to comment

Challenges of monitoring distributed systems

  1. 1. Nenad Bozic, Co-founder and Technical Consultant @ SmartCat Challenges of Monitoring Distributed Systems
  2. 2. Agenda ● Monitoring 101 ● Metric data stream and tools ● Log data stream and tools ● Combine metrics and logs for full control ● Real world example ● SMART alerting
  3. 3. Monitoring 101 ● Metrics data stream contains real-time data about application (system components) performances ● Log data stream contains information about different activities and events within the system ● Alerting is giving you the freedom not to watch graphs and logs all the time
  4. 4. Metric data stream ● Easily forgotten and pushed aside when chasing deadlines ● Metrics are indicators that everything is working within expected boundaries ● Good dashboard has enough information (not too much, not to little) Distributed system -> many graphs to watch -> information overload trap
  5. 5. Metric data stream - decision ● SAS solutions vs self-managed solutions ● Paying solutions vs free solutions ● Decision based on technical team skillset and security of your data ● SAS solutions: AppDynamics, AWS CloudWatch, DataDog ● Self-managed solutions: Riemann, TICK stack
  6. 6. Metric data stream - stack ● Riemann as sinc that handles events and sends them to Riemann server ● InfluxDB as NoSQL store which is build for measurements ● Grafana as visualization tool (flexible configurable graphs from many data sources)
  7. 7. Log data stream ● Log monitoring on single machine requires skill and knowledge ● Same challenges as with metrics (not too much, not to little) ● Metrics are indicator that something happened and logs provide context what happened Distributed system -> many terminals open -> information overload trap
  8. 8. Log data stream - decision ● SAS solutions vs self-managed solutions ● Paying solutions and free solutions ● Decision based on technical team skillset and security of your data ● SAS solutions: Loggly, Splunk, Rollbar ● Self-managed solutions: ELK, Splunk
  9. 9. Log data stream - ELK stack ● ELK (ElasticSearch, LogStash, Kibana) all open source ● ElasticSearch indexes log messages for easier searching ● Logstash can filter, manipulate and transform messages ● Kibana is visualization tool with filtering capabilities ● Filebeat is sending log mesages from instances
  10. 10. Combine logs and metrics ● It is much easier to look at graphs than logs ● Good metric coverage can pinpoint problems ● Usually we need log messages to bring context ● Grafana can combine InfluxDB (measurement data store) and ElasticSearch (log index)
  11. 11. Real world example ● Provide reliable latency guarantee for 99.999% request ● 3 node application with 12 node Cassandra cluster ● Lot of metrics transferred to metrics machine ● Whole infrastructure deployed to AWS ● We needed fine grained diagnostics for queries to database both on cluster and application level among other things
  12. 12. Real world example
  13. 13. Alerting ● Alerting is giving you freedom not to look at graphs ● Someone else placed domain knowledge about alerts ● Alerting must not be frequent since you will end up ignoring alerts Distributed system -> many alerts -> information overload trap
  14. 14. SMART Alerting ● Have more context when anomaly happens ● Have snapshot of the system at moment something happened ● Be proactive, not reactive, let system predict cause of malfunction and prevent it instead of curing it
  15. 15. Conclusion ● Have right amount of information, not too much, not too little ● Having good selection of metrics and logs is iterative process ● Do not end up fixing monitoring machine instead of fixing application code (especially in distributed world) ● Be proactive, not reactive ● Tailor metrics by your needs, build tools if there are not any that suite your use case
  16. 16. Links ● Monitoring stack for distributed systems - SmartCat blog post ● Distributed logging - SmartCat blog post ● Metrics collection stack for distributed systems - SmartCat blog post ● Monitoring machine ansible project (Riemann, Influx, Grafana, ELK) - SmartCat github project
  17. 17. Q&A
  18. 18. Nenad Bozic @NenadBozicNs Thank You! SmartCat @SmartCat_io