SlideShare a Scribd company logo
1 of 26
Nenad Bozic, Co-founder and Technical Consultant @ SmartCat
Challenges of Monitoring Distributed
Systems
Agenda
● Monitoring 101
● Metric data stream and tools
● Log data stream and tools
● Combine metrics and logs for full control
● Real world example
● SMART alerting
Monitoring 101
● Metrics data stream contains real-time data about application
(system components) performances
● Log data stream contains information about different activities and
events within the system
● Alerting is giving you the freedom not to watch graphs and logs all
the time
Metric data stream
● Easily forgotten and pushed aside when chasing deadlines
● Metrics are indicators that everything is working within expected
boundaries
● Good dashboard has enough information (not too much, not to little)
Distributed system -> many graphs to watch -> information overload trap
Metric data stream - decision
● SAS solutions vs self-managed solutions
● Paying solutions vs free solutions
● Decision based on technical team skillset and security of your data
● SAS solutions: AppDynamics, AWS CloudWatch, DataDog
● Self-managed solutions: Riemann, TICK stack
Metric data stream - stack
● Riemann as sinc that handles events and sends them to Riemann
server
● InfluxDB as NoSQL store which is build for measurements
● Grafana as visualization tool (flexible configurable graphs from
many data sources)
Log data stream
● Log monitoring on single machine requires skill and knowledge
● Same challenges as with metrics (not too much, not to little)
● Metrics are indicator that something happened and logs provide
context what happened
Distributed system -> many terminals open -> information overload trap
Log data stream - decision
● SAS solutions vs self-managed solutions
● Paying solutions and free solutions
● Decision based on technical team skillset and security of your data
● SAS solutions: Loggly, Splunk, Rollbar
● Self-managed solutions: ELK, Splunk
Log data stream - ELK stack
● ELK (ElasticSearch, LogStash, Kibana) all open source
● ElasticSearch indexes log messages for easier searching
● Logstash can filter, manipulate and transform messages
● Kibana is visualization tool with filtering capabilities
● Filebeat is sending log mesages from instances
Combine logs and metrics
● It is much easier to look at graphs than logs
● Good metric coverage can pinpoint problems
● Usually we need log messages to bring context
● Grafana can combine InfluxDB (measurement data store) and
ElasticSearch (log index)
Real world example
● Provide reliable latency guarantee for 99.999% request
● 3 node application with 12 node Cassandra cluster
● Lot of metrics transferred to metrics machine
● Whole infrastructure deployed to AWS
● We needed fine grained diagnostics for queries to database both on
cluster and application level among other things
Real world example
Alerting
● Alerting is giving you freedom not to look at graphs
● Someone else placed domain knowledge about alerts
● Alerting must not be frequent since you will end up ignoring alerts
Distributed system -> many alerts -> information overload trap
SMART Alerting
● Have more context when anomaly happens
● Have snapshot of the system at moment something happened
● Be proactive, not reactive, let system predict cause of malfunction
and prevent it instead of curing it
Conclusion
● Have right amount of information, not too much, not too little
● Having good selection of metrics and logs is iterative process
● Do not end up fixing monitoring machine instead of fixing application
code (especially in distributed world)
● Be proactive, not reactive
● Tailor metrics by your needs, build tools if there are not any that
suite your use case
Links
● Monitoring stack for distributed systems - SmartCat blog post
● Distributed logging - SmartCat blog post
● Metrics collection stack for distributed systems - SmartCat blog post
● Monitoring machine ansible project (Riemann, Influx, Grafana, ELK)
- SmartCat github project
Q&A
Nenad Bozic
@NenadBozicNs
Thank You!
SmartCat
@SmartCat_io
www.smartcat.io

More Related Content

What's hot

ironSource Atom BigData Berlin
ironSource Atom BigData BerlinironSource Atom BigData Berlin
ironSource Atom BigData BerlinShimon Tolts
 
Jon Schneider at SpringOne Platform 2017
Jon Schneider at SpringOne Platform 2017Jon Schneider at SpringOne Platform 2017
Jon Schneider at SpringOne Platform 2017VMware Tanzu
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...Big Data Spain
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsSrinath Perera
 
Big Data, Analytics and Real Time Event Processing
Big Data, Analytics and Real Time Event Processing Big Data, Analytics and Real Time Event Processing
Big Data, Analytics and Real Time Event Processing WSO2
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Jonathan Singer
 
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZUREREAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZUREMarco Pozzan
 
Enabling the Bank of the Future by Ignacio Bernal
Enabling the Bank of the Future by Ignacio BernalEnabling the Bank of the Future by Ignacio Bernal
Enabling the Bank of the Future by Ignacio BernalBig Data Spain
 
Using druid for interactive count distinct queries at scale @ nmc
Using druid  for interactive count distinct queries at scale @ nmcUsing druid  for interactive count distinct queries at scale @ nmc
Using druid for interactive count distinct queries at scale @ nmcIdo Shilon
 
Apache edgent
Apache edgentApache edgent
Apache edgentYogesh BG
 

What's hot (15)

ironSource Atom BigData Berlin
ironSource Atom BigData BerlinironSource Atom BigData Berlin
ironSource Atom BigData Berlin
 
Jon Schneider at SpringOne Platform 2017
Jon Schneider at SpringOne Platform 2017Jon Schneider at SpringOne Platform 2017
Jon Schneider at SpringOne Platform 2017
 
SCADA
SCADASCADA
SCADA
 
Basic forecasting
Basic forecastingBasic forecasting
Basic forecasting
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and Applications
 
Big Data, Analytics and Real Time Event Processing
Big Data, Analytics and Real Time Event Processing Big Data, Analytics and Real Time Event Processing
Big Data, Analytics and Real Time Event Processing
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZUREREAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
 
Enabling the Bank of the Future by Ignacio Bernal
Enabling the Bank of the Future by Ignacio BernalEnabling the Bank of the Future by Ignacio Bernal
Enabling the Bank of the Future by Ignacio Bernal
 
Smart meter
Smart meterSmart meter
Smart meter
 
Using druid for interactive count distinct queries at scale @ nmc
Using druid  for interactive count distinct queries at scale @ nmcUsing druid  for interactive count distinct queries at scale @ nmc
Using druid for interactive count distinct queries at scale @ nmc
 
Thingsboard IoT Platform - A Quick Tour
Thingsboard IoT Platform - A Quick TourThingsboard IoT Platform - A Quick Tour
Thingsboard IoT Platform - A Quick Tour
 
How do you eat an elephant
How do you eat an elephantHow do you eat an elephant
How do you eat an elephant
 
Apache edgent
Apache edgentApache edgent
Apache edgent
 

Similar to Challenges of monitoring distributed systems

Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)Eran Levy
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
Monitoring
MonitoringMonitoring
Monitoringstrikr .
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper diveRobert Kubiś
 
Unified Operations Vision
Unified Operations VisionUnified Operations Vision
Unified Operations VisionSteve Mushero
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapWithTheBest
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsAlluxio, Inc.
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Amazon Web Services
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platformLi Gao
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
 
Making Automation Work
Making Automation WorkMaking Automation Work
Making Automation Workstrikr .
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
 
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekLogging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekvivekrajan
 

Similar to Challenges of monitoring distributed systems (20)

Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Monitoring
MonitoringMonitoring
Monitoring
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper dive
 
Unified Operations Vision
Unified Operations VisionUnified Operations Vision
Unified Operations Vision
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara Prathap
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
What is SCADA system? SCADA Solutions for IoT
What is SCADA system? SCADA Solutions for IoTWhat is SCADA system? SCADA Solutions for IoT
What is SCADA system? SCADA Solutions for IoT
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
Making Automation Work
Making Automation WorkMaking Automation Work
Making Automation Work
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekLogging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 

Challenges of monitoring distributed systems

  • 1. Nenad Bozic, Co-founder and Technical Consultant @ SmartCat Challenges of Monitoring Distributed Systems
  • 2.
  • 3. Agenda ● Monitoring 101 ● Metric data stream and tools ● Log data stream and tools ● Combine metrics and logs for full control ● Real world example ● SMART alerting
  • 4. Monitoring 101 ● Metrics data stream contains real-time data about application (system components) performances ● Log data stream contains information about different activities and events within the system ● Alerting is giving you the freedom not to watch graphs and logs all the time
  • 5.
  • 6. Metric data stream ● Easily forgotten and pushed aside when chasing deadlines ● Metrics are indicators that everything is working within expected boundaries ● Good dashboard has enough information (not too much, not to little) Distributed system -> many graphs to watch -> information overload trap
  • 7. Metric data stream - decision ● SAS solutions vs self-managed solutions ● Paying solutions vs free solutions ● Decision based on technical team skillset and security of your data ● SAS solutions: AppDynamics, AWS CloudWatch, DataDog ● Self-managed solutions: Riemann, TICK stack
  • 8. Metric data stream - stack ● Riemann as sinc that handles events and sends them to Riemann server ● InfluxDB as NoSQL store which is build for measurements ● Grafana as visualization tool (flexible configurable graphs from many data sources)
  • 9.
  • 10. Log data stream ● Log monitoring on single machine requires skill and knowledge ● Same challenges as with metrics (not too much, not to little) ● Metrics are indicator that something happened and logs provide context what happened Distributed system -> many terminals open -> information overload trap
  • 11.
  • 12. Log data stream - decision ● SAS solutions vs self-managed solutions ● Paying solutions and free solutions ● Decision based on technical team skillset and security of your data ● SAS solutions: Loggly, Splunk, Rollbar ● Self-managed solutions: ELK, Splunk
  • 13. Log data stream - ELK stack ● ELK (ElasticSearch, LogStash, Kibana) all open source ● ElasticSearch indexes log messages for easier searching ● Logstash can filter, manipulate and transform messages ● Kibana is visualization tool with filtering capabilities ● Filebeat is sending log mesages from instances
  • 14.
  • 15. Combine logs and metrics ● It is much easier to look at graphs than logs ● Good metric coverage can pinpoint problems ● Usually we need log messages to bring context ● Grafana can combine InfluxDB (measurement data store) and ElasticSearch (log index)
  • 16.
  • 17. Real world example ● Provide reliable latency guarantee for 99.999% request ● 3 node application with 12 node Cassandra cluster ● Lot of metrics transferred to metrics machine ● Whole infrastructure deployed to AWS ● We needed fine grained diagnostics for queries to database both on cluster and application level among other things
  • 19.
  • 20. Alerting ● Alerting is giving you freedom not to look at graphs ● Someone else placed domain knowledge about alerts ● Alerting must not be frequent since you will end up ignoring alerts Distributed system -> many alerts -> information overload trap
  • 21.
  • 22. SMART Alerting ● Have more context when anomaly happens ● Have snapshot of the system at moment something happened ● Be proactive, not reactive, let system predict cause of malfunction and prevent it instead of curing it
  • 23. Conclusion ● Have right amount of information, not too much, not too little ● Having good selection of metrics and logs is iterative process ● Do not end up fixing monitoring machine instead of fixing application code (especially in distributed world) ● Be proactive, not reactive ● Tailor metrics by your needs, build tools if there are not any that suite your use case
  • 24. Links ● Monitoring stack for distributed systems - SmartCat blog post ● Distributed logging - SmartCat blog post ● Metrics collection stack for distributed systems - SmartCat blog post ● Monitoring machine ansible project (Riemann, Influx, Grafana, ELK) - SmartCat github project
  • 25. Q&A