SlideShare a Scribd company logo
1 of 26
Nenad Bozic, Co-founder and Technical Consultant @ SmartCat
Challenges of Monitoring Distributed
Systems
Agenda
● Monitoring 101
● Metric data stream and tools
● Log data stream and tools
● Combine metrics and logs for full control
● Real world example
● SMART alerting
Monitoring 101
● Metrics data stream contains real-time data about application
(system components) performances
● Log data stream contains information about different activities and
events within the system
● Alerting is giving you the freedom not to watch graphs and logs all
the time
Metric data stream
● Easily forgotten and pushed aside when chasing deadlines
● Metrics are indicators that everything is working within expected
boundaries
● Good dashboard has enough information (not too much, not to little)
Distributed system -> many graphs to watch -> information overload trap
Metric data stream - decision
● SAS solutions vs self-managed solutions
● Paying solutions vs free solutions
● Decision based on technical team skillset and security of your data
● SAS solutions: AppDynamics, AWS CloudWatch, DataDog
● Self-managed solutions: Riemann, TICK stack
Metric data stream - stack
● Riemann as sinc that handles events and sends them to Riemann
server
● InfluxDB as NoSQL store which is build for measurements
● Grafana as visualization tool (flexible configurable graphs from
many data sources)
Log data stream
● Log monitoring on single machine requires skill and knowledge
● Same challenges as with metrics (not too much, not to little)
● Metrics are indicator that something happened and logs provide
context what happened
Distributed system -> many terminals open -> information overload trap
Log data stream - decision
● SAS solutions vs self-managed solutions
● Paying solutions and free solutions
● Decision based on technical team skillset and security of your data
● SAS solutions: Loggly, Splunk, Rollbar
● Self-managed solutions: ELK, Splunk
Log data stream - ELK stack
● ELK (ElasticSearch, LogStash, Kibana) all open source
● ElasticSearch indexes log messages for easier searching
● Logstash can filter, manipulate and transform messages
● Kibana is visualization tool with filtering capabilities
● Filebeat is sending log mesages from instances
Combine logs and metrics
● It is much easier to look at graphs than logs
● Good metric coverage can pinpoint problems
● Usually we need log messages to bring context
● Grafana can combine InfluxDB (measurement data store) and
ElasticSearch (log index)
Real world example
● Provide reliable latency guarantee for 99.999% request
● 3 node application with 12 node Cassandra cluster
● Lot of metrics transferred to metrics machine
● Whole infrastructure deployed to AWS
● We needed fine grained diagnostics for queries to database both on
cluster and application level among other things
Real world example
Alerting
● Alerting is giving you freedom not to look at graphs
● Someone else placed domain knowledge about alerts
● Alerting must not be frequent since you will end up ignoring alerts
Distributed system -> many alerts -> information overload trap
SMART Alerting
● Have more context when anomaly happens
● Have snapshot of the system at moment something happened
● Be proactive, not reactive, let system predict cause of malfunction
and prevent it instead of curing it
Conclusion
● Have right amount of information, not too much, not too little
● Having good selection of metrics and logs is iterative process
● Do not end up fixing monitoring machine instead of fixing application
code (especially in distributed world)
● Be proactive, not reactive
● Tailor metrics by your needs, build tools if there are not any that
suite your use case
Links
● Monitoring stack for distributed systems - SmartCat blog post
● Distributed logging - SmartCat blog post
● Metrics collection stack for distributed systems - SmartCat blog post
● Monitoring machine ansible project (Riemann, Influx, Grafana, ELK)
- SmartCat github project
Q&A
Nenad Bozic
@NenadBozicNs
Thank You!
SmartCat
@SmartCat_io
www.smartcat.io

More Related Content

What's hot

ironSource Atom BigData Berlin
ironSource Atom BigData BerlinironSource Atom BigData Berlin
ironSource Atom BigData BerlinShimon Tolts
 
Jon Schneider at SpringOne Platform 2017
Jon Schneider at SpringOne Platform 2017Jon Schneider at SpringOne Platform 2017
Jon Schneider at SpringOne Platform 2017VMware Tanzu
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...Big Data Spain
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsSrinath Perera
 
Big Data, Analytics and Real Time Event Processing
Big Data, Analytics and Real Time Event Processing Big Data, Analytics and Real Time Event Processing
Big Data, Analytics and Real Time Event Processing WSO2
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Jonathan Singer
 
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZUREREAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZUREMarco Pozzan
 
Enabling the Bank of the Future by Ignacio Bernal
Enabling the Bank of the Future by Ignacio BernalEnabling the Bank of the Future by Ignacio Bernal
Enabling the Bank of the Future by Ignacio BernalBig Data Spain
 
Using druid for interactive count distinct queries at scale @ nmc
Using druid  for interactive count distinct queries at scale @ nmcUsing druid  for interactive count distinct queries at scale @ nmc
Using druid for interactive count distinct queries at scale @ nmcIdo Shilon
 
Apache edgent
Apache edgentApache edgent
Apache edgentYogesh BG
 

What's hot (15)

ironSource Atom BigData Berlin
ironSource Atom BigData BerlinironSource Atom BigData Berlin
ironSource Atom BigData Berlin
 
Jon Schneider at SpringOne Platform 2017
Jon Schneider at SpringOne Platform 2017Jon Schneider at SpringOne Platform 2017
Jon Schneider at SpringOne Platform 2017
 
SCADA
SCADASCADA
SCADA
 
Basic forecasting
Basic forecastingBasic forecasting
Basic forecasting
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and Applications
 
Big Data, Analytics and Real Time Event Processing
Big Data, Analytics and Real Time Event Processing Big Data, Analytics and Real Time Event Processing
Big Data, Analytics and Real Time Event Processing
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZUREREAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
REAL TIME ANALYTICS INFRASTRUCTURE WITH AZURE
 
Enabling the Bank of the Future by Ignacio Bernal
Enabling the Bank of the Future by Ignacio BernalEnabling the Bank of the Future by Ignacio Bernal
Enabling the Bank of the Future by Ignacio Bernal
 
Smart meter
Smart meterSmart meter
Smart meter
 
Using druid for interactive count distinct queries at scale @ nmc
Using druid  for interactive count distinct queries at scale @ nmcUsing druid  for interactive count distinct queries at scale @ nmc
Using druid for interactive count distinct queries at scale @ nmc
 
Thingsboard IoT Platform - A Quick Tour
Thingsboard IoT Platform - A Quick TourThingsboard IoT Platform - A Quick Tour
Thingsboard IoT Platform - A Quick Tour
 
How do you eat an elephant
How do you eat an elephantHow do you eat an elephant
How do you eat an elephant
 
Apache edgent
Apache edgentApache edgent
Apache edgent
 

Similar to Challenges of monitoring distributed systems

Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)Eran Levy
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
Monitoring
MonitoringMonitoring
Monitoringstrikr .
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper diveRobert Kubiś
 
Unified Operations Vision
Unified Operations VisionUnified Operations Vision
Unified Operations VisionSteve Mushero
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Demi Ben-Ari
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapWithTheBest
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsAlluxio, Inc.
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesEd Hunter
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Amazon Web Services
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platformLi Gao
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
 
Making Automation Work
Making Automation WorkMaking Automation Work
Making Automation Workstrikr .
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
 
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekLogging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekvivekrajan
 

Similar to Challenges of monitoring distributed systems (20)

Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Monitoring
MonitoringMonitoring
Monitoring
 
Monitoring - deeper dive
Monitoring  - deeper diveMonitoring  - deeper dive
Monitoring - deeper dive
 
Unified Operations Vision
Unified Operations VisionUnified Operations Vision
Unified Operations Vision
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"Monitoring Big Data Systems - "The Simple Way"
Monitoring Big Data Systems - "The Simple Way"
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara Prathap
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
How to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and ApplicationsHow to Develop and Operate Cloud Native Data Platforms and Applications
How to Develop and Operate Cloud Native Data Platforms and Applications
 
Netflix SRE perf meetup_slides
Netflix SRE perf meetup_slidesNetflix SRE perf meetup_slides
Netflix SRE perf meetup_slides
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016Reactive Cloud Security | AWS Public Sector Summit 2016
Reactive Cloud Security | AWS Public Sector Summit 2016
 
Cloud native data platform
Cloud native data platformCloud native data platform
Cloud native data platform
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
What is SCADA system? SCADA Solutions for IoT
What is SCADA system? SCADA Solutions for IoTWhat is SCADA system? SCADA Solutions for IoT
What is SCADA system? SCADA Solutions for IoT
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
Making Automation Work
Making Automation WorkMaking Automation Work
Making Automation Work
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeekLogging : How much is too much? Network Security Monitoring Talk @ hasgeek
Logging : How much is too much? Network Security Monitoring Talk @ hasgeek
 

Recently uploaded

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Recently uploaded (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Challenges of monitoring distributed systems

  • 1. Nenad Bozic, Co-founder and Technical Consultant @ SmartCat Challenges of Monitoring Distributed Systems
  • 2.
  • 3. Agenda ● Monitoring 101 ● Metric data stream and tools ● Log data stream and tools ● Combine metrics and logs for full control ● Real world example ● SMART alerting
  • 4. Monitoring 101 ● Metrics data stream contains real-time data about application (system components) performances ● Log data stream contains information about different activities and events within the system ● Alerting is giving you the freedom not to watch graphs and logs all the time
  • 5.
  • 6. Metric data stream ● Easily forgotten and pushed aside when chasing deadlines ● Metrics are indicators that everything is working within expected boundaries ● Good dashboard has enough information (not too much, not to little) Distributed system -> many graphs to watch -> information overload trap
  • 7. Metric data stream - decision ● SAS solutions vs self-managed solutions ● Paying solutions vs free solutions ● Decision based on technical team skillset and security of your data ● SAS solutions: AppDynamics, AWS CloudWatch, DataDog ● Self-managed solutions: Riemann, TICK stack
  • 8. Metric data stream - stack ● Riemann as sinc that handles events and sends them to Riemann server ● InfluxDB as NoSQL store which is build for measurements ● Grafana as visualization tool (flexible configurable graphs from many data sources)
  • 9.
  • 10. Log data stream ● Log monitoring on single machine requires skill and knowledge ● Same challenges as with metrics (not too much, not to little) ● Metrics are indicator that something happened and logs provide context what happened Distributed system -> many terminals open -> information overload trap
  • 11.
  • 12. Log data stream - decision ● SAS solutions vs self-managed solutions ● Paying solutions and free solutions ● Decision based on technical team skillset and security of your data ● SAS solutions: Loggly, Splunk, Rollbar ● Self-managed solutions: ELK, Splunk
  • 13. Log data stream - ELK stack ● ELK (ElasticSearch, LogStash, Kibana) all open source ● ElasticSearch indexes log messages for easier searching ● Logstash can filter, manipulate and transform messages ● Kibana is visualization tool with filtering capabilities ● Filebeat is sending log mesages from instances
  • 14.
  • 15. Combine logs and metrics ● It is much easier to look at graphs than logs ● Good metric coverage can pinpoint problems ● Usually we need log messages to bring context ● Grafana can combine InfluxDB (measurement data store) and ElasticSearch (log index)
  • 16.
  • 17. Real world example ● Provide reliable latency guarantee for 99.999% request ● 3 node application with 12 node Cassandra cluster ● Lot of metrics transferred to metrics machine ● Whole infrastructure deployed to AWS ● We needed fine grained diagnostics for queries to database both on cluster and application level among other things
  • 19.
  • 20. Alerting ● Alerting is giving you freedom not to look at graphs ● Someone else placed domain knowledge about alerts ● Alerting must not be frequent since you will end up ignoring alerts Distributed system -> many alerts -> information overload trap
  • 21.
  • 22. SMART Alerting ● Have more context when anomaly happens ● Have snapshot of the system at moment something happened ● Be proactive, not reactive, let system predict cause of malfunction and prevent it instead of curing it
  • 23. Conclusion ● Have right amount of information, not too much, not too little ● Having good selection of metrics and logs is iterative process ● Do not end up fixing monitoring machine instead of fixing application code (especially in distributed world) ● Be proactive, not reactive ● Tailor metrics by your needs, build tools if there are not any that suite your use case
  • 24. Links ● Monitoring stack for distributed systems - SmartCat blog post ● Distributed logging - SmartCat blog post ● Metrics collection stack for distributed systems - SmartCat blog post ● Monitoring machine ansible project (Riemann, Influx, Grafana, ELK) - SmartCat github project
  • 25. Q&A