SlideShare a Scribd company logo
1 of 22
ALEKSANDR TAVGEN
Software Architect
@ATavgen
SOFTWARE
TRENDS IN 2019
Why Dashboards Are Useless and Observability
Is the New Buzzword
Observability
Good Old
Monitoring
YEARS AGO IT WAS SIMPLE AND STRAIGHTFORWARD
PARADIGM
SHIFT
• Cloud
• Microservices
• Ephemeral
• Dynamic
WHAT IS MONITORING
Tests for Dev Monitoring for Ops
PING
LOAD TIME
RESPONSE TIME
SSH
BLACK BOX MONITORING
API CALL
WHITE BOX MONITORING
Metrics Logs Traces
CLASSIC WAY
Checking status and behaviour of systems
Some checks to verify that bunch of things within thresholds
Build dashboards with Graphite or Grafana
DO YOU LIKE
SPAGHETTI?
So Dashboards are useless?
Being asked why
customers can’t open a
site
MAIN
PROBLEM
HIGH
CARDINALITY
OF DATA
• Combinatoric By Nature
• No-Right-Aggregation
• Rich Relationships
• Interdependencies
LOG
AGGREGATION
• Tools like Splunk or ELK
very helpful
• But it comes with a cost
• Modern systems
generate huge amounts
of logs
• It can raise billing to the
moon
LOGS VS EVENTS
WHY WE ARE NOT READY
TO FULL AI SOLUTION
Reproducibility
Resource Consumption
Speed
Scalability
Clarity
ANOMALIES = ALERTS?
• Thousands Metrics
• Statistical Fluctuations
• High Cardinality
WHY WE NEED
STREAMING APPROACH?
• Gaining observability and bringing unknown-
unknowns to the spot lights need a high granular
data.
• Even carefully designing metrics and events you
will eventually find quite large amount of them.
• For operating on this scale in real time regular
querying or batch jobs will have significant
latency and overhead.
WHY IS IT HARD?
• Any operation on infinite stream of data is
quite engineering endeavor by itself
• You need deal with distributed systems
implications
• Operating on thousands of metrics in real
time make these questions quite important.
• Events can be unordered
STREAM PROCESSING
PLATFORMS
• Elastic
• Reactive
• Message Driven
• Resilient
OBSERVABILITY IN
2019
• Process large volumes of highly granular
data
• Near Real-Time
• Ad hoc questions to data on demand
• Flexibility Related to Business Domain
THANK YOU

More Related Content

What's hot

Serverless microservices in the wild
Serverless microservices in the wildServerless microservices in the wild
Serverless microservices in the wildRotem Tamir
 
AMSUG Presentation Nov 25, 2014
AMSUG Presentation Nov 25, 2014AMSUG Presentation Nov 25, 2014
AMSUG Presentation Nov 25, 2014jmustac
 
Combinación de logs, métricas y trazas para una observabilidad centralizada
Combinación de logs, métricas y trazas para una observabilidad centralizadaCombinación de logs, métricas y trazas para una observabilidad centralizada
Combinación de logs, métricas y trazas para una observabilidad centralizadaElasticsearch
 
Open Source Operations Analytics With Elastic
Open Source Operations Analytics With ElasticOpen Source Operations Analytics With Elastic
Open Source Operations Analytics With ElasticArthur Gimpel
 
From Scrum to Flow using Actionable Agile Metrics
From Scrum to Flow using Actionable Agile MetricsFrom Scrum to Flow using Actionable Agile Metrics
From Scrum to Flow using Actionable Agile MetricsPeter Pito
 
Reinventing enterprise defense with the Elastic Stack
Reinventing enterprise defense with the Elastic StackReinventing enterprise defense with the Elastic Stack
Reinventing enterprise defense with the Elastic StackElasticsearch
 
RightScale Webinar: Leverage Cloud Infrastructure for Your Holiday Campaigns
RightScale Webinar: Leverage Cloud Infrastructure for Your Holiday CampaignsRightScale Webinar: Leverage Cloud Infrastructure for Your Holiday Campaigns
RightScale Webinar: Leverage Cloud Infrastructure for Your Holiday CampaignsRightScale
 
SnapLogic Live: ServiceNow Integration
SnapLogic Live: ServiceNow IntegrationSnapLogic Live: ServiceNow Integration
SnapLogic Live: ServiceNow IntegrationSnapLogic
 
Fast, reliable, secure @ Velocity 2015
Fast, reliable, secure @  Velocity 2015Fast, reliable, secure @  Velocity 2015
Fast, reliable, secure @ Velocity 2015Ariel Tseitlin
 
Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...
Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...
Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...Amazon Web Services
 
Effortless HVAC simulation using ApacheHVAC
Effortless HVAC simulation using ApacheHVACEffortless HVAC simulation using ApacheHVAC
Effortless HVAC simulation using ApacheHVACIES VE
 
Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...
Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...
Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...Tamao Nakahara
 
SnapLogic Live: Powering Cloud Analytics
SnapLogic Live: Powering Cloud AnalyticsSnapLogic Live: Powering Cloud Analytics
SnapLogic Live: Powering Cloud AnalyticsSnapLogic
 
Memrise presentation @ London Snowplow meetup
Memrise presentation @ London Snowplow meetup Memrise presentation @ London Snowplow meetup
Memrise presentation @ London Snowplow meetup idan_by
 
How JIRA Core Helps 300,000 Houses Become Smarter
How JIRA Core Helps 300,000 Houses Become SmarterHow JIRA Core Helps 300,000 Houses Become Smarter
How JIRA Core Helps 300,000 Houses Become SmarterAtlassian
 
Introducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from SnowplowIntroducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from SnowplowGiuseppe Gaviani
 
Take control of your DevOps Dumping Ground; Melissa Sussmann
Take control of your DevOps Dumping Ground; Melissa SussmannTake control of your DevOps Dumping Ground; Melissa Sussmann
Take control of your DevOps Dumping Ground; Melissa SussmannPuppet
 
Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...
Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...
Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...Elasticsearch
 
SnapLogic Live: Anaplan Integration
SnapLogic Live: Anaplan IntegrationSnapLogic Live: Anaplan Integration
SnapLogic Live: Anaplan IntegrationSnapLogic
 

What's hot (20)

Serverless microservices in the wild
Serverless microservices in the wildServerless microservices in the wild
Serverless microservices in the wild
 
AMSUG Presentation Nov 25, 2014
AMSUG Presentation Nov 25, 2014AMSUG Presentation Nov 25, 2014
AMSUG Presentation Nov 25, 2014
 
Combinación de logs, métricas y trazas para una observabilidad centralizada
Combinación de logs, métricas y trazas para una observabilidad centralizadaCombinación de logs, métricas y trazas para una observabilidad centralizada
Combinación de logs, métricas y trazas para una observabilidad centralizada
 
Open Source Operations Analytics With Elastic
Open Source Operations Analytics With ElasticOpen Source Operations Analytics With Elastic
Open Source Operations Analytics With Elastic
 
From Scrum to Flow using Actionable Agile Metrics
From Scrum to Flow using Actionable Agile MetricsFrom Scrum to Flow using Actionable Agile Metrics
From Scrum to Flow using Actionable Agile Metrics
 
Reinventing enterprise defense with the Elastic Stack
Reinventing enterprise defense with the Elastic StackReinventing enterprise defense with the Elastic Stack
Reinventing enterprise defense with the Elastic Stack
 
RightScale Webinar: Leverage Cloud Infrastructure for Your Holiday Campaigns
RightScale Webinar: Leverage Cloud Infrastructure for Your Holiday CampaignsRightScale Webinar: Leverage Cloud Infrastructure for Your Holiday Campaigns
RightScale Webinar: Leverage Cloud Infrastructure for Your Holiday Campaigns
 
SnapLogic Live: ServiceNow Integration
SnapLogic Live: ServiceNow IntegrationSnapLogic Live: ServiceNow Integration
SnapLogic Live: ServiceNow Integration
 
Fast, reliable, secure @ Velocity 2015
Fast, reliable, secure @  Velocity 2015Fast, reliable, secure @  Velocity 2015
Fast, reliable, secure @ Velocity 2015
 
Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...
Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...
Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...
 
Effortless HVAC simulation using ApacheHVAC
Effortless HVAC simulation using ApacheHVACEffortless HVAC simulation using ApacheHVAC
Effortless HVAC simulation using ApacheHVAC
 
Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...
Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...
Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...
 
SnapLogic Live: Powering Cloud Analytics
SnapLogic Live: Powering Cloud AnalyticsSnapLogic Live: Powering Cloud Analytics
SnapLogic Live: Powering Cloud Analytics
 
Memrise presentation @ London Snowplow meetup
Memrise presentation @ London Snowplow meetup Memrise presentation @ London Snowplow meetup
Memrise presentation @ London Snowplow meetup
 
How JIRA Core Helps 300,000 Houses Become Smarter
How JIRA Core Helps 300,000 Houses Become SmarterHow JIRA Core Helps 300,000 Houses Become Smarter
How JIRA Core Helps 300,000 Houses Become Smarter
 
Introducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from SnowplowIntroducing Sauna - Decisioning and response platform from Snowplow
Introducing Sauna - Decisioning and response platform from Snowplow
 
Take control of your DevOps Dumping Ground; Melissa Sussmann
Take control of your DevOps Dumping Ground; Melissa SussmannTake control of your DevOps Dumping Ground; Melissa Sussmann
Take control of your DevOps Dumping Ground; Melissa Sussmann
 
Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...
Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...
Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...
 
SnapLogic Live: Anaplan Integration
SnapLogic Live: Anaplan IntegrationSnapLogic Live: Anaplan Integration
SnapLogic Live: Anaplan Integration
 
Helix Nebula Science Cloud usage by ALICE
Helix Nebula Science Cloud usage by ALICEHelix Nebula Science Cloud usage by ALICE
Helix Nebula Science Cloud usage by ALICE
 

Similar to Why Dashboards Are Useless and Observability Is the New Buzzword

Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyTimetrix
 
Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_MeetupPaolo Platter
 
Massive Streaming Analytics with Spark Streaming
Massive Streaming Analytics with Spark StreamingMassive Streaming Analytics with Spark Streaming
Massive Streaming Analytics with Spark StreamingPaolo Platter
 
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...InfluxData
 
DevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insightsDevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insightssriram_rajan
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Aleksandr Tavgen
 
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the uglyObservability - the good, the bad, and the ugly
Observability - the good, the bad, and the uglyAleksandr Tavgen
 
Using Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformUsing Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformDevOps.com
 
PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent Jonny Daenen
 
IRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New ContentIRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New ContentMartin Sykora
 
Box Functionalities 0.20
Box Functionalities 0.20Box Functionalities 0.20
Box Functionalities 0.20Federico Russo
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsSingleStore
 
Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS SoftServe
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 
October 2018 ODTUG Webinar - Getting Started with Groovy in EPBCS
October 2018 ODTUG Webinar - Getting Started with Groovy in EPBCSOctober 2018 ODTUG Webinar - Getting Started with Groovy in EPBCS
October 2018 ODTUG Webinar - Getting Started with Groovy in EPBCSKyle Goodfriend
 
Succeeding with DevOps Transformation - Rafal Gancarz
Succeeding with DevOps Transformation - Rafal GancarzSucceeding with DevOps Transformation - Rafal Gancarz
Succeeding with DevOps Transformation - Rafal GancarzOpenCredo
 
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices Apigee | Google Cloud
 
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...Lucas Jellema
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with SparkVincent GALOPIN
 

Similar to Why Dashboards Are Useless and Observability Is the New Buzzword (20)

Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 
Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_Meetup
 
Massive Streaming Analytics with Spark Streaming
Massive Streaming Analytics with Spark StreamingMassive Streaming Analytics with Spark Streaming
Massive Streaming Analytics with Spark Streaming
 
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
 
DevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insightsDevOps Toolbox: Application monitoring and insights
DevOps Toolbox: Application monitoring and insights
 
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
 
Observability - the good, the bad, and the ugly
Observability - the good, the bad, and the uglyObservability - the good, the bad, and the ugly
Observability - the good, the bad, and the ugly
 
Using Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS PlatformUsing Time Series for Full Observability of a SaaS Platform
Using Time Series for Full Observability of a SaaS Platform
 
Beyond The Rails Way
Beyond The Rails WayBeyond The Rails Way
Beyond The Rails Way
 
PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent PXL Data Engineering Workshop By Selligent
PXL Data Engineering Workshop By Selligent
 
IRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New ContentIRMAC April 2015 - DMBOK2 DWBI New Content
IRMAC April 2015 - DMBOK2 DWBI New Content
 
Box Functionalities 0.20
Box Functionalities 0.20Box Functionalities 0.20
Box Functionalities 0.20
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
 
Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS Rapid Prototyping for Big Data with AWS
Rapid Prototyping for Big Data with AWS
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
October 2018 ODTUG Webinar - Getting Started with Groovy in EPBCS
October 2018 ODTUG Webinar - Getting Started with Groovy in EPBCSOctober 2018 ODTUG Webinar - Getting Started with Groovy in EPBCS
October 2018 ODTUG Webinar - Getting Started with Groovy in EPBCS
 
Succeeding with DevOps Transformation - Rafal Gancarz
Succeeding with DevOps Transformation - Rafal GancarzSucceeding with DevOps Transformation - Rafal Gancarz
Succeeding with DevOps Transformation - Rafal Gancarz
 
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices
 
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
 
Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 

Recently uploaded (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 

Why Dashboards Are Useless and Observability Is the New Buzzword

Editor's Notes

  1. What is Observability There are a lot of discussions and jokes about this term. Some of them — Why call it monitoring? That’s not sexy enough anymore. — Observability, because rebranding Ops as DevOps wasn’t bad enough, now they’re devopsifying monitoring too — New Chuck Norris of DevOps — I’m an engineer that can help provide monitoring to the other engineers in the organization > Great, here’s $80k. I’m an architect that can help provide observability for cloud-native, container-based applications > Awesome! Here’s $300k! Cindy Sridharan     What is the difference between Monitoring and Observability if there is so?
  2. Looking back… Years ago, we mostly operated software on physical servers. Our applications were some sort of monolith application built upon LAMP or other stack. Checking uptime was as simple as making regular pings and controlling CPU/disk usage for your application.
  3. Paradigm Shift Main paradigm shift came from infrastructure and architectural space. Cloud Architectures, Microservices, Kubernetes, immutable infrastructure changed the way companies build and operate systems. With adoption of new ideas, system we built became more and more distributed and ephemeral. Virtualization, Containerization and Orchestration Frameworks take responsibility of providing computational resources and handling failures creates an abstraction layer for hardware and networking. Moving towards abstraction from underlying hardware and networking means that our responsibility is focused on ensuring that our applications work as intended and according business processes were intended.  
  4. What is Monitoring Monitoring to operations is essentially the same as tests for software development. In fact, tests check behavior of the system parts against set of inputs in a sandboxed environment usually with heavy mocked components. Main issue is that amount of possible problems in production can’t be covered with tests in any way. Most of the problems in a mature stable system are unknown-unknowns which are related not only to software development itself but a real world too.
  5. For the uninitiated, blackbox monitoring refers to the category of monitoring derived by treating the system as a blackbox and examining it from the outside. While some believe that with more sophisticated tooling at our disposal blackbox monitoring is a thing of the past, I’d argue that blackbox monitoring still has its place, what with large parts of core business and infrastructural components being outsourced to third-party vendors.  Even outside of third-party integrations, treating our own systems as blackboxes might still have some value, especially in a microservices environment where different services owned by different teams might be involved in servicing a request. In such cases, being able to communicate quantitatively about systems paves the way toward establishing SLOs for different services.
  6. Whitebox Monitoring versus Observability “Whitebox monitoring” refers to a category of “monitoring” based on the information derived from the internals of systems. Whitebox monitoring isn’t really a revolutionary idea anymore. Time series, logs and traces are all more in vogue than ever these days and have been for a few years. So then. Is observability just whitebox monitoring by another name? Well, not quite.
  7. Why we need new monitoring. Pretty often Monitoring is dissected from Observability concept(https://thenewstack.io/monitoring-and-observability-whats-the-difference-and-why-does-it-matter/) with defining it as something that gathers data about state of infrastructure/apps and performance traces in one or another way. Or according honeycomb.io you are checking the status and behaviors of your systems against a known baseline, to determine if anything is not behaving as expected. You can write Nagios checks to verify that a bunch of things are within known good thresholds. You can build dashboards with Graphite or Ganglia to group sets of useful graphs. All of these are terrific tools for understanding the known-unknowns about your system.   A large ecosystem of such products has been evolved such as New Relic, Datadog, AppDynamics. All these tools perfectly fit for low-level and mid-level monitoring or detangling performance issues.   These type of monitoring tools do not handle queries on a data with a high cardinality. Or can poorly help with a problem related to a 3d party integration issues or behavior of a large complex systems with a swarm of services working in modern virtual environments.
  8. While adopting telemetry to different parts of the system is common practice it is usually ends with bunch of spaghetti drawn on a dashboards.   These are GitLab operational metrics, they are open to a public. https://dashboards.gitlab.com/d/mnbqU9Smz/fleet-overview?refresh=5m&orgId=1 Why Dashboards are useless. Actually not. But only in case when you know where and when to watch. Otherwise better watch YouTube. Dashboards do not scale. Imagine situation where you have a bunch of metrics related to your infrastructure cpu_usage/disk quotas and apps related metrics such as JVM allocation_speed/gc_runs etc. Amount of those metrics easily can grow to thousands or tenths-hundreds thousands. All you Dashboard’s are green but some problem occurred on a third-party integration service. You still have your dashboards green but end users affected already. You decided to add third party integrations checks to your monitoring and get additional bunch of metrics and dashboards on your TV set. Until some new case will arise. Being asked why customers can’t open a site it is often looks like this
  9. Log aggregation. Log aggregation Tools such as Elastic Stack or Splunk are used for vast majority of modern IT companies. These instruments are amazingly helpful for Root Cause Analysis or Post Mortems. They have also ability to monitor some conditions which can be derived from your logs flow. But it comes with a cost. Modern systems generate huge amounts of logs and growing of your traffic can exhaust your ELK resources or raise billing from Splunk to the moon. There are some sampling techniques which can reduce usually so-called bored logs amount to some order of magnitude and saving all abnormal ones in a full range. It can give a high-level overview about normal system behavior and detailed view for any problematic ones.
  10. From logs to events model Usually lines of logs are reflecting some event occurring in the system. Like make connection, authentication, query to database and so on. Executing all phases means piece of work was made. Definition event as a piece of work can be seen as Service Objectives related with particular service. By service I mean not only software services but some real physical devices as well like sensors or other machinery from IoT world. It also very complementary to Domain Driven Design principles. Isolation and Responsibility sharing between services or domains make events specific to each piece of work on every part in the system. For Login Service event can be successful_logins, failed_logins due to the authentication problem or business logic, every event has own metadata about timing and execution stages on different phases which domain, service, etc. Metrics and events should build a story around processes in the system. Events can be sampled in a way that for normal behavior only fraction of that is stored and all with problematic are stored as is. Events are aggregated and stored as Key Performance Indicators for objectives of the particular service. It can bring together service objective metrics with a metadata related to that in every particular moment leverages connections between issues. Written with high cardinality in mind such as services, datacenters, build versions in a separate granularities reveals unknown – unknowns in the system. Is this some form of instrumentation of software? Yes. But comparing debug level logging and full instrumentation you can drink from a fire hose in production environment without being drowned by data and costs.
  11. Why we are not ready for full AI solutions. AI is a good badge for startup raising investments. But devil hides in details.   Reproducibility Problem of full machine learning systems so called full AI approach is that when it constantly learns some behavior then your system has no reproducibility. If you want to understand why some condition for example was alerted, then you have no such possibility because models are changed already. So any solution with constant learning of behavior have such a problem. Without reproducibility it is very hard to optimize system itself. You have no possibility to optimize your system without this which is essential when you operate on high granular data or metrics.   Resource Consumption For any sort of constant learning on your data you need considerable amount of computational resources. Usually this is some form of batch processing on bunch of data. For some products minimal requirements for 200 000 metrics processing is v32CPU and 64 Gb RAM, if you want to double this amount of metrics to 400 000 you need another machine with same requirements.   You can’t scale Deep Learning full automation yet Making some research in this field (Samreen Hassan Massak master thesis ) it was found that training process for some thousands of metrics take some days or CPU or hours on GPU. You can’t scale it without blowing your budget.   Speed All this is quite costly or hard to scale. Solutions like Amazon Forecast – Time Series Forecasting is batch processing services where you should ingest data and wait for computation to end are not fit for that.   Clarity According Google experience https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ The rules that catch real incidents most often should be as simple, predictable, and reliable as possible. When models or rules are constantly change you lose understanding of the system and it works as a black box.
  12. Imagine you have thousands of metrics and if you want to have a good observability you need collect high cardinal data. Every heartbeat of the system will generate statistical fluctuations of your metrics swarm. https://berlinbuzzwords.de/15/session/signatures-patterns-and-trends-timeseries-data-mining-etsy One of the main lessons were learned in Etsy Kale project was: Alerting on metrics anomalies will eventually lead to massive amounts of alerts and manual work playing with thresholds and handcrafting some filters to that.
  13. Things should be considered Any operation on infinite stream of data is quite engineering endeavor by itself. You need deal with distributed systems implications.   While monitoring on a high level of events, Service Level Objectives or KPI you need be reactive and not constantly query your data but operate on stream which can scale horizontally and achieve large throughput and speed without consuming some overwhelmed resources. Some streaming frameworks such as Apache Storm, Apache Flink, Apache Spark oriented on tuple processing and not oriented on time series processing out of the box. There are problems with semantics of distributed systems. Imagine you have a lot of deployments in different datacenters. You can have some network problems and agent storing your KPI metrics has no ability to send it. After a while let’s say 3 minutes agent sent this data to the system. And this new information should trigger action on this condition. Should we store this data window in memory and check for conditions match not only backwards but in forward way as well? How large this desynchronization window should be? Operating on thousands of metrics in real time make these questions quite important. You cannot store everything in DB in stream processing systems without losing speed. Real Time stream analyzing of time series data in distributed systems is tricky because any events about your system behavior can be unordered and conditions that could be met on this data depends on order of events. Which means that semantic at least once can be achieved easy but duplicate amounts will be different.
  14. Desirable Features of a Monitoring Strategy by Google Modern design usually involves separating collection and rule evaluation (with a solution like Prometheus server), long-term time series storage (InfluxDB), alert aggregation (Alertmanager), and dashboarding (Grafana). Google’s logs-based systems process large volumes of highly granular data. There’s some inherent delay between when an event occurs and when it is visible in logs. For analysis that’s not time-sensitive, these logs can be processed with a batch system, interrogated with ad hoc queries, and visualized with dashboards. An example of this workflow would be using Cloud Dataflow to process logs, BigQuery for ad hoc queries, and Data Studio for the dashboards. By contrast, our metrics-based monitoring system, which collects a large number of metrics from every service at Google, provides much less granular information, but in near real time. These characteristics are fairly typical of other logs- and metrics- based monitoring systems, although there are exceptions, such as real-time logs systems or high-cardinality metrics. In an ideal world, monitoring and alerting code should be subject to the same testing standards as code development. While Prometheus developers are discussing developing unit tests for monitoring, there is currently no broadly adopted system that allows you to do this. At Google, we test our monitoring and alerting using a domain-specific language that allows us to create synthetic time series. We then write assertions based upon the values in a derived time series, or the firing status and label presence of specific alerts.   https://books.google.ee/books?id=fElmDwAAQBAJ&pg=PT88&lpg=PT88&dq=Monitoring+Jess+Frame,+Anthony+Lenton,+Steven+Thurgood,&source=bl&ots=h76liC_qH3&sig=FZ9ZZKzsOwdxwir_pjh9nwCOx1U&hl=en&sa=X&ved=2ahUKEwjdtsXhsKnfAhXwtYsKHVu4C5gQ6AEwBnoECAIQAQ#v=onepage&q=Monitoring%20Jess%20Frame%2C%20Anthony%20Lenton%2C%20Steven%20Thurgood%2C&f=false