Why Dashboards Are Useless and Observability Is the New Buzzword

•Download as PPTX, PDF•

0 likes•239 views

Timetrix

Tartu Startup Day 2019

Software

YEARS AGO IT WAS SIMPLE AND STRAIGHTFORWARD

PARADIGM
SHIFT
• Cloud
• Microservices
• Ephemeral
• Dynamic

WHAT IS MONITORING
Tests for Dev Monitoring for Ops

PING
LOAD TIME
RESPONSE TIME
SSH
BLACK BOX MONITORING
API CALL

WHITE BOX MONITORING
Metrics Logs Traces

CLASSIC WAY
Checking status and behaviour of systems
Some checks to verify that bunch of things within thresholds
Build dashboards with Graphite or Grafana

DO YOU LIKE
SPAGHETTI?
So Dashboards are useless?

Being asked why
customers can’t open a
site

MAIN
PROBLEM
HIGH
CARDINALITY
OF DATA
• Combinatoric By Nature
• No-Right-Aggregation
• Rich Relationships
• Interdependencies

LOG
AGGREGATION
• Tools like Splunk or ELK
very helpful
• But it comes with a cost
• Modern systems
generate huge amounts
of logs
• It can raise billing to the
moon

WHY WE ARE NOT READY
TO FULL AI SOLUTION
Reproducibility
Resource Consumption
Speed
Scalability
Clarity

ANOMALIES = ALERTS?
• Thousands Metrics
• Statistical Fluctuations
• High Cardinality

WHY WE NEED
STREAMING APPROACH?
• Gaining observability and bringing unknown-
unknowns to the spot lights need a high granular
data.
• Even carefully designing metrics and events you
will eventually find quite large amount of them.
• For operating on this scale in real time regular
querying or batch jobs will have significant
latency and overhead.

WHY IS IT HARD?
• Any operation on infinite stream of data is
quite engineering endeavor by itself
• You need deal with distributed systems
implications
• Operating on thousands of metrics in real
time make these questions quite important.
• Events can be unordered

STREAM PROCESSING
PLATFORMS
• Elastic
• Reactive
• Message Driven
• Resilient

OBSERVABILITY IN
2019
• Process large volumes of highly granular
data
• Near Real-Time
• Ad hoc questions to data on demand
• Flexibility Related to Business Domain

What's hot

Serverless microservices in the wildRotem Tamir

AMSUG Presentation Nov 25, 2014jmustac

Combinación de logs, métricas y trazas para una observabilidad centralizadaElasticsearch

Open Source Operations Analytics With ElasticArthur Gimpel

From Scrum to Flow using Actionable Agile MetricsPeter Pito

Reinventing enterprise defense with the Elastic StackElasticsearch

RightScale Webinar: Leverage Cloud Infrastructure for Your Holiday CampaignsRightScale

SnapLogic Live: ServiceNow IntegrationSnapLogic

Fast, reliable, secure @ Velocity 2015Ariel Tseitlin

Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...Amazon Web Services

Effortless HVAC simulation using ApacheHVACIES VE

Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...Tamao Nakahara

SnapLogic Live: Powering Cloud AnalyticsSnapLogic

Memrise presentation @ London Snowplow meetup idan_by

How JIRA Core Helps 300,000 Houses Become SmarterAtlassian

Introducing Sauna - Decisioning and response platform from SnowplowGiuseppe Gaviani

Take control of your DevOps Dumping Ground; Melissa SussmannPuppet

Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...Elasticsearch

SnapLogic Live: Anaplan IntegrationSnapLogic

Helix Nebula Science Cloud usage by ALICEHelix Nebula The Science Cloud

What's hot (20)

Serverless microservices in the wild

AMSUG Presentation Nov 25, 2014

Combinación de logs, métricas y trazas para una observabilidad centralizada

Open Source Operations Analytics With Elastic

From Scrum to Flow using Actionable Agile Metrics

Reinventing enterprise defense with the Elastic Stack

RightScale Webinar: Leverage Cloud Infrastructure for Your Holiday Campaigns

SnapLogic Live: ServiceNow Integration

Fast, reliable, secure @ Velocity 2015

Delivering Meaningful Change to NSW Citizens Through a Serverless Data Lake (...

Effortless HVAC simulation using ApacheHVAC

Cloud Foundry Summit 2015 - New Relic & Cloud Foundry (Cloud Foundry on Azure...

SnapLogic Live: Powering Cloud Analytics

Memrise presentation @ London Snowplow meetup

How JIRA Core Helps 300,000 Houses Become Smarter

Introducing Sauna - Decisioning and response platform from Snowplow

Take control of your DevOps Dumping Ground; Melissa Sussmann

Elastic APM: amplificação dos seus logs e métricas para proporcionar um panor...

SnapLogic Live: Anaplan Integration

Helix Nebula Science Cloud usage by ALICE

Similar to Why Dashboards Are Useless and Observability Is the New Buzzword

Observability – the good, the bad, and the uglyTimetrix

Agile Lab_BigData_MeetupPaolo Platter

Massive Streaming Analytics with Spark StreamingPaolo Platter

Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...InfluxData

DevOps Toolbox: Application monitoring and insightssriram_rajan

Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Aleksandr Tavgen

Observability - the good, the bad, and the uglyAleksandr Tavgen

Using Time Series for Full Observability of a SaaS PlatformDevOps.com

Beyond The Rails WayAndrzej Krzywda

PXL Data Engineering Workshop By Selligent Jonny Daenen

IRMAC April 2015 - DMBOK2 DWBI New ContentMartin Sykora

Box Functionalities 0.20Federico Russo

How Kafka and Modern Databases Benefit Apps and AnalyticsSingleStore

Rapid Prototyping for Big Data with AWS SoftServe

Lambda Architectures in PracticeC4Media

October 2018 ODTUG Webinar - Getting Started with Groovy in EPBCSKyle Goodfriend

Succeeding with DevOps Transformation - Rafal GancarzOpenCredo

I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices Apigee | Google Cloud

Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...Lucas Jellema

Lambda architecture with SparkVincent GALOPIN

Similar to Why Dashboards Are Useless and Observability Is the New Buzzword (20)

Observability – the good, the bad, and the ugly

Agile Lab_BigData_Meetup

Massive Streaming Analytics with Spark Streaming

Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...

DevOps Toolbox: Application monitoring and insights

Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine

Observability - the good, the bad, and the ugly

Using Time Series for Full Observability of a SaaS Platform

Beyond The Rails Way

PXL Data Engineering Workshop By Selligent

IRMAC April 2015 - DMBOK2 DWBI New Content

Box Functionalities 0.20

How Kafka and Modern Databases Benefit Apps and Analytics

Rapid Prototyping for Big Data with AWS

Lambda Architectures in Practice

October 2018 ODTUG Webinar - Getting Started with Groovy in EPBCS

Succeeding with DevOps Transformation - Rafal Gancarz

I Love APIs 2015: Building Predictive Apps with Lamda and MicroServices

Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...

Lambda architecture with Spark

Recently uploaded

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

React Server Component in Next.js by Hanief UtamaHanief Utama

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

Cyber security and its impact on E commercemanigoyal112

Introduction Computer Science - Software Design.pdfFerryKemperman

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.

Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1

How to submit a standout Adobe Champion ApplicationBradBedford3

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel

Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini

Recently uploaded (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

Odoo 14 - eLearning Module In Odoo 14 Enterprise

React Server Component in Next.js by Hanief Utama

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

Cyber security and its impact on E commerce

Introduction Computer Science - Software Design.pdf

2.pdf Ejercicios de programación competitiva

PREDICTING RIVER WATER QUALITY ppt presentation

Software Project Health Check: Best Practices and Techniques for Your Product...

A healthy diet for your Java application Devoxx France.pdf

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data

Best Web Development Agency- Idiosys USA.pdf

How to submit a standout Adobe Champion Application

Ahmed Motair CV April 2024 (Senior SW Developer)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

Unveiling the Future: Sylius 2.0 New Features

Xen Safety Embedded OSS Summit April 2024 v4.pdf

Why Dashboards Are Useless and Observability Is the New Buzzword

1. ALEKSANDR TAVGEN Software Architect @ATavgen

2. SOFTWARE TRENDS IN 2019 Why Dashboards Are Useless and Observability Is the New Buzzword

3. Observability Good Old Monitoring

4. YEARS AGO IT WAS SIMPLE AND STRAIGHTFORWARD

5. PARADIGM SHIFT • Cloud • Microservices • Ephemeral • Dynamic

6. WHAT IS MONITORING Tests for Dev Monitoring for Ops

7. PING LOAD TIME RESPONSE TIME SSH BLACK BOX MONITORING API CALL

8. WHITE BOX MONITORING Metrics Logs Traces

9. CLASSIC WAY Checking status and behaviour of systems Some checks to verify that bunch of things within thresholds Build dashboards with Graphite or Grafana

10. DO YOU LIKE SPAGHETTI? So Dashboards are useless?

11. Being asked why customers can’t open a site

12.

13. MAIN PROBLEM HIGH CARDINALITY OF DATA • Combinatoric By Nature • No-Right-Aggregation • Rich Relationships • Interdependencies

14. LOG AGGREGATION • Tools like Splunk or ELK very helpful • But it comes with a cost • Modern systems generate huge amounts of logs • It can raise billing to the moon

15. LOGS VS EVENTS

16. WHY WE ARE NOT READY TO FULL AI SOLUTION Reproducibility Resource Consumption Speed Scalability Clarity

17. ANOMALIES = ALERTS? • Thousands Metrics • Statistical Fluctuations • High Cardinality

18. WHY WE NEED STREAMING APPROACH? • Gaining observability and bringing unknown- unknowns to the spot lights need a high granular data. • Even carefully designing metrics and events you will eventually find quite large amount of them. • For operating on this scale in real time regular querying or batch jobs will have significant latency and overhead.

19. WHY IS IT HARD? • Any operation on infinite stream of data is quite engineering endeavor by itself • You need deal with distributed systems implications • Operating on thousands of metrics in real time make these questions quite important. • Events can be unordered

20. STREAM PROCESSING PLATFORMS • Elastic • Reactive • Message Driven • Resilient

21. OBSERVABILITY IN 2019 • Process large volumes of highly granular data • Near Real-Time • Ad hoc questions to data on demand • Flexibility Related to Business Domain

22. THANK YOU

Editor's Notes

What is Observability There are a lot of discussions and jokes about this term. Some of them — Why call it monitoring? That’s not sexy enough anymore. — Observability, because rebranding Ops as DevOps wasn’t bad enough, now they’re devopsifying monitoring too — New Chuck Norris of DevOps — I’m an engineer that can help provide monitoring to the other engineers in the organization> Great, here’s $80k.I’m an architect that can help provide observability for cloud-native, container-based applications> Awesome! Here’s $300k! Cindy Sridharan What is the difference between Monitoring and Observability if there is so?
Looking back… Years ago, we mostly operated software on physical servers. Our applications were some sort of monolith application built upon LAMP or other stack. Checking uptime was as simple as making regular pings and controlling CPU/disk usage for your application.
Paradigm Shift Main paradigm shift came from infrastructure and architectural space. Cloud Architectures, Microservices, Kubernetes, immutable infrastructure changed the way companies build and operate systems. With adoption of new ideas, system we built became more and more distributed and ephemeral. Virtualization, Containerization and Orchestration Frameworks take responsibility of providing computational resources and handling failures creates an abstraction layer for hardware and networking. Moving towards abstraction from underlying hardware and networking means that our responsibility is focused on ensuring that our applications work as intended and according business processes were intended.
What is Monitoring Monitoring to operations is essentially the same as tests for software development. In fact, tests check behavior of the system parts against set of inputs in a sandboxed environment usually with heavy mocked components. Main issue is that amount of possible problems in production can’t be covered with tests in any way. Most of the problems in a mature stable system are unknown-unknowns which are related not only to software development itself but a real world too.
For the uninitiated, blackbox monitoring refers to the category of monitoring derived by treating the system as a blackbox and examining it from the outside. While some believe that with more sophisticated tooling at our disposal blackbox monitoring is a thing of the past, I’d argue that blackbox monitoring still has its place, what with large parts of core business and infrastructural components being outsourced to third-party vendors. Even outside of third-party integrations, treating our own systems as blackboxes might still have some value, especially in a microservices environment where different services owned by different teams might be involved in servicing a request. In such cases, being able to communicate quantitatively about systems paves the way toward establishing SLOs for different services.
Whitebox Monitoring versus Observability “Whitebox monitoring” refers to a category of “monitoring” based on the information derived from the internals of systems. Whitebox monitoring isn’t really a revolutionary idea anymore. Time series, logs and traces are all more in vogue than ever these days and have been for a few years. So then. Is observability just whitebox monitoring by another name? Well, not quite.
Why we need new monitoring. Pretty often Monitoring is dissected from Observability concept(https://thenewstack.io/monitoring-and-observability-whats-the-difference-and-why-does-it-matter/) with defining it as something that gathers data about state of infrastructure/apps and performance traces in one or another way. Or according honeycomb.io you are checking the status and behaviors of your systems against a known baseline, to determine if anything is not behaving as expected. You can write Nagios checks to verify that a bunch of things are within known good thresholds. You can build dashboards with Graphite or Ganglia to group sets of useful graphs. All of these are terrific tools for understanding the known-unknowns about your system. A large ecosystem of such products has been evolved such as New Relic, Datadog, AppDynamics. All these tools perfectly fit for low-level and mid-level monitoring or detangling performance issues. These type of monitoring tools do not handle queries on a data with a high cardinality. Or can poorly help with a problem related to a 3d party integration issues or behavior of a large complex systems with a swarm of services working in modern virtual environments.
While adopting telemetry to different parts of the system is common practice it is usually ends with bunch of spaghetti drawn on a dashboards. These are GitLab operational metrics, they are open to a public. https://dashboards.gitlab.com/d/mnbqU9Smz/fleet-overview?refresh=5m&orgId=1 Why Dashboards are useless. Actually not. But only in case when you know where and when to watch. Otherwise better watch YouTube. Dashboards do not scale. Imagine situation where you have a bunch of metrics related to your infrastructure cpu_usage/disk quotas and apps related metrics such as JVM allocation_speed/gc_runs etc. Amount of those metrics easily can grow to thousands or tenths-hundreds thousands. All you Dashboard’s are green but some problem occurred on a third-party integration service. You still have your dashboards green but end users affected already. You decided to add third party integrations checks to your monitoring and get additional bunch of metrics and dashboards on your TV set. Until some new case will arise. Being asked why customers can’t open a site it is often looks like this
Log aggregation. Log aggregation Tools such as Elastic Stack or Splunk are used for vast majority of modern IT companies. These instruments are amazingly helpful for Root Cause Analysis or Post Mortems. They have also ability to monitor some conditions which can be derived from your logs flow. But it comes with a cost. Modern systems generate huge amounts of logs and growing of your traffic can exhaust your ELK resources or raise billing from Splunk to the moon. There are some sampling techniques which can reduce usually so-called bored logs amount to some order of magnitude and saving all abnormal ones in a full range. It can give a high-level overview about normal system behavior and detailed view for any problematic ones.
From logs to events model Usually lines of logs are reflecting some event occurring in the system. Like make connection, authentication, query to database and so on. Executing all phases means piece of work was made. Definition event as a piece of work can be seen as Service Objectives related with particular service. By service I mean not only software services but some real physical devices as well like sensors or other machinery from IoT world. It also very complementary to Domain Driven Design principles. Isolation and Responsibility sharing between services or domains make events specific to each piece of work on every part in the system. For Login Service event can be successful_logins, failed_logins due to the authentication problem or business logic, every event has own metadata about timing and execution stages on different phases which domain, service, etc. Metrics and events should build a story around processes in the system. Events can be sampled in a way that for normal behavior only fraction of that is stored and all with problematic are stored as is. Events are aggregated and stored as Key Performance Indicators for objectives of the particular service. It can bring together service objective metrics with a metadata related to that in every particular moment leverages connections between issues. Written with high cardinality in mind such as services, datacenters, build versions in a separate granularities reveals unknown – unknowns in the system. Is this some form of instrumentation of software? Yes. But comparing debug level logging and full instrumentation you can drink from a fire hose in production environment without being drowned by data and costs.
Why we are not ready for full AI solutions. AI is a good badge for startup raising investments. But devil hides in details. Reproducibility Problem of full machine learning systems so called full AI approach is that when it constantly learns some behavior then your system has no reproducibility. If you want to understand why some condition for example was alerted, then you have no such possibility because models are changed already. So any solution with constant learning of behavior have such a problem. Without reproducibility it is very hard to optimize system itself. You have no possibility to optimize your system without this which is essential when you operate on high granular data or metrics. Resource Consumption For any sort of constant learning on your data you need considerable amount of computational resources. Usually this is some form of batch processing on bunch of data. For some products minimal requirements for 200 000 metrics processing is v32CPU and 64 Gb RAM, if you want to double this amount of metrics to 400 000 you need another machine with same requirements. You can’t scale Deep Learning full automation yet Making some research in this field (Samreen Hassan Massak master thesis ) it was found that training process for some thousands of metrics take some days or CPU or hours on GPU. You can’t scale it without blowing your budget. Speed All this is quite costly or hard to scale. Solutions like Amazon Forecast – Time Series Forecasting is batch processing services where you should ingest data and wait for computation to end are not fit for that. Clarity According Google experience https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/ The rules that catch real incidents most often should be as simple, predictable, and reliable as possible. When models or rules are constantly change you lose understanding of the system and it works as a black box.
Imagine you have thousands of metrics and if you want to have a good observability you need collect high cardinal data. Every heartbeat of the system will generate statistical fluctuations of your metrics swarm. https://berlinbuzzwords.de/15/session/signatures-patterns-and-trends-timeseries-data-mining-etsy One of the main lessons were learned in Etsy Kale project was: Alerting on metrics anomalies will eventually lead to massive amounts of alerts and manual work playing with thresholds and handcrafting some filters to that.
Things should be considered Any operation on infinite stream of data is quite engineering endeavor by itself. You need deal with distributed systems implications. While monitoring on a high level of events, Service Level Objectives or KPI you need be reactive and not constantly query your data but operate on stream which can scale horizontally and achieve large throughput and speed without consuming some overwhelmed resources. Some streaming frameworks such as Apache Storm, Apache Flink, Apache Spark oriented on tuple processing and not oriented on time series processing out of the box. There are problems with semantics of distributed systems. Imagine you have a lot of deployments in different datacenters. You can have some network problems and agent storing your KPI metrics has no ability to send it. After a while let’s say 3 minutes agent sent this data to the system. And this new information should trigger action on this condition. Should we store this data window in memory and check for conditions match not only backwards but in forward way as well? How large this desynchronization window should be? Operating on thousands of metrics in real time make these questions quite important. You cannot store everything in DB in stream processing systems without losing speed. Real Time stream analyzing of time series data in distributed systems is tricky because any events about your system behavior can be unordered and conditions that could be met on this data depends on order of events. Which means that semantic at least once can be achieved easy but duplicate amounts will be different.
Desirable Features of a Monitoring Strategy by Google Modern design usually involves separating collection and rule evaluation (with a solution like Prometheus server), long-term time series storage (InfluxDB), alert aggregation (Alertmanager), and dashboarding (Grafana). Google’s logs-based systems process large volumes of highly granular data. There’s some inherent delay between when an event occurs and when it is visible in logs. For analysis that’s not time-sensitive, these logs can be processed with a batch system, interrogated with ad hoc queries, and visualized with dashboards. An example of this workflow would be using Cloud Dataflow to process logs, BigQuery for ad hoc queries, and Data Studio for the dashboards. By contrast, our metrics-based monitoring system, which collects a large number of metrics from every service at Google, provides much less granular information, but in near real time. These characteristics are fairly typical of other logs- and metrics- based monitoring systems, although there are exceptions, such as real-time logs systems or high-cardinality metrics. In an ideal world, monitoring and alerting code should be subject to the same testing standards as code development. While Prometheus developers are discussing developing unit tests for monitoring, there is currently no broadly adopted system that allows you to do this. At Google, we test our monitoring and alerting using a domain-specific language that allows us to create synthetic time series. We then write assertions based upon the values in a derived time series, or the firing status and label presence of specific alerts. https://books.google.ee/books?id=fElmDwAAQBAJ&pg=PT88&lpg=PT88&dq=Monitoring+Jess+Frame,+Anthony+Lenton,+Steven+Thurgood,&source=bl&ots=h76liC_qH3&sig=FZ9ZZKzsOwdxwir_pjh9nwCOx1U&hl=en&sa=X&ved=2ahUKEwjdtsXhsKnfAhXwtYsKHVu4C5gQ6AEwBnoECAIQAQ#v=onepage&q=Monitoring%20Jess%20Frame%2C%20Anthony%20Lenton%2C%20Steven%20Thurgood%2C&f=false

Why Dashboards Are Useless and Observability Is the New Buzzword

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Why Dashboards Are Useless and Observability Is the New Buzzword

Similar to Why Dashboards Are Useless and Observability Is the New Buzzword (20)

Recently uploaded

Recently uploaded (20)

Why Dashboards Are Useless and Observability Is the New Buzzword

Editor's Notes