9. CLASSIC WAY
Checking status and behaviour of systems
Some checks to verify that bunch of things within thresholds
Build dashboards with Graphite or Grafana
14. LOG
AGGREGATION
• Tools like Splunk or ELK
very helpful
• But it comes with a cost
• Modern systems
generate huge amounts
of logs
• It can raise billing to the
moon
18. WHY WE NEED
STREAMING APPROACH?
• Gaining observability and bringing unknown-
unknowns to the spot lights need a high granular
data.
• Even carefully designing metrics and events you
will eventually find quite large amount of them.
• For operating on this scale in real time regular
querying or batch jobs will have significant
latency and overhead.
19. WHY IS IT HARD?
• Any operation on infinite stream of data is
quite engineering endeavor by itself
• You need deal with distributed systems
implications
• Operating on thousands of metrics in real
time make these questions quite important.
• Events can be unordered
21. OBSERVABILITY IN
2019
• Process large volumes of highly granular
data
• Near Real-Time
• Ad hoc questions to data on demand
• Flexibility Related to Business Domain
What is Observability
There are a lot of discussions and jokes about this term. Some of them
— Why call it monitoring? That’s not sexy enough anymore.
— Observability, because rebranding Ops as DevOps wasn’t bad enough, now they’re devopsifying monitoring too
— New Chuck Norris of DevOps
— I’m an engineer that can help provide monitoring to the other engineers in the organization> Great, here’s $80k.I’m an architect that can help provide observability for cloud-native, container-based applications> Awesome! Here’s $300k!
Cindy Sridharan
What is the difference between Monitoring and Observability if there is so?
Looking back…
Years ago, we mostly operated software on physical servers. Our applications were some sort of monolith application built upon LAMP or other stack. Checking uptime was as simple as making regular pings and controlling CPU/disk usage for your application.
Paradigm Shift
Main paradigm shift came from infrastructure and architectural space. Cloud Architectures, Microservices, Kubernetes, immutable infrastructure changed the way companies build and operate systems.
With adoption of new ideas, system we built became more and more distributed and ephemeral.
Virtualization, Containerization and Orchestration Frameworks take responsibility of providing computational resources and handling failures creates an abstraction layer for hardware and networking.
Moving towards abstraction from underlying hardware and networking means that our responsibility is focused on ensuring that our applications work as intended and according business processes were intended.
What is Monitoring
Monitoring to operations is essentially the same as tests for software development. In fact, tests check behavior of the system parts against set of inputs in a sandboxed environment usually with heavy mocked components.
Main issue is that amount of possible problems in production can’t be covered with tests in any way. Most of the problems in a mature stable system are unknown-unknowns which are related not only to software development itself but a real world too.
For the uninitiated, blackbox monitoring refers to the category of monitoring derived by treating the system as a blackbox and examining it from the outside. While some believe that with more sophisticated tooling at our disposal blackbox monitoring is a thing of the past, I’d argue that blackbox monitoring still has its place, what with large parts of core business and infrastructural components being outsourced to third-party vendors.
Even outside of third-party integrations, treating our own systems as blackboxes might still have some value, especially in a microservices environment where different services owned by different teams might be involved in servicing a request. In such cases, being able to communicate quantitatively about systems paves the way toward establishing SLOs for different services.
Whitebox Monitoring versus Observability
“Whitebox monitoring” refers to a category of “monitoring” based on the information derived from the internals of systems. Whitebox monitoring isn’t really a revolutionary idea anymore. Time series, logs and traces are all more in vogue than ever these days and have been for a few years.
So then. Is observability just whitebox monitoring by another name?
Well, not quite.
Why we need new monitoring.
Pretty often Monitoring is dissected from Observability concept(https://thenewstack.io/monitoring-and-observability-whats-the-difference-and-why-does-it-matter/) with defining it as something that gathers data about state of infrastructure/apps and performance traces in one or another way.
Or according honeycomb.io
you are checking the status and behaviors of your systems against a known baseline, to determine if anything is not behaving as expected.
You can write Nagios checks to verify that a bunch of things are within known good thresholds.
You can build dashboards with Graphite or Ganglia to group sets of useful graphs.
All of these are terrific tools for understanding the known-unknowns about your system.
A large ecosystem of such products has been evolved such as New Relic, Datadog, AppDynamics. All these tools perfectly fit for low-level and mid-level monitoring or detangling performance issues.
These type of monitoring tools do not handle queries on a data with a high cardinality. Or can poorly help with a problem related to a 3d party integration issues or behavior of a large complex systems with a swarm of services working in modern virtual environments.
While adopting telemetry to different parts of the system is common practice it is usually ends with bunch of spaghetti drawn on a dashboards.
These are GitLab operational metrics, they are open to a public.
https://dashboards.gitlab.com/d/mnbqU9Smz/fleet-overview?refresh=5m&orgId=1
Why Dashboards are useless.
Actually not. But only in case when you know where and when to watch. Otherwise better watch YouTube.
Dashboards do not scale.
Imagine situation where you have a bunch of metrics related to your infrastructure cpu_usage/disk quotas and apps related metrics such as JVM allocation_speed/gc_runs etc. Amount of those metrics easily can grow to thousands or tenths-hundreds thousands. All you Dashboard’s are green but some problem occurred on a third-party integration service. You still have your dashboards green but end users affected already.
You decided to add third party integrations checks to your monitoring and get additional bunch of metrics and dashboards on your TV set. Until some new case will arise.
Being asked why customers can’t open a site it is often looks like this
Log aggregation.
Log aggregation Tools such as Elastic Stack or Splunk are used for vast majority of modern IT companies. These instruments are amazingly helpful for Root Cause Analysis or Post Mortems. They have also ability to monitor some conditions which can be derived from your logs flow.
But it comes with a cost. Modern systems generate huge amounts of logs and growing of your traffic can exhaust your ELK resources or raise billing from Splunk to the moon.
There are some sampling techniques which can reduce usually so-called bored logs amount to some order of magnitude and saving all abnormal ones in a full range. It can give a high-level overview about normal system behavior and detailed view for any problematic ones.
From logs to events model
Usually lines of logs are reflecting some event occurring in the system. Like make connection, authentication, query to database and so on. Executing all phases means piece of work was made. Definition event as a piece of work can be seen as Service Objectives related with particular service. By service I mean not only software services but some real physical devices as well like sensors or other machinery from IoT world.
It also very complementary to Domain Driven Design principles. Isolation and Responsibility sharing between services or domains make events specific to each piece of work on every part in the system.
For Login Service event can be successful_logins, failed_logins due to the authentication problem or business logic, every event has own metadata about timing and execution stages on different phases which domain, service, etc.
Metrics and events should build a story around processes in the system.
Events can be sampled in a way that for normal behavior only fraction of that is stored and all with problematic are stored as is. Events are aggregated and stored as Key Performance Indicators for objectives of the particular service.
It can bring together service objective metrics with a metadata related to that in every particular moment leverages connections between issues.
Written with high cardinality in mind such as services, datacenters, build versions in a separate granularities reveals unknown – unknowns in the system.
Is this some form of instrumentation of software? Yes. But comparing debug level logging and full instrumentation you can drink from a fire hose in production environment without being drowned by data and costs.
Why we are not ready for full AI solutions.
AI is a good badge for startup raising investments. But devil hides in details.
Reproducibility
Problem of full machine learning systems so called full AI approach is that when it constantly learns some behavior then your system has no reproducibility. If you want to understand why some condition for example was alerted, then you have no such possibility because models are changed already. So any solution with constant learning of behavior have such a problem.
Without reproducibility it is very hard to optimize system itself. You have no possibility to optimize your system without this which is essential when you operate on high granular data or metrics.
Resource Consumption
For any sort of constant learning on your data you need considerable amount of computational resources. Usually this is some form of batch processing on bunch of data. For some products minimal requirements for 200 000 metrics processing is v32CPU and 64 Gb RAM, if you want to double this amount of metrics to 400 000 you need another machine with same requirements.
You can’t scale Deep Learning full automation yet
Making some research in this field (Samreen Hassan Massak master thesis ) it was found that training process for some thousands of metrics take some days or CPU or hours on GPU. You can’t scale it without blowing your budget.
Speed
All this is quite costly or hard to scale. Solutions like Amazon Forecast – Time Series Forecasting is batch processing services where you should ingest data and wait for computation to end are not fit for that.
Clarity
According Google experience https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/
The rules that catch real incidents most often should be as simple, predictable, and reliable as possible.
When models or rules are constantly change you lose understanding of the system and it works as a black box.
Imagine you have thousands of metrics and if you want to have a good observability you need collect high cardinal data. Every heartbeat of the system will generate statistical fluctuations of your metrics swarm.
https://berlinbuzzwords.de/15/session/signatures-patterns-and-trends-timeseries-data-mining-etsy
One of the main lessons were learned in Etsy Kale project was:
Alerting on metrics anomalies will eventually lead to massive amounts of alerts and manual work playing with thresholds and handcrafting some filters to that.
Things should be considered
Any operation on infinite stream of data is quite engineering endeavor by itself. You need deal with distributed systems implications.
While monitoring on a high level of events, Service Level Objectives or KPI you need be reactive and not constantly query your data but operate on stream which can scale horizontally and achieve large throughput and speed without consuming some overwhelmed resources.
Some streaming frameworks such as Apache Storm, Apache Flink, Apache Spark oriented on tuple processing and not oriented on time series processing out of the box.
There are problems with semantics of distributed systems.
Imagine you have a lot of deployments in different datacenters. You can have some network problems and agent storing your KPI metrics has no ability to send it. After a while let’s say 3 minutes agent sent this data to the system. And this new information should trigger action on this condition. Should we store this data window in memory and check for conditions match not only backwards but in forward way as well? How large this desynchronization window should be? Operating on thousands of metrics in real time make these questions quite important. You cannot store everything in DB in stream processing systems without losing speed.
Real Time stream analyzing of time series data in distributed systems is tricky because any events about your system behavior can be unordered and conditions that could be met on this data depends on order of events. Which means that semantic at least once can be achieved easy but duplicate amounts will be different.
Desirable Features of a Monitoring Strategy by Google
Modern design usually involves separating collection and rule evaluation (with a solution like Prometheus server), long-term time series storage (InfluxDB), alert aggregation (Alertmanager), and dashboarding (Grafana).
Google’s logs-based systems process large volumes of highly granular data. There’s some inherent delay between when an event occurs and when it is visible in logs. For analysis that’s not time-sensitive, these logs can be processed with a batch system, interrogated with ad hoc queries, and visualized with dashboards. An example of this workflow would be using Cloud Dataflow to process logs, BigQuery for ad hoc queries, and Data Studio for the dashboards.
By contrast, our metrics-based monitoring system, which collects a large number of metrics from every service at Google, provides much less granular information, but in near real time. These characteristics are fairly typical of other logs- and metrics- based monitoring systems, although there are exceptions, such as real-time logs systems or high-cardinality metrics.
In an ideal world, monitoring and alerting code should be subject to the same testing standards as code development. While Prometheus developers are discussing developing unit tests for monitoring, there is currently no broadly adopted system that allows you to do this.
At Google, we test our monitoring and alerting using a domain-specific language that allows us to create synthetic time series. We then write assertions based upon the values in a derived time series, or the firing status and label presence of specific alerts.
https://books.google.ee/books?id=fElmDwAAQBAJ&pg=PT88&lpg=PT88&dq=Monitoring+Jess+Frame,+Anthony+Lenton,+Steven+Thurgood,&source=bl&ots=h76liC_qH3&sig=FZ9ZZKzsOwdxwir_pjh9nwCOx1U&hl=en&sa=X&ved=2ahUKEwjdtsXhsKnfAhXwtYsKHVu4C5gQ6AEwBnoECAIQAQ#v=onepage&q=Monitoring%20Jess%20Frame%2C%20Anthony%20Lenton%2C%20Steven%20Thurgood%2C&f=false