Monitoring Distributed Systems

Never Fail Twice
How Playtech Mastered Failure Detection Across Distributed Systems

Bio
 Technical Architect with more than 18 y. of experience
 Passionate about IT
 Financial and Data Science background
 Last years in Research and Design projects

Agenda
• What is observability and monitoring
• Why this is hard
• Possible approaches
• How we solved it
• Future of the instrumentation and observability

Objectives
 Get in touch with time-series analysis
 Understanding Distributed Systems pro’s and con’s
 Understanding observability and instrumentation concepts

Observability
 Monitoring is for operating software/systems
 Instrumentation is for writing software
 Observability is for understanding systems
Charity Majors

Why is it difficult
 1. Various problems may lead to non-obvious system behaviour.
 2. Various metrics may have different correlations in time and space.
 3. Monitoring a complex application is a significant engineering endeavor in and of itself.
 4. There is a mix of different measurements and metrics.

System monitoring
in Playtech
 50+ multibranded sites, distributed all over
the world
 Multiple products
 Multichannel
 Different mix of integrations

On the shoulders of giants
A lot of companies
built their own
solutions for
monitoring their
systems.
There was not
always success
stories.

Etsy
 Etsy is a large online
marketplace of handmade
goods
 Their engineering team
collected more than 250,000
different metrics from their
servers
 They tried to find anomalies
using complex math
approaches.

Lessons
learnt from
KALE 1.0
Anomalies in other metrics should be used for root cause
analysis.
Alerts should only be sent out when anomalies are detected in
business and user metrics
A one-size-fits-all type of approach will probably not fit
at all
Anomaly detection is more than just outlier detection

Google SRE team’s BorgMon
 Google has trended toward simpler and faster monitoring
systems, with better tools for post hoc analysis
 [They] avoid “magic” systems that try to learn thresholds or
automatically detect causality
 Rules that generate alerts for humans should be simple to
understand and represent a clear failure
According to the authors of Site Reliability Engineering

Playtech
case
Past tool from HP is “one-fits-for-
all”
Low efficiency and side effects
False Positives and missed incidents
Horrible operability

Time Series
 A time series is a series of data points indexed (or listed or graphed) in time order
 Economical processes have a regular structure
 These are amount of sales in the shops, production of champagne, online transactions
 Usually they have seasonal periods and trend lines
 Using this information, simplifies analysis

Stationary Time-Series Data
 Is a stochastic process, which characteristics does not change
 White noise

Non Stationary Time Series
 Trend line
 Dispersion change

How to model that?
 Every measurement consists of a signal and an error
component/noise, because our processes are affected by many
factors
 Point_of_measurement = signal + error
 Subtract the model’s values from our measurements
 The more our model resembles the real signal, the more our
residue will approximate the error component or stationarity or
white noise

Regression or finding a trend line

Trend line subtracted
Looks like white noise

Dickey-Fuller test of an initial piece of data
Stationary hypothesis rejected

And after subtraction
Result is a stationary time series

Let’s take a moving average from our example

Compared with a next week data

Why Time Series DB matters
Optimized for handling time series data
No Updates. Facts do not change ever
Appending data only
Last data has been queried more often
InfluxDB is one of the best time series database

An
Important
Notice
The second level involves receiving such information and
making decisions as to whether they represent real problems
or outages.
This is the information consumption level.
The first level involves searching for anomalies in metrics and
sending out notifications if outliers are found.
This is the information emission level.

Overall
Architecture
 Python stack
 Built as a set of loosely
coupled components
 Executed on their own
Python virtual machines
 Event-driven design

Event Streamer
 Component that holds Workers, fetches data regularly, and tests this data against the statistical
models managed by Workers
 A Worker is the main working unit that holds a set of models together with meta-information
 Workers are fully independent and every cycle is executed using a threading pool

Rule Engine
 Consumes the information provided by the
Event Streamer
 Rules built as Abstract Syntax Tree
 Around 1500 matches per sec in one
process

We also measure dynamics
 We can take into account the speed and acceleration of the degradation of the metrics
 It correspond to, respectively, the severity and the predicted change in the severity of the incident
 Speed is an angular coefficient or a discrete derivative of a particular metric, which is calculated
for every violation
 The same applies to acceleration or the second order derivative

Model ensemble can be fine tuned

For every alert report is created

Alerta – open-source product for alerts
aggregation

Q&A
 Thank you very much
 aleks.tavgen@gmail.com
 https://medium.com/@ATavgen/time-series-modelling-a9bf4f467687
 https://medium.com/@ATavgen/never-fail-twice-608147cb49b

Monitoring Distributed Systems

More Related Content

What's hot

Similar to Monitoring Distributed Systems

Recently uploaded

Monitoring Distributed Systems

Editor's Notes