SlideShare a Scribd company logo
Using Time Series for Full
Observability of a SaaS Platform
Aleksandr Tavgen Playtech, co-founder Timetrix
vAbout me
More than 19 years of
professional experience
FinTech and Data Science
background
From Developer to SRE Engineer
Solved and automated some
problems in Operations on scale
Overall problem
• Zoo of monitoring solutions in large enterprises often
distributed over the world
• M&A transactions or distributed teams make central
managing impossible or ineffective
• For small enterprises or startups the key question is
about finding the best solution
• A lot of companies have failed this way
• A lot of anti-patterns have developed
Managing a
Zoo
• A lot of independent
teams
• Everyone has some sort
of solution
• It is hard to get overall
picture of operations
• It is hard to orchestrate
and make changes
ZOO! ZOO! ZOO!
Common
Anti-patterns
It is tempting to keep everything
recorded just in case
Amount of metrics in monitoring
grows exponen?ally
Nobody understands such huge
bunch of metrics
Engineering complexity grows as
well
Uber case – 9 billion of metrics / 1000 + instances for monitoring solution
Dashboards problem
• Proliferating amount of metrics leads to unusable
dashboards
• How can one observe 9 billion metrics?
• Quite often it looks like spaghetti
• It is ok to pursue anti-pattern for approx. 1,5 years
• GitLab Dashboards are a good example
IF YOU NEED 9
BILLION OF
METRICS, YOU
ARE PROBABLY
WRONG
Actually not
• Dashboards are very useful when
you know where and when to watch
• Our brain can recognize and process
visual pa:erns more effec=vely
• But only when you know what you
are looking for and when
Queries
vs.
Dashboards
Querying your data requires more cogni2ve
effort than a quick look at dashboards
Metrics are a low resolution of your
system’s dynamics
Metrics should not replace logs
It is not necessary to have millions of them
What are
Incidents
• Something that has impact
on operational/business
level
• Incidents are expensive
• Incidents come with
credibility costs
COST OF AN
HOUR OF
DOWNTIME
2017-2018
h#ps://www.sta,sta.com/sta,s,cs/753938/worldwide-enterprise-server-hourly-down,me-cost/
• Change
• Network Failure
• Bug
• Human Factor
• Hardware Failure
• Unspecified
Causes of outage
Outage in dynamics
Timeline of
Outage
Detec%on
Investigation
Escalation
Fixing
What is it all about?
• Any reduction of
outage/incident timeline
results in significant positive
financial impact
• It is about credibility as well
• And your DevOps teams
feel less pain and toil on
their way
Focus on KPI metrics
Metrics
• It is almost impossible to operate on
billions of metrics
• In case of normal system behavior there
will always be outliers in real production
data
• Therefore, not all outliers should be
flagged as anomalous incidents
• Etsy Kale project case
Paradigm Shift
• The main paradigm shift comes from the fields of infrastructure and
architecture
• Cloud architectures, microservices, Kubernetes, and immutable
infrastructure have changed the way companies build and operate
systems
• Virtualization, containerization and orchestration frameworks abstract
infra level
• Moving towards abstraction from the underlying hardware and
networking means that we must focus on ensuring that our
applications work as intended in the context of our business
processes.
KPI monitoring
• KPI metrics are related to the core business
opera=ons
• It could be logins, ac=ve sessions, any domain
specific opera=ons
• Heavily seasoned
• Sta=c thresholds can’t help here
What we had
Time series data
Analysis
Trend line
Dispersion change
Moving average with 60 min window
Moving variance with 60 min window
Model + Next week data
PredicDve
AlerDng
System
Anomalies combined with rules
Rules are dynamic
Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to
low percentage points
Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to
low percentage points
General view
• Finding anomalies on metrics
• Finding regularities on a higher
level
• Combining events from
organization internals
(changes/deployments)
• Stream processing architectures
Why do we need time-series storage?
• We have unpredicted delay on networking
• Operating worldwide is a problem
• CAP theorem
• You can receive signals from the past
• But you should look into the future too
• How long should this window be in the future?
Why not Ka:a and all those classical
streaming?
• Frameworks like Storm, Flink - oriented on tuples not =me-ordered
events
• We do not want to process everything
• A lot of events are needed on-demand
• It is ok to lose some signals in favor of performance
• And we s=ll have signals from the past
Why Influx v 2.0
• Flux
• Better isolation
• Central storage for metrics, events,
traces
• Streaming paradigm
Taking a higher picture
• Finding anomalies on a lower level
• Tracing
• Event logs
• Finding regularities between them
• Building a topology
• We can call it AIOps as well
Open Tracing
• Tracing is a higher resolution of your
system’s dynamics
• Distributed tracing can show you unknown-
unknowns
• It reduces Investigation part of Incident
Timeline
• There is a good OSS Jaeger implementation
• Influx v 2.0 – the supported backend
storage
Jaeger with
Influxv2.0 as a
Backend storage
• Real prod case
• Every minute approx. 8000
traces
• Performance issue with
limitaDon on I/O ops
connecDons
• Bursts of context switches
on the kernel level
Impact on the particular
execution flow
• Db query is quite constant
• Processing time in normal case - 1-3 ms
• After a process context switch - more than 40 ms
Flux
• Multi-source joining
• Same functional composition paradigm
• Easy to test hypothesis
• You can combine metrics, event logs, and traces
• Data transformation based on conditions
Real incident
We need some statistical
models to operate on raw
data
Let’s check logins part
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
Random Walk
• Processes have a lot of random
factors
• Random Walk modelling
• X(t) = X(t-1) + Er(t)
• Er(t) = X(t) - X(t-1)
• Stationary time-series is very
easy to model
• Do not need statistical models
• Just reservoir with variance
Er(t) = X(t) - X(t-1)
Er(t) = discrete deriva=ve of (X)
On a larger scale
• Simple to model
• Cheap memory reservoirs models
• Very fast
Security case
• Failed logins ratio is related to
overall statistical activity
• People make type-o’s
• Simple thresholds not working
One Flux transformation pipeline
Real Alerts related to attacks on Login Service
Combining all
together
Adding Traces and
Events can reduce
Inves2ga2on part
Can pinpoint to
Root Cause
•It is all about semantics
•Datacenters, sites, services
•Graph topology based on time-series data
Timetrix
• As a lot people involved in it from
different companies
• We decided to Open Source core
engine
• Integrations which are specific to
domain companies could be easily
added
• We plan to launch Q3/Q4 2019
• Core engine is written in Java
• Great Kudos to bonitoo.io team for
great drivers
Q&A
http://medium.com/@ATavgen/
www.timetrix.io
hSps://twiSer.com/ATavgen
hSps://habr.com/ru/users/homunculus

More Related Content

What's hot

How to Implement Disaster Recovery in the Cloud
How to Implement Disaster Recovery in the CloudHow to Implement Disaster Recovery in the Cloud
How to Implement Disaster Recovery in the Cloud
Bluelock
 
Lean Kanban India 2016 | Stop fudging the numbers and Start Forecasting with ...
Lean Kanban India 2016 | Stop fudging the numbers and Start Forecasting with ...Lean Kanban India 2016 | Stop fudging the numbers and Start Forecasting with ...
Lean Kanban India 2016 | Stop fudging the numbers and Start Forecasting with ...
LeanKanbanIndia
 
The Troubleshooting Chart
The Troubleshooting ChartThe Troubleshooting Chart
The Troubleshooting Chart
James Wing
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Keet Sugathadasa
 
Digitalization in Electronics Manufacturing
Digitalization in Electronics ManufacturingDigitalization in Electronics Manufacturing
Digitalization in Electronics Manufacturing
Tom Arne Danielsen
 
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
Agile Greece
 
Tech debt will kill us
Tech debt will kill usTech debt will kill us
Tech debt will kill us
Julian Warszawski
 
Kanban - A Crash Course
Kanban - A Crash CourseKanban - A Crash Course
Kanban - A Crash Course
Sam McAfee
 
Plan Your IaaS Environment for Optimal Performance
Plan Your IaaS Environment for Optimal PerformancePlan Your IaaS Environment for Optimal Performance
Plan Your IaaS Environment for Optimal Performance
RISC Networks
 
Avoiding Performance Problems: When and How to Debug Production
Avoiding Performance Problems: When and How to Debug ProductionAvoiding Performance Problems: When and How to Debug Production
Avoiding Performance Problems: When and How to Debug ProductionAppNeta
 
The 5 Laws of Software Estimates
The 5 Laws of Software EstimatesThe 5 Laws of Software Estimates
The 5 Laws of Software Estimates
Vitebsk Miniq
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
Orangescrum enterprise features
Orangescrum enterprise featuresOrangescrum enterprise features
Orangescrum enterprise features
Orangescrum
 
Sre summary
Sre summarySre summary
Sre summary
Yogesh Shah
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
Rauno De Pasquale
 
Digital Testing Approach
Digital Testing ApproachDigital Testing Approach
Digital Testing Approach
Anand Deshpande
 
Panel Discussion Continuous Deployment in SaaS
Panel Discussion Continuous Deployment in SaaSPanel Discussion Continuous Deployment in SaaS
Panel Discussion Continuous Deployment in SaaS
Jonas Cheng
 
Data-Driven Operations - Practice realtime data analyse
Data-Driven Operations - Practice realtime data analyse Data-Driven Operations - Practice realtime data analyse
Data-Driven Operations - Practice realtime data analyse
Guixing Bai
 
The Ins and Outs of CTMS Data Migration
The Ins and Outs of CTMS Data MigrationThe Ins and Outs of CTMS Data Migration
The Ins and Outs of CTMS Data MigrationPerficient
 

What's hot (20)

How to Implement Disaster Recovery in the Cloud
How to Implement Disaster Recovery in the CloudHow to Implement Disaster Recovery in the Cloud
How to Implement Disaster Recovery in the Cloud
 
Lean Kanban India 2016 | Stop fudging the numbers and Start Forecasting with ...
Lean Kanban India 2016 | Stop fudging the numbers and Start Forecasting with ...Lean Kanban India 2016 | Stop fudging the numbers and Start Forecasting with ...
Lean Kanban India 2016 | Stop fudging the numbers and Start Forecasting with ...
 
The Troubleshooting Chart
The Troubleshooting ChartThe Troubleshooting Chart
The Troubleshooting Chart
 
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet SugathadasaSite Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
Site Reliability Engineering (SRE) - Tech Talk by Keet Sugathadasa
 
Digitalization in Electronics Manufacturing
Digitalization in Electronics ManufacturingDigitalization in Electronics Manufacturing
Digitalization in Electronics Manufacturing
 
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
Improving Agility (Learning from Maersk Line's Journey) | Özlem Yüce | Agile ...
 
Tech debt will kill us
Tech debt will kill usTech debt will kill us
Tech debt will kill us
 
Kanban - A Crash Course
Kanban - A Crash CourseKanban - A Crash Course
Kanban - A Crash Course
 
Plan Your IaaS Environment for Optimal Performance
Plan Your IaaS Environment for Optimal PerformancePlan Your IaaS Environment for Optimal Performance
Plan Your IaaS Environment for Optimal Performance
 
Avoiding Performance Problems: When and How to Debug Production
Avoiding Performance Problems: When and How to Debug ProductionAvoiding Performance Problems: When and How to Debug Production
Avoiding Performance Problems: When and How to Debug Production
 
AgileIteration
AgileIterationAgileIteration
AgileIteration
 
The 5 Laws of Software Estimates
The 5 Laws of Software EstimatesThe 5 Laws of Software Estimates
The 5 Laws of Software Estimates
 
Overview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practicesOverview of Site Reliability Engineering (SRE) & best practices
Overview of Site Reliability Engineering (SRE) & best practices
 
Orangescrum enterprise features
Orangescrum enterprise featuresOrangescrum enterprise features
Orangescrum enterprise features
 
Sre summary
Sre summarySre summary
Sre summary
 
DevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE ConceptsDevOps Torino Meetup - SRE Concepts
DevOps Torino Meetup - SRE Concepts
 
Digital Testing Approach
Digital Testing ApproachDigital Testing Approach
Digital Testing Approach
 
Panel Discussion Continuous Deployment in SaaS
Panel Discussion Continuous Deployment in SaaSPanel Discussion Continuous Deployment in SaaS
Panel Discussion Continuous Deployment in SaaS
 
Data-Driven Operations - Practice realtime data analyse
Data-Driven Operations - Practice realtime data analyse Data-Driven Operations - Practice realtime data analyse
Data-Driven Operations - Practice realtime data analyse
 
The Ins and Outs of CTMS Data Migration
The Ins and Outs of CTMS Data MigrationThe Ins and Outs of CTMS Data Migration
The Ins and Outs of CTMS Data Migration
 

Similar to Using Time Series for Full Observability of a SaaS Platform

Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
InfluxData
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
Timetrix
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
Ivo Andreev
 
Training - What is Performance ?
Training  - What is Performance ?Training  - What is Performance ?
Training - What is Performance ?
Betclic Everest Group Tech Team
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
Amit Kejriwal
 
Do-It-Yourself ENOVIA PLM MIgration
Do-It-Yourself ENOVIA PLM MIgrationDo-It-Yourself ENOVIA PLM MIgration
Do-It-Yourself ENOVIA PLM MIgration
Joseph Lopez, M.ISM
 
DMM9 - Data Migration Testing
DMM9 - Data Migration TestingDMM9 - Data Migration Testing
DMM9 - Data Migration TestingNick van Beest
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event Sourcing
Mike Bild
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
confluent
 
Brighttalk high scale low touch and other bedtime stories - final
Brighttalk   high scale low touch and other bedtime stories - finalBrighttalk   high scale low touch and other bedtime stories - final
Brighttalk high scale low touch and other bedtime stories - finalAndrew White
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science Project
Digital Vidya
 
PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko
Neotys
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
Randy Shoup
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
Niko Vuokko
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Capgemini: Observability within the Dutch government
Capgemini: Observability within the Dutch governmentCapgemini: Observability within the Dutch government
Capgemini: Observability within the Dutch government
Elasticsearch
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
Nicolas Carlier
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
Tash Bickley
 
Integration strategies best practices- Mulesoft meetup April 2018
Integration strategies   best practices- Mulesoft meetup April 2018Integration strategies   best practices- Mulesoft meetup April 2018
Integration strategies best practices- Mulesoft meetup April 2018
Rohan Rasane
 
SCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome ThemSCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome Them
Compuware
 

Similar to Using Time Series for Full Observability of a SaaS Platform (20)

Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
 
Training - What is Performance ?
Training  - What is Performance ?Training  - What is Performance ?
Training - What is Performance ?
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
Do-It-Yourself ENOVIA PLM MIgration
Do-It-Yourself ENOVIA PLM MIgrationDo-It-Yourself ENOVIA PLM MIgration
Do-It-Yourself ENOVIA PLM MIgration
 
DMM9 - Data Migration Testing
DMM9 - Data Migration TestingDMM9 - Data Migration Testing
DMM9 - Data Migration Testing
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event Sourcing
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 
Brighttalk high scale low touch and other bedtime stories - final
Brighttalk   high scale low touch and other bedtime stories - finalBrighttalk   high scale low touch and other bedtime stories - final
Brighttalk high scale low touch and other bedtime stories - final
 
Lifecycle of a Data Science Project
Lifecycle of a Data Science ProjectLifecycle of a Data Science Project
Lifecycle of a Data Science Project
 
PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko
 
An Agile Approach to Machine Learning
An Agile Approach to Machine LearningAn Agile Approach to Machine Learning
An Agile Approach to Machine Learning
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Capgemini: Observability within the Dutch government
Capgemini: Observability within the Dutch governmentCapgemini: Observability within the Dutch government
Capgemini: Observability within the Dutch government
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
 
Productionising Machine Learning Models
Productionising Machine Learning ModelsProductionising Machine Learning Models
Productionising Machine Learning Models
 
Integration strategies best practices- Mulesoft meetup April 2018
Integration strategies   best practices- Mulesoft meetup April 2018Integration strategies   best practices- Mulesoft meetup April 2018
Integration strategies best practices- Mulesoft meetup April 2018
 
SCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome ThemSCM Transformation Challenges and How to Overcome Them
SCM Transformation Challenges and How to Overcome Them
 

More from DevOps.com

Modernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareModernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source Software
DevOps.com
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
DevOps.com
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
DevOps.com
 
Next Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykNext Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and Snyk
DevOps.com
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the Cloud
DevOps.com
 
2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions
DevOps.com
 
A New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionA New Year’s Ransomware Resolution
A New Year’s Ransomware Resolution
DevOps.com
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
DevOps.com
 
Don't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDon't Panic! Effective Incident Response
Don't Panic! Effective Incident Response
DevOps.com
 
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureCreating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
DevOps.com
 
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportRole Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
DevOps.com
 
Monitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogMonitoring Serverless Applications with Datadog
Monitoring Serverless Applications with Datadog
DevOps.com
 
Deliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDeliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or Privately
DevOps.com
 
Securing medical apps in the age of covid final
Securing medical apps in the age of covid finalSecuring medical apps in the age of covid final
Securing medical apps in the age of covid final
DevOps.com
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
DevOps.com
 
The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021
DevOps.com
 
Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?
DevOps.com
 
Secure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsSecure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift Environments
DevOps.com
 
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
DevOps.com
 
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
DevOps.com
 

More from DevOps.com (20)

Modernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source SoftwareModernizing on IBM Z Made Easier With Open Source Software
Modernizing on IBM Z Made Easier With Open Source Software
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
Comparing Microsoft SQL Server 2019 Performance Across Various Kubernetes Pla...
 
Next Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and SnykNext Generation Vulnerability Assessment Using Datadog and Snyk
Next Generation Vulnerability Assessment Using Datadog and Snyk
 
Vulnerability Discovery in the Cloud
Vulnerability Discovery in the CloudVulnerability Discovery in the Cloud
Vulnerability Discovery in the Cloud
 
2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions2021 Open Source Governance: Top Ten Trends and Predictions
2021 Open Source Governance: Top Ten Trends and Predictions
 
A New Year’s Ransomware Resolution
A New Year’s Ransomware ResolutionA New Year’s Ransomware Resolution
A New Year’s Ransomware Resolution
 
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
Getting Started with Runtime Security on Azure Kubernetes Service (AKS)
 
Don't Panic! Effective Incident Response
Don't Panic! Effective Incident ResponseDon't Panic! Effective Incident Response
Don't Panic! Effective Incident Response
 
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's CultureCreating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
Creating a Culture of Chaos: Chaos Engineering Is Not Just Tools, It's Culture
 
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with TeleportRole Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
Role Based Access Controls (RBAC) for SSH and Kubernetes Access with Teleport
 
Monitoring Serverless Applications with Datadog
Monitoring Serverless Applications with DatadogMonitoring Serverless Applications with Datadog
Monitoring Serverless Applications with Datadog
 
Deliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or PrivatelyDeliver your App Anywhere … Publicly or Privately
Deliver your App Anywhere … Publicly or Privately
 
Securing medical apps in the age of covid final
Securing medical apps in the age of covid finalSecuring medical apps in the age of covid final
Securing medical apps in the age of covid final
 
How to Build a Healthy On-Call Culture
How to Build a Healthy On-Call CultureHow to Build a Healthy On-Call Culture
How to Build a Healthy On-Call Culture
 
The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021The Evolving Role of the Developer in 2021
The Evolving Role of the Developer in 2021
 
Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?Service Mesh: Two Big Words But Do You Need It?
Service Mesh: Two Big Words But Do You Need It?
 
Secure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift EnvironmentsSecure Data Sharing in OpenShift Environments
Secure Data Sharing in OpenShift Environments
 
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
How to Govern Identities and Access in Cloud Infrastructure: AppsFlyer Case S...
 
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
Elevate Your Enterprise Python and R AI, ML Software Strategy with Anaconda T...
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

Using Time Series for Full Observability of a SaaS Platform

  • 1. Using Time Series for Full Observability of a SaaS Platform Aleksandr Tavgen Playtech, co-founder Timetrix
  • 2. vAbout me More than 19 years of professional experience FinTech and Data Science background From Developer to SRE Engineer Solved and automated some problems in Operations on scale
  • 3.
  • 4. Overall problem • Zoo of monitoring solutions in large enterprises often distributed over the world • M&A transactions or distributed teams make central managing impossible or ineffective • For small enterprises or startups the key question is about finding the best solution • A lot of companies have failed this way • A lot of anti-patterns have developed
  • 5. Managing a Zoo • A lot of independent teams • Everyone has some sort of solution • It is hard to get overall picture of operations • It is hard to orchestrate and make changes
  • 7. Common Anti-patterns It is tempting to keep everything recorded just in case Amount of metrics in monitoring grows exponen?ally Nobody understands such huge bunch of metrics Engineering complexity grows as well
  • 8. Uber case – 9 billion of metrics / 1000 + instances for monitoring solution
  • 9. Dashboards problem • Proliferating amount of metrics leads to unusable dashboards • How can one observe 9 billion metrics? • Quite often it looks like spaghetti • It is ok to pursue anti-pattern for approx. 1,5 years • GitLab Dashboards are a good example
  • 10. IF YOU NEED 9 BILLION OF METRICS, YOU ARE PROBABLY WRONG
  • 11.
  • 12.
  • 13.
  • 14.
  • 15. Actually not • Dashboards are very useful when you know where and when to watch • Our brain can recognize and process visual pa:erns more effec=vely • But only when you know what you are looking for and when
  • 16. Queries vs. Dashboards Querying your data requires more cogni2ve effort than a quick look at dashboards Metrics are a low resolution of your system’s dynamics Metrics should not replace logs It is not necessary to have millions of them
  • 17. What are Incidents • Something that has impact on operational/business level • Incidents are expensive • Incidents come with credibility costs
  • 18. COST OF AN HOUR OF DOWNTIME 2017-2018 h#ps://www.sta,sta.com/sta,s,cs/753938/worldwide-enterprise-server-hourly-down,me-cost/
  • 19. • Change • Network Failure • Bug • Human Factor • Hardware Failure • Unspecified Causes of outage
  • 22. What is it all about? • Any reduction of outage/incident timeline results in significant positive financial impact • It is about credibility as well • And your DevOps teams feel less pain and toil on their way
  • 23. Focus on KPI metrics
  • 24. Metrics • It is almost impossible to operate on billions of metrics • In case of normal system behavior there will always be outliers in real production data • Therefore, not all outliers should be flagged as anomalous incidents • Etsy Kale project case
  • 25.
  • 26. Paradigm Shift • The main paradigm shift comes from the fields of infrastructure and architecture • Cloud architectures, microservices, Kubernetes, and immutable infrastructure have changed the way companies build and operate systems • Virtualization, containerization and orchestration frameworks abstract infra level • Moving towards abstraction from the underlying hardware and networking means that we must focus on ensuring that our applications work as intended in the context of our business processes.
  • 27. KPI monitoring • KPI metrics are related to the core business opera=ons • It could be logins, ac=ve sessions, any domain specific opera=ons • Heavily seasoned • Sta=c thresholds can’t help here
  • 29.
  • 31.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46. Moving average with 60 min window
  • 47. Moving variance with 60 min window
  • 48. Model + Next week data
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56. Overwhelming results • Red area – Customer Detection • Blue area – Own Observation (toil) • Orange line – Central Grafana Introduced • Green line – ML based solution in prod Customer Detection has dropped to low percentage points Overwhelming results • Red area – Customer Detection • Blue area – Own Observation (toil) • Orange line – Central Grafana Introduced • Green line – ML based solution in prod Customer Detection has dropped to low percentage points
  • 57. General view • Finding anomalies on metrics • Finding regularities on a higher level • Combining events from organization internals (changes/deployments) • Stream processing architectures
  • 58. Why do we need time-series storage? • We have unpredicted delay on networking • Operating worldwide is a problem • CAP theorem • You can receive signals from the past • But you should look into the future too • How long should this window be in the future?
  • 59. Why not Ka:a and all those classical streaming? • Frameworks like Storm, Flink - oriented on tuples not =me-ordered events • We do not want to process everything • A lot of events are needed on-demand • It is ok to lose some signals in favor of performance • And we s=ll have signals from the past
  • 60. Why Influx v 2.0 • Flux • Better isolation • Central storage for metrics, events, traces • Streaming paradigm
  • 61. Taking a higher picture • Finding anomalies on a lower level • Tracing • Event logs • Finding regularities between them • Building a topology • We can call it AIOps as well
  • 62. Open Tracing • Tracing is a higher resolution of your system’s dynamics • Distributed tracing can show you unknown- unknowns • It reduces Investigation part of Incident Timeline • There is a good OSS Jaeger implementation • Influx v 2.0 – the supported backend storage
  • 63. Jaeger with Influxv2.0 as a Backend storage • Real prod case • Every minute approx. 8000 traces • Performance issue with limitaDon on I/O ops connecDons • Bursts of context switches on the kernel level
  • 64. Impact on the particular execution flow • Db query is quite constant • Processing time in normal case - 1-3 ms • After a process context switch - more than 40 ms
  • 65. Flux • Multi-source joining • Same functional composition paradigm • Easy to test hypothesis • You can combine metrics, event logs, and traces • Data transformation based on conditions
  • 66. Real incident We need some statistical models to operate on raw data
  • 68. • Let’s check relations between them • Looks more like stationary time – series • Easier to model • Let’s check relations between them • Looks more like stationary time – series • Easier to model
  • 69. Random Walk • Processes have a lot of random factors • Random Walk modelling • X(t) = X(t-1) + Er(t) • Er(t) = X(t) - X(t-1) • Stationary time-series is very easy to model • Do not need statistical models • Just reservoir with variance
  • 70. Er(t) = X(t) - X(t-1) Er(t) = discrete deriva=ve of (X)
  • 71. On a larger scale • Simple to model • Cheap memory reservoirs models • Very fast
  • 72. Security case • Failed logins ratio is related to overall statistical activity • People make type-o’s • Simple thresholds not working
  • 74. Real Alerts related to attacks on Login Service
  • 75. Combining all together Adding Traces and Events can reduce Inves2ga2on part Can pinpoint to Root Cause
  • 76. •It is all about semantics •Datacenters, sites, services •Graph topology based on time-series data
  • 77. Timetrix • As a lot people involved in it from different companies • We decided to Open Source core engine • Integrations which are specific to domain companies could be easily added • We plan to launch Q3/Q4 2019 • Core engine is written in Java • Great Kudos to bonitoo.io team for great drivers
  • 78.