SlideShare a Scribd company logo
1 of 81
Observability – the good, the bad, and the
ugly
Aleksandr Tavgen
(Playtech / Co-founder Timetrix)
vAbout me
More than 19 years of
professional experience
FinTech and Data Science
background
From Developer to SRE Engineer
Solved and automated some
problems in Operations on scale
Incidents and Outages
What are
Incidents
• Something that has impact
on operational/business
level
• Incidents are expensive
• Incidents come with
credibility costs
COST OF AN
HOUR OF
DOWNTIME
2017-2018
https://www.statista.com/statistics/753938/worldwide-enterprise-server-hourly-downtime-cost/
• Change
• Network Failure
• Bug
• Human Factor
• Hardware Failure
• Unspecified
Causes of outage
Outage in dynamics
Timeline of
Outage
Detection
Investigation
Escalation
Fixing
What is it all about?
• Any reduction of
outage/incident timeline
results in significant positive
financial impact
• It is about credibility as well
• Your Ops teams feel less
pain
Overall problems
• Zoo of monitoring solutions
• M&A transactions
• Finding the best solution
• A lot of companies have failed this way
• A lot of anti-patterns have developed
Managing a
Zoo
• A lot of independent teams
• Everyone has some sort of
solution
• It is hard to get overall picture
• It is hard to orchestrate and
make changes
ZOO! ZOO! ZOO!
Common Anti-
patterns
It is tempting to keep everything
recorded just in case
Amount of metrics in monitoring
grows exponentially
Nobody understands such huge
bunch of metrics
Engineering complexity grows as
well
Uber case – 9 billion of metrics / 1000 + instances for monitoring solution
Dashboards problem
• Proliferating amount of metrics leads to unusable
dashboards
• How can one observe 9 billion metrics?
• Quite often it looks like spaghetti
• It is ok to pursue anti-pattern for approx. 1,5 years
• GitLab Dashboards are a good example
IF YOU NEED 9
BILLION OF
METRICS, YOU
ARE PROBABLY
WRONG
Actually not
• Dashboards are very useful
• Our brain can recognize and process
visual patterns more effectively
• But only when you know what you
are looking for and when
Queries
vs.
Dashboards
Querying your data requires more cognitive
effort than a quick look at dashboards
Metrics are a low resolution of your
system’s dynamics
Metrics should not replace logs
It is not necessary to have millions of them
Focus on KPI metrics
Metrics
• It is impossible to operate on billions of
metrics
• There will always be outliers in real
production data
• Not all outliers should be flagged as
anomalous incidents
• Etsy Kale project case
Paradigm Shift
• The main paradigm shift comes from the fields of infrastructure and
architecture
• Cloud architectures, microservices, Kubernetes
• Virtualization abstracts an infra level
• We must focus on Key Performance Indicators
KPI monitoring
• KPI metrics are related to the core business ops
• It could be logins, active sessions, any domain
specific operations
• Heavily seasoned
• Static thresholds can’t help here
What we had
Time series data
Analysis
Trend line
Dispersion change
Moving average with 60 min window
Moving variance with 60 min window
Model + Next week data
Predictive
Alerting
System
Anomalies combined with rules
Rules are dynamic
Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to
low percentage points
General view
• Finding anomalies on metrics
• Finding regularities on a higher
level
• Combining events
• Stream processing architectures
Why do we need time-series storage?
• We have unpredicted delay on networking
• Operating worldwide is a problem
• CAP theorem
• You can receive signals from the past
• But you should look into the future too
• How long should this window be in the future?
Why not Kafka and all those classical
streaming?
• Frameworks like Storm, Flink - oriented on tuples processing
• We do not want to process everything
• A lot of events are needed on-demand
• It is ok to lose some signals in favor of performance
• And we still have signals from the past
Taking a higher picture
• Finding anomalies on a lower level
• Tracing
• Event logs
• Finding regularities between them
• Building a topology
• We can call it AIOps as well
Open Tracing
• Tracing is a higher resolution of your
system’s dynamics
• Distributed tracing can show you unknown-
unknowns
• It reduces Investigation part of Incident
Timeline
• There is a good OSS Jaeger implementation
Jaeger with
Influxv2.0 as a
Backend storage
• Real prod case
• 8000 traces per minute
• Performance issue
• Bursts of context switches
on the kernel level
Impact on the particular
execution flow
• Db query is quite constant
• Processing time in normal case - 1-3 ms
• After a process context switch - more than 40 ms
Why Influx v 2.0
• Flux
• Better isolation
• Central storage for metrics, events,
traces
• Streaming paradigm
Flux
• Multi-source joining
• Same functional composition paradigm
• Easy to test hypothesis
• You can combine metrics, event logs, and traces
• Data transformation based on conditions
Real incident
We need some statistical
models to operate on raw
data
Let’s check logins part
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
Random Walk
• Processes have a lot of random
factors
• Random Walk modelling
• X(t) = X(t-1) + Er(t)
• Er(t) = X(t) - X(t-1)
• Stationary time-series is very
easy to model
• Do not need statistical models
• Just reservoir with variance
Er(t) = X(t) - X(t-1)
Er(t) = discrete derivative of (X)
On a larger scale
• Simple to model
• Cheap memory reservoirs models
• Very fast
Security case
• Failed logins ratio is related to
overall statistical activity
• People make type-o’s
• Simple thresholds not working
Security case
One Flux transformation pipeline
Real Alerts related to attacks on Login Service
Combining all
together
Adding Traces and
Events can reduce
Investigation part
Can pinpoint to
Root Cause
•It is all about semantics
•Datacenters, sites, services
•Graph topology based on time-series data
Timetrix
• As a lot people involved in it from
different companies
• We decided to Open Source core
engine
• Integrations which are specific to
domain companies could be easily
added
• We plan to launch Q3/Q4 2019
• Core engine is written in Java
• Great Kudos to bonitoo.io team for
great drivers
www.timetrix.io
Twitter - @Atavgen
Medium - @ATavgen
Subscribe

More Related Content

What's hot

Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
Data Con LA
 
STC-2015_Regional-Round-Ppt- Capgemini - Cost Effective QC to ALM Migration
STC-2015_Regional-Round-Ppt- Capgemini - Cost Effective QC to ALM MigrationSTC-2015_Regional-Round-Ppt- Capgemini - Cost Effective QC to ALM Migration
STC-2015_Regional-Round-Ppt- Capgemini - Cost Effective QC to ALM Migration
Jayanta Nath
 
The 5 Laws of Software Estimates
The 5 Laws of Software EstimatesThe 5 Laws of Software Estimates
The 5 Laws of Software Estimates
Vitebsk Miniq
 
Data flow in the data center
Data flow in the data centerData flow in the data center
Data flow in the data center
Adam Cataldo
 
Reactive Applications
Reactive ApplicationsReactive Applications
Reactive Applications
Mike Bild
 

What's hot (20)

Real-Time Anomaly Detection and Root Cause Analysis
Real-Time Anomaly Detection and Root Cause AnalysisReal-Time Anomaly Detection and Root Cause Analysis
Real-Time Anomaly Detection and Root Cause Analysis
 
How to Implement Disaster Recovery in the Cloud
How to Implement Disaster Recovery in the CloudHow to Implement Disaster Recovery in the Cloud
How to Implement Disaster Recovery in the Cloud
 
Rethinking Streaming Analytics for Scale
Rethinking Streaming Analytics for ScaleRethinking Streaming Analytics for Scale
Rethinking Streaming Analytics for Scale
 
Capgemini: Observability within the Dutch government
Capgemini: Observability within the Dutch governmentCapgemini: Observability within the Dutch government
Capgemini: Observability within the Dutch government
 
2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR Workshop2014 July Webinar Modern DR Workshop
2014 July Webinar Modern DR Workshop
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Neptune : Re-thinking Incident Response Automation
Neptune : Re-thinking Incident Response Automation Neptune : Re-thinking Incident Response Automation
Neptune : Re-thinking Incident Response Automation
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
How KeyBank Used Elastic to Build an Enterprise Monitoring SolutionHow KeyBank Used Elastic to Build an Enterprise Monitoring Solution
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 
Intuit continuous performance testing for code camp temp
Intuit continuous performance testing for code camp tempIntuit continuous performance testing for code camp temp
Intuit continuous performance testing for code camp temp
 
STC-2015_Regional-Round-Ppt- Capgemini - Cost Effective QC to ALM Migration
STC-2015_Regional-Round-Ppt- Capgemini - Cost Effective QC to ALM MigrationSTC-2015_Regional-Round-Ppt- Capgemini - Cost Effective QC to ALM Migration
STC-2015_Regional-Round-Ppt- Capgemini - Cost Effective QC to ALM Migration
 
Machine Learning for Anomaly Detection, Time Series Modeling, and More
Machine Learning for Anomaly Detection, Time Series Modeling, and MoreMachine Learning for Anomaly Detection, Time Series Modeling, and More
Machine Learning for Anomaly Detection, Time Series Modeling, and More
 
Production Operations An Architect And Developers Perspective (Without Notes)
Production Operations   An Architect And Developers Perspective (Without Notes)Production Operations   An Architect And Developers Perspective (Without Notes)
Production Operations An Architect And Developers Perspective (Without Notes)
 
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native ApplicationsThe Reactive Principles: Eight Tenets For Building Cloud Native Applications
The Reactive Principles: Eight Tenets For Building Cloud Native Applications
 
The 5 Laws of Software Estimates
The 5 Laws of Software EstimatesThe 5 Laws of Software Estimates
The 5 Laws of Software Estimates
 
Data flow in the data center
Data flow in the data centerData flow in the data center
Data flow in the data center
 
Metrics driven development 10.09.2014
Metrics driven development   10.09.2014Metrics driven development   10.09.2014
Metrics driven development 10.09.2014
 
Reactive Applications
Reactive ApplicationsReactive Applications
Reactive Applications
 
Go Observability (in practice)
Go Observability (in practice)Go Observability (in practice)
Go Observability (in practice)
 

Similar to Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine

Brighttalk high scale low touch and other bedtime stories - final
Brighttalk   high scale low touch and other bedtime stories - finalBrighttalk   high scale low touch and other bedtime stories - final
Brighttalk high scale low touch and other bedtime stories - final
Andrew White
 

Similar to Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine (20)

Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
 
CQRS + Event Sourcing
CQRS + Event SourcingCQRS + Event Sourcing
CQRS + Event Sourcing
 
Building an Experimentation Platform in Clojure
Building an Experimentation Platform in ClojureBuilding an Experimentation Platform in Clojure
Building an Experimentation Platform in Clojure
 
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
Oracle Management Cloud - introduction, overview and getting started (AMIS, 2...
 
Brighttalk high scale low touch and other bedtime stories - final
Brighttalk   high scale low touch and other bedtime stories - finalBrighttalk   high scale low touch and other bedtime stories - final
Brighttalk high scale low touch and other bedtime stories - final
 
Building data intensive applications
Building data intensive applicationsBuilding data intensive applications
Building data intensive applications
 
I pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 
Scaling Systems: Architectures that grow
Scaling Systems: Architectures that growScaling Systems: Architectures that grow
Scaling Systems: Architectures that grow
 
Training - What is Performance ?
Training  - What is Performance ?Training  - What is Performance ?
Training - What is Performance ?
 
Sql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.pptSql azure cluster dashboard public.ppt
Sql azure cluster dashboard public.ppt
 
Azure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challengesAzure architecture design patterns - proven solutions to common challenges
Azure architecture design patterns - proven solutions to common challenges
 
PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko PAC 2019 virtual Alexander Podelko
PAC 2019 virtual Alexander Podelko
 
Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps Practices
 
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
 
Art of Cloud Workload Translation
Art of Cloud Workload TranslationArt of Cloud Workload Translation
Art of Cloud Workload Translation
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
 
Dances with bits - industrial data analytics made easy!
Dances with bits - industrial data analytics made easy!Dances with bits - industrial data analytics made easy!
Dances with bits - industrial data analytics made easy!
 
Deep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data Services
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 

Recently uploaded

Recently uploaded (20)

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2Driving Innovation: Scania's API Revolution with WSO2
Driving Innovation: Scania's API Revolution with WSO2
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAMWSO2Con2024 - Organization Management: The Revolution in B2B CIAM
WSO2Con2024 - Organization Management: The Revolution in B2B CIAM
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdfAzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
AzureNativeQumulo_HPC_Cloud_Native_Benchmarks.pdf
 
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & InnovationWSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital BusinessesWSO2CON 2024 - Software Engineering for Digital Businesses
WSO2CON 2024 - Software Engineering for Digital Businesses
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
WSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in UgandaWSO2CON 2024 - Building a Digital Government in Uganda
WSO2CON 2024 - Building a Digital Government in Uganda
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 

Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine

  • 1. Observability – the good, the bad, and the ugly Aleksandr Tavgen (Playtech / Co-founder Timetrix)
  • 2. vAbout me More than 19 years of professional experience FinTech and Data Science background From Developer to SRE Engineer Solved and automated some problems in Operations on scale
  • 3.
  • 5. What are Incidents • Something that has impact on operational/business level • Incidents are expensive • Incidents come with credibility costs
  • 6. COST OF AN HOUR OF DOWNTIME 2017-2018 https://www.statista.com/statistics/753938/worldwide-enterprise-server-hourly-downtime-cost/
  • 7. • Change • Network Failure • Bug • Human Factor • Hardware Failure • Unspecified Causes of outage
  • 10. What is it all about? • Any reduction of outage/incident timeline results in significant positive financial impact • It is about credibility as well • Your Ops teams feel less pain
  • 11. Overall problems • Zoo of monitoring solutions • M&A transactions • Finding the best solution • A lot of companies have failed this way • A lot of anti-patterns have developed
  • 12. Managing a Zoo • A lot of independent teams • Everyone has some sort of solution • It is hard to get overall picture • It is hard to orchestrate and make changes
  • 14. Common Anti- patterns It is tempting to keep everything recorded just in case Amount of metrics in monitoring grows exponentially Nobody understands such huge bunch of metrics Engineering complexity grows as well
  • 15. Uber case – 9 billion of metrics / 1000 + instances for monitoring solution
  • 16. Dashboards problem • Proliferating amount of metrics leads to unusable dashboards • How can one observe 9 billion metrics? • Quite often it looks like spaghetti • It is ok to pursue anti-pattern for approx. 1,5 years • GitLab Dashboards are a good example
  • 17. IF YOU NEED 9 BILLION OF METRICS, YOU ARE PROBABLY WRONG
  • 18.
  • 19.
  • 20.
  • 21.
  • 22. Actually not • Dashboards are very useful • Our brain can recognize and process visual patterns more effectively • But only when you know what you are looking for and when
  • 23. Queries vs. Dashboards Querying your data requires more cognitive effort than a quick look at dashboards Metrics are a low resolution of your system’s dynamics Metrics should not replace logs It is not necessary to have millions of them
  • 24. Focus on KPI metrics
  • 25. Metrics • It is impossible to operate on billions of metrics • There will always be outliers in real production data • Not all outliers should be flagged as anomalous incidents • Etsy Kale project case
  • 26.
  • 27. Paradigm Shift • The main paradigm shift comes from the fields of infrastructure and architecture • Cloud architectures, microservices, Kubernetes • Virtualization abstracts an infra level • We must focus on Key Performance Indicators
  • 28.
  • 29. KPI monitoring • KPI metrics are related to the core business ops • It could be logins, active sessions, any domain specific operations • Heavily seasoned • Static thresholds can’t help here
  • 31.
  • 33.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48. Moving average with 60 min window
  • 49. Moving variance with 60 min window
  • 50. Model + Next week data
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58. Overwhelming results • Red area – Customer Detection • Blue area – Own Observation (toil) • Orange line – Central Grafana Introduced • Green line – ML based solution in prod Customer Detection has dropped to low percentage points
  • 59. General view • Finding anomalies on metrics • Finding regularities on a higher level • Combining events • Stream processing architectures
  • 60. Why do we need time-series storage? • We have unpredicted delay on networking • Operating worldwide is a problem • CAP theorem • You can receive signals from the past • But you should look into the future too • How long should this window be in the future?
  • 61. Why not Kafka and all those classical streaming? • Frameworks like Storm, Flink - oriented on tuples processing • We do not want to process everything • A lot of events are needed on-demand • It is ok to lose some signals in favor of performance • And we still have signals from the past
  • 62. Taking a higher picture • Finding anomalies on a lower level • Tracing • Event logs • Finding regularities between them • Building a topology • We can call it AIOps as well
  • 63. Open Tracing • Tracing is a higher resolution of your system’s dynamics • Distributed tracing can show you unknown- unknowns • It reduces Investigation part of Incident Timeline • There is a good OSS Jaeger implementation
  • 64. Jaeger with Influxv2.0 as a Backend storage • Real prod case • 8000 traces per minute • Performance issue • Bursts of context switches on the kernel level
  • 65. Impact on the particular execution flow • Db query is quite constant • Processing time in normal case - 1-3 ms • After a process context switch - more than 40 ms
  • 66. Why Influx v 2.0 • Flux • Better isolation • Central storage for metrics, events, traces • Streaming paradigm
  • 67. Flux • Multi-source joining • Same functional composition paradigm • Easy to test hypothesis • You can combine metrics, event logs, and traces • Data transformation based on conditions
  • 68. Real incident We need some statistical models to operate on raw data
  • 70. • Let’s check relations between them • Looks more like stationary time – series • Easier to model • Let’s check relations between them • Looks more like stationary time – series • Easier to model
  • 71. Random Walk • Processes have a lot of random factors • Random Walk modelling • X(t) = X(t-1) + Er(t) • Er(t) = X(t) - X(t-1) • Stationary time-series is very easy to model • Do not need statistical models • Just reservoir with variance
  • 72. Er(t) = X(t) - X(t-1) Er(t) = discrete derivative of (X)
  • 73. On a larger scale • Simple to model • Cheap memory reservoirs models • Very fast
  • 74. Security case • Failed logins ratio is related to overall statistical activity • People make type-o’s • Simple thresholds not working Security case
  • 76. Real Alerts related to attacks on Login Service
  • 77. Combining all together Adding Traces and Events can reduce Investigation part Can pinpoint to Root Cause
  • 78. •It is all about semantics •Datacenters, sites, services •Graph topology based on time-series data
  • 79. Timetrix • As a lot people involved in it from different companies • We decided to Open Source core engine • Integrations which are specific to domain companies could be easily added • We plan to launch Q3/Q4 2019 • Core engine is written in Java • Great Kudos to bonitoo.io team for great drivers
  • 80.

Editor's Notes

  1. Virtualization, containerization, and orchestration frameworks are responsible for providing computational resources and handling failures creates an abstraction layer for hardware and networking. Moving towards abstraction from the underlying hardware and networking means that we must focus on ensuring that our applications work as intended in the context of our business processes.