Talk about approaches to an observability. Do we need millions of metrics? Anomalies vs regularities? Can Machine Learning help us? Some abilities of Flux language by InfluxData
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
1. Observability – the good, the bad, and the
ugly
Aleksandr Tavgen
(Playtech / Co-founder Timetrix)
2. vAbout me
More than 19 years of
professional experience
FinTech and Data Science
background
From Developer to SRE Engineer
Solved and automated some
problems in Operations on scale
10. What is it all about?
• Any reduction of
outage/incident timeline
results in significant positive
financial impact
• It is about credibility as well
• Your Ops teams feel less
pain
11. Overall problems
• Zoo of monitoring solutions
• M&A transactions
• Finding the best solution
• A lot of companies have failed this way
• A lot of anti-patterns have developed
12. Managing a
Zoo
• A lot of independent teams
• Everyone has some sort of
solution
• It is hard to get overall picture
• It is hard to orchestrate and
make changes
14. Common Anti-
patterns
It is tempting to keep everything
recorded just in case
Amount of metrics in monitoring
grows exponentially
Nobody understands such huge
bunch of metrics
Engineering complexity grows as
well
15. Uber case – 9 billion of metrics / 1000 + instances for monitoring solution
16. Dashboards problem
• Proliferating amount of metrics leads to unusable
dashboards
• How can one observe 9 billion metrics?
• Quite often it looks like spaghetti
• It is ok to pursue anti-pattern for approx. 1,5 years
• GitLab Dashboards are a good example
17. IF YOU NEED 9
BILLION OF
METRICS, YOU
ARE PROBABLY
WRONG
18.
19.
20.
21.
22. Actually not
• Dashboards are very useful
• Our brain can recognize and process
visual patterns more effectively
• But only when you know what you
are looking for and when
23. Queries
vs.
Dashboards
Querying your data requires more cognitive
effort than a quick look at dashboards
Metrics are a low resolution of your
system’s dynamics
Metrics should not replace logs
It is not necessary to have millions of them
25. Metrics
• It is impossible to operate on billions of
metrics
• There will always be outliers in real
production data
• Not all outliers should be flagged as
anomalous incidents
• Etsy Kale project case
26.
27. Paradigm Shift
• The main paradigm shift comes from the fields of infrastructure and
architecture
• Cloud architectures, microservices, Kubernetes
• Virtualization abstracts an infra level
• We must focus on Key Performance Indicators
28.
29. KPI monitoring
• KPI metrics are related to the core business ops
• It could be logins, active sessions, any domain
specific operations
• Heavily seasoned
• Static thresholds can’t help here
58. Overwhelming
results
• Red area – Customer Detection
• Blue area – Own Observation (toil)
• Orange line – Central Grafana Introduced
• Green line – ML based solution in prod
Customer Detection has dropped to
low percentage points
59. General view
• Finding anomalies on metrics
• Finding regularities on a higher
level
• Combining events
• Stream processing architectures
60. Why do we need time-series storage?
• We have unpredicted delay on networking
• Operating worldwide is a problem
• CAP theorem
• You can receive signals from the past
• But you should look into the future too
• How long should this window be in the future?
61. Why not Kafka and all those classical
streaming?
• Frameworks like Storm, Flink - oriented on tuples processing
• We do not want to process everything
• A lot of events are needed on-demand
• It is ok to lose some signals in favor of performance
• And we still have signals from the past
62. Taking a higher picture
• Finding anomalies on a lower level
• Tracing
• Event logs
• Finding regularities between them
• Building a topology
• We can call it AIOps as well
63. Open Tracing
• Tracing is a higher resolution of your
system’s dynamics
• Distributed tracing can show you unknown-
unknowns
• It reduces Investigation part of Incident
Timeline
• There is a good OSS Jaeger implementation
64. Jaeger with
Influxv2.0 as a
Backend storage
• Real prod case
• 8000 traces per minute
• Performance issue
• Bursts of context switches
on the kernel level
65. Impact on the particular
execution flow
• Db query is quite constant
• Processing time in normal case - 1-3 ms
• After a process context switch - more than 40 ms
66. Why Influx v 2.0
• Flux
• Better isolation
• Central storage for metrics, events,
traces
• Streaming paradigm
67. Flux
• Multi-source joining
• Same functional composition paradigm
• Easy to test hypothesis
• You can combine metrics, event logs, and traces
• Data transformation based on conditions
70. • Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
• Let’s check relations between them
• Looks more like stationary time – series
• Easier to model
71. Random Walk
• Processes have a lot of random
factors
• Random Walk modelling
• X(t) = X(t-1) + Er(t)
• Er(t) = X(t) - X(t-1)
• Stationary time-series is very
easy to model
• Do not need statistical models
• Just reservoir with variance
73. On a larger scale
• Simple to model
• Cheap memory reservoirs models
• Very fast
74. Security case
• Failed logins ratio is related to
overall statistical activity
• People make type-o’s
• Simple thresholds not working
Security case
78. •It is all about semantics
•Datacenters, sites, services
•Graph topology based on time-series data
79. Timetrix
• As a lot people involved in it from
different companies
• We decided to Open Source core
engine
• Integrations which are specific to
domain companies could be easily
added
• We plan to launch Q3/Q4 2019
• Core engine is written in Java
• Great Kudos to bonitoo.io team for
great drivers
Virtualization, containerization, and orchestration frameworks are responsible for providing computational resources and handling failures creates an abstraction layer for hardware and networking.
Moving towards abstraction from the underlying hardware and networking means that we must focus on ensuring that our applications work as intended in the context of our business processes.