2. About Me
• I come from OPS side
• For me most interesting part is
to make operations boring
• Mostly have been working in
Financial Services
• 4finance
• Swedbank
• KPMG
• T
• Playing basketball in amateur
level for ages
3. Complexity of Systems
Level of System Distribution
simple monolith
modular monolith
complex modular monolith or microservices
Complexity
4. Failures of Complex Systems
• Complex systems are
intrinsically hazardous systems
• Catastrophe is always just
around the corner
• All practitioner actions are
gambles.
Paper URL:
https://goo.gl/sTvJw8
6. Monitoring System
Monitoring system complexes should address two questions:
what’s broken, and why? ...
“What” versus “why” is one of the most important distinctions in
writing good monitoring with maximum signal and minimum
noise
Source: Service Reliability Engineering Book
7. General Monitoring House Rules
• Metrics and Checks that catch real incidents most often should be as
simple, predictable, and reliable as possible.
• Data collection, aggregation, and alerting configuration that is rarely
exercised should be up for removal.
• Signals that are collected, but not exposed in any prebaked dashboard
nor used by any alert, are candidates for removal.
10. Blackbox Approach: Checks Monitoring
• Checks, not metrics.
• Simple, yes/no questions.
• First generation of monitoring systems
• Not suitable what’s actually happening under the hood, without
guessing
12. Whitebox Approach: Metrics Monitoring
• Addreses known failure vectors.
• There is needed to be developed instrumentation for exposing
data to monitoring
• Proper monitoring is mixture technical data with business data
• Too much monitoring is noise
13.
14. Whitebox Approach: Logging
• Valuable insigth: place where starts are investigations
• View of Request
• View of System
• Easy to collect data, from data points.
• Plain text
• Structured
• Binary
• LogAll vs LogActionalbe data
• Data sets bloats, large scale ingestions of data tricky
15.
16. Tracing
• Most challenging part to implement from historical point-of-
view
• Tracing captures the lifetime of requests as they flow through
the various components of a distributed system
• Recent developments in tracing tools gives brigth look in future:
• Dtrace and BFP framework
• OpenTracing: http://opentracing.io/
17.
18. Observability
In control theory, observability is a measure of how well internal states
of a system can be inferred from knowledge of its external outputs. The
observability and controllability of a system are mathematical duals.
Source: Wikipedia
21. Privacy and Observability
• Starting 25th May, 2018 EU personal data protection directive or
GDPR will be fully in place.
• Drastic accountability measures:
• Up to 10m EUR or 2% global turnover for the first audit fail
• Up to 20m EUR or 4% global turnover for the second audit fail
• Observability tools are silent huge personal data collectors
• Include in your Company’s data protection Sscope or anonymize
data
22. Conclusions
• Reliability of systems makes money (not loosing it)
• In distributed systems all teams involved in systems
development has to commit to making systems observable
• For one type of tasks choose one tool
• Review what data you collect, visualize your data
• Pick your own Observability target based on the requirements
of your service.