This document discusses the importance of complete, high-resolution data for effective observability. It argues that sampling and aggregation can degrade data accuracy and precision, making it harder to detect issues, troubleshoot problems, and understand system behavior. Having full fidelity metrics, traces, and logs at a fine-grained level is necessary to gain useful insights from observability data. Timestamp alignment and network latency also impact how precisely events can be ordered and correlated. Overall, the key to effective observability is having data that is fully representative of the application and infrastructure it comes from.
3. Every company is on a cloud journey
To increase velocity, agility and responsiveness
Retain & Optimize Lift & Shift Re-Factor Re-Architect /
Cloud-Native
DEV OPS DEV OPS DEV OPS DEV OPS
Cloud Managed e.g. RDS,
DynamoDB, SaaS
Cloud First Architecture
Tightly Coupled Apps,
Slow Deployment Cycles
Primarily using
Cloud IaaS
More Modular, but
Dependent App Components
Loosely Coupled Microservices,
and Serverless Functions
VM VM VM
VM VM VM VM VM VM
Private Public
VM VM VM VM VM VM
Private Public Private Public
“By 2025, 85% of organizations will run containers in production, up from less than 30% in 2020” – Gartner, Dec 14, 2020
4. Observability Challenges
● Microservices create complex interactions.
● Failures don't exactlyrepeat.
● Debugging multitenancy ispainful.
● Monitoring no longer can save us alone
Cynefin Framework
4
7. Data is the driving factor for
observability
• AI/ML-driven Directed Troubleshooting
• Unlimited Cardinality
• Streaming data, incl. Monitoring and Alerting
• Full-fidelity metrics and traces
• Open standards, open source data ingest
8. Dealing with the noise
• Filter the Signals
• Linear, Low-pass, Band-pass, All-pass
• Sample the signals
• Random, Head-based, Tail-based,
Post-predictive, Dimensionality reduction
• Improve the visualization
• Smart aggregation
9. Problems with observability sampling
• Leads to alert storms caused by the
cascading nature of failures
• Leads to needle-in-the-haystack scenarios
and long MTTR
• Siloed infrastructure and application insights
• Routinely miss trace data when
troubleshooting edge cases and intermittent
issues
Trace sampling
No awareness of service
dependencies
Simplistic triaging
Siloed from infrastructure
10. Typical trace sampling
Typically observes ~1-5% of transactions
Byte-Code
Agent
Head-Based Sampling
Microservices
Tail-based sampling misses too…
START TRACE END TRACE
12. But wait! My metrics tell me everything
Your metrics are usually not sampled,
for your infrastructure
But can be for your application traces
Leading to bad duration results and
potential missed alerts
13. TL;DR: Data Completeness
• Your ability to use observability is dependent on your data integrity
• Don’t let the “chosen data” bias your results
• Keep it all. Otherwise you can’t track customer happiness
• Real-time matters
17. Missing the point
10 sec average =13.9
95% = 27.05
First 5 sec average =16.4
95% = 29.2
Second 5 sec average =11.4
95% = 19.4
18. Data resolution ≠ Reporting resolution
• But both can be problematic
• Always deliver all data points regardless of reporting
• Finer granularity means more potential precision
Minute Minute
Second Second
Area of actuality
19. Adding in concepts of native and chart
resolution
• In Observability
• Native resolution is our data collection interval
• Chart resolution is the aggregation points that our
graphs use
Hint: we want speedy data collection and sufficient chart
resolution
24. Predictive behavior
• Sometimes you want to know what’s coming
• Prediction is only as good as the data precision
and accuracy
• Historic versus Sudden Change
• (Trend) Stationary
• Expect false positives (and negatives)
25. TL;DR: Data Preciseness
• Observability is only as useful as your data's precision and accuracy
• Your consideration of the data needs to account for elastic, ephemeral
and skew
• Prediction is a target, but
Keep in mind the difference between extrapolation and interpolation
26. The most effective debugging tool is still careful thought, coupled
with judiciously placed print statements.
-Brian Kernighan Unix for Beginners 1979
Observability is the new print statement
Closing Thoughts