Short Data Rules for Observability.pdf

Data Rules for Observability
Dave McAllister - NGINX

Every company is on a cloud journey
To increase velocity, agility and responsiveness
Retain & Optimize Lift & Shift Re-Factor Re-Architect /
Cloud-Native
DEV OPS DEV OPS DEV OPS DEV OPS
Cloud Managed e.g. RDS,
DynamoDB, SaaS
Cloud First Architecture
Tightly Coupled Apps,
Slow Deployment Cycles
Primarily using
Cloud IaaS
More Modular, but
Dependent App Components
Loosely Coupled Microservices,
and Serverless Functions
VM VM VM
VM VM VM VM VM VM
Private Public
VM VM VM VM VM VM
Private Public Private Public
“By 2025, 85% of organizations will run containers in production, up from less than 30% in 2020” – Gartner, Dec 14, 2020

Observability Challenges
● Microservices create complex interactions.
● Failures don't exactlyrepeat.
● Debugging multitenancy ispainful.
● Monitoring no longer can save us alone
Cynefin Framework
4

© 2 0 2 0 S P L U N K I N C .
Use all of your data
to avoid blind spots

©2021 F5
6
The more observable a system,
the quicker we can understand
why it’s acting up and fix it
Observability is a Data
Problem
Metrics
Do I have
a problem?
Traces
Where is the
problem?
Logs
Why is the problem
happening?
Observability
DETECT TROUBLESHOOT ROOT CAUSE
Full-Stack Visibility
& Context-Rich Insights

Data is the driving factor for
observability
• AI/ML-driven Directed Troubleshooting
• Unlimited Cardinality
• Streaming data, incl. Monitoring and Alerting
• Full-fidelity metrics and traces
• Open standards, open source data ingest

Dealing with the noise
• Filter the Signals
• Linear, Low-pass, Band-pass, All-pass
• Sample the signals
• Random, Head-based, Tail-based,
Post-predictive, Dimensionality reduction
• Improve the visualization
• Smart aggregation

Problems with observability sampling
• Leads to alert storms caused by the
cascading nature of failures
• Leads to needle-in-the-haystack scenarios
and long MTTR
• Siloed infrastructure and application insights
• Routinely miss trace data when
troubleshooting edge cases and intermittent
issues
Trace sampling
No awareness of service
dependencies
Simplistic triaging
Siloed from infrastructure

Typical trace sampling
Typically observes ~1-5% of transactions
Byte-Code
Agent
Head-Based Sampling
Microservices
Tail-based sampling misses too…
START TRACE END TRACE

But wait! My metrics tell me everything
Your metrics are usually not sampled,
for your infrastructure
But can be for your application traces
Leading to bad duration results and
potential missed alerts

TL;DR: Data Completeness
• Your ability to use observability is dependent on your data integrity
• Don’t let the “chosen data” bias your results
• Keep it all. Otherwise you can’t track customer happiness
• Real-time matters

©2021 F5
14
Operate at the speed and
resolution of your app and
infrastructure

©2021 F5
15
The resolution and speed of the data directly
impact the insights you gain

©2021 F5
16
• Interchangeable?
• Accuracy is that the measure
is correct
• Precise means it is consistent
with other measurements
Observability depends on both
But aggregation and analysis
can skew this
Discussing accuracy
and precision

Missing the point
10 sec average =13.9
95% = 27.05
First 5 sec average =16.4
95% = 29.2
Second 5 sec average =11.4
95% = 19.4

Data resolution ≠ Reporting resolution
• But both can be problematic
• Always deliver all data points regardless of reporting
• Finer granularity means more potential precision
Minute Minute
Second Second
Area of actuality

Adding in concepts of native and chart
resolution
• In Observability
• Native resolution is our data collection interval
• Chart resolution is the aggregation points that our
graphs use
Hint: we want speedy data collection and sufficient chart
resolution

Complexity
Drift and Skew
• Ephemeral Behavior
• Cloud-compute Elasticity

©2021 F5
21
Accurate
Timestamps?
• Network latencies get
lower
• Event frequencies are
higher
• Chrony on AWS/GCP
• ~10s to 100s of
accuracy
• May not always order
events properly
Image: ClockWork.io

©2021 F5
22
Aligned
traces?
• Spans may start ahead of parent
spans starts
• Spans may start after parent span
ends
• Span durations can be impacted,
resulting in lack of precision
Image: ClockWork.io

Predictive behavior
• Sometimes you want to know what’s coming
• Prediction is only as good as the data precision
and accuracy
• Historic versus Sudden Change
• (Trend) Stationary
• Expect false positives (and negatives)

TL;DR: Data Preciseness
• Observability is only as useful as your data's precision and accuracy
• Your consideration of the data needs to account for elastic, ephemeral
and skew
• Prediction is a target, but
Keep in mind the difference between extrapolation and interpolation

The most effective debugging tool is still careful thought, coupled
with judiciously placed print statements.
-Brian Kernighan Unix for Beginners 1979
Observability is the new print statement
Closing Thoughts

Thanks for listening
• https://www.linkedin.com/in/davemc

Short Data Rules for Observability.pdf

Recommended

Recommended

More Related Content

Similar to Short Data Rules for Observability.pdf

Similar to Short Data Rules for Observability.pdf (20)

More from Dave McAllister

More from Dave McAllister (10)

Recently uploaded

Recently uploaded (20)

Short Data Rules for Observability.pdf