Demystifying observability

Demystifying
observability as a
testing tool
Abby Bangser (she/her)
@a_bangser

“
@a_bangser
I originally moved into platform engineering to never
again have to debate “would a user ever do that?”

@a_bangser
Why learn about observability?
Groundwork for many tools & techniques:
● Sustainable on-call
● Chaos engineering
● Testing in production
● Progressive rollouts

@a_bangser
So what is observability?

@a_bangser
Ok...so...what now?

@a_bangser
Grounding the definition in capabilities
* https://lightstep.com/observability/
Observability helps you “understand your entire system
and how it fits together, and then use that information to
discover what specifically you should care about when it’s
most important.”*

@a_bangser
Grounding the definition in capabilities
most important.”*
Observability is access to
telemetry (data) that is
both
relevant and explorable

@a_bangser
Our journey today
✓ Define observability
✓ Techniques that rely on observability
➔ Observability in testing today and the future
➔ Current data structures and pitfalls
➔ Where to focus our investment now

@a_bangser
Debugging a “false” error is hard work

@a_bangser
So we use more descriptive asserts

@a_bangser
And we can expect even more
https://docs.honeycomb.io/working-with-your-data/bubbleup/

@a_bangser
Persona based test charters are
speculative
https://cdn.pixabay.com/photo/2012/04/28/17/11/people-43575__340.png

@a_bangser
So we extend into data driven testing
SELECT MIN(column_name)
FROM table_name
WHERE condition;
https://stackoverflow.com/a/50507519/2035223

@a_bangser
Real time learning from data exploration

@a_bangser
Our journey today
✓ Observability in testing today and the future
➔ Current data structures and pitfalls

@a_bangser
Revisiting our definition of Observability
most important.”*
Observability is access to
telemetry (data) that is
both
relevant and explorable

@a_bangser
These are sometimes referred to as the
“3 pillars of observability”
https://images.contentstack.io/v3/assets/bltefdd0b53724fa2ce/bltf85be52d51892228/5c98d45f8e3cc6505f19f678/three-pillars-of-observability-logs-metrics-tracs-apm.png

@a_bangser
Quick recap on logs
Logs

@a_bangser
Logging starts early with “Hello World”
Logs

@a_bangser
What happens as we lean on logs?
7a82dd3a
Logs
abc

@a_bangser
Log recap
Strengths:
+ Very detailed insights
+ Provides a clear order of
operation
Weaknesses:
‐ No built in relationship to a
user’s goals
‐ Relies on a schema so adding
new data can be diﬀicult
‐ Expensive to store
‐ Privacy risks for certain data

@a_bangser
Deep dive on metrics
Metrics

@a_bangser
Metrics provide a story over time
Metrics
https://medium.com/@srpillai/deploying-prometheus-in-kubernetes-with-persistent-disk-or-configmap-1f47e1a34a2e

@a_bangser
And they get used to generate alerts
Metrics

@a_bangser
How histograms are stored in a time series DB
le=
1k
http_requests_duration_microseconds
le=
250k
le=
500k
le=
1M
le=
5M
le=
+inf
* `le` stands for “less than or equal to”
Metrics

@a_bangser
How histograms gets generated in a time
series DB
www.website.com in 0.25 seconds
Metrics
le=
1k
le=
250k
le=
500k
le=
1M
le=
5M
le=
+inf

@a_bangser
le=
1k
le=
250k
le=
500k
le=
1M
le=
5M
le=
+inf
series DB
Metrics

@a_bangser
le=
1k
le=
250k
le=
500k
le=
1M
le=
5M
le=
+inf
series DB
www.website.com/big_ﬁle in 5 seconds
Metrics

@a_bangser
What does that +inf bucket really mean?
Metrics
le=
1k
le=
5M
le=
250k
le=
500k
le=
1M
le=
+inf

@a_bangser
le=
1k
le=
5M
le=
250k
le=
500k
le=
1M
le=
+inf
Why misjudging bucket size matters
Metrics
15 total requests
means the 95th
percentile request
is approx. 14.25

@a_bangser
Creating a more informed distribution
Metrics
le=
1k
le=
5M
le=
250k
le=
500k
le=
1M
le=
+inf
le=
500k
le=
1M
le=
5M
le=
+inf
le=
10M
le=
30M

@a_bangser
what happens to the previous data?
Metrics
le=
500k
le=
30M
le=
1M
le=
5M
le=
10M
le=
+inf
le=
1k
le=
5M
le=
250k
le=
500k
le=
1M
le=
+inf

@a_bangser
le=
1k
le=
5M
le=
250k
le=
500k
le=
1M
le=
+inf
But we can start collecting again
Metrics
le=
500k
le=
30M
le=
1M
le=
5M
le=
10M
le=
+inf

@a_bangser
le=
1k
le=
5M
le=
250k
le=
500k
le=
1M
le=
+inf
le=
500k
le=
30M
le=
1M
le=
5M
le=
10M
le=
+inf
And see the benefit of right sized buckets
Metrics

@a_bangser
Metrics Recap
Metrics
Strengths:
+ Cheap to gather & store
+ High level view over
long periods of time
+ Discrete numbers make
for easy math
Weaknesses:
‐ Requires additional tools
to debug
‐ Aggregated data
‐ Requires pre-determined
questions

@a_bangser
Quick recap on traces
Traces
(APM)

@a_bangser
Tracing answers “where” based questions
https://monzo.com/blog/we-built-network-isolation-for-1-500-services
Each dot is one of
1,500 services
Each line is one
possible network call
Each colour is a
diﬀerent team

@a_bangser
Tracing is a call stack for a distributed
system
https://medium.com/opentracing/take-opentracing-for-a-hotrod-ride-f6e3141f7941

@a_bangser
What services, in what order, and for how
long
https://medium.com/opentracing/take-opentracing-for-a-hotrod-ride-f6e3141f7941

@a_bangser
Trace recap
Strengths:
+ Full picture of a
request
+ Contains duration
Weaknesses:
‐ Additional request details
are optional

@a_bangser
Recap on the current “3 pillar” approach
Traces
(APM)
MetricsLogs
Strengths:
+ Can support long term tracking
and in the moment debugging
Weaknesses:
‐ Stores data in 3 diﬀerent ways ($$$)
‐ Requires 3 diﬀerent query languages
‐ Depends on knowing our questions
upfront

@a_bangser
The 3 pillars are better suited to
Monitoring
Monitoring

@a_bangser
What is the difference?
https://lightstep.com/observability/

@a_bangser
Our journey today
✓ Observability in testing and quality today
✓ What observability in testing and quality can be
✓ Current data structures and pitfalls

@a_bangser
So let’s talk the future
Events

@a_bangser
In short:
Structured logs scoped
around a single request

@a_bangser
Even shorter...
Context rich traces

@a_bangser
Let’s use an example

@a_bangser
How logs and events are created
@PostMapping("flip")
public ResponseEntity flipImage(@RequestParam("image") MultipartFile file,
@RequestParam(value = "vertical") Boolean vertical,
@RequestParam(value = "horizontal") Boolean horizontal) {
EVENT.addFields(
new Pair("content.type", file.getContentType()),
new Pair("action", "flip"),
new Pair("flip_vertical", vertical),
new Pair("flip_horizontal", horizontal));
LOGGER.info("Receiving {} image to flip.", file.getContentType());
byte[] flippedImage = imageService.flip(file, vertical, horizontal);
LOGGER.info("Successfully flipped image id: {}", file.getId());
EVENT.addField("image_id", file.getId());
return new ResponseEntity<>(flippedImage, headers, HttpStatus.OK);
}
EVENT.addFields(
new Pair("content.type", file.getContentType()),
new Pair("action", "flip"),
new Pair("flip_vertical", vertical),
new Pair("flip_horizontal", horizontal));
LOGGER.info("Successfully flipped image id: {}", file.getId()");
LOGGER.info("Receiving {} image to flip.", file.getContentType());
EVENT.addField("image_id", file.getId());

@a_bangser
Comparing the outputs
Multiple logs
A single event

@a_bangser
The event data covers all log data

@a_bangser
...and the metrics data

@a_bangser
...and the trace data

@a_bangser
All that AND:
Request based allowing req. & Resp. data

@a_bangser
All that AND:
Schema-less easily allowing new fields

@a_bangser
Why is this powerful?

@a_bangser
Explorability with the data in one place
● Key numbers like latency and count
● Key variables like IDs
● Correlation between services

@a_bangser
Explorability with the data in one place
Strengths:
+ All visualisations can
be derived including
timelines and big
picture view
+ Full context of a users
request
Weaknesses:
‐ Requires investment from
engineers who know the
app!
‐ Privacy risks for certain
data

@a_bangser
There is work to do away from code too!
➔ Drive the virtuous cycle of deep
domain knowledge supporting better
data collection.

@a_bangser
data collection.
➔ Encourage the use of observability as
a part of feature validation.

@a_bangser
data collection.
➔ Encourage the use of observability as
a part of feature validation.
➔ Keep asking high value questions and
don’t settle until tools & data can
answer them!

Thank you!
Let’s keep the conversation going
@a_bangser

Demystifying observability

More Related Content

What's hot

Similar to Demystifying observability

More from Abigail Bangser

Recently uploaded

Demystifying observability