OBSERVABILITY: IT’S NOT JUST AN OPS THING
Instrumentation and monitoring often get lumped under the “ops” umbrella, but it’s just as critical for the software engineer. Answering questions about what your code is doing can be just as important during the development process as it is after you’ve shipped.
Let’s talk about how this plays out in the real world by using Honeycomb as a case study: relying on system visibility to inform planning during the development process, observing new changes during and after release, and (of course) debugging. We’ll discuss specific instrumentation practices that we love and have used successfully to gain visibility into our system’s day-to-day operations, and how those techniques have evolved over time.
3. OBSERVABILITY THE DEV PROCESS
The practice of understanding
the internal state of a system
via knowledge of its external
outputs.
Wikipedia (paraphrased)
5. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
6. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
7. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
8. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
VERIFY IT
BUILD IT
9. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
VERIFY IT (ON MY MACHINE)
BUILD IT
10. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
…?
VERIFY IT (ON MY MACHINE)
BUILD IT
11. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
12. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
WATCH IT
13. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
WATCH IT
14. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
WATCH IT
15. OBSERVABILITY THE DEV PROCESS
How do my error rates look these days?
Have things gotten slower over time?
16. OBSERVABILITY THE DEV PROCESS
How do my error rates look these days?
Have things gotten slower over time?
What does my service look like from this one annoying
customer’s perspective?
17. OBSERVABILITY THE DEV PROCESS
How do my error rates look these days?
Have things gotten slower over time?
What will the impact be, of this change we’re planning?
What does my service look like from that one huge, annoying
customer’s perspective?
18. OBSERVABILITY THE DEV PROCESS
How do my error rates look these days?
Have things gotten slower over time?
What will the impact be, of this change we’re planning?
What does "normal" look like these days? Does it line
up with what I thought?
What does my service look like from that one huge, annoying
customer’s perspective?
19. OBSERVABILITY THE DEV PROCESS
The ability to answer
questions about your
system, using data
DECIDE IT
BUILD IT
VERIFY IT (ON MY MACHINE)
VERIFY IT (IN PRODUCTION)
WATCH IT
&
21. ▸ How’s our load? Is it spread reasonably evenly across our Kafka
partitions?
▸ Did latency increase in our API server? How does our new
batching endpoint compare to our old RESTy endpoint?
▸ How did those recent memory optimizations affect our query-
serving capacity?
22. ▸ How’s our load? Are high-volume customers spread reasonably
evenly across our Kafka partitions?
▸ Did latency increase in our API server? Which customers were
impacted the most? And who’ll benefit the most from batching?
▸ How did those recent memory optimizations affect our query-
serving capacity for customers with string-heavy payloads?
33. BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
🏁
Feature flags and flexible observability tools = manual canarying
… except we can do it for everything
👾
44. ▸ Start at the edge with basic, common attributes (e.g. HTTP)
▸ Business-relevant or infrastructure-specific characteristics (e.g.
customer ID, DB replica set)
CONCEPTUALLY
45. ▸ Start at the edge with basic, common attributes (e.g. HTTP)
▸ Business-relevant or infrastructure-specific characteristics (e.g.
customer ID, DB replica set)
▸ Temporary additional fields for validating hypotheses
CONCEPTUALLY
46. ▸ Start at the edge with basic, common attributes (e.g. HTTP)
▸ Business-relevant or infrastructure-specific characteristics (e.g.
customer ID, DB replica set)
▸ Temporary additional fields for validating hypotheses
▸ Prune stale fields (if necessary)
CONCEPTUALLY
49. ▸ Contextual, structured data
▸ Common set of nouns and consistent naming
▸ Don't be dogmatic; let the use case dictate the ingest pattern
SOME BEST PRACTICES
50. ▸ Contextual, structured data
▸ Common set of nouns and consistent naming
▸ Don't be dogmatic; let the use case dictate the ingest pattern
▸ e.g. instrumenting individual reads while batching writes
SOME BEST PRACTICES
52. first pass:
- server_hostname
- method
- url
- build_id
- remote_addr
- request_id
- status
- x_forwarded_for
- error
- event_time
- team_id
- payload_size
- sample_rate
then we added:
- dropped
- get_schema_dur_ms
- protobuf_encoding_dur_ms
- kafka_write_dur_ms
- request_dur_ms
- json_decoding_dur_ms +others
a couple of days later, we added:
- offset
- kafka_topic
- chosen_partition
after that:
- memory_inuse
- num_goroutines
a week after that:
- warning
- drop_reason
and on and on, adding 2-3 fields
every couple of weeks:
- user_agent
- unknown_columns
- dataset_partitions
- dataset_id
- dataset_name
- api_version
- create_marker_dur_ms
- marker_id
- nil_value_for_columns
- batch
- gzipped
- batch_datapoint_lens
- batch_num_datasets
- batch_process_datapoints_dur_ms
- batch_validate_datasets_dur_ms
- batch_dataset_names
- dataset_columns
- event_columns
53. ▸ Stop writing software based on intuition, start backing it up with
data
▸ Teach observability tools to speak more than "Ops"
▸ ??? (← a.k.a., Ask lots of questions and validate hypotheses)
▸ Profit!
DEVELOPERS, OUR MISSION:
THANKS!@cyen