Influx/Days 2017 San Francisco | Christine Yen

OBSERVABILITY: NOT JUST
AN OPS THING
@cyen
@honeycombio

OBSERVABILITY THE DEV PROCESS
The practice of understanding
the internal state of a system
via knowledge of its external
outputs.
Wikipedia (paraphrased)

Twitter hive mind

The ability to answer
questions about your
system, using data

system, using data
DECIDE IT

system, using data
DECIDE IT
BUILD IT

system, using data
DECIDE IT
VERIFY IT
BUILD IT

system, using data
DECIDE IT
VERIFY IT (ON MY MACHINE)
BUILD IT

system, using data
DECIDE IT
…?
BUILD IT

system, using data
DECIDE IT
BUILD IT
VERIFY IT (IN PRODUCTION)

system, using data
DECIDE IT
BUILD IT
WATCH IT

How do my error rates look these days?
Have things gotten slower over time?

What does my service look like from this one annoying
customer’s perspective?

What will the impact be, of this change we’re planning?
What does my service look like from that one huge, annoying

What will the impact be, of this change we’re planning?
What does "normal" look like these days? Does it line
up with what I thought?
What does my service look like from that one huge, annoying

system, using data
DECIDE IT
BUILD IT
WATCH IT
&

WHY DOES THIS MATTER SO MUCH TO ME?

▸ How’s our load? Is it spread reasonably evenly across our Kafka
partitions?
▸ Did latency increase in our API server? How does our new
batching endpoint compare to our old RESTy endpoint?
▸ How did those recent memory optimizations affect our query-
serving capacity?

▸ How’s our load? Are high-volume customers spread reasonably
evenly across our Kafka partitions?
▸ Did latency increase in our API server? Which customers were
impacted the most? And who’ll beneﬁt the most from batching?
▸ How did those recent memory optimizations affect our query-
serving capacity for customers with string-heavy payloads?

OK. SO WHAT DOES THIS LOOK LIKE?

DECIDE IT
BUILD IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT

BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT

DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT

DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
BUILD IT
Like debug statements
in production data

BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
🏁

BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
🏁
Feature ﬂags and ﬂexible observability tools = manual canarying
… except we can do it for everything
👾

BUILD IT
DECIDE IT
VERIFY IT
(WFM🤘)
VERIFY IT
(IN PROD)
WATCH IT
🏁
👾

BUILD ID
FEATURE FLAGS
CUSTOMER IDS

BUILD ID
FEATURE FLAGS
CUSTOMER IDS
GRAPHS
ALERTS
CODE
RELEASES

1. HYPOTHESIS
2. INSTRUMENTATION (MAYBE)
3. VALIDATION (OR NOT)
4. ONWARD

CONCEPTUALLY
▸ Start at the edge with basic, common attributes (e.g. HTTP)

▸ Business-relevant or infrastructure-speciﬁc characteristics (e.g.
customer ID, DB replica set)
CONCEPTUALLY

▸ Temporary additional ﬁelds for validating hypotheses
CONCEPTUALLY

▸ Temporary additional ﬁelds for validating hypotheses
▸ Prune stale ﬁelds (if necessary)
CONCEPTUALLY

▸ Contextual, structured data
SOME BEST PRACTICES

▸ Common set of nouns and consistent naming
SOME BEST PRACTICES

▸ Don't be dogmatic; let the use case dictate the ingest pattern
SOME BEST PRACTICES

▸ Don't be dogmatic; let the use case dictate the ingest pattern
▸ e.g. instrumenting individual reads while batching writes
SOME BEST PRACTICES

ﬁrst pass:
- server_hostname
- method
- url
- build_id
- remote_addr
- request_id
- status
- x_forwarded_for
- error
- event_time
- team_id
- payload_size
- sample_rate
then we added:
- dropped
- get_schema_dur_ms
- protobuf_encoding_dur_ms
- kafka_write_dur_ms
- request_dur_ms
- json_decoding_dur_ms +others
a couple of days later, we added:
- offset
- kafka_topic
- chosen_partition
after that:
- memory_inuse
- num_goroutines
a week after that:
- warning
- drop_reason
and on and on, adding 2-3 ﬁelds
every couple of weeks:
- user_agent
- unknown_columns
- dataset_partitions
- dataset_id
- dataset_name
- api_version
- create_marker_dur_ms
- marker_id
- nil_value_for_columns
- batch
- gzipped
- batch_datapoint_lens
- batch_num_datasets
- batch_process_datapoints_dur_ms
- batch_validate_datasets_dur_ms
- batch_dataset_names
- dataset_columns
- event_columns

▸ Stop writing software based on intuition, start backing it up with
data
▸ Teach observability tools to speak more than "Ops"
▸ ??? (← a.k.a., Ask lots of questions and validate hypotheses)
▸ Proﬁt!
DEVELOPERS, OUR MISSION:
THANKS!@cyen

Influx/Days 2017 San Francisco | Christine Yen

Recommended

Recommended

More Related Content

Similar to Influx/Days 2017 San Francisco | Christine Yen

Similar to Influx/Days 2017 San Francisco | Christine Yen (20)

More from InfluxData

More from InfluxData (20)

Recently uploaded

Recently uploaded (20)

Influx/Days 2017 San Francisco | Christine Yen