Golang observability
(in practice)
Eran Levy
@levyeran
https://medium.com/@levyeran
We're hiring!
Agenda
● Cloud native observability
● Logs
● Metrics
● Tracing
● Best practices
● Where do we go next
● Q?
WIFM?
● Know the available tools for observability
● How to get started?
● Best practices
Microservices might be good for your business...
But understanding what's going on is
another story
(Image: Netflix)
Observability
“Observability”, according to this definition, is a superset
of “monitoring”, providing certain benefits and insights
that “monitoring” tools come a cropper at. - Cindy
Sridharan
“Observability”, on the other hand, aims to provide highly
granular insights into the behavior of systems
along with rich context, perfect for debugging
purposes. - Cindy Sridharan
Understanding the full-cycle of a given flow and gain insights while asking your questions along the way
(Twitter engineering blog)
Like!
Logs Metrics Traces
Lets drill-down...
Logs
● Search for a specific pattern in a given time-window or dig into application
specific logs
● Write logs to stdout/stderr and the k8s cluster shall take care of the shipping
to a central logging infrastructure
● Pick the right package for your need:
○ Built-in “log” package - not structured, not leveled, mostly for dev - std log with timestamp
○ Logrus - JSON format, structured, leveled, hooks (note hooks lock)
○ uber-go/zap - fast (benchmarks: https://github.com/uber-go/zap/tree/master/benchmarks),
structured, leveled - performance focused - string formatting, reflection and small allocations
are CPU-intensive
○ golang/glog - if performance and volume are highly important, you might consider this one -
didn’t get the chance to use
Demo
Logs - Best Practices
● Logs are expensive! String formatting and interface{} reflections are CPU
intensive
● Aim for logs standardization i.e. common fields, standard messages - it
should help in prod
● Prefer log actionable messages and avoid maintaining too many log levels i.e
warn
● Don’t manage logging concurrency - the packages already take care of that
● Hooks (i.e logrus) - use them wisely (mutex locks)
Another log aggregation approach - Loki by Grafana
Metrics
● Metrics provide quantitative information about processes running inside the
system, including counters, gauges, and histograms (Opentelemetry)
● Measure business impact and user experience -
○ Add custom metrics
○ build dashboards
○ generate alerts
● “The four golden signals of monitoring are latency, traffic, errors, and
saturation.” (Google SRE)
● Modern metrics are stored in a time-series database - metric name and
key/value tags that create multi-dimensional space
grpc_io_server_server_latency_count{grpc_server_method="tokenizer.Tokenizer/GetTokens"} 7
(source: sysdig.com)
Integrate your metrics backend of choice
● Prefer using vendor neutral APIs such as Opencesus (soon Opentelemetry) to
dedicated stats backend clients (i.e. Prometheus go sdk)
● Metrics aren’t sampled - you would like to spot percentile latencies i.e. 99P
● Client libraries usually aggregate the collected metrics data in-process and
send to the backend server (prometheus, stackdriver, honeycomb, others)
● Standardize your KPIs to build meaningful dashboards
Opencensus service approach
Opentelemetry adopt that approach
● Agent vs Agentless
● Collector
● Demo docker-compose
Before we move on - Opencensus terminology
● Measure - the metric type that we are going to record - latency ms unit
● Measurement - recorded data point - 5 ms
● Aggregation - count, sum, distribution
● Exporter - backend of choice exporter
● View - coupling of aggregation, measure and tags
Demo
Distributed Tracing
● Tracing, aka distributed tracing, provides insight into the full life-cycles, aka traces, of requests to
the system, allowing you to pinpoint failures and performance issues (Opentelemtry)
● Enables engineers to understand which services were participated in a given end-to-end trace
Integrate your tracing system of choice
● Prefer vendor neutral APIs such as Opentracing/Opencensus (soon
Opentelemetry) to dedicated tracing client
● Trace critical business operations and calls to other services (ServiceA -> DB)
● Context propagation is the key - use “context” to propagate traces where
possible
● Prefer sidecar agents instead of calling directly to backend services where
possible (i.e. zipkin receives request by its collector)
● Opencensus agent is an interesting approach that enables you gain better
flexibility (i.e. dynamically change the backend service)
● Large systems can produce large amount of traces - large traffic and resource
intensive - choose the right Sampling strategy
Remember Opencensus service approach?
Same goes here…
Jaeger agent
Jaeger got something bit similar but its jaeger oriented - you can obviously use
that as well but you won’t get all the benefits that OC can provide
Demo
Best practices
- Standardization is a key - tags (tracing- i.e. Semantic Conventions), fields
(logs), metrics
- Enable engineers create alerts based on their metrics easily (i.e helm charts)
- Prefer sidecar agents instead of calling directly to backend services where
possible (agent vs agentless)
- Prefer vendor neutral APIs and instrumentation packages
- Choose tracer Sampling strategy - huge traffic, resource intensive
Where do we go next?
● Opentelemetry
(source: opentelemetry.io)
● Cloudevents.io
● Evolving architecture - Trace graph
● Use traces to spot problems that affect KPIs
Where do we go next?
Questions?

Go Observability (in practice)

  • 1.
    Golang observability (in practice) EranLevy @levyeran https://medium.com/@levyeran
  • 2.
  • 3.
    Agenda ● Cloud nativeobservability ● Logs ● Metrics ● Tracing ● Best practices ● Where do we go next ● Q?
  • 4.
    WIFM? ● Know theavailable tools for observability ● How to get started? ● Best practices
  • 5.
    Microservices might begood for your business... But understanding what's going on is another story (Image: Netflix)
  • 6.
    Observability “Observability”, according tothis definition, is a superset of “monitoring”, providing certain benefits and insights that “monitoring” tools come a cropper at. - Cindy Sridharan “Observability”, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes. - Cindy Sridharan Understanding the full-cycle of a given flow and gain insights while asking your questions along the way (Twitter engineering blog)
  • 7.
  • 8.
  • 9.
    Logs ● Search fora specific pattern in a given time-window or dig into application specific logs ● Write logs to stdout/stderr and the k8s cluster shall take care of the shipping to a central logging infrastructure ● Pick the right package for your need: ○ Built-in “log” package - not structured, not leveled, mostly for dev - std log with timestamp ○ Logrus - JSON format, structured, leveled, hooks (note hooks lock) ○ uber-go/zap - fast (benchmarks: https://github.com/uber-go/zap/tree/master/benchmarks), structured, leveled - performance focused - string formatting, reflection and small allocations are CPU-intensive ○ golang/glog - if performance and volume are highly important, you might consider this one - didn’t get the chance to use
  • 10.
  • 11.
    Logs - BestPractices ● Logs are expensive! String formatting and interface{} reflections are CPU intensive ● Aim for logs standardization i.e. common fields, standard messages - it should help in prod ● Prefer log actionable messages and avoid maintaining too many log levels i.e warn ● Don’t manage logging concurrency - the packages already take care of that ● Hooks (i.e logrus) - use them wisely (mutex locks)
  • 12.
    Another log aggregationapproach - Loki by Grafana
  • 13.
    Metrics ● Metrics providequantitative information about processes running inside the system, including counters, gauges, and histograms (Opentelemetry) ● Measure business impact and user experience - ○ Add custom metrics ○ build dashboards ○ generate alerts ● “The four golden signals of monitoring are latency, traffic, errors, and saturation.” (Google SRE) ● Modern metrics are stored in a time-series database - metric name and key/value tags that create multi-dimensional space grpc_io_server_server_latency_count{grpc_server_method="tokenizer.Tokenizer/GetTokens"} 7
  • 14.
  • 15.
    Integrate your metricsbackend of choice ● Prefer using vendor neutral APIs such as Opencesus (soon Opentelemetry) to dedicated stats backend clients (i.e. Prometheus go sdk) ● Metrics aren’t sampled - you would like to spot percentile latencies i.e. 99P ● Client libraries usually aggregate the collected metrics data in-process and send to the backend server (prometheus, stackdriver, honeycomb, others) ● Standardize your KPIs to build meaningful dashboards
  • 16.
    Opencensus service approach Opentelemetryadopt that approach ● Agent vs Agentless ● Collector ● Demo docker-compose
  • 17.
    Before we moveon - Opencensus terminology ● Measure - the metric type that we are going to record - latency ms unit ● Measurement - recorded data point - 5 ms ● Aggregation - count, sum, distribution ● Exporter - backend of choice exporter ● View - coupling of aggregation, measure and tags
  • 18.
  • 19.
    Distributed Tracing ● Tracing,aka distributed tracing, provides insight into the full life-cycles, aka traces, of requests to the system, allowing you to pinpoint failures and performance issues (Opentelemtry) ● Enables engineers to understand which services were participated in a given end-to-end trace
  • 22.
    Integrate your tracingsystem of choice ● Prefer vendor neutral APIs such as Opentracing/Opencensus (soon Opentelemetry) to dedicated tracing client ● Trace critical business operations and calls to other services (ServiceA -> DB) ● Context propagation is the key - use “context” to propagate traces where possible ● Prefer sidecar agents instead of calling directly to backend services where possible (i.e. zipkin receives request by its collector) ● Opencensus agent is an interesting approach that enables you gain better flexibility (i.e. dynamically change the backend service) ● Large systems can produce large amount of traces - large traffic and resource intensive - choose the right Sampling strategy
  • 23.
    Remember Opencensus serviceapproach? Same goes here…
  • 24.
    Jaeger agent Jaeger gotsomething bit similar but its jaeger oriented - you can obviously use that as well but you won’t get all the benefits that OC can provide
  • 25.
  • 26.
    Best practices - Standardizationis a key - tags (tracing- i.e. Semantic Conventions), fields (logs), metrics - Enable engineers create alerts based on their metrics easily (i.e helm charts) - Prefer sidecar agents instead of calling directly to backend services where possible (agent vs agentless) - Prefer vendor neutral APIs and instrumentation packages - Choose tracer Sampling strategy - huge traffic, resource intensive
  • 27.
    Where do wego next? ● Opentelemetry (source: opentelemetry.io)
  • 28.
    ● Cloudevents.io ● Evolvingarchitecture - Trace graph ● Use traces to spot problems that affect KPIs Where do we go next?
  • 29.