Microservices architecture involves many services that are being distributed over the network resulting in many more ways of failure. This session will try to cover the available tools that can help you when designing/building such distributed system in Go
3. Agenda
● Cloud native observability
● Logs
● Metrics
● Tracing
● Best practices
● Where do we go next
● Q?
4. WIFM?
● Know the available tools for observability
● How to get started?
● Best practices
5. Microservices might be good for your business...
But understanding what's going on is
another story
(Image: Netflix)
6. Observability
“Observability”, according to this definition, is a superset
of “monitoring”, providing certain benefits and insights
that “monitoring” tools come a cropper at. - Cindy
Sridharan
“Observability”, on the other hand, aims to provide highly
granular insights into the behavior of systems
along with rich context, perfect for debugging
purposes. - Cindy Sridharan
Understanding the full-cycle of a given flow and gain insights while asking your questions along the way
(Twitter engineering blog)
9. Logs
● Search for a specific pattern in a given time-window or dig into application
specific logs
● Write logs to stdout/stderr and the k8s cluster shall take care of the shipping
to a central logging infrastructure
● Pick the right package for your need:
○ Built-in “log” package - not structured, not leveled, mostly for dev - std log with timestamp
○ Logrus - JSON format, structured, leveled, hooks (note hooks lock)
○ uber-go/zap - fast (benchmarks: https://github.com/uber-go/zap/tree/master/benchmarks),
structured, leveled - performance focused - string formatting, reflection and small allocations
are CPU-intensive
○ golang/glog - if performance and volume are highly important, you might consider this one -
didn’t get the chance to use
11. Logs - Best Practices
● Logs are expensive! String formatting and interface{} reflections are CPU
intensive
● Aim for logs standardization i.e. common fields, standard messages - it
should help in prod
● Prefer log actionable messages and avoid maintaining too many log levels i.e
warn
● Don’t manage logging concurrency - the packages already take care of that
● Hooks (i.e logrus) - use them wisely (mutex locks)
13. Metrics
● Metrics provide quantitative information about processes running inside the
system, including counters, gauges, and histograms (Opentelemetry)
● Measure business impact and user experience -
○ Add custom metrics
○ build dashboards
○ generate alerts
● “The four golden signals of monitoring are latency, traffic, errors, and
saturation.” (Google SRE)
● Modern metrics are stored in a time-series database - metric name and
key/value tags that create multi-dimensional space
grpc_io_server_server_latency_count{grpc_server_method="tokenizer.Tokenizer/GetTokens"} 7
15. Integrate your metrics backend of choice
● Prefer using vendor neutral APIs such as Opencesus (soon Opentelemetry) to
dedicated stats backend clients (i.e. Prometheus go sdk)
● Metrics aren’t sampled - you would like to spot percentile latencies i.e. 99P
● Client libraries usually aggregate the collected metrics data in-process and
send to the backend server (prometheus, stackdriver, honeycomb, others)
● Standardize your KPIs to build meaningful dashboards
17. Before we move on - Opencensus terminology
● Measure - the metric type that we are going to record - latency ms unit
● Measurement - recorded data point - 5 ms
● Aggregation - count, sum, distribution
● Exporter - backend of choice exporter
● View - coupling of aggregation, measure and tags
19. Distributed Tracing
● Tracing, aka distributed tracing, provides insight into the full life-cycles, aka traces, of requests to
the system, allowing you to pinpoint failures and performance issues (Opentelemtry)
● Enables engineers to understand which services were participated in a given end-to-end trace
20.
21.
22. Integrate your tracing system of choice
● Prefer vendor neutral APIs such as Opentracing/Opencensus (soon
Opentelemetry) to dedicated tracing client
● Trace critical business operations and calls to other services (ServiceA -> DB)
● Context propagation is the key - use “context” to propagate traces where
possible
● Prefer sidecar agents instead of calling directly to backend services where
possible (i.e. zipkin receives request by its collector)
● Opencensus agent is an interesting approach that enables you gain better
flexibility (i.e. dynamically change the backend service)
● Large systems can produce large amount of traces - large traffic and resource
intensive - choose the right Sampling strategy
24. Jaeger agent
Jaeger got something bit similar but its jaeger oriented - you can obviously use
that as well but you won’t get all the benefits that OC can provide
26. Best practices
- Standardization is a key - tags (tracing- i.e. Semantic Conventions), fields
(logs), metrics
- Enable engineers create alerts based on their metrics easily (i.e helm charts)
- Prefer sidecar agents instead of calling directly to backend services where
possible (agent vs agentless)
- Prefer vendor neutral APIs and instrumentation packages
- Choose tracer Sampling strategy - huge traffic, resource intensive
27. Where do we go next?
● Opentelemetry
(source: opentelemetry.io)
28. ● Cloudevents.io
● Evolving architecture - Trace graph
● Use traces to spot problems that affect KPIs
Where do we go next?