Observability and its application

Observability
Huynh Quang Thao
Trusting Social

What is the distributed tracing ?
- tracing involves a specialized use of logging to record information
about a program's execution.

What is the metric ?
- tracing involves a specialized use of logging to record information
about a program's execution.
- Example metrics:
- Average time to query products table.

Tracing
Microservice 1 Exporter
BMicroservice 2 Exporter
Backend

Metric
Backend

Why use OpenCensus / OpenTracing
- Standardize format with backends (Jaeger, Zipkin, …)
- Abstract logic code.

Why use OpenCensus / OpenTracing
- Standard format with backend (Jaeger, Zipkin, …)
- Abstract logic code.
_, span := trace.StartSpan(r.Context(), "child")
defer span.End()
span.Annotate([]trace.Attribute{trace.StringAttribute("key", "value")}, “querying")
span.AddAttributes(trace.StringAttribute("hello", "world"))
je, err := jaeger.NewExporter(jaeger.Options{
AgentEndpoint: agentEndpoint,
CollectorEndpoint: collectorEndpoint,
ServiceName: service,
})
trace.RegisterExporter(je)

Export directly to the backend
Backend

OpenCensus local z-pages

- Coupling between each microservice with the backend.
- If we want to change the backend, we must update code on every service.
- If we want to change some conWigurations, we must update code on service.
- Scaling exporter languages (i.e Jaeger: must be written for all supported
languages Golang, Java, Python, …)
- Manage ports on some backends such as Prometheus.

OpenCensus Service
- Decoupling between services and tracing/metric backends.
- OpenCensus collector supports intelligent sampling. (tail-based approach)
- Preprocess data (annotate span, update tags …) before come to another
backends.
- Don’t have much documentation now. But we can get reference to Jaeger for
similar deployment.

Trace
- A trace is a tree of spans.
- Every request which sends from the client will generate a TraceID.
- Showing the path of the work through the system.

Span
- A span represents a single operation in a trace.
- A span could be representative of an HTTP request, a RPC call, a database query.
- User deWines code path: start and end.

Span
doSomeWork(); // sleep 3s
_, span := trace.StartSpan(r.Context(), "parent span")
defer span.End()
doSomeWork();
_, childrenSpan := trace.StartSpan(r.Context(), "children span")
defer childrenSpan.End()
doSomeWork();

Tag
- Tag is the key-value pair of data which associated with each trace.
- Helpful for the reporting, searching, Wiltering …

Tag
_, childrenSpan := trace.StartSpan(r.Context(), "children span")
defer childrenSpan.End()
childrenSpan.AddAttributes(trace.StringAttribute("purpose", "test"))

Trace Sampling
There are 4 levels:
- Always
- Never
- Probabilistic
- Rate limiting
- Should be Probabilistic / Rate limiting
- Never for un-sampling request.

Trace Sampling
There are 4 levels:
- Always
- Never
- Probabilistic
- Rate limiting
- Should be Probabilistic / Rate limiting
- Never for un-sampling request.
trace.ApplyConfig(trace.Config{DefaultSampler: trace.AlwaysSample()})
trace.ApplyConfig(trace.Config{DefaultSampler: trace.ProbabilitySampler(0.7)})
Global Con:iguration
Via Span
_, span := trace.StartSpan(r.Context(), "child", func(options *trace.StartOptions) {
options.Sampler = trace.AlwaysSample()
})

OpenCensus sample rules
The OpenCensus use the head-based sampling with following rules:
1. If the span is a root Span:
• If a "span-scoped" Sampler is provided, use it to determine the sampling decision.
• Else use the global default Sampler to determine the sampling decision.
2. If the span is a child of a remote Span:
3. If the span is a child of a local Span:
• Else keep the sampling decision from the parent.

OpenCensus sample rules
The OpenCensus use the head-based sampling with following rules:
1. If the span is a root Span:
2. If the span is a child of a remote Span:
3. If the span is a child of a local Span:
• Else keep the sampling decision from the parent.
Disadvantages:
- Might lost some useful data.
- Can be Wixed by using the tail-based approach on the OpenCensus collector.
References:
- https://github.com/census-instrumentation/opencensus-specs/blob/master/
trace/Sampling.md
- https://sWlanders.net/2019/04/17/intelligent-sampling-with-opencensus/

Measure
- A measure represents a metric type to be recorded.
- For example, request latency is in µs and request size is in KBs.
- A measure includes 3 Wields: Name - Description - Unit
- Measure supports 2 type: Wloat and int
GormQueryCount = stats.Int64( // Type: Integer
GormQueryCountName, // name
"Number of queries started", // description
stats.UnitDimensionless, // Unit
)

Measurement
- Measurement is a data point produced after recording a quantity by a measure.
- A measurement is just a raw statistic.
measurement := GormQueryCount.M(1)
// M creates a new int64 measurement.
// Use Record to record measurements.
func (m *Int64Measure) M(v int64) Measurement {
return Measurement{
m: m,
desc: m.desc,
v: float64(v),
}
}
stats.Record(wrappedCtx, GormQueryCount.M(1))

View
- Views are the coupling of an Aggregation applied to a Measure and optionally Tags.
- Supported aggregation function: Count / Distribution / Sum / LastValue.
- Multiple views can use same measure but only when different aggregation.
- The various tags used to group and Wilter collected metrics later on.
GormQueryCountView = &view.View{
Name: GormQueryCountName,
Description: "Count of database queries based on Table and Operator",
TagKeys: []tag.Key{GormOperatorTag, GormTableTag},
Measure: GormQueryCount,
Aggregation: view.Count(),
}

Metric Sampling
Stats are NOT sampled to be able to represent uncommon cases hence, stats
are ALWAYS recorded unless dropped.

Context Propagation: B3 Standard
Header Data:
X-B3-Sampled:[1]
X-B3-Spanid:[dacdb2208f874447]
X-B3-Traceid:[9ca4a513af5f299a856dec51336a051b]
var requestOption = comm.RequestOption{
Transport: &ochttp.Transport{
Propagation: &b3.HTTPFormat{},
Base: &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
},
},
}

Context Propagation: OpenTracing Standard
Header Data:
Traceparent:[00-a9f4dc05b7a78f6f2f717d7396d9450f-187065dac4cd685c-01]
var requestOption = comm.RequestOption{
Transport: &ochttp.Transport{
Propagation: &tracecontext.HTTPFormat{},
},
},
}

1. HTTP Handler
mux := http.NewServeMux()
mux.HandleFunc("/first", firstAPI)
// wrap handler inside OpenCensus handler for tracing request
och := &ochttp.Handler{
Handler: mux,
}
// start
if err := http.ListenAndServe(address, och); err != nil {
panic(err)
}

1. HTTP Handler
// vite/tracing
// WrapHandlerWithTracing wraps handler inside OpenCensus handler for tracing
func WrapHandlerWithTracing(handler http.Handler,
option OptionTracing) (http.Handler, error) {
// processing option here
// ...
handler = &ochttp.Handler{
Propagation: propagationFormat,
IsPublicEndpoint: option.IsPublicEndpoint,
StartOptions: startOptions,
Handler: handler,
}
return handler, nil
}
Wrap normal http.Handler with ochttp.Handler

2. HTTP Transport Layer
var DefaultTransport = &ochttp.Transport{
Propagation: &tracecontext.HTTPFormat{},
},
}
var DefaultTransport = http.Transport{
}
Before
After

3. Callback: GORM
func RegisterGormCallbacksWithConfig(db *gorm.DB, cfg *GormTracingCfg) {
db.Callback().Create()
.Before(“gorm:create")
.Register("instrumentation:before_create", cfg.beforeCallback(CreateOperator))
db.Callback().Create().After(“gorm:create")
.Register("instrumentation:after_create", cfg.afterCallback())
//more callbacks here
}
func MigrateDB() {
testDB = createDBConnection()
RegisterGormCallbacks(testDB)
}
Register all necessary callbacks for GORM

3. Callback: GORM
func GormWithContext(ctx context.Context, origGorm *gorm.DB) *gorm.DB {
return origGorm.Set(ScopeContextKey, ctx)
}
Wrap Gorm Object with context before calling database operator
orm := tracing.GormWithContext(r.Context(), testDB)
product, _ := GetFirstProductWithContext(orm)
func GetFirstProductWithContext(db *gorm.DB) (*Product, error) {
r := &Product{}
if err := db.First(r, 1).Error; err != nil {
if gorm.IsRecordNotFoundError(err) {
return nil, nil
}
log.Println(vite.MarkError, err)
return nil, err
}
return r, nil
}

4. Callback: Redis
// takes a vanilla redis.Client and returns trace instrumented version
func RedisWithContext(ctx context.Context, origClient *redis.Client) *redis.Client {
client := origClient.WithContext(ctx)
client.WrapProcess(perCommandTracer(ctx, &redisDefaultCfg))
return client
}
// perCommandTracer provides the instrumented function
func perCommandTracer(ctx context.Context, cfg *RedisTracingCfg,
) func(oldProcess func(cmd redis.Cmder) error) func(redis.Cmder) error {
return func(fn func(cmd redis.Cmder) error) func(redis.Cmder) error {
return func(cmd redis.Cmder) error {
span := cfg.startTrace(ctx, cmd)
defer cfg.endTrace(span, cmd)
err := fn(cmd)
return err
}
}
}

4. Callback: Redis
// wrap redis object before calling redis operator
wrapRedis := tracing.RedisWithContext(r.Context(), Redis.Client)
readKeyWithContext(wrapRedis, "service", "StackOverFlow")
func readKeyWithContext(client *redis.Client, key string) string {
return client.Get(key).String()
}
Client side: wrap again Redis client with context before calling Redis operator.

5. Exporter
func RunJaegerExporter(service string, agentEndpoint string,
collectorEndpoint string) (*jaeger.Exporter, error) {
je, err := jaeger.NewExporter(jaeger.Options{
AgentEndpoint: agentEndpoint,
CollectorEndpoint: collectorEndpoint,
ServiceName: service,
})
if err != nil {
return nil, err
}
trace.RegisterExporter(je)
trace.ApplyConfig(trace.Config{DefaultSampler: trace.ProbabilitySampler(0.2)})
return je, nil
}
_, err := tracing.RunJaegerExporter(
"trusting_social_demo",
"localhost:6831",
"http://localhost:14268/api/traces",
)
Export to Jaeger

5. Exporter
Export to Console
_, err = tracing.RunConsoleExporter()
if err != nil {
panic(err)
}
// Start starts the metric and span data exporter.
func (exporter *LogExporter) Start() error {
exporter.traceExporter.Start()
exporter.viewExporter.Start()
err := exporter.metricExporter.Start()
if err != nil {
return err
}
return nil
}

5. Exporter
Export to Prometheus
func RunPrometheusExporter(namespace string) (*prometheus.Exporter, error) {
pe, err := prometheus.NewExporter(prometheus.Options{
Namespace: namespace,
})
view.RegisterExporter(pe)
return pe, nil
}
// add api endpoint for prometheus
app.Mux.Handle("/metrics", pe)
scrape_configs:
- job_name: 'trustingsocial_ocmetrics'
scrape_interval: 5s
static_configs:
- targets: ['host.docker.internal:3000']
Create entry point /metrics for prometheus service call
Sample prometheus conWiguration:

6. Register views
Register all database views
err := tracing.RegisterAllDatabaseViews()
if err != nil {
panic(err)
}
defer tracing.UnregisterAllDatabaseViews()
// RegisterAllDatabaseViews registers all database views
func RegisterAllDatabaseViews() error {
return view.Register(GormQueryCountView)
}
Register all Redis views
err = tracing.RegisterAllRedisViews()
if err != nil {
panic(err)
}
defer tracing.UnregisterAllRedisViews()

Export trace
func (exporter *TraceExporter) ExportSpan(sd *trace.SpanData) {
var (
traceID = hex.EncodeToString(sd.SpanContext.TraceID[:])
spanID = hex.EncodeToString(sd.SpanContext.SpanID[:])
parentSpanID = hex.EncodeToString(sd.ParentSpanID[:])
)
// RunJaegerExporter exports trace to Jaeger
}
func (exporter *TraceExporter) Start() {
trace.RegisterExporter(exporter)
}
1. Implement ExportSpan function
2. Call trace.RegisterExporter

Export view
// ExportView implements view.Exporter's interface
func (exporter *ViewExporter) ExportView(vd *view.Data) {
for _, row := range vd.Rows {
}
}
// Start starts printing log
func (exporter *ViewExporter) Start() {
view.RegisterExporter(exporter)
}
1. Implement ExportView function
2. Call view.RegisterExporter

Export metric
// ExportMetrics implements metricexport.Exporter's interface.
func (exporter *MetricExporter) ExportMetrics(ctx context.Context,
metrics []*metricdata.Metric) error {
for _, metric := range metrics {
// process each metric
}
return nil
}
// Start starts printing log
func (exporter *MetricExporter) Start() error {
exporter.initReaderOnce.Do(func() {
exporter.intervalReader, _ = metricexport.NewIntervalReader(
exporter.reader,
exporter,
)
})
exporter.intervalReader.ReportingInterval = exporter.reportingInterval
return exporter.intervalReader.Start()
}
1. Implement ExportMetrics function
2. Interval polling to get latest metric data

The four golden signals
1. Latency
2. TrafWic
3. Errors
4. Saturations

Latency
1. Latency
• The time it takes to service a request.
• important to distinguish between the latency of successful requests and the latency
of failed requests.
• it’s important to track error latency, as opposed to just Wiltering out errors.
Example:
• Database: time to query to database server.
• HTTP request: time from the beginning to the end of the request.

TrafWic
1. Traf:ic
• A measure of how much demand is being placed on your system, measured in a high-
level system-speciWic metric.
Example:
• HTTP request: HTTP Requests per second
• Database: Successfully / Fail queries per second.
• Redis: Successfully / Fail queries (without not found) queries per second.

Error
1. Error
• The rate of requests that fail, either explicitly, implicitly or policy.
• Explicit: request with http status code 500.
• Implicit: an HTTP 200 success response, but coupled with the wrong content)
• Policy: If you committed to one-second response times, any request over one second
is an error

Example:
• HTTP request: request with status code not 200
• Redis: Queries that return error code (without not found).
• Database: Queries that return error code (without not found)

Saturation
1. Error
• How "full" your service is: Explicit, Implicit or Policy
Explicit: request with http status code 500.
Implicit: an HTTP 200 success response, but coupled with the wrong content)
Policy: If you committed to one-second response times, any request over one second is
an error
• Latency increases are often a leading indicator of saturation.
• Measuring your 99th percentile response time over some small window can give a very
early signal of saturation.

Example:
• HTTP request: System loads such as CPU, RAM …
• Redis: Idle / Active / Inactive connections in connection pool.
• Database: Idle / Active / Inactive connections in connection pool.

The four golden signals
Already implemented in tracing repository, in 4 packages: redis / gorm / http
Must read
- How to measure on production environment
- Systematically way to resolving production issues
…

Tracing Repository
Repository: https://github.com/tsocial/tracing
• Implemented callbacks for Redis, Gorm and HTTP Handler.
• DeWined and implemented observability for each package.
• Implemented some exporters (e.g: Jaeger, Prometheus, …).
• Implemented console exporter and simple exporter for testing.
• Example project to demonstrate the usage.
• Decoupling with the Telco platform. Open Source ?

Sample project
1. Repository: https://github.com/tsocial/distributed_tracing_demo
- Test with Gorm/Redis
- Test tracing with console exporter
- Test with Jaeger /Prometheus
- Call external service
- Call internal service
- TODO: test with OpenCensus service
2. Repository:
https://github.com/census-instrumentation/opencensus-service/blob/master/demos/trace/
docker-compose.yaml
- Test with OpenCensus service
- Multiple internal services
- Jaeger / Prometheus / Zipkin …

References
- Documentation: https://opencensus.io
- Examples: https://github.com/census-instrumentation/opencensus-go/tree/master/examples
- How not to measure latency: https://www.youtube.com/watch?v=lJ8ydIuPFeU
- SpeciWication for B3 format: https://github.com/apache/incubator-zipkin-b3-propagation
- SpeciWication for OpenTracing format:
• https://www.w3.org/TR/trace-context/#dfn-distributed-traces
• https://github.com/opentracing/speciWication/issues/86
- Logging architecture: https://kubernetes.io/docs/concepts/cluster-administration/logging/
- Nice post about OpenCensus vs OpenTracing: https://github.com/gomods/athens/issues/392
- OpenCensus service Design: https://github.com/census-instrumentation/opencensus-service/blob/master/DESIGN.md
- Distributed tracing at Uber: https://eng.uber.com/distributed-tracing/
- Tracing HTTP request latency: https://medium.com/opentracing/tracing-http-request-latency-in-go-with-
opentracing-7cc1282a100a
- Context propagation: https://medium.com/jaegertracing/embracing-context-propagation-7100b9b6029a
- Only book about distributed tracing: https://www.amazon.com/Mastering-Distributed-Tracing-performance-microservices/
dp/1788628462
- https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals

Observability and its application

More Related Content

What's hot

Similar to Observability and its application

More from Thao Huynh Quang

Recently uploaded

Observability and its application