OpenTelemetry For Operators

OpenTelemetry For
Operators
Presented by Kevin Brockhoff
Apache 2.0 Licensed

Our
Agenda
● Why are current observability platforms
falling short?
● What OpenTelemetry features address
these issues?
● How do I run OpenTelemetry
components in production?
● Who are the innovators in the
observability space?

Level
Setting
● Have you used ELK stack or other log
aggregator?
● Have you used an APM system?
● Have you used distributed tracing
before?

Who am I?
● Kevin Brockhoff - Senior
Consultant, Daugherty Business
Solutions
○ Solving difficult cloud adoption
challenges for Daugherty's
Fortune 500 clients
○ OpenTelemetry committer since
early stages of the project
○ Github:
https://github.com/kbrockhoff
○ Linkedin:
https://www.linkedin.com/in/kevi
n-brockhoff-a557877/

6
Enterprise Applications
● Only instrumented with logging during initial development.
○ Logging oriented toward development, not operations
● Metrics and tracing only added later if at all as a separate project.
○ Each team creates their own system using familiar tools
○ Or enterprise commits to a specific APM vendor
● Logs, metrics and traces are never connected.

7
First Generation Observability Platforms
Search logs in ELK,
Lack context
Homegrown tracing per
app mainly accessible by
developers
Customer experience
metrics
Low-level metrics
and alerts

OpenCensus + OpenTracing = OpenTelemetry
● OpenTracing:
○ Provides APIs and instrumentation for distributed tracing
● OpenCensus:
○ Provides APIs and instrumentation that allow you to collect application metrics and
distributed tracing.
● OpenTelemetry:
○ An effort to combine distributed tracing, metrics and logging into a single set of system
components and language-specific libraries.

10
OpenTelemetry Project
● Specification
○ API (for application developers)
○ SDK Implementations
○ Transport Protocol (Protobuf)
● Collector (middleware)
● SDK’s (various stages of maturity)
○ C++
○ C# (Auto-instrument/Manual)
○ Erlang
○ Go
○ JavaScript (Browser/Node)
○ Java (Auto-instrument/Manual)
■ Android compatibility
○ PHP
○ Python (Auto-instrument/Manual)
○ Ruby
○ Rust
○ Swift

13
OpenTelemetry Collector
● Offers a vendor-agnostic implementation on how to receive, process and
export telemetry data.
● Removes the need to run, operate and maintain multiple
agents/collectors.
● Support open-source telemetry data formats (e.g. OTLP, Jaeger,
Prometheus, etc.) sending to multiple open-source or commercial back-
ends.

14
Collector Concepts
● Telemetry data processing pipelines
○ Per pipeline: Receiver(s) -> Processors -> Exporter(s)
○ Currently only single telemetry type pipelines supported
● Extensions
○ Supporting functionality
○ Core collector extensions
■ health_check - HTTP endpoint for load balancer or k8s controller
■ zpages - Internal processing metrics and traces accessible via HTTP
■ pprof - Performance profiler enables the golang net/http/pprof endpoint

Collector Bundled Receivers
Traces
● Jaeger
○ Compact Thrift, Binary Thrift, HTTP,
gRPC
○ Sampling strategy configuration server
● Kafka
○ OTLP, Jaeger, Zipkin data structures
● OpenCensus
● OTLP (OpenTelemetry Protocol)
○ gRPC, HTTP
● Zipkin
○ v1, v1 Thrift, v2, v2 Protobuf
Metrics
● Host metrics scrapper
○ cpu, disk, load, filesystem, memory,
network, processes, swap, process
● Kafka
○ OTLP
● OpenCensus
○ gRPC, HTTP
● Prometheus
○ Full discovery and polling capabilities
Logs
● Fluent Forward
○ Spec compliant except no mTLS

Collector Contrib Receivers
Traces
● AWS X-Ray
● SignalFX APM v1
Metrics
● AWS ECS Container
● Carbon
● CollectD (JSON only)
● Docker Stats
● Kubernetes Cluster
● Kubernetes Kubelet
● Prometheus Exporters
● Redis INFO
● SignalFX
● Splunk HEC
● StatsD
● Wavefront
Logs
● SignalFX (Events)
● Stanza

Collector Bundled Processors
● Attributes
○ Modifies span attributes
● Batch
○ Groups data into batches
● Filter
○ Include/exclude metrics by name
● Group by Trace
○ Holds all spans for a trace for a set time
and then sends to next processor
● Memory Limiter
○ Prevents out-of-memory issues by
triggering GC
○ Configuration must be matched with
ballast setting collector is launched with
● Queued Retry
○ Deprecated, each exporter now
implements
● Resource
○ Applies changes to Resource attributes
● Probabilistic Sampling
○ Adjusts TraceID hash-based sampling
decisions by sampling.priority
attribute value
● Tail Sampling
○ Sampling decisions based on configured
attribute values and rate limits
● Span
○ Modifies span name or attributes based
on span name

18
Recommended Processor Configuration
Traces
memory_limiter
any sampling processors
batch
any other processors
Metrics
memory_limiter
any filtering processors
batch
any other processors
Memory limiter ballast_size_mib must match --mem-ballast-size-mib command line
parameter. Trigger GC with either limit_mib / spike_limit_mib or limit_percentage /
spike_limit_percentage.

Collector Contrib Processors
● Kubernetes
○ Adds metadata from pod
● Metrics Transform
○ Renames/aggregations within individual
metrics
● Resource Detection
○ OTEL_RESOURCE environment variable
○ GCE metadata server
○ EC2 instance metadata server
● Routing
○ Route to particular exporter based on
incoming header value
TODO
● Span data sharding by TraceID

Collector Bundled Exporters
Traces
● File
○ JSON format
● Jaeger
○ v2 gRPC
● Kafka
○ OTLP, Jaeger, Zipkin
● Logging
○ Debugging
● OpenCensus
● Zipkin
○ v2 JSON or Protobuf
Metrics
● File
○ JSON format
● Logging
○ Debugging
● OpenCensus
● Prometheus
○ Metrics endpoint for Prometheus to pull
from
● Prometheus Remote Write
○ Pushes metrics in Prometheus
TimeSeries format (Cortex, etc.)

Collector Contrib Exporters
Traces
● AlibabaCloud LogService
● AWS X-Ray
● Azure Monitor
● Datadog
● Elastic
● Honeycomb
● Jaeger v1 Thrift
● AWS Kinesis (Jaeger proto)
● New Relic
● SignalFX APM
● Sentry
● Stackdriver
Metrics
● AlibabaCloud LogService
● AWS CloudWatch EMF
● Carbon
● Datadog
● Elastic
● New Relic
● SignalFX
● Splunk HEC
● Stackdriver

Vendor Hosted Exporters
Traces
● Dynatrace OneAgent
● Lightstep Launchers
Metrics
● Dynatrace OneAgent
● Lightstep Launchers

receivers:
otlp:
protocols:
grpc:
max_recv_msg_size_mib: 32
max_concurrent_streams: 16
read_buffer_size: 1024
write_buffer_size: 1024
keepalive:
server_parameters:
max_connection_idle: 10s
processors:
memory_limiter:
ballast_size_mib: 192
check_interval: 5s
limit_mib: 448
spike_limit_mib: 64
batch:
send_batch_size: 64
timeout: 15s
exporters:
jaeger:
endpoint: jaeger.monitoring.svc.storefront-development.local.:14250
timeout: 10s
sending_queue:
enabled: true
num_consumers: 2
queue_size: 10
retry_on_failure:
enabled: true
initial_interval: 10s
max_interval: 60s
max_elapsed_time: 10m
prometheusremotewrite:
namespace: "monitoring"
sending_queue:
enabled: true
num_consumers: 2
queue_size: 10
retry_on_failure:
enabled: true
initial_interval: 10s
max_interval: 60s
max_elapsed_time: 10m
endpoint: ":8888"
ca_file: "/etc/pki/tls/certs/carbon-lb.pem"
write_buffer_size: 524288
headers:
Prometheus-Remote-Write-Version: "0.1.0"
X-Scope-OrgID: 234
extensions:
health_check:
port: 13133
zpages:
endpoint: :55679
service:
extensions: [zpages, health_check]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
Full Configuration File Example

Collector Command Line Example
/usr/local/bin/otelcol
--config=/usr/local/etc/otel-collector-config.yaml
--mem-ballast-size-mib=192
--log-level=DEBUG

25
Collector Docker Images
● otel/opentelemetry-collector
○ Core receivers, processors, and exporters bundled in
● otel/opentelemetry-collector-contrib
○ All core and contrib receivers, processors, and exporters bundled in
● OpenTelemetry Collector builder
○ https://github.com/observatorium/opentelemetry-collector-builder

26
Other Collector Installs
● RPM
○ Produced by opentelemetry-collector build
● Debian
○ Produced by opentelemetry-collector build

27
Observing the Collector
● health_check
○ http://<hostname>:13133/ returns basic
pipeline availability
● zpages
○ RPC metric aggregations at
http://<hostname>:55679/debug/rpcz
○ Trace summaries at
http://<hostname>:55679/debug/tracez
● prometheus
○ Pipeline metrics scrap endpoint at
http://<hostname>:8888/metrics

28
Current Gotchas
● Errors propagated back through pipelines and instances in the chain
○ Errors reported by SDK exporters in the applications may be coming from two hops
downstream
● TraceID sharding not working correctly
○ Can only do tail-based sampling if running single instance of collector

29
Observability Platform Innovations

30
Latest Innovations
● Dynatrace automates manual quality validation processes using AI-
assisted SLI/SLO-based quality gates.
● New Relic Incident Intelligence continuously analyzes alerts and incident
data to find patterns in event sequences and offers suggested correlation
decisions that merge incidents to reduce alert noise further.
● Splunk SignalFX provides high cardinality exploration of traces across
different regions, hosts, versions or users.
● Lightstep provides rapid root cause analysis using unlimited cardinality
and a high-fidelity dataset uncompromised by head or tail sampling,

31
Latest Innovations
● Datadog provides automated tagging and correlation of logs so can jump
from any log entry to related metrics.
● Honeycomb lets you break down on every dimension in your data both
the obvious fields, and the surprising ones.
● Grafana Loki datasource provides switching from metrics to logs with
preserved label filters.
● Elastic Observability bring your logs, metrics, and APM traces together at
scale in a single stack.

OpenTelemetry For Operators

More Related Content

What's hot

Similar to OpenTelemetry For Operators

Recently uploaded

OpenTelemetry For Operators

Editor's Notes