Observability: Beyond the Three Pillars with Spring

Jonatan Ivanov
2022-06-29
Observability
Copyright © 2022 VMware, Inc. or its aﬃliates.
Beyond the three pillars with Spring

About Me
- @jonatan_ivanov
- develotters.com
- Seattle Java User Group
- Spring Team @ VMware
- Micrometer
- Spring Cloud Sleuth
- “Spring Observability”

Disclaimer
This presentation may contain product features or functionality that are currently under
development.
This overview of new technology represents no commitment from VMware to deliver these
features in any generally available product.
Features are subject to change, and must not be included in contracts, purchase orders, or
sales agreements of any kind.
Technical feasibility and market demand will aﬀect ﬁnal delivery.
Pricing and packaging for any new features/functionality/technology discussed or
presented, have not been determined.
The information in this presentation is for informational purposes only and may not be
incorporated into any contract. There is no commitment or obligation to deliver any items
presented herein.

Cover w/ Image
Agenda
- What is Observability?
- Why do we need it?
- “The Three Pillars” (with examples)
- Logging
- Metrics
- Distributed Tracing
- How to implement it with Spring?
- “Non-conventional” Observability
- Q&A

What is Observability?
Why do we need it?

“In control theory, observability is a measure of how well
internal states of a system can be inferred from knowledge
of its external outputs.”
…
“A system is said to be observable if [...] the current state can
be estimated using only the information from outputs.”
(Wikipedia)

How well we can understand the
internals of a system based on its
outputs
(Providing meaningful information about what happens inside)

Being able to ask arbitrary questions
without knowing ahead what you want to ask
Turning data points and context into insights
Being able to quickly troubleshoot problems
with no prior knowledge (unknown unknowns)

Why do we need Observability?
Today's systems are insanely complex (cloud)
(Death Star Architecture, Big Ball of Mud)

Complexity (cloud): LAMP stack vs. Cloud Environments
We need to face unknown unknowns
We might not know where our apps are
We might not know how many instances we have (or what versions)
We can’t modify/debug/etc. it
Something is always broken (Fallacies of Distributed Computing)
Like sending rovers to Mars: You can’t touch/modify them after launch

Chaos
Environments can be chaotic
You turn a knob here a little and services are going down there
Unknown Unknowns
We can’t know everything, we need to deal with unknown unknowns
“This should be impossible!”, “That will never happen!”
Relativity
The same thing can be perceived diﬀerently by diﬀerent observers
Everything is broken for the users but the server side seems ok

Continuous Improvement
If you want to improve something, you need to be able to measure it ﬁrst
How many resources do you utilize (cpu, ram, io, etc.)?
What are your throughput/latency (max.) patterns?
How frequently do you deploy?
How long does it take for the code to go live?
How long does it take to troubleshoot an issue or recover from an outage?
How often are you paged?

Opens the door for advanced capabilities
Chaos Engineering
Anomaly Detection
Feature ﬂags
A/B Testing
Auto-tuning
Adaptive Apps

“The Three Pillars”
(The most popular approach)

Logging - Metrics - Distributed Tracing
Metrics
What is the context?
Measure-and-Combine data
Aggregatable
Can identify trends
Not traﬃc-sensitive (usually)
Distributed Tracing
Why happened?
Recording events
With causal ordering
Can identify cause across
apps
Context Propagation (later)
Logging
What happened?
Emitting events
Easy to read (grep)
INFO/WARN/ERROR/…
Stacktraces

Example: Latency
Metrics
“99.999% of the requests
were faster than 140ms.”
“The max was 150ms.”
So it’s quite bad.
But why was this slow?
Logging
“Processing a request took
140ms.”
Is it bad?
Is it good?
Distributed Tracing
“Service A called Service B.”
“Service B called the DB.”
“The services were ok.”
“The network was ok.”
“The DB was slow.”
“Because somebody
requested a lot of data.”

Example: Error
Metrics
“The error rate is 0.001/sec.”
“We had 2 errors recently.”
So it’s not that bad.
But why did this happen?
Logging
“Request processing failed.”
“Here’s the stacktrace.”
Is it bad?
(Well, it failed.) How bad?
How many of them failed?
Distributed Tracing
“Service A called Service B.”
“Service B called the DB.”
“The services were ok.”
“The network was ok.”
“The DB call failed.”
“Because of invalid input.”

Application logs: classic DEBUG/INFO/WARN/ERROR events (+stacktraces)
Payload logs: Raw request and response pairs
GC logs: GC events (JEP 271 - Uniﬁed GC Logging)
Access logs: Logs from the underlying HTTP server (e.g.: Tomcat)
- Who and when called our service
- What request (HTTP method, headers, path, query)
- Response status, processing time, payload sizes
etc. (audit logs, metrics in logs, trace logs)
Logging 101 - Types of logs

SLF4J with Logback comes pre-conﬁgured but you can replace Logback
SLF4J
- Simple Logging Façade for Java
- Simple API for various logging libraries
- Allows to plug in the desired logging library
Logback
- Modern logging library
- Natively implements the SLF4J API
If you want Log4j2 instead of Logback:
- spring-boot-starter-logging
+ spring-boot-starter-log4j2
Logging with Spring: SLF4J + Logback

Logging with Spring: Payload, Access, GC
Payload logs: Logbook
+ logbook-spring-boot-starter (auto-conﬁgured)
Access logs:
server.tomcat.accesslog.enabled=true
server.tomcat.basedir=logs
server.tomcat.accesslog.pattern=...
server.jetty.accesslog.enabled=true
server.undertow.accesslog.enabled=true
+ logback-access (if you want to use Logback, needs to be conﬁgured)
GC logs: JVM args

Metrics 101
Time series data: data that changes over time
Trends, context, anomaly detection, visualization, alerting
Various Backends
Publishing: Client Pushes vs. Server Polls
Dimensionality: Dimensional vs. Hierarchical

Metrics with Spring: Micrometer
Popular Metrics library on the JVM
Like SLF4J, but for metrics
Simple API
Supports the most popular metric backends
Comes with spring-boot-actuator
Spring projects are instrumented using Micrometer
A lot of third-party libraries use Micrometer

Micrometer - Like SLF4J, but for metrics
Graphite
Humio
InﬂuxDB
JMX
KairosDB
New Relic
OpenTSDB
OTLP
Prometheus
SignalFx
Stackdriver (GCP)
StatsD
Wavefront* (VMware)
(/actuator/metrics)
AppOptics
Atlas
Azure Monitor
CloudWatch (AWS)
Datadog
Dynatrace
Elastic
Ganglia
*VMware Tanzu Observability by Wavefront

Distributed Tracing 101 - Correlation
TraceId: 123
123
123

Distributed Tracing 101 - Span and Trace
E
F
C
D
B
A
TraceId: 123

Span (basic unit of work)
SpanId, ParentSpanId, TraceId
Timestamps (start/stop)
Events (annotations) with timestamps
Tags (key-value pairs)
ProcessId
Local IP, Remote IP
+ Log correlation (and context propagation)
+ Visualization
Distributed Tracing 101 - Span and Trace

Distributed Tracing with Spring: Spring Cloud Sleuth
Distributed Tracing Support for Spring
Provides an abstraction layer on top of tracing libraries (3.x)
- Brave (OpenZipkin), default
- OpenTelemetry (CNCF), experimental
Log Correlation + Context Propagation
Instrumentation for Spring Projects (and your application)
Instrumentation for third-party libraries (through Brave and OTel)
Supports various backends (through Brave and OTel)

All-In-One: Observation API (Micrometer.next)
Observation observation = Observation.start("test", registry);
try { // TODO: scope
Thread.sleep(1000);
}
catch (Exception exception) {
observation.error(exception);
throw exception;
}
finally { // TODO: attach tags
observation.stop();
}
observation.observeChecked(() -> Thread.sleep(1000));

“Non-conventional”
Observability

“Non-conventional” Observability
Is there anything else beyond Logging + Metrics + Tracing?
We are looking for:
- outputs (that provide)
- meaningful information
- about what’s inside of our system

Spring Boot Actuator
auditevents
beans
caches
conditions
configprops
env
flyway
health (k8s probes)
heap/thread dump
httptrace
info
integrationgraph
jolokia
logfile
loggers
liquibase
metrics, traces
mappings
prometheus
quartz
scheduledtasks
sessions
shutdown
startup

{
"status": "UP",
"components": {
"db": {
"status": "UP",
"details": {
"database": "H2",
"validationQuery": "isValid()"
}
},
[...]
}
}
Health Endpoint

{
"status": "UP",
"components": {
[...]
"diskSpace": {
"status": "UP",
"details": {
"total": 1000240963584,
"free": 764043239424,
"threshold": 10485760,
"exists": true
}
},
"ping": {
"status": "UP"
},
}
}
Health Endpoint

{
"status": "UP",
"components": {
[...]
"tealeafService": {
"status": "UP",
"details": {
"components": { ... }
}
},
"waterService": {
"status": "UP",
"details": {
"components": { ... }
}
}
}
}
Health Endpoint

"git": {
"branch": "main",
"commit": {
"id": "96c9ebe",
"time": "2022-04-07T19:19:19Z"
}
},
"build": {
"artifact": "tea-service",
"name": "tea-service",
"time": "2022-04-07T19:19:35.153Z",
"version": "96c9ebe.1649359173515", // 1.2.3
"group": "org.example.teahouse"
}
Info Endpoint

"java": {
"vendor": "Eclipse Adoptium",
"version": "18",
"runtime": {
"name": "OpenJDK Runtime Environment",
"version": "18+36"
},
"jvm": {
"name": "OpenJDK 64-Bit Server VM",
"vendor": "Eclipse Adoptium",
"version": "18+36"
}
},
"environment": {
"activeProfiles": [ "local" ]
}
Info Endpoint

"memory": {
"total": 268435456,
"max": 268435456,
"free": 149509024
},
"cpu": {
"availableProcessors": 16
},
"gcs": [
{
"name": "G1 Young Generation",
"memoryPoolNames": [ ... ]
},
{
"name": "G1 Old Generation",
"memoryPoolNames": [ ... ]
}
]
Info Endpoint

"user": {
"timezone": "UTC",
"country": "US",
"language": "en",
"dir": "~/GitHub/teahouse/tea-service"
},
"os": {
"arch": "x86_64",
"name": "Mac OS X",
"version": "latest :)"
},
"network": {
"host": "my-hostname",
"ip": "192.168.0.100"
},
"startTime": "2022-04-07T19:19:36.898Z",
"uptime": "PT15M31.094729S",
"heartbeat": "2022-04-07T19:35:07.992731Z"
Info Endpoint

Info Endpoint
How to contact the dev team, where is the repo of the project?
Cloud
instanceId and type
image version
region, account, cloud provider
TLS Certiﬁcate Chain
subject, issuer
validity (expiration date) -> health check?
signature algorithm
You can create your own endpoint
Dependencies used runtime; Dependency lock ﬁles
/whoami: username + roles

Service Registry/Discoverability
How many service instances do we have (by environment)?
What versions are deployed (by environment)?
Where are they?
host/ip, port
instanceId, region, account, cloud provider, etc.
Service starts/stops (deployments, restarts)?

API Discoverability
How can I call this service?
Spring REST Docs
Generates docs from tests and hand-written docs
Spring Cloud Contract + Pact Broker
Consumer Driven Contracts (test client-server contract)
You know when you break your clients
Swagger / OpenAPI + ReDoc
API spec, docs
API browser + client
Spring HATEOAS + HAL Explorer
Add links to your resources (other resources or operations)
API browser + client

{
"id": "6b55663a",
[...]
"_links": {
"self": {
"href": "/tealeaves/6b55663a"
},
"search": {
"href": "/tealeaves/search/findByName?name=sencha"
},
"collection": {
"href": "/tealeaves"
}
}
}
Spring HATEOAS

{
"_embedded": {
"tealeaves": [...]
},
"_links": {
"first": { "href": "/tealeaves?page=0&size=5" },
"prev": { "href": "/tealeaves?page=0&size=5" },
"self": { "href": "/tealeaves?page=1&size=5" },
"next": { "href": "/tealeaves?page=2&size=5" },
"last": { "href": "/tealeaves?page=2&size=5" }
},
"page": {
"size": 5,
"totalElements": 15,
"totalPages": 3,
"number": 1
}
}
Spring HATEOAS

Questions/Feedback?
Twitter: @jonatan_ivanov
Blog: develotters.com
Try it: github.com/jonatan-ivanov/teahouse
© 2022 Spring. A VMware-backed project.

Observability: Beyond the Three Pillars with Spring

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Observability: Beyond the Three Pillars with Spring

Similar to Observability: Beyond the Three Pillars with Spring (20)

More from VMware Tanzu

More from VMware Tanzu (20)

Recently uploaded

Recently uploaded (20)

Observability: Beyond the Three Pillars with Spring