The Observability Pipeline

@tyler_treat
The Observability Pipeline
Tyler Treat / deliver:Agile 2019 / April 29, 2019

@tyler_treat
The way we build systems has
fundamentally changed.

@tyler_treat
Our systems are more complex
than they’ve ever been.

@tyler_treat
Don’t believe me?

@tyler_treat
https://www.youtube.com/watch?v=xy3w2hGijhE

@tyler_treat
This is our server.
His name is Toby.

@tyler_treat
We take good care of Toby.

@tyler_treat
We release to him twice a year. 
(quarterly if we’re feeling dangerous)

@tyler_treat
Toby is compatible with most 
versions of Internet Explorer.

@tyler_treat
Toby likes to go on long walks, 
so sometimes we’ll take him  
offline for a bit. 
(usually just nights and weekends)

@tyler_treat
No one seems to mind.

@tyler_treat
Sometimes Toby crashes, 
but we always make sure 
to restart him.

@tyler_treat
This is 74db150601cd.

@tyler_treat
It’s best not to get too 
attached because when he’s 
no longer needed, well…

@tyler_treat
Transactional 
DB
App Server
Reporting 
DB

@tyler_treat
“We need to be
highly available.”

@tyler_treat
Node 1
App Server
Reporting 
DB
Node 2 Node 3
Node 4 Node 5
Database Cluster
App Server App Serverrver

@tyler_treat
“We need to support
every device.”

@tyler_treat
“We need faster
response times.”

@tyler_treat
Node 1
App Server
Reporting 
DB
Node 2 Node 3
Node 4 Node 5
Database Cluster
Node 1 Node 2 Node 3
Node 4 Node 5
Cache Cluster

@tyler_treat
“We need real-time
analytics, not batch.”

@tyler_treat
App Server
Node 4 Node 5
Database Cluster
Node 4 Node 5
Cache Cluster
Node 4 Node 5
BI Data Cluster
BI Server BI Server
Data Pipeline

@tyler_treat
“We need to release
multiple times a day.”

@tyler_treat
Node 4 Node 5
BI Data Cluster
BI Server BI Server
1 2 3
4 5
Database Cluster
1 2 3
4 5
Cache Cluster
Microservice
1 2 3
4 5
Database Cluster
1 2 3
4 5
Cache Cluster
Microservice
1 2 3
4 5
Database Cluster
1 2 3
4 5
Cache Cluster
Microservice
1 2 3
4 5
Database Cluster
1 2 3
4 5
Cache Cluster
Microservice
Data Pipeline

@tyler_treat
“We need to support
multiple geos.”

@tyler_treat
North America
BI Server BI Server
Microservice Microservice
Asia Paciﬁc
BI Server BI Server

@tyler_treat
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
CDN

@tyler_treat
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
CDN
Infrastructure
Load Balancers Orchestrators DNS Conﬁguration . . .

@tyler_treat
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
CDN
CI/CD
Repo Repo Repo Repo
Builder Builder Builder
Artifacts Artifacts Artifacts
Deployer Deployer
Infrastructure

@tyler_treat
“Oh, and one more
thing…”

@tyler_treat
“…we need to do
DevOps.”

@tyler_treat
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
CDN
CI/CD
Repo Repo Repo Repo
Deployer Deployer
“DevOps”
Infrastructure

@tyler_treat
Because our constraints and expectations
have fundamentally changed.

@tyler_treat
Cloud and containers have led to much
more distributed and dynamic systems.

@tyler_treat
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
North America
BI Server BI Server
CDN
CI/CD
Repo Repo Repo Repo
Deployer Deployer
Infrastructure
“DevOps”

@tyler_treat
This shift has exposed deﬁciencies
in our tools and practices…

@tyler_treat
…and has led to new tools created
to help us support our systems.

@tyler_treat
How do we make sense of it all?

@tyler_treat
In particular, how do we make
this…

@tyler_treat
more like this…

@tyler_treat
“The Observability Pipeline”

@tyler_treat
A Brave New World

@tyler_treat
APM
Debugger
Proﬁler
SSH
grep

@tyler_treat
APM
Debugger
Proﬁler
SSH
System Behavior
grep

@tyler_treat
APM
Debugger
Proﬁler
SSH
System Behavior
Actual Customer Impact
grep

@tyler_treat
APM
Debugger
Proﬁler
SSH
Testing in Production at Scale, Amit Gud
grep

@tyler_treat
APM
Debugger
Proﬁler
SSH
System Behavior
???grep

@tyler_treat
grep
APM
Debugger
Proﬁler
SSH
System Behavior
???

@tyler_treat
Many companies rely on a separate
operations team to monitor, triage, and
even resolve issues.

@tyler_treat
This model doesn’t map to the world
of microservices and containers.

@tyler_treat
And it leads to ineffective
feedback loops.

@tyler_treat
In order for developers to take on this
responsibility, they need to be enabled.

@tyler_treat
“DevOps” teams are really
“Developer Enablement” teams.

@tyler_treat
This shift in how we build systems has
caused an explosion of new tools and
terminology.

@tyler_treat
“Observability”

@tyler_treat
Post Hoc vs. Ad Hoc

@tyler_treat
Data Available
Understanding

@tyler_treat
Data Available
Understanding
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”

@tyler_treat
Data Available
Understanding
Known Knowns
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”

@tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
Known Unknowns
understand

@tyler_treat
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing 
sporadic failures and slowdowns”
Known Unknowns
understand

@tyler_treat
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Unknown Unknowns
understand
Known Unknowns
understand
FACTS

@tyler_treat
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Unknown Unknowns
understand
Known Unknowns
understand
FACTS
HYPOTHESES

@tyler_treat
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Unknown Unknowns
understand
Known Unknowns
understand
ASSUMPTIONS FACTS
HYPOTHESES

@tyler_treat
Unknown Unknowns
understand
DISCOVERIES
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Known Unknowns
understand
ASSUMPTIONS FACTS
HYPOTHESES

@tyler_treat
Unknown Unknowns
understand
DISCOVERIES
Data Available
Understanding
Known Unknowns
understand
HYPOTHESES
MonitoringObservability

@tyler_treat
Unknown Unknowns
understand
DISCOVERIES
Data Available
Understanding
Known Unknowns
understand
HYPOTHESES
TestingExploring

@tyler_treat
“The army is now fully prepared
to ﬁght the previous war.”

@tyler_treat
 
Observability Data
application logs
system logs
audit logs
application metrics
distributed traces
events

@tyler_treat
Some 
challenges…
 
Observability Data
application logs
system logs
audit logs
application metrics
distributed traces
events
- Locked up inside a single vendor’s solution
- Not readily available across the enterprise 
(or in some cases, too readily available)
- Many tools and products needed for 
different data and use cases
- Tool and data needs vary from team to 
team
- Ever-changing landscape of tools, products, 
and services
- Sheer volume of data can be overwhelming

@tyler_treat
System
Splunk
Universal
Forwarder

@tyler_treat
System
Splunk
Universal
Forwarder
Datadog Metrics
Agent
Datadog APM
Agent

@tyler_treat
System
Splunk
Universal
Forwarder
Datadog Metrics
Agent
Datadog APM
Agent
Universal
Analytics Client

@tyler_treat
System
Splunk
Universal
Forwarder
Datadog Metrics
Agent
Datadog APM
Agent
Universal
Analytics Client
Amazon Glacier
S3 Client

@tyler_treat
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
Amazon Glacier
S3 Client
…
Datadog Metrics
Agent

System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sp
Un
For
Datad
A
Universal
Analytics Client
S3 Client
…
Datado
A
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sp
Un
For
Datad
A
Universal
Analytics Client
S3 Client
…
Datado
A
Splunk
Universal
Forwarder
Universal
Analytics Client
Splunk
Universal
Forwarder
Universal
Analytics Client
Splunk
Universal
Forwarder
Universal
Analytics Client
Sp
Un
For
Universal
Analytics Client
System System System System

@tyler_treat
“Oh, actually we want to change
how we parse our logs.”

@tyler_treat
“Re-roll the agents."

@tyler_treat
“Oh, actually we want to use
Sumo Logic for logging.”

System
Sumo Logic
Collector
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sumo Logic
Collector
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sumo Logic
Collector
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sum
Co
Datad
A
Universal
Analytics Client
S3 Client
…
Datado
A
System
Sumo Logic
Collector
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sumo Logic
Collector
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sumo Logic
Collector
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sum
Co
Datad
A
Universal
Analytics Client
S3 Client
…
Datado
A
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sum
Co
Universal
Analytics Client

@tyler_treat
“Oh, actually we want to use
New Relic for APM.”

System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sum
Co
Universal
Analytics Client

@tyler_treat
“Oh, actually we want to evaluate
Honeycomb for debugging.”

System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sum
Co
Universal
Analytics Client
Honeytail AgentHoneytail Agent Honeytail Agent Honey
Honeytail Agent Honeytail Agent Honeytail Agent Honey

@tyler_treat
You get the idea.

@tyler_treat
How big of a lift is it for your
organization to change tools?

@tyler_treat
How easy is it to experiment
with new ones?

@tyler_treat
Data Sources
• VMs
• Containers
• Load balancers
• Service meshes
• Audit logs
• VPC flow logs
• Firewall logs
• …
Data Sinks
• Centralized logging
• SIEM
• Monitoring
• APM
• Alerting
• Cold storage
• BI
• …
What data to send?
Where to send it?
How to send it?

@tyler_treat
A decoupled approach

@tyler_treat
What data to send?
Where to send it?
How to send it?
Data Sources
• VMs
• Containers
• Load balancers
• Service meshes
• Audit logs
• VPC flow logs
• Firewall logs
• …
Data Sinks
• Centralized logging
• SIEM
• Monitoring
• APM
• Alerting
• Cold storage
• BI
• …
Observability Pipeline

@tyler_treat
Anatomy of an Observability Pipeline

@tyler_treat
Structure your damn data.
1. Data Speciﬁcations

@tyler_treat
log.error(“User '{}' login failed”.format(user))

@tyler_treat
ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

@tyler_treat
log.error(“User login failed”,
event=LOGIN_ERROR,
user=“tylertreat”,
email=“tyler.treat@realkinetic.com”,
error=error)

@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “ERROR”,
“event”: “user_login_error”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
“error”: “Invalid username or password”,
“message”: “User login failed”
}

@tyler_treat
Pass a context object to
everything.

@tyler_treat
def login(ctx, username, email, password):
ctx.set(user=username, email=email)
...
log.error(“User login failed”,
event=LOGIN_ERROR,
context=ctx,
error=error)
...

@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “ERROR”,
“event”: “user_login_error”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”,
},
“error”: “Invalid username or password”,
“message”: “User login failed”
}

@tyler_treat
What goes on the context?

@tyler_treat
What can you get for “free” and
what do you need to pass along?

@tyler_treat
Create standard specs for each data
type collected (logs, metrics, traces).

@tyler_treat
Specs can enforce required ﬁelds (e.g.
user id, license, trace id) and data types.

@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “INFO”,
“event”: “user_login”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”, 
“user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”, 
“license”: “942e6543f0844be680e72003d5e060fd”,
}
}

@tyler_treat
Be mindful not to log sensitive
data like passwords.

@tyler_treat
Specs alone aren’t enough!
2. Speciﬁcation Libraries

@tyler_treat
Empowering developers requires
providing tools that align the “easy” path
with the “right” path.

@tyler_treat
We need libraries that implement the
specs and make it easy for devs to
instrument their systems.

@tyler_treat
• Java: log4j
• Go: logrus
• Python: structlog
• Ruby: ruby-cabin
• .NET: serilog
• JS: structured-log
• etc.
There are many
existing libraries
for structured
logging.

@tyler_treat
For tracing and
metrics, there are
vendor-neutral APIs
like OpenTracing
and OpenCensus.

@tyler_treat
We need a lightweight agent that can
collect data from hosts/containers.
3. Data Collector

@tyler_treat
Collect data, perform transformations/
ﬁlters, and write it to the data pipeline.

@tyler_treat
Typically runs as an agent on the
host (DaemonSet in Kubernetes).

@tyler_treat
Data is written to stdout/stderr
or a Unix domain socket.

@tyler_treat
Just use
Fluentd or
Logstash
(+Beats).

@tyler_treat
We need a scalable, fault-tolerant data
stream to handle the ﬁrehose of
observability data generated.
4. Data Pipeline

@tyler_treat
This also provides a buffer that
decouples producers from consumers.

@tyler_treat
Lots of options…

@tyler_treat
We need a component to consume data
from the pipeline, perform ﬁltering, and
write it to the appropriate backends.
5. Data Router

@tyler_treat
May perform transformations and processing of data,
but heavy processing should be the responsibility of a
backend system (e.g. alerting or aggregations).

@tyler_treat
This is where the data spec
comes into play.

@tyler_treat
The data type determines how
incoming data is routed.

@tyler_treat
Data Pipeline
Amazon Glacier
Data Router
logs
traces
metrics

@tyler_treat
This is primarily a stateless
component writing to APIs.

@tyler_treat
Good ﬁt for
“serverless”
solutions.

@tyler_treat
Piecing It All Together

@tyler_treat
You don’t need to build it out all
in one go.

@tyler_treat
There are quick wins along the
way!

@tyler_treat
Evolving to an Observability Pipeline
• Adopt structured logging
• Move log/data collection out of process
• Use a centralized logging system
• Introduce a streaming data solution
• Start adding data consumers

@tyler_treat
Moving from host-centric to
service-centric observability.

@tyler_treat
This maps to VMs and containers as
well as it does to “serverless” models.

@tyler_treat
Ops
Systems
Production
Product 
Development
Product 
Management
Security & 
Compliance
Support/ 
Helpdesk

@tyler_treat
Dev/Ops/SRE
Systems
Production
Audit
Business Analytics
Pricing Decisions
Data-Driven Product Decisions
Threat Detection
Monitoring
Debugging & Operational Insights
...

@tyler_treat
Dev/Ops/SRE
Systems
Production

@tyler_treat
Beneﬁts
• Pattern can be evolved to with quick wins along the way
• Maps to elastic and serverless architectures better
• Empowers teams in siloed organizations and unlocks data for other parts
of the business
• Enables teams to use the tools best suited to their needs
• Easier to change tools or evaluate them side-by-side by decoupling
• Minimizes impact on developers and the core system

@tyler_treat
But it’s not a silver bullet.

@tyler_treat
Downsides
• Moving away from agent-based model means we have to handle data
routing ourselves
• A lot of the Data Router components might need to be custom-made
using various vendor SDKs or client libraries (assuming they have
APIs)
• This also means we might lose some of the value-add features of
certain agents
• Unclear how well this maps to pull-based models (e.g. Prometheus)

@tyler_treat
CI/CD Pipeline + 
Observability Pipeline

@tyler_treat
CI/CD
Pre-
Production 
(theorizing about
known unknowns)
Post-
Production 
(learning from
unknown unknowns)
Observability

@tyler_treat
Thank You
realkinetic.com 
bravenewgeek.com

The Observability Pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Observability Pipeline

Similar to The Observability Pipeline (20)

More from Tyler Treat

More from Tyler Treat (7)

Recently uploaded

Recently uploaded (20)

The Observability Pipeline