Cloud-Native Observability

@tyler_treat
Cloud-Native Observability
Tyler Treat / Cloud Native - Madison / June 6, 2019

@tyler_treat
APM
Debugger
Proﬁler
SSH
grep

@tyler_treat
APM
Debugger
Proﬁler
SSH
System Behavior
grep

@tyler_treat
APM
Debugger
Proﬁler
SSH
System Behavior
Actual Customer Impact
grep

@tyler_treat
APM
Debugger
Proﬁler
SSH
Testing in Production at Scale, Amit Gud
grep

@tyler_treat
APM
Debugger
Proﬁler
SSH
System Behavior
Actual Customer Impact
???grep

@tyler_treat
“Observability”

@tyler_treat
Post Hoc vs. Ad Hoc

@tyler_treat
Data Available
Understanding

@tyler_treat
Data Available
Understanding
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”

@tyler_treat
Data Available
Understanding
Known Knowns
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”

@tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
Known Unknowns
understand

@tyler_treat
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing 
sporadic failures and slowdowns”
Known Unknowns
understand

@tyler_treat
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Unknown Unknowns
understand
Known Unknowns
understand
FACTS

@tyler_treat
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Unknown Unknowns
understand
Known Unknowns
understand
FACTS
HYPOTHESES

@tyler_treat
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Unknown Unknowns
understand
Known Unknowns
understand
ASSUMPTIONS FACTS
HYPOTHESES

@tyler_treat
Unknown Unknowns
understand
DISCOVERIES
Data Available
Understanding
Unknown Knowns
aware of
Known Knowns
Known Unknowns
understand
ASSUMPTIONS FACTS
HYPOTHESES

@tyler_treat
Unknown Unknowns
understand
DISCOVERIES
Data Available
Understanding
Known Unknowns
understand
HYPOTHESES
MonitoringObservability

@tyler_treat
Unknown Unknowns
understand
DISCOVERIES
Data Available
Understanding
Known Unknowns
understand
HYPOTHESES
TestingExploring

@tyler_treat
 
Observability Data
application logs
system logs
audit logs
application metrics
distributed traces
events

@tyler_treat
Some 
challenges…
 
Observability Data
application logs
system logs
audit logs
application metrics
distributed traces
events
- Locked up inside a single vendor’s solution
- Not readily available across the enterprise 
(or in some cases, too readily available)
- Many tools and products needed for 
different data and use cases
- Tool and data needs vary from team to 
team
- Ever-changing landscape of tools, products, 
and services
- Sheer volume of data can be overwhelming

@tyler_treat
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
Amazon Glacier
S3 Client
…
Datadog Metrics
Agent

System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sp
Un
For
Datad
A
Universal
Analytics Client
S3 Client
…
Datado
A
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sp
Un
For
Datad
A
Universal
Analytics Client
S3 Client
…
Datado
A
Splunk
Universal
Forwarder
Universal
Analytics Client
Splunk
Universal
Forwarder
Universal
Analytics Client
Splunk
Universal
Forwarder
Universal
Analytics Client
Sp
Un
For
Universal
Analytics Client
System System System System

@tyler_treat
How big of a lift is it for your
organization to change tools?

@tyler_treat
How easy is it to experiment
with new ones?

@tyler_treat
Data Sources
• VMs
• Containers
• Load balancers
• Service meshes
• Audit logs
• VPC flow logs
• Firewall logs
• …
Data Sinks
• Centralized logging
• SIEM
• Monitoring
• APM
• Alerting
• Cold storage
• BI
• …
What data to send?
Where to send it?
How to send it?

@tyler_treat
A decoupled approach

@tyler_treat
What data to send?
Where to send it?
How to send it?
Data Sources
• VMs
• Containers
• Load balancers
• Service meshes
• Audit logs
• VPC flow logs
• Firewall logs
• …
Data Sinks
• Centralized logging
• SIEM
• Monitoring
• APM
• Alerting
• Cold storage
• BI
• …
Observability Pipeline

@tyler_treat
The Observability Pipeline

@tyler_treat
Structure your damn data.
1. Data Speciﬁcations

@tyler_treat
log.error(“User '{}' login failed”.format(user))

@tyler_treat
ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

@tyler_treat
log.error(“User login failed”,
event=LOGIN_ERROR,
user=“tylertreat”,
email=“tyler.treat@realkinetic.com”,
error=error)

@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “ERROR”,
“event”: “user_login_error”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
“error”: “Invalid username or password”,
“message”: “User login failed”
}

@tyler_treat
Pass a context object to
everything.

@tyler_treat
def login(ctx, username, email, password):
ctx.set(user=username, email=email)
...
log.error(“User login failed”,
event=LOGIN_ERROR,
context=ctx,
error=error)
...

@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “ERROR”,
“event”: “user_login_error”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”,
},
“error”: “Invalid username or password”,
“message”: “User login failed”
}

@tyler_treat
Create standard specs for each data
type collected (logs, metrics, traces).

@tyler_treat
Specs can enforce required ﬁelds (e.g.
user id, license, trace id) and data types.

@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “INFO”,
“event”: “user_login”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”, 
“user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”, 
“license”: “942e6543f0844be680e72003d5e060fd”,
}
}

@tyler_treat
Specs alone aren’t enough!
2. Speciﬁcation Libraries

@tyler_treat
We need libraries.

@tyler_treat
• Java: log4j
• Go: logrus
• Python: structlog
• Ruby: ruby-cabin
• .NET: serilog
• JS: structured-log
• etc.
There are many
existing libraries
for structured
logging.

@tyler_treat
For tracing and
metrics, there are
vendor-neutral APIs
like OpenTracing
and OpenCensus.

@tyler_treat
We need a lightweight agent that can
collect data from hosts/containers.
3. Data Collector

@tyler_treat
Collect data, perform transformations/
ﬁlters, and write it to the data pipeline.

@tyler_treat
Typically runs as an agent on the
host (DaemonSet in Kubernetes).

@tyler_treat
Data is written to stdout/stderr
or a Unix domain socket.

@tyler_treat
Just use
Fluentd or
Logstash
(+Beats).

@tyler_treat
We need a scalable, fault-tolerant data
stream to handle the ﬁrehose of
observability data generated.
4. Data Pipeline

@tyler_treat
This also provides a buffer that
decouples producers from consumers.

@tyler_treat
Lots of options…

@tyler_treat
We need a component to consume data
from the pipeline, perform ﬁltering, and
write it to the appropriate backends.
5. Data Router

@tyler_treat
This is where the data spec
comes into play.

@tyler_treat
The data shape determines how
incoming data is handled.

@tyler_treat
Data Pipeline
Amazon Glacier
Data Router
logs
traces
metrics

@tyler_treat
This is primarily a stateless
component writing to APIs.

@tyler_treat
Good ﬁt for
“serverless”
solutions.

@tyler_treat
Piecing It All Together

@tyler_treat
You don’t need to build it out all
in one go.

@tyler_treat
There are quick wins along the
way!

@tyler_treat
Evolving to an Observability Pipeline
• Adopt structured logging
• Move log/data collection out of process
• Use a centralized logging system
• Introduce a streaming data solution
• Start adding data consumers

@tyler_treat
Dev/Ops/SRE
Systems
Production

@tyler_treat
CI/CD
Pre-
Production 
(theorizing about
known unknowns)
Post-
Production 
(learning from
unknown unknowns)
Observability

@tyler_treat
Trip Service
Flight Service
Hotel Service
Car Rental
ServiceDynamoDB
DynamoDB
DynamoDB
DynamoDB
Book Trip

@tyler_treat
Structured logging + context

@tyler_treat
And now here’s some YAML…

@tyler_treat
Kubernetes
Kinesis

@tyler_treat
Kubernetes
Kinesis Lambda

@tyler_treat
Kubernetes
Kinesis Lambda
CloudWatch
Jaeger
Stackdriver

@tyler_treat
Code: 
https://github.com/RealKinetic/cloud-native-meetup-2019

@tyler_treat
Thank You
realkinetic.com 
bravenewgeek.com

Cloud-Native Observability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloud-Native Observability

Similar to Cloud-Native Observability (20)

More from Tyler Treat

More from Tyler Treat (9)

Recently uploaded

Recently uploaded (20)

Cloud-Native Observability