@tyler_treat
Cloud-Native Observability
Tyler Treat / Cloud Native - Madison / June 6, 2019
@tyler_treat
@tyler_treat
Monitoring
@tyler_treat
APM
Debugger
Profiler
SSH
grep
@tyler_treat
APM
Debugger
Profiler
SSH
grep
@tyler_treat
APM
Debugger
Profiler
SSH
System Behavior
grep
@tyler_treat
APM
Debugger
Profiler
SSH
System Behavior
Actual Customer Impact
grep
@tyler_treat
Monitoring
@tyler_treat
APM
Debugger
Profiler
SSH
grep
@tyler_treat
APM
Debugger
Profiler
SSH
Testing in Production at Scale, Amit Gud
grep
@tyler_treat
APM
Debugger
Profiler
SSH
System Behavior
Actual Customer Impact
???grep
@tyler_treat
“Observability”
@tyler_treat
Post Hoc vs. Ad Hoc
@tyler_treat
Data Available
Understanding
@tyler_treat
Data Available
Understanding
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
@tyler_treat
Data Available
Understanding
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
@tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
@tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing

sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
@tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing

sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
FACTS
@tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing

sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
FACTS
HYPOTHESES
@tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing

sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
ASSUMPTIONS FACTS
HYPOTHESES
@tyler_treat
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing

sporadic failures and slowdowns”
DISCOVERIES
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
ASSUMPTIONS FACTS
HYPOTHESES
@tyler_treat
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing

sporadic failures and slowdowns”
DISCOVERIES
Data Available
Understanding
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
HYPOTHESES
MonitoringObservability
@tyler_treat
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing

sporadic failures and slowdowns”
DISCOVERIES
Data Available
Understanding
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
HYPOTHESES
TestingExploring
@tyler_treat


Observability Data
application logs
system logs
audit logs
application metrics
distributed traces
events
@tyler_treat
Some

challenges…


Observability Data
application logs
system logs
audit logs
application metrics
distributed traces
events
- Locked up inside a single vendor’s solution
- Not readily available across the enterprise

(or in some cases, too readily available)
- Many tools and products needed for

different data and use cases
- Tool and data needs vary from team to

team
- Ever-changing landscape of tools, products,

and services
- Sheer volume of data can be overwhelming
@tyler_treat
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
Amazon Glacier
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sp
Un
For
Datad
A
Universal
Analytics Client
S3 Client
…
Datado
A
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Splunk
Universal
Forwarder
Datadog APM
Agent
Universal
Analytics Client
S3 Client
…
Datadog Metrics
Agent
System
Sp
Un
For
Datad
A
Universal
Analytics Client
S3 Client
…
Datado
A
Splunk
Universal
Forwarder
Universal
Analytics Client
Splunk
Universal
Forwarder
Universal
Analytics Client
Splunk
Universal
Forwarder
Universal
Analytics Client
Sp
Un
For
Universal
Analytics Client
System System System System
@tyler_treat
How big of a lift is it for your
organization to change tools?
@tyler_treat
How easy is it to experiment
with new ones?
@tyler_treat
Data Sources
• VMs
• Containers
• Load balancers
• Service meshes
• Audit logs
• VPC flow logs
• Firewall logs
• …
Data Sinks
• Centralized logging
• SIEM
• Monitoring
• APM
• Alerting
• Cold storage
• BI
• …
What data to send?
Where to send it?
How to send it?
@tyler_treat
A decoupled approach
@tyler_treat
What data to send?
Where to send it?
How to send it?
Data Sources
• VMs
• Containers
• Load balancers
• Service meshes
• Audit logs
• VPC flow logs
• Firewall logs
• …
Data Sinks
• Centralized logging
• SIEM
• Monitoring
• APM
• Alerting
• Cold storage
• BI
• …
Observability Pipeline
@tyler_treat
The Observability Pipeline
@tyler_treat
Structure your damn data.
1. Data Specifications
@tyler_treat
log.error(“User '{}' login failed”.format(user))
@tyler_treat
ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed
@tyler_treat
log.error(“User login failed”,
event=LOGIN_ERROR,
user=“tylertreat”,
email=“tyler.treat@realkinetic.com”,
error=error)
@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “ERROR”,
“event”: “user_login_error”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
“error”: “Invalid username or password”,
“message”: “User login failed”
}
@tyler_treat
JSON is fine.
@tyler_treat
Pass a context object to
everything.
@tyler_treat
def login(ctx, username, email, password):
ctx.set(user=username, email=email)
...
log.error(“User login failed”,
event=LOGIN_ERROR,
context=ctx,
error=error)
...
@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “ERROR”,
“event”: “user_login_error”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
},
“error”: “Invalid username or password”,
“message”: “User login failed”
}
@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “ERROR”,
“event”: “user_login_error”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”,
“email”: “tyler.treat@realkinetic.com”,
},
“error”: “Invalid username or password”,
“message”: “User login failed”
}
@tyler_treat
Create standard specs for each data
type collected (logs, metrics, traces).
@tyler_treat
Specs can enforce required fields (e.g.
user id, license, trace id) and data types.
@tyler_treat
{
“timestamp”: “2019-04-05 13:26.42”,
“level”: “INFO”,
“event”: “user_login”,
“context”: {
“id”: “accfbb8315c44a52ad893ca6772e1caf”,
“http_method”: “POST”,
“http_path”: “/login”,
“user”: “tylertreat”,

“user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,

“license”: “942e6543f0844be680e72003d5e060fd”,
“email”: “tyler.treat@realkinetic.com”,
}
}
@tyler_treat
Specs alone aren’t enough!
2. Specification Libraries
@tyler_treat
We need libraries.
@tyler_treat
• Java: log4j
• Go: logrus
• Python: structlog
• Ruby: ruby-cabin
• .NET: serilog
• JS: structured-log
• etc.
There are many
existing libraries
for structured
logging.
@tyler_treat
For tracing and
metrics, there are
vendor-neutral APIs
like OpenTracing
and OpenCensus.
@tyler_treat
We need a lightweight agent that can
collect data from hosts/containers.
3. Data Collector
@tyler_treat
Collect data, perform transformations/
filters, and write it to the data pipeline.
@tyler_treat
Typically runs as an agent on the
host (DaemonSet in Kubernetes).
@tyler_treat
Data is written to stdout/stderr
or a Unix domain socket.
@tyler_treat
Just use
Fluentd or
Logstash
(+Beats).
@tyler_treat
We need a scalable, fault-tolerant data
stream to handle the firehose of
observability data generated.
4. Data Pipeline
@tyler_treat
This also provides a buffer that
decouples producers from consumers.
@tyler_treat
Lots of options…
@tyler_treat
@tyler_treat
We need a component to consume data
from the pipeline, perform filtering, and
write it to the appropriate backends.
5. Data Router
@tyler_treat
This is where the data spec
comes into play.
@tyler_treat
The data shape determines how
incoming data is handled.
@tyler_treat
Data Pipeline
Amazon Glacier
Data Router
logs
traces
metrics
@tyler_treat
Data Pipeline
Amazon Glacier
Data Router
logs
traces
metrics
@tyler_treat
Data Pipeline
Amazon Glacier
Data Router
logs
traces
metrics
@tyler_treat
This is primarily a stateless
component writing to APIs.
@tyler_treat
Good fit for
“serverless”
solutions.
@tyler_treat
Piecing It All Together
@tyler_treat
@tyler_treat
You don’t need to build it out all
in one go.
@tyler_treat
There are quick wins along the
way!
@tyler_treat
Evolving to an Observability Pipeline
• Adopt structured logging
• Move log/data collection out of process
• Use a centralized logging system
• Introduce a streaming data solution
• Start adding data consumers
@tyler_treat
Dev/Ops/SRE
Systems
Production
@tyler_treat
Dev/Ops/SRE
Systems
Production
@tyler_treat
Dev/Ops/SRE
Systems
Production
@tyler_treat
Dev/Ops/SRE
Systems
Production
@tyler_treat
Dev/Ops/SRE
Systems
Production
@tyler_treat
Dev/Ops/SRE
Systems
Production
@tyler_treat
CI/CD
Pre-
Production

(theorizing about
known unknowns)
Post-
Production

(learning from
unknown unknowns)
Observability
@tyler_treat
Part 2: Demo
@tyler_treat
Trip Service
Flight Service
Hotel Service
Car Rental
ServiceDynamoDB
DynamoDB
DynamoDB
DynamoDB
Book Trip
@tyler_treat
Structured logging + context
@tyler_treat
Kubernetes
@tyler_treat
And now here’s some YAML…
@tyler_treat
@tyler_treat
@tyler_treat
Kubernetes
@tyler_treat
+
@tyler_treat
Kubernetes
Kinesis
@tyler_treat
AWS Lambda
@tyler_treat
Kubernetes
Kinesis Lambda
@tyler_treat
Kubernetes
Kinesis Lambda
CloudWatch
Jaeger
Stackdriver
@tyler_treat
Code:

https://github.com/RealKinetic/cloud-native-meetup-2019
@tyler_treat
Thank You
realkinetic.com

bravenewgeek.com

Cloud-Native Observability