The document discusses cloud-native observability, emphasizing the importance of structured logging and data pipelines for effective monitoring and debugging in complex systems. It outlines challenges such as accessing observability data across different platforms, the necessity of standard specifications for data collection, and the implementation of an observability pipeline to handle data efficiently. Key components include structured logging, data collectors, and routers, which facilitate the seamless transfer and analysis of system behavior to enhance operational reliability.
@tyler_treat
Data Available
Understanding
Known Knowns
•Things we are aware of and understand
• “The system has a 1GB memory limit”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
17.
@tyler_treat
Data Available
Understanding
Unknown Knowns
•Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
18.
@tyler_treat
Data Available
Understanding
Unknown Knowns
•Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
19.
@tyler_treat
Data Available
Understanding
Unknown Knowns
•Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
FACTS
20.
@tyler_treat
Data Available
Understanding
Unknown Knowns
•Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
FACTS
HYPOTHESES
21.
@tyler_treat
Data Available
Understanding
Unknown Knowns
•Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
ASSUMPTIONS FACTS
HYPOTHESES
22.
@tyler_treat
Unknown Unknowns
• Thingswe are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
DISCOVERIES
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
ASSUMPTIONS FACTS
HYPOTHESES
23.
@tyler_treat
Unknown Unknowns
• Thingswe are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
DISCOVERIES
Data Available
Understanding
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
HYPOTHESES
MonitoringObservability
24.
@tyler_treat
Unknown Unknowns
• Thingswe are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
DISCOVERIES
Data Available
Understanding
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
HYPOTHESES
TestingExploring
@tyler_treat
Some
challenges…
Observability Data
application logs
systemlogs
audit logs
application metrics
distributed traces
events
- Locked up inside a single vendor’s solution
- Not readily available across the enterprise
(or in some cases, too readily available)
- Many tools and products needed for
different data and use cases
- Tool and data needs vary from team to
team
- Ever-changing landscape of tools, products,
and services
- Sheer volume of data can be overwhelming
@tyler_treat
Evolving to anObservability Pipeline
• Adopt structured logging
• Move log/data collection out of process
• Use a centralized logging system
• Introduce a streaming data solution
• Start adding data consumers