The Observability Graph; Knowledge Graphs for Automated Infrastructure Observability

The Observability Graph:
Knowledge Graphs for
Automated Infrastructure
Observability
Monitorama 22 October 2019

Staff Data Scientist
homin@datadoghq.com
Homin Lee
2

IMAGE
So many hosts, so many
containers, so many regions...
INFRASTRUCTURE

IMAGE
Microservices, part monolith/part
micro, serverless, orchestrated,
semi-orchestrated
SERVICES

IMAGE
High cardinality, lots and lots of tags
METRICS

IMAGE
So many requests. Spans coming from
everywhere
TRACES

IMAGE
So much variety; so many ﬁelds.
EVENTS

IMAGE
So much detail. So many lines.
LOGS

IMAGEIMAGE
TOTAL INFORMATION AWARENESS

IDEAL
Service A Service BRequests
DB

IDEAL
DB
lag_metric{service:B}

IDEAL
DB
system.mem.pct_usable{DB, host:h111}

IMAGE
Amazing Observability
Ideal Environment

IMAGE
Ideal Environment
Easy Root Cause Analysis

IMAGE
Ideal Environment
(maybe even…)

IMAGE
Ideal Environment
Automatically Surface All
Contributing Factors

REALITY
service:A
service:A is an intern project with simulated traﬃc

REALITY
Service ARequests DB
lag_metric{lsv-systype:B}
Lsv-systype (???) B
Service X
Service Y
Service Z
Service F Service G

REALITY
Service A Service BRequests DB
proxy

IMAGE
AN ASIDE
https://xkcd.com/793/

IMAGE
The mirroring
hypothesis
a.k.a. Conway’s Law
a.k.a. You ship your org chart...
Colfer, Lyra J., and Carliss Y. Baldwin. "The mirroring hypothesis: theory, evidence, and exceptions."
Industrial and Corporate Change 25.5 (2016): 709-738.

IMAGE
Corollary:
Observability follows your org chart.

IMAGE
Corollary:
Observability follows your org chart.
app | storage

IMAGE
Gore’s
hypothesis
a.k.a. The prequel to Dunbar’s Number
a.k.a. The thing you heard about from The
Tipping Point
Zhou, W-X., et al. "Discrete hierarchical organization of social group sizes." Proceedings of the Royal
Society B: Biological Sciences 272.1561 (2005): 439-444.
Hamel, Gary, and B. Breen. "Building an innovation democracy: WL Gore." The future of
management (2007)

IMAGE
Corollary:
Observability standards completely fall
apart after 150 employees.

IMAGE
Synthesis
Without enforcement, observability
standards fall apart after 150 employees,
but still tend to follow the org chart.

(in the real world)
ROOT CAUSE ANALYSIS

The Situation
Negatives:
– Large-scale, messy, inconsistent observability data
– Labels are hard to come by
Positives:
– Domain knowledge
– Lots of user-interaction data

The Situation
Negatives:
Positives:
Machine Learning

The Situation
Negatives:
Positives:
Unsupervised
(or Semi-Supervised)
Machine Learning

The Situation
Negatives:
Positives:
Unsupervised
Known Entities
Machine Learning

The Situation
Negatives:
Positives:
Unsupervised
Known Entities
Relational Data
Machine Learning

IMAGE
Knowledge
Graphs
https://arxiv.org/abs/1504.08153

Observability Graphs
Alert A Metric M
Team T
Service S

Alert A Metric M
Team T
Alert B
Metric N
Service S

Alert A Metric M
Dashboard DTeam T
Alert B
Metric N
Service S

Alert A Metric M
Dashboard DTeam T
Alert B
Metric N
Metric OAlert C
Service S

Alert A Metric M
Dashboard DTeam T
Service SService R
Alert B
Metric N
Metric OAlert C
system.cpu.idle{role:R}“[R] CPU is high on R!”

Alert A Metric M
Dashboard DTeam T
Service SService R
Alert B
Metric N
Metric OAlert C

Alert A Metric M
Dashboard D
Dashboard E
Team T
Service SService R
Alert W Metric P
Alert B
Metric N
Metric OAlert C

Alert A Metric M
Dashboard D
Dashboard E
Team T
Service SService R
Alert W Metric P
Alert B
Metric N
Metric OAlert C
Alert X Alert Y Alert Z

Alert A Metric M
Dashboard D
Dashboard E
Team T
Service SService R
Alert W Metric P
Alert B
Metric N
Metric OAlert C
DB U
Service V
DB W

Alert A Metric M
Dashboard D
Dashboard E
Team T
Service SService R
Alert W Metric P
Alert B
Metric N
Metric OAlert C
DB U
Service V
DB W
Metric L

REALITY
Service S Service RRequests
DB W
system.mem.pct_usable{DB, host:h111}

IMAGE
https://knowyourmeme.com/memes/galaxy-brain
CONCLUSIONS
Machine Learning and Knowledge Graphs
save the day!

IMAGE
Sam Cali
CONCLUSIONS
Observability data is created by people to
be consumed by people.
Monitoring tools and data are useless if
people can’t make sense of them.
By studying how people interact with this
data, we can increase the observability of
our systems.

The Observability Graph; Knowledge Graphs for Automated Infrastructure Observability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Observability Graph; Knowledge Graphs for Automated Infrastructure Observability

Similar to The Observability Graph; Knowledge Graphs for Automated Infrastructure Observability (20)

Recently uploaded

Recently uploaded (20)

The Observability Graph; Knowledge Graphs for Automated Infrastructure Observability