Visualizing Systems with Statemaps

B
Visualizing Systems with Statemaps
CTO
bryan@joyent.com
Bryan Cantrill
@bcantrill
The stack of abstraction
• Our software systems are built as stacks of abstraction
• These stacks allow us to stand on the shoulders of history — to
reuse components without rebuilding them
• We can do this because of the software paradox: software is
both information and machine, exhibiting properties of both
• Our stacks are higher and run deeper than we can see or know:
software is opaque; the nature of abstraction is to seal us from
what runs beneath!
Run silent, run deep
• Not only is the stack deep, it is silent
• Running software emits neither light nor heat; it makes no
sound; it attracts no mass; it (mostly) has no odor
• Running software is — by all conventional notions — unseeable
• This generally isn’t a bad thing, as long as it all works…
Hurricanes from butterflies
• When the stack of abstraction performs pathologically, its power
transmogrifies to peril: layering amplifies performance
pathologies but hinders insight
• Work amplifies as we go down the stack
• Latency amplifies as we go up the stack
• Seemingly minor issues in one layer can cascade into systemic
pathological performance…
• As the system becomes dominated by its outliers, butterflies
spawn hurricanes of pathological performance
Debugging the hurricanes
• Understanding a pathologically performing system is
excruciatingly difficult:
• Symptoms are often far removed from root cause
• There may not be a single root cause but several
• The system is dynamic and may change without warning
• Improvements to the system are hard to model and verify
• Emphatically, this is not “tuning” — it is debugging
How do we debug?
• To debug methodically, we must resist the temptation to quick
hypotheses, focusing rather on questions and observations
• Iterating between questions and observations gathers the facts
that will constrain future hypotheses
• These facts can be used to disconfirm hypotheses!
• How do we ask questions?
• How do we make observations?
Asking questions
• For performance debugging, the initial question formulation is
particularly challenging: where does one start?
• Resource-centric methodologies like the USE Method
(Utilization/Saturation/Errors) can be excellent starting points…
• But keep these methodologies in their context: they provide
initial questions to ask — they are not recipes for debugging
arbitrary performance pathologies!
Making observations
• Questions are answered through observation
• But — reminder! — software cannot by conventionally seen!
• It is up to the system itself to have the capacity to be seen
• This capacity is the system’s observability — and without it, we
are reduced to guessing
• Do not conflate software observability with control theory’s
definition of observability!
• Software is observable when it can answer your question about
its behavior — software observability is not a boolean!
The pillars of observability
• Much has been made of the so-called “pillars of observability”:
monitoring, logging and instrumentation
• Each of these is important, for each has within it the capacity to
answer questions about the system
• But each also has limitations!
• Their shared limitation: each can only be as effective as the
observer — they cannot answer questions not asked!
• Observability seeks to answer questions asked and prompt new
ones: the human is the foundation of observability!
Observability through instrumentation
• Static instrumentation modifies source to provide semantically
relevant information, e.g., via logging or counters
• Dynamic instrumentation allows for the system to be changed
while running to emit data, e.g. DTrace, OpenTracing
• Both mechanisms of instrumentation are essential!
• Static instrumentation provides the observations necessary for
early question formulation…
• Dynamic instrumentation answers deeper, ad hoc questions
Aggregation
• When instrumenting the system, it can become overwhelmed
with the overhead of instrumentation
• Aggregation is essential for scalable, non-invasive
instrumentation — and is a first-class primitive in (e.g.) DTrace
• But aggregation also eliminates important dimensions of data,
especially with respect to time; some questions may only be
answered with disaggregated data!
• Use aggregation for performance debugging — but also
understand its limits!
Visualization
• The visual cortex is unparalleled at detecting patterns
• The value of visualizing data is not merely providing answers,
but also (and especially) provoking new questions
• Our systems are so large, complicated and abstract that there is
not one way to visualize them, but many
• The visualization of systems and their representations is an
essential facet of system observability!
Visualization: Gnuplot
• Graphs are terrific — so much so that we should not restrict
ourselves to the captive graphs found in bundled software!
• An ad hoc plotting tool is essential for performance debugging;
and Gnuplot is an excellent (if idiosyncratic) one
• Gnuplot is easily combined with workhorses like awk or perl
• That Gnuplot is an essential tool helps to set expectation
around performance debugging tools: they are not magicians!
Visualization: Heatmaps
Visualization: Flamegraphs
Visualization: Statemaps
• Flamegraphs help understand the work a system is doing, but
how does one visualize a system that isn’t doing work?
• That is, idleness is a common pathology in a suboptimal
system; there is a hidden bottleneck — but where?
• To explore these kinds of problems, we have developed
statemaps, a visualization of entity state over time
Visualization: Statemaps
Statemap input data
• Statemaps operate on a payload of concatenated JSON where
each line corresponds to a state transition for an entity:



{ "time": "52524411", "entity": "30080", "state": 0 }

{ "time": "52587486", "entity": "30137", "state": 0 }
{ "time": "52769425", "entity": "30080", "state": 4 }
{ "time": "52895402", "entity": "30137", "state": 1 }
{ "time": "53177670", "entity": "62308", "state": 0 }
{ "time": "53230742", "entity": "30137", "state": 0 }
{ "time": "53268043", "entity": "30137", "state": 1 }
{ "time": "53562441", "entity": "62308", "state": 4 }
{ "time": "53616633", "entity": "30137", "state": 0 }
{ "time": "53762283", "entity": "30137", "state": 6 }

…
Statemap input data
• States are described in JSON metadata header, e.g.:





{

"start": [ 1544138397, 322335287 ],

"title": "PostgreSQL statemap on HAB01436, by process ID",

"host": "HAB01436",

"entityKind": "Process",

"states": {

"on-cpu": {"value": 0, "color": "#DAF7A6" },

"off-cpu-waiting": {"value": 1, "color": "#f9f9f9" },

"off-cpu-semop": {"value": 2, "color": "#FF5733" },

"off-cpu-blocked": {"value": 3, "color": "#C70039" },

"off-cpu-zfs-read": {"value": 4, "color": "#FFC300" },

"off-cpu-zfs-write": {"value": 5, "color": "#338AFF" },

"off-cpu-zil-commit": {"value": 6, "color": "#66FFCC" },

"off-cpu-tx-delay": {"value": 7, "color": "#CCFF00" },

"off-cpu-dead": {"value": 8, "color": "#E0E0E0" },

"wal-init": {"value": 9, "color": "#dd1871" },

"wal-init-tx-delay": {"value": 10, "color": "#fd4bc9" }

}

}
Statemap output
• Statemap rendering code processes the JSON stream and
renders it into a SVG that is the actual state map
• SVG can be manipulated interactively (zoomed, panned,
highlighted, etc.) but also stands independently
• Statemaps are entirely neutral with respect to methodology!
Instrumentation for statemaps
• Statemaps themselves — like gnuplot — are entirely generic to
input data: they visualize arbitrary state over arbitrary time
• We have developed example statemap-generating dynamic
instrumentation for database, CPU, I/O, filesystem operations
• The data rate in terms of state transitions per second varies
based on what is being instrumented: from <10/sec to >1M/sec
Coalescing states
• For even modestly large inputs, adjacent states must be
coalesced to allow for reasonable visualization
• When this aggregation is required, the statemap rendering code
coalesces the least significant two adjacent states — allowing
for larger trends to stay intact
• The threshold at which states are coalesced can be dynamically
adjusted to allow for higher resolution
• Importantly, the original data retains all state transitions!
Coalescing states
Coalescing states
Tagged statemaps
• We have found it useful to be able to tag states with immutable
information that describes the context around the state
• For example, tagging a state for CPU execution with immutable
context information (process, thread, etc.)
• Tag occurs separately in the stream, e.g.:



{ "state": 0, "tag": "d136827", "pid": "51943", "tid": "1",
"execname": "postgres", "psargs": "/opt/postgresql/9.6.3/bin/
postgres -D /manatee/pg/data" }

…

{ "time": "330931", "entity": "12", "state": 0, "tag": "d136827" }
Tagged statemaps
Stacked statemaps
• We have found it useful to be able to stack statemaps from
either disjoint sources or disjoint machines
• Allows for activity in one domain or machine to be tightly
correlated with activity in another domain or machine
• Across machines, can be subject to wall clock skew…
• …but if wall clocks are skewing within the datacenter, there are
likely bigger problems!
Stacked statemaps across domains
Stacked statemaps across machines
Stacked statemaps across many machines?
Statemaps
• Statemaps provide a generic and system-neutral tool for
visualizing system state over time
• Statemaps use visualization to prompt questions
• Statemaps work in concert with system observability facilities
that can answer the questions that statemaps raise
• We must keep the human in mind when developing for
observability — the capacity to answer arbitrary questions is
only as effective as the human asking them!
• Statemap renderer: https://github.com/joyent/statemap
1 of 31

Recommended

Kubernetes Architecture by
 Kubernetes Architecture Kubernetes Architecture
Kubernetes ArchitectureKnoldus Inc.
3K views17 slides
Introduction to Kubernetes by
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetesrajdeep
46.8K views39 slides
Interfacing C/C++ and Python with SWIG by
Interfacing C/C++ and Python with SWIGInterfacing C/C++ and Python with SWIG
Interfacing C/C++ and Python with SWIGDavid Beazley (Dabeaz LLC)
14.5K views115 slides
RHEL8 Kernel Management Manual in Korean by
RHEL8 Kernel Management Manual in KoreanRHEL8 Kernel Management Manual in Korean
RHEL8 Kernel Management Manual in KoreanJun Hee Shin
273 views87 slides
OpenTelemetry For Architects by
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For ArchitectsKevin Brockhoff
844 views32 slides
Observability in Java: Getting Started with OpenTelemetry by
Observability in Java: Getting Started with OpenTelemetryObservability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetryDevOps.com
379 views26 slides

More Related Content

What's hot

Adopting OpenTelemetry by
Adopting OpenTelemetryAdopting OpenTelemetry
Adopting OpenTelemetryVincent Behar
235 views23 slides
RedisConf17 - Distributed Java Map Structures and Services with Redisson by
RedisConf17 - Distributed Java Map Structures and Services with RedissonRedisConf17 - Distributed Java Map Structures and Services with Redisson
RedisConf17 - Distributed Java Map Structures and Services with RedissonRedis Labs
2.6K views25 slides
Kubernetes by
KubernetesKubernetes
Kuberneteserialc_w
3K views19 slides
The Business Event Bus by
The Business Event BusThe Business Event Bus
The Business Event BusJoris Meijer
1.5K views37 slides
OpenTelemetry For Developers by
OpenTelemetry For DevelopersOpenTelemetry For Developers
OpenTelemetry For DevelopersKevin Brockhoff
2.2K views44 slides
Monitoring Kubernetes with Prometheus by
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
4.2K views35 slides

What's hot(20)

RedisConf17 - Distributed Java Map Structures and Services with Redisson by Redis Labs
RedisConf17 - Distributed Java Map Structures and Services with RedissonRedisConf17 - Distributed Java Map Structures and Services with Redisson
RedisConf17 - Distributed Java Map Structures and Services with Redisson
Redis Labs2.6K views
Kubernetes by erialc_w
KubernetesKubernetes
Kubernetes
erialc_w3K views
The Business Event Bus by Joris Meijer
The Business Event BusThe Business Event Bus
The Business Event Bus
Joris Meijer1.5K views
Monitoring Kubernetes with Prometheus by Grafana Labs
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
Grafana Labs4.2K views
Serf / Consul 入門 ~仕事を楽しくしよう~ by Masahito Zembutsu
Serf / Consul 入門 ~仕事を楽しくしよう~Serf / Consul 入門 ~仕事を楽しくしよう~
Serf / Consul 入門 ~仕事を楽しくしよう~
Masahito Zembutsu19.3K views
Kubernetes #6 advanced scheduling by Terry Cho
Kubernetes #6   advanced schedulingKubernetes #6   advanced scheduling
Kubernetes #6 advanced scheduling
Terry Cho8K views
Introduction to Container Storage Interface (CSI) by Idan Atias
Introduction to Container Storage Interface (CSI)Introduction to Container Storage Interface (CSI)
Introduction to Container Storage Interface (CSI)
Idan Atias216 views
Kubernetes #4 volume &amp; stateful set by Terry Cho
Kubernetes #4   volume &amp; stateful setKubernetes #4   volume &amp; stateful set
Kubernetes #4 volume &amp; stateful set
Terry Cho2.3K views
Présentation docker et kubernetes by Kiwi Backup
Présentation docker et kubernetesPrésentation docker et kubernetes
Présentation docker et kubernetes
Kiwi Backup1.3K views
Exaリーディングのすゝめ by Shinichi Makino
ExaリーディングのすゝめExaリーディングのすゝめ
Exaリーディングのすゝめ
Shinichi Makino1.2K views
Fonctionnalites et performances des cni pour Kubernetes - devops d-day 2018 by Alexis Ducastel
Fonctionnalites et performances des cni pour Kubernetes - devops d-day 2018Fonctionnalites et performances des cni pour Kubernetes - devops d-day 2018
Fonctionnalites et performances des cni pour Kubernetes - devops d-day 2018
Alexis Ducastel781 views
Everything You Need To Know About Persistent Storage in Kubernetes by The {code} Team
Everything You Need To Know About Persistent Storage in KubernetesEverything You Need To Know About Persistent Storage in Kubernetes
Everything You Need To Know About Persistent Storage in Kubernetes
The {code} Team2.2K views
How Does Kubernetes Build OpenAPI Specifications? by reallavalamp
How Does Kubernetes Build OpenAPI Specifications?How Does Kubernetes Build OpenAPI Specifications?
How Does Kubernetes Build OpenAPI Specifications?
reallavalamp303 views
Free GitOps Workshop + Intro to Kubernetes & GitOps by Weaveworks
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
Weaveworks180 views
Replacing iptables with eBPF in Kubernetes with Cilium by Michal Rostecki
Replacing iptables with eBPF in Kubernetes with CiliumReplacing iptables with eBPF in Kubernetes with Cilium
Replacing iptables with eBPF in Kubernetes with Cilium
Michal Rostecki469 views

Similar to Visualizing Systems with Statemaps

The Hurricane's Butterfly: Debugging pathologically performing systems by
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systemsbcantrill
5.7K views27 slides
From Pipelines to Refineries: Scaling Big Data Applications by
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
1.2K views34 slides
RAJAT PROJECT.pptx by
RAJAT PROJECT.pptxRAJAT PROJECT.pptx
RAJAT PROJECT.pptxSayedMohdAsim2
6 views32 slides
Is Spark the right choice for data analysis ? by
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
231 views22 slides
Deep dive time series anomaly detection with different Azure Data Services by
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data ServicesMarco Parenzan
181 views50 slides
Time Series Anomaly Detection with .net and Azure by
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and AzureMarco Parenzan
146 views48 slides

Similar to Visualizing Systems with Statemaps(20)

The Hurricane's Butterfly: Debugging pathologically performing systems by bcantrill
The Hurricane's Butterfly: Debugging pathologically performing systemsThe Hurricane's Butterfly: Debugging pathologically performing systems
The Hurricane's Butterfly: Debugging pathologically performing systems
bcantrill5.7K views
From Pipelines to Refineries: Scaling Big Data Applications by Databricks
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
Databricks1.2K views
Is Spark the right choice for data analysis ? by Ahmed Kamal
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
Ahmed Kamal231 views
Deep dive time series anomaly detection with different Azure Data Services by Marco Parenzan
Deep dive time series anomaly detection with different Azure Data ServicesDeep dive time series anomaly detection with different Azure Data Services
Deep dive time series anomaly detection with different Azure Data Services
Marco Parenzan181 views
Time Series Anomaly Detection with .net and Azure by Marco Parenzan
Time Series Anomaly Detection with .net and AzureTime Series Anomaly Detection with .net and Azure
Time Series Anomaly Detection with .net and Azure
Marco Parenzan146 views
Building a Database for the End of the World by jhugg
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
jhugg426 views
Building an Experimentation Platform in Clojure by Srihari Sriraman
Building an Experimentation Platform in ClojureBuilding an Experimentation Platform in Clojure
Building an Experimentation Platform in Clojure
Srihari Sriraman431 views
Performance tuning Grails applications by GR8Conf
 Performance tuning Grails applications Performance tuning Grails applications
Performance tuning Grails applications
GR8Conf774 views
Innovation with SAP HANA using customisation - What are my options by Lars Breddemann
Innovation with SAP HANA using customisation - What are my optionsInnovation with SAP HANA using customisation - What are my options
Innovation with SAP HANA using customisation - What are my options
Lars Breddemann4.4K views
Zebras all the way down: The engineering challenges of the data path by bcantrill
Zebras all the way down: The engineering challenges of the data pathZebras all the way down: The engineering challenges of the data path
Zebras all the way down: The engineering challenges of the data path
bcantrill17.2K views
Velocity 2015 linux perf tools by Brendan Gregg
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
Brendan Gregg1.1M views
Performance tuning Grails applications by Lari Hotari
Performance tuning Grails applicationsPerformance tuning Grails applications
Performance tuning Grails applications
Lari Hotari3.9K views
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ... by confluent
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...
confluent1.8K views
Introduction to Java Profiling by Jerry Yoakum
Introduction to Java ProfilingIntroduction to Java Profiling
Introduction to Java Profiling
Jerry Yoakum139 views
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine by Aleksandr Tavgen
Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine Observability -  The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Observability - The good, the bad and the ugly Xp Days 2019 Kiev Ukraine
Aleksandr Tavgen55 views
I pushed in production :). Have a nice weekend by Nicolas Carlier
I pushed in production :). Have a nice weekendI pushed in production :). Have a nice weekend
I pushed in production :). Have a nice weekend
Nicolas Carlier736 views
Practical deep learning for computer vision by Eran Shlomo
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
Eran Shlomo849 views

More from bcantrill

Predicting the Present by
Predicting the PresentPredicting the Present
Predicting the Presentbcantrill
30 views17 slides
Sharpening the Axe: The Primacy of Toolmaking by
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmakingbcantrill
248 views23 slides
Coming of Age: Developing young technologists without robbing them of their y... by
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...bcantrill
370 views21 slides
I have come to bury the BIOS, not to open it: The need for holistic systems by
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systemsbcantrill
1.6K views20 slides
Towards Holistic Systems by
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systemsbcantrill
5.7K views13 slides
The Coming Firmware Revolution by
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolutionbcantrill
1.2K views12 slides

More from bcantrill(20)

Predicting the Present by bcantrill
Predicting the PresentPredicting the Present
Predicting the Present
bcantrill30 views
Sharpening the Axe: The Primacy of Toolmaking by bcantrill
Sharpening the Axe: The Primacy of ToolmakingSharpening the Axe: The Primacy of Toolmaking
Sharpening the Axe: The Primacy of Toolmaking
bcantrill248 views
Coming of Age: Developing young technologists without robbing them of their y... by bcantrill
Coming of Age: Developing young technologists without robbing them of their y...Coming of Age: Developing young technologists without robbing them of their y...
Coming of Age: Developing young technologists without robbing them of their y...
bcantrill370 views
I have come to bury the BIOS, not to open it: The need for holistic systems by bcantrill
I have come to bury the BIOS, not to open it: The need for holistic systemsI have come to bury the BIOS, not to open it: The need for holistic systems
I have come to bury the BIOS, not to open it: The need for holistic systems
bcantrill1.6K views
Towards Holistic Systems by bcantrill
Towards Holistic SystemsTowards Holistic Systems
Towards Holistic Systems
bcantrill5.7K views
The Coming Firmware Revolution by bcantrill
The Coming Firmware RevolutionThe Coming Firmware Revolution
The Coming Firmware Revolution
bcantrill1.2K views
Hardware/software Co-design: The Coming Golden Age by bcantrill
Hardware/software Co-design: The Coming Golden AgeHardware/software Co-design: The Coming Golden Age
Hardware/software Co-design: The Coming Golden Age
bcantrill1.9K views
Tockilator: Deducing Tock execution flows from Ibex Verilator traces by bcantrill
Tockilator: Deducing Tock execution flows from Ibex Verilator tracesTockilator: Deducing Tock execution flows from Ibex Verilator traces
Tockilator: Deducing Tock execution flows from Ibex Verilator traces
bcantrill475 views
No Moore Left to Give: Enterprise Computing After Moore's Law by bcantrill
No Moore Left to Give: Enterprise Computing After Moore's LawNo Moore Left to Give: Enterprise Computing After Moore's Law
No Moore Left to Give: Enterprise Computing After Moore's Law
bcantrill4.1K views
Andreessen's Corollary: Ethical Dilemmas in Software Engineering by bcantrill
Andreessen's Corollary: Ethical Dilemmas in Software EngineeringAndreessen's Corollary: Ethical Dilemmas in Software Engineering
Andreessen's Corollary: Ethical Dilemmas in Software Engineering
bcantrill2.2K views
Platform values, Rust, and the implications for system software by bcantrill
Platform values, Rust, and the implications for system softwarePlatform values, Rust, and the implications for system software
Platform values, Rust, and the implications for system software
bcantrill6.8K views
Is it time to rewrite the operating system in Rust? by bcantrill
Is it time to rewrite the operating system in Rust?Is it time to rewrite the operating system in Rust?
Is it time to rewrite the operating system in Rust?
bcantrill27.4K views
dtrace.conf(16): DTrace state of the union by bcantrill
dtrace.conf(16): DTrace state of the uniondtrace.conf(16): DTrace state of the union
dtrace.conf(16): DTrace state of the union
bcantrill836 views
Papers We Love: ARC after dark by bcantrill
Papers We Love: ARC after darkPapers We Love: ARC after dark
Papers We Love: ARC after dark
bcantrill3.1K views
Principles of Technology Leadership by bcantrill
Principles of Technology LeadershipPrinciples of Technology Leadership
Principles of Technology Leadership
bcantrill5.4K views
Platform as reflection of values: Joyent, node.js, and beyond by bcantrill
Platform as reflection of values: Joyent, node.js, and beyondPlatform as reflection of values: Joyent, node.js, and beyond
Platform as reflection of values: Joyent, node.js, and beyond
bcantrill13.2K views
Debugging under fire: Keeping your head when systems have lost their mind by bcantrill
Debugging under fire: Keeping your head when systems have lost their mindDebugging under fire: Keeping your head when systems have lost their mind
Debugging under fire: Keeping your head when systems have lost their mind
bcantrill3.6K views
Down Memory Lane: Two Decades with the Slab Allocator by bcantrill
Down Memory Lane: Two Decades with the Slab AllocatorDown Memory Lane: Two Decades with the Slab Allocator
Down Memory Lane: Two Decades with the Slab Allocator
bcantrill3.2K views
The State of Cloud 2016: The whirlwind of creative destruction by bcantrill
The State of Cloud 2016: The whirlwind of creative destructionThe State of Cloud 2016: The whirlwind of creative destruction
The State of Cloud 2016: The whirlwind of creative destruction
bcantrill6.5K views
Oral tradition in software engineering: Passing the craft across generations by bcantrill
Oral tradition in software engineering: Passing the craft across generationsOral tradition in software engineering: Passing the craft across generations
Oral tradition in software engineering: Passing the craft across generations
bcantrill4.2K views

Recently uploaded

.NET Deserialization Attacks by
.NET Deserialization Attacks.NET Deserialization Attacks
.NET Deserialization AttacksDharmalingam Ganesan
7 views50 slides
Automated Testing of Microsoft Power BI Reports by
Automated Testing of Microsoft Power BI ReportsAutomated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI ReportsRTTS
11 views20 slides
POS Software in Bangladesh.pdf by
POS Software in Bangladesh.pdfPOS Software in Bangladesh.pdf
POS Software in Bangladesh.pdfSEOServiceProviderBa
6 views1 slide
Techstack Ltd at Slush 2023, Ukrainian delegation by
Techstack Ltd at Slush 2023, Ukrainian delegationTechstack Ltd at Slush 2023, Ukrainian delegation
Techstack Ltd at Slush 2023, Ukrainian delegationViktoriiaOpanasenko
7 views4 slides
Streamlining Your Business Operations with Enterprise Application Integration... by
Streamlining Your Business Operations with Enterprise Application Integration...Streamlining Your Business Operations with Enterprise Application Integration...
Streamlining Your Business Operations with Enterprise Application Integration...Flexsin
5 views12 slides
Using Qt under LGPL-3.0 by
Using Qt under LGPL-3.0Using Qt under LGPL-3.0
Using Qt under LGPL-3.0Burkhard Stubert
14 views11 slides

Recently uploaded(20)

Automated Testing of Microsoft Power BI Reports by RTTS
Automated Testing of Microsoft Power BI ReportsAutomated Testing of Microsoft Power BI Reports
Automated Testing of Microsoft Power BI Reports
RTTS11 views
Streamlining Your Business Operations with Enterprise Application Integration... by Flexsin
Streamlining Your Business Operations with Enterprise Application Integration...Streamlining Your Business Operations with Enterprise Application Integration...
Streamlining Your Business Operations with Enterprise Application Integration...
Flexsin 5 views
tecnologia18.docx by nosi6702
tecnologia18.docxtecnologia18.docx
tecnologia18.docx
nosi67026 views
predicting-m3-devopsconMunich-2023-v2.pptx by Tier1 app
predicting-m3-devopsconMunich-2023-v2.pptxpredicting-m3-devopsconMunich-2023-v2.pptx
predicting-m3-devopsconMunich-2023-v2.pptx
Tier1 app14 views
Top-5-production-devconMunich-2023-v2.pptx by Tier1 app
Top-5-production-devconMunich-2023-v2.pptxTop-5-production-devconMunich-2023-v2.pptx
Top-5-production-devconMunich-2023-v2.pptx
Tier1 app9 views
Mobile App Development Company by Richestsoft
Mobile App Development CompanyMobile App Development Company
Mobile App Development Company
Richestsoft 5 views
Google Solutions Challenge 2024 Talk pdf by MohdAbdulAleem4
Google Solutions Challenge 2024 Talk pdfGoogle Solutions Challenge 2024 Talk pdf
Google Solutions Challenge 2024 Talk pdf
MohdAbdulAleem434 views
Transport Management System - Shipment & Container Tracking by Freightoscope
Transport Management System - Shipment & Container TrackingTransport Management System - Shipment & Container Tracking
Transport Management System - Shipment & Container Tracking
Freightoscope 6 views

Visualizing Systems with Statemaps

  • 1. Visualizing Systems with Statemaps CTO bryan@joyent.com Bryan Cantrill @bcantrill
  • 2. The stack of abstraction • Our software systems are built as stacks of abstraction • These stacks allow us to stand on the shoulders of history — to reuse components without rebuilding them • We can do this because of the software paradox: software is both information and machine, exhibiting properties of both • Our stacks are higher and run deeper than we can see or know: software is opaque; the nature of abstraction is to seal us from what runs beneath!
  • 3. Run silent, run deep • Not only is the stack deep, it is silent • Running software emits neither light nor heat; it makes no sound; it attracts no mass; it (mostly) has no odor • Running software is — by all conventional notions — unseeable • This generally isn’t a bad thing, as long as it all works…
  • 4. Hurricanes from butterflies • When the stack of abstraction performs pathologically, its power transmogrifies to peril: layering amplifies performance pathologies but hinders insight • Work amplifies as we go down the stack • Latency amplifies as we go up the stack • Seemingly minor issues in one layer can cascade into systemic pathological performance… • As the system becomes dominated by its outliers, butterflies spawn hurricanes of pathological performance
  • 5. Debugging the hurricanes • Understanding a pathologically performing system is excruciatingly difficult: • Symptoms are often far removed from root cause • There may not be a single root cause but several • The system is dynamic and may change without warning • Improvements to the system are hard to model and verify • Emphatically, this is not “tuning” — it is debugging
  • 6. How do we debug? • To debug methodically, we must resist the temptation to quick hypotheses, focusing rather on questions and observations • Iterating between questions and observations gathers the facts that will constrain future hypotheses • These facts can be used to disconfirm hypotheses! • How do we ask questions? • How do we make observations?
  • 7. Asking questions • For performance debugging, the initial question formulation is particularly challenging: where does one start? • Resource-centric methodologies like the USE Method (Utilization/Saturation/Errors) can be excellent starting points… • But keep these methodologies in their context: they provide initial questions to ask — they are not recipes for debugging arbitrary performance pathologies!
  • 8. Making observations • Questions are answered through observation • But — reminder! — software cannot by conventionally seen! • It is up to the system itself to have the capacity to be seen • This capacity is the system’s observability — and without it, we are reduced to guessing • Do not conflate software observability with control theory’s definition of observability! • Software is observable when it can answer your question about its behavior — software observability is not a boolean!
  • 9. The pillars of observability • Much has been made of the so-called “pillars of observability”: monitoring, logging and instrumentation • Each of these is important, for each has within it the capacity to answer questions about the system • But each also has limitations! • Their shared limitation: each can only be as effective as the observer — they cannot answer questions not asked! • Observability seeks to answer questions asked and prompt new ones: the human is the foundation of observability!
  • 10. Observability through instrumentation • Static instrumentation modifies source to provide semantically relevant information, e.g., via logging or counters • Dynamic instrumentation allows for the system to be changed while running to emit data, e.g. DTrace, OpenTracing • Both mechanisms of instrumentation are essential! • Static instrumentation provides the observations necessary for early question formulation… • Dynamic instrumentation answers deeper, ad hoc questions
  • 11. Aggregation • When instrumenting the system, it can become overwhelmed with the overhead of instrumentation • Aggregation is essential for scalable, non-invasive instrumentation — and is a first-class primitive in (e.g.) DTrace • But aggregation also eliminates important dimensions of data, especially with respect to time; some questions may only be answered with disaggregated data! • Use aggregation for performance debugging — but also understand its limits!
  • 12. Visualization • The visual cortex is unparalleled at detecting patterns • The value of visualizing data is not merely providing answers, but also (and especially) provoking new questions • Our systems are so large, complicated and abstract that there is not one way to visualize them, but many • The visualization of systems and their representations is an essential facet of system observability!
  • 13. Visualization: Gnuplot • Graphs are terrific — so much so that we should not restrict ourselves to the captive graphs found in bundled software! • An ad hoc plotting tool is essential for performance debugging; and Gnuplot is an excellent (if idiosyncratic) one • Gnuplot is easily combined with workhorses like awk or perl • That Gnuplot is an essential tool helps to set expectation around performance debugging tools: they are not magicians!
  • 16. Visualization: Statemaps • Flamegraphs help understand the work a system is doing, but how does one visualize a system that isn’t doing work? • That is, idleness is a common pathology in a suboptimal system; there is a hidden bottleneck — but where? • To explore these kinds of problems, we have developed statemaps, a visualization of entity state over time
  • 18. Statemap input data • Statemaps operate on a payload of concatenated JSON where each line corresponds to a state transition for an entity:
 
 { "time": "52524411", "entity": "30080", "state": 0 }
 { "time": "52587486", "entity": "30137", "state": 0 } { "time": "52769425", "entity": "30080", "state": 4 } { "time": "52895402", "entity": "30137", "state": 1 } { "time": "53177670", "entity": "62308", "state": 0 } { "time": "53230742", "entity": "30137", "state": 0 } { "time": "53268043", "entity": "30137", "state": 1 } { "time": "53562441", "entity": "62308", "state": 4 } { "time": "53616633", "entity": "30137", "state": 0 } { "time": "53762283", "entity": "30137", "state": 6 }
 …
  • 19. Statemap input data • States are described in JSON metadata header, e.g.:
 
 
 {
 "start": [ 1544138397, 322335287 ],
 "title": "PostgreSQL statemap on HAB01436, by process ID",
 "host": "HAB01436",
 "entityKind": "Process",
 "states": {
 "on-cpu": {"value": 0, "color": "#DAF7A6" },
 "off-cpu-waiting": {"value": 1, "color": "#f9f9f9" },
 "off-cpu-semop": {"value": 2, "color": "#FF5733" },
 "off-cpu-blocked": {"value": 3, "color": "#C70039" },
 "off-cpu-zfs-read": {"value": 4, "color": "#FFC300" },
 "off-cpu-zfs-write": {"value": 5, "color": "#338AFF" },
 "off-cpu-zil-commit": {"value": 6, "color": "#66FFCC" },
 "off-cpu-tx-delay": {"value": 7, "color": "#CCFF00" },
 "off-cpu-dead": {"value": 8, "color": "#E0E0E0" },
 "wal-init": {"value": 9, "color": "#dd1871" },
 "wal-init-tx-delay": {"value": 10, "color": "#fd4bc9" }
 }
 }
  • 20. Statemap output • Statemap rendering code processes the JSON stream and renders it into a SVG that is the actual state map • SVG can be manipulated interactively (zoomed, panned, highlighted, etc.) but also stands independently • Statemaps are entirely neutral with respect to methodology!
  • 21. Instrumentation for statemaps • Statemaps themselves — like gnuplot — are entirely generic to input data: they visualize arbitrary state over arbitrary time • We have developed example statemap-generating dynamic instrumentation for database, CPU, I/O, filesystem operations • The data rate in terms of state transitions per second varies based on what is being instrumented: from <10/sec to >1M/sec
  • 22. Coalescing states • For even modestly large inputs, adjacent states must be coalesced to allow for reasonable visualization • When this aggregation is required, the statemap rendering code coalesces the least significant two adjacent states — allowing for larger trends to stay intact • The threshold at which states are coalesced can be dynamically adjusted to allow for higher resolution • Importantly, the original data retains all state transitions!
  • 25. Tagged statemaps • We have found it useful to be able to tag states with immutable information that describes the context around the state • For example, tagging a state for CPU execution with immutable context information (process, thread, etc.) • Tag occurs separately in the stream, e.g.:
 
 { "state": 0, "tag": "d136827", "pid": "51943", "tid": "1", "execname": "postgres", "psargs": "/opt/postgresql/9.6.3/bin/ postgres -D /manatee/pg/data" }
 …
 { "time": "330931", "entity": "12", "state": 0, "tag": "d136827" }
  • 27. Stacked statemaps • We have found it useful to be able to stack statemaps from either disjoint sources or disjoint machines • Allows for activity in one domain or machine to be tightly correlated with activity in another domain or machine • Across machines, can be subject to wall clock skew… • …but if wall clocks are skewing within the datacenter, there are likely bigger problems!
  • 30. Stacked statemaps across many machines?
  • 31. Statemaps • Statemaps provide a generic and system-neutral tool for visualizing system state over time • Statemaps use visualization to prompt questions • Statemaps work in concert with system observability facilities that can answer the questions that statemaps raise • We must keep the human in mind when developing for observability — the capacity to answer arbitrary questions is only as effective as the human asking them! • Statemap renderer: https://github.com/joyent/statemap