SlideShare a Scribd company logo
Monitoring and Observability

                           /   in Complex Architectures

Tuesday, October 2, 12
Hi! I’m @postwait




                         I founded @OmniTI
                               and @MessageSystems
                               and @Circonus




Tuesday, October 2, 12
Hi! I’m @postwait




                         I am very active in @TheOfficialACM
                         participating in @ACMQueue
                         and the practitioners board.




Tuesday, October 2, 12
Hi! I’m @postwait




                         I (regrettably) build complex systems.




Tuesday, October 2, 12
Why we are here




                         We’re here to talk about
                         coping with breakage




Tuesday, October 2, 12
Rule #1




                         Direct observation of failure
                         leads to quicker rectification.




Tuesday, October 2, 12
Rule #2




                         You cannot correct
                         what you cannot measure.




Tuesday, October 2, 12
Solution Approach #1



                         Debugging failures requires either
                         visibility into the
                         precipitating state




Tuesday, October 2, 12
Precipitating State



                         Single threaded applications



                         ✓ Easy

Tuesday, October 2, 12
Precipitating State



                         Multi-threaded applications



                         ✓ Challenging

Tuesday, October 2, 12
Precipitating State



                         Distributed applications




                              here there be dragons




Tuesday, October 2, 12
Solution Approach #2



                         or
                         direct observation of a
                         (and likely very many)
                         failing transaction




Tuesday, October 2, 12
Direct Observation




                         Observing something fail...
                         is priceless.




Tuesday, October 2, 12
Direct Observation




                         Observation leads to
                         intelligent questioning.




Tuesday, October 2, 12
Direct Observation




                         Questioning leads to answers...
                         but only through more observation.




Tuesday, October 2, 12
Direct Observation




                         Questioning leads to answers...
                         but only through more observation.


                                    and herein lies the rub.


Tuesday, October 2, 12
Leaning Towards Scientific Process



                         In production you don’t have
                           • repeatability
                           • control groups
                           • external verification




Tuesday, October 2, 12
Leaning Towards Scientific Process



                         In production you don’t have
                           • repeatability
                           • control groups
                           • external verification

                                              ... or do you?

Tuesday, October 2, 12
What’s monitoring got to do with it?




                         Monitoring is all about the
                         passive observation of
                         telemetry data.




Tuesday, October 2, 12
Monitoring Telemetry



                         cannot pinpoint problems


                         can provides evidence of
                         the existence of a problem




Tuesday, October 2, 12
Monitoring




                         Gives you evidence that
                         there is a problem




Tuesday, October 2, 12
Monitoring




                         Gives you evidence that
                         you have fixed a problem
                         (or at least the symptoms)




Tuesday, October 2, 12
Monitoring Tactically




                         If it could be of interest,
                         measure it and
                         expose the measurement




Tuesday, October 2, 12
Monitoring: embedded
                  statsd                               metrics
                  https://github.com/etsy/statsd       https://github.com/codahale/metrics



                  resmon                               folsom
                  http://labs.omniti.com/labs/resmon   https://github.com/boundary/folsom



                                                       metrics.js
                                                       https://github.com/mikejihbe/metrics



                                                       metrics-net
                                                       https://github.com/danielcrenna/metrics-net




Tuesday, October 2, 12
Monitoring: collection
                  reconnoiter                               circonus
                  http://labs.omniti.com/labs/reconnoiter   http://circonus.com/



                  graphite                                  librato
                  http://graphite.wikidot.com/              https://metrics.librato.com/



                  OpenTSDB
                  http://opentsdb.net/




Tuesday, October 2, 12
Monitoring: Bling
                         visualizing an architecture rollout




Tuesday, October 2, 12
Monitoring: Bling
                     visualizing the impact on service times




Tuesday, October 2, 12
average API service time latency




Tuesday, October 2, 12
actual API service time latency




                  http://www.slideshare.net/postwait/atldevops



Tuesday, October 2, 12
Monitoring: Bling




Tuesday, October 2, 12
Repeatability is a Pipe Dream


                         You production problem is a
                         (hopefully pathological)
                         outcome of circumstance.


                         A circumstance which often
                         cannot be repeated.



Tuesday, October 2, 12
Control Groups



                         Control groups can
                         compensate for the
                         inability to
                         precisely repeat an experiment.




Tuesday, October 2, 12
Control Groups




                         Most architectures have redundancy.




Tuesday, October 2, 12
Control Groups




                         With the right design,
                         you can turn that redundancy
                         into a debugging environment.


                  [1] http://omniti.com/surge/2012/sessions/xtreme-deployment




Tuesday, October 2, 12
Control Groups: Simple Example



                         I have 10 web servers
                         I fix 1
                         I verify 1 is fixed
                         I verify 9 are still broken




Tuesday, October 2, 12
Control Groups: Seems Easy



                         Web servers tend to be:
                           • homogeneous
                           • share-(nothing|little)
                           • independent




Tuesday, October 2, 12
Control Groups: Not So Easy



                         Most other services aren’t so
                         homogeneous and equal:
                         databases, batch processes (think
                         billings), orchestration middleware,
                         message queues



Tuesday, October 2, 12
Observability


                         Some might claim that
                         seeing telemetry data is
                         observation...


                         It is doubly indirect at best.



Tuesday, October 2, 12
Observability



                         I want to
                         directly see
                         the
                         errant behaviour




Tuesday, October 2, 12
Observability is forgiving



                         In complex, multi-component
                         architectures, errors can be
                         observed as errant behaviour in
                         many junction points.




Tuesday, October 2, 12
Observing the network




                         tcpdump / snoop
                         wireshark




Tuesday, October 2, 12
Observing the network



                         Looking at just the
                         arrival of new connections

                         tcpdump -nnq -tttt -s384
                         'tcp port 80 and (tcp[13] & (2|16) == 2)'




Tuesday, October 2, 12
Observing the network


                         Looking at just the data
                         arrival and departure times
                         tcpdump -nnq -tt
                         -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

                         snoop -rq -ta
                         -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'




Tuesday, October 2, 12
Observing the network
                         Finding the difference between
                         a client’s question and
                         a server’s answer
                         (tcpdump | awk filter).
                         {
                             gsub(".[0-9]+(: | >)"," & ");
                             gsub("[:=]"," ");
                             EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

                             if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); }

                             S[EP]= ($4==".80")?"S":"C";
                             L[EP]= $1;
                         }



Tuesday, October 2, 12
Observing the network




Tuesday, October 2, 12
Observing the network




Tuesday, October 2, 12
Observing user-space



                         strace[1] / truss
                         gstack / pstack
                         gcore + gdb / dbx / mdb[2]


                  [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf
                  [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf




Tuesday, October 2, 12
System call tracing




                         Watching sshd
                         is a good way to get familiar.
                         truss -f -p `pgrep sshd`




Tuesday, October 2, 12
System call tracing




                         An active web server is going to be
                         like a firehose.
                         truss -f -p `pgrep httpd`




Tuesday, October 2, 12
Observing the system



                         DTrace


                         Live production demo or GTFO.




Tuesday, October 2, 12
Thank You




                         Questions?




Tuesday, October 2, 12

More Related Content

What's hot

Observability, what, why and how
Observability, what, why and howObservability, what, why and how
Observability, what, why and how
Neeraj Bagga
 
Observability & Datadog
Observability & DatadogObservability & Datadog
Observability & Datadog
JamesAnderson599331
 
Observability vs APM vs Monitoring Comparison
Observability vs APM vs  Monitoring ComparisonObservability vs APM vs  Monitoring Comparison
Observability vs APM vs Monitoring Comparison
jeetendra mandal
 
Observability
ObservabilityObservability
Observability in the world of microservices
Observability in the world of microservicesObservability in the world of microservices
Observability in the world of microservices
Chandresh Pancholi
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
MoovingON
 
Observability; a gentle introduction
Observability; a gentle introductionObservability; a gentle introduction
Observability; a gentle introduction
Bram Vogelaar
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
Timetrix
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) Observability
Christoph Engelbert
 
Demystifying observability
Demystifying observability Demystifying observability
Demystifying observability
Abigail Bangser
 
Observability
ObservabilityObservability
Observability
Diego Pacheco
 
Observability
ObservabilityObservability
Observability
Ebru Cucen Çüçen
 
MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks
 
Observability
Observability Observability
Observability
Enes Altınok
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
Danylenko Max
 
.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session
Splunk
 
Observability-101
Observability-101Observability-101
Observability-101
Piyush Baderia
 
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
DevOps.com
 
Monitoring with Dynatrace Presentation.pptx
Monitoring with Dynatrace Presentation.pptxMonitoring with Dynatrace Presentation.pptx
Monitoring with Dynatrace Presentation.pptx
Knoldus Inc.
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
Abeer R
 

What's hot (20)

Observability, what, why and how
Observability, what, why and howObservability, what, why and how
Observability, what, why and how
 
Observability & Datadog
Observability & DatadogObservability & Datadog
Observability & Datadog
 
Observability vs APM vs Monitoring Comparison
Observability vs APM vs  Monitoring ComparisonObservability vs APM vs  Monitoring Comparison
Observability vs APM vs Monitoring Comparison
 
Observability
ObservabilityObservability
Observability
 
Observability in the world of microservices
Observability in the world of microservicesObservability in the world of microservices
Observability in the world of microservices
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
 
Observability; a gentle introduction
Observability; a gentle introductionObservability; a gentle introduction
Observability; a gentle introduction
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) Observability
 
Demystifying observability
Demystifying observability Demystifying observability
Demystifying observability
 
Observability
ObservabilityObservability
Observability
 
Observability
ObservabilityObservability
Observability
 
MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!MeasureWorks - Performance Labs - Why Observability Matters!
MeasureWorks - Performance Labs - Why Observability Matters!
 
Observability
Observability Observability
Observability
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session
 
Observability-101
Observability-101Observability-101
Observability-101
 
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
 
Monitoring with Dynatrace Presentation.pptx
Monitoring with Dynatrace Presentation.pptxMonitoring with Dynatrace Presentation.pptx
Monitoring with Dynatrace Presentation.pptx
 
Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)Getting started with Site Reliability Engineering (SRE)
Getting started with Site Reliability Engineering (SRE)
 

Viewers also liked

The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
Theo Schlossnagle
 
Nonlinear observer design
Nonlinear observer designNonlinear observer design
Nonlinear observer design
Pantelis Sopasakis
 
Data viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueData viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueMakoto Inoue
 
Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?
pdyball
 
Performance and Metrics at Lonely Planet
Performance and Metrics at Lonely PlanetPerformance and Metrics at Lonely Planet
Performance and Metrics at Lonely Planet
Mark Jennings
 
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Tim Morrow
 
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack CloudsIn-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
PLUMgrid
 
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Andy Davies
 
Bring the Noise
Bring the NoiseBring the Noise
Bring the Noise
Jon Cowie
 
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks
 
Velocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youVelocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and you
Patrick Meenan
 
Integrating multiple CDNs at Etsy
Integrating multiple CDNs at EtsyIntegrating multiple CDNs at Etsy
Integrating multiple CDNs at Etsy
Laurie Denness
 
Getting 100B Metrics to Disk
Getting 100B Metrics to DiskGetting 100B Metrics to Disk
Getting 100B Metrics to Disk
jthurman42
 
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
James Wickett
 
Atldevops
AtldevopsAtldevops
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
Theo Schlossnagle
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
Theo Schlossnagle
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
Theo Schlossnagle
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
Theo Schlossnagle
 
Linux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene PirogovLinux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene Pirogov
Pivorak MeetUp
 

Viewers also liked (20)

The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Nonlinear observer design
Nonlinear observer designNonlinear observer design
Nonlinear observer design
 
Data viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueData viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoue
 
Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?
 
Performance and Metrics at Lonely Planet
Performance and Metrics at Lonely PlanetPerformance and Metrics at Lonely Planet
Performance and Metrics at Lonely Planet
 
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
 
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack CloudsIn-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
 
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
 
Bring the Noise
Bring the NoiseBring the Noise
Bring the Noise
 
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
 
Velocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youVelocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and you
 
Integrating multiple CDNs at Etsy
Integrating multiple CDNs at EtsyIntegrating multiple CDNs at Etsy
Integrating multiple CDNs at Etsy
 
Getting 100B Metrics to Disk
Getting 100B Metrics to DiskGetting 100B Metrics to Disk
Getting 100B Metrics to Disk
 
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
 
Atldevops
AtldevopsAtldevops
Atldevops
 
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Linux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene PirogovLinux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene Pirogov
 

Similar to Monitoring and observability

Productivity, Productivity, Productivity
Productivity, Productivity, ProductivityProductivity, Productivity, Productivity
Productivity, Productivity, Productivity
Fabian Alcantara
 
Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012
Ryan Weald
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)
packetloop
 
Optimizing the Mobile Search Experience
Optimizing the Mobile Search ExperienceOptimizing the Mobile Search Experience
Optimizing the Mobile Search Experience
Monetate
 
Twitter Storm
Twitter StormTwitter Storm
Twitter Storm
Sergey Lukjanov
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
Chris Dagdigian
 
Optimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerceOptimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerce
Kellan
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
Justin Dorfman
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
james tong
 
Big Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyBig Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyJason Davis
 
Automatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from TwitterAutomatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from Twitter
Marieke van Erp
 
The Web Designers Toolkit
The Web Designers ToolkitThe Web Designers Toolkit
The Web Designers Toolkit
R/GA
 

Similar to Monitoring and observability (14)

Productivity, Productivity, Productivity
Productivity, Productivity, ProductivityProductivity, Productivity, Productivity
Productivity, Productivity, Productivity
 
Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012
 
Continous delivery
Continous deliveryContinous delivery
Continous delivery
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)
 
Optimizing the Mobile Search Experience
Optimizing the Mobile Search ExperienceOptimizing the Mobile Search Experience
Optimizing the Mobile Search Experience
 
Twitter Storm
Twitter StormTwitter Storm
Twitter Storm
 
Measuring
MeasuringMeasuring
Measuring
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Optimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerceOptimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerce
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
Big Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyBig Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at Etsy
 
Automatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from TwitterAutomatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from Twitter
 
The Web Designers Toolkit
The Web Designers ToolkitThe Web Designers Toolkit
The Web Designers Toolkit
 

More from Theo Schlossnagle

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
Theo Schlossnagle
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
Theo Schlossnagle
 
Monitoring 101
Monitoring 101Monitoring 101
Monitoring 101
Theo Schlossnagle
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
Theo Schlossnagle
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service design
Theo Schlossnagle
 
Craftsmanship
CraftsmanshipCraftsmanship
Craftsmanship
Theo Schlossnagle
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
Theo Schlossnagle
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
Theo Schlossnagle
 
Project reality
Project realityProject reality
Project reality
Theo Schlossnagle
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
Theo Schlossnagle
 
Operational Software Design
Operational Software DesignOperational Software Design
Operational Software Design
Theo Schlossnagle
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About PerformanceTheo Schlossnagle
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012Theo Schlossnagle
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
Theo Schlossnagle
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
Theo Schlossnagle
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentationTheo Schlossnagle
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoringTheo Schlossnagle
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approach
Theo Schlossnagle
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
Theo Schlossnagle
 

More from Theo Schlossnagle (20)

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
 
Monitoring 101
Monitoring 101Monitoring 101
Monitoring 101
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service design
 
Craftsmanship
CraftsmanshipCraftsmanship
Craftsmanship
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
 
Project reality
Project realityProject reality
Project reality
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Operational Software Design
Operational Software DesignOperational Software Design
Operational Software Design
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
 
Omnios and unix
Omnios and unixOmnios and unix
Omnios and unix
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approach
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

Monitoring and observability

  • 1. Monitoring and Observability / in Complex Architectures Tuesday, October 2, 12
  • 2. Hi! I’m @postwait I founded @OmniTI and @MessageSystems and @Circonus Tuesday, October 2, 12
  • 3. Hi! I’m @postwait I am very active in @TheOfficialACM participating in @ACMQueue and the practitioners board. Tuesday, October 2, 12
  • 4. Hi! I’m @postwait I (regrettably) build complex systems. Tuesday, October 2, 12
  • 5. Why we are here We’re here to talk about coping with breakage Tuesday, October 2, 12
  • 6. Rule #1 Direct observation of failure leads to quicker rectification. Tuesday, October 2, 12
  • 7. Rule #2 You cannot correct what you cannot measure. Tuesday, October 2, 12
  • 8. Solution Approach #1 Debugging failures requires either visibility into the precipitating state Tuesday, October 2, 12
  • 9. Precipitating State Single threaded applications ✓ Easy Tuesday, October 2, 12
  • 10. Precipitating State Multi-threaded applications ✓ Challenging Tuesday, October 2, 12
  • 11. Precipitating State Distributed applications here there be dragons Tuesday, October 2, 12
  • 12. Solution Approach #2 or direct observation of a (and likely very many) failing transaction Tuesday, October 2, 12
  • 13. Direct Observation Observing something fail... is priceless. Tuesday, October 2, 12
  • 14. Direct Observation Observation leads to intelligent questioning. Tuesday, October 2, 12
  • 15. Direct Observation Questioning leads to answers... but only through more observation. Tuesday, October 2, 12
  • 16. Direct Observation Questioning leads to answers... but only through more observation. and herein lies the rub. Tuesday, October 2, 12
  • 17. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification Tuesday, October 2, 12
  • 18. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification ... or do you? Tuesday, October 2, 12
  • 19. What’s monitoring got to do with it? Monitoring is all about the passive observation of telemetry data. Tuesday, October 2, 12
  • 20. Monitoring Telemetry cannot pinpoint problems can provides evidence of the existence of a problem Tuesday, October 2, 12
  • 21. Monitoring Gives you evidence that there is a problem Tuesday, October 2, 12
  • 22. Monitoring Gives you evidence that you have fixed a problem (or at least the symptoms) Tuesday, October 2, 12
  • 23. Monitoring Tactically If it could be of interest, measure it and expose the measurement Tuesday, October 2, 12
  • 24. Monitoring: embedded statsd metrics https://github.com/etsy/statsd https://github.com/codahale/metrics resmon folsom http://labs.omniti.com/labs/resmon https://github.com/boundary/folsom metrics.js https://github.com/mikejihbe/metrics metrics-net https://github.com/danielcrenna/metrics-net Tuesday, October 2, 12
  • 25. Monitoring: collection reconnoiter circonus http://labs.omniti.com/labs/reconnoiter http://circonus.com/ graphite librato http://graphite.wikidot.com/ https://metrics.librato.com/ OpenTSDB http://opentsdb.net/ Tuesday, October 2, 12
  • 26. Monitoring: Bling visualizing an architecture rollout Tuesday, October 2, 12
  • 27. Monitoring: Bling visualizing the impact on service times Tuesday, October 2, 12
  • 28. average API service time latency Tuesday, October 2, 12
  • 29. actual API service time latency http://www.slideshare.net/postwait/atldevops Tuesday, October 2, 12
  • 31. Repeatability is a Pipe Dream You production problem is a (hopefully pathological) outcome of circumstance. A circumstance which often cannot be repeated. Tuesday, October 2, 12
  • 32. Control Groups Control groups can compensate for the inability to precisely repeat an experiment. Tuesday, October 2, 12
  • 33. Control Groups Most architectures have redundancy. Tuesday, October 2, 12
  • 34. Control Groups With the right design, you can turn that redundancy into a debugging environment. [1] http://omniti.com/surge/2012/sessions/xtreme-deployment Tuesday, October 2, 12
  • 35. Control Groups: Simple Example I have 10 web servers I fix 1 I verify 1 is fixed I verify 9 are still broken Tuesday, October 2, 12
  • 36. Control Groups: Seems Easy Web servers tend to be: • homogeneous • share-(nothing|little) • independent Tuesday, October 2, 12
  • 37. Control Groups: Not So Easy Most other services aren’t so homogeneous and equal: databases, batch processes (think billings), orchestration middleware, message queues Tuesday, October 2, 12
  • 38. Observability Some might claim that seeing telemetry data is observation... It is doubly indirect at best. Tuesday, October 2, 12
  • 39. Observability I want to directly see the errant behaviour Tuesday, October 2, 12
  • 40. Observability is forgiving In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points. Tuesday, October 2, 12
  • 41. Observing the network tcpdump / snoop wireshark Tuesday, October 2, 12
  • 42. Observing the network Looking at just the arrival of new connections tcpdump -nnq -tttt -s384 'tcp port 80 and (tcp[13] & (2|16) == 2)' Tuesday, October 2, 12
  • 43. Observing the network Looking at just the data arrival and departure times tcpdump -nnq -tt -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)' snoop -rq -ta -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)' Tuesday, October 2, 12
  • 44. Observing the network Finding the difference between a client’s question and a server’s answer (tcpdump | awk filter). { gsub(".[0-9]+(: | >)"," & "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4); if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); } S[EP]= ($4==".80")?"S":"C"; L[EP]= $1; } Tuesday, October 2, 12
  • 47. Observing user-space strace[1] / truss gstack / pstack gcore + gdb / dbx / mdb[2] [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf Tuesday, October 2, 12
  • 48. System call tracing Watching sshd is a good way to get familiar. truss -f -p `pgrep sshd` Tuesday, October 2, 12
  • 49. System call tracing An active web server is going to be like a firehose. truss -f -p `pgrep httpd` Tuesday, October 2, 12
  • 50. Observing the system DTrace Live production demo or GTFO. Tuesday, October 2, 12
  • 51. Thank You Questions? Tuesday, October 2, 12