SlideShare a Scribd company logo
1 of 51
Download to read offline
Monitoring and Observability

                           /   in Complex Architectures

Tuesday, October 2, 12
Hi! I’m @postwait




                         I founded @OmniTI
                               and @MessageSystems
                               and @Circonus




Tuesday, October 2, 12
Hi! I’m @postwait




                         I am very active in @TheOfficialACM
                         participating in @ACMQueue
                         and the practitioners board.




Tuesday, October 2, 12
Hi! I’m @postwait




                         I (regrettably) build complex systems.




Tuesday, October 2, 12
Why we are here




                         We’re here to talk about
                         coping with breakage




Tuesday, October 2, 12
Rule #1




                         Direct observation of failure
                         leads to quicker rectification.




Tuesday, October 2, 12
Rule #2




                         You cannot correct
                         what you cannot measure.




Tuesday, October 2, 12
Solution Approach #1



                         Debugging failures requires either
                         visibility into the
                         precipitating state




Tuesday, October 2, 12
Precipitating State



                         Single threaded applications



                         ✓ Easy

Tuesday, October 2, 12
Precipitating State



                         Multi-threaded applications



                         ✓ Challenging

Tuesday, October 2, 12
Precipitating State



                         Distributed applications




                              here there be dragons




Tuesday, October 2, 12
Solution Approach #2



                         or
                         direct observation of a
                         (and likely very many)
                         failing transaction




Tuesday, October 2, 12
Direct Observation




                         Observing something fail...
                         is priceless.




Tuesday, October 2, 12
Direct Observation




                         Observation leads to
                         intelligent questioning.




Tuesday, October 2, 12
Direct Observation




                         Questioning leads to answers...
                         but only through more observation.




Tuesday, October 2, 12
Direct Observation




                         Questioning leads to answers...
                         but only through more observation.


                                    and herein lies the rub.


Tuesday, October 2, 12
Leaning Towards Scientific Process



                         In production you don’t have
                           • repeatability
                           • control groups
                           • external verification




Tuesday, October 2, 12
Leaning Towards Scientific Process



                         In production you don’t have
                           • repeatability
                           • control groups
                           • external verification

                                              ... or do you?

Tuesday, October 2, 12
What’s monitoring got to do with it?




                         Monitoring is all about the
                         passive observation of
                         telemetry data.




Tuesday, October 2, 12
Monitoring Telemetry



                         cannot pinpoint problems


                         can provides evidence of
                         the existence of a problem




Tuesday, October 2, 12
Monitoring




                         Gives you evidence that
                         there is a problem




Tuesday, October 2, 12
Monitoring




                         Gives you evidence that
                         you have fixed a problem
                         (or at least the symptoms)




Tuesday, October 2, 12
Monitoring Tactically




                         If it could be of interest,
                         measure it and
                         expose the measurement




Tuesday, October 2, 12
Monitoring: embedded
                  statsd                               metrics
                  https://github.com/etsy/statsd       https://github.com/codahale/metrics



                  resmon                               folsom
                  http://labs.omniti.com/labs/resmon   https://github.com/boundary/folsom



                                                       metrics.js
                                                       https://github.com/mikejihbe/metrics



                                                       metrics-net
                                                       https://github.com/danielcrenna/metrics-net




Tuesday, October 2, 12
Monitoring: collection
                  reconnoiter                               circonus
                  http://labs.omniti.com/labs/reconnoiter   http://circonus.com/



                  graphite                                  librato
                  http://graphite.wikidot.com/              https://metrics.librato.com/



                  OpenTSDB
                  http://opentsdb.net/




Tuesday, October 2, 12
Monitoring: Bling
                         visualizing an architecture rollout




Tuesday, October 2, 12
Monitoring: Bling
                     visualizing the impact on service times




Tuesday, October 2, 12
average API service time latency




Tuesday, October 2, 12
actual API service time latency




                  http://www.slideshare.net/postwait/atldevops



Tuesday, October 2, 12
Monitoring: Bling




Tuesday, October 2, 12
Repeatability is a Pipe Dream


                         You production problem is a
                         (hopefully pathological)
                         outcome of circumstance.


                         A circumstance which often
                         cannot be repeated.



Tuesday, October 2, 12
Control Groups



                         Control groups can
                         compensate for the
                         inability to
                         precisely repeat an experiment.




Tuesday, October 2, 12
Control Groups




                         Most architectures have redundancy.




Tuesday, October 2, 12
Control Groups




                         With the right design,
                         you can turn that redundancy
                         into a debugging environment.


                  [1] http://omniti.com/surge/2012/sessions/xtreme-deployment




Tuesday, October 2, 12
Control Groups: Simple Example



                         I have 10 web servers
                         I fix 1
                         I verify 1 is fixed
                         I verify 9 are still broken




Tuesday, October 2, 12
Control Groups: Seems Easy



                         Web servers tend to be:
                           • homogeneous
                           • share-(nothing|little)
                           • independent




Tuesday, October 2, 12
Control Groups: Not So Easy



                         Most other services aren’t so
                         homogeneous and equal:
                         databases, batch processes (think
                         billings), orchestration middleware,
                         message queues



Tuesday, October 2, 12
Observability


                         Some might claim that
                         seeing telemetry data is
                         observation...


                         It is doubly indirect at best.



Tuesday, October 2, 12
Observability



                         I want to
                         directly see
                         the
                         errant behaviour




Tuesday, October 2, 12
Observability is forgiving



                         In complex, multi-component
                         architectures, errors can be
                         observed as errant behaviour in
                         many junction points.




Tuesday, October 2, 12
Observing the network




                         tcpdump / snoop
                         wireshark




Tuesday, October 2, 12
Observing the network



                         Looking at just the
                         arrival of new connections

                         tcpdump -nnq -tttt -s384
                         'tcp port 80 and (tcp[13] & (2|16) == 2)'




Tuesday, October 2, 12
Observing the network


                         Looking at just the data
                         arrival and departure times
                         tcpdump -nnq -tt
                         -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

                         snoop -rq -ta
                         -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'




Tuesday, October 2, 12
Observing the network
                         Finding the difference between
                         a client’s question and
                         a server’s answer
                         (tcpdump | awk filter).
                         {
                             gsub(".[0-9]+(: | >)"," & ");
                             gsub("[:=]"," ");
                             EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

                             if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); }

                             S[EP]= ($4==".80")?"S":"C";
                             L[EP]= $1;
                         }



Tuesday, October 2, 12
Observing the network




Tuesday, October 2, 12
Observing the network




Tuesday, October 2, 12
Observing user-space



                         strace[1] / truss
                         gstack / pstack
                         gcore + gdb / dbx / mdb[2]


                  [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf
                  [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf




Tuesday, October 2, 12
System call tracing




                         Watching sshd
                         is a good way to get familiar.
                         truss -f -p `pgrep sshd`




Tuesday, October 2, 12
System call tracing




                         An active web server is going to be
                         like a firehose.
                         truss -f -p `pgrep httpd`




Tuesday, October 2, 12
Observing the system



                         DTrace


                         Live production demo or GTFO.




Tuesday, October 2, 12
Thank You




                         Questions?




Tuesday, October 2, 12

More Related Content

What's hot

Observability; a gentle introduction
Observability; a gentle introductionObservability; a gentle introduction
Observability; a gentle introduction
Bram Vogelaar
 

What's hot (20)

How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
How to Move from Monitoring to Observability, On-Premises and in a Multi-Clou...
 
Observability – the good, the bad, and the ugly
Observability – the good, the bad, and the uglyObservability – the good, the bad, and the ugly
Observability – the good, the bad, and the ugly
 
Observability; a gentle introduction
Observability; a gentle introductionObservability; a gentle introduction
Observability; a gentle introduction
 
Combining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified ObservabilityCombining Logs, Metrics, and Traces for Unified Observability
Combining Logs, Metrics, and Traces for Unified Observability
 
Observability vs APM vs Monitoring Comparison
Observability vs APM vs  Monitoring ComparisonObservability vs APM vs  Monitoring Comparison
Observability vs APM vs Monitoring Comparison
 
Observability, what, why and how
Observability, what, why and howObservability, what, why and how
Observability, what, why and how
 
Observability
ObservabilityObservability
Observability
 
Observability & Datadog
Observability & DatadogObservability & Datadog
Observability & Datadog
 
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
More Than Monitoring: How Observability Takes You From Firefighting to Fire P...
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
 
Observability
ObservabilityObservability
Observability
 
Observability
Observability Observability
Observability
 
Observability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing PrimerObservability, Distributed Tracing, and Open Source: The Missing Primer
Observability, Distributed Tracing, and Open Source: The Missing Primer
 
Road to (Enterprise) Observability
Road to (Enterprise) ObservabilityRoad to (Enterprise) Observability
Road to (Enterprise) Observability
 
Api observability
Api observability Api observability
Api observability
 
.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session
 
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
 
Monitoring and observability
Monitoring and observabilityMonitoring and observability
Monitoring and observability
 
Principles of System Observability
Principles of System Observability Principles of System Observability
Principles of System Observability
 
Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?Do You Really Need to Evolve From Monitoring to Observability?
Do You Really Need to Evolve From Monitoring to Observability?
 

Viewers also liked

Data viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueData viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoue
Makoto Inoue
 
Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?
pdyball
 

Viewers also liked (20)

The math behind big systems analysis.
The math behind big systems analysis.The math behind big systems analysis.
The math behind big systems analysis.
 
Nonlinear observer design
Nonlinear observer designNonlinear observer design
Nonlinear observer design
 
Data viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoueData viz as_interface_makoto_inoue
Data viz as_interface_makoto_inoue
 
Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?Velocity EU 2013 What is the velocity of an unladen swallow?
Velocity EU 2013 What is the velocity of an unladen swallow?
 
Performance and Metrics at Lonely Planet
Performance and Metrics at Lonely PlanetPerformance and Metrics at Lonely Planet
Performance and Metrics at Lonely Planet
 
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
Why Page Speed Isn't Enough - Tim Morrow - Velocity Europe 2012
 
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack CloudsIn-kernel Analytics and Tracing with eBPF for OpenStack Clouds
In-kernel Analytics and Tracing with eBPF for OpenStack Clouds
 
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
Are Today’s Good Practices… Tomorrow’s Performance Anti-Patterns?
 
Bring the Noise
Bring the NoiseBring the Noise
Bring the Noise
 
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
MeasureWorks - Velocity Conference Europe 2012 - a Web Performance dashboard ...
 
Velocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and youVelocity EU 2012 - Third party scripts and you
Velocity EU 2012 - Third party scripts and you
 
Integrating multiple CDNs at Etsy
Integrating multiple CDNs at EtsyIntegrating multiple CDNs at Etsy
Integrating multiple CDNs at Etsy
 
Getting 100B Metrics to Disk
Getting 100B Metrics to DiskGetting 100B Metrics to Disk
Getting 100B Metrics to Disk
 
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
Be Mean to Your Code with Gauntlt and the Rugged Way // Velocity EU 2013 Work...
 
Atldevops
AtldevopsAtldevops
Atldevops
 
Understanding Slowness
Understanding SlownessUnderstanding Slowness
Understanding Slowness
 
What's in a number?
What's in a number?What's in a number?
What's in a number?
 
Xtreme Deployment
Xtreme DeploymentXtreme Deployment
Xtreme Deployment
 
SRECon Coherent Performance
SRECon Coherent PerformanceSRECon Coherent Performance
SRECon Coherent Performance
 
Linux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene PirogovLinux Tracing Superpowers by Eugene Pirogov
Linux Tracing Superpowers by Eugene Pirogov
 

Similar to Monitoring and observability

Optimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerceOptimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerce
Kellan
 
Big Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyBig Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at Etsy
Jason Davis
 

Similar to Monitoring and observability (14)

Productivity, Productivity, Productivity
Productivity, Productivity, ProductivityProductivity, Productivity, Productivity
Productivity, Productivity, Productivity
 
Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012Building Data Driven Products With Ruby - RubyConf 2012
Building Data Driven Products With Ruby - RubyConf 2012
 
Continous delivery
Continous deliveryContinous delivery
Continous delivery
 
Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)Ruxcon Finding Needles in Haystacks (the size of countries)
Ruxcon Finding Needles in Haystacks (the size of countries)
 
Optimizing the Mobile Search Experience
Optimizing the Mobile Search ExperienceOptimizing the Mobile Search Experience
Optimizing the Mobile Search Experience
 
Twitter Storm
Twitter StormTwitter Storm
Twitter Storm
 
Measuring
MeasuringMeasuring
Measuring
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Optimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerceOptimizing for change: Taking risks safely & e-commerce
Optimizing for change: Taking risks safely & e-commerce
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
Stability patterns presentation
Stability patterns presentationStability patterns presentation
Stability patterns presentation
 
Big Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at EtsyBig Data, Big Changes: Data-Driven Product Development at Etsy
Big Data, Big Changes: Data-Driven Product Development at Etsy
 
Automatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from TwitterAutomatic Extraction of Soccer Game Event Data from Twitter
Automatic Extraction of Soccer Game Event Data from Twitter
 
The Web Designers Toolkit
The Web Designers ToolkitThe Web Designers Toolkit
The Web Designers Toolkit
 

More from Theo Schlossnagle

A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
Theo Schlossnagle
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
Theo Schlossnagle
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
Theo Schlossnagle
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
Theo Schlossnagle
 

More from Theo Schlossnagle (20)

Adding Simplicity to Complexity
Adding Simplicity to ComplexityAdding Simplicity to Complexity
Adding Simplicity to Complexity
 
Put Some SRE in Your Shipped Software
Put Some SRE in Your Shipped SoftwarePut Some SRE in Your Shipped Software
Put Some SRE in Your Shipped Software
 
Monitoring 101
Monitoring 101Monitoring 101
Monitoring 101
 
Distributed Systems - Like It Or Not
Distributed Systems - Like It Or NotDistributed Systems - Like It Or Not
Distributed Systems - Like It Or Not
 
Applying SRE techniques to micro service design
Applying SRE techniques to micro service designApplying SRE techniques to micro service design
Applying SRE techniques to micro service design
 
Craftsmanship
CraftsmanshipCraftsmanship
Craftsmanship
 
Commandments of scale
Commandments of scaleCommandments of scale
Commandments of scale
 
Adaptive availability
Adaptive availabilityAdaptive availability
Adaptive availability
 
Project reality
Project realityProject reality
Project reality
 
Monitoring the #DevOps way
Monitoring the #DevOps wayMonitoring the #DevOps way
Monitoring the #DevOps way
 
Operational Software Design
Operational Software DesignOperational Software Design
Operational Software Design
 
A Coherent Discussion About Performance
A Coherent Discussion About PerformanceA Coherent Discussion About Performance
A Coherent Discussion About Performance
 
OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012OmniOS Motivation and Design ~ LISA 2012
OmniOS Motivation and Design ~ LISA 2012
 
Omnios and unix
Omnios and unixOmnios and unix
Omnios and unix
 
It's all about telemetry
It's all about telemetryIt's all about telemetry
It's all about telemetry
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
Social improvements in monitoring
Social improvements in monitoringSocial improvements in monitoring
Social improvements in monitoring
 
Building Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approachBuilding Scalable Systems: an asynchronous approach
Building Scalable Systems: an asynchronous approach
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Monitoring and observability

  • 1. Monitoring and Observability / in Complex Architectures Tuesday, October 2, 12
  • 2. Hi! I’m @postwait I founded @OmniTI and @MessageSystems and @Circonus Tuesday, October 2, 12
  • 3. Hi! I’m @postwait I am very active in @TheOfficialACM participating in @ACMQueue and the practitioners board. Tuesday, October 2, 12
  • 4. Hi! I’m @postwait I (regrettably) build complex systems. Tuesday, October 2, 12
  • 5. Why we are here We’re here to talk about coping with breakage Tuesday, October 2, 12
  • 6. Rule #1 Direct observation of failure leads to quicker rectification. Tuesday, October 2, 12
  • 7. Rule #2 You cannot correct what you cannot measure. Tuesday, October 2, 12
  • 8. Solution Approach #1 Debugging failures requires either visibility into the precipitating state Tuesday, October 2, 12
  • 9. Precipitating State Single threaded applications ✓ Easy Tuesday, October 2, 12
  • 10. Precipitating State Multi-threaded applications ✓ Challenging Tuesday, October 2, 12
  • 11. Precipitating State Distributed applications here there be dragons Tuesday, October 2, 12
  • 12. Solution Approach #2 or direct observation of a (and likely very many) failing transaction Tuesday, October 2, 12
  • 13. Direct Observation Observing something fail... is priceless. Tuesday, October 2, 12
  • 14. Direct Observation Observation leads to intelligent questioning. Tuesday, October 2, 12
  • 15. Direct Observation Questioning leads to answers... but only through more observation. Tuesday, October 2, 12
  • 16. Direct Observation Questioning leads to answers... but only through more observation. and herein lies the rub. Tuesday, October 2, 12
  • 17. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification Tuesday, October 2, 12
  • 18. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification ... or do you? Tuesday, October 2, 12
  • 19. What’s monitoring got to do with it? Monitoring is all about the passive observation of telemetry data. Tuesday, October 2, 12
  • 20. Monitoring Telemetry cannot pinpoint problems can provides evidence of the existence of a problem Tuesday, October 2, 12
  • 21. Monitoring Gives you evidence that there is a problem Tuesday, October 2, 12
  • 22. Monitoring Gives you evidence that you have fixed a problem (or at least the symptoms) Tuesday, October 2, 12
  • 23. Monitoring Tactically If it could be of interest, measure it and expose the measurement Tuesday, October 2, 12
  • 24. Monitoring: embedded statsd metrics https://github.com/etsy/statsd https://github.com/codahale/metrics resmon folsom http://labs.omniti.com/labs/resmon https://github.com/boundary/folsom metrics.js https://github.com/mikejihbe/metrics metrics-net https://github.com/danielcrenna/metrics-net Tuesday, October 2, 12
  • 25. Monitoring: collection reconnoiter circonus http://labs.omniti.com/labs/reconnoiter http://circonus.com/ graphite librato http://graphite.wikidot.com/ https://metrics.librato.com/ OpenTSDB http://opentsdb.net/ Tuesday, October 2, 12
  • 26. Monitoring: Bling visualizing an architecture rollout Tuesday, October 2, 12
  • 27. Monitoring: Bling visualizing the impact on service times Tuesday, October 2, 12
  • 28. average API service time latency Tuesday, October 2, 12
  • 29. actual API service time latency http://www.slideshare.net/postwait/atldevops Tuesday, October 2, 12
  • 31. Repeatability is a Pipe Dream You production problem is a (hopefully pathological) outcome of circumstance. A circumstance which often cannot be repeated. Tuesday, October 2, 12
  • 32. Control Groups Control groups can compensate for the inability to precisely repeat an experiment. Tuesday, October 2, 12
  • 33. Control Groups Most architectures have redundancy. Tuesday, October 2, 12
  • 34. Control Groups With the right design, you can turn that redundancy into a debugging environment. [1] http://omniti.com/surge/2012/sessions/xtreme-deployment Tuesday, October 2, 12
  • 35. Control Groups: Simple Example I have 10 web servers I fix 1 I verify 1 is fixed I verify 9 are still broken Tuesday, October 2, 12
  • 36. Control Groups: Seems Easy Web servers tend to be: • homogeneous • share-(nothing|little) • independent Tuesday, October 2, 12
  • 37. Control Groups: Not So Easy Most other services aren’t so homogeneous and equal: databases, batch processes (think billings), orchestration middleware, message queues Tuesday, October 2, 12
  • 38. Observability Some might claim that seeing telemetry data is observation... It is doubly indirect at best. Tuesday, October 2, 12
  • 39. Observability I want to directly see the errant behaviour Tuesday, October 2, 12
  • 40. Observability is forgiving In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points. Tuesday, October 2, 12
  • 41. Observing the network tcpdump / snoop wireshark Tuesday, October 2, 12
  • 42. Observing the network Looking at just the arrival of new connections tcpdump -nnq -tttt -s384 'tcp port 80 and (tcp[13] & (2|16) == 2)' Tuesday, October 2, 12
  • 43. Observing the network Looking at just the data arrival and departure times tcpdump -nnq -tt -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)' snoop -rq -ta -s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)' Tuesday, October 2, 12
  • 44. Observing the network Finding the difference between a client’s question and a server’s answer (tcpdump | awk filter). { gsub(".[0-9]+(: | >)"," & "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4); if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); } S[EP]= ($4==".80")?"S":"C"; L[EP]= $1; } Tuesday, October 2, 12
  • 47. Observing user-space strace[1] / truss gstack / pstack gcore + gdb / dbx / mdb[2] [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf Tuesday, October 2, 12
  • 48. System call tracing Watching sshd is a good way to get familiar. truss -f -p `pgrep sshd` Tuesday, October 2, 12
  • 49. System call tracing An active web server is going to be like a firehose. truss -f -p `pgrep httpd` Tuesday, October 2, 12
  • 50. Observing the system DTrace Live production demo or GTFO. Tuesday, October 2, 12
  • 51. Thank You Questions? Tuesday, October 2, 12