Monitoring and observability
Upcoming SlideShare
Loading in...5
×
 

Monitoring and observability

on

  • 5,743 views

In this session we’ll leave the need for performance a foregone conclusion and take a whirlwind tour through the complexity of modern Internet architectures. The complexities lead to evil ...

In this session we’ll leave the need for performance a foregone conclusion and take a whirlwind tour through the complexity of modern Internet architectures. The complexities lead to evil optimization problems and significant challenges troubleshooting production issues to a speedy and successful end.

Starting with the simple facts that you can’t fix what you can’t see and you can’t improve what you can’t measure, we’ll discuss what needs monitoring and why. We’ll talk about unlikely allies in the fight for time and budget to instrument systems, applications and processes for observability.

You’ll leave the session with a better understanding of what it looks like to troubleshoot the storm of a malfunctioning large architecture and some tools and techniques you can use to not be swallowed by the Kraken.

Statistics

Views

Total Views
5,743
Views on SlideShare
5,527
Embed Views
216

Actions

Likes
15
Downloads
84
Comments
0

14 Embeds 216

http://tech.m6web.fr 101
https://twitter.com 45
http://www.ninjasys.co.uk 32
http://lanyrd.com 12
http://irq.tumblr.com 6
http://localhost 5
http://www.brijj.com 4
http://local.tryghost.org 4
https://www.linkedin.com 2
http://omniti.com 1
http://leed.galsungen.net 1
http://www.linkedin.com 1
https://si0.twimg.com 1
http://japhy.soup.io 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Monitoring and observability Monitoring and observability Presentation Transcript

  • Monitoring and Observability / in Complex ArchitecturesTuesday, October 2, 12
  • Hi! I’m @postwait I founded @OmniTI and @MessageSystems and @CirconusTuesday, October 2, 12
  • Hi! I’m @postwait I am very active in @TheOfficialACM participating in @ACMQueue and the practitioners board.Tuesday, October 2, 12
  • Hi! I’m @postwait I (regrettably) build complex systems.Tuesday, October 2, 12
  • Why we are here We’re here to talk about coping with breakageTuesday, October 2, 12
  • Rule #1 Direct observation of failure leads to quicker rectification.Tuesday, October 2, 12
  • Rule #2 You cannot correct what you cannot measure.Tuesday, October 2, 12
  • Solution Approach #1 Debugging failures requires either visibility into the precipitating stateTuesday, October 2, 12
  • Precipitating State Single threaded applications ✓ EasyTuesday, October 2, 12
  • Precipitating State Multi-threaded applications ✓ ChallengingTuesday, October 2, 12
  • Precipitating State Distributed applications here there be dragonsTuesday, October 2, 12
  • Solution Approach #2 or direct observation of a (and likely very many) failing transactionTuesday, October 2, 12
  • Direct Observation Observing something fail... is priceless.Tuesday, October 2, 12
  • Direct Observation Observation leads to intelligent questioning.Tuesday, October 2, 12
  • Direct Observation Questioning leads to answers... but only through more observation.Tuesday, October 2, 12
  • Direct Observation Questioning leads to answers... but only through more observation. and herein lies the rub.Tuesday, October 2, 12
  • Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verificationTuesday, October 2, 12
  • Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification ... or do you?Tuesday, October 2, 12
  • What’s monitoring got to do with it? Monitoring is all about the passive observation of telemetry data.Tuesday, October 2, 12
  • Monitoring Telemetry cannot pinpoint problems can provides evidence of the existence of a problemTuesday, October 2, 12
  • Monitoring Gives you evidence that there is a problemTuesday, October 2, 12
  • Monitoring Gives you evidence that you have fixed a problem (or at least the symptoms)Tuesday, October 2, 12
  • Monitoring Tactically If it could be of interest, measure it and expose the measurementTuesday, October 2, 12
  • Monitoring: embedded statsd metrics https://github.com/etsy/statsd https://github.com/codahale/metrics resmon folsom http://labs.omniti.com/labs/resmon https://github.com/boundary/folsom metrics.js https://github.com/mikejihbe/metrics metrics-net https://github.com/danielcrenna/metrics-netTuesday, October 2, 12
  • Monitoring: collection reconnoiter circonus http://labs.omniti.com/labs/reconnoiter http://circonus.com/ graphite librato http://graphite.wikidot.com/ https://metrics.librato.com/ OpenTSDB http://opentsdb.net/Tuesday, October 2, 12
  • Monitoring: Bling visualizing an architecture rolloutTuesday, October 2, 12
  • Monitoring: Bling visualizing the impact on service timesTuesday, October 2, 12
  • average API service time latencyTuesday, October 2, 12
  • actual API service time latency http://www.slideshare.net/postwait/atldevopsTuesday, October 2, 12
  • Monitoring: BlingTuesday, October 2, 12
  • Repeatability is a Pipe Dream You production problem is a (hopefully pathological) outcome of circumstance. A circumstance which often cannot be repeated.Tuesday, October 2, 12
  • Control Groups Control groups can compensate for the inability to precisely repeat an experiment.Tuesday, October 2, 12
  • Control Groups Most architectures have redundancy.Tuesday, October 2, 12
  • Control Groups With the right design, you can turn that redundancy into a debugging environment. [1] http://omniti.com/surge/2012/sessions/xtreme-deploymentTuesday, October 2, 12
  • Control Groups: Simple Example I have 10 web servers I fix 1 I verify 1 is fixed I verify 9 are still brokenTuesday, October 2, 12
  • Control Groups: Seems Easy Web servers tend to be: • homogeneous • share-(nothing|little) • independentTuesday, October 2, 12
  • Control Groups: Not So Easy Most other services aren’t so homogeneous and equal: databases, batch processes (think billings), orchestration middleware, message queuesTuesday, October 2, 12
  • Observability Some might claim that seeing telemetry data is observation... It is doubly indirect at best.Tuesday, October 2, 12
  • Observability I want to directly see the errant behaviourTuesday, October 2, 12
  • Observability is forgiving In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points.Tuesday, October 2, 12
  • Observing the network tcpdump / snoop wiresharkTuesday, October 2, 12
  • Observing the network Looking at just the arrival of new connections tcpdump -nnq -tttt -s384 tcp port 80 and (tcp[13] & (2|16) == 2)Tuesday, October 2, 12
  • Observing the network Looking at just the data arrival and departure times tcpdump -nnq -tt -s 384 tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0) snoop -rq -ta -s 384 tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)Tuesday, October 2, 12
  • Observing the network Finding the difference between a client’s question and a server’s answer (tcpdump | awk filter). { gsub(".[0-9]+(: | >)"," & "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4); if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); } S[EP]= ($4==".80")?"S":"C"; L[EP]= $1; }Tuesday, October 2, 12
  • Observing the networkTuesday, October 2, 12
  • Observing the networkTuesday, October 2, 12
  • Observing user-space strace[1] / truss gstack / pstack gcore + gdb / dbx / mdb[2] [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdfTuesday, October 2, 12
  • System call tracing Watching sshd is a good way to get familiar. truss -f -p `pgrep sshd`Tuesday, October 2, 12
  • System call tracing An active web server is going to be like a firehose. truss -f -p `pgrep httpd`Tuesday, October 2, 12
  • Observing the system DTrace Live production demo or GTFO.Tuesday, October 2, 12
  • Thank You Questions?Tuesday, October 2, 12