Monitoring and Observability                           /   in Complex ArchitecturesTuesday, October 2, 12
Hi! I’m @postwait                         I founded @OmniTI                               and @MessageSystems             ...
Hi! I’m @postwait                         I am very active in @TheOfficialACM                         participating in @ACM...
Hi! I’m @postwait                         I (regrettably) build complex systems.Tuesday, October 2, 12
Why we are here                         We’re here to talk about                         coping with breakageTuesday, Octo...
Rule #1                         Direct observation of failure                         leads to quicker rectification.Tuesda...
Rule #2                         You cannot correct                         what you cannot measure.Tuesday, October 2, 12
Solution Approach #1                         Debugging failures requires either                         visibility into th...
Precipitating State                         Single threaded applications                         ✓ EasyTuesday, October 2,...
Precipitating State                         Multi-threaded applications                         ✓ ChallengingTuesday, Octo...
Precipitating State                         Distributed applications                              here there be dragonsTue...
Solution Approach #2                         or                         direct observation of a                         (a...
Direct Observation                         Observing something fail...                         is priceless.Tuesday, Octob...
Direct Observation                         Observation leads to                         intelligent questioning.Tuesday, O...
Direct Observation                         Questioning leads to answers...                         but only through more o...
Direct Observation                         Questioning leads to answers...                         but only through more o...
Leaning Towards Scientific Process                         In production you don’t have                           • repeat...
Leaning Towards Scientific Process                         In production you don’t have                           • repeat...
What’s monitoring got to do with it?                         Monitoring is all about the                         passive o...
Monitoring Telemetry                         cannot pinpoint problems                         can provides evidence of    ...
Monitoring                         Gives you evidence that                         there is a problemTuesday, October 2, 12
Monitoring                         Gives you evidence that                         you have fixed a problem                ...
Monitoring Tactically                         If it could be of interest,                         measure it and          ...
Monitoring: embedded                  statsd                               metrics                  https://github.com/ets...
Monitoring: collection                  reconnoiter                               circonus                  http://labs.om...
Monitoring: Bling                         visualizing an architecture rolloutTuesday, October 2, 12
Monitoring: Bling                     visualizing the impact on service timesTuesday, October 2, 12
average API service time latencyTuesday, October 2, 12
actual API service time latency                  http://www.slideshare.net/postwait/atldevopsTuesday, October 2, 12
Monitoring: BlingTuesday, October 2, 12
Repeatability is a Pipe Dream                         You production problem is a                         (hopefully patho...
Control Groups                         Control groups can                         compensate for the                      ...
Control Groups                         Most architectures have redundancy.Tuesday, October 2, 12
Control Groups                         With the right design,                         you can turn that redundancy        ...
Control Groups: Simple Example                         I have 10 web servers                         I fix 1               ...
Control Groups: Seems Easy                         Web servers tend to be:                           • homogeneous        ...
Control Groups: Not So Easy                         Most other services aren’t so                         homogeneous and ...
Observability                         Some might claim that                         seeing telemetry data is              ...
Observability                         I want to                         directly see                         the          ...
Observability is forgiving                         In complex, multi-component                         architectures, erro...
Observing the network                         tcpdump / snoop                         wiresharkTuesday, October 2, 12
Observing the network                         Looking at just the                         arrival of new connections      ...
Observing the network                         Looking at just the data                         arrival and departure times...
Observing the network                         Finding the difference between                         a client’s question a...
Observing the networkTuesday, October 2, 12
Observing the networkTuesday, October 2, 12
Observing user-space                         strace[1] / truss                         gstack / pstack                    ...
System call tracing                         Watching sshd                         is a good way to get familiar.          ...
System call tracing                         An active web server is going to be                         like a firehose.   ...
Observing the system                         DTrace                         Live production demo or GTFO.Tuesday, October ...
Thank You                         Questions?Tuesday, October 2, 12
Upcoming SlideShare
Loading in...5
×

Monitoring and observability

5,734

Published on

In this session we’ll leave the need for performance a foregone conclusion and take a whirlwind tour through the complexity of modern Internet architectures. The complexities lead to evil optimization problems and significant challenges troubleshooting production issues to a speedy and successful end.

Starting with the simple facts that you can’t fix what you can’t see and you can’t improve what you can’t measure, we’ll discuss what needs monitoring and why. We’ll talk about unlikely allies in the fight for time and budget to instrument systems, applications and processes for observability.

You’ll leave the session with a better understanding of what it looks like to troubleshoot the storm of a malfunctioning large architecture and some tools and techniques you can use to not be swallowed by the Kraken.

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,734
On Slideshare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
91
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide

Monitoring and observability

  1. 1. Monitoring and Observability / in Complex ArchitecturesTuesday, October 2, 12
  2. 2. Hi! I’m @postwait I founded @OmniTI and @MessageSystems and @CirconusTuesday, October 2, 12
  3. 3. Hi! I’m @postwait I am very active in @TheOfficialACM participating in @ACMQueue and the practitioners board.Tuesday, October 2, 12
  4. 4. Hi! I’m @postwait I (regrettably) build complex systems.Tuesday, October 2, 12
  5. 5. Why we are here We’re here to talk about coping with breakageTuesday, October 2, 12
  6. 6. Rule #1 Direct observation of failure leads to quicker rectification.Tuesday, October 2, 12
  7. 7. Rule #2 You cannot correct what you cannot measure.Tuesday, October 2, 12
  8. 8. Solution Approach #1 Debugging failures requires either visibility into the precipitating stateTuesday, October 2, 12
  9. 9. Precipitating State Single threaded applications ✓ EasyTuesday, October 2, 12
  10. 10. Precipitating State Multi-threaded applications ✓ ChallengingTuesday, October 2, 12
  11. 11. Precipitating State Distributed applications here there be dragonsTuesday, October 2, 12
  12. 12. Solution Approach #2 or direct observation of a (and likely very many) failing transactionTuesday, October 2, 12
  13. 13. Direct Observation Observing something fail... is priceless.Tuesday, October 2, 12
  14. 14. Direct Observation Observation leads to intelligent questioning.Tuesday, October 2, 12
  15. 15. Direct Observation Questioning leads to answers... but only through more observation.Tuesday, October 2, 12
  16. 16. Direct Observation Questioning leads to answers... but only through more observation. and herein lies the rub.Tuesday, October 2, 12
  17. 17. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verificationTuesday, October 2, 12
  18. 18. Leaning Towards Scientific Process In production you don’t have • repeatability • control groups • external verification ... or do you?Tuesday, October 2, 12
  19. 19. What’s monitoring got to do with it? Monitoring is all about the passive observation of telemetry data.Tuesday, October 2, 12
  20. 20. Monitoring Telemetry cannot pinpoint problems can provides evidence of the existence of a problemTuesday, October 2, 12
  21. 21. Monitoring Gives you evidence that there is a problemTuesday, October 2, 12
  22. 22. Monitoring Gives you evidence that you have fixed a problem (or at least the symptoms)Tuesday, October 2, 12
  23. 23. Monitoring Tactically If it could be of interest, measure it and expose the measurementTuesday, October 2, 12
  24. 24. Monitoring: embedded statsd metrics https://github.com/etsy/statsd https://github.com/codahale/metrics resmon folsom http://labs.omniti.com/labs/resmon https://github.com/boundary/folsom metrics.js https://github.com/mikejihbe/metrics metrics-net https://github.com/danielcrenna/metrics-netTuesday, October 2, 12
  25. 25. Monitoring: collection reconnoiter circonus http://labs.omniti.com/labs/reconnoiter http://circonus.com/ graphite librato http://graphite.wikidot.com/ https://metrics.librato.com/ OpenTSDB http://opentsdb.net/Tuesday, October 2, 12
  26. 26. Monitoring: Bling visualizing an architecture rolloutTuesday, October 2, 12
  27. 27. Monitoring: Bling visualizing the impact on service timesTuesday, October 2, 12
  28. 28. average API service time latencyTuesday, October 2, 12
  29. 29. actual API service time latency http://www.slideshare.net/postwait/atldevopsTuesday, October 2, 12
  30. 30. Monitoring: BlingTuesday, October 2, 12
  31. 31. Repeatability is a Pipe Dream You production problem is a (hopefully pathological) outcome of circumstance. A circumstance which often cannot be repeated.Tuesday, October 2, 12
  32. 32. Control Groups Control groups can compensate for the inability to precisely repeat an experiment.Tuesday, October 2, 12
  33. 33. Control Groups Most architectures have redundancy.Tuesday, October 2, 12
  34. 34. Control Groups With the right design, you can turn that redundancy into a debugging environment. [1] http://omniti.com/surge/2012/sessions/xtreme-deploymentTuesday, October 2, 12
  35. 35. Control Groups: Simple Example I have 10 web servers I fix 1 I verify 1 is fixed I verify 9 are still brokenTuesday, October 2, 12
  36. 36. Control Groups: Seems Easy Web servers tend to be: • homogeneous • share-(nothing|little) • independentTuesday, October 2, 12
  37. 37. Control Groups: Not So Easy Most other services aren’t so homogeneous and equal: databases, batch processes (think billings), orchestration middleware, message queuesTuesday, October 2, 12
  38. 38. Observability Some might claim that seeing telemetry data is observation... It is doubly indirect at best.Tuesday, October 2, 12
  39. 39. Observability I want to directly see the errant behaviourTuesday, October 2, 12
  40. 40. Observability is forgiving In complex, multi-component architectures, errors can be observed as errant behaviour in many junction points.Tuesday, October 2, 12
  41. 41. Observing the network tcpdump / snoop wiresharkTuesday, October 2, 12
  42. 42. Observing the network Looking at just the arrival of new connections tcpdump -nnq -tttt -s384 tcp port 80 and (tcp[13] & (2|16) == 2)Tuesday, October 2, 12
  43. 43. Observing the network Looking at just the data arrival and departure times tcpdump -nnq -tt -s 384 tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0) snoop -rq -ta -s 384 tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)Tuesday, October 2, 12
  44. 44. Observing the network Finding the difference between a client’s question and a server’s answer (tcpdump | awk filter). { gsub(".[0-9]+(: | >)"," & "); gsub("[:=]"," "); EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4); if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); } S[EP]= ($4==".80")?"S":"C"; L[EP]= $1; }Tuesday, October 2, 12
  45. 45. Observing the networkTuesday, October 2, 12
  46. 46. Observing the networkTuesday, October 2, 12
  47. 47. Observing user-space strace[1] / truss gstack / pstack gcore + gdb / dbx / mdb[2] [1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf [2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdfTuesday, October 2, 12
  48. 48. System call tracing Watching sshd is a good way to get familiar. truss -f -p `pgrep sshd`Tuesday, October 2, 12
  49. 49. System call tracing An active web server is going to be like a firehose. truss -f -p `pgrep httpd`Tuesday, October 2, 12
  50. 50. Observing the system DTrace Live production demo or GTFO.Tuesday, October 2, 12
  51. 51. Thank You Questions?Tuesday, October 2, 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×