In this session we’ll leave the need for performance a foregone conclusion and take a whirlwind tour through the complexity of modern Internet architectures. The complexities lead to evil optimization problems and significant challenges troubleshooting production issues to a speedy and successful end.
Starting with the simple facts that you can’t fix what you can’t see and you can’t improve what you can’t measure, we’ll discuss what needs monitoring and why. We’ll talk about unlikely allies in the fight for time and budget to instrument systems, applications and processes for observability.
You’ll leave the session with a better understanding of what it looks like to troubleshoot the storm of a malfunctioning large architecture and some tools and techniques you can use to not be swallowed by the Kraken.
2. Hi! I’m @postwait
I founded @OmniTI
and @MessageSystems
and @Circonus
Tuesday, October 2, 12
3. Hi! I’m @postwait
I am very active in @TheOfficialACM
participating in @ACMQueue
and the practitioners board.
Tuesday, October 2, 12
4. Hi! I’m @postwait
I (regrettably) build complex systems.
Tuesday, October 2, 12
5. Why we are here
We’re here to talk about
coping with breakage
Tuesday, October 2, 12
6. Rule #1
Direct observation of failure
leads to quicker rectification.
Tuesday, October 2, 12
7. Rule #2
You cannot correct
what you cannot measure.
Tuesday, October 2, 12
8. Solution Approach #1
Debugging failures requires either
visibility into the
precipitating state
Tuesday, October 2, 12
9. Precipitating State
Single threaded applications
✓ Easy
Tuesday, October 2, 12
10. Precipitating State
Multi-threaded applications
✓ Challenging
Tuesday, October 2, 12
11. Precipitating State
Distributed applications
here there be dragons
Tuesday, October 2, 12
12. Solution Approach #2
or
direct observation of a
(and likely very many)
failing transaction
Tuesday, October 2, 12
13. Direct Observation
Observing something fail...
is priceless.
Tuesday, October 2, 12
14. Direct Observation
Observation leads to
intelligent questioning.
Tuesday, October 2, 12
15. Direct Observation
Questioning leads to answers...
but only through more observation.
Tuesday, October 2, 12
16. Direct Observation
Questioning leads to answers...
but only through more observation.
and herein lies the rub.
Tuesday, October 2, 12
17. Leaning Towards Scientific Process
In production you don’t have
• repeatability
• control groups
• external verification
Tuesday, October 2, 12
18. Leaning Towards Scientific Process
In production you don’t have
• repeatability
• control groups
• external verification
... or do you?
Tuesday, October 2, 12
19. What’s monitoring got to do with it?
Monitoring is all about the
passive observation of
telemetry data.
Tuesday, October 2, 12
20. Monitoring Telemetry
cannot pinpoint problems
can provides evidence of
the existence of a problem
Tuesday, October 2, 12
21. Monitoring
Gives you evidence that
there is a problem
Tuesday, October 2, 12
22. Monitoring
Gives you evidence that
you have fixed a problem
(or at least the symptoms)
Tuesday, October 2, 12
23. Monitoring Tactically
If it could be of interest,
measure it and
expose the measurement
Tuesday, October 2, 12
31. Repeatability is a Pipe Dream
You production problem is a
(hopefully pathological)
outcome of circumstance.
A circumstance which often
cannot be repeated.
Tuesday, October 2, 12
32. Control Groups
Control groups can
compensate for the
inability to
precisely repeat an experiment.
Tuesday, October 2, 12
33. Control Groups
Most architectures have redundancy.
Tuesday, October 2, 12
34. Control Groups
With the right design,
you can turn that redundancy
into a debugging environment.
[1] http://omniti.com/surge/2012/sessions/xtreme-deployment
Tuesday, October 2, 12
35. Control Groups: Simple Example
I have 10 web servers
I fix 1
I verify 1 is fixed
I verify 9 are still broken
Tuesday, October 2, 12
36. Control Groups: Seems Easy
Web servers tend to be:
• homogeneous
• share-(nothing|little)
• independent
Tuesday, October 2, 12
37. Control Groups: Not So Easy
Most other services aren’t so
homogeneous and equal:
databases, batch processes (think
billings), orchestration middleware,
message queues
Tuesday, October 2, 12
38. Observability
Some might claim that
seeing telemetry data is
observation...
It is doubly indirect at best.
Tuesday, October 2, 12
39. Observability
I want to
directly see
the
errant behaviour
Tuesday, October 2, 12
40. Observability is forgiving
In complex, multi-component
architectures, errors can be
observed as errant behaviour in
many junction points.
Tuesday, October 2, 12
42. Observing the network
Looking at just the
arrival of new connections
tcpdump -nnq -tttt -s384
'tcp port 80 and (tcp[13] & (2|16) == 2)'
Tuesday, October 2, 12
43. Observing the network
Looking at just the data
arrival and departure times
tcpdump -nnq -tt
-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'
snoop -rq -ta
-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'
Tuesday, October 2, 12
44. Observing the network
Finding the difference between
a client’s question and
a server’s answer
(tcpdump | awk filter).
{
gsub(".[0-9]+(: | >)"," & ");
gsub("[:=]"," ");
EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);
if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); }
S[EP]= ($4==".80")?"S":"C";
L[EP]= $1;
}
Tuesday, October 2, 12