Monitoring and observability

Monitoring and Observability

/ in Complex Architectures

Tuesday, October 2, 12

Hi! I’m @postwait

I founded @OmniTI
and @MessageSystems
and @Circonus


Hi! I’m @postwait

I am very active in @TheOfﬁcialACM
participating in @ACMQueue
and the practitioners board.


Hi! I’m @postwait

I (regrettably) build complex systems.


Why we are here

We’re here to talk about
coping with breakage


Rule #1

Direct observation of failure
leads to quicker rectiﬁcation.


Rule #2

You cannot correct
what you cannot measure.


Solution Approach #1

Debugging failures requires either
visibility into the
precipitating state


Precipitating State

Single threaded applications

✓ Easy


Precipitating State

Multi-threaded applications

✓ Challenging


Precipitating State

Distributed applications

here there be dragons


Solution Approach #2

or
direct observation of a
(and likely very many)
failing transaction


Direct Observation

Observing something fail...
is priceless.


Direct Observation

Observation leads to
intelligent questioning.


Direct Observation

Questioning leads to answers...
but only through more observation.


Direct Observation

Questioning leads to answers...
but only through more observation.

and herein lies the rub.


Leaning Towards Scientific Process

In production you don’t have
• repeatability
• control groups
• external veriﬁcation


Leaning Towards Scientific Process

In production you don’t have
• repeatability
• control groups
• external veriﬁcation

... or do you?


What’s monitoring got to do with it?

Monitoring is all about the
passive observation of
telemetry data.


Monitoring Telemetry

cannot pinpoint problems

can provides evidence of
the existence of a problem


Monitoring

Gives you evidence that
there is a problem


Monitoring

Gives you evidence that
you have ﬁxed a problem
(or at least the symptoms)


Monitoring Tactically

If it could be of interest,
measure it and
expose the measurement


Monitoring: embedded
statsd metrics
https://github.com/etsy/statsd https://github.com/codahale/metrics

resmon folsom
http://labs.omniti.com/labs/resmon https://github.com/boundary/folsom

metrics.js
https://github.com/mikejihbe/metrics

metrics-net
https://github.com/danielcrenna/metrics-net


Monitoring: collection
reconnoiter circonus
http://labs.omniti.com/labs/reconnoiter http://circonus.com/

graphite librato
http://graphite.wikidot.com/ https://metrics.librato.com/

OpenTSDB
http://opentsdb.net/


Monitoring: Bling
visualizing an architecture rollout


Monitoring: Bling
visualizing the impact on service times


average API service time latency


actual API service time latency

http://www.slideshare.net/postwait/atldevops


Monitoring: Bling


Repeatability is a Pipe Dream

You production problem is a
(hopefully pathological)
outcome of circumstance.

A circumstance which often
cannot be repeated.


Control Groups

Control groups can
compensate for the
inability to
precisely repeat an experiment.


Control Groups

Most architectures have redundancy.


Control Groups

With the right design,
you can turn that redundancy
into a debugging environment.

[1] http://omniti.com/surge/2012/sessions/xtreme-deployment


Control Groups: Simple Example

I have 10 web servers
I ﬁx 1
I verify 1 is ﬁxed
I verify 9 are still broken


Control Groups: Seems Easy

Web servers tend to be:
• homogeneous
• share-(nothing|little)
• independent


Control Groups: Not So Easy

Most other services aren’t so
homogeneous and equal:
databases, batch processes (think
billings), orchestration middleware,
message queues


Observability

Some might claim that
seeing telemetry data is
observation...

It is doubly indirect at best.


Observability

I want to
directly see
the
errant behaviour


Observability is forgiving

In complex, multi-component
architectures, errors can be
observed as errant behaviour in
many junction points.


Observing the network

tcpdump / snoop
wireshark



Looking at just the
arrival of new connections

tcpdump -nnq -tttt -s384
'tcp port 80 and (tcp[13] & (2|16) == 2)'



Looking at just the data
arrival and departure times
tcpdump -nnq -tt
-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'

snoop -rq -ta
-s 384 'tcp port 80 and (((ip[2:2] - ((ip[0]&0xf)*4)) - ((tcp[12]&0xf0)/4)) != 0)'


Finding the difference between
a client’s question and
a server’s answer
(tcpdump | awk ﬁlter).
{
gsub(".[0-9]+(: | >)"," & ");
gsub("[:=]"," ");
EP=sprintf("%s%s", ($4==".80")?$6:$3, ($4==".80")?$7:$4);

if(S[EP] == "C" && $4 == ".80") { printf("%f %sn", $1 - L[EP], EP); }

S[EP]= ($4==".80")?"S":"C";
L[EP]= $1;
}


Observing user-space

strace[1] / truss
gstack / pstack
gcore + gdb / dbx / mdb[2]

[1] http://www.cli.di.unipi.it/~gadducci/SOL-11/Local/referenceCards/LINUX_System_Call_Quick_Reference.pdf
[2] http://hub.opensolaris.org/bin/download/Community+Group+mdb/tips/mdb-cheatsheet.pdf


System call tracing

Watching sshd
is a good way to get familiar.
truss -f -p `pgrep sshd`


System call tracing

An active web server is going to be
like a ﬁrehose.
truss -f -p `pgrep httpd`


Observing the system

DTrace

Live production demo or GTFO.


Thank You

Questions?


Monitoring and observability

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Monitoring and observability

Similar to Monitoring and observability (14)

More from Theo Schlossnagle

More from Theo Schlossnagle (20)

Recently uploaded

Recently uploaded (20)

Monitoring and observability