0
From Zero To Visibility
Bridget Kromhout
8thbridge.com
small social commerce startup
acquired in the last month by Fluid, Inc.
small devteam
I am the ops team
http...
twisty maze of little shell scripts
http://www.pcgameshardware.de/screenshots/1280x1024/2007/07/CA01.jpg
time-consuming to understand
difficult to modify
doesn’t scale
artisanal monitoring?!
http://shop.bespokebacon.com/images/...
New Relic
pros:
nice graphs
application-level view
good error analysis
cons:
slow to update
many false-positive alerts
hig...
motivating change
http://99designs.
com/illustrations/contests/illustration-
pagerduty-161025/entries
as hideous as you remember
“Horrendous interface”
“Well, it’s more “old” than anything
else. At least everything is in the
same place as you left it ...
“Sensu has so many
moving parts that I
wouldn’t be able to
sleep at night unless
I set up a Nagios
instance to make
sure t...
hating on nagios: the middle years
“hadoop does not suffer from a paucity of configuration options”
http://jaganesundar.wordpress.com/2011/12/05/installing-a...
http://modiinhub.com/wp-content/uploads/2014/02/logo-mongodb-tagline.png
“Cyber” monday: 1988 called; wants its word back.
wow. such nosql. very webscale.
“a single write operation holds the lock exclusively, and
no other read or write operation...
“If it moves, we track it. Sometimes we’ll
draw a graph of something that isn’t
moving yet, just in case it decides to mak...
the (former) state of our graphite & statsd
● Graphite 0.9.9
○ hand-rolled
○ over 2 years old
○ missing new features (Cons...
http://media-cache-ec0.pinimg.com/736x/68/c2/9d/68c29deb72bad94cd4e3c1aa0f3cdcd8.jpg
this is wrong tool. never use this.
Community cookbooks?
● StatsD
○ https://github.com/librato/statsd-cookbook
● Graphite ones good, but…
○ focus on Apache (w...
when in doubt: tcpdump is your friend
http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/
carbon-aggravator (between 0.9.10 & 0.9.12)
# If set true, metric received will be forwarded to
# DESTINATIONS in addition...
carbonate: A+++ would clone again
whisper-fill.py
backfill datapoints between whisper files
life as a third wheel party
thresholds: because not every outage is abrupt
normal traffic
decision
to turn off
decision
to...
open-source error reporting
all the things
StatsD
Application-level error
analysis
Alarms for autoscaling
Timers &
counters
Log & host-level
Hadoop & ...
What’s next?
http://blog.xebia.fr/wp-content/uploads/2013/12/file-logstash-es-kibana.png
what even is ideal monitoring solution
http://www.quickmeme.com/img/f5/f512ff9bee084263df5571d3c81388019dcb063173e1dbcfa2b...
questions; comments; whatnot
Twitter: @bridgetkromhout
Email: bridget@kromhout.org
In person: DevOps Days Minneapolis
(dev...
From Zero To Visibility
Upcoming SlideShare
Loading in...5
×

From Zero To Visibility

4,305

Published on

Monitorama Portland 2014
Portland, OR
2014-05-05 to 2014-05-07


When I joined a startup already in progress as their first ops hire, what monitoring existed was a twisty maze of half-measures. The devteam dreaded oncall, and our Mean Time To Lost Sleep was way too low. Improving visibility into our infrastructure and application performance required trying new tools and changing how we thought about what we were measuring. Join me for a tragicomic journey from the vale of blissful ignorance through the straits of Nagios and into the mountains of Graphite. Thrill! to the victories. Cringe! at the rewards of hubris. Share! your own insights, because this tale never really ends.

Published in: Technology, Design
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,305
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
42
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "From Zero To Visibility"

  1. 1. From Zero To Visibility Bridget Kromhout
  2. 2. 8thbridge.com small social commerce startup acquired in the last month by Fluid, Inc. small devteam I am the ops team http://www.thedirtbox.com/wp-content/uploads/2013/01/ping-pongart.jpg
  3. 3. twisty maze of little shell scripts http://www.pcgameshardware.de/screenshots/1280x1024/2007/07/CA01.jpg
  4. 4. time-consuming to understand difficult to modify doesn’t scale artisanal monitoring?! http://shop.bespokebacon.com/images/bespoke-logo.final(3).png
  5. 5. New Relic pros: nice graphs application-level view good error analysis cons: slow to update many false-positive alerts high prices (better now)
  6. 6. motivating change http://99designs. com/illustrations/contests/illustration- pagerduty-161025/entries
  7. 7. as hideous as you remember
  8. 8. “Horrendous interface” “Well, it’s more “old” than anything else. At least everything is in the same place as you left it because it’s been the same since 1912.” https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/ not alone!
  9. 9. “Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.” who watches the RabbitMQ? -- @murphy_slaw (via @lozzd) http://images.sodahead.com/profiles/0/0/0/5/1/6/6/3/9/Watchmen-trademark-symbol-62141795529.jpeg http://portertech.ca/images/2011-11-01/sensu-diagram.png
  10. 10. hating on nagios: the middle years
  11. 11. “hadoop does not suffer from a paucity of configuration options” http://jaganesundar.wordpress.com/2011/12/05/installing-and-configuring-hadoop-0-20-205-using-it-rpm/ monitor all the ports?! best way to monitor HBase: hbck: the HBase consistency checker nagios -> bash script -> parsing output of hbck http://www.ymc.ch/en/how-to-monitor-hbase-health-by-nagios
  12. 12. http://modiinhub.com/wp-content/uploads/2014/02/logo-mongodb-tagline.png
  13. 13. “Cyber” monday: 1988 called; wants its word back.
  14. 14. wow. such nosql. very webscale. “a single write operation holds the lock exclusively, and no other read or write operations may share the lock.”
  15. 15. “If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it.” Ian Malpass, Etsy http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
  16. 16. the (former) state of our graphite & statsd ● Graphite 0.9.9 ○ hand-rolled ○ over 2 years old ○ missing new features (Consolidate by!) ● StatsD was newish, but… ○ hand-rolled ○ running in a screen session ○ on a special snowflake box
  17. 17. http://media-cache-ec0.pinimg.com/736x/68/c2/9d/68c29deb72bad94cd4e3c1aa0f3cdcd8.jpg this is wrong tool. never use this.
  18. 18. Community cookbooks? ● StatsD ○ https://github.com/librato/statsd-cookbook ● Graphite ones good, but… ○ focus on Apache (we use nginx) ○ we haven’t moved to Chef 11 (gasp!)
  19. 19. when in doubt: tcpdump is your friend http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/
  20. 20. carbon-aggravator (between 0.9.10 & 0.9.12) # If set true, metric received will be forwarded to # DESTINATIONS in addition to # the output of the aggregation rules. If set false # the carbon-aggregator will # only ever send the output of aggregation. FORWARD_ALL = True
  21. 21. carbonate: A+++ would clone again whisper-fill.py backfill datapoints between whisper files
  22. 22. life as a third wheel party thresholds: because not every outage is abrupt normal traffic decision to turn off decision to turn back on accidental removal
  23. 23. open-source error reporting
  24. 24. all the things StatsD Application-level error analysis Alarms for autoscaling Timers & counters Log & host-level Hadoop & HBase visualization MongoDB Graphs Time-series data graphing client-side plugins Threshold-based alarmsDashboard external checks
  25. 25. What’s next? http://blog.xebia.fr/wp-content/uploads/2013/12/file-logstash-es-kibana.png
  26. 26. what even is ideal monitoring solution http://www.quickmeme.com/img/f5/f512ff9bee084263df5571d3c81388019dcb063173e1dbcfa2babac9274576b6.jpg ❏ finds real problems ❏ actionable alerting ❏ usable by all ❏ …?
  27. 27. questions; comments; whatnot Twitter: @bridgetkromhout Email: bridget@kromhout.org In person: DevOps Days Minneapolis (devopsdays.org)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×