Monitoring at a SaaS
Startup
Tradeoffs and Tools
Bridget Kromhout
8thbridge.com
small social commerce startup
acquired in the last week by Fluid, Inc.
small devteam
I am the ops team
twisty maze of little shell scripts
bespoke artisanal
monitoring
difficult to modify;
doesn’t scale
http://www.pcgameshard...
New Relic
pros:
nice graphs
application-level view
good error analysis
cons:
slow to update
many false-positive alerts
hig...
Motivating
Change
http://99designs.com/illustrations/contests/illustration-pagerduty-161025/entries
: as hideous as you remember
https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/
“Horrendous interface”
“W...
“Sensu has so many
moving parts that I
wouldn’t be able to
sleep at night unless
I set up a Nagios
instance to make
sure t...
HBase: monitor all the ports?!?
hbck: the HBase consistency checker
nagios -> bash script -> parsing output of hbck
http:/...
adding alert after alert after...
http://modiinhub.com/wp-content/uploads/2014/02/logo-mongodb-tagline.png
MMS (MongoDB Monitoring Service)
“cyber” monday:
1988 called; wants its word back.
the rewards of hubris
MMS showed the issue
but we weren't alerting on it...
If it moves, we track it.
Sometimes we’ll draw a graph
of something that isn’t moving
yet, just in case it decides to
make...
Graphite & StatsD
➔ Graphite
◆ Store and visualize time-series data
◆ http://graphite.readthedocs.org/
➔ StatsD
◆ Measure ...
Where we were
➔ Graphite 0.9.9 (wanted 0.9.12)
◆ over 2 years old
◆ missing new features (Consolidate by!)
➔ StatsD was ne...
Community cookbooks?
➔ Graphite ones good, but…
◆ focus on Apache (we use nginx)
◆ we haven’t moved to Chef 11 (gasp!)
➔ S...
Graphite cookbook (Part 1)
➔ Install in a virtualenv (django, uwsgi, nginx)
➔ Dependencies recommended
◆ https://github.co...
Graphite cookbook (Part 2)
➔ graphite-web
◆ Django app, renders graphs
➔ whisper
◆ fixed-size database for storing time-se...
when in doubt: tcpdump is your friend
http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/
carbon-aggravator (between 0.9.10 & 0.9.12)
# If set true, metric received will be forwarded to
# DESTINATIONS in addition...
Carbonate
whisper-fill.py
backfill datapoints between whisper files
2am: sudden drop-off
8am: look at graphs: ?!?!
10am: and we’re back.
What’s next?
❏ finds real problems
❏ actionable alerting
❏ usable by all
❏ …?
the ideal
monitoring
solution...
http://www.quickmeme.com...
What we’re actually using now
StatsD
Application-level error
analysis
Alarms for autoscaling
Timers &
counters
Log & host-...
Discuss!
Twitter: @bridgetkromhout
Email: bridget@kromhout.org
Monitoring at a SAAS Startup: Tradeoffs and Tools
Upcoming SlideShare
Loading in …5
×

Monitoring at a SAAS Startup: Tradeoffs and Tools

1,249 views

Published on

I gave this talk at MinneBar 2014: http://sessions.minnestar.org/sessions/162

When I joined a SaaS startup already in progress as their first ops hire, what monitoring existed was a twisty maze of half-measures. The devteam dreaded oncall, and our Mean Time To Lost Sleep was way too low.

Improving visibility into our infrastructure and application performance required trying new tools and changing how we thought about what we were measuring. Join me for a tragicomic journey from the vale of blissful ignorance through the straits of Nagios and into the mountains of Graphite. We'll talk tools and pitfalls, missteps and dead ends, and everything we haven't yet done but should.

Tools covered will include Nagios, StatsD, Graphite, and Sentry, with some digressions into others such as NewRelic and MMS.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,249
On SlideShare
0
From Embeds
0
Number of Embeds
81
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Monitoring at a SAAS Startup: Tradeoffs and Tools

  1. 1. Monitoring at a SaaS Startup Tradeoffs and Tools Bridget Kromhout
  2. 2. 8thbridge.com small social commerce startup acquired in the last week by Fluid, Inc. small devteam I am the ops team
  3. 3. twisty maze of little shell scripts bespoke artisanal monitoring difficult to modify; doesn’t scale http://www.pcgameshardware.de/screenshots/1280x1024/2007/07/CA01.jpg
  4. 4. New Relic pros: nice graphs application-level view good error analysis cons: slow to update many false-positive alerts high prices (better now)
  5. 5. Motivating Change http://99designs.com/illustrations/contests/illustration-pagerduty-161025/entries
  6. 6. : as hideous as you remember
  7. 7. https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/ “Horrendous interface” “Well, it’s more “old” than anything else. At least everything is in the same place as you left it because it’s been the same since 1912.”
  8. 8. “Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.” -- @murphy_slaw (via @lozzd)
  9. 9. HBase: monitor all the ports?!? hbck: the HBase consistency checker nagios -> bash script -> parsing output of hbck http://www.ymc.ch/en/how-to-monitor-hbase-health-by-nagios
  10. 10. adding alert after alert after...
  11. 11. http://modiinhub.com/wp-content/uploads/2014/02/logo-mongodb-tagline.png
  12. 12. MMS (MongoDB Monitoring Service)
  13. 13. “cyber” monday: 1988 called; wants its word back. the rewards of hubris MMS showed the issue but we weren't alerting on it didn't understand the global write lock
  14. 14. If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving yet, just in case it decides to make a run for it. -- @indec http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
  15. 15. Graphite & StatsD ➔ Graphite ◆ Store and visualize time-series data ◆ http://graphite.readthedocs.org/ ➔ StatsD ◆ Measure everything! (Timers, counters, events, …) ◆ https://github.com/etsy/statsd/
  16. 16. Where we were ➔ Graphite 0.9.9 (wanted 0.9.12) ◆ over 2 years old ◆ missing new features (Consolidate by!) ➔ StatsD was newish, but… ◆ hand-rolled ◆ running in a screen session ◆ on a special snowflake box
  17. 17. Community cookbooks? ➔ Graphite ones good, but… ◆ focus on Apache (we use nginx) ◆ we haven’t moved to Chef 11 (gasp!) ➔ StatsD ◆ https://github.com/librato/statsd-cookbook ◆ launches daemons via upstart ◆ generates config file based on attributes
  18. 18. Graphite cookbook (Part 1) ➔ Install in a virtualenv (django, uwsgi, nginx) ➔ Dependencies recommended ◆ https://github.com/graphite-project/graphite- web/blob/master/requirements.txt ➔ libcairo2-dev package on Ubuntu 12.04 LTS ➔ install graphite’s 3 parts via pip
  19. 19. Graphite cookbook (Part 2) ➔ graphite-web ◆ Django app, renders graphs ➔ whisper ◆ fixed-size database for storing time-series data ◆ like RRD ➔ carbon ◆ carbon-cache.py - stores data ◆ carbon-aggregator.py - buffers, then stores ◆ carbon-relay.py - for sharding/replication
  20. 20. when in doubt: tcpdump is your friend http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/
  21. 21. carbon-aggravator (between 0.9.10 & 0.9.12) # If set true, metric received will be forwarded to # DESTINATIONS in addition to # the output of the aggregation rules. If set false # the carbon-aggregator will # only ever send the output of aggregation. FORWARD_ALL = True
  22. 22. Carbonate whisper-fill.py backfill datapoints between whisper files
  23. 23. 2am: sudden drop-off 8am: look at graphs: ?!?! 10am: and we’re back.
  24. 24. What’s next?
  25. 25. ❏ finds real problems ❏ actionable alerting ❏ usable by all ❏ …? the ideal monitoring solution... http://www.quickmeme.com/img/f5/f512ff9bee084263df5571d3c81388019dcb063173e1dbcfa2babac9274576b6.jpg
  26. 26. What we’re actually using now StatsD Application-level error analysis Alarms for autoscaling Timers & counters Log & host-level Hadoop & HBase visualization MongoDB Graphs Time-series data graphing client-side plugins External uptime checks oncall rotation/alerting Threshold-based alarms Dashboard
  27. 27. Discuss! Twitter: @bridgetkromhout Email: bridget@kromhout.org

×