A Whirlwind Tour of
Etsy's Monitoring Stack
Daniel Schauenberg
dschauenberg@etsy.com
@mrtazz
@mrtazz
@mrtazz
@mrtazzItem by TheBackPackShoppe
How comfortable
are you deploying
a change right
now?
“If this is your first
day at Etsy, you
deploy the site”
@mrtazz
Ganglia
• System level metrics
• Instance per DC/environment
• > 220k RRD files
• Fully configured through Chef role
attributes
@mrtazz
Rainbow Graphs!
@mrtazz
StatsD
• Single instance on one server
• Traffic mostly from 70 Web & 24 API
servers
• Node.js
• Heavy Sampling
• Graphite as backend
@mrtazz
@mrtazz
Graphite
• Application level metrics
• 96G RAM, 20 Cores, 7.3T SSD RAID 10
• 525k metrics/minute
• Mirrored Master/Master Setup
• Functionally sharded relays
@mrtazz
CNAME
relays
relays
caches
caches
statsdtimers	

statsdcounts	

statsd	

chef	

logster	

fqld	

search	

generic
@mrtazz
@mrtazz
@mrtazz
Syslog-Ng
• Web, Search, Gearman, Photos, Nagios,
Network, VPN
• 1.2GB written/minute
• Chef role attribute based config
• Rule ordering!
@mrtazz
github.com/etsy/logster
• Extract metrics from log files
• Written in Python
• Runs every minute via cron
@mrtazz
Splunk
• Indexes all of our log files
• Easy search for patterns
• Saved searches for interesting ones
• Basically using it as a glorified grep
@mrtazz
Logstash
• Experiment status
• Makes it easier integrate different sources
• Easy to set up in dev environment
• Trying to figure out where/how it fits into
our infrastructure
@mrtazz
Eventinator
• Tracks all events in our infrastructure
• Chef runs and changes
• DNS changes
• Network
• Deploys
• Server provisioning and decommissioning
• ~ 12 million events in the last 2 years
@mrtazz
@mrtazz
Chef
• rules everything around me
• Same cookbooks on prod and dev
• every node runs Chef every 10 minutes
• ton of knife plugins and handlers
@mrtazz
@mrtazz
> 120 recipes
@mrtazz
@mrtazz
Nagios
@mrtazz
Nagios
• 2 instances in each DC/environment
• Fully Chef generated configuration
• Service checks and contacts in git
• Notifications via email->SMS gateway
• ~75% ops on-call
@mrtazz
github.com/lozzd/nagdash
@mrtazz
@mrtazz
@mrtazz
@mrtazz
Nagios Herald
• Add context to nagios alerts
• What are the first 5 things you do when
you get paged?
• You already have the phone in your hand
• nagios notification handler
@mrtazz
@mrtazz
The Toys are real
@mrtazz
There’s another
side of heaven
@mrtazz
Ops Weekly
@mrtazz
Ops Weekly
@mrtazz
Summary
• Set of trusted tools
• Enhance where they come short
• Try out new things
• Write tools where applicable
• Continuous monitoring and adaptation
@mrtazz
codeascraft.com	

etsy.com/codeascraft/talks	

etsy.github.com	

etsy.com/careers
@mrtazz
Questions?
A Whirlwind Tour of
Etsy's Monitoring Stack
Daniel Schauenberg
dschauenberg@etsy.com
@mrtazz

A Whirlwind Tour of Etsy's Monitoring Stack