When Spotify started in 2006, with just 20 people, they were more worried about selling the idea of music streaming than of setting up monitoring systems. Fast forward to 2015 and
more than 400 engineers are collecting more than 30 million time series from more than 10000 hosts; so how did we get here? The journey has been a long one, with plenty of false starts and growing pains, from scaling systems to scaling teams to scaling the business itself; challenging what we thought we knew about operational monitoring at every step.
This talk is about some of the more interesting challenges we've faced along the way, and about what we've learned so far; covering some of the technical details but primarily focusing on the human aspects, and how our monitoring solutions have both shaped and been shaped by organizational structures and changing engineering practices.
2. ‣ Martin Parm, Danish, 36 years old
‣ Master degree in CS from Copenhagen
University
‣ IT operations and infrastructure since 2004
‣ Joined Spotify in 2012 (and moved to Sweden)
‣ Joined Spotify’s monitoring team in february
2014
‣ Currently Product Owner for monitoring
About Martin Parm
3. This talk neither endorses nor condemn any
specific products, open source or otherwise. All
opinions expressed are in the context of Spotify’s
specific history with monitoring.
Disclaimer
4. This talk is not a sales pitch for our monitoring
solution. It’s a story about how it came to be.
Disclaimer (2)
6. ‣ Music streaming - and discovery
‣ Clients for all major operating systems
‣ Partner integration into speakers, TVs,
PlayStation, ChromeCast, etc.
‣ More than 20 million paid subscribers*
‣ More than 75 million daily active users*
‣ Paid back more than $3 billion in royalties*
The service - the right music for every
moment
* https://news.spotify.com/se/2015/06/10/20-million-reasons-to-say-thanks/
7. ‣ Major tech offices in Stockholm, Gothenburg,
New York and San Francisco
‣ Small sales and label relations offices in all
markets (read: countries)
‣ ~1500 employees worldwide
○ ~50% in Technology
○ ~100 people in IO, our infrastructure department
○ 6 engineers in our monitoring team
The people
8. ‣ 4 physical data centers
‣ ~10K physical servers
‣ Microservice infrastructure with ~1000 different
services
‣ Mostly Ubuntu Linux
The technology
9. ‣ Ops-In-Squads
○ Distributed operational responsibility in the feature teams
○ Monitoring as self-service for backend
‣ Two monitoring systems for backend
○ Heroic - time series based graphs and alerts
○ Riemann - event based alerting
‣ ~100M time series
‣ ~750 graph-based alert definitions
‣ ~1500 graph dashboards
Operations and monitoring
11. Spotify starts in 2006 - a different world
‣ No cloud computing
○ AWS had only just launched
○ Google App engine and Microsoft Azure didn’t exist yet
‣ Few cloud applications
○ GMail was still in limited public beta
○ Google Docs was still in limited testing
○ Facebook opened for public access in September
‣ No smartphones
○ Apple had not even unveiled the iPhone yet
○ Android didn’t become available until 2008
12. Minutes from meeting in June 2007
“Munin mail -
It's not sustainable to get
100 status mails per day
about full disks to dev. X
should turn them off.”
13. ‣ Sitemon; our first graphing system
○ Based on Munin but with a custom frontend
○ Metrics were pulled from hosts by aggregators
○ Metrics were written to several different RRD files with different
solutions
○ Static graphs were generated from these RRD files
○ One single dashboard for all systems and users
○ Main metrics: Backend Request Failures
First steps with monitoring
14. -- Emil Fredriksson, Operations director at Spotify
(Slightly paragraphed)
“Our first alerting system was an
engineer, who looked at graphs all
the time; day and night; weekends.
And he would start calling up
people, when something looked
wrong.”
15. ‣ Spotify launched in Sweden and UK in October
2008
‣ Zabbix was introduced in September 2009
‣ Alerts were sent as text messages to
Operations, who would then contact feature
developers
‣ Most common “alerting source”: Users
○ Operations had a permanent twitter search
First steps with alerting
17. ‣ Opened our 3rd data center and grew to ~1000
hosts
‣ Spotify grew from ~100 to 400-600 people
worldwide in a few months
‣ Many new engineers didn’t have operational
experience or DevOps mentality
‣ A rift between dev and ops emerged...
2011-2012: The 2nd great hiring spree
18. ‣ Development speed-up and a vast increase in
new services
‣ Stability and reliability was an increasing
problem
‣ Service ownership was often unclear
‣ Too frequent changes for a monolithic SRE
team to keep up
2011-2012: The 2nd great hiring spree
19. ‣ Big incidents almost every week → The
business were unhappy
‣ Constant panic and fire fighting → The SRE
team were unhappy
‣ Policies and restrictions, and angry SRE
engineers → The feature developers were
unhappy
2011-2012: The 2nd great hiring spree
20. “The infrastructure and feature squads that write
services should also take responsibility for correct
day-to-day operation of individual services.”
‣ Capacity Planning
‣ Service Configuration and Deployment
‣ Monitoring and Alerting
‣ Defining and Managing to SLAs
‣ Managing Incidents
September 2012: Ops In Squads
21. Benefits
‣ Organizational Scalability
‣ Faster incident solving - getting The Right
Person™ on the problem faster
‣ Accountability - making the right people hurt
‣ Autonomy - feature teams make all their own
planning and decisions
September 2012: Ops In Squads
22. Human challenges
‣ Developers need training, but not a new
education
‣ Developers need autonomy, but will do stupid
things
‣ Developers need to care about monitoring and
alerting, but not the monitoring pipeline
September 2012: Ops In Squads
23. We needed infrastructure as services
‣ The classis Operations team was disbanded
‣ Operations engineers and tools teams were
reformed in IO, our infrastructure organization
‣ Teaching and self-service became a primary
priority
September 2012: Ops In Squads
24. “Creating IO was
probably one of the
smartest moves in
Spotify”
-- Previous Product Owner for Monitoring
(Slightly paragraphed)
26. ‣ Late 2011: Backend Infrastructure Team (BIT)
was formed
○ BIT was the first infrastructure team at Spotify
‣ Tasked with log delivery, monitoring and
alerting
‣ Development of Sitemon2 began
○ Meant to replace Sitemon
○ Still based on Munin, but with a Cassandra backend and much
more powerful frontend
Sitemon2 - The graphing system, which
never launched
27. ...but BIT was set up for failure from the start
‣ Sitemon2 was developed mostly in isolation and
with very little collaboration with developers
‣ Priority collisions: Log delivery was always
more critical than monitoring
‣ Scope creep: BIT tried to integrate Sitemon2
with analytics
Sitemon2 - The graphing system, which
never launched
28. We needed feature teams to take part in
monitoring, but Zabbix was too inflexible and hard
to learn.
‣ Late 2012: Development of OMG began
○ Event streaming processor similar to Riemann
○ Initial development was super fast and focused
○ Developed in collaboration with Operations
‣ A few teams adopted OMG, but.....
OMG - The alerting system no one
understood
30. The alerting rule language was Esper (EPL)
‣ Most engineers found the learning curve way
too steep and confusing
‣ Too few tools and libraries for the language
Why was this not caught?
‣ The Ops engineer assigned for the
collaboration happened to also like Esper
OMG - The alerting system no one
understood
31. ‣ February 2013: One of our system architects
builds Monster as a proof-of-concept hack
project
○ In-memory time series database in Java
○ Based on Munin collection and data model
○ Metric data was pushed rather than pulled
○ The prototype was completed in 2 weeks
Monster
32. ‣ Pushing monitoring data was much more
reliable than pulling
‣ Querying and graphing is blazing fast
‣ The Operations engineers loved it!
‣ Sitemon kept running, but development of
Sitemon2 was halted
We’ll get back to the failure part...
Monster
34. ‣ First dedicated monitoring team at Spotify
‣ Assigned with the task of “Providing self-
service monitoring solutions for DevOps teams”
‣ Inherited Monster, Zabbix and OMG
‣ Calculation: Monster could survive a year, so
we focused on alerting first
Hero Squad
35. ‣ Replaced Zabbix and OMG with Riemann
○ “Riemann is an event stream processor”
○ Written in Clojure
○ Rules are also written in Clojure
‣ We built a support library with helper functions,
namespace support and unit testing
‣ Build a web frontend for reading the current
state of Riemann
Riemann as a self-service alerting
system
36. * Some boilerplate code *
(def-rules
(where (tagged-any roles)
(tagged "monitoring-hooks"
(:alert target))
(where (service "vfs/disk_used/")
(spotify/trigger-on (p/above 0.9)
(with {:description "Disk is getting full"}
(:alert target))))))
Riemann rule written in Clojure
37. ‣ Success: Riemann was widely adopted
‣ Success: Riemann is a true self-service
○ Riemann rules lives in a shared git repo, which gets
automatically deployed
○ Each team/project have it’s own namespace
○ Unit tests ensure that rules work as intended
○ Peak: 36 namespaces and ~5000 lines of Clojure code
‣ Failure: Many engineers didn’t understand or
like the Clojure language
Riemann as a self-service alerting
system
39. ‣ Sharding and rebalancing quickly became a
serious operational overhead
○ The whisper write pattern involved randomly seeking and
writing across a lot of files – one for each series
○ The Cyanite backend have recently addressed this
‣ Hierarchical naming
○ Example: “db1.cpu.idle”
○ Difficult to slice and dice metrics across dimension, e.g. select
all host in a site or running a particular piece of software
A brief encounter with collectd and
Graphite
40. ‣ Replace long
hierarchical names
with tags
‣ Compatible with
what Riemann does
with events
‣ Makes slicing and
dicing metrics easy
.... but who supports it?
Metric 2.0
41. ‣ We need quick adoption and commitment from
our feature teams
‣ Monitoring was still very immature and we need
room to experiment and fail
‣ Problem: Engineers get sick of migrations and
refactorings
‣ Solution: A flexible infrastructure and an “API”
Adoption vs. flexibility
42. ‣ Small daemon running on each host, which
forwards events and metrics to monitoring
infrastructure
‣ First written in Ruby, but later ported to Java
‣ Provides a stable entry point for our users
ffwd - a monitoring “API”
43. Abstracting away the infrastructure from
monitoring collection
Magical
monitoring
pipeline
Metrics and events
ffwd
Alerts
Pretty
graphs
44. ‣ Atlas, developed by NetFlix
○ Hadn’t been open sourced yet
‣ Prometheus, developed by SoundCloud
○ Hadn’t been open sourced yet
‣ OpenTSDB, originally developed by
StumbleUpon
○ Was rejected because of bad experiences with HBase
‣ InfluxDB
○ Was too immature at the time
Tripping towards graphing; it’s all about
the timing
45. ‣ Time series database written in Java and
backed by Cassandra
‣ Looked promising at first
○ We deployed it and killed Sitemon for good
○ We quickly ran into problems with the query engine
‣ Timing: The two main developers got hired by
DataStax (the Cassandra company)
○ KairosDB development went to a halt
KairosDB
46. ‣ Originally written as an alternative query engine
for KairosDB
‣ We kept using the KairosDB database schema
and KairosDB metric writers
‣ June 2014: We dropped KairosDB and Heroic
became a stand-alone product
‣ ElasticSearch used for indexing time series
metadata
May 2014: The birth of Heroic
47. Monster couldn’t scale, but this was not obvious to
the users
‣ When it worked, it was blazing fast and beat all
other solutions
‣ When it broke, it crashed and required the
attention of the monitoring team, but most
users never knew
‣ Only visible sign: shorter and shorter history
Back to the Monster failure
48. ‣ Failure: Because Monster was loved, and the
users weren’t experiencing the pain when it
broke, many teams resisted migrating from
Monster
‣ Result: We didn’t manage to shut down
Monster until August 2015
‣ In it’s last 6 weeks Monster crashed 51(!) times
Back to the Monster failure
50. Alerting was becoming a problem again
‣ Scaling Riemann with the increasing number of
metrics became hard
○ We began sharding, but some groups of hosts were still too big
‣ Writing reliable sliding window rules in Riemann
was hard
○ Learning Riemann and Clojure was the most common
complaint from our users
‣ One team dropped our monitoring solution and
moved to an external vendor
51. ‣ Simple thresholds on time series using the
same backend, data and query language
○ 3 operators: Above, Below or Missing for X time
‣ Integrated directly into our frontend
○ No code, no fancy math, just a line on a graph
Graph-based alerting
53. ‣ Our engineers loves it!
○ Thousands of lines of Riemann code was ripped out
○ Many teams have migrated completely away from Riemann
○ We saw a massive speed-up in adoption of monitoring; both
data collection and definitions of dashboard and alerts
‣ Many monitoring problems can indeed be
expressed as a simple threshold
Graph-based alerting
55. ‣ We are currently collecting ~10TB of metrics
per month worldwide
○ 30TB of storage in Cassandra due to replication factor
‣ ~80% of our data was collected within the last 6
month
Adoption of Heroic
56. The final current picture
Metrics and
events
ffwd
Alerts
Pretty
graphs
Riemann
Apache
Kafka
Heroic
57. ‣ ffwd and ffwd-java has been developed as
Open Source software from the start
‣ Heroic was released as Open Source software
yesterday
○ Blog post: “Monitoring at Spotify: Introducing Heroic”
○ Other components are being released later
We finally Open Sourced it
59. ● Learning a new monitoring
system is an investment
● Legacy systems are almost
always to hardest
But it gets worse...
● Almost all system ends as
legacy
● You probably haven’t
installed your last
monitoring system
Migrations are hard and expensive
Suggestions:
● Consider having abstraction
layers
● Beware of vendor lock-in
○ Open Source software is not
safe
● Sometimes it’s cheaper to
keep a migration layer for
legacy systems than
migrating
60. ● The monitoring/operations
team are experts; feature
developers might not be
● User experience matters for
adoption
● The learning curve affects
the cost of adoption for
teams
User experience and learning curve
matters
● A technically superior
solution is worthless, if your
users don’t understand it
● Providing good defaults and
a easy golden path will not
only drive adoption, but also
prevent users from making
common mistakes
61. When collection is easy and
performance is good, engineers
will start using the monitoring
system as a debugger.
● Storage is cheap but not
free
● The operational cost of
keeping debugging data
highly available is
significant
Beware of scope creep in monitoring
When graphing is easy, pretty
and powerful, people will start
using monitoring for business
analytics.
● Monitoring is suppose to be
reliable, but not accurate
62. ● Seems very intuitive
● Fragile, sensitive to latency
and sporadic failures
● Noise for alerting
● What are you really
measuring?
● Solution: Convert your
problem into a metric by
interpreting close to the
source
Heartbeats are hard to get right
63. ● We used to sent events on every Puppet run
● Teams would make monitoring rules for failed Puppet runs
and absent Puppet runs
● Problem: Absent Puppet runs looks exactly the same when
○ Puppet is disabled
○ Network is down
○ Host is down
○ Host has been decommissioned
● Solution: Emit “Time since last successful Puppet run” metric
instead
○ Now we can do simple thresholds, which are easy to reason about
Heartbeats example: Puppet runs
64. ● Indexing 100M time series
is hard
● Browsing 100M time series
is hard
○ UI design - getting an overview
of 100M time series is hard
○ Understanding a graph with
thousands of lines is difficult for
humans
● Your data will keep growing
The next big scaling problem is very
human: data discovery
● Anomaly detection and
machine learning might
help us
○ Many new and upcoming
product looks promising
○ ...but still largely an unsolved
problem
65. Thank you for
your time and
patience!
Martin Parm
email: parmus@spotify.com
twitter: @parmus_dk