Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm

1,080 views

Published on

When Spotify started in 2006, with just 20 people, they were more worried about selling the idea of music streaming than of setting up monitoring systems. Fast forward to 2015 and

more than 400 engineers are collecting more than 30 million time series from more than 10000 hosts; so how did we get here? The journey has been a long one, with plenty of false starts and growing pains, from scaling systems to scaling teams to scaling the business itself; challenging what we thought we knew about operational monitoring at every step.

This talk is about some of the more interesting challenges we've faced along the way, and about what we've learned so far; covering some of the technical details but primarily focusing on the human aspects, and how our monitoring solutions have both shaped and been shaped by organizational structures and changing engineering practices.

Published in: Technology
  • Be the first to comment

OSMC 2015: Monitoring at Spotify-When things go ping in the night by Martin Parm

  1. 1. Monitoring at Spotify - When things go ping in the night Martin Parm, Product owner for Monitoring
  2. 2. ‣ Martin Parm, Danish, 36 years old ‣ Master degree in CS from Copenhagen University ‣ IT operations and infrastructure since 2004 ‣ Joined Spotify in 2012 (and moved to Sweden) ‣ Joined Spotify’s monitoring team in february 2014 ‣ Currently Product Owner for monitoring About Martin Parm
  3. 3. This talk neither endorses nor condemn any specific products, open source or otherwise. All opinions expressed are in the context of Spotify’s specific history with monitoring. Disclaimer
  4. 4. This talk is not a sales pitch for our monitoring solution. It’s a story about how it came to be. Disclaimer (2)
  5. 5. Spotify - what we do
  6. 6. ‣ Music streaming - and discovery ‣ Clients for all major operating systems ‣ Partner integration into speakers, TVs, PlayStation, ChromeCast, etc. ‣ More than 20 million paid subscribers* ‣ More than 75 million daily active users* ‣ Paid back more than $3 billion in royalties* The service - the right music for every moment * https://news.spotify.com/se/2015/06/10/20-million-reasons-to-say-thanks/
  7. 7. ‣ Major tech offices in Stockholm, Gothenburg, New York and San Francisco ‣ Small sales and label relations offices in all markets (read: countries) ‣ ~1500 employees worldwide ○ ~50% in Technology ○ ~100 people in IO, our infrastructure department ○ 6 engineers in our monitoring team The people
  8. 8. ‣ 4 physical data centers ‣ ~10K physical servers ‣ Microservice infrastructure with ~1000 different services ‣ Mostly Ubuntu Linux The technology
  9. 9. ‣ Ops-In-Squads ○ Distributed operational responsibility in the feature teams ○ Monitoring as self-service for backend ‣ Two monitoring systems for backend ○ Heroic - time series based graphs and alerts ○ Riemann - event based alerting ‣ ~100M time series ‣ ~750 graph-based alert definitions ‣ ~1500 graph dashboards Operations and monitoring
  10. 10. Our story begins back in 2006...
  11. 11. Spotify starts in 2006 - a different world ‣ No cloud computing ○ AWS had only just launched ○ Google App engine and Microsoft Azure didn’t exist yet ‣ Few cloud applications ○ GMail was still in limited public beta ○ Google Docs was still in limited testing ○ Facebook opened for public access in September ‣ No smartphones ○ Apple had not even unveiled the iPhone yet ○ Android didn’t become available until 2008
  12. 12. Minutes from meeting in June 2007 “Munin mail - It's not sustainable to get 100 status mails per day about full disks to dev. X should turn them off.”
  13. 13. ‣ Sitemon; our first graphing system ○ Based on Munin but with a custom frontend ○ Metrics were pulled from hosts by aggregators ○ Metrics were written to several different RRD files with different solutions ○ Static graphs were generated from these RRD files ○ One single dashboard for all systems and users ○ Main metrics: Backend Request Failures First steps with monitoring
  14. 14. -- Emil Fredriksson, Operations director at Spotify (Slightly paragraphed) “Our first alerting system was an engineer, who looked at graphs all the time; day and night; weekends. And he would start calling up people, when something looked wrong.”
  15. 15. ‣ Spotify launched in Sweden and UK in October 2008 ‣ Zabbix was introduced in September 2009 ‣ Alerts were sent as text messages to Operations, who would then contact feature developers ‣ Most common “alerting source”: Users ○ Operations had a permanent twitter search First steps with alerting
  16. 16. 2011/2012: Ops in squads
  17. 17. ‣ Opened our 3rd data center and grew to ~1000 hosts ‣ Spotify grew from ~100 to 400-600 people worldwide in a few months ‣ Many new engineers didn’t have operational experience or DevOps mentality ‣ A rift between dev and ops emerged... 2011-2012: The 2nd great hiring spree
  18. 18. ‣ Development speed-up and a vast increase in new services ‣ Stability and reliability was an increasing problem ‣ Service ownership was often unclear ‣ Too frequent changes for a monolithic SRE team to keep up 2011-2012: The 2nd great hiring spree
  19. 19. ‣ Big incidents almost every week → The business were unhappy ‣ Constant panic and fire fighting → The SRE team were unhappy ‣ Policies and restrictions, and angry SRE engineers → The feature developers were unhappy 2011-2012: The 2nd great hiring spree
  20. 20. “The infrastructure and feature squads that write services should also take responsibility for correct day-to-day operation of individual services.” ‣ Capacity Planning ‣ Service Configuration and Deployment ‣ Monitoring and Alerting ‣ Defining and Managing to SLAs ‣ Managing Incidents September 2012: Ops In Squads
  21. 21. Benefits ‣ Organizational Scalability ‣ Faster incident solving - getting The Right Person™ on the problem faster ‣ Accountability - making the right people hurt ‣ Autonomy - feature teams make all their own planning and decisions September 2012: Ops In Squads
  22. 22. Human challenges ‣ Developers need training, but not a new education ‣ Developers need autonomy, but will do stupid things ‣ Developers need to care about monitoring and alerting, but not the monitoring pipeline September 2012: Ops In Squads
  23. 23. We needed infrastructure as services ‣ The classis Operations team was disbanded ‣ Operations engineers and tools teams were reformed in IO, our infrastructure organization ‣ Teaching and self-service became a primary priority September 2012: Ops In Squads
  24. 24. “Creating IO was probably one of the smartest moves in Spotify” -- Previous Product Owner for Monitoring (Slightly paragraphed)
  25. 25. Three tales of failure* * Read: Learning opportunities
  26. 26. ‣ Late 2011: Backend Infrastructure Team (BIT) was formed ○ BIT was the first infrastructure team at Spotify ‣ Tasked with log delivery, monitoring and alerting ‣ Development of Sitemon2 began ○ Meant to replace Sitemon ○ Still based on Munin, but with a Cassandra backend and much more powerful frontend Sitemon2 - The graphing system, which never launched
  27. 27. ...but BIT was set up for failure from the start ‣ Sitemon2 was developed mostly in isolation and with very little collaboration with developers ‣ Priority collisions: Log delivery was always more critical than monitoring ‣ Scope creep: BIT tried to integrate Sitemon2 with analytics Sitemon2 - The graphing system, which never launched
  28. 28. We needed feature teams to take part in monitoring, but Zabbix was too inflexible and hard to learn. ‣ Late 2012: Development of OMG began ○ Event streaming processor similar to Riemann ○ Initial development was super fast and focused ○ Developed in collaboration with Operations ‣ A few teams adopted OMG, but..... OMG - The alerting system no one understood
  29. 29. OMG rule written in Esper {Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify. muninplugintcp4950[hermes_requests_discovery,%%any]","last","0"].avg (300)}>0.1&(({TRIGGER.VALUE}=0&{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_replies_discovery,%% any]","last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950 [hermes_requests_discovery,%%any]","last","0"].min(300)}<0.9)| ({TRIGGER.VALUE}=1&{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_replies_discovery,%%any]"," last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_requests_discovery,%% any]","last","0"].min(300)}<0.97))
  30. 30. The alerting rule language was Esper (EPL) ‣ Most engineers found the learning curve way too steep and confusing ‣ Too few tools and libraries for the language Why was this not caught? ‣ The Ops engineer assigned for the collaboration happened to also like Esper OMG - The alerting system no one understood
  31. 31. ‣ February 2013: One of our system architects builds Monster as a proof-of-concept hack project ○ In-memory time series database in Java ○ Based on Munin collection and data model ○ Metric data was pushed rather than pulled ○ The prototype was completed in 2 weeks Monster
  32. 32. ‣ Pushing monitoring data was much more reliable than pulling ‣ Querying and graphing is blazing fast ‣ The Operations engineers loved it! ‣ Sitemon kept running, but development of Sitemon2 was halted We’ll get back to the failure part... Monster
  33. 33. 2013: The birth of a dedicated monitoring team
  34. 34. ‣ First dedicated monitoring team at Spotify ‣ Assigned with the task of “Providing self- service monitoring solutions for DevOps teams” ‣ Inherited Monster, Zabbix and OMG ‣ Calculation: Monster could survive a year, so we focused on alerting first Hero Squad
  35. 35. ‣ Replaced Zabbix and OMG with Riemann ○ “Riemann is an event stream processor” ○ Written in Clojure ○ Rules are also written in Clojure ‣ We built a support library with helper functions, namespace support and unit testing ‣ Build a web frontend for reading the current state of Riemann Riemann as a self-service alerting system
  36. 36. * Some boilerplate code * (def-rules (where (tagged-any roles) (tagged "monitoring-hooks" (:alert target)) (where (service "vfs/disk_used/") (spotify/trigger-on (p/above 0.9) (with {:description "Disk is getting full"} (:alert target)))))) Riemann rule written in Clojure
  37. 37. ‣ Success: Riemann was widely adopted ‣ Success: Riemann is a true self-service ○ Riemann rules lives in a shared git repo, which gets automatically deployed ○ Each team/project have it’s own namespace ○ Unit tests ensure that rules work as intended ○ Peak: 36 namespaces and ~5000 lines of Clojure code ‣ Failure: Many engineers didn’t understand or like the Clojure language Riemann as a self-service alerting system
  38. 38. 2014: Now for the pretty graphs...
  39. 39. ‣ Sharding and rebalancing quickly became a serious operational overhead ○ The whisper write pattern involved randomly seeking and writing across a lot of files – one for each series ○ The Cyanite backend have recently addressed this ‣ Hierarchical naming ○ Example: “db1.cpu.idle” ○ Difficult to slice and dice metrics across dimension, e.g. select all host in a site or running a particular piece of software A brief encounter with collectd and Graphite
  40. 40. ‣ Replace long hierarchical names with tags ‣ Compatible with what Riemann does with events ‣ Makes slicing and dicing metrics easy .... but who supports it? Metric 2.0
  41. 41. ‣ We need quick adoption and commitment from our feature teams ‣ Monitoring was still very immature and we need room to experiment and fail ‣ Problem: Engineers get sick of migrations and refactorings ‣ Solution: A flexible infrastructure and an “API” Adoption vs. flexibility
  42. 42. ‣ Small daemon running on each host, which forwards events and metrics to monitoring infrastructure ‣ First written in Ruby, but later ported to Java ‣ Provides a stable entry point for our users ffwd - a monitoring “API”
  43. 43. Abstracting away the infrastructure from monitoring collection Magical monitoring pipeline Metrics and events ffwd Alerts Pretty graphs
  44. 44. ‣ Atlas, developed by NetFlix ○ Hadn’t been open sourced yet ‣ Prometheus, developed by SoundCloud ○ Hadn’t been open sourced yet ‣ OpenTSDB, originally developed by StumbleUpon ○ Was rejected because of bad experiences with HBase ‣ InfluxDB ○ Was too immature at the time Tripping towards graphing; it’s all about the timing
  45. 45. ‣ Time series database written in Java and backed by Cassandra ‣ Looked promising at first ○ We deployed it and killed Sitemon for good ○ We quickly ran into problems with the query engine ‣ Timing: The two main developers got hired by DataStax (the Cassandra company) ○ KairosDB development went to a halt KairosDB
  46. 46. ‣ Originally written as an alternative query engine for KairosDB ‣ We kept using the KairosDB database schema and KairosDB metric writers ‣ June 2014: We dropped KairosDB and Heroic became a stand-alone product ‣ ElasticSearch used for indexing time series metadata May 2014: The birth of Heroic
  47. 47. Monster couldn’t scale, but this was not obvious to the users ‣ When it worked, it was blazing fast and beat all other solutions ‣ When it broke, it crashed and required the attention of the monitoring team, but most users never knew ‣ Only visible sign: shorter and shorter history Back to the Monster failure
  48. 48. ‣ Failure: Because Monster was loved, and the users weren’t experiencing the pain when it broke, many teams resisted migrating from Monster ‣ Result: We didn’t manage to shut down Monster until August 2015 ‣ In it’s last 6 weeks Monster crashed 51(!) times Back to the Monster failure
  49. 49. July 2014: Graph-based alerting
  50. 50. Alerting was becoming a problem again ‣ Scaling Riemann with the increasing number of metrics became hard ○ We began sharding, but some groups of hosts were still too big ‣ Writing reliable sliding window rules in Riemann was hard ○ Learning Riemann and Clojure was the most common complaint from our users ‣ One team dropped our monitoring solution and moved to an external vendor
  51. 51. ‣ Simple thresholds on time series using the same backend, data and query language ○ 3 operators: Above, Below or Missing for X time ‣ Integrated directly into our frontend ○ No code, no fancy math, just a line on a graph Graph-based alerting
  52. 52. Graph-based alerting
  53. 53. ‣ Our engineers loves it! ○ Thousands of lines of Riemann code was ripped out ○ Many teams have migrated completely away from Riemann ○ We saw a massive speed-up in adoption of monitoring; both data collection and definitions of dashboard and alerts ‣ Many monitoring problems can indeed be expressed as a simple threshold Graph-based alerting
  54. 54. Adoption of Heroic
  55. 55. ‣ We are currently collecting ~10TB of metrics per month worldwide ○ 30TB of storage in Cassandra due to replication factor ‣ ~80% of our data was collected within the last 6 month Adoption of Heroic
  56. 56. The final current picture Metrics and events ffwd Alerts Pretty graphs Riemann Apache Kafka Heroic
  57. 57. ‣ ffwd and ffwd-java has been developed as Open Source software from the start ‣ Heroic was released as Open Source software yesterday ○ Blog post: “Monitoring at Spotify: Introducing Heroic” ○ Other components are being released later We finally Open Sourced it
  58. 58. What we have learned so far
  59. 59. ● Learning a new monitoring system is an investment ● Legacy systems are almost always to hardest But it gets worse... ● Almost all system ends as legacy ● You probably haven’t installed your last monitoring system Migrations are hard and expensive Suggestions: ● Consider having abstraction layers ● Beware of vendor lock-in ○ Open Source software is not safe ● Sometimes it’s cheaper to keep a migration layer for legacy systems than migrating
  60. 60. ● The monitoring/operations team are experts; feature developers might not be ● User experience matters for adoption ● The learning curve affects the cost of adoption for teams User experience and learning curve matters ● A technically superior solution is worthless, if your users don’t understand it ● Providing good defaults and a easy golden path will not only drive adoption, but also prevent users from making common mistakes
  61. 61. When collection is easy and performance is good, engineers will start using the monitoring system as a debugger. ● Storage is cheap but not free ● The operational cost of keeping debugging data highly available is significant Beware of scope creep in monitoring When graphing is easy, pretty and powerful, people will start using monitoring for business analytics. ● Monitoring is suppose to be reliable, but not accurate
  62. 62. ● Seems very intuitive ● Fragile, sensitive to latency and sporadic failures ● Noise for alerting ● What are you really measuring? ● Solution: Convert your problem into a metric by interpreting close to the source Heartbeats are hard to get right
  63. 63. ● We used to sent events on every Puppet run ● Teams would make monitoring rules for failed Puppet runs and absent Puppet runs ● Problem: Absent Puppet runs looks exactly the same when ○ Puppet is disabled ○ Network is down ○ Host is down ○ Host has been decommissioned ● Solution: Emit “Time since last successful Puppet run” metric instead ○ Now we can do simple thresholds, which are easy to reason about Heartbeats example: Puppet runs
  64. 64. ● Indexing 100M time series is hard ● Browsing 100M time series is hard ○ UI design - getting an overview of 100M time series is hard ○ Understanding a graph with thousands of lines is difficult for humans ● Your data will keep growing The next big scaling problem is very human: data discovery ● Anomaly detection and machine learning might help us ○ Many new and upcoming product looks promising ○ ...but still largely an unsolved problem
  65. 65. Thank you for your time and patience! Martin Parm email: parmus@spotify.com twitter: @parmus_dk
  66. 66. List of Open Source software mentioned ‣ Munin, http://munin-monitoring.org/ ‣ Zabbix, http://www.zabbix.com/ ‣ Riemann, http://riemann.io/ ‣ Apache Kafka, http://kafka.apache.org/ ‣ Atlas, https://github.com/Netflix/atlas ‣ Prometheus, http://prometheus.io/ ‣ OpenTSDM, http://opentsdb.net/ ‣ InfluxDB, https://influxdb.com/ ‣ KairosDB, http://kairosdb.github.io/ ‣ ffwd, https://github.com/spotify/ffwd ‣ ffwd-java, https://github.com/spotify/ffwd-java ‣ Heroic, https://github.com/spotify/heroic ‣ Cassandra, http://cassandra.apache.org/ ‣ ElasticSearch, https://www.elastic.co/

×