Scaling Operations At Spotify

7,029 views

Published on

My talk at ServiceManagerDag in the Netherlands about Scaling Operations at Spotify. April 2015

Published in: Technology

Scaling Operations At Spotify

  1. 1. Scaling Operations atSpotify Service Manager Dag. April 2015 David Poblador i Garcia - @davidpoblador
  2. 2. About Spotify…
  3. 3. About David… ‣ JoinedSpotifyin2011 ‣ Infrastructure/Operationsbackground ‣ LedtheSiteReliabilityteamatSpotifyfor3+years ‣ CurrentlyleadingtheServiceAvailabilityteam RealTime Monitoring, Security, Network Engineering, Service Capacity, Operating System
  4. 4. Spotify nowadays Some numbers
  5. 5. Over 15 million paying subscribers Paying subscribers
  6. 6. Over 60 million active users Active users
  7. 7. Over 30 million songs Number of songs
  8. 8. Over 20,000 new songs per day Added songs per day
  9. 9. Over 1.5 billion playlists Number of playlists
  10. 10. Available in 58 markets Number of markets
  11. 11. Spotify nowadays… ‣ Over15millionpayingsubscribers ‣ Over60millionactiveusers ‣ Over30millionsongs Morethan 20,000 added everyday ‣ Over1.5billionplaylists ‣ Availablein58markets Andorra,Argentina,Austria,Australia, Belgium, Bolivia, Brazil, Bulgaria, Canada, Chile, Colombia, Costa Rica, Cyprus, Czech Republic, Denmark, Dominican Republic, Ecuador, El Salvador, Estonia, Finland, France, Germany, Greece, Guatemala, Honduras, Hong Kong, Hungary, Iceland, Ireland, Italy, Latvia, Liechtenstein, Lithuania, Luxembourg, Malaysia, Malta, Mexico, Monaco, NewZealand, Netherlands, Nicaragua, Norway, Panama, Paraguay, Peru, Philippines, Poland, Portugal, Singapore, Slovakia, Spain, Sweden, Switzerland,Taiwan,Turkey, UK, Uruguayand USA.
  12. 12. But this talk is about how to scale an Operations team…
  13. 13. Let’s have a look
 at the past…
  14. 14. Late 2011
  15. 15. Operations team
 in 2011
  16. 16. Operations. Now and then 2011 Spread too thin 5 people
  17. 17. Operations. Now and then 2011 Spread too thin 5 people Now ?
  18. 18. Operations. Now and then 2011 Spread too thin 5 people Now No team
  19. 19. Timeline Backend Infrastructure SRE Internal IT I/O Early 2011 Mid 2012 Sep 2013 Operations Dev Feature teams 2008
  20. 20. How do we operate our services?
  21. 21. How Spotify works
  22. 22. System Ownership at Spotify…
  23. 23. Spotify Engineering Culture
  24. 24. Operations in Squads
  25. 25. Ops in Squads Background • Impossible to scale a central operations team • Understaffed • Difficult to find generalists • We believe that operation has to sit close to development • Our bet for autonomy • Break dependencies • End to end responsibility
  26. 26. Vicious
 circle Operations does not have enough time to support squads Squads invent a non-standard square wheel for their particular problem Increasing technical debt due to a lot of differently shaped wheels System ownership and operational support is complex We need highly skilled systems engineers It's difficult to hire skilled engineers
  27. 27. Operations in Squads Timeline Backend Infrastructure SRE Internal IT I/O Early 2011 Mid 2012 Sep 2013 Operations Dev Feature teams 2008
  28. 28. Current status ‣ IncidentManagersonCall(IMOC) Groupthat coordinates incidents affecting multipleteams. ‣ Increasedavailability Ouravailabilitykeeps improving.
  29. 29. Areas of improvement ‣ Theexpectationsweplaceonsquadsaresometimesunclear Too manythingsto do. ‣ Communicationbetweenfeatureteamsandinfrastructureteams Questions squads have are not fullyunderstood/answered byteams providing infrastructure.
  30. 30. Ops in Squads ExpectationsCapacity Planning Alerting Graphing Define SLA Backups Restore tests Service Operational
 Quality Checklist Recoverability Identify high level
 metrics Security Reviews Incident Tracking Remediate incidents Manageability RedundancyHigh Availability Deprecate Deployment Manage perimeter Recoverability Graceful
 Degradation System
 Review Upgrade OS
  31. 31. To summarize
  32. 32. DevOps → Dev == Ops
  33. 33. DevOps > Dev + Ops
  34. 34. It is easier to learn if you can work on the full stack
  35. 35. Howeveryengineerat Spotifybecameasysadmin andtheOpsteamstopped gettingupatnight Service Manager Dag. April 2015 David Poblador i Garcia - @davidpoblador
  36. 36. Thank you. @davidpoblador

×