Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

447 views

Published on

Как службе эксплуатации справиться с интенсивным ростом компании? Как поставлять ПО быстро и безопасно в продукт с множеством зависимых компонентов? Как распространить ответственность за эксплуатацию среди инженерных подразделений? Я расскажу про опыт, который мы получили в Spotify, возможно это поможет ответить на данные вопросы.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

  1. 1. Operations Engineering Evolution at Spotify Lev Popov Site Reliability Engineer @nabamx
  2. 2. Who am I?  Lev Popov  Service Reliability Engineer in Spotify  Joined Spotify in 2014  Previous QIK – Skype – Microsoft  Background in services and networks operations
  3. 3. What is Spotify?
  4. 4. Some Numbers • Over 60 million MAU (monthly active users) • Over 15 million paying subscribers • Over 30 million tracks • Over 1.5 billion playlists • Over 20.000 songs added per day
  5. 5. Capacity We Own • 4 Data Centers • Over 7000 bare metal servers • Many different services • Pushing an average of 35GBps to the Internet • 24/7/365
  6. 6. But let's talk about operations
  7. 7. Service Service Service Service Dev owner In the beginning was the… Dev owner Ops owner Dev owner Ops owner Operations team Dev owner On-call Monitoring Build systems Backups DB Networks …
  8. 8. Operations Team in 2011 Thin group of 5 people • Over 10 million users • Over 2 million paying subscribers • 12 Countries • Over 15 million tracks • Over 400 million playlists • 3 datacenters • Over 1300 servers
  9. 9. Operations Team Now ? • Over 60 million users • Over 15 million paying subscribers • 58 Countries • Over 30 million tracks • Over 1.5 billion playlists • 4 datacenters • Over 5000 servers
  10. 10. Operations Team Now No team • Over 60 million users • Over 15 million paying subscribers • 58 Countries • Over 30 million tracks • Over 1.5 billion playlists • 4 datacenters • Over 5000 servers
  11. 11. Spotify Engineering Culture
  12. 12. How We Scale • Service oriented architecture Separate services for separate features • UNIX way Small simple programs doing one thing well • KISS principle Simple applications are easier to scale
  13. 13. How Spotify Works
  14. 14. Scaling Agile • Squad is similar to a scrum team • Designed to feel like a small startup • Self organizing teams • Autonomy to decide their own way of working
  15. 15. Scaling Agile
  16. 16. Service Dev owner Service Can we scale that? Service Dev owner Ops owner Service Dev owner Ops owner Operations team Dev owner On-call Monitoring Build systems Backups DB Networks …
  17. 17. Ops in Squads
  18. 18. Ops in Squads Background Impossible to scale a central operations team • Understaffed • Difficult to find generalists We believe that operation has to sit close to development Our bet for autonomy • Break dependencies • End to end responsibility
  19. 19. Timeline Dev Dev Backend Infrastructure I/O Operations SRE Internal IT Operations in Squads 2008 Early 2011 Mid 2012 Sep 2013
  20. 20. Infrastructure Operations feature squad feature squad feature squad feature squad IO Tribe networks conf mgmt containers feature squad enable + support product area
  21. 21. Ops in Squads Expectations
  22. 22. Wait, wait, but what if…
  23. 23. squad Core SRE Core SRE IO Tribe Major Incidents Scalability Issues Systems Design Problems Teaching Best Practices in General squad squad squadsquad
  24. 24. Incident Management
  25. 25. Incident Management Incident Postmortem Remediation Incident Manager On-Call Everybody involved in an incident
  26. 26. Postmortems • Plan for post-mortems • Keep it close in time • Record the project details • Involve everyone • Get it in writing • Record successes as well as failures • It's not for punishment • Create an action plan • Make it available
  27. 27. On-call follows the sun Stockholm New York Stockholm New York Stockholm New YorkL0 SA Product OwnersL1 SA LeadL2 19 CET 01 EST 19 CET 01 EST 07 CET 07 CET 13 EST13 EST 19 CET 13 EST
  28. 28. Areas of Improvement
  29. 29. Areas of Improvement • The expectations we place on squads are sometimes unclear • Communication between feature teams and infrastructure teams • It’s hard to measure ops in squads success • Abandoned services and other ownership issues
  30. 30. Thank you. @nabamx lev@spotify.com

×