More Related Content


More from Ontico(20)


Recently uploaded(20)

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

  1. Operations Engineering Evolution at Spotify Lev Popov Site Reliability Engineer @nabamx
  2. Who am I?  Lev Popov  Service Reliability Engineer in Spotify  Joined Spotify in 2014  Previous QIK – Skype – Microsoft  Background in services and networks operations
  3. What is Spotify?
  4. Some Numbers • Over 60 million MAU (monthly active users) • Over 15 million paying subscribers • Over 30 million tracks • Over 1.5 billion playlists • Over 20.000 songs added per day
  5. Capacity We Own • 4 Data Centers • Over 7000 bare metal servers • Many different services • Pushing an average of 35GBps to the Internet • 24/7/365
  6. But let's talk about operations
  7. Service Service Service Service Dev owner In the beginning was the… Dev owner Ops owner Dev owner Ops owner Operations team Dev owner On-call Monitoring Build systems Backups DB Networks …
  8. Operations Team in 2011 Thin group of 5 people • Over 10 million users • Over 2 million paying subscribers • 12 Countries • Over 15 million tracks • Over 400 million playlists • 3 datacenters • Over 1300 servers
  9. Operations Team Now ? • Over 60 million users • Over 15 million paying subscribers • 58 Countries • Over 30 million tracks • Over 1.5 billion playlists • 4 datacenters • Over 5000 servers
  10. Operations Team Now No team • Over 60 million users • Over 15 million paying subscribers • 58 Countries • Over 30 million tracks • Over 1.5 billion playlists • 4 datacenters • Over 5000 servers
  11. Spotify Engineering Culture
  12. How We Scale • Service oriented architecture Separate services for separate features • UNIX way Small simple programs doing one thing well • KISS principle Simple applications are easier to scale
  13. How Spotify Works
  14. Scaling Agile • Squad is similar to a scrum team • Designed to feel like a small startup • Self organizing teams • Autonomy to decide their own way of working
  15. Scaling Agile
  16. Service Dev owner Service Can we scale that? Service Dev owner Ops owner Service Dev owner Ops owner Operations team Dev owner On-call Monitoring Build systems Backups DB Networks …
  17. Ops in Squads
  18. Ops in Squads Background Impossible to scale a central operations team • Understaffed • Difficult to find generalists We believe that operation has to sit close to development Our bet for autonomy • Break dependencies • End to end responsibility
  19. Timeline Dev Dev Backend Infrastructure I/O Operations SRE Internal IT Operations in Squads 2008 Early 2011 Mid 2012 Sep 2013
  20. Infrastructure Operations feature squad feature squad feature squad feature squad IO Tribe networks conf mgmt containers feature squad enable + support product area
  21. Ops in Squads Expectations
  22. Wait, wait, but what if…
  23. squad Core SRE Core SRE IO Tribe Major Incidents Scalability Issues Systems Design Problems Teaching Best Practices in General squad squad squadsquad
  24. Incident Management
  25. Incident Management Incident Postmortem Remediation Incident Manager On-Call Everybody involved in an incident
  26. Postmortems • Plan for post-mortems • Keep it close in time • Record the project details • Involve everyone • Get it in writing • Record successes as well as failures • It's not for punishment • Create an action plan • Make it available
  27. On-call follows the sun Stockholm New York Stockholm New York Stockholm New YorkL0 SA Product OwnersL1 SA LeadL2 19 CET 01 EST 19 CET 01 EST 07 CET 07 CET 13 EST13 EST 19 CET 13 EST
  28. Areas of Improvement
  29. Areas of Improvement • The expectations we place on squads are sometimes unclear • Communication between feature teams and infrastructure teams • It’s hard to measure ops in squads success • Abandoned services and other ownership issues
  30. Thank you. @nabamx

Editor's Notes

  1. We are kind of big and we are growing
  2. Available in 58 countries
  3. 4 DCs 2 in Europe (STO,LON) and 2 in the US ASH (Virginia, EAST Coast) SJC (California, WEST Coast) We have over 5k bare metal machines and growing We use a Service Oriented Architecture, with hundreds of different services That’s approximately 25 wikipedias per minute It’s a non stop party, music can’t stop streaming
  4. Spotify was founded in 2006 and went live in 2008 As usual operations was handled by one person in the beginning Number of operations engineers grew over time So at some point of time we came to the following scheme: Every service has dev and ops owner Generally every ops engineer owned multiple services All ops engineers also responsible for general operations tasks such as being on-call, building monitoring, backups, etc
  5. In 2011 we already were kind of big Around 5 people was holding all operational responsibility
  6. Guess what now?
  7. We have no operations team at all anymore
  8. That sounds strange, but let’s talk about engineering culture in Spotify first before I will continue on operations
  9. We are growing fast. To handle that and scale appropriate we are following some basic principles: SOA where each service talks to one or multiple services / coupling several layers of services over well defined interfaces. Services maintained and deployed separately. UNIX way: Short, simple, clear, modular and extendable code that can be easily maintained and repurposed by developers other than its creators. KISS principle: Simplicity should always be a key goal in architectural design and unnecessary complexity should be avoided.
  10. This basic principles brought us to pretty complex system in general which consists of many moving parts running on top of our infrastructure This diagram is a bit outdated, but it shows the idea. You don’t need to understand every part of the system to build new service and maintain existing one. Most of the services are autonomous Number of services is growing continuously Who maintains all this stuff?
  11. Our engineers are grouped into squads A Squad is the basic unit of development at Spotify Squads have product-driven missions It is similar to a Scrum team, and is designed to feel like a mini-startup. They sit together, and they have all the skills and tools needed to design, develop, test, and release to production. Squads are a self- organizing team and decide their own way of working – some use Scrum sprints, some use Kanban, some use a mix of these approaches. Every squad own some services and features We have a lot of squads in Spotify and Dealing with multiple teams is always a challenge! Beside that Different Offices, Different Time Zones Matrix organization structure helps a lot to handle that:
  12. Matrix organization structure helps a lot to handle that, we have following organizational primitives: Squad - the basic unit of development at Spotify as I mention before Tribe - a collection of squads that work in related areas such as General backend services, Desktop Clients, Mobile Clients, Feature squads working on a set of similar features Chapter - a small family of people having similar skills and working within the same general area, within the same tribe. Chapter leads are taking care of employees carrier growth PO – head of squad, entrepreneur, he prioritizes the work and takes both business value and tech aspects into consideration Every individual contributor is a part of some squad and chapter Beside that we have guilds which is basically are “communities of interest”, a informal group of people that want to share knowledge, tools, code, and practices. QA Guild or Python Guild for example Our structure and methodologies help us to scale well both product and organization
  13. Let’s come back to operations, can we scale this structure? How many operations engineers needed to handle: 100 services? 1000 services? How easy is to hire engineers nowadays? Ops and devs are blocking on each other all the time.
  14. initiative
  15. - Collaboration Autonomy: blockers, single point of failure for services development Who knows service better than it`s developers
  16. This timeline shows evolution of operations division. In Sep 2013 we’ve merged SRE and backend infrastructure organization into I/O tribe. Sizes of blocks doesn’t reflect amount of people
  17. What is IO tribe. It consist of squads grouped by product area each with product driven mission IO Tribe provides: Platform Tools Support, docs and best practices Feature squads are responsible for their services operations themselves! So now we have developers instead of sysadmins and operation responsibility is spread across whole tech organization Feature squads are more autonomous and has no handovers to operations team
  18. Capacity planning On-call for services you own Deployments and so on For every ops task we want squads to do IO Tribe should provide toolset, documentation, best practices and support. Samples (Sony!) Story on how squad that owns one of the backend services has increased their capacity with 100 baremetal nods with couple of clicks in web interfaces.
  19. Something really bad happens
  20. Despite being part of a product driven organization, SRE fall back into a support function in many situations that require immediate attention:
  21. In order to handle that we have Core SRE organization that involves high-skilled SRE from SA squads. Resolving Scalability issues Helping with Systems design problems “Explaining” our platform Teaching best practices in general Doing Incident management: post mortems and remediations
  22. Mistakes are ok unless same ones repeated twice Every incident should be reviewed and appropriate remediation should be made to avoid same incident in future Anyone who may be concern can attend incident postmortem and remediation meetings to influence output + mistakes -> more automation Services have different SLAs and reliability immediate action is not always required Incidents that doesn’t affect major features should be handled by feature squads This approach helps us to be HA Example: music stop playing: Receive alert from monitoring or support Find out what is broken Contact broken service owners and stakeholders if necessary Coordinate recovery Coordinate post-mortem and schedule remediations
  23. Plan for post-mortems. To get the most value out of this activity, you need to take it seriously. The postmortem should be a scheduled activity, with time for a meeting of the team to discuss the lessons learned and time for someone (or some group) to write the postmortem report. Keep it close in time. Don't let memories fade by scheduling the postmortem too long after the end of the project. Ship the software, have the celebration, and then roll right into the post-mortem. Record the project details. Part of the post-mortem report needs to be a recital of the details of the project: how big it was, how long it took, what software was used, what the objectives were, and so on. Involve everyone. There are two facets to this. First, different people will have different insights about the project, and you need to collect them all to really understand what worked and didn't. Second, getting everyone involved helps prevent the post-mortem from degenerating into scapegoating. Get it in writing. The project manager needs to own the process of reducing the post-mortem lessons to a written report, delegating this if necessary. Record successes as well as failures. It's easy for a post-mortem to degenerate into a blame session, especially if the project went over budget or the team didn't manage to deliver all the promised features. It's not for punishment. If you want honest post-mortems, management has to develop a reputation for listening openly to input and not punishing people for being honest. Create an action plan. The written post-mortem should make recommendations of how to continue things that worked, and how to fix things that didn't work. Remember, the idea is to learn from your successes and failures, not just to document them. Make it available. A software post-mortem locked in a filing cabinet in the sub-basement does no one any good. Good organizations store the supply of post-mortems somewhere that they're easily found.
  24. Too many things to do We are working on well-defined requirements and audit process for ops in squads Questions squads have are not fully understood/answered by teams providing infrastructure. Including documentation and best-practices +visibility issues +abandoned services +handover between squads
  25. +1 slide where are we heading now