Operations Engineering
Evolution at Spotify
Lev Popov
Site Reliability Engineer
@nabamx
Who am I?
 Lev Popov
 Service Reliability Engineer in Spotify
 Joined Spotify in 2014
 Previous QIK – Skype – Microsoft
 Background in services and networks operations
What is
Spotify?
Some Numbers
• Over 60 million MAU (monthly active users)
• Over 15 million paying subscribers
• Over 30 million tracks
• Over 1.5 billion playlists
• Over 20.000 songs added per day
Capacity We Own
• 4 Data Centers
• Over 7000 bare metal servers
• Many different services
• Pushing an average of 35GBps to the Internet
• 24/7/365
But let's talk
about operations
Service
Service
Service
Service
Dev owner
In the beginning was the…
Dev owner
Ops owner
Dev owner
Ops owner
Operations team
Dev owner
On-call
Monitoring
Build systems
Backups
DB
Networks
…
Operations Team in 2011
Thin group of 5
people
• Over 10 million users
• Over 2 million paying subscribers
• 12 Countries
• Over 15 million tracks
• Over 400 million playlists
• 3 datacenters
• Over 1300 servers
Operations Team Now
?
• Over 60 million users
• Over 15 million paying subscribers
• 58 Countries
• Over 30 million tracks
• Over 1.5 billion playlists
• 4 datacenters
• Over 5000 servers
Operations Team Now
No team
• Over 60 million users
• Over 15 million paying subscribers
• 58 Countries
• Over 30 million tracks
• Over 1.5 billion playlists
• 4 datacenters
• Over 5000 servers
Spotify
Engineering
Culture
How We Scale
• Service oriented architecture
Separate services for separate features
• UNIX way
Small simple programs doing one thing well
• KISS principle
Simple applications are easier to scale
How Spotify Works
Scaling Agile
• Squad is similar to a scrum
team
• Designed to feel like a small
startup
• Self organizing teams
• Autonomy to decide their
own way of working
Scaling Agile
Service
Dev owner
Service
Can we scale that?
Service
Dev owner
Ops owner
Service
Dev owner
Ops owner
Operations team
Dev owner
On-call
Monitoring
Build systems
Backups
DB
Networks
…
Ops in Squads
Ops in Squads Background
Impossible to scale a central operations team
• Understaffed
• Difficult to find generalists
We believe that operation has to sit close to development
Our bet for autonomy
• Break dependencies
• End to end responsibility
Timeline
Dev
Dev
Backend Infrastructure
I/O
Operations
SRE
Internal IT
Operations in Squads
2008 Early 2011 Mid 2012 Sep 2013
Infrastructure Operations
feature
squad
feature
squad
feature
squad
feature
squad
IO
Tribe
networks
conf
mgmt
containers
feature
squad
enable + support
product area
Ops in Squads
Expectations
Wait, wait, but what if…
squad
Core SRE
Core SRE
IO
Tribe
Major Incidents Scalability Issues
Systems Design Problems
Teaching Best Practices in General
squad squad squadsquad
Incident
Management
Incident Management
Incident Postmortem Remediation
Incident Manager
On-Call
Everybody involved
in an incident
Postmortems
• Plan for post-mortems
• Keep it close in time
• Record the project details
• Involve everyone
• Get it in writing
• Record successes as well as failures
• It's not for punishment
• Create an action plan
• Make it available
On-call follows the sun
Stockholm
New York
Stockholm
New York
Stockholm
New YorkL0
SA Product OwnersL1
SA LeadL2
19 CET
01 EST
19 CET
01 EST
07 CET 07 CET
13 EST13 EST
19 CET
13 EST
Areas of
Improvement
Areas of Improvement
• The expectations we place on squads are sometimes unclear
• Communication between feature teams and infrastructure teams
• It’s hard to measure ops in squads success
• Abandoned services and other ownership issues
Thank you.
@nabamx
lev@spotify.com

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Editor's Notes

  • #5 We are kind of big and we are growing
  • #6 Available in 58 countries
  • #7 4 DCs 2 in Europe (STO,LON) and 2 in the US ASH (Virginia, EAST Coast) SJC (California, WEST Coast) We have over 5k bare metal machines and growing We use a Service Oriented Architecture, with hundreds of different services That’s approximately 25 wikipedias per minute It’s a non stop party, music can’t stop streaming
  • #9 Spotify was founded in 2006 and went live in 2008 As usual operations was handled by one person in the beginning Number of operations engineers grew over time So at some point of time we came to the following scheme: Every service has dev and ops owner Generally every ops engineer owned multiple services All ops engineers also responsible for general operations tasks such as being on-call, building monitoring, backups, etc
  • #10 In 2011 we already were kind of big Around 5 people was holding all operational responsibility
  • #11 Guess what now?
  • #12 We have no operations team at all anymore
  • #13 That sounds strange, but let’s talk about engineering culture in Spotify first before I will continue on operations
  • #15 We are growing fast. To handle that and scale appropriate we are following some basic principles: SOA where each service talks to one or multiple services / coupling several layers of services over well defined interfaces. Services maintained and deployed separately. UNIX way: Short, simple, clear, modular and extendable code that can be easily maintained and repurposed by developers other than its creators. KISS principle: Simplicity should always be a key goal in architectural design and unnecessary complexity should be avoided.
  • #16 This basic principles brought us to pretty complex system in general which consists of many moving parts running on top of our infrastructure This diagram is a bit outdated, but it shows the idea. You don’t need to understand every part of the system to build new service and maintain existing one. Most of the services are autonomous Number of services is growing continuously Who maintains all this stuff?
  • #17 Our engineers are grouped into squads A Squad is the basic unit of development at Spotify Squads have product-driven missions It is similar to a Scrum team, and is designed to feel like a mini-startup. They sit together, and they have all the skills and tools needed to design, develop, test, and release to production. Squads are a self- organizing team and decide their own way of working – some use Scrum sprints, some use Kanban, some use a mix of these approaches. Every squad own some services and features We have a lot of squads in Spotify and Dealing with multiple teams is always a challenge! Beside that Different Offices, Different Time Zones Matrix organization structure helps a lot to handle that:
  • #18 Matrix organization structure helps a lot to handle that, we have following organizational primitives: Squad - the basic unit of development at Spotify as I mention before Tribe - a collection of squads that work in related areas such as General backend services, Desktop Clients, Mobile Clients, Feature squads working on a set of similar features Chapter - a small family of people having similar skills and working within the same general area, within the same tribe. Chapter leads are taking care of employees carrier growth PO – head of squad, entrepreneur, he prioritizes the work and takes both business value and tech aspects into consideration Every individual contributor is a part of some squad and chapter Beside that we have guilds which is basically are “communities of interest”, a informal group of people that want to share knowledge, tools, code, and practices. QA Guild or Python Guild for example Our structure and methodologies help us to scale well both product and organization
  • #19 Let’s come back to operations, can we scale this structure? How many operations engineers needed to handle: 100 services? 1000 services? How easy is to hire engineers nowadays? Ops and devs are blocking on each other all the time.
  • #21 initiative
  • #22  - Collaboration Autonomy: blockers, single point of failure for services development Who knows service better than it`s developers
  • #23 This timeline shows evolution of operations division. In Sep 2013 we’ve merged SRE and backend infrastructure organization into I/O tribe. Sizes of blocks doesn’t reflect amount of people
  • #24 What is IO tribe. It consist of squads grouped by product area each with product driven mission IO Tribe provides: Platform Tools Support, docs and best practices Feature squads are responsible for their services operations themselves! So now we have developers instead of sysadmins and operation responsibility is spread across whole tech organization Feature squads are more autonomous and has no handovers to operations team
  • #25 Capacity planning On-call for services you own Deployments and so on For every ops task we want squads to do IO Tribe should provide toolset, documentation, best practices and support. Samples (Sony!) Story on how squad that owns one of the backend services has increased their capacity with 100 baremetal nods with couple of clicks in web interfaces.
  • #26 Something really bad happens
  • #27 Despite being part of a product driven organization, SRE fall back into a support function in many situations that require immediate attention:
  • #28 In order to handle that we have Core SRE organization that involves high-skilled SRE from SA squads. Resolving Scalability issues Helping with Systems design problems “Explaining” our platform Teaching best practices in general Doing Incident management: post mortems and remediations
  • #30 Mistakes are ok unless same ones repeated twice Every incident should be reviewed and appropriate remediation should be made to avoid same incident in future Anyone who may be concern can attend incident postmortem and remediation meetings to influence output + mistakes -> more automation Services have different SLAs and reliability immediate action is not always required Incidents that doesn’t affect major features should be handled by feature squads This approach helps us to be HA Example: music stop playing: Receive alert from monitoring or support Find out what is broken Contact broken service owners and stakeholders if necessary Coordinate recovery Coordinate post-mortem and schedule remediations
  • #31 Plan for post-mortems. To get the most value out of this activity, you need to take it seriously. The postmortem should be a scheduled activity, with time for a meeting of the team to discuss the lessons learned and time for someone (or some group) to write the postmortem report. Keep it close in time. Don't let memories fade by scheduling the postmortem too long after the end of the project. Ship the software, have the celebration, and then roll right into the post-mortem. Record the project details. Part of the post-mortem report needs to be a recital of the details of the project: how big it was, how long it took, what software was used, what the objectives were, and so on. Involve everyone. There are two facets to this. First, different people will have different insights about the project, and you need to collect them all to really understand what worked and didn't. Second, getting everyone involved helps prevent the post-mortem from degenerating into scapegoating. Get it in writing. The project manager needs to own the process of reducing the post-mortem lessons to a written report, delegating this if necessary. Record successes as well as failures. It's easy for a post-mortem to degenerate into a blame session, especially if the project went over budget or the team didn't manage to deliver all the promised features. It's not for punishment. If you want honest post-mortems, management has to develop a reputation for listening openly to input and not punishing people for being honest. Create an action plan. The written post-mortem should make recommendations of how to continue things that worked, and how to fix things that didn't work. Remember, the idea is to learn from your successes and failures, not just to document them. Make it available. A software post-mortem locked in a filing cabinet in the sub-basement does no one any good. Good organizations store the supply of post-mortems somewhere that they're easily found.
  • #34 Too many things to do We are working on well-defined requirements and audit process for ops in squads Questions squads have are not fully understood/answered by teams providing infrastructure. Including documentation and best-practices +visibility issues +abandoned services +handover between squads
  • #35 +1 slide where are we heading now