Uploaded byOntico

PPTX, PDF661 views

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

The document outlines the evolution of operations engineering at Spotify, highlighting significant growth since 2011, with the operations team now supporting over 60 million users and numerous services. It emphasizes the importance of a squad-based structure, aiming for autonomy and efficiency in addressing operations alongside development. Areas for improvement are identified, including unclear expectations and communication challenges between teams.

Engineering◦

Who am I?
 Lev Popov
 Service Reliability Engineer in Spotify
 Joined Spotify in 2014
 Previous QIK – Skype – Microsoft
 Background in services and networks operations

Some Numbers
• Over 60 million MAU (monthly active users)
• Over 15 million paying subscribers
• Over 30 million tracks
• Over 1.5 billion playlists
• Over 20.000 songs added per day

Capacity We Own
• 4 Data Centers
• Over 7000 bare metal servers
• Many different services
• Pushing an average of 35GBps to the Internet
• 24/7/365

Service
Service
Service
Service
Dev owner
In the beginning was the…
Dev owner
Ops owner
Dev owner
Ops owner
Operations team
Dev owner
On-call
Monitoring
Build systems
Backups
DB
Networks
…

Operations Team in 2011
Thin group of 5
people
• Over 10 million users
• Over 2 million paying subscribers
• 12 Countries
• Over 15 million tracks
• Over 400 million playlists
• 3 datacenters
• Over 1300 servers

Operations Team Now
?
• Over 60 million users
• Over 15 million paying subscribers
• 58 Countries
• Over 30 million tracks
• Over 1.5 billion playlists
• 4 datacenters
• Over 5000 servers

Operations Team Now
No team
• Over 60 million users
• Over 15 million paying subscribers
• 58 Countries
• Over 30 million tracks
• Over 1.5 billion playlists
• 4 datacenters
• Over 5000 servers

How We Scale
• Service oriented architecture
Separate services for separate features
• UNIX way
Small simple programs doing one thing well
• KISS principle
Simple applications are easier to scale

Scaling Agile
• Squad is similar to a scrum
team
• Designed to feel like a small
startup
• Self organizing teams
• Autonomy to decide their
own way of working

Service
Dev owner
Service
Can we scale that?
Service
Dev owner
Ops owner
Service
Dev owner
Ops owner
Operations team
Dev owner
On-call
Monitoring
Build systems
Backups
DB
Networks
…

Ops in Squads Background
Impossible to scale a central operations team
• Understaffed
• Difficult to find generalists
We believe that operation has to sit close to development
Our bet for autonomy
• Break dependencies
• End to end responsibility

Timeline
Dev
Dev
Backend Infrastructure
I/O
Operations
SRE
Internal IT
Operations in Squads
2008 Early 2011 Mid 2012 Sep 2013

Infrastructure Operations
feature
squad
feature
squad
feature
squad
feature
squad
IO
Tribe
networks
conf
mgmt
containers
feature
squad
enable + support
product area

squad
Core SRE
Core SRE
IO
Tribe
Major Incidents Scalability Issues
Systems Design Problems
Teaching Best Practices in General
squad squad squadsquad

Incident Management
Incident Postmortem Remediation
Incident Manager
On-Call
Everybody involved
in an incident

Postmortems
• Plan for post-mortems
• Keep it close in time
• Record the project details
• Involve everyone
• Get it in writing
• Record successes as well as failures
• It's not for punishment
• Create an action plan
• Make it available

On-call follows the sun
Stockholm
New York
Stockholm
New York
Stockholm
New YorkL0
SA Product OwnersL1
SA LeadL2
19 CET
01 EST
19 CET
01 EST
07 CET 07 CET
13 EST13 EST
19 CET
13 EST

Areas of Improvement
• The expectations we place on squads are sometimes unclear
• Communication between feature teams and infrastructure teams
• It’s hard to measure ops in squads success
• Abandoned services and other ownership issues

More Related Content

PPTX

Power shell v3 session1

byVladimir Márquez

PPTX

Developing JavaEE 7 based apps with Payara Micro

byPayara

PPTX

JavaEE Microservices -the Payara Way

byPayara

PPTX

JPA 2.1 on Payara Server

byPayara

PPTX

Developing Java EE applications with NetBeans and Payara

byPayara

PDF

Conditional Logging Considered Harmful - Sean Reilly

byJAXLondon2014

PPTX

Deploying Elastic Java EE Microservices in the Cloud with Docker

byPayara

PPTX

AWS Meetup - Nordstrom Data Lab and the AWS Cloud

byNordstromDataLab

Power shell v3 session1

byVladimir Márquez

Developing JavaEE 7 based apps with Payara Micro

byPayara

JavaEE Microservices -the Payara Way

byPayara

JPA 2.1 on Payara Server

byPayara

Developing Java EE applications with NetBeans and Payara

byPayara

Conditional Logging Considered Harmful - Sean Reilly

byJAXLondon2014

Deploying Elastic Java EE Microservices in the Cloud with Docker

byPayara

AWS Meetup - Nordstrom Data Lab and the AWS Cloud

byNordstromDataLab

What's hot

PPT

Nordstrom Data Lab Recommendo API with Node.js

byDavid Von Lehman

PPTX

Breakdown the GUI - PowerShell logging to automate everything

byJaap Brasser

PDF

Secure your environment by automation

byJaap Brasser

ODP

OWASP 2013 EU Tour Amsterdam ZAP Intro

bySimon Bennetts

ODP

JoinSEC 2013 London - ZAP Intro

bySimon Bennetts

ODP

OWASP 2015 AppSec EU ZAP 2.4.0 and beyond..

bySimon Bennetts

PPT

Mysql from a DBA prespective

byKarthik .P.R

ODP

OWASP 2013 AppSec EU Hamburg - ZAP Innovations

bySimon Bennetts

PDF

Chat automation in a Modern IT environment

byJaap Brasser

PPTX

Building your own JEA Configuration

byJaap Brasser

PPTX

ZAP @FOSSASIA2015

bySumanth Damarla

PPTX

SenchaCon 2016: Being Productive with the New Sencha Fiddle - Mitchell Simoens

bySencha

PDF

AWS Users Meetup April 2015

byJervin Real

PDF

Postgres resources for beginners

byMalcolm McLean

ODP

OWASP 2013 Limerick - ZAP: Whats even newer

bySimon Bennetts

ODP

OWASP 2012 AppSec Dublin ZAP Intro

bySimon Bennetts

PPTX

Geek Sync | Top 5 Tips to Keep Always On Always Humming and Users Happy

byIDERA Software

PDF

Automating angular

byCharles Max Wood

ODP

Automating OWASP ZAP - DevCSecCon talk

bySimon Bennetts

PDF

Secure JAX-RS

byPayara

Nordstrom Data Lab Recommendo API with Node.js

byDavid Von Lehman

Breakdown the GUI - PowerShell logging to automate everything

byJaap Brasser

Secure your environment by automation

byJaap Brasser

OWASP 2013 EU Tour Amsterdam ZAP Intro

bySimon Bennetts

JoinSEC 2013 London - ZAP Intro

bySimon Bennetts

OWASP 2015 AppSec EU ZAP 2.4.0 and beyond..

bySimon Bennetts

Mysql from a DBA prespective

byKarthik .P.R

OWASP 2013 AppSec EU Hamburg - ZAP Innovations

bySimon Bennetts

Chat automation in a Modern IT environment

byJaap Brasser

Building your own JEA Configuration

byJaap Brasser

ZAP @FOSSASIA2015

bySumanth Damarla

SenchaCon 2016: Being Productive with the New Sencha Fiddle - Mitchell Simoens

bySencha

AWS Users Meetup April 2015

byJervin Real

Postgres resources for beginners

byMalcolm McLean

OWASP 2013 Limerick - ZAP: Whats even newer

bySimon Bennetts

OWASP 2012 AppSec Dublin ZAP Intro

bySimon Bennetts

Geek Sync | Top 5 Tips to Keep Always On Always Humming and Users Happy

byIDERA Software

Automating angular

byCharles Max Wood

Automating OWASP ZAP - DevCSecCon talk

bySimon Bennetts

Secure JAX-RS

byPayara

Viewers also liked

PDF

Как Vagrant и Chef ускорили разработку в несколько раз / Тимур Батыршин (Cina...

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

More Related Content

What's hot

Viewers also liked

Similar to Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

More from Ontico

Recently uploaded

Эволюция службы эксплуатации «Spotify» / Лев Попов (Spotify)

Editor's Notes