SlideShare a Scribd company logo
1 of 35
July 10, 2013
Data center &
Backend buildout
Emil Fredriksson
David Poblador i Garcia
@davidpoblador
July 10, 2013
• Some numbers about Spotify
• Data centers, Infrastructure
and Capacity
• How Spotify works
• What are we working on now?
Some numbers
•1000M+ playlists
•Over 24M active users
•Over 20M songs (adding 20K every day)
•Over 6M paying subscribers
•Available in 28 markets
Operations
in numbers
•90+ backend systems
•23 SRE engineers
•2 locations: NYC and Stockholm
•Around 15 teams building the Spotify Platform
in Operations and Infrastructure
July 10, 2013
Data centers,
infrastructure
and capacity
Data centers:
our factories
•Input electricity, servers and software.
Get the Spotify services as output
•We have to scale it up as we grow our
business
•Where the software meets the real world and
customers
•If it does not work, the music stops playing
The capacity
challenge
•Supporting our service for a growing number
of users
•New more complex features require server
capacity
•Keeping up with very fast software
development
Delivering capacity
•We operate four data centers with more than
5 000 servers and 140Gbps of Internet
capacity
•In 2008 there were 20 servers
•Renting space in large data center facilities
•Owning and operating hardware and network
What we need in a
data center
•Reliable power supply
•Air conditioning
•Secure space
•Network POPs
•Remote hands
•Shipping and handling
Pods – standard
data center units
•Deploying a new data centers takes a long
time!
•We need to be agile and fast to keep up with
the product development
•We solve this by standardizing our data
centers and networking in to pods and pre-
provision servers
•Target is to keep 30% spare capacity at all
times
Pods – standard
data center units
•44 racks in one pod, about 1500 servers
•Racks redundantly connected with 10GE
uplink to core switches
•Pod is directly connected to the Internet via
multiple 10GE transit links
•Build it the same way every time
•Include the base infrastructure services
July 10, 2013
Data center
locations
•You can not go faster than light
•Distance == Latency
•Current locations: Stockholm, London,
Ashburn (US east coast), San Jose (US west
coast)
•Static content on CDN. Dynamic content
comes from our data centers
So what about the
public clouds?
•Commoditization of the data center is
happening now, few companies will need to
build data centers in the future
•We already use both AWS S3 and EC2, usage
will increase
•Challenges that still remain:
•Inter node network performance
•Cost (at large scale)
•Flexible hardware configurations
July 10, 2013
Automated
installation
•Information about servers go in to a database:
MAC address, hardware configuration, location,
networks, hostnames and state(available, in-use)
•Automatic generation of DNS, DHCP and PXE
records
•Cobbler used as an installation server
•Single command installs multiple servers in
multiple data centers
July 10, 2013
How Spotify works
access
point
storage
search
playlist
user
web api
browse
...
Backend services
Clients
www.spotify.com
ads
social
key
Facebook
Amazon
S3
CDN
Content ingestion,
indexing, and transcoding
Log analysis
(hadoop)
Record labels
DNS à la Spotify
•Distribution of clients
•Error reporting by clients
•Service discovery
•DHT ring configuration
DNS: Service
discovery
•_playlist: service name
•_http: protocol
•3600: ttl
•10: prio
•50: weight
•8081: port
•host1.spotify.net: host
_playlist._http.spotify.net 3600 SRV 10 50 8081 host1.spotify.net.
DNS: DHT rings
Which service instance should I ask
for a resource?
•Configuration
config._key._http.spotify.net 3600 TXT “slaves=0”
config._key._http.spotify.net 3600 TXT “slaves=2 redundancy=host”
•Mapping ring segment to service instance
tokens.8081.host1.spotify.net 3600 TXT “00112233445566778899aabbccddeeff”
Databases:
Cassandra & Postgres
•Critical and consistency important:
PostgreSQL
•Huge, growing fast, eventual consistency OK:
Cassandra
Storage:
Production Storage
•Read only
•Large files
•HTTP based
•nginx + storage proxies + Amazon S3
Other types of storage
•Hadoop
•Tokyo Cabinet
•CDB
•BDB
Communication protocols
between services: HTTP
•Originally used by every system
•Simple
•Well known
•Battle tested
•Proper Implementations in many languages
•Each service defines its own RESTful protocol
Communication protocols
between services: Hermes
Thin layer on top of ØMQ
Data in messages is serialized as protobuf
•Services define their APIs partly as protobuf
Hermes is embedded in the client-AP protocol
•AP doesn’t need to translate protocols, it is just a
message router.
In addition to request/reply, we get pub/sub.
Configuration management
•We use Puppet
•Installs Debian packages based on recipes
•Teams developing a system write Puppet
manifests
•Hiera: simple Hierarchical Database for
service parameters
•Not the most scalable solution
July 10, 2013
Working on...
Operational responsibility
delegation
•Each feature team takes responsibility for the
entire stack: from developing a system to
running and operating it.
•Mentality shift: from “it works” to “it scales”
•Full responsibility: capacity planning,
monitoring, incident management.
•Risk of reinventing square wheels. Closing the
feedback loop is key.
Service Discovery
•DNS will stay
•We can’t afford rewriting every system
•We like to be able to use standard tools (dig)
to troubleshoot
•We aim to have a handsfree zone file
management
•Automated registration and deregistration of
nodes is a goal
Unit of deployment
(containers)
•Runs on top of our OS platform
•Consistency between different environments (testing,
production, public cloud, development boxes...)
•Version N looks always the same
•Testability improves
•Deployments are fast. Gradual rollouts FTW!
•Rollbacks are easy
•Configurations could be part of the bundle
Incident management
process improvements
•Main objective: A type of incident happens only once.
•Streamline internal and external communication
•Teams developing a system lead the process for
incidents connected with it
•SRE leads the process for incidents affecting multiple
pieces that require a higher level of coordination
•Mitigation > Post-mortem > Remediation > Resolution
More stuff being done
•Explaining our challenges to the world
•Opensourcing many of our tools
•Self-service provisioning of capacity
•Improvements in our continuous integration pipeline
•Network platform
•OS platform
•Automation everywhere
•Recruitment
July 10, 2013
We are hiring
spoti.fi/ops-jobs
July 10, 2013
Gràcies! Q & A
spoti.fi/ops-jobs
Emil Fredriksson / David Poblador i Garcia

More Related Content

What's hot

How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyJosh Baer
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At SpotifyAdam Kawa
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafkaconfluent
 
Enabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integrationEnabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integrationDataWorks Summit
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinarRTTS
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryEric D. Schabell
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Adam Kawa
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 

What's hot (20)

How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Big Data At Spotify
Big Data At SpotifyBig Data At Spotify
Big Data At Spotify
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Prometheus 101
Prometheus 101Prometheus 101
Prometheus 101
 
Disaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache KafkaDisaster Recovery Plans for Apache Kafka
Disaster Recovery Plans for Apache Kafka
 
Enabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integrationEnabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integration
 
An introduction to QuerySurge webinar
An introduction to QuerySurge webinarAn introduction to QuerySurge webinar
An introduction to QuerySurge webinar
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Observability at Spotify
Observability at SpotifyObservability at Spotify
Observability at Spotify
 
Dataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine LearningDataiku - From Big Data To Machine Learning
Dataiku - From Big Data To Machine Learning
 
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Scaling Operations At Spotify
Scaling Operations At SpotifyScaling Operations At Spotify
Scaling Operations At Spotify
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 

Similar to Spotify: Data center & Backend buildout

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...ssuserd3a367
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edgeRam Kedem
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksMapR Technologies
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented MiddlewareADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented MiddlewareDATAVERSITY
 
Fast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud ServiceFast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud ServiceGustavo Rene Antunez
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...confluent
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSCloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSaspyker
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoopch adnan
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
Postgres for the Future
Postgres for the FuturePostgres for the Future
Postgres for the FutureEDB
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalVMware Tanzu Korea
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big dataJack (Yaakov) Bezalel
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...AboutYouGmbH
 

Similar to Spotify: Data center & Backend buildout (20)

Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented MiddlewareADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
 
Fast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud ServiceFast, Flexible Application Development with Oracle Database Cloud Service
Fast, Flexible Application Development with Oracle Database Cloud Service
 
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
Billions of Messages in Real Time: Why Paypal & LinkedIn Trust an Engagement ...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Cloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSSCloud Services Powered by IBM SoftLayer and NetflixOSS
Cloud Services Powered by IBM SoftLayer and NetflixOSS
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Postgres for the Future
Postgres for the FuturePostgres for the Future
Postgres for the Future
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
Piranha vs. mammoth predator appliances that chew up big data
Piranha vs. mammoth   predator appliances that chew up big dataPiranha vs. mammoth   predator appliances that chew up big data
Piranha vs. mammoth predator appliances that chew up big data
 
Windows Azure introduction
Windows Azure introductionWindows Azure introduction
Windows Azure introduction
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 

Recently uploaded

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Spotify: Data center & Backend buildout

  • 1. July 10, 2013 Data center & Backend buildout Emil Fredriksson David Poblador i Garcia @davidpoblador
  • 2. July 10, 2013 • Some numbers about Spotify • Data centers, Infrastructure and Capacity • How Spotify works • What are we working on now?
  • 3. Some numbers •1000M+ playlists •Over 24M active users •Over 20M songs (adding 20K every day) •Over 6M paying subscribers •Available in 28 markets
  • 4. Operations in numbers •90+ backend systems •23 SRE engineers •2 locations: NYC and Stockholm •Around 15 teams building the Spotify Platform in Operations and Infrastructure
  • 5. July 10, 2013 Data centers, infrastructure and capacity
  • 6. Data centers: our factories •Input electricity, servers and software. Get the Spotify services as output •We have to scale it up as we grow our business •Where the software meets the real world and customers •If it does not work, the music stops playing
  • 7. The capacity challenge •Supporting our service for a growing number of users •New more complex features require server capacity •Keeping up with very fast software development
  • 8. Delivering capacity •We operate four data centers with more than 5 000 servers and 140Gbps of Internet capacity •In 2008 there were 20 servers •Renting space in large data center facilities •Owning and operating hardware and network
  • 9. What we need in a data center •Reliable power supply •Air conditioning •Secure space •Network POPs •Remote hands •Shipping and handling
  • 10. Pods – standard data center units •Deploying a new data centers takes a long time! •We need to be agile and fast to keep up with the product development •We solve this by standardizing our data centers and networking in to pods and pre- provision servers •Target is to keep 30% spare capacity at all times
  • 11. Pods – standard data center units •44 racks in one pod, about 1500 servers •Racks redundantly connected with 10GE uplink to core switches •Pod is directly connected to the Internet via multiple 10GE transit links •Build it the same way every time •Include the base infrastructure services
  • 13. Data center locations •You can not go faster than light •Distance == Latency •Current locations: Stockholm, London, Ashburn (US east coast), San Jose (US west coast) •Static content on CDN. Dynamic content comes from our data centers
  • 14. So what about the public clouds? •Commoditization of the data center is happening now, few companies will need to build data centers in the future •We already use both AWS S3 and EC2, usage will increase •Challenges that still remain: •Inter node network performance •Cost (at large scale) •Flexible hardware configurations
  • 16. Automated installation •Information about servers go in to a database: MAC address, hardware configuration, location, networks, hostnames and state(available, in-use) •Automatic generation of DNS, DHCP and PXE records •Cobbler used as an installation server •Single command installs multiple servers in multiple data centers
  • 17. July 10, 2013 How Spotify works
  • 19. DNS à la Spotify •Distribution of clients •Error reporting by clients •Service discovery •DHT ring configuration
  • 20. DNS: Service discovery •_playlist: service name •_http: protocol •3600: ttl •10: prio •50: weight •8081: port •host1.spotify.net: host _playlist._http.spotify.net 3600 SRV 10 50 8081 host1.spotify.net.
  • 21. DNS: DHT rings Which service instance should I ask for a resource? •Configuration config._key._http.spotify.net 3600 TXT “slaves=0” config._key._http.spotify.net 3600 TXT “slaves=2 redundancy=host” •Mapping ring segment to service instance tokens.8081.host1.spotify.net 3600 TXT “00112233445566778899aabbccddeeff”
  • 22. Databases: Cassandra & Postgres •Critical and consistency important: PostgreSQL •Huge, growing fast, eventual consistency OK: Cassandra
  • 23. Storage: Production Storage •Read only •Large files •HTTP based •nginx + storage proxies + Amazon S3
  • 24. Other types of storage •Hadoop •Tokyo Cabinet •CDB •BDB
  • 25. Communication protocols between services: HTTP •Originally used by every system •Simple •Well known •Battle tested •Proper Implementations in many languages •Each service defines its own RESTful protocol
  • 26. Communication protocols between services: Hermes Thin layer on top of ØMQ Data in messages is serialized as protobuf •Services define their APIs partly as protobuf Hermes is embedded in the client-AP protocol •AP doesn’t need to translate protocols, it is just a message router. In addition to request/reply, we get pub/sub.
  • 27. Configuration management •We use Puppet •Installs Debian packages based on recipes •Teams developing a system write Puppet manifests •Hiera: simple Hierarchical Database for service parameters •Not the most scalable solution
  • 29. Operational responsibility delegation •Each feature team takes responsibility for the entire stack: from developing a system to running and operating it. •Mentality shift: from “it works” to “it scales” •Full responsibility: capacity planning, monitoring, incident management. •Risk of reinventing square wheels. Closing the feedback loop is key.
  • 30. Service Discovery •DNS will stay •We can’t afford rewriting every system •We like to be able to use standard tools (dig) to troubleshoot •We aim to have a handsfree zone file management •Automated registration and deregistration of nodes is a goal
  • 31. Unit of deployment (containers) •Runs on top of our OS platform •Consistency between different environments (testing, production, public cloud, development boxes...) •Version N looks always the same •Testability improves •Deployments are fast. Gradual rollouts FTW! •Rollbacks are easy •Configurations could be part of the bundle
  • 32. Incident management process improvements •Main objective: A type of incident happens only once. •Streamline internal and external communication •Teams developing a system lead the process for incidents connected with it •SRE leads the process for incidents affecting multiple pieces that require a higher level of coordination •Mitigation > Post-mortem > Remediation > Resolution
  • 33. More stuff being done •Explaining our challenges to the world •Opensourcing many of our tools •Self-service provisioning of capacity •Improvements in our continuous integration pipeline •Network platform •OS platform •Automation everywhere •Recruitment
  • 34. July 10, 2013 We are hiring spoti.fi/ops-jobs
  • 35. July 10, 2013 Gràcies! Q & A spoti.fi/ops-jobs Emil Fredriksson / David Poblador i Garcia