Spotify: Playing for millions, tuning for more


Published on

Barcelona Developers Conference presentation by Nick Barkas and David Poblador i Garcia, 18 November 2011. How we manage a huge collection of servers and some of the technologies we use for building a scalable, high performance music streaming service.

Published in: Technology
  • Be the first to comment

Spotify: Playing for millions, tuning for more

  1. 1. Playing for millions, tuning for more David Poblador i Garcia - @davidpoblador Nick Barkas - @snb Barcelona Developers Conference - 18 November 2011fredag 18 november 11
  2. 2. Spotifiera anyone?fredag 18 november 11
  3. 3. Outline Growth Deploying lots of servers Backend architecture overview Communication protocols Storage Monitoring Future improvementsfredag 18 november 11
  4. 4. We’re kind of big Over ten million registered users Over two million paying subscribers Launched in 12 countries Over 15 million tracks* Over 400 million playlists Three datacentres Over 1300 servers * Number of tracks licensed globally. Catalogue size varies in each country.fredag 18 november 11
  5. 5. We’re getting bigger! More countries • Added US (July) and Denmark (October) this year • Austria, Switzerland, and Belgium added this week More users • Sign-up via Facebook • From one to two million paying subscribers in six months More music! • Adding over 20,000 tracks each dayfredag 18 november 11
  6. 6. How to manage so many servers? ServerDB FAI Debian Packaging Puppet (yes, we also hate it sometimes) Monitoringfredag 18 november 11
  7. 7. ServerDB In house tool An authoritative database of equipment • Locations • Datacentres • Hostnames Aiming to have it as the unique source of info • DNS config • What server does what • Puppet classes • FAI classesfredag 18 november 11
  8. 8. FAI and Puppet FAI installs all the basic stuff on TFTP boot • Partitions based on server type (and FAI class) • Installs base packages (.deb, of course) • Sets the basic network configuration • Bootstraps Puppet Puppet takes over • Installs packages based on Puppet recipes • Our devs write Puppet manifests • We hate it (sometimes)fredag 18 november 11
  9. 9. Let’s install a server!fredag 18 november 11
  10. 10. Overview of Spotify components Amazon S3 Content ingestion, Record labels indexing, and transcoding Backend services ads search CDN social storage playlist browse Facebook key ... user Log analysis (hadoop) access web api point Clientsfredag 18 november 11
  11. 11. Reducing bandwidth: P2P and cachingfredag 18 november 11
  12. 12. DNS: finding services and resources What’s the hostname and port for the service I want? • SRV record: 3600 SRV 10 50 8081 name ttl prio weight port host Which service instance should I ask for a resource? • Distributed hash tables (DHT). Ring configuration: 3600 TXT “slaves=0” 3600 TXT “slaves=2 redundancy=host” • Mapping ring segment to service instance: 3600 TXT “00112233445566778899aabbccddeeff”fredag 18 november 11
  13. 13. Communication between services Clients -> AP: proprietary protocol AP -> service and service <-> service • HTTP ‣ Originally all services used this ‣ Simple, well known, battle tested ‣ Each service defines its own (usually) RESTful protocol • Splat: Service Platform ‣ Custom-built by Spotify devs ‣ Protocol defined with Thrift ‣ Provides replication and load balancingfredag 18 november 11
  14. 14. New communication framework: hermes Thin layer on top of ØMQ Data in messages are serialized as protobuf • Services define their APIs partly as protobuf messages Hermes messages embedded in client <-> AP protocol • AP doesn’t need to translate protocols; acts as ØMQ router In addition to request/reply, we get pub/subfredag 18 november 11
  15. 15. Storage technologies Critical, consistency important: PostgreSQL • User info required for authentication Huge, growing, eventual consistency OK: Cassandra • Playlists, other user info, social Fast, small, read-only key-value: Tokyo Cabinet • Track/artist/album metadata, encryption keys Large files, read-only: Nginx caching proxy + Amazon S3 • Music files, album cover artfredag 18 november 11
  16. 16. Monitoring We graph all our systems • Munin plugins to collect data ‣ Server related figures (CPU, disk...) ‣ Systems related figures (latency, playbacks...) • We use our own frontend to display the data Alerts are handled using Zabbix • We classify alerts by severity • High severity alerts are delivered to our pagers ‣ Currently we only get a handful per weekfredag 18 november 11
  17. 17. Future (and current) challenges Self-recovery • Diagnose • Take measures Auto notification • Do not bother ops, bother our suppliers Auto scaling • Bring up new servers Better way to register services than DNS • ZooKeeper? Faster to update, always consistentfredag 18 november 11
  18. 18. Gràcies! Preguntes? Nick Barkas @snb David Poblador i Garcia @davidpobladorfredag 18 november 11