Obvious and Non-Obvious Scalability Issues: Spotify Learnings

3,471 views

Published on

These are the slides for the talk I held during the Barcelona Developers Conference 2013. In this talk, I cover some of the scalability issues we've been facing during our intense growth experienced since 2008. The talk is mostly focused to systems and backend engineers.

Note: some of the slides are not superawesome because the transitions are lost in the conversion to PDF.

Published in: Technology, Business

Obvious and Non-Obvious Scalability Issues: Spotify Learnings

  1. 1. Obvious and Non-Obvious
 Scalability Issues: Spotify Learnings David Poblador i Garcia
 @davidpoblador ! ! BcnDevCon13 November 12, 2013
  2. 2. Spotify in numbers
  3. 3. 2011
  4. 4. 2011
  5. 5. 2013
  6. 6. One order of magnitude bigger in some dimensions
  7. 7. 1.000.000.000 playlists 400M two years ago 2M new playlists every day
  8. 8. 6000+ servers 1300 two years ago 20 in 2008
  9. 9. Available in 32 markets 12 two years ago
  10. 10. Two years ago less than 10 people in OPSbigger now + Infra more than 10 times
  11. 11. 4 
 Data Centers
  12. 12. More than 20M songs Adding 20K every day
  13. 13. More than 24M active users 6M paying subscribers
  14. 14. More than 50 teams
 building products & features
  15. 15. Around 100
 backend systems
  16. 16. Learning to Scale
  17. 17. Scaling
 Data Centers
  18. 18. 1 Scaling Data Centers Admit that when you are small, there is someone better than you at building datacenters
  19. 19. 2009? Scaling Data Centers
  20. 20. 2 Scaling Data Centers Streamline
 your procurement
 process
  21. 21. 2012 Scaling Data Centers
  22. 22. 3 Scaling Data Centers Have a
 “unit of capacity” ! We call it POD
  23. 23. 2012 Scaling Data Centers
  24. 24. 4 Scaling Data Centers Data Centers are being commoditized ! Chances are that only a few players will deploy DCs in the future. ! Keep an eye on that. Might make sense for your needs
  25. 25. cloud Scaling Operations
  26. 26. Scaling your backend
  27. 27. User AP AP backend
 service AP backend
 service User … User Scaling your backend AP
  28. 28. know your limits Scaling your backend
  29. 29. examples User AP 60K users 5000 reqs/second AP backend
 service node AP backend
 service node User … User Scaling your backend AP
  30. 30. Do not try to be
 ‘too smart’
  31. 31. DNS à la Spotify Do not try to be ‘too smart’
  32. 32. Error Reporting DHT ring lookup Service Discovery User Distribution Do not try to be ‘too smart’
  33. 33. User AP DNS GeoIP magic AP User … User Do not try to be ‘too smart’ AP AP
  34. 34. Do not try to be ‘too smart’
  35. 35. 8.8.8.8 Do not try to be ‘too smart’
  36. 36. User AP AP User … User Do not try to be ‘too smart’ AP AP 8.8.8.8
  37. 37. Storage Devices
  38. 38. RAM AP 5000 reqs/second AP backend
 service node AP backend
 service node AP Storage Devices
  39. 39. Does not fit in RAM anymore AP ? reqs/second AP big backend
 service node AP big backend
 service node AP Storage Devices
  40. 40. Hard Drives 200 IOPS Storage Devices
  41. 41. AP ? reqs/second AP big backend
 service node AP big backend
 service node AP Storage Devices
  42. 42. SSD 10,000 IOPS Storage Devices
  43. 43. Fusion IO 250,000 IOPS Storage Devices
  44. 44. Page Cache
  45. 45. Example AP 5000 reqs/second AP backend
 service node AP backend
 service node ! RAM: 32 GB OS RAM: 2 GB ! AP Page Cache Songs: 10M Index size: 10 GB
  46. 46. AP AP AP AP Page Cache backend
 service node backend
 service node Increase in data (songs…) ! Index: approx 13 GB
  47. 47. Page Cache
  48. 48. posix_fadvise(2) orchestrate index deployment mlock(2) Page Cache
  49. 49. Retry (not much) Back Off Fail Fast Degrade Gracefully
  50. 50. User AP AP backend
 service AP backend
 service User … User AP Retry (not much). Back Off. Fail Fast. Degrade Gracefully
  51. 51. DDoS’d by your clients 5000 conns/sec User Retry (not much). Back Off. Fail Fast. Degrade Gracefully AP
  52. 52. Exponential Back Off Retry 5000 conns/sec User Retry (not much). Back Off. Fail Fast. Degrade Gracefully AP
  53. 53. Fail Fast User AP AP backend
 service AP backend
 service User … User AP Retry (not much). Back Off. Fail Fast. Degrade Gracefully
  54. 54. Degrade gracefully User Acceptable Behaviour AP AP backend
 service AP backend
 service User … User AP Retry (not much). Back Off. Fail Fast. Degrade Gracefully
  55. 55. Test in real world conditions
  56. 56. Use your most valuable asset Start by sending X% of users to X% of your servers Test in real world conditions
  57. 57. Automate
  58. 58. When necessary Automate
  59. 59. Automate http://xkcd.com/1205/
  60. 60. Take a self service approach everywhere
  61. 61. Configuration Management Databases and Storage Provisioning of Servers Service Discovery Load Balancing Monitoring … Take a self service approach everywhere
  62. 62. Scaling
 Operations
 (the team)
  63. 63. 2011 Scaling Operations
  64. 64. 1 Scaling Operations Start having teams carry operational responsibility for their own services, including on-call duties for the systems they own
  65. 65. 2012 Scaling Operations
  66. 66. 2 Scaling Operations Infrastructure and Operations provide expert guidance/help on how to run service(s) teams own in production
 (and everywhere else)
  67. 67. 2013 Scaling Operations
  68. 68. 3 Scaling Operations Infrastructure and Operations focus the effort on building and extending our platform to create an awesome place to run services
  69. 69. devops Scaling Operations
  70. 70. Incident
 Management Process
  71. 71. “Prevent an issue from happening twice” Incident Management Process
  72. 72. OPS-6000 Incident Management Process
  73. 73. Incident (severity) Postmortem meeting with stakeholders Remediations (urgency) Incident Management Process
  74. 74. Moltes gràcies! David Poblador i Garcia
 @davidpoblador ! ! BcnDevCon13 November 12, 2013

×