Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling the Data Infrastructure @Spotify

1,399 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2k7JYDP.

Mārtiņš Kalvāns and Matti Pehrs overview the Data Infrastructure at Spotify, diving into some of the data infrastructure components, such us Event Delivery, Datamon and Styx. Filmed at qconsf.com.

Mārtiņš Kalvāns is a Big Data Engineer at Spotify. Since 2013, he has been working in the Spotify Data Infrastructure Tribe. Matti Pehrs is a Software Engineer at Spotify. He works as a back-end developer in the Spotify Data Infrastructure Tribe. He has worked in the IT industry for more than 20 years and with Java almost from its inception.

Published in: Technology

Scaling the Data Infrastructure @Spotify

  1. 1. Scaling Data Infrastructure @ Spotify matti@spotify.com kalvans@spotify.com
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ spotify-data-infrastructure
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. Mārtiņš Kalvāns kalvans@spotify.com Matti Pehrs matti@spotify.com
  5. 5. Agenda 1. Data at Spotify 2. Summer of 2015 3. Challenges & Victory ○ Datamon ○ Styx ○ GABO
  6. 6. Spotify big-data context ● Over 100 million monthly active users ● Over 30 million song ● Over 2 billion playlists ● Active in 60 markets
  7. 7. Data is at the heart of Spotify In 2007 - Monthly Royalty Report In 2016 - Monthly Royalty Report - Weekly Billboard - Daily reports to partners - ... - AB-Testing - Discover weekly - Daily Mix - ...
  8. 8. Our growth in Data Users +50 TB/day +100M Users Developers +60 TB/day +10k M/R jobs
  9. 9. Hadoop Autonomy & Dependencies Team A Team B Team C
  10. 10. Autonomy & Dependencies
  11. 11. Autonomy & Dependencies
  12. 12. Autonomy & Dependencies
  13. 13. Summer of Incidents
  14. 14. ● A strain of incidents Summer of Incidents
  15. 15. ● A strain of incidents ● War-room Summer of Incidents
  16. 16. ● A strain of incidents ● War-room ● Hadoop on it’s knees Summer of Incidents
  17. 17. ● A strain of incidents ● War-room ● Hadoop on it’s knees ● Event Delivery Catch up Summer of Incidents
  18. 18. ● A strain of incidents ● War-room ● Hadoop on it’s knees ● Event Delivery Catch up ● Reprocessing of data Summer of Incidents
  19. 19. ● A strain of incidents ● War-room ● Hadoop on it’s knees ● Event Delivery Catch up ● Reprocessing of data ● Hard to debug data issues Summer of Incidents
  20. 20. Challenges and the path to victory...
  21. 21. 1. Early Warning Datamon - Data monitoring Challenges and the path to victory...
  22. 22. 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control Challenges and the path to victory...
  23. 23. 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery Challenges and the path to victory...
  24. 24. 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery Challenges and the path to victory...
  25. 25. Early Warning - Datamon
  26. 26. ● Unified view ○ Alignment between teams ● Ownership ○ Clear ownership of data ● SLA ○ Alert on late data Early Warning - Datamon
  27. 27. ● Define terminology ● Provide metadata language ● Implement a Datamon service Early Warning - Datamon
  28. 28. 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery Challenges and the path to victory...
  29. 29. - Execution control - Self service for data users - Execution information - Expose debug information - Execution isolation - Docker for data jobs Debuggability & Control - Styx The river Styx
  30. 30. ● Execution control ○ Centralized execution API Debuggability & Control - Styx
  31. 31. Debuggability & Control - Styx ● Execution control ○ Centralized execution API ○ Backfilling and reprocessing
  32. 32. ● Execution control ● Execution information ○ Timeline Debuggability & Control - Styx
  33. 33. Debuggability & Control - Styx ● Execution control ● Execution information ○ Timeline ○ Google Cloud Logging
  34. 34. Debuggability & Control - Styx ● Execution control ● Execution information ● Execution isolation ○ Docker
  35. 35. 1. Early Warning Datamon - Data monitoring 2. Debuggability & Control Styx - Scheduling and control 3. Automate Capacity GABO - Event Delivery Challenges and the path to victory...
  36. 36. ● Complex and manual config Automate Capacity - GABO/Event Delivery
  37. 37. ● Complex and manual config ● Pubsub & Dataflow streaming Automate Capacity - GABO/Event Delivery
  38. 38. ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale Automate Capacity - GABO/Event Delivery
  39. 39. ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale ● Dataflow streaming Automate Capacity - GABO/Event Delivery
  40. 40. ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale ● Dataflow streaming :-( ● 2 micro services + 1 Map/Reduce job Automate Capacity - GABO/Event Delivery
  41. 41. ● Complex and manual config ● Pubsub & Dataflow streaming ● Pubsubs at scale ● Dataflow streaming :-( ● 2 micro services + 1 Map/Reduce job ● Autoscaling & The Stuffer Automate Capacity - GABO/Event Delivery
  42. 42. ● Handles at least 10x our load ● Darkloading ● Autoscale everything ● Self service GABO - WIP
  43. 43. ● Make sure you have the right tools to deal with data incidents ○ Make sure you have time to implement the tools you need ● Remember that your capacity model can fail at larger scale ○ Keep track of your scale and Automate, automate, automate... Summary
  44. 44. Thank you! kalvans@spotify.com matti@spotify.com Want to join the band? http://spoti.fi/jobs
  45. 45. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ spotify-data-infrastructure

×