Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data at Spotify

12,818 views

Published on

Data infrastructure at Spotify

Published in: Technology, Education

Data at Spotify

  1. 1. June 12, 2014 Danielle Jabin Data Engineer, A/B Testing Data at Spotify
  2. 2. I’m Danielle Jabin •  Data Engineer in the Stockholm office •  A/B testing infrastructure •  California born & raised •  If I can survive a Swedish winter, so can you! •  Studied Computer Science, Statistics, and Real Estate through the M&T program at the University of Pennsylvania
  3. 3. 3 Over 40 million active users As of June 9, 2014  
  4. 4. 4 Access to more than 20 million songs As of June 9, 2014  
  5. 5. Big Data •  40 million Monthly Active Users •  20+ million tracks •  1.5 TB of compressed data from users per day •  64 TB of data generated in Hadoop each day (including replication factor of 3) As of June 9, 2014  
  6. 6. 6 So how much data is that?
  7. 7. Let’s compare: 64 TB •  293, 203, 072 books (200 pages or 240,000 characters) •  16,777,216 MP3 files (with 4MB average file size) •  22,369,600 images (with 3MB average file size)
  8. 8. 8 That’s a lot of selfies
  9. 9. 9 How do we use this data?
  10. 10. Use Cases •  Reporting •  Business Analytics •  Operational Analytics •  Product Features
  11. 11. Reporting •  Reporting to labels, licensors, partners, and advertisers •  We support our partners
  12. 12. Business Analytics •  Analyzing growth, user behavior, sign-up funnels, etc •  Company KPIs •  NPS analysis
  13. 13. Operational Metrics •  Root cause analysis •  Latency analysis •  Better capacity planning (servers, people, bandwidth)
  14. 14. Product Features •  Discover and Radio •  Top lists •  Personalized recommendations •  A/B Testing
  15. 15. 15 How do we collect this data?
  16. 16. The three pillars of our Data Infrastructure: Kafka Collection Hadoop Processing Databases Analytics/Visualization
  17. 17. This is Dave. Data Engineer at Spotify by day…
  18. 18. …chiptune DJ Demoscene Time Machine by night.
  19. 19. Let’s listen to Dave’s song
  20. 20. Kafka •  High volume pub-sub system •  “Producers publish messages to Kafka topics, and consumers subscribe to these topics and consume the messages.”
  21. 21. Kafka •  Robust and scalable solution for collection of logs •  Fast data transfer •  Low CPU overhead •  Built-in partitioning, replication, and fault-tolerance •  Consumers can pull data at different rates •  Able to handle extremely high volumes
  22. 22. Other people listened too!
  23. 23. Hadoop •  Process and store massive amounts of unstructured data across a distributed cluster •  One cluster with 37 nodes to 690 nodes today •  28 PB of storage •  The largest Hadoop cluster in Europe
  24. 24. Hadoop •  Entering the land of optimizations •  Data retention policy •  Move to JVM-based languages •  MapReduce languages •  Moving to Crunch, JVM-based, for speed and scalability •  Python with Hadoop Streaming, Java, Hive, PIG, Scala •  Sprunch: Crunch wrapper for Scala, open sourced by Spotify •  Spotify open-sourced scheduler, Luigi, written in Python •  Simple and easy way to chain jobs
  25. 25. What if we want to know more? vs
  26. 26. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Core data can be used and manipulated for various needs •  Ad hoc queries •  Dashboards
  27. 27. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards
  28. 28. Databases •  Aggregates from Hadoop put into PostgreSQL or Cassandra •  Sqoop •  Ad hoc queries •  Dashboards
  29. 29. Questions?
  30. 30. A/B testing questions? Find me! Contr ol vs
  31. 31. Thank you!

×