Big Data At Spotify

13,386 views

Published on

Published in: Technology
  • Be the first to comment

Big Data At Spotify

  1. 1. Adam Kawa Data Engineer @ Spotify (Big) Data At Spotify
  2. 2. At Spotify, important questions are being asked all the time!
  3. 3. Some of these questions are ”relatively easy” to answer…
  4. 4. 1. How many times has Coldplay been streamed this month? 2. Who was the most popular artist in NYC last week? 3. How many times was “Get Lucky” streamed during first 24h? Labels, Licensor, Partners, Advertisers
  5. 5. ■■ Very granular reports are required -- DividedDivided by gender, age, location and more ■■ We have been delivering various reportsWe have been delivering various reports from day 1 -- Too much data for traditional solutionsToo much data for traditional solutions Reporting
  6. 6. QUIZ!
  7. 7. Question Who was the most frequently streamed female artist in 2013? Answer? A) Katy Perry B) Lady Gaga C) Madonna D) Rihanna Popular Artists
  8. 8. Question Who was the most frequently streamed female artist in 2013? Popular Artists
  9. 9. ■■ The Most Popular Male Artist - Macklemore ■■ The Most Popular Band - Imagine Dragons ■■ The Most Popular Track - “Can't Hold Us” Popular Artists In 2013
  10. 10. ■■ UsersUsers love local artists! -- Berlin - Sido -- London - Coldplay -- Singapore – Vanessa-Mae -- Stockholm - Avicii Popular Artists In 2013
  11. 11. ■■ UsersUsers love local artists!love local artists! -- NYC listens to Jay-Z 88% more than rest of the world -- Stockholm listens to ABBA 110% more than the rest of the world Popular Artists In 2013
  12. 12. Question What was the most “viral” track in 2013? Popular Tracks
  13. 13. Question What was the most “viral” track in 2013? Answer “Get Lucky” by Daft Punk feat. Pharrell Williams Popular Tracks
  14. 14. Artist Analytics – Daft Punk “Get Lucky” was released on April, 19th 2013.
  15. 15. Artist Analytics – Daft Punk Around 5x more streams comparing a day “before” and “after” “Get Lucky”
  16. 16. Artist Analytics – Daft Punk What happened that day?
  17. 17. Artist Analytics – Daft Punk “Random Access Memories” was released on May, 17th 2013.
  18. 18. ■■ 09.08.63 – 11.02.2012 Artist Analytics – Whitney Houston
  19. 19. ■■ One of the most popular Polish rock bands ever Artist Analytics – Budka Suflera What happened?
  20. 20. ■■ One of the most popular Polish rock bands ever Artist Analytics – Budka Suflera Information about the retirement was announced...
  21. 21. 1. What was the number of daily active users (DAU) yesterday? 2. How many users have signed up this week? 3. Which country to launch Spotify next? Management And Investors
  22. 22. ■■ AnalyzingAnalyzing growthgrowth -- Number ofNumber of aactive usersctive users,, streamed songsstreamed songs, sign-ups and more -- Where to launch Spotify nextWhere to launch Spotify next ■■ Company KPIs Business Analytics
  23. 23. However, some of the questions are really tricky to answer!
  24. 24. 1. What song to stream to Jay-Z when he wakes up? 2. Is Adam Kawa bored with Timbuktu today? 3. How to encourage Jeff to go for the Premium Account? Data Scientists, Researchers
  25. 25. ■■ Recommendations -- Powering features likePowering features like Discover, Radio -- ““Perfect music for every moment ♪♫ ♬ ♯Perfect music for every moment ♪♫ ♬ ♯”” ■■ Classification of songs and playlists by genre or mood ■■ Top listsTop lists per country Product Features
  26. 26. ■■ Overall, in 2013Overall, in 2013 -- Best Hangover Cure - “The Lazy Song” -- Best Song To Get Over An Ex - “Someone like you” -- Best Party Starter - “Levels” -- Best Driving Song – “Bohemian Rhapsody” -- Best Work Out Song - “Eye of the Tiger” Perfect Music For Every Moment
  27. 27. 1. Is this button nicer that the previous one? 2. How to personalize the messages displayed to users? 3. How should the results of search be displayed? Designers, Feature's Owners
  28. 28. ■■ A/B Test -- Come with promising “look-and-feels” and do A/B testsCome with promising “look-and-feels” and do A/B tests ■■ ExplicitExplicit ffeedback from users -- ButBut users usually do not like to rateusers usually do not like to rate thingsthings -- ButBut users usually do not like to customizeusers usually do not like to customize thingsthings Designers, Feature's Owners
  29. 29. ■■ Sign-up Button On FacebookSign-up Button On Facebook A/B Test Use Case Sign-up button on the landing page
  30. 30. Sign-up Button On FacebookSign-up Button On Facebook Layouts of sign-up button B – Test Group (50%) A – Control Group (50%)
  31. 31. Sign-up Button On FacebookSign-up Button On Facebook Which one performed better? B – Test Group (50%) A – Control Group (50%) Layouts of sign-up button
  32. 32. Sign-up Button On FacebookSign-up Button On Facebook Layouts of sign-up button Much more sign-ups! A – Control Group (50%) B – Test Group (50%)
  33. 33. ■■ “Only 10% are likely to cause a true uplif” - Google after 12K tests -- Be able to iBe able to iterate fast! ■■ “80% of the times, we are wrong about what consumers want” -- The truth is in data!The truth is in data! A/B Tests
  34. 34. In the past, we guesstimated a bit (common sense, intuition, gut feeling, observations, inspirations)
  35. 35. Isn't it inspired by the Window's Menu Start button? ;) Isn't it inspired by the Window's Menu Start button? ;) “KöP!” means “BUY!”“KöP!” means “BUY!”
  36. 36. Today, we make data-driven decisions
  37. 37. To make data-driven decision data and data-infrastructure are required (among the others)
  38. 38. ■■ OverOver 6 million of paying subscribers6 million of paying subscribers ■■ OverOver 24 million of MAU24 million of MAU (monthly active users)(monthly active users) ■■ 1.5 billion playlists1.5 billion playlists created so farcreated so far ■■ Available inAvailable in 55 countries55 countries ■■ OverOver 20 million of songs20 million of songs ■■ 4,5 billion hours streamed4,5 billion hours streamed in 2013in 2013 Users At Spotify
  39. 39. ■■ Data generatedData generated by usersby users andand for usersfor users!! -- 1.51.5 TB of compressed data from users per dayTB of compressed data from users per day -- 64 TB of data generated in Hadoop each day (triplicated)64 TB of data generated in Hadoop each day (triplicated) (Big) Data At Spotify
  40. 40. ■■ ApacheApache Hadoop YARNHadoop YARN ■■ Many other systems including:Many other systems including: -- KafkaKafka,, LuigiLuigi,, Cassandra,Cassandra, PostgreSQLPostgreSQL in productionin production -- Giraph, Tez, Spark in the evaluation modeGiraph, Tez, Spark in the evaluation mode Data Infrastructure At Spotify
  41. 41. ■■ ProbablyProbably the largest commercial Hadoop cluster in Europe!the largest commercial Hadoop cluster in Europe! -- 694 heterogeneous nodes -- 12.63 PB of data used12.63 PB of data used -- ~7.000 job each day~7.000 job each day Apache Hadoop
  42. 42. ■■ Used forUsed for “off-line” processing“off-line” processing -- When Hadoop is down, Spotify still plays music!When Hadoop is down, Spotify still plays music! -- When Hadoop is down, Data Analysts play FIFA, table tennisWhen Hadoop is down, Data Analysts play FIFA, table tennis or … run queries locallyor … run queries locally ■■ WeWe mostly analyze logsmostly analyze logs from users' activityfrom users' activity Apache Hadoop
  43. 43. ■■ Get insights toGet insights to offer a better productoffer a better product -- “More data usually beats better algorithms”“More data usually beats better algorithms” ■■ Get insights toGet insights to make better decisionsmake better decisions -- Avoid “guesstimates”Avoid “guesstimates” ■■ Take a competitive advantageTake a competitive advantage -- More companies have started offering music streamingMore companies have started offering music streaming What Does Hadoop Allow Us To Do?
  44. 44. ■■ WeWe use multiple tools and languagesuse multiple tools and languages -- HiveHive is very popular among our data analystsis very popular among our data analysts -- CrunchCrunch for core pipeline jobsfor core pipeline jobs -- SomeSome legacy code in Hadoop Streaminglegacy code in Hadoop Streaming with Pythonwith Python -- A number ofA number of PigPig,, Java MapReduceJava MapReduce jobsjobs -- AvroAvro as storage format (but we start considering columnaras storage format (but we start considering columnar formats)formats) How Do We Use Hadoop?
  45. 45. ■■ PrimarilyPrimarily uused to transport logs -- from multiple servers -- to a central location for storage and analysis ■■ A better fit for us than FlumeA better fit for us than Flume -- We got higher throughput with KafkaWe got higher throughput with Kafka ■■ We added more features to KafkaWe added more features to Kafka -- EEnd-to-end deliverynd-to-end delivery -- EncryptionEncryption Apache Kafka
  46. 46. ■■ A scalable and distributed key-value store ■■ Provides fast read-write access for manyProvides fast read-write access for many small pieces of datasmall pieces of data -- We use it for playlists, user profiles,We use it for playlists, user profiles, popularity countpopularity count ■■ Was a better fit for us than HBaseWas a better fit for us than HBase -- The NN was the SPOF at that timeThe NN was the SPOF at that time Apache Cassandra
  47. 47. ■■ Allows us to build complex pipelines of batch jobs ■■ HHandles dependency resolution, workflow management, visualization and more ■■ Our alternative to Oozie and AzkabanOur alternative to Oozie and Azkaban -- Spotify,Spotify, Foursquare, Bitly and more contributeFoursquare, Bitly and more contribute Luigi
  48. 48. We still use them! ■■ Powering features that requirePowering features that require transactions support, integritytransactions support, integrity constraintsconstraints -- e.g.e.g. ordering Spotify gift-cardsordering Spotify gift-cards ■■ Semi-aggregated data forSemi-aggregated data for dashboardsdashboards ■■ Semi-aggregated data forSemi-aggregated data for quick analysisquick analysis RDBMS
  49. 49. March 2013 Tricky questions were asked!
  50. 50. 1. How many servers do you need to buy to survive one year? 2. If we agree, what will you do to use them efficiently? 3. If we agree, do not come back to us this year, OK? Finance Department
  51. 51. ■ Partially responsible for answering these questions! ■ One of Data Engineers who - takes care of 694-node Hadoop-YARN cluster - implements and troubleshoots users' jobs - works in a team with Josh, Marcin, Rafal, Fabian and Wouter ■ Hadoop instructor for almost 2 years ■ Co-organizer of Warsaw and Stockholm HUGs ■ Blogger at HakunaMapData.com Adam Kawa
  52. 52. ■■ Latency analysis - msec to wait for music after pressing the “Play” button ■■ CCapacity planning - servers, bandwidth, data-center space and more Operational Metrics
  53. 53. ■■ Hadoop provides tons of metrics, logs and files ■■ They can beThey can be analyzed by … Hadoop Operational Metrics For Hadoop
  54. 54. ■ This knowledge can be useful to learn how to - measure how fast our HDFS is growing - calculate the empirical retention policy for datasets - optimize the scheduler - benchmark the cluster - and more What Hadoop Can Tell About Itself
  55. 55. Let's see a couple of examples
  56. 56. 5.000 TB of data created before October 1, 2013
  57. 57. Could we Archive data accessed before this day?
  58. 58. ■ You can analyze FsImage file to learn how fast you grow ■ You can even correlate this data with - number of DAU - total size of logs generated by users - activity of users e.g. hours streamed - number of queries / day run by analysts Advanced HDFS Capacity Planning
  59. 59. ■ You can also use ''trend feature'' in Ganglia Simplified HDFS Capacity Planning If we do NOTHING, we will fill the cluster in September...
  60. 60. What will we do to surviver longer than September?
  61. 61. ■ We introduced an automatic retention policy - An owner of the dataset specifies a retention period - If needed, a retention period can be calculated empirically
  62. 62. We continuously improve our MapReduce jobs
  63. 63. ■ We schedule some jobs each hour, day or week e.g.: - Top lists for each country - Reports for the labels, partners, advertisers Idea ■ Use job statistics from the previous executions of a job - to optimize the current execution of this job - to learn about the history of performance of a given job Recurring MapReduce Jobs Even perfect manual setting may become eventually outdated when an input dataset grows!
  64. 64. ■ A tiny PoC ;) ■ The average task time set to 10 minutes (inspired by LinkedIn) ■ It should help in extreme cases: very short and long living tasks type # map # reduce avg map time avg reduce time job execution time old_1 4826 25 46sec 1hrs, 52mins, 14sec 2hrs, 52mins, 16sec new_1 391 294 4mins, 46sec 8mins, 24sec 23mins, 12sec type # map # reduce avg map time avg reduce time job execution time old_2 4936 800 7mins, 30sec 22mins, 18sec 5hrs, 20mins, 1sec new_2 4936 1893 8mins, 52sec 7mins, 35sec 1hrs, 18mins, 29sec MapReduce Jobs Autotuning
  65. 65. ■ We make data-driven decisions to improve our product ■ Scalable and open-source projects allows us to do that ■ Hadoop, Cassandra, Kafka need love and care - And passionate people who give it to them ■ Hadoop is like a salutary virus - It quickly spreads across people and projects Summary
  66. 66. Questions?
  67. 67. BONUS!
  68. 68. One Question:One Question: What could happen after some time of simultaneousWhat could happen after some time of simultaneous development of MapReduce jobs,development of MapReduce jobs, maintenance of a large cluster,maintenance of a large cluster, and listening to perfect music for every moment?and listening to perfect music for every moment?
  69. 69. A Possible Answer:A Possible Answer: You may discover Hadoop in the lyrics of many popular songs!You may discover Hadoop in the lyrics of many popular songs!
  70. 70. Check out spotify.com/jobs or @Spotifyjobs for more information kawaa@spotify.com Check out my blog: HakunaMapData.com Want to join the band?
  71. 71. Thank you!

×