Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Music Personalization : Real time Platforms.

1,275 views

Published on

On Spotify, Storm and Platforms. Presented at Crunch Conf.

Published in: Data & Analytics

Music Personalization : Real time Platforms.

  1. 1. Music Personalization: Realtime Platforms ♫ + ML + You = ❤ CrunchConf, Budapest, October 30, 2015
  2. 2. Esh Kumar Machine Learning & Data Products @ Spotify NYC @eshvk
  3. 3. Who am I? • UT Austin Machine Learning • Building Large Scale Recommendation Systems @ Mozilla, StumbleUpon & Spotify
  4. 4. 75 M+ Active Users
  5. 5. 58 Markets
  6. 6. 1 TB of Logs/Day
  7. 7. 1200+ Node Hadoop Cluster
  8. 8. Products •Discover … to find new albums •Discover Weekly … A weekly Playlist •Editorial Playlist Recommendations •Radio
  9. 9. Music Personalization •Understanding People ➡ User Experience, Cultural Variations •Understanding Content ➡ Genres, Cultural knowledge •Models ➡ Collaborative Filtering, Content Based ML Content User
  10. 10. Music Personalization •Understanding People ➡ User Experience, Cultural Variations •Understanding Content ➡ Genres, Cultural knowledge •Models ➡ Collaborative Filtering, Content Based • News, Blogs, NLP
  11. 11. Music Personalization •Understanding People ➡ User Experience, Cultural Variations •Understanding Content ➡ Genres, Cultural knowledge •Models ➡ Collaborative Filtering, Content Based • News, Blogs, NLP • Manually tag attributes • Curation
  12. 12. Music Personalization •Understanding People ➡ User Experience, Cultural Variations •Understanding Content ➡ Genres, Cultural knowledge •Models ➡ Collaborative Filtering, Content Based • News, Blogs, NLP • Manually tag attributes • Curation • CF
  13. 13. 30 Million Songs… WhatTo Play? 75 Million Users … 1 Person Every 3 Secs…
  14. 14. Recommendation Systems • Predict user response to options. • Rich field: Matrix completion, ranking, text models, latent factor models. • Several conferences annually. RecSys, NIPS, ICML etc • Industry researchers include NFLX, GOOG, MS and more…
  15. 15. Collaborative Filtering Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T! Model you based on songs you played… Predict your future based on similar users… Millions of users and billions of streams… …. so there is someone like you out there
  16. 16. Collaborative Filtering The Netflix Prize. A million dollars for beating NFLX’s best algorithms by ~ 10%.
  17. 17. Similarity Our problem is to figure out how similar two items are. Mathematically, this means modeling a function Similarity(x,y) for all users and items, if possible.
  18. 18. How do we do this? Matrix Completion. A matrix expresses a system. We model the data in the form of a matrix. For example, play counts for all songs and all users could be: Users 8 >>>>>>< >>>>>>: 0 B B B B B B @ Song Plays z }| { s1,1 s1,2 14 · · · s1,n s2,1 s2,2 2 · · · s2,n · · · sm,1 sm,2 1 · · · sm,n 1 C C C C C C A Users 8 >>>>>>< >>>>>>: 0 B B B B B B @ Song Plays z }| { s1,1 s1,2 14 · · · s1,n s2,1 s2,2 2 · · · s2,n · · · sm,1 sm,2 1 · · · sm,n 1 C C C C C C A Call Me Maybe Esh Esh listened to call me maybe once… ⇡ 0 B B B B B B B B B @ u1 u2 ... ... ... um 1 C C C C C C C C C A t1 t2 · · · · · · · · · tn⇡ 0 B B B B B B B B B @ u1 u2 ... ... ... um 1 C C C C C C C C C A t1 t2 · · · · · · · · · tn
  19. 19. Matrix Completion is well studied … Start with random vectors around the origin. Run alternating least squares or gradient descent or stochastic gradient descent… All this is Hadoopable™. Users 8 >>>>>>< >>>>>>: 0 B B B B B B @ Song Plays z }| { s1,1 s1,2 14 · · · s1,n s2,1 s2,2 2 · · · s2,n · · · sm,1 sm,2 1 · · · sm,n 1 C C C C C C A Users 8 >>>>>>< >>>>>>: 0 B B B B B B @ Song Plays z }| { s1,1 s1,2 14 · · · s1,n s2,1 s2,2 2 · · · s2,n · · · sm,1 sm,2 1 · · · sm,n 1 C C C C C C A Call Me Maybe Esh Esh listened to call me maybe once… ⇡ 0 B B B B B B B B B @ u1 u2 ... ... ... um 1 C C C C C C C C C A t1 t2 · · · · · · · · · tn⇡ 0 B B B B B B B B B @ u1 u2 ... ... ... um 1 C C C C C C C C C A t1 t2 · · · · · · · · · tn
  20. 20. 30 Million Songs… WhatTo Play? 75 Million People … 1 Person Every 3 Secs…
  21. 21. 1.5 Billion Playlists
  22. 22. Language Models • Language models work well too. For example, a playlist could be considered as a document and you could learn the latent vectors for tracks (words). • Then represent a User as a linear combination of their Tracks.
  23. 23. word2vec Words with similar contexts have similar meaning
  24. 24. word2vec
  25. 25. word2vec Target Word Context Word
  26. 26. word2vec Target Words and Corresponding Contexts shining bright trees dark green stars 61 50 10 30 1 sun 71 60 5 2 0 cucumber 2 1 15 3 40
  27. 27. word2vec Playlists CPU Vectors Read GetVectors & Update
  28. 28. Vectors are awesome! •Unique fingerprint for every users, tracks, albums, artists & even playlists in the same space. •Similarity is easily computable. Euclidean Distance or Cosine Similarity.
  29. 29. Approximate Nearest Neighbors •Fast approximate nearest neighbor search. • Locality Sensitive Hashing • https://github.com/spotify/annoy
  30. 30. Vectors are great for Infrastructure too… •Machine Learning can be decomposed & abstracted away. •A Lambda Architecture involving Machine Learning becomes eas(ier). •Platforms for Personalization become possible….
  31. 31. The Record Store… The List Maker … How do you scale this?
  32. 32. Tools of the trade • Build models in Python. (NumPy, SciPy ) • Jobs in Scalding + Luigi ( https://github.com/spotify/luigi ) • Storm for real time. • In house RPC for serving requests.
  33. 33. Storm 101 • Realtime Stream Processing. • Like Hadoop but easier. • Fault tolerant. • Java, Clojure (yay!) and more!
  34. 34. Storm @ Spotify • Major users are Ads & Personalization! • Everyteam manages its own cluster. For personalization, we have a 12 node cluster. • Relatively a new tech, compared to Hadoop™.
  35. 35. So why Storm? • Hadoop is slowwww. Daily UserVector jobs takes ~ 16 hours to run. Small Data FTW! • New Users are important; they need a friend! • What moment are you in? Gym, Running etc?.
  36. 36. Getting Data Across The Globe
  37. 37. HDFS Kafka Pipeline … User
 Listens Playlists Realtime Listens Spout
  38. 38. HDFS Kafka Pipeline … User
 Listens Playlists Realtime Listens Spout User Vector Generation Job Latent Vector Models Track, Artist, Album Vectors
  39. 39. HDFS Kafka Pipeline … User
 Listens Playlists Realtime Listens Spout User Vector Generation Job Latent Vector Models Track, Artist, Album Vectors Compressed Listening History Bolts Cassandra Cassandra
  40. 40. HDFS Kafka Pipeline + Platform User
 Listens Playlists Realtime Listens Spout User Vector Generation Job Latent Vector Models Track, Artist, Album Vectors Compressed Listening History Bolts Cassandra Cassandra Backend Systems •Top Albums •Top Tracks •Top Playlists
  41. 41. Discover New User •Going from two weeks of no recommendations to recommendations as soon as a user plays a track. •Successful A/B test •First team to build a production ready personalization feature using Storm.
  42. 42. Lessons Learnt … • Boring technology works well. Complicated Storm Topology = Bad. (Dan Mckinley) • Storm is nice. Would have preferred reusing batch Scalding Code. Maybe Spark Streaming? • Grow your API from one use case to another. Don’t solve for everything at one time.
  43. 43. Join the band! • Machine Learning, Data & Backend Gigs. • Now touring in New York, Boston & Stockholm! • https://www.spotify.com/jobs/
  44. 44. Thanks ! Esh Kumar @eshvk

×