Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Playlist Recommendations @ Spotify

1,223 views

Published on

Slides from a talk at a meetup organized by SF Scala at Spotify's San Francisco office. The slides present details of playlist recommendations at Spotify and how Spotify uses Scalding to develop robust and reliable pipelines to generate these recommendations.
Meetup details: http://www.meetup.com/SF-Scala/events/224430674/

Published in: Engineering
  • Be the first to comment

Playlist Recommendations @ Spotify

  1. 1. Playlist Recommendations @ Nikhil Tibrewal @nikhil_tibrewal
  2. 2. Who am I? Nikhil Tibrewal (Nick-hill) ● Data Engineer on Lambda squad (Spotify’s primary ML team) ● Graduated from Carnegie Mellon University in Dec 2013 ● B.Sc. in Computer Science + additional major in Econ ● Been part of Spotify band for ~1.5 years ● Worked on a range of projects, primarily Playlist Recommendations
  3. 3. Spotify in numbers ● Started in 2006, 58 markets ● 75M+ active users, 20M+ paying ● 30M+ songs, 20K new per day ● 1.5+ billion playlists ● 1 TB logs per day
  4. 4. ● Discover tab ● Radio ● Related Artists ● Discover Weekly ● Playlist recs on “Now” Strip Recommendations so far on Spotify For Ellie Goulding
  5. 5. “Now” Strip Human curated playlist
  6. 6. “Now” Strip Human curated playlist Recommended playlist
  7. 7. But… How are playlist recs generated?
  8. 8. Quick Overview! ● Recommend only human curated playlists (1000+) ○ Well-designed cover images ○ Thorough descriptions ○ Title reflects content
  9. 9. Quick Overview! ● Recommend only human curated playlists (1000+) ○ Well-designed cover images ○ Thorough descriptions ○ Title reflects content Good
  10. 10. Quick Overview! ● Recommend only human curated playlists (1000+) ○ Well-designed cover images ○ Thorough descriptions ○ Title reflects content Good Bad
  11. 11. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering
  12. 12. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist
  13. 13. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist ○ Use Annoy to store playlist vectors in N dimensional space ANNOY (Approximate Nearest Neighbors Oh Yeah) created at Spotify https://github.com/spotify/annoy
  14. 14. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist ○ Use Annoy to store playlist vectors in N dimensional space ○ Vectorize user taste as well: ■ User vector derived from user listening history
  15. 15. Quick Overview! ● Recommendations pipeline: Candidate Generation ○ Generate N dimensional track vectors from collaborative filtering ○ Vectorize playlists: ■ Playlist vector derived from track vectors in playlist ○ Use Annoy to store playlist vectors in N dimensional space ○ Vectorize user taste as well: ■ User vector derived from user listening history ○ User and playlist vectors in same space! ○ Query for nearest playlists to user from Annoy tree annoyTree.getNearest(seedVector, K)
  16. 16. Quick Overview! ● Recommendations pipeline: Ranking Model ○ Use genre information, demographics data, and playlist popularity data to further rank recommendations ■ John: 21, USA, likes rock ■ Should get rock playlist recs that are popular in USA and amongst 21 year olds ○ Apply post-processing steps for shuffling and add variety to avoid repetitions
  17. 17. Quick Overview! ● Recommendations pipeline: Ranking Model ○ Use genre information, demographics data, and playlist popularity data to further rank recommendations ■ John: 21, USA, likes rock ■ Should get rock playlist recs that are popular in USA and amongst 21 year olds ○ Apply post-processing steps for shuffling and add variety to avoid repetitions 90% DAUs have recs!
  18. 18. Quick Overview! ● Infrastructure ○ Luigi to manage workflow (also built at Spotify) ○ Entire pipeline written in Scalding ○ 1200+ nodes Hadoop cluster to run jobs ○ Cassandra (~dozen nodes for playlist recs) ○ Java backend micro-services serving recs
  19. 19. Quick Overview! "Scalding is comprised of a DSL (domain-specific language) that makes MapReduce computations look like Scala’s collection API and is a wrapper for Cascading to make it easy to define jobs, test and data sources on an HDFS" (http: //cascading.io/customer/twitter/)
  20. 20. Scalding w.r.t. Playlist Recs ● Used Python back in the day ○ Inputs and outputs were tab separated ○ Complexity UP => Difficulty to maintain UP ○ Hard to write tests ● Scalding provided compile time error checks ○ Catch errors early ○ Define schemas (e.g. Avro) ● Can use Parquet + Avro for input/output ○ Easy to write and read data ○ Records with a lot of fields! ○ Lesson: Parquet hurts performance w/ fat columns (nested data structs) +
  21. 21. Scalding w.r.t. Playlist Recs +
  22. 22. Scalding w.r.t. Playlist Recs ● Data quality ○ Hadoop counters wrappers in extended Scalding library code +
  23. 23. Scalding w.r.t. Playlist Recs ● Data quality ○ Hadoop counters wrappers in extended Scalding library code ○ Verify counters within reasonable ranges +
  24. 24. Scalding w.r.t. Playlist Recs +
  25. 25. Scalding w.r.t. Playlist Recs ● Pipeline tolerance ○ Job failures are normal, and annoying with big jobs ○ Scalding checkpoints ○ Lesson: checkpoint itself is a map-reduce job and has the same caveats ○ Still very helpful! +
  26. 26. Scalding w.r.t. Playlist Recs ● Job runtimes ○ Common solutions: more reducers and code optimizations ○ Speculative execution for larger jobs ○ Caveat: can take up unnecessary resources +
  27. 27. Scalding w.r.t. Playlist Recs ● Memory issues ○ Used Sparkey indices in Python (developed at Spotify, now open source) ■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts” ■ Replicated to all mappers ○ Complex jobs in Scalding => higher memory config for jobs with Sparkey + https://github.com/spotify/sparkey
  28. 28. Scalding w.r.t. Playlist Recs ● Memory issues ○ Used Sparkey indices in Python (developed at Spotify, now open source) ■ “Simple constant key/value storage lib for read-heavy systems with infrequent large bulk inserts” ■ Replicated to all mappers ○ Complex jobs in Scalding => higher memory config for jobs with Sparkey ○ Lesson: trade memory resources for MAYBE a little more time with joins + bigPipe.join(exSparkeyPipe) https://github.com/spotify/sparkey
  29. 29. Scalding w.r.t. Playlist Recs ● Driven ○ “A sophisticated tool that collects telemetry data from running Scalding / Cascading jobs on a cluster and presenting them in an intriguing User Interface." ○ http://cascading.io/ +
  30. 30. Scalding w.r.t. Playlist Recs +
  31. 31. Scalding w.r.t. Playlist Recs ● Other awesome benefits +
  32. 32. Scalding w.r.t. Playlist Recs ● Other awesome benefits ○ Active community + big players +
  33. 33. Scalding w.r.t. Playlist Recs ● Other awesome benefits ○ Active community + big players ○ Data pipeline flows naturally follow the functional paradigm - essentially writing Scala code +
  34. 34. Scalding w.r.t. Playlist Recs +
  35. 35. Scalding w.r.t. Playlist Recs Productivity without sacrificing performance! +
  36. 36. Status: Completed Spotify is hiring! Nikhil Tibrewal @nikhil_tibrewal

×