Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Oscar Carlsson
Data Engineer
lad@spotify.com
Big Data
and
Machine Learning
@ Spotify
Friday 6/3 2015
● D-student starting 2009
● Graduated last year from CSALL
(Student in this class 2013)
● Master thesis at Spotify
● Data ...
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
Supervised learning:
data (X), labels (Y)
Unsupervised learning:
data (X)
In the Machine Learning class:
What is data at Spotify?
Songs Track
Metadata
User generated Users Playlists
Cover arts Listens Country, email etc Tracks ...
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
Big Data and processing it
● 20 TB compressed data / DAY
○ 200 TB generated and stored / day (replication)
● Our business ...
Big Data and processing it
● Distributed computing and storage
○ Hadoop
■ MapReduce
○ Cassandra
● Hadoop cluster
○ 1100 no...
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
Using data at Spotify
Everyone part of the company is interested in our data
● Product
○ Are people using X? Should we foc...
Using data at Spotify
● Data-driven decision making
○ Like.. every decision.
○ Analysts / Data scientists
● A/B test every...
Using data at Spotify: A/B testing
Objective: Decrease time from loading playlist to first play
Hypothesis: The bigger but...
Using data at Spotify: A/B testing
CONTROL A B
Analytics: A/B testing
Metric:
Share of users playing first play > 500ms
(500ms is made up)
Lets roll out A to all users a...
● What is data at Spotify?
● Big data and processing it
● Using data at Spotify
● Machine Learning
Outline
● Machine Learning
○ User analysis
○ Artist disambiguation
○ Recommender systems
Outline
“ A music session
somehow represents
a moment for the
user. Can we find
these moments and
describe them? ”
● Take a subset of user listening data with new genre
data
○ Combine listens in sessions
■ Consequent plays, no 15 min pau...
Machine Learning: Cluster user music sessions
K-Means Per cluster classification
Machine Learning: Cluster user music sessions
Per cluster logistic regression
w: weight vector
Each w_i can be interpreted...
Machine Learning: Cluster user music sessions
Clusters described by logistic regression
name of x_i
at largest
w_i
Machine Learning: Cluster user music sessions
Machine Learning: Cluster user music sessions
Machine Learning
Artist disambiguation
Cleaning up the artists pages
Machine Learning: Artist disambiguation
Machine Learning: Artist disambiguation
Lets listen to those tracks!
Is it really the same Fredrik?
Machine Learning: Artist disambiguation
Machine Learning: Artist disambiguation
● Rank artists with probability of being ambiguous
● Apply clustering on each “amb...
Machine Learning: Recommender system
The discover page
Machine Learning: Recommender system
Collaborative filtering
Machine Learning: Recommender system
Collaborative filtering
● Build a matrix of user plays
● Compute similarity between i...
Machine Learning: Recommender system
4 Million tracks x 60 Million users
→ Pairwise similarity infeasible
Approximate the ...
Machine Learning: Recommender system
Matrix factorization (latent factor models)
Machine Learning: Recommender system
Small vectors
Cosine similarity and dot product efficient
Machine Learning: Recommender system
Finding recommendations:
Approximate nearest neighbour (ANN)
code: https://github.com...
Machine Learning: Recommender system
I just went through this quickly, read more details of
Spotify Rec sys here:
Doing th...
● More content-based ML
○ Fingerprinting: Echo nest
○ Content-based music recommendation using
convolutional neural networ...
Summary
● Multiple data sources -> multiple angles
● Data drives decision with A/B testing
● User analysis
○ Cluster and d...
● We supervise thesis workers
○ Artist disambiguation/deduplication
○ Cluster user music sessions
○ Context-based recommen...
Oscar Carlsson
lad@spotify.com
Linkedin
Thank you for
listening!
Upcoming SlideShare
Loading in …5
×

Big data and machine learning @ Spotify

3,016 views

Published on

Presented at the Machine Learning class at Chalmers, Gothenburg.
http://www.cse.chalmers.se/research/lab/courses.php?coid=9

Trying to connect their theoretical machine learning class with industry examples.

Published in: Data & Analytics

Big data and machine learning @ Spotify

  1. 1. Oscar Carlsson Data Engineer lad@spotify.com Big Data and Machine Learning @ Spotify Friday 6/3 2015
  2. 2. ● D-student starting 2009 ● Graduated last year from CSALL (Student in this class 2013) ● Master thesis at Spotify ● Data Engineer at Spotify in Gothenburg Me
  3. 3. ● What is data at Spotify? ● Big data and processing it ● Using data at Spotify ● Machine Learning Outline
  4. 4. Supervised learning: data (X), labels (Y) Unsupervised learning: data (X) In the Machine Learning class:
  5. 5. What is data at Spotify? Songs Track Metadata User generated Users Playlists Cover arts Listens Country, email etc Tracks of playlist Album Clicks Add/Removes Genres, Mood etc Page views 30 Million songs 60 Million Monthly Active Users 58 Markets 15 Million subscribers 1.5 Billion Playlists
  6. 6. ● What is data at Spotify? ● Big data and processing it ● Using data at Spotify ● Machine Learning Outline
  7. 7. Big Data and processing it ● 20 TB compressed data / DAY ○ 200 TB generated and stored / day (replication) ● Our business is highly dependent on these logs ○ We pay artist depending on plays, plays = logs Too much to store on a single computer. We need a cluster to process it! .. this is typically what is called “Big Data”
  8. 8. Big Data and processing it ● Distributed computing and storage ○ Hadoop ■ MapReduce ○ Cassandra ● Hadoop cluster ○ 1100 nodes ○ ~8000 jobs/day
  9. 9. ● What is data at Spotify? ● Big data and processing it ● Using data at Spotify ● Machine Learning Outline
  10. 10. Using data at Spotify Everyone part of the company is interested in our data ● Product ○ Are people using X? Should we focus on features such as Y? ● Insights ○ What music is trending? What artists is popular where? ● Performance ○ How is latency in country Y? Did this reduce stutter in country X?
  11. 11. Using data at Spotify ● Data-driven decision making ○ Like.. every decision. ○ Analysts / Data scientists ● A/B test everything! ● A/B testing: ○ Statistical hypothesis testing ○ Simple randomized experiment with >= 2 variants (A, B)
  12. 12. Using data at Spotify: A/B testing Objective: Decrease time from loading playlist to first play Hypothesis: The bigger button the faster users finds it Test set up: ● A - variant 1 ○ 2% US and SE MAU users ● B - variant 2 ○ 2% US and SE MAU users ● Control - normal ○ Rest of users in US SE “The shuffle button”
  13. 13. Using data at Spotify: A/B testing CONTROL A B
  14. 14. Analytics: A/B testing Metric: Share of users playing first play > 500ms (500ms is made up) Lets roll out A to all users and throw away B!
  15. 15. ● What is data at Spotify? ● Big data and processing it ● Using data at Spotify ● Machine Learning Outline
  16. 16. ● Machine Learning ○ User analysis ○ Artist disambiguation ○ Recommender systems Outline
  17. 17. “ A music session somehow represents a moment for the user. Can we find these moments and describe them? ”
  18. 18. ● Take a subset of user listening data with new genre data ○ Combine listens in sessions ■ Consequent plays, no 15 min pause ○ Session = [genres] ● Clustering algorithms to find similar sessions ○ K-means / Hierarchical clustering ● Describe the clusters using logistic regression Machine Learning: Cluster user music sessions
  19. 19. Machine Learning: Cluster user music sessions K-Means Per cluster classification
  20. 20. Machine Learning: Cluster user music sessions Per cluster logistic regression w: weight vector Each w_i can be interpreted as the effect in the x_i variable x_i = genres
  21. 21. Machine Learning: Cluster user music sessions Clusters described by logistic regression name of x_i at largest w_i
  22. 22. Machine Learning: Cluster user music sessions
  23. 23. Machine Learning: Cluster user music sessions
  24. 24. Machine Learning Artist disambiguation Cleaning up the artists pages
  25. 25. Machine Learning: Artist disambiguation
  26. 26. Machine Learning: Artist disambiguation Lets listen to those tracks! Is it really the same Fredrik?
  27. 27. Machine Learning: Artist disambiguation
  28. 28. Machine Learning: Artist disambiguation ● Rank artists with probability of being ambiguous ● Apply clustering on each “ambiguous” artists albums/tracks ○ Using features such as country, release year, label/licensor etc. ○ Distinct cluster could be different artists ● Nicely present this for manual curation
  29. 29. Machine Learning: Recommender system The discover page
  30. 30. Machine Learning: Recommender system Collaborative filtering
  31. 31. Machine Learning: Recommender system Collaborative filtering ● Build a matrix of user plays ● Compute similarity between items
  32. 32. Machine Learning: Recommender system 4 Million tracks x 60 Million users → Pairwise similarity infeasible Approximate the matrix with NMF
  33. 33. Machine Learning: Recommender system Matrix factorization (latent factor models)
  34. 34. Machine Learning: Recommender system Small vectors Cosine similarity and dot product efficient
  35. 35. Machine Learning: Recommender system Finding recommendations: Approximate nearest neighbour (ANN) code: https://github.com/spotify/annoy Related artists & Radio: Similar to user recommendations, more models and not all CF-based Multiple models: Score candidates from all models, combine and rank!
  36. 36. Machine Learning: Recommender system I just went through this quickly, read more details of Spotify Rec sys here: Doing this on MapReduce Comparing with Netflix Music Rec @ MLConf 2014
  37. 37. ● More content-based ML ○ Fingerprinting: Echo nest ○ Content-based music recommendation using convolutional neural networks ● Personalize everything ○ Emails ○ Ads ○ User profiling ● ML on other parts of product than Rec Sys .. final last words on the Future of ML at Spotify
  38. 38. Summary ● Multiple data sources -> multiple angles ● Data drives decision with A/B testing ● User analysis ○ Cluster and describe with classifier ● Artist disambiguation ○ Cluster and give to manual curators ● Recommender systems ○ Collaborative filtering
  39. 39. ● We supervise thesis workers ○ Artist disambiguation/deduplication ○ Cluster user music sessions ○ Context-based recommender systems ○ Personalized ads / Personalized emails ● We have internships! www.spotify.com/jobs .. and potentially you could help us?
  40. 40. Oscar Carlsson lad@spotify.com Linkedin Thank you for listening!

×