SlideShare a Scribd company logo
Machine Learning &
Big Data @
Andy Sloane
@a1k0n
http://a1k0n.net
Madison Big Data Meetup
Jan 27, 2015
Big data?
60M Monthly Active Users (MAU)
50M tracks in our catalog
...But many are identical copies from different
releases (e.g. US and UK releases of the same
album)
...and only 4M unique songs have been listened to
>500 times
Big data?
Raw material: application logs, delivered via Apache
Kafka
Wake Me Up by Avicii has been played 330M times, by
~6M different users
"EndSong": 500GB / day
...But aggregated per-user play counts for a whole
year fit in ~60GB ("medium data")
Hadoop @ Spotify
900 nodes (all in London datacenter)
34 TB RAM total
~16000 typical concurrent tasks (mappers/reducers)
2GB RAM per mapper/reducer slot
What do we need ML for?
Recommendations
Related Artists
Radio
Recommendations
The Discover page
4M tracks x 60M active users, rebuilt daily
The Discover page
Okay, but how do we come up with recommendations?
Collaborative filtering!
Collaborative filtering
Collaborative filtering
Great, but how does that actually work?
Each time a user plays something, add it to a matrix
Compute similarity, somehow, between items based on
who played what
Collaborative filtering
So compute some distance between every pair of rows
and columns
That's just O( ) = O( ) operations... O_O
We need a better way...
60M
2
2
1.8 × 10
15
(BTW: Twitter has a decent approximation that can actually make this work, called DIMSUM:
https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum)
I've tried it but don't have results to report here yet :(
Collaborative filtering
Latent factor models
Instead, we use a "small" representation for each user &
item: -dimensional vectorsf
(here, )f = 2
and approximate the big matrix with it.
Why vectors?
Very compact representation of musical style or user's
taste
Only like 40-200 elements (2 shown above for
illustration)
Why vectors?
Dot product between items = similarity between items
Dot product between vectors = good/bad
recommendation
user x item
2 x 4 = 8
-4 x 0 = 0
2 x -2 = -4
-1 x 5 = + -5
= -1
Recommendations via dot products
Another example of tracks in two
dimensions
Implicit Matrix Factorization
Hu, Koren, Volinsky - Collaborative Filtering for Implicit
Feedback Datasets
Tries to predict whether user listens to item :u i
P = ≈ ( )
⎛
⎝
⎜
⎜
⎜
⎜
0
0
0
1
0
1
0
0
0
1
1
0
1
0
0
1
⎞
⎠
⎟
⎟
⎟
⎟
X
⎛
⎝
⎜
⎜
⎜
Y
T
⎞
⎠
⎟
⎟
⎟
is all item vectors, is all user vectorsY X
"implicit" because users don't tell us what they like, we
only observe what they do/don't listen to
Goal: make close to 1 for things each user has
listened to, 0 for everything else.
Implicit Matrix Factorization
⋅xu y
i
— user 's vector
— item 's vector
— 1 if user played item , 0 otherwise
— "confidence", ad-hoc weight based on number of
times user played item ; e.g.,
— regularization penalty to avoid overfitting
xu u
y
i
i
p
ui
u i
cui
u i 1 + α ⋅
λ
Minimize:
+ λ
(
|| | + || |
)
∑
u,i
cui ( − )p
ui
x
T
u y
i
2
∑
u
xu |
2
∑
i
y
i
|
2
Solution: alternate solving for all users :
and all items :
Alternating Least Squares
xu
= ( Y + ( − I)Y + λIxu Y
T
Y
T
C
u
)
−1
Y
T
C
u
p
u⋅
y
i
= ( X + ( − I)X + λIy
i
X
T
X
T
C
i
)
−1
X
T
C
i
p
⋅i
= x matrix, sum of outer products of all items
same, except only items the user played
= weighted -dimensional sum of items the
user played
YY
T
f f
( − I)YY
T
C
u
Y
T
C
u
p
u
f
Alternating Least Squares
Key point: each iteration is linear in size of input, even
though we are solving for all users x all items, and needs
only memory to solvef
2
No learning rates, just a few tunable parameters ( , , )f λ α
All you do is add stuff up, solve an x matrix problem,
and repeat!
f f
We use dimensional vectors for
recommendations
f = 40
Matrix/vector math using numpy in Python, breeze in
scala
Alternating Least Squares
Adding lots of stuff up
Problem: any user (60M) can play any item (4M)
thus we may need to add any user's vector to any
item's vector
If we put user vectors in memory, it takes a lot of RAM!
Worst case: 60M users * 40 dimensions * sizeof(float) =
9.6GB of user vectors
...too big to fit in a mapper slot on our cluster
Solution: Split the data into a matrix
Most recent run made a 14 x 112 grid
Adding lots of stuff up
Input is a bunch of tuples
is the same modulo K for all users
is the same modulo L for all items
e.g., if K = 4, mapper #1 gets users 1, 5, 9, 13, ...
One map shard
(user, item, count)
user
item
Add up vectors from every data point
Then flip users ↔items and repeat!
Adding stuff up
(user, item, count)
def mapper(self, input): # Luigi-style python job
user, item, count = parse(input)
conf = AdHocConfidenceFunction(count) # e.g. 1 + alpha*count
# add up user vectors from previous iteration
term1 = conf * self.user_vectors[user]
term2 = np.outer(user_vectors[user], user_vectors[user])
* (conf - 1)
yield item, np.array([term1, term2])
def reducer(self, item, terms):
term1, term2 = sum(terms)
item_vector = np.solve(
self.YTY + term2 + self.l2penalty * np.identity(self.dim),
term1)
yield item, item_vector
Alternating Least Squares
Implemented in Java Map-Reduce framework which
runs other models, too
After about 20 iterations, we converge
Each iteration takes about 20 minutes, so about 7-8
hours total
Recomputed from scratch weekly
User vectors recomputed daily, keeping items fixed
So we have vectors, now what?
60M users x 4M recommendable items
Finding Recommendations
For each user, how do we find the best items given
their vector?
Brute force is O(60M x 4M x 40) = O(9 peta-operations)!
Instead, use an approximation based on locality
sensitive hashing (LSH)
Approximate Nearest Neighbors /
Locality-Sensitive Hashing
Annoy - github.com/spotify/annoy
Annoy - github.com/spotify/annoy
Pre-built read-only database of item vectors
Internally, recursively splits random hyperplanes
Nearby points likely on the same side of random split
Builds several random trees (a forest) for better
approximation
Given an -dimensional query vector, finds similar items
in database
Index loads via mmap, so all processes on the same
machine share RAM
Queries are very, very fast, but approximate
Python implementation available, Java forthcoming
f
Generating recommendations
Annoy index for all items is only 1.2GB
I have one on my laptop... Live demo!
Could serve up nearest neighbors at load time, but we
precompute Discover on Hadoop
Generating recommendations in parallel
Send annoy index in distributed cache, load it via mmap
in map-reduce process
Reducer loads vectors + user stats, looks up ANN,
generates recommendations.
Related Artists
Related Artists
Great for music discovery
Essential for finding believable reasons for latent
factor-based recommendations
When generating recommendations, run through a list
of related artists to find potential reasons
Similar items use cosine distance
Cosine is similar to dot product; just add a
normalization step
Helps "factor out" popularity from similarity
Related Artists
How we build it
Similar to user recommendations, but with more
models, not necessarily collaborative filtering based
Implicit Matrix Factorization (shown previously)
"Vector-Exp", similar model but probabilistic in
nature, trained with gradient descent
Google word2vec on playlists
Echo Nest "cultural similarity" — based on scraping
web pages about music!
Query ANNs to generate candidates
Score candidates from all models, combine and rank
Pre-build table of 20 nearest artists to each artist
Radio
ML-wise, exactly the same as Related Artists!
Radio
For each track, generate candidates with ANN from
each model
Score w/ all models, rank with ensemble
Store top 250 nearest neighbors in a database
(Cassandra)
User plays radio → load 250 tracks and shuffle
Thumbs up → load more tracks from the thumbed-up
song
Thumbs down → remove that song / re-weight tracks
Upcoming work
Deep learning based item similarity
http://benanne.github.io/2014/08/05/spotify-cnns.html
Upcoming work
Audio fingerprint based
content deduplication
~1500 Echo Nest Musical Fingerprints per track
based matching to accelerate all-pairs
similarity
Fast connected components using Hash-to-Min
algorithm - mapreduce steps
Min-Hash
O(log d)
http://arxiv.org/pdf/1203.5387.pdf
Thanks!
I can be reached here:
Andy Sloane
Email:
Twitter:
Special thanks to , whose slides I
plagiarized mercilessly
andy@a1k0n.net
@a1k0n
http://a1k0n.net
Erik Bernhardsson

More Related Content

What's hot

Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
Erik Bernhardsson
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
Vidhya Murali
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendations
Sophia Ciocca
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
Chris Johnson
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
Ching-Wei Chen
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
Vidhya Murali
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
Oguz Semerci
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
Erik Bernhardsson
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
Chris Johnson
 
Deezer Insights - Nordic Telco Summit presentation September 23rd 2014
Deezer Insights - Nordic Telco Summit presentation September 23rd 2014Deezer Insights - Nordic Telco Summit presentation September 23rd 2014
Deezer Insights - Nordic Telco Summit presentation September 23rd 2014
Alex Esser
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
Neville Li
 
How data drives spotify
How data drives spotifyHow data drives spotify
How data drives spotify
Ali Sarrafi
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
Mounia Lalmas-Roelleke
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
Mounia Lalmas-Roelleke
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at Scale
Mounia Lalmas-Roelleke
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
Mounia Lalmas-Roelleke
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
Rafał Wojdyła
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
Josh Baer
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
Karthik Murugesan
 
Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"
Mounia Lalmas-Roelleke
 

What's hot (20)

Collaborative Filtering at Spotify
Collaborative Filtering at SpotifyCollaborative Filtering at Spotify
Collaborative Filtering at Spotify
 
CF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At SpotifyCF Models for Music Recommendations At Spotify
CF Models for Music Recommendations At Spotify
 
Spotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendationsSpotify Discover Weekly: The machine learning behind your music recommendations
Spotify Discover Weekly: The machine learning behind your music recommendations
 
Scala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music RecommendationsScala Data Pipelines for Music Recommendations
Scala Data Pipelines for Music Recommendations
 
Machine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at SpotifyMachine Learning and Big Data for Music Discovery at Spotify
Machine Learning and Big Data for Music Discovery at Spotify
 
Music Personalization At Spotify
Music Personalization At SpotifyMusic Personalization At Spotify
Music Personalization At Spotify
 
Homepage Personalization at Spotify
Homepage Personalization at SpotifyHomepage Personalization at Spotify
Homepage Personalization at Spotify
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Music Recommendations at Scale with Spark
Music Recommendations at Scale with SparkMusic Recommendations at Scale with Spark
Music Recommendations at Scale with Spark
 
Deezer Insights - Nordic Telco Summit presentation September 23rd 2014
Deezer Insights - Nordic Telco Summit presentation September 23rd 2014Deezer Insights - Nordic Telco Summit presentation September 23rd 2014
Deezer Insights - Nordic Telco Summit presentation September 23rd 2014
 
Scala Data Pipelines @ Spotify
Scala Data Pipelines @ SpotifyScala Data Pipelines @ Spotify
Scala Data Pipelines @ Spotify
 
How data drives spotify
How data drives spotifyHow data drives spotify
How data drives spotify
 
Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)Recommending and Searching (Research @ Spotify)
Recommending and Searching (Research @ Spotify)
 
Personalizing the listening experience
Personalizing the listening experiencePersonalizing the listening experience
Personalizing the listening experience
 
Engagement, Metrics & Personalisation at Scale
Engagement, Metrics &  Personalisation at ScaleEngagement, Metrics &  Personalisation at Scale
Engagement, Metrics & Personalisation at Scale
 
Recommending and searching @ Spotify
Recommending and searching @ SpotifyRecommending and searching @ Spotify
Recommending and searching @ Spotify
 
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and PainThe Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Hadoop at Spotify - Through Failures and Pain
 
How Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At SpotifyHow Apache Drives Music Recommendations At Spotify
How Apache Drives Music Recommendations At Spotify
 
Spotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music DiscoverySpotify Machine Learning Solution for Music Discovery
Spotify Machine Learning Solution for Music Discovery
 
Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"Engagement, metrics and "recommenders"
Engagement, metrics and "recommenders"
 

Similar to Machine learning @ Spotify - Madison Big Data Meetup

Introduction to recommender systems
Introduction to recommender systemsIntroduction to recommender systems
Introduction to recommender systems
Arnaud de Myttenaere
 
Recsys 2018 overview and highlights
Recsys 2018 overview and highlightsRecsys 2018 overview and highlights
Recsys 2018 overview and highlights
Sandra Garcia
 
Machine learning
Machine learningMachine learning
Machine learning
Andrea Iacono
 
Collaborative filtering at scale
Collaborative filtering at scaleCollaborative filtering at scale
Collaborative filtering at scalehuguk
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep Learning
Oswald Campesato
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
Chris Johnson
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep Learning
Oswald Campesato
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep Learning
Oswald Campesato
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
Travis Oliphant
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web appsiapain
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
Cite References.Classification in Discriminant Analysis Discussi.docx
Cite References.Classification in Discriminant Analysis Discussi.docxCite References.Classification in Discriminant Analysis Discussi.docx
Cite References.Classification in Discriminant Analysis Discussi.docx
clarebernice
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
Gabriele Angeletti
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshop
Tae-Gil Noh
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
JavaDayUA
 
Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopPranab Ghosh
 
Recommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshRecommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab Ghosh
BigDataCloud
 
Deep Learning in your Browser: powered by WebGL
Deep Learning in your Browser: powered by WebGLDeep Learning in your Browser: powered by WebGL
Deep Learning in your Browser: powered by WebGL
Oswald Campesato
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
Tariq Rashid
 

Similar to Machine learning @ Spotify - Madison Big Data Meetup (20)

Introduction to recommender systems
Introduction to recommender systemsIntroduction to recommender systems
Introduction to recommender systems
 
Recsys 2018 overview and highlights
Recsys 2018 overview and highlightsRecsys 2018 overview and highlights
Recsys 2018 overview and highlights
 
Machine learning
Machine learningMachine learning
Machine learning
 
Collaborative filtering at scale
Collaborative filtering at scaleCollaborative filtering at scale
Collaborative filtering at scale
 
TypeScript and Deep Learning
TypeScript and Deep LearningTypeScript and Deep Learning
TypeScript and Deep Learning
 
Collaborative Filtering with Spark
Collaborative Filtering with SparkCollaborative Filtering with Spark
Collaborative Filtering with Spark
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep Learning
 
D3, TypeScript, and Deep Learning
D3, TypeScript, and Deep LearningD3, TypeScript, and Deep Learning
D3, TypeScript, and Deep Learning
 
Standardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft PresentationStandardizing arrays -- Microsoft Presentation
Standardizing arrays -- Microsoft Presentation
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web apps
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
 
Cite References.Classification in Discriminant Analysis Discussi.docx
Cite References.Classification in Discriminant Analysis Discussi.docxCite References.Classification in Discriminant Analysis Discussi.docx
Cite References.Classification in Discriminant Analysis Discussi.docx
 
Project - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive HashingProject - Deep Locality Sensitive Hashing
Project - Deep Locality Sensitive Hashing
 
Tg noh jeju_workshop
Tg noh jeju_workshopTg noh jeju_workshop
Tg noh jeju_workshop
 
Monads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy DyagilevMonads and Monoids by Oleksiy Dyagilev
Monads and Monoids by Oleksiy Dyagilev
 
Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by Hadoop
 
Recommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshRecommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab Ghosh
 
Deep Learning in your Browser: powered by WebGL
Deep Learning in your Browser: powered by WebGLDeep Learning in your Browser: powered by WebGL
Deep Learning in your Browser: powered by WebGL
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
 

Recently uploaded

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 

Recently uploaded (20)

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 

Machine learning @ Spotify - Madison Big Data Meetup

  • 1. Machine Learning & Big Data @ Andy Sloane @a1k0n http://a1k0n.net Madison Big Data Meetup Jan 27, 2015
  • 2. Big data? 60M Monthly Active Users (MAU) 50M tracks in our catalog ...But many are identical copies from different releases (e.g. US and UK releases of the same album) ...and only 4M unique songs have been listened to >500 times
  • 3. Big data? Raw material: application logs, delivered via Apache Kafka Wake Me Up by Avicii has been played 330M times, by ~6M different users "EndSong": 500GB / day ...But aggregated per-user play counts for a whole year fit in ~60GB ("medium data")
  • 4. Hadoop @ Spotify 900 nodes (all in London datacenter) 34 TB RAM total ~16000 typical concurrent tasks (mappers/reducers) 2GB RAM per mapper/reducer slot
  • 5. What do we need ML for? Recommendations Related Artists Radio
  • 7. The Discover page 4M tracks x 60M active users, rebuilt daily
  • 8. The Discover page Okay, but how do we come up with recommendations? Collaborative filtering!
  • 10. Collaborative filtering Great, but how does that actually work? Each time a user plays something, add it to a matrix Compute similarity, somehow, between items based on who played what
  • 11. Collaborative filtering So compute some distance between every pair of rows and columns That's just O( ) = O( ) operations... O_O We need a better way... 60M 2 2 1.8 × 10 15 (BTW: Twitter has a decent approximation that can actually make this work, called DIMSUM: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum) I've tried it but don't have results to report here yet :(
  • 12. Collaborative filtering Latent factor models Instead, we use a "small" representation for each user & item: -dimensional vectorsf (here, )f = 2 and approximate the big matrix with it.
  • 13. Why vectors? Very compact representation of musical style or user's taste Only like 40-200 elements (2 shown above for illustration)
  • 14. Why vectors? Dot product between items = similarity between items Dot product between vectors = good/bad recommendation user x item 2 x 4 = 8 -4 x 0 = 0 2 x -2 = -4 -1 x 5 = + -5 = -1
  • 16. Another example of tracks in two dimensions
  • 17. Implicit Matrix Factorization Hu, Koren, Volinsky - Collaborative Filtering for Implicit Feedback Datasets Tries to predict whether user listens to item :u i P = ≈ ( ) ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ X ⎛ ⎝ ⎜ ⎜ ⎜ Y T ⎞ ⎠ ⎟ ⎟ ⎟ is all item vectors, is all user vectorsY X "implicit" because users don't tell us what they like, we only observe what they do/don't listen to
  • 18. Goal: make close to 1 for things each user has listened to, 0 for everything else. Implicit Matrix Factorization ⋅xu y i — user 's vector — item 's vector — 1 if user played item , 0 otherwise — "confidence", ad-hoc weight based on number of times user played item ; e.g., — regularization penalty to avoid overfitting xu u y i i p ui u i cui u i 1 + α ⋅ λ Minimize: + λ ( || | + || | ) ∑ u,i cui ( − )p ui x T u y i 2 ∑ u xu | 2 ∑ i y i | 2
  • 19. Solution: alternate solving for all users : and all items : Alternating Least Squares xu = ( Y + ( − I)Y + λIxu Y T Y T C u ) −1 Y T C u p u⋅ y i = ( X + ( − I)X + λIy i X T X T C i ) −1 X T C i p ⋅i = x matrix, sum of outer products of all items same, except only items the user played = weighted -dimensional sum of items the user played YY T f f ( − I)YY T C u Y T C u p u f
  • 20. Alternating Least Squares Key point: each iteration is linear in size of input, even though we are solving for all users x all items, and needs only memory to solvef 2 No learning rates, just a few tunable parameters ( , , )f λ α All you do is add stuff up, solve an x matrix problem, and repeat! f f We use dimensional vectors for recommendations f = 40 Matrix/vector math using numpy in Python, breeze in scala
  • 21. Alternating Least Squares Adding lots of stuff up Problem: any user (60M) can play any item (4M) thus we may need to add any user's vector to any item's vector If we put user vectors in memory, it takes a lot of RAM! Worst case: 60M users * 40 dimensions * sizeof(float) = 9.6GB of user vectors ...too big to fit in a mapper slot on our cluster
  • 22. Solution: Split the data into a matrix Most recent run made a 14 x 112 grid Adding lots of stuff up
  • 23. Input is a bunch of tuples is the same modulo K for all users is the same modulo L for all items e.g., if K = 4, mapper #1 gets users 1, 5, 9, 13, ... One map shard (user, item, count) user item
  • 24. Add up vectors from every data point Then flip users ↔items and repeat! Adding stuff up (user, item, count) def mapper(self, input): # Luigi-style python job user, item, count = parse(input) conf = AdHocConfidenceFunction(count) # e.g. 1 + alpha*count # add up user vectors from previous iteration term1 = conf * self.user_vectors[user] term2 = np.outer(user_vectors[user], user_vectors[user]) * (conf - 1) yield item, np.array([term1, term2]) def reducer(self, item, terms): term1, term2 = sum(terms) item_vector = np.solve( self.YTY + term2 + self.l2penalty * np.identity(self.dim), term1) yield item, item_vector
  • 25. Alternating Least Squares Implemented in Java Map-Reduce framework which runs other models, too After about 20 iterations, we converge Each iteration takes about 20 minutes, so about 7-8 hours total Recomputed from scratch weekly User vectors recomputed daily, keeping items fixed So we have vectors, now what?
  • 26. 60M users x 4M recommendable items Finding Recommendations For each user, how do we find the best items given their vector? Brute force is O(60M x 4M x 40) = O(9 peta-operations)! Instead, use an approximation based on locality sensitive hashing (LSH)
  • 27. Approximate Nearest Neighbors / Locality-Sensitive Hashing Annoy - github.com/spotify/annoy
  • 28. Annoy - github.com/spotify/annoy Pre-built read-only database of item vectors Internally, recursively splits random hyperplanes Nearby points likely on the same side of random split Builds several random trees (a forest) for better approximation Given an -dimensional query vector, finds similar items in database Index loads via mmap, so all processes on the same machine share RAM Queries are very, very fast, but approximate Python implementation available, Java forthcoming f
  • 29. Generating recommendations Annoy index for all items is only 1.2GB I have one on my laptop... Live demo! Could serve up nearest neighbors at load time, but we precompute Discover on Hadoop
  • 30. Generating recommendations in parallel Send annoy index in distributed cache, load it via mmap in map-reduce process Reducer loads vectors + user stats, looks up ANN, generates recommendations.
  • 32. Related Artists Great for music discovery Essential for finding believable reasons for latent factor-based recommendations When generating recommendations, run through a list of related artists to find potential reasons
  • 33. Similar items use cosine distance Cosine is similar to dot product; just add a normalization step Helps "factor out" popularity from similarity
  • 34. Related Artists How we build it Similar to user recommendations, but with more models, not necessarily collaborative filtering based Implicit Matrix Factorization (shown previously) "Vector-Exp", similar model but probabilistic in nature, trained with gradient descent Google word2vec on playlists Echo Nest "cultural similarity" — based on scraping web pages about music! Query ANNs to generate candidates Score candidates from all models, combine and rank Pre-build table of 20 nearest artists to each artist
  • 35. Radio
  • 36. ML-wise, exactly the same as Related Artists! Radio For each track, generate candidates with ANN from each model Score w/ all models, rank with ensemble Store top 250 nearest neighbors in a database (Cassandra) User plays radio → load 250 tracks and shuffle Thumbs up → load more tracks from the thumbed-up song Thumbs down → remove that song / re-weight tracks
  • 37. Upcoming work Deep learning based item similarity http://benanne.github.io/2014/08/05/spotify-cnns.html
  • 38. Upcoming work Audio fingerprint based content deduplication ~1500 Echo Nest Musical Fingerprints per track based matching to accelerate all-pairs similarity Fast connected components using Hash-to-Min algorithm - mapreduce steps Min-Hash O(log d) http://arxiv.org/pdf/1203.5387.pdf
  • 39. Thanks! I can be reached here: Andy Sloane Email: Twitter: Special thanks to , whose slides I plagiarized mercilessly andy@a1k0n.net @a1k0n http://a1k0n.net Erik Bernhardsson