August 5, 2013
ML ♡ Hadoop @ Spotify
If it’s slow, buy more racks
I’m Erik Bernhardsson
Master’s in Physics from KTH in Stockholm
Started at Spotify in 2008, managed the Analytics team for...
August 5, 2013
What’s Spotify? What are the
challenges?
Started in 2006
Currently has 24 million users
6 million paying us...
And adding 20K every day...
Big challenge: Spotify has over 20 million tracks
4
Good and bad news: we also have 100B streams
Let’s use collaborative
filtering!
5
Hey,
I like tracks P, Q, R, S!
Well,
I l...
Hadoop at Spotify
6
Back in 2009
Matrix factorization
causing cluster to
overheat? Don’t worry,
put up curtain
7
Source:
Hadoop today
700 nodes at our data center in London
8
The Discover page
9
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs...
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs...
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs...
Here’s a secret behind the Discover page
It’s precomputed every night
10
HADOOP
Cassandra
Bartender
Log streams
Music recs...
OK so how do we come up with recommendations?
Let’s do collaborative filtering!
In particular, implicit collaborative filt...
Stop!!!
Break it down!!
12
AP AP AP AP AP AP
Hadoop
(>100B streams)
Play track z
play track y
play track x
5k tracks/s
Step 1: Collect data
13
Step 2: Put everything into a big sparse matrix
14
@ . . . 7 . . . . . . . . .
...
...
...
A
very big matrix too:
M =
0
B
...
Matrix example
Roughly 25 billion nonzero entries
Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
Matrix example
Roughly 25 billion nonzero entries
Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”)
15
...
Idea is to find vectors for each user and item
Here’s how it looks like algebraically:
Step 3: Matrix factorization
16P =
...
For instance, for PLSA
Probabilistic Latent Semantic Indexing (Hoffman, 1999)
Invented as a method intended for text class...
Why are vectors nice?
Super small fingerprints of the musical style or the user’s taste
Usually something like 40-200 elem...
Another example of tracks in two dimensions
19
Implementing matrix factorization is a little tricky
Iterative algorithms that stake many steps to converge
40 parameters ...
One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1...
One iteration, one map/reduce job
21
Reduce stepMap step
u % K = 0
i % L = 0
u % K = 0
i % L = 1
...
u % K = 0
i % L = L-1...
Here’s what happens in one map shard
Input is a bunch of (user, item, count) tuples
user is the same modulo K for all user...
Might take a while to converge
Start with random vectors
around the origin
23
Hadoop?
Yeah we could probably do it in Spark 10x or 100x faster.
Still, Hadoop is a great way to scale things horizontall...
Nice compact vectors and it’s super fast to compute
similarity
25
Latent factor 1
Latent factor 2
track x
track y
cos(x, y...
Music recommendations are now just dot products
26
Latent factor 1
Latent factor 2
track x
User u's vector
track y
It’s still tricky to search for similar tracks though
We have many million tracks and you don’t want to compute cosine for...
Approximate nearest neighbors to the rescue!
Cut the space recursively by random
plane.
If two points are close, they are ...
How do you retrain the model?
It takes a long time to train a full factorization model.
We want to update user vectors muc...
The pipeline
“Hack” to recalculate user vectors
more frequently.
Is this a little complicated? Yeah
probably.
30
May 2013 ...
Ideal case
Put all vectors in Cassandra/Memcached, use Storm to update in real time
31
But Hadoop is pretty nice at parallelizing recommendations
24 core but not a lot of
RAM? mmap is your
friend
32
One map re...
Music recommendations!
Our latest baby, the
Discover page. Featuring
lots of different types of
recommendations.
Expect th...
More music recommendations!
Radio!
34
More music recommendations!
Related artists
35
Thanks!
Btw, we’re hiring Machine Learning Engineers
and Data Engineers!
Email me at erikbern@spotify.com!
Upcoming SlideShare
Loading in...5
×

ML+Hadoop at NYC Predictive Analytics

14,671

Published on

How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/

Published in: Technology, Education
0 Comments
27 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
14,671
On Slideshare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
152
Comments
0
Likes
27
Embeds 0
No embeds

No notes for slide

ML+Hadoop at NYC Predictive Analytics

  1. 1. August 5, 2013 ML ♡ Hadoop @ Spotify If it’s slow, buy more racks
  2. 2. I’m Erik Bernhardsson Master’s in Physics from KTH in Stockholm Started at Spotify in 2008, managed the Analytics team for two years Moved to NYC in 2011, now the Engineering Manager of the Discovery team at Spotify in NYC 2
  3. 3. August 5, 2013 What’s Spotify? What are the challenges? Started in 2006 Currently has 24 million users 6 million paying users Available in 20 countries About 300 engineers, of which 70 in NYC
  4. 4. And adding 20K every day... Big challenge: Spotify has over 20 million tracks 4
  5. 5. Good and bad news: we also have 100B streams Let’s use collaborative filtering! 5 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T!
  6. 6. Hadoop at Spotify 6
  7. 7. Back in 2009 Matrix factorization causing cluster to overheat? Don’t worry, put up curtain 7
  8. 8. Source: Hadoop today 700 nodes at our data center in London 8
  9. 9. The Discover page 9
  10. 10. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass
  11. 11. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass
  12. 12. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass https://github.com/spotify/luigi
  13. 13. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass https://github.com/spotify/luigi https://github.com/spotify/hdfs2cass
  14. 14. OK so how do we come up with recommendations? Let’s do collaborative filtering! In particular, implicit collaborative filtering In particular, matrix factorization (aka latent factor methods) 11
  15. 15. Stop!!! Break it down!! 12
  16. 16. AP AP AP AP AP AP Hadoop (>100B streams) Play track z play track y play track x 5k tracks/s Step 1: Collect data 13
  17. 17. Step 2: Put everything into a big sparse matrix 14 @ . . . 7 . . . . . . . . . ... ... ... A very big matrix too: M = 0 B B B @ c11 c12 . . . c1n c21 c22 . . . c2n ... ... cm1 cm2 . . . cmn 1 C C C A | {z } 107 items 9 >>>>>>>>>= >>>>>>>>>; 107 users
  18. 18. Matrix example Roughly 25 billion nonzero entries Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”) 15
  19. 19. Matrix example Roughly 25 billion nonzero entries Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”) 15 Erik Never gonna give you up Erik listened to Never gonna give you up 1 times
  20. 20. Idea is to find vectors for each user and item Here’s how it looks like algebraically: Step 3: Matrix factorization 16P = B B B @ p21 p22 . . . p2n ... ... pm1 pm2 . . . pmn C C C A The idea with matrix factorization is to represent this probability distribu- tion like this: pui = aT u bi M0 = AT B 0 B B B B B B @ 1 C C C C C C A ⇡ 0 B B B B B B @ 1 C C C C C C A | {z } f f 0 . . . . . . . 1 0 . . 1
  21. 21. For instance, for PLSA Probabilistic Latent Semantic Indexing (Hoffman, 1999) Invented as a method intended for text classification 17 P = 0 B B B B B B @ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 C C C C C C A ⇡ 0 B B B B B B @ . . . . . . . . . . . . 1 C C C C C C A | {z } user vectors ✓ . . . . . . . . . . . . . . ◆ | {z } item vectors PLSA 0 B B B B B B @ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 C C C C C C A | {z } P (u,i)= P z P (u|z)P (i,z) ⇡ 0 B B B B B B @ . . . . . . . . . . . . 1 C C C C C C A | {z } P (u|z) ✓ . . . . . . . . . . . . . . ◆ | {z } P (i,z) X
  22. 22. Why are vectors nice? Super small fingerprints of the musical style or the user’s taste Usually something like 40-200 elements Hard to illustrate 40 dimensions in a 2 dimensional slide, but here’s an attempt: 18 0.87 1.17 -0.26 0.56 2.21 0.77 -0.03 Latent factor 1 Latent factor 2 track x's vector Track X:
  23. 23. Another example of tracks in two dimensions 19
  24. 24. Implementing matrix factorization is a little tricky Iterative algorithms that stake many steps to converge 40 parameters for each item and user So something like 1.2 billion parameters “Google News Personalization: Scalable Online Collaborative Filtering” 20
  25. 25. One iteration, one map/reduce job 21 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1
  26. 26. One iteration, one map/reduce job 21 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1
  27. 27. Here’s what happens in one map shard Input is a bunch of (user, item, count) tuples user is the same modulo K for all users item is the same modulo L for all items 22 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector!
  28. 28. Might take a while to converge Start with random vectors around the origin 23
  29. 29. Hadoop? Yeah we could probably do it in Spark 10x or 100x faster. Still, Hadoop is a great way to scale things horizontally. ???? 24
  30. 30. Nice compact vectors and it’s super fast to compute similarity 25 Latent factor 1 Latent factor 2 track x track y cos(x, y) = HIGH IPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 IPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 Florence + the Machine Lana Del Rey 0.81 IPMF item item MDS: P(i ! j) = exp(bT j bi)/Zi = exp( |bj bi| 2 ) P k exp( |bk bi| 2 )
  31. 31. Music recommendations are now just dot products 26 Latent factor 1 Latent factor 2 track x User u's vector track y
  32. 32. It’s still tricky to search for similar tracks though We have many million tracks and you don’t want to compute cosine for all pairs 27
  33. 33. Approximate nearest neighbors to the rescue! Cut the space recursively by random plane. If two points are close, they are more likely to end up on the same side of each plane. https://github.com/spotify/annoy 28
  34. 34. How do you retrain the model? It takes a long time to train a full factorization model. We want to update user vectors much more frequently (at least daily!) However, item vectors are fairly stable. Throw away user vectors and recreate them from scratch! 29
  35. 35. The pipeline “Hack” to recalculate user vectors more frequently. Is this a little complicated? Yeah probably. 30 May 2013 logs Matrix factorization Item vectors User vectors June 2013 logs Matrix factorization Item vectors User vectors + more logs Seeding User vectors (1) Logs User vectors (2) More logs User vectors (3) More logs User vectors (4) More logs User vectors (5) More logs Time
  36. 36. Ideal case Put all vectors in Cassandra/Memcached, use Storm to update in real time 31
  37. 37. But Hadoop is pretty nice at parallelizing recommendations 24 core but not a lot of RAM? mmap is your friend 32 One map reduce job Recs! ANN index of all vectors Distributed cache: User vectors M M M M DC M M M M DC M M M M DC
  38. 38. Music recommendations! Our latest baby, the Discover page. Featuring lots of different types of recommendations. Expect this to change quite a lot in the next few months! 33
  39. 39. More music recommendations! Radio! 34
  40. 40. More music recommendations! Related artists 35
  41. 41. Thanks! Btw, we’re hiring Machine Learning Engineers and Data Engineers! Email me at erikbern@spotify.com!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×