Advertisement

ML+Hadoop at NYC Predictive Analytics

Head of Engineering — we're hiring at Better Mortgage
Aug. 3, 2013
Advertisement

More Related Content

Slideshows for you(20)

Similar to ML+Hadoop at NYC Predictive Analytics(20)

Advertisement

ML+Hadoop at NYC Predictive Analytics

  1. August 5, 2013 ML ♡ Hadoop @ Spotify If it’s slow, buy more racks
  2. I’m Erik Bernhardsson Master’s in Physics from KTH in Stockholm Started at Spotify in 2008, managed the Analytics team for two years Moved to NYC in 2011, now the Engineering Manager of the Discovery team at Spotify in NYC 2
  3. August 5, 2013 What’s Spotify? What are the challenges? Started in 2006 Currently has 24 million users 6 million paying users Available in 20 countries About 300 engineers, of which 70 in NYC
  4. And adding 20K every day... Big challenge: Spotify has over 20 million tracks 4
  5. Good and bad news: we also have 100B streams Let’s use collaborative filtering! 5 Hey, I like tracks P, Q, R, S! Well, I like tracks Q, R, S, T! Then you should check out track P! Nice! Btw try track T!
  6. Hadoop at Spotify 6
  7. Back in 2009 Matrix factorization causing cluster to overheat? Don’t worry, put up curtain 7
  8. Source: Hadoop today 700 nodes at our data center in London 8
  9. The Discover page 9
  10. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass
  11. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass
  12. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass https://github.com/spotify/luigi
  13. Here’s a secret behind the Discover page It’s precomputed every night 10 HADOOP Cassandra Bartender Log streams Music recs hdfs2cass https://github.com/spotify/luigi https://github.com/spotify/hdfs2cass
  14. OK so how do we come up with recommendations? Let’s do collaborative filtering! In particular, implicit collaborative filtering In particular, matrix factorization (aka latent factor methods) 11
  15. Stop!!! Break it down!! 12
  16. AP AP AP AP AP AP Hadoop (>100B streams) Play track z play track y play track x 5k tracks/s Step 1: Collect data 13
  17. Step 2: Put everything into a big sparse matrix 14 @ . . . 7 . . . . . . . . . ... ... ... A very big matrix too: M = 0 B B B @ c11 c12 . . . c1n c21 c22 . . . c2n ... ... cm1 cm2 . . . cmn 1 C C C A | {z } 107 items 9 >>>>>>>>>= >>>>>>>>>; 107 users
  18. Matrix example Roughly 25 billion nonzero entries Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”) 15
  19. Matrix example Roughly 25 billion nonzero entries Total size is roughly 25 billion * 12 bytes = 300 GB (“medium data”) 15 Erik Never gonna give you up Erik listened to Never gonna give you up 1 times
  20. Idea is to find vectors for each user and item Here’s how it looks like algebraically: Step 3: Matrix factorization 16P = B B B @ p21 p22 . . . p2n ... ... pm1 pm2 . . . pmn C C C A The idea with matrix factorization is to represent this probability distribu- tion like this: pui = aT u bi M0 = AT B 0 B B B B B B @ 1 C C C C C C A ⇡ 0 B B B B B B @ 1 C C C C C C A | {z } f f 0 . . . . . . . 1 0 . . 1
  21. For instance, for PLSA Probabilistic Latent Semantic Indexing (Hoffman, 1999) Invented as a method intended for text classification 17 P = 0 B B B B B B @ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 C C C C C C A ⇡ 0 B B B B B B @ . . . . . . . . . . . . 1 C C C C C C A | {z } user vectors ✓ . . . . . . . . . . . . . . ◆ | {z } item vectors PLSA 0 B B B B B B @ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 C C C C C C A | {z } P (u,i)= P z P (u|z)P (i,z) ⇡ 0 B B B B B B @ . . . . . . . . . . . . 1 C C C C C C A | {z } P (u|z) ✓ . . . . . . . . . . . . . . ◆ | {z } P (i,z) X
  22. Why are vectors nice? Super small fingerprints of the musical style or the user’s taste Usually something like 40-200 elements Hard to illustrate 40 dimensions in a 2 dimensional slide, but here’s an attempt: 18 0.87 1.17 -0.26 0.56 2.21 0.77 -0.03 Latent factor 1 Latent factor 2 track x's vector Track X:
  23. Another example of tracks in two dimensions 19
  24. Implementing matrix factorization is a little tricky Iterative algorithms that stake many steps to converge 40 parameters for each item and user So something like 1.2 billion parameters “Google News Personalization: Scalable Online Collaborative Filtering” 20
  25. One iteration, one map/reduce job 21 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1
  26. One iteration, one map/reduce job 21 Reduce stepMap step u % K = 0 i % L = 0 u % K = 0 i % L = 1 ... u % K = 0 i % L = L-1 u % K = 1 i % L = 0 u % K = 1 i % L = 1 ... ... ... ... ... ... u % K = K-1 i % L = 0 ... ... u % K = K-1 i % L = L-1 item vectors item%L=0 item vectors item%L=1 item vectors i % L = L-1 user vectors u % K = 0 user vectors u % K = 1 user vectors u % K = K-1 all log entries u % K = 1 i % L = 1 u % K = 0 u % K = 1 u % K = K-1
  27. Here’s what happens in one map shard Input is a bunch of (user, item, count) tuples user is the same modulo K for all users item is the same modulo L for all items 22 One map task Distributed cache: All user vectors where u % K = x Distributed cache: All item vectors where i % L = y Mapper Emit contributions Map input: tuples (u, i, count) where u % K = x and i % L = y Reducer New vector!
  28. Might take a while to converge Start with random vectors around the origin 23
  29. Hadoop? Yeah we could probably do it in Spark 10x or 100x faster. Still, Hadoop is a great way to scale things horizontally. ???? 24
  30. Nice compact vectors and it’s super fast to compute similarity 25 Latent factor 1 Latent factor 2 track x track y cos(x, y) = HIGH IPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 IPMF item item: P(i ! j) = exp(bT j bi)/Zi = exp(bT j bi) P k exp(bT k bi) VECTORS: pui = aT u bi simij = cos(bi, bj) = bT i bj |bi||bj| O(f) i j simi,j 2pac 2pac 1.0 2pac Notorious B.I.G. 0.91 2pac Dr. Dre 0.87 2pac Florence + the Machine 0.26 Florence + the Machine Lana Del Rey 0.81 IPMF item item MDS: P(i ! j) = exp(bT j bi)/Zi = exp( |bj bi| 2 ) P k exp( |bk bi| 2 )
  31. Music recommendations are now just dot products 26 Latent factor 1 Latent factor 2 track x User u's vector track y
  32. It’s still tricky to search for similar tracks though We have many million tracks and you don’t want to compute cosine for all pairs 27
  33. Approximate nearest neighbors to the rescue! Cut the space recursively by random plane. If two points are close, they are more likely to end up on the same side of each plane. https://github.com/spotify/annoy 28
  34. How do you retrain the model? It takes a long time to train a full factorization model. We want to update user vectors much more frequently (at least daily!) However, item vectors are fairly stable. Throw away user vectors and recreate them from scratch! 29
  35. The pipeline “Hack” to recalculate user vectors more frequently. Is this a little complicated? Yeah probably. 30 May 2013 logs Matrix factorization Item vectors User vectors June 2013 logs Matrix factorization Item vectors User vectors + more logs Seeding User vectors (1) Logs User vectors (2) More logs User vectors (3) More logs User vectors (4) More logs User vectors (5) More logs Time
  36. Ideal case Put all vectors in Cassandra/Memcached, use Storm to update in real time 31
  37. But Hadoop is pretty nice at parallelizing recommendations 24 core but not a lot of RAM? mmap is your friend 32 One map reduce job Recs! ANN index of all vectors Distributed cache: User vectors M M M M DC M M M M DC M M M M DC
  38. Music recommendations! Our latest baby, the Discover page. Featuring lots of different types of recommendations. Expect this to change quite a lot in the next few months! 33
  39. More music recommendations! Radio! 34
  40. More music recommendations! Related artists 35
  41. Thanks! Btw, we’re hiring Machine Learning Engineers and Data Engineers! Email me at erikbern@spotify.com!
Advertisement