Spotify's Collaborative Filtering platform powers our Discover Page. In light of the fact, that we have been adding one new user every three seconds, it is paramount that we do things in real-time. We redesigned our recommendation system and added a Storm based real-time platform.
Emily
Introduce Emily and Esh give background on what we do and who we are
Discover Page
Google Now
Playlist Recommendations
Discover Weekly
Personalization features
Emily
Emily
Scalding
Esh
Emily
Scalding
Emily
Scalding
Esh
Talk about high intent vs low intent when talking about building user vectors.
Esh
A single machine = 100 billion words a day. Word2vec works on the basis of the distributional hypothesis. The idea being that words which appear in the same context, have similar meanings. One model in the word2vec framework that we use is the Skipgram model. So essentially, what happens is that we go through documents, for each word in the document, we try to predict what the future or previous words should be. Mathematically there is a way to show that this is like factorizing a word-context matrix. For us, playlists are documents, and words are songs that we would like to learn vectors for. The advantage of something like word2vec is that at the end, you have a geometry defined on top of vectors. So you could add the tracks of an artists to get the vector representation of an artist.
Esh
Static indices that can be shipped around. Core principle being LSH.
Esh
reflective of their whole music taste
Emily
We are the first team to build a production ready personalization feature using Storm at Spotify.
The Kafka queues were optimized for Hadoop ingestion
Localized close to the Hadoop Cluster in London.
Emily
Spouts, bolts, tuples
Topology to stitch together the bolts
Emily
First team at Spotify to do real-time recommendations
The Kafka queues were optimized for Hadoop ingestion. The Kafka cluster was localized close to the Hadoop cluster. Both in our data center in London.
Localized close to the Hadoop Cluster in London.
Emily
Write into LON Cassandra cluster
Use sparkey files to store vector info
Splash to ship sparkey files
not writing user vectors, only writing out the recs
Esh
Despite the challenges, we had a successful ab test and are running this in production
Emily
Write out the vectors, not just the recs
Service for vectors
Aggregation service on top of vectors to compute recs
Use real-time data to improve recs for all users
Emily
Why Lambda
The Batch Architecture
Real-time Architecture
Challenges
Future Work