Your SlideShare is downloading.
×

- 1. Recommenders Shallow / Deep SUDEEP DAS Frontiers and Advances in Data Sciences Conference, X’ian, China 2017
- 2. Recommendations guide our experiences almost everywhere!
- 3. Personalization in my typical day
- 4. Morning: News/ Workout/ Getting ready
- 5. Commute hours: Music/ YouTube Lectures/ Books
- 6. Now and then: Social Media/ Shopping online
- 7. Evenings are for Netflix, of course!
- 8. ORIGINS
- 9. ● 1999-2005: Netflix Prize: ○ >10% improvement, win $1,000,000 ● Top performing model(s) ended up be a variation of Matrix Factorization (SVD++, Koren, et al) ● Although Netflix’s rec system has moved on, MF is still the foundational method on which most collaborative filtering systems are based Background
- 10. Matrix Factorization
- 11. Singular Value Decomposition (Origins) R = U Σ VT U VT = users items Σ ratings matrix left/right singular vectors (orthonormal basis) Singular values (scaling) R
- 12. ● Low-rank approximation ● Eckart-Young theorem: SVD: Largest SV’s for approximation ≈ [U’,Σ’,VT ’] = argmin ǁR - UΣVT ǁ2 R F Frobenius Norm
- 13. Low-rank Matrix Factorization ● No orthogonality requirement ● Weighted least squares (or others) P≈R Q Size of latent space U Σ VT Scaling factor is absorbed into both matrices (not normalized)
- 14. ● Bias terms ● Regularization, e.g. L2, L1, etc Low-rank MF (cont…) Overall bias User bias Item bias
- 15. From Olivier Grisel, dotAI 2017 The FeedForward View
- 16. MF Extensions
- 17. ● Replace user-vector with sum of item vectors Asymmetric Matrix Factorization ( )≈R I(R) items items N(u) is all items user i rated/viewed/clicked Y Q
- 18. AMF, relation to Neural Network 1-hot encoding of a user’s play history Single hidden layer is equivalent to learning a Y and Q matrix (aka weights)
- 19. ● SLIM replaces low-rank approx by a sparse item-item matrix. Sparity comes from L1 regularizer. ● Equivalent to constructing a regression using user’s play history to predict ratings ● NB: Important that diagonal is excluded. Otherwise solution is trivial. SLIM ≈R I(R) Diagonal replaced with with zeros Y items items 0
- 20. Clustering and PGM
- 21. Example / Motivation
- 22. Classic Example / Motivation
- 23. ? ? ? ? ? ? ? ? ? 0.88 Items Users now belong to multiple “topics”, with some proportion 0.12 Purchases are a mix proportional to user’s affinity for topic, and item affinity within topic
- 24. K D W θ φz w α β Latent Dirichlet Allocation (LDA)
- 25. LDA as a generative model
- 26. What topics look like: 0.15 0.630.22
- 27. Final step: Recommending from topics ● Once we’ve learnt a user’s distribution over topics, and each topic’s distribution over items. Producing a recommendation is easy. ● Score every item, i, using below, and recommend items with highest probability (discarding items the user has already purchased)
- 28. Deep Learning in Recommender Systems
- 29. Why deep? Deep Learning Is Making Waves Everywhere!
- 30. In many domains, deep learning is achieving near-human or super-human accuracy! However, applications of Deep Learning in Recommender Systems is at its infancy.
- 31. So, what is Deep Learning? A class of machine learning algorithms: ● that use a cascade of multiple non-linear processing layers ● and complex model structures ● to learn different representations of the data in each layer ● where higher level features are derived from lower level features to form a hierarchical representation. Balázs Hidasi, RecSys 2016
- 32. Traditional vs Deep Handcrafted Features Learned/Trainable Features Trainable Classifier Trainable Classifier Traditional ML Deep Learning “Socrates” “Socrates”
- 33. Learning hierarchical representations of data Learned Features Trainable Classifier Each layer learns progressively complex representations from its predecessor “Socrates” Raw Pixels Edges Parts of Objects composed from edges Object models
- 34. Earliest adaptation: Restricted Boltzmann Machines From recent presentation by Alexandros Karatzoglou One hidden layer. User feedback on items interacted with, are propagated back to all items. Very similar to an autoencoder!
- 35. There are many ways to make this deep. From Olivier Grisel, dotAI 2017
- 36. From Olivier Grisel, dotAI 2017
- 37. From Olivier Grisel, dotAI 2017
- 38. From Olivier Grisel, dotAI 2017
- 39. Deep Triplet Networks From Olivier Grisel, dotAI 2017
- 40. Wide + Deep Models for Recommendations In a recommender setting, you may want to train with a wide set of cross-product feature transformations , so that the model essentially memorizes these sparse feature combinations (rules): Meh! Yay! Cheng et al, Google Inc. (2016)
- 41. Wide + Deep Models for Recommendations On the other hand, you may want the ability to generalize using the representational power of a deep network. But deep nets can over-generalize. Cheng et al, Google Inc. (2016)
- 42. Wide + Deep Models for Recommendations Best of both worlds: Jointly train a deep + wide network. The cross-feature transformation in the wide model component can memorize all those sparse, specific rules, while the deep model component can generalize to similar items via embeddings. Cheng et al, Google Inc. (2016)
- 43. Wide + Deep Models for Recommendations Cheng et al, Google Inc. (2016) Wide + Deep Model for app recommendations.
- 44. The Youtube Recommendation model A two Stage Approach with two deep networks: ● The candidate generation network takes events from the user’s YouTube activity history as input and retrieves a small subset (hundreds) of videos from a large corpus. These candidates are intended to be generally relevant to the user with high precision. The candidate generation network only provides broad personalization via collaborative filtering. ● The ranking network scores each video according to a desired objective function using a rich set of features describing the video and user. The highest scoring videos are presented to the user, ranked by their score Covington et al., Google Inc. (2016)
- 45. The Youtube Recommendation model Deep candidate generation model architecture ● embedded sparse features concatenated with dense features. Embeddings are averaged before concatenation to transform variable sized bags of sparse IDs into fixed-width vectors suitable for input to the hidden layers. ● All hidden layers are fully connected. ● In training, a cross-entropy loss is minimized with gradient descent on the output of the sampled softmax. ● At serving, an approximate nearest neighbor lookup is performed to generate hundreds of candidate video recommendations. Stage One Covington et al., Google Inc. (2016)
- 46. The Youtube Recommendation model Stage Two Deep ranking network architecture ● uses embedded categorical features (both univalent and multivalent) with shared embeddings and powers of normalized continuous features. ● All layers are fully connected. In practice, hundreds of features are fed into the network. Covington et al., Google Inc. (2016)
- 47. Autoencoders
- 48. Collaborative Denoising Auto-Encoder Collaborative Denoising Auto-Encoders for Top-N Recommender Systems, Wu et.al., WSDM 2016 ● Treats the feedback on items y that the user U has interacted with (input layer) as a noisy version of the user’s preferences on all items (output layer) ● Introduces a user specific input node and hidden bias node, while the item weights are shared across all users.
- 49. Recurrent Neural Networks - Sequence Modeling http://colah.github.io/posts/2015-08-Understanding-LSTMs/ A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor.
- 50. Session-based recommendation with Recurrent Neural Networks (GRU4Rec) Hidasi et al. ICLR (2016) ● Treat each user session as sequence of clicks ● Predict next item in the session sequence
- 51. Adding Item metadata to GRU4Rec: Parallel RNN Hidasi et al. Recsys (2016) ● Separate RNNs for each input type ○ Item ID ○ Image feature vector obtained from CNN (last avg. pooling layer)
- 52. Convolutional Neural Nets Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 (2016) http://cs231n.stanford.edu/
- 53. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback He et al., AAAI (2015) Helping cold start with augmenting item factors with visual factors ● Create an item Factor that is a sum of two terms: An Item Visual Factor which is an embedding of a Deep CNN on the item image, and the usual collaborative item factor.
- 54. Deep content based music recommendations http://benanne.github.io/2014/08/05/spotify-cnns.html Cold Starting New or Less Popular Music ● Take the Mel Spectrogram of the song and run it through several convolutional and MaxPooling layers to a compressed 1d representation. ● The training objective is to minimize the squared error between the collaborative item factors of a known item and the item factor predicted from the CNN> ● Then for a new item, the model can predict the item factor, and make recommendations. Aäron van den Oord, Sander Dieleman and Benjamin Schrauwen, NIPS 2013
- 55. The Pinterest Application: Pin2Vec Related Pins Liu et al (2017) https://medium.com/the-graph/applying-deep-learning-to-related-pi ns-a6fee3c92f5e Learn a 128 dimensional compressed representation of each item (embedding). Then use a similarity function (cosine) between them to find similar items.
- 56. The Pinterest Application: Pin2Vec Related Pins Liu et al (2017) https://medium.com/the-graph/applying-deep-learning-to-related-pi ns-a6fee3c92f5e Co-occurrence Pin2Vec
- 57. The Pinterest Application: Pin2Vec Related Pins Liu et al (2017) https://medium.com/the-graph/applying-deep-learning-to-related-pi ns-a6fee3c92f5e
- 58. Some concluding thoughts ● Deep Learning is augmenting shallow model based recommender systems. The main draws for DL in RecSys seems to be: ● Better generalization beyond linear models for user-item interactions. ● Embeddings: Unified representation of heterogeneous signals (e.g. add image/audio/textual content as side information to item embeddings via convolutional NNs). ● Exploitation of sequential information in actions leading up to recommendation (e.g. LSTM on viewing/purchase/search history to predict what will be watched/purchased/searched next). ● DL toolkits provide unprecedented flexibility in experimenting with loss functions (e.g. in toolkits like TensorFlow/MxNet/Keras etc. switching the loss from classification loss to ranking loss is trivial.
- 59. Headline THANKS! sdas@netflix.com @datamusing @netflixresearch