- 1. Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neural Networks 7th Italian Information Retrieval Workshop Venezia (Italy), May 30-31 2016 Cataldo Musto, Claudio Greco, Alessandro Suglia and Giovanni Semeraro Work supported by the IBM Faculty Award ”Deep Learning to boost Cognitive Question Answering” Titan X GPU used for this research donated by the NVIDIA Corporation 1
- 2. Overview 1. Background Content-based recommender systems Neural network models 2. Research work Ask me Any Rating (AMAR) Experimental evaluation 3. Conclusions Lesson-learnt Vision 2
- 3. Background
- 4. Content-based recommender systems Consists in matching up the attributes of a user proﬁle with the attributes of a content object (item) [1] [1] P. Lops, M. De Gemmis, and G. Semeraro. “Content-based recommender systems: State of the art and trends”. In: Recommender systems handbook. Springer, 2011 3
- 5. Deep learning Deﬁnition Allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction [2] • Discovers intricate structure in large data sets by using the backpropagation algorithm [3]; • Leads to progressively more abstract features at higher layers of representations; • More abstract concepts are generally invariant to most local changes of the input; [2] Y. LeCun, Y. Bengio, and G. Hinton. “Deep learning”. In: Nature 521 (2015) [3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning representations by back-propagating errors”. In: Cognitive modeling (1988) 4
- 6. Recurrent Neural Networks • Recurrent Neural Networks (RNN) are architectures suitable to model variable-length sequential data [4]; • The connections between their units may contain loops which let them consider past states in the learning process; • Their roots are in the Dynamical System Theory in which the following relation is true: s(t) = f(s(t−1) ; x(t) ; θ) where s(t) represents the current system state computed by a generic function f evaluated on the previous state s(t−1) , x(t) represents the current input and θ are the network parameters. [4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Tech. rep. DTIC Document, 1985 5
- 7. RNN pros and cons Pros • Appropriate to represent sequential data; • A versatile framework which can be applied to different tasks; • Can learn short-term and long-term temporal dependencies. Cons • Vanishing/exploding gradient problem [5]; • Difﬁculties to reach satisfying minima during the optimization of the loss function; • Difﬁcult to parallelize the training process. [5] Y. Bengio, P. Simard, and P. Frasconi. “Learning long-term dependencies with gradient descent is difﬁcult”. In: Neural Networks, IEEE Transactions on 5 (1994) 6
- 8. Long Short Term Memory (LSTM) • A speciﬁc RNN introduced to solve the vanishing/exploding gradient problem; • Each cell presents a complex structure which is more powerful than simple RNN cells. Figure: LSTM architecture [6] forget gate (f) considers the current input and the previous state to remove or preserve the most appropriate information for the given task [6] A. Graves, A. Mohamed, and G. Hinton. “Speech recognition with deep recurrent neural networks”. In: Acoustics, Speech and Signal Processing (ICASSP), IEEE 2013 7
- 9. Long Short Term Memory (LSTM) • A speciﬁc RNN introduced to solve the vanishing/exploding gradient problem; • Each cell presents a complex structure which is more powerful than simple RNN cells. Figure: LSTM architecture [6] input gate (i) considers the current input and the previous state to determine how the input information will be used to update the state cell [6] A. Graves, A. Mohamed, and G. Hinton. “Speech recognition with deep recurrent neural networks”. In: Acoustics, Speech and Signal Processing (ICASSP), IEEE 2013 7
- 10. Long Short Term Memory (LSTM) • A speciﬁc RNN introduced to solve the vanishing/exploding gradient problem; • Each cell presents a complex structure which is more powerful than simple RNN cells. Figure: LSTM architecture [6] output gate (o) considers the current input, the previous state and the updated state cell to generate an appropriate output for the given task [6] A. Graves, A. Mohamed, and G. Hinton. “Speech recognition with deep recurrent neural networks”. In: Acoustics, Speech and Signal Processing (ICASSP), IEEE 2013 7
- 11. Research work
- 12. Ask Me Any Rating (AMAR) “Mirror, mirror, here I stand. What is the fairest movie in the land?” • Inspired by a neural network model used to solve Question Answering toy tasks [7]; • Name adapted from “Ask Me Anything” [8]; • Very simple Factoid Question Answering system where user proﬁles are questions and ratings are answers. [7] J. Weston et al. “Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks”. In: CoRR abs/1502.05698 (2015) [8] A. Kumar et al. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing”. In: CoRR abs/1506.07285 (2015) 8
- 13. Ask Me Any Rating (AMAR) • Two different modules to generate: • User embedding • Item embedding • User embedding associated to a user identiﬁer; • Item embedding generated from an item description; • Concatenation of user and item embeddings given to a logistic regression layer to predict the probability of a “like”. w1 w2 wm User u Item description id User LT Word LT v(u) v(id) LSTM LSTM LSTM v(w1) v(w2) v(wm) h(w1) h(w2) h(wm) Mean pooling layer Concatenation layer Logistic regression layer 9
- 14. Ask Me Any Rating (AMAR) User embedding • An identiﬁer u is associated to each user; • The identiﬁer is given as input to a lookup table (User LT); • User LT converts it to a learnt user embedding v(u). Item embedding • Each word w1 . . . wm of the item description id is associated to a unique identiﬁer speciﬁc of the item descriptions corpus; • Words identiﬁers are given as input to a lookup table (Item LT); • Item LT converts them to learnt words embeddings v(wk); • Words embeddings v(wk) are sequentially passed through an RNN with LSTM cells (LSTM module); • The LSTM module generates a latent representation h(wk) for each word; • A mean pooling layer averages the words representations generating an item embedding v(id) for the item i. 10
- 15. Ask Me Any Rating (AMAR) “Like” probability estimation • Item and user embeddings, v(id) and v(u), are concatenated in a single representation; • The resulting representation is used as feature for the prediction task; • A Logistic regression layer is used to estimate the probability of a “like” given by user u to a speciﬁc item i; • The generated score is used to build a sorted list of recommended items for user u. Optimization criterion • The neural network is trained minimizing the Binary Cross-entropy loss function. 11
- 16. AMAR extended • AMAR extended adds to the AMAR architecture an additional module for items genres; • An identiﬁer gk is associated to each item genre; • Genres identiﬁer are given as input to a lookup table (Genre LT); • Genres LT converts them to learnt genres embeddings v(gk); • A mean pooling layer averages the genres representations generating a genres embedding v(ig). g1 g2 gn Item genres igj Genre LT v(u) v(id) Mean pooling layer Concatenation layer Logistic regression layer v(g1) v(g2) v(gn) v(ig) Item description idUser u 12
- 17. Experimental protocol • Datasets: Movielens 1M (ML1M) e DBbook; • Text preprocessing: tokenization and stopword removal; • Evaluation strategy: 5-fold cross validation for Movielens 1M, holdout for DBbook; • Recommendation task: top-N recommendation leveraging binary user feedback; • Evaluation strategy for recommendation: TestRatings [9]; • Metric: F1-measure evaluated at 5, 10 and 15. [9] A. Bellogin, P. Castells, and I. Cantador. “Precision-oriented evaluation of recommender systems: an algorithmic comparison”. In: Proceedings of the ﬁfth ACM conference on Recommender systems. 2011 13
- 18. ML1M A ﬁlm dataset created by the research group GroupLens of the University of Minnesota which contains user ratings on a 5-stars scale. Each rating has been binarized according to the following formula: bin_rating(r) = { 1, if r ≥ 4 0, otherwise #ratings 1000209 #users 6040 #item 3301 avg ratings per user 31.423 avg positive ratings per user 17.985 avg negative ratings per user 13.439 sparsity 0.95 14
- 19. DBbook A book dataset released for the Linked open data-enabled recommender systems: ESWC 2014 challenge [10]. It contains binary user preferences (e.g., I like it, I don’t like it). #ratings 72371 #users 6181 #item 8170 avg ratings per user 11.392 avg positive ratings per user 6.727 avg negative ratings per user 4.665 sparsity 0.998 [10] T. Di Noia, I. Cantador, and V. C. Ostuni. “Linked open data-enabled recommender systems: ESWC 2014 challenge on book recommendation”. In: Semantic Web Evaluation Challenge. Springer, 2014 15
- 20. Models conﬁgurations Embedding-based recommenders W2V Google News (W2V-news) • Method: SG • Embedding size: 300 • Corpus: Google News GloVe • Embedding size: 300 • Corpus: Wikipedia 2014 + Gigaword 5 Baseline recommenders Item to item CF (I2I) * • Neighbours: 30, 50, 80 User to user CF (U2U) * • Neighbours: 30, 50, 80 SLIM with BPR-Opt (BPRSlim) * TF-IDF Bayesian Personalized Ranking Matrix Factorization (BPRMF) * • Latent factors: 10, 30, 50 Weighted Matrix Factorization Method (WRMF) * • Latent factors: 10, 30, 50 * MyMediaLite implementations 16
- 21. Models conﬁgurations AMAR • Opt. method: RMSprop [11] • α: 0.9 • Learning rate: 0.001 • Epochs: 25; • User embedding size: 10; • Item embedding size: 10; • LSTM output size: 10; • Batch size: • ML1M: 1536 • DBbook: 512 AMAR extended • Opt. method: RMSprop • α: 0.9 • Learning rate: 0.001 • Epochs: 25; • User embedding size: 10; • Item embedding size: 10; • Genre embedding size: 10; • LSTM output size: 10; • Batch size: • ML1M: 1536 • DBbook: 512 [11] T. Tieleman and G. E. Hinton. “rmsprop”. In: COURSERA: Neural Networks for Machine Learning Lecture 6.5 (2012) 17
- 22. DBbook results 0.662 0.662 0.655 0.656 0.64 0.639 0.631 0.636 0.632 0.662 0.62 0.62 0.63 0.63 0.64 0.64 0.65 0.65 0.66 0.66 0.67 AMAR AMAR extended GloVe W2V-News I2I-30 U2U-30 BPRMF-30 WRMF-50 BPRSlim TF-IDF F1@10 RECOMMENDER CONFIGURATIONS Differences statistically signiﬁcant according to Wilcoxon test (ρ ≤ 0.05) 18
- 23. ML1M results 0.641 0.644 0.575 0.587 0.527 0.525 0.524 0.525 0.548 0.59 0.40 0.45 0.50 0.55 0.60 0.65 AMAR AMAR extended GloVe W2V-News I2I-30 U2U-30 BPRMF-30 WRMF-50 BPRSlim TF-IDF F1@10 RECOMMENDER CONFIGURATIONS Only differences between U2U and GloVe, BPRSlim and GloVe, GloVe and Word2vec are not statistically signiﬁcant according to Wilcoxon test (ρ ≤ 0.05) 19
- 24. Conclusions
- 25. AMAR pros and cons Pros • High improvement on ML1M; • Able to learn more suitable item and user representations for the recommendation task; • Item and user embeddings are not generated using a simple mean, but they are adapted during training. Cons • It does not deal well with very sparse datasets: • Small improvement on DBbook • High training times: • DBbook: 50 minutes per epoch • ML1M: 90 minutes per epoch 20
- 26. AMAR Improvements Optimization • Use alternative training methods and regularization techniques; • Use pretrained word embeddings; • More appropriate cost functions for top-N recommendation; • Increase embedding dimensions. Architecture • Item modeling may be improved by using different neural network architectures; • Classiﬁcation step may be done by using deeper fully connected layers. Additional features Leverage important data silos to enrich item representations: • Linked Open Data; • Web and social media. 21
- 27. Thanks for your attention • Design of recommender systems using deep neural networks; • Experimental evaluation on well-known datasets on the top-N recommendation task; • Higher performance using deep models than using shallow models. Alessandro Suglia alessandro.suglia@gmail.com Claudio Greco claudiogaetanogreco@gmail.com 22
- 28. Technical details (Warning: for geeks only)
- 29. Cross entropy Deﬁnition Given two probability distributions over the same underlying set of events, p and q, it measures the average number of bits needed to identify an event drawn from a set of possibilities, if a coding scheme is used based on an “unnatural” probability distribution q, rather than the “true” distribution p. Given discrete probability distributions p and q, the cross entropy is deﬁned as follows: H(p, q) = − ∑ x p(x) log q(x) 23
- 30. RNN Given an input vector x(t) , bias vectors b, c and weight matrices U, V and W, a forward step of an RNN neural network is computed in this way: at = b + Wst−1 + Uxt st = tanh(at) ot = c + Vst pt = softmax(ot) In this case, the activation function are the hyperbolic tangent (tanh) for the hidden layer and the multinomial logistic function (softmax) for the output layer. 24
- 31. LSTM The information ﬂow in an LSTM module is much more complex than the one in an RNN. The architecture used in this work uses the following equations, presented in [6]: it = σ(Wxixt + Whiht−1 + Wcict−1 + bi) ft = σ(Wxfxt + Whfht−1 + Wcfct−1 + bf) ct = ftct−1 + it tanh(Wxcxt + Whcht−1 + bc) ot = σ(Wxoxt + Whoht−1 + Wcoct + bo) ht = ot tanh(ct) where σ is the logistic sigmoid function, and i, f, o and c are respectively the input gate, forget gate,output gate and cell activation vectors, all of which are the same size as the hidden vector h. 25
- 32. Corpus stats Google News • # tokens: 6B • Vocabulary size: 40K • # matched words: • DBbook: 44636 (41.52%) • ML1M: 35150 (49.13%) GloVe • # tokens: 100B • Vocabulary size: 3M • # matched words: • DBbook: 65013 (60.48%) • ML1M: 49893 (69.74%) 26
- 33. References
- 34. [1] Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. “Content-based recommender systems: State of the art and trends”. In: Recommender systems handbook. Springer, 2011. [2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: Nature 521 (2015). [3] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations by back-propagating errors”. In: Cognitive modeling 5 (1988). [4] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Tech. rep. DTIC Document, 1985. [5] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. “Learning long-term dependencies with gradient descent is difﬁcult”. In: Neural Networks, IEEE Transactions on 5 (1994). [6] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. “Speech recognition with deep recurrent neural networks”. In: 26
- 35. Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE. 2013. [7] Jason Weston et al. “Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks”. In: CoRR abs/1502.05698 (2015). [8] Ankit Kumar et al. “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing”. In: CoRR abs/1506.07285 (2015). [9] Alejandro Bellogin, Pablo Castells, and Ivan Cantador. “Precision-oriented evaluation of recommender systems: an algorithmic comparison”. In: Proceedings of the ﬁfth ACM conference on Recommender systems. 2011. [10] Tommaso Di Noia, Iván Cantador, and Vito Claudio Ostuni. “Linked open data-enabled recommender systems: ESWC 2014 challenge on book recommendation”. In: Semantic Web Evaluation Challenge. Springer, 2014. 26
- 36. [11] Tijmen Tieleman and Geoffrey E. Hinton. “rmsprop”. In: COURSERA: Neural Networks for Machine Learning Lecture 6.5 (2012). 26