Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations

1,492 views

Published on

Slides for my RecSys 2016 talk on integrating image and textual information into session based recommendations using novel parallel RNN architectures.

Link to the paper: http://www.hidasi.eu/en/publications.html#p_rnn_recsys16

Published in: Science
  • Be the first to comment

Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations

  1. 1. Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations Balázs Hidasi, Gravity R&D (@balazshidasi) Massimo Quadrana, Politecnico di Milano (@mxqdr) Alexandros Karatzoglou, Telefonica Research (@alexk_z) Domonkos Tikk, Gravity R&D (@domonkostikk)
  2. 2. Session-based recommendations • Permanent (user) cold-start  User identification  Changing goal/intent  Disjoint sessions • Item-to-Session recommendations  Recommend to previous events • Next event prediction
  3. 3. GRU4Rec [Hidasi et. al. ICLR 2016] • Gated Recurrent Unit  𝑧𝑡 = 𝜎 𝑊 𝑧 𝑥𝑡 + 𝑈 𝑧 ℎ 𝑡−1  𝑟𝑡 = 𝜎 𝑊 𝑟 𝑥𝑡 + 𝑈 𝑟 ℎ 𝑡−1  ℎ 𝑡 = 𝜎 𝑊𝑥𝑡 + 𝑈(𝑟𝑡∘ ℎ 𝑡−1)  ℎ 𝑡 = 1 − 𝑧𝑡 ℎ 𝑡−1 + 𝑧𝑡ℎ 𝑡 • Adapted to the recommendation problem  Session-parallel mini-batches  Ranking loss: TOP1= 1 𝑁 𝑆 𝑗=1 𝑁 𝑆 𝜎 𝑟𝑠,𝑖 − 𝑟𝑠,𝑗 + 𝜎 𝑟𝑠,𝑗 2  Sampling the output ℎℎ 𝑥𝑡 ℎ 𝑡 𝑧 𝑖1 𝑖5 𝑖8 𝑦1 1 𝑦2 1 𝑦3 1 𝑦4 1 𝑦5 1 𝑦6 1 𝑦7 1 𝑦8 1 𝑦1 3 𝑦2 3 𝑦3 3 𝑦4 3 𝑦5 3 𝑦6 3 𝑦7 3 𝑦8 3 𝑦1 2 𝑦2 2 𝑦3 2 𝑦4 2 𝑦5 2 𝑦6 2 𝑦7 2 𝑦8 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 𝑖1,1 𝑖1,2 𝑖1,3 𝑖1,4 𝑖2,1 𝑖2,2 𝑖2,3 𝑖3,1 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖3,6 𝑖4,1 𝑖4,2 𝑖5,1 𝑖5,2 𝑖5,3 Session1 Session2 Session3 Session4 Session5 𝑖1,1 𝑖1,2 𝑖1,3 𝑖2,1 𝑖2,2 𝑖3,1 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖4,1 𝑖5,1 𝑖5,2 Input Desired output … 𝑖1,2 𝑖1,3 𝑖1,4 𝑖2,2 𝑖2,3 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖3,6 𝑖4,2 𝑖5,2 𝑖5,3 … GRU layer One-hot vector Weighted output Scores on items f() One-hot vector ItemID (next) ItemID
  4. 4. Item features & user decisions • Clicking on recommendations • Influencing factors  Prior knowledge of the user  The stuff they see o Image (thumbnail, product image, etc) o Text (title, short description, etc) o Price o Item properties o ... • Feature extraction from images and text
  5. 5. Naive solution 1: Train on the features • Item features on the input • Bad offline accuracy • Recommendations make sense  but somewhat general Image feature vector GRU layer Network output Scores on items f() ItemID One-hot vector ItemID (next)
  6. 6. Naive solution 2: Concat. session & feature info • Concatenate one-hot vector of the ID and the feature vector of the item • Recommendation accuracy similar to the ID only network • ID input is initially stronger signal on the next item  Dominates the image input  Network learns to focus on weights of the ID wID wimage Image feature vector GRU layer One-hot vector Weighted output Scores on items f() One-hot vector ItemID (next) ItemID
  7. 7. Parallel RNN • Separate RNNs for each input type  ID  Image feature vector  ... • Subnets model the sessions separately • Concatenate hidden layers  Optional scaling / weighting • Output computed from the concat. hidden state wID wimage Image feature vector GRU layer Subnet output GRU layer One-hot vector Subnet output Weighted output Scores on items f() ItemID One-hot vector ItemID (next)
  8. 8. Training • Backpropagate through the network in one go • Accuracy is slightly better than for the ID only network • Subnets learn similar aspects of the sessions  Some of the model capacity is wasted
  9. 9. Alternative training methods • Force the networks to learn different aspects  Complement each other • Inspired by ALS and ensemble learning • Train one subnet at a time and fix the others  Alternate between them • Depending on the frequency of alternation  Residual training (train fully)  Alternating training (per epoch)  Interleaving training (per minibatch)
  10. 10. Experiments on video thumbnails • Online video site  712,824 items  Training: 17,419,964 sessions; 69,312,698 events  Test: 216,725 sessions; 921,202 events o Target score compared to only the 50K most popular items • Feature extraction  GoogLeNet (Caffe) pretrained on ImageNet dataset  Features: values of the last avg. pooling layer o Dense vector of 1024 values o Length normalized to 1
  11. 11. Results on video thumbnails • Low number of hidden units  MRR improvements o +18.72% (vs item-kNN) o +15.41% (vs ID only, 100) o +14.40% (vs ID only, 200) • High number of hidden units  500+500 for p-RNN  1000 for others  Diminishing return for increasing the number of epochs or the hidden units  +5-7% in MRR Method #Hidden units Recall@2 0 MRR@2 0 Item-kNN - 0.6263 0.3740 ID only 100 0.6831 0.3847 ID only 200 0.6963 0.3881 Feature only 100 0.5367 0.3065 Concat 100 0.6766 0.3850 P-RNN (naive) 100+100 0.6765 0.4014 P-RNN (res) 100+100 0.7028 0.4440 P-RNN (alt) 100+100 0.6874 0.4331 P-RNN (int) 100+100 0.7040 0.4361
  12. 12. Experiments on text • Classified ad site  339,055 items  Train: 1,173,094 sessions; 9,011,321 events  Test: 35,741 sessions; 254,857 events o Target score compared with all items  Multilanguage user generated titles and item descriptions • Feature extraction  Concatenated title and description  Bi-grams with TF-IDF weighting o Sparse vectors of 1,099,425 length; 5.44 avg. non-zeroes • Experiments with high number of hidden units • +3.4% in recall • +5.37% in MRR
  13. 13. Summary • Incorporated image and text features into session- based recommendations using RNNs • Naive inclusion is not effective • P-RNN architectures • Training methods for P-RNNs
  14. 14. Future work • Revisit feature extraction • Try representations of the whole video • Use multiple aspects for items
  15. 15. Thanks! Q&A

×