Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations

Balázs Hidasi
Balázs HidasiHead of Data Mining and Research at Gravity R&D
Parallel Recurrent Neural Network
Architectures for Feature-rich
Session-based Recommendations
Balázs Hidasi, Gravity R&D (@balazshidasi)
Massimo Quadrana, Politecnico di Milano (@mxqdr)
Alexandros Karatzoglou, Telefonica Research
(@alexk_z)
Domonkos Tikk, Gravity R&D (@domonkostikk)
Session-based recommendations
• Permanent (user) cold-start
 User identification
 Changing goal/intent
 Disjoint sessions
• Item-to-Session recommendations
 Recommend to previous events
• Next event prediction
GRU4Rec [Hidasi et. al. ICLR 2016]
• Gated Recurrent Unit
 𝑧𝑡 = 𝜎 𝑊 𝑧
𝑥𝑡 + 𝑈 𝑧
ℎ 𝑡−1
 𝑟𝑡 = 𝜎 𝑊 𝑟
𝑥𝑡 + 𝑈 𝑟
ℎ 𝑡−1
 ℎ 𝑡 = 𝜎 𝑊𝑥𝑡 + 𝑈(𝑟𝑡∘ ℎ 𝑡−1)
 ℎ 𝑡 = 1 − 𝑧𝑡 ℎ 𝑡−1 + 𝑧𝑡ℎ 𝑡
• Adapted to the recommendation problem
 Session-parallel mini-batches
 Ranking loss: TOP1=
1
𝑁 𝑆
𝑗=1
𝑁 𝑆
𝜎 𝑟𝑠,𝑖 − 𝑟𝑠,𝑗 + 𝜎 𝑟𝑠,𝑗
2
 Sampling the output
ℎℎ
𝑥𝑡
ℎ 𝑡
𝑧
𝑖1 𝑖5 𝑖8
𝑦1
1
𝑦2
1
𝑦3
1 𝑦4
1
𝑦5
1
𝑦6
1
𝑦7
1
𝑦8
1
𝑦1
3
𝑦2
3
𝑦3
3
𝑦4
3
𝑦5
3
𝑦6
3
𝑦7
3
𝑦8
3
𝑦1
2
𝑦2
2
𝑦3
2 𝑦4
2
𝑦5
2
𝑦6
2
𝑦7
2
𝑦8
2
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0
𝑖1,1 𝑖1,2 𝑖1,3 𝑖1,4
𝑖2,1 𝑖2,2 𝑖2,3
𝑖3,1 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖3,6
𝑖4,1 𝑖4,2
𝑖5,1 𝑖5,2 𝑖5,3
Session1
Session2
Session3
Session4
Session5
𝑖1,1 𝑖1,2 𝑖1,3
𝑖2,1 𝑖2,2
𝑖3,1 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5
𝑖4,1
𝑖5,1 𝑖5,2
Input
Desired
output
…
𝑖1,2 𝑖1,3 𝑖1,4
𝑖2,2 𝑖2,3
𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖3,6
𝑖4,2
𝑖5,2 𝑖5,3
…
GRU layer
One-hot vector
Weighted output
Scores on items
f()
One-hot vector
ItemID (next)
ItemID
Item features & user decisions
• Clicking on recommendations
• Influencing factors
 Prior knowledge of the user
 The stuff they see
o Image (thumbnail, product image, etc)
o Text (title, short description, etc)
o Price
o Item properties
o ...
• Feature extraction from images and text
Naive solution 1: Train on the features
• Item features on the input
• Bad offline accuracy
• Recommendations make sense
 but somewhat general
Image feature vector
GRU layer
Network output
Scores on items
f()
ItemID
One-hot vector
ItemID (next)
Naive solution 2: Concat. session & feature info
• Concatenate one-hot
vector of the ID and the
feature vector of the item
• Recommendation
accuracy similar to the ID
only network
• ID input is initially stronger
signal on the next item
 Dominates the image input
 Network learns to focus on
weights of the ID
wID wimage
Image feature vector
GRU layer
One-hot vector
Weighted output
Scores on items
f()
One-hot vector
ItemID (next)
ItemID
Parallel RNN
• Separate RNNs for
each input type
 ID
 Image feature vector
 ...
• Subnets model the
sessions separately
• Concatenate hidden
layers
 Optional scaling /
weighting
• Output computed from
the concat. hidden
state
wID wimage
Image feature vector
GRU layer
Subnet output
GRU layer
One-hot vector
Subnet output
Weighted output
Scores on items
f()
ItemID
One-hot vector
ItemID (next)
Training
• Backpropagate through the network in one go
• Accuracy is slightly better than for the ID only
network
• Subnets learn similar aspects of the sessions
 Some of the model capacity is wasted
Alternative training methods
• Force the networks to learn different aspects
 Complement each other
• Inspired by ALS and ensemble learning
• Train one subnet at a time and fix the others
 Alternate between them
• Depending on the frequency of alternation
 Residual training (train fully)
 Alternating training (per epoch)
 Interleaving training (per minibatch)
Experiments on video thumbnails
• Online video site
 712,824 items
 Training: 17,419,964 sessions; 69,312,698 events
 Test: 216,725 sessions; 921,202 events
o Target score compared to only the 50K most popular items
• Feature extraction
 GoogLeNet (Caffe) pretrained on ImageNet dataset
 Features: values of the last avg. pooling layer
o Dense vector of 1024 values
o Length normalized to 1
Results on video thumbnails
• Low number of hidden
units
 MRR improvements
o +18.72% (vs item-kNN)
o +15.41% (vs ID only, 100)
o +14.40% (vs ID only, 200)
• High number of hidden
units
 500+500 for p-RNN
 1000 for others
 Diminishing return for
increasing the number of
epochs or the hidden units
 +5-7% in MRR
Method #Hidden
units
Recall@2
0
MRR@2
0
Item-kNN - 0.6263 0.3740
ID only 100 0.6831 0.3847
ID only 200 0.6963 0.3881
Feature only 100 0.5367 0.3065
Concat 100 0.6766 0.3850
P-RNN (naive) 100+100 0.6765 0.4014
P-RNN (res) 100+100 0.7028 0.4440
P-RNN (alt) 100+100 0.6874 0.4331
P-RNN (int) 100+100 0.7040 0.4361
Experiments on text
• Classified ad site
 339,055 items
 Train: 1,173,094 sessions; 9,011,321 events
 Test: 35,741 sessions; 254,857 events
o Target score compared with all items
 Multilanguage user generated titles and item descriptions
• Feature extraction
 Concatenated title and description
 Bi-grams with TF-IDF weighting
o Sparse vectors of 1,099,425 length; 5.44 avg. non-zeroes
• Experiments with high number of
hidden units
• +3.4% in recall
• +5.37% in MRR
Summary
• Incorporated image and text features into session-
based recommendations using RNNs
• Naive inclusion is not effective
• P-RNN architectures
• Training methods for P-RNNs
Future work
• Revisit feature extraction
• Try representations of the whole video
• Use multiple aspects for items
Thanks!
Q&A
1 of 15

More Related Content

What's hot(20)

Deeplearning in financeDeeplearning in finance
Deeplearning in finance
Sebastien Jehan14.3K views
Deep Learning with Python (PyData Seattle 2015)Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)
Alexander Korbonits22.8K views
Hands-on Tutorial of Deep LearningHands-on Tutorial of Deep Learning
Hands-on Tutorial of Deep Learning
Chun-Ming Chang443 views

Viewers also liked(19)

Causal inference in data scienceCausal inference in data science
Causal inference in data science
Amit Sharma3.9K views
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem2.1K views
KDD2015論文読み会KDD2015論文読み会
KDD2015論文読み会
Sotetsu KOYAMADA(小山田創哲)1.4K views
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
Russell Jurney8.6K views
Visual Design with DataVisual Design with Data
Visual Design with Data
Seth Familian2.9M views
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
Volker Hirsch902.1K views

Similar to Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations(20)

(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
Amazon Web Services4.1K views
pydataPointCloud.pptxpydataPointCloud.pptx
pydataPointCloud.pptx
Manuel Rodrigo Cabello Malagón77 views
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
Marcin Jedyk56 views
object-detection.pptxobject-detection.pptx
object-detection.pptx
MohamedAliHabib3651 views
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
Sungjoon Choi10.4K views
PhD Thesis Proposal PhD Thesis Proposal
PhD Thesis Proposal
Ziqiang Feng473 views
Object Detection with TransformersObject Detection with Transformers
Object Detection with Transformers
Databricks1.1K views

Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations

  • 1. Parallel Recurrent Neural Network Architectures for Feature-rich Session-based Recommendations Balázs Hidasi, Gravity R&D (@balazshidasi) Massimo Quadrana, Politecnico di Milano (@mxqdr) Alexandros Karatzoglou, Telefonica Research (@alexk_z) Domonkos Tikk, Gravity R&D (@domonkostikk)
  • 2. Session-based recommendations • Permanent (user) cold-start  User identification  Changing goal/intent  Disjoint sessions • Item-to-Session recommendations  Recommend to previous events • Next event prediction
  • 3. GRU4Rec [Hidasi et. al. ICLR 2016] • Gated Recurrent Unit  𝑧𝑡 = 𝜎 𝑊 𝑧 𝑥𝑡 + 𝑈 𝑧 ℎ 𝑡−1  𝑟𝑡 = 𝜎 𝑊 𝑟 𝑥𝑡 + 𝑈 𝑟 ℎ 𝑡−1  ℎ 𝑡 = 𝜎 𝑊𝑥𝑡 + 𝑈(𝑟𝑡∘ ℎ 𝑡−1)  ℎ 𝑡 = 1 − 𝑧𝑡 ℎ 𝑡−1 + 𝑧𝑡ℎ 𝑡 • Adapted to the recommendation problem  Session-parallel mini-batches  Ranking loss: TOP1= 1 𝑁 𝑆 𝑗=1 𝑁 𝑆 𝜎 𝑟𝑠,𝑖 − 𝑟𝑠,𝑗 + 𝜎 𝑟𝑠,𝑗 2  Sampling the output ℎℎ 𝑥𝑡 ℎ 𝑡 𝑧 𝑖1 𝑖5 𝑖8 𝑦1 1 𝑦2 1 𝑦3 1 𝑦4 1 𝑦5 1 𝑦6 1 𝑦7 1 𝑦8 1 𝑦1 3 𝑦2 3 𝑦3 3 𝑦4 3 𝑦5 3 𝑦6 3 𝑦7 3 𝑦8 3 𝑦1 2 𝑦2 2 𝑦3 2 𝑦4 2 𝑦5 2 𝑦6 2 𝑦7 2 𝑦8 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 𝑖1,1 𝑖1,2 𝑖1,3 𝑖1,4 𝑖2,1 𝑖2,2 𝑖2,3 𝑖3,1 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖3,6 𝑖4,1 𝑖4,2 𝑖5,1 𝑖5,2 𝑖5,3 Session1 Session2 Session3 Session4 Session5 𝑖1,1 𝑖1,2 𝑖1,3 𝑖2,1 𝑖2,2 𝑖3,1 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖4,1 𝑖5,1 𝑖5,2 Input Desired output … 𝑖1,2 𝑖1,3 𝑖1,4 𝑖2,2 𝑖2,3 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖3,6 𝑖4,2 𝑖5,2 𝑖5,3 … GRU layer One-hot vector Weighted output Scores on items f() One-hot vector ItemID (next) ItemID
  • 4. Item features & user decisions • Clicking on recommendations • Influencing factors  Prior knowledge of the user  The stuff they see o Image (thumbnail, product image, etc) o Text (title, short description, etc) o Price o Item properties o ... • Feature extraction from images and text
  • 5. Naive solution 1: Train on the features • Item features on the input • Bad offline accuracy • Recommendations make sense  but somewhat general Image feature vector GRU layer Network output Scores on items f() ItemID One-hot vector ItemID (next)
  • 6. Naive solution 2: Concat. session & feature info • Concatenate one-hot vector of the ID and the feature vector of the item • Recommendation accuracy similar to the ID only network • ID input is initially stronger signal on the next item  Dominates the image input  Network learns to focus on weights of the ID wID wimage Image feature vector GRU layer One-hot vector Weighted output Scores on items f() One-hot vector ItemID (next) ItemID
  • 7. Parallel RNN • Separate RNNs for each input type  ID  Image feature vector  ... • Subnets model the sessions separately • Concatenate hidden layers  Optional scaling / weighting • Output computed from the concat. hidden state wID wimage Image feature vector GRU layer Subnet output GRU layer One-hot vector Subnet output Weighted output Scores on items f() ItemID One-hot vector ItemID (next)
  • 8. Training • Backpropagate through the network in one go • Accuracy is slightly better than for the ID only network • Subnets learn similar aspects of the sessions  Some of the model capacity is wasted
  • 9. Alternative training methods • Force the networks to learn different aspects  Complement each other • Inspired by ALS and ensemble learning • Train one subnet at a time and fix the others  Alternate between them • Depending on the frequency of alternation  Residual training (train fully)  Alternating training (per epoch)  Interleaving training (per minibatch)
  • 10. Experiments on video thumbnails • Online video site  712,824 items  Training: 17,419,964 sessions; 69,312,698 events  Test: 216,725 sessions; 921,202 events o Target score compared to only the 50K most popular items • Feature extraction  GoogLeNet (Caffe) pretrained on ImageNet dataset  Features: values of the last avg. pooling layer o Dense vector of 1024 values o Length normalized to 1
  • 11. Results on video thumbnails • Low number of hidden units  MRR improvements o +18.72% (vs item-kNN) o +15.41% (vs ID only, 100) o +14.40% (vs ID only, 200) • High number of hidden units  500+500 for p-RNN  1000 for others  Diminishing return for increasing the number of epochs or the hidden units  +5-7% in MRR Method #Hidden units Recall@2 0 MRR@2 0 Item-kNN - 0.6263 0.3740 ID only 100 0.6831 0.3847 ID only 200 0.6963 0.3881 Feature only 100 0.5367 0.3065 Concat 100 0.6766 0.3850 P-RNN (naive) 100+100 0.6765 0.4014 P-RNN (res) 100+100 0.7028 0.4440 P-RNN (alt) 100+100 0.6874 0.4331 P-RNN (int) 100+100 0.7040 0.4361
  • 12. Experiments on text • Classified ad site  339,055 items  Train: 1,173,094 sessions; 9,011,321 events  Test: 35,741 sessions; 254,857 events o Target score compared with all items  Multilanguage user generated titles and item descriptions • Feature extraction  Concatenated title and description  Bi-grams with TF-IDF weighting o Sparse vectors of 1,099,425 length; 5.44 avg. non-zeroes • Experiments with high number of hidden units • +3.4% in recall • +5.37% in MRR
  • 13. Summary • Incorporated image and text features into session- based recommendations using RNNs • Naive inclusion is not effective • P-RNN architectures • Training methods for P-RNNs
  • 14. Future work • Revisit feature extraction • Try representations of the whole video • Use multiple aspects for items