Parallel RNN Architectures for Feature-rich Session Recommendations

Parallel Recurrent Neural Network
Architectures for Feature-rich
Session-based Recommendations
Balázs Hidasi, Gravity R&D (@balazshidasi)
Massimo Quadrana, Politecnico di Milano (@mxqdr)
Alexandros Karatzoglou, Telefonica Research
(@alexk_z)
Domonkos Tikk, Gravity R&D (@domonkostikk)

Session-based recommendations
• Permanent (user) cold-start
 User identification
 Changing goal/intent
 Disjoint sessions
• Item-to-Session recommendations
 Recommend to previous events
• Next event prediction

GRU4Rec [Hidasi et. al. ICLR 2016]
• Gated Recurrent Unit
 𝑧𝑡 = 𝜎 𝑊 𝑧
𝑥𝑡 + 𝑈 𝑧
ℎ 𝑡−1
 𝑟𝑡 = 𝜎 𝑊 𝑟
𝑥𝑡 + 𝑈 𝑟
ℎ 𝑡−1
 ℎ 𝑡 = 𝜎 𝑊𝑥𝑡 + 𝑈(𝑟𝑡∘ ℎ 𝑡−1)
 ℎ 𝑡 = 1 − 𝑧𝑡 ℎ 𝑡−1 + 𝑧𝑡ℎ 𝑡
• Adapted to the recommendation problem
 Session-parallel mini-batches
 Ranking loss: TOP1=
1
𝑁 𝑆
𝑗=1
𝑁 𝑆
𝜎 𝑟𝑠,𝑖 − 𝑟𝑠,𝑗 + 𝜎 𝑟𝑠,𝑗
2
 Sampling the output
ℎℎ
𝑥𝑡
ℎ 𝑡
𝑧
𝑖1 𝑖5 𝑖8
𝑦1
1
𝑦2
1
𝑦3
1 𝑦4
1
𝑦5
1
𝑦6
1
𝑦7
1
𝑦8
1
𝑦1
3
𝑦2
3
𝑦3
3
𝑦4
3
𝑦5
3
𝑦6
3
𝑦7
3
𝑦8
3
𝑦1
2
𝑦2
2
𝑦3
2 𝑦4
2
𝑦5
2
𝑦6
2
𝑦7
2
𝑦8
2
1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1
0 0 0 0 1 0 0 0
𝑖1,1 𝑖1,2 𝑖1,3 𝑖1,4
𝑖2,1 𝑖2,2 𝑖2,3
𝑖3,1 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖3,6
𝑖4,1 𝑖4,2
𝑖5,1 𝑖5,2 𝑖5,3
Session1
Session2
Session3
Session4
Session5
𝑖1,1 𝑖1,2 𝑖1,3
𝑖2,1 𝑖2,2
𝑖3,1 𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5
𝑖4,1
𝑖5,1 𝑖5,2
Input
Desired
output
…
𝑖1,2 𝑖1,3 𝑖1,4
𝑖2,2 𝑖2,3
𝑖3,2 𝑖3,3 𝑖3,4 𝑖3,5 𝑖3,6
𝑖4,2
𝑖5,2 𝑖5,3
…
GRU layer
One-hot vector
Weighted output
Scores on items
f()
One-hot vector
ItemID (next)
ItemID

Item features & user decisions
• Clicking on recommendations
• Influencing factors
 Prior knowledge of the user
 The stuff they see
o Image (thumbnail, product image, etc)
o Text (title, short description, etc)
o Price
o Item properties
o ...
• Feature extraction from images and text

Naive solution 1: Train on the features
• Item features on the input
• Bad offline accuracy
• Recommendations make sense
 but somewhat general
Image feature vector
GRU layer
Network output
Scores on items
f()
ItemID
One-hot vector
ItemID (next)

Naive solution 2: Concat. session & feature info
• Concatenate one-hot
vector of the ID and the
feature vector of the item
• Recommendation
accuracy similar to the ID
only network
• ID input is initially stronger
signal on the next item
 Dominates the image input
 Network learns to focus on
weights of the ID
wID wimage
GRU layer
One-hot vector
Weighted output
Scores on items
f()
One-hot vector
ItemID (next)
ItemID

Parallel RNN
• Separate RNNs for
each input type
 ID
 Image feature vector
 ...
• Subnets model the
sessions separately
• Concatenate hidden
layers
 Optional scaling /
weighting
• Output computed from
the concat. hidden
state
wID wimage
GRU layer
Subnet output
GRU layer
One-hot vector
Subnet output
Weighted output
Scores on items
f()
ItemID
One-hot vector
ItemID (next)

Training
• Backpropagate through the network in one go
• Accuracy is slightly better than for the ID only
network
• Subnets learn similar aspects of the sessions
 Some of the model capacity is wasted

Alternative training methods
• Force the networks to learn different aspects
 Complement each other
• Inspired by ALS and ensemble learning
• Train one subnet at a time and fix the others
 Alternate between them
• Depending on the frequency of alternation
 Residual training (train fully)
 Alternating training (per epoch)
 Interleaving training (per minibatch)

Experiments on video thumbnails
• Online video site
 712,824 items
 Training: 17,419,964 sessions; 69,312,698 events
 Test: 216,725 sessions; 921,202 events
o Target score compared to only the 50K most popular items
• Feature extraction
 GoogLeNet (Caffe) pretrained on ImageNet dataset
 Features: values of the last avg. pooling layer
o Dense vector of 1024 values
o Length normalized to 1

Results on video thumbnails
• Low number of hidden
units
 MRR improvements
o +18.72% (vs item-kNN)
o +15.41% (vs ID only, 100)
o +14.40% (vs ID only, 200)
• High number of hidden
units
 500+500 for p-RNN
 1000 for others
 Diminishing return for
increasing the number of
epochs or the hidden units
 +5-7% in MRR
Method #Hidden
units
Recall@2
0
MRR@2
0
Item-kNN - 0.6263 0.3740
ID only 100 0.6831 0.3847
ID only 200 0.6963 0.3881
Feature only 100 0.5367 0.3065
Concat 100 0.6766 0.3850
P-RNN (naive) 100+100 0.6765 0.4014
P-RNN (res) 100+100 0.7028 0.4440
P-RNN (alt) 100+100 0.6874 0.4331
P-RNN (int) 100+100 0.7040 0.4361

Experiments on text
• Classified ad site
 339,055 items
 Train: 1,173,094 sessions; 9,011,321 events
 Test: 35,741 sessions; 254,857 events
o Target score compared with all items
 Multilanguage user generated titles and item descriptions
• Feature extraction
 Concatenated title and description
 Bi-grams with TF-IDF weighting
o Sparse vectors of 1,099,425 length; 5.44 avg. non-zeroes
• Experiments with high number of
hidden units
• +3.4% in recall
• +5.37% in MRR

Summary
• Incorporated image and text features into session-
based recommendations using RNNs
• Naive inclusion is not effective
• P-RNN architectures
• Training methods for P-RNNs

Future work
• Revisit feature extraction
• Try representations of the whole video
• Use multiple aspects for items

Parallel RNN Architectures for Feature-rich Session Recommendations

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Parallel RNN Architectures for Feature-rich Session Recommendations

Similar to Parallel RNN Architectures for Feature-rich Session Recommendations (20)

More from Balázs Hidasi

More from Balázs Hidasi (9)

Recently uploaded

Recently uploaded (20)

Parallel RNN Architectures for Feature-rich Session Recommendations