Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, Alexandros Karatzoglou et al)

DM
Session-Based Recommendations with Recurrent
Neural Networks
(Balazs Hidasi, Alexandros Karatzoglou et al)
1

DM목차
 Backgrounds
 Factor model based approach in recommender system
 Neighborhood approach in recommender system
 Recurrent Neural Networks and GRU(Gated Recurrent Unit)
 RNN in session based recommender system
 Structure of proposed model
 Session based mini-batch
 Ranking Loss
 Experiments and Discussion
 Conclusion
 What makes this paper great?
2

DMFactor model based recommender system
Represent users and items in latent space numerically
EX)
 Represent a user U as vector 𝑢 = 0.7, 1.3, −0.5, 0.6 𝑇
 Represent a item I as vector 𝑖 = 2.05, 1.2, 2.6, 3.9 𝑇
Targets(what we want to predict) are calculated using numerically
represented using user, item, and other contents information.
EX)
 Predicted rating of user U on item I
𝑟𝑢,𝑖 = 𝑑𝑜𝑡 𝑢, 𝑖 = 𝑢 𝑇 𝑖 = 0.7 ∗ 2.05 + 1.3 ∗ 1.2 − 0.5 ∗ 2.6 + 0.6 ∗ 3.9 = 4.035
4

DMNeighborhood based recommender system
 Rating of user u on item I is calculated by how user u’s neighbors rated item I
 determining neighborhood of users is important
 finding similar users given an user is big issue
 Targets are calculated by weighted normalized sum of rating of neighbors where similarity is used as weight.
EX)
 Predicted rating of user U on item I
𝑟𝑢,𝑖 ==
𝑢𝑠𝑒𝑟𝑠(𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 ∗ 𝑟𝑎𝑡𝑖𝑛𝑔 𝑢𝑠𝑒𝑟)
𝑢𝑠𝑒𝑟𝑠 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒
=
0.7 ∗ 3 + 0.3 ∗ 4 + 0.05 ∗ 2
0.7 + 0.3 + 0.05
= 3.24
5
User similarity with user U Rating on item I
U 70% 3
B 30% 4
C 5% 2

DMLimits of factor model in session-based recommender
system
Same user in different session are classified as different user.
 Hard to construct user-profile
 Lack of user-profile
Neighborhood based recommender system still works
 Computing similarities between items are based on co-occurrences of items in
sessions(user profiles).
In session-based recommender system, Neighborhood methods are
used extensively.
6

DMRecurrent Neural Network, GRU
Recurrent Neural network is a kind of network that
sequential input
 text sentences, series of actions of a user on web gives
arbitrary goals (mainly next element/action in sequential data)
 sentiment of given sentence (given text sentence)
 Which page a user will visit next (given actions on web)
Gated Recurrent Unit(GRU) (Cho et al., 2014)
 Designed to solve gradient vanishing/exploding problem in RNN like LSTM
 Faster training than LSTM because it have lower number of parameters than LSTM
7

DMAbstract view of Recurrent Neural Network (1)
 RNN layer takes two input
 Given input
 Hidden state from previous state (initially zero)
 Input hidden state determines the state of RNN layer
 RNN layer gives two output
 Output
 Hidden state to next state.
 RNN can span arbitrary length.
 We can train using only last output, whole
output sequence, or some of them.
8
Input 1
Output 1
RNN layerℎ1
Initially zero
Input 2
Output 2
RNN layer
ℎ2

DM
Abstract view of Recurrent Neural Network (2)
Training RNN (1)
EX)
 𝒍𝒐𝒔𝒔 = 𝒊 𝒚𝒊 − 𝒐𝒊
𝟐
9
Input 1
𝑜1
RNNℎ1
Input 2
𝑜2
RNN
…
Input n
𝑜 𝑘
RNN
𝑦1 𝑦2 𝑦 𝑘
…
…
…
ℎ2

DM
Training RNN (2)
EX)
 𝒍𝒐𝒔𝒔 = 𝒚 − 𝒐 𝒌
𝟐
10
Input 1
𝑜1
RNNℎ1
Input 2
𝑜2
RNN
…
Input n
𝑜 𝑘
RNN
𝑦
…
…
Discard 𝑜1, 𝑜2, … , 𝑜 𝑘−1
ℎ2

DM
11
Input 1
RNN 1ℎ1
Input 2
RNN 1
…
Input n
RNN 1
…
…
RNN 2 RNN 2 RNN 2ℎ1
ℎ2
…RNN 3 RNN 3 RNN 3ℎ1 Deep RNN
… … … …
ℎ2
ℎ2
𝑜1
𝑜1
𝑜2
𝑜2 𝑜 𝑘
𝑜 𝑘

DMgradient vanishing/exploding in Deep Learning
presume we are familiar with basic linear algebra
 Repeated matrix-vector multiplication can be dangerous.
𝑾𝒙 𝒕 = 𝒙 𝒕+𝟏
 Suppose that 𝒙0 = 𝒗 𝟏 + 𝒗 𝟐 + ⋯ 𝒗 𝒌, 𝒘𝒉𝒆𝒓𝒆 𝒗 𝟏, 𝒗 𝟐, . . 𝒗 𝒌 are eigenvectors of 𝑾
 This is True for most cases.
𝑾𝒙 𝟎 = 𝝀 𝟏 𝒗 𝟏 + 𝝀 𝟐 𝒗 𝟐 + … + 𝝀 𝒌 𝒗 𝒌 , where 𝝀 𝒌 is eigenvalue of W
𝒙 𝒏+𝟏 = 𝑾𝒙 𝒏 = 𝝀 𝟏
𝒏
𝒗 𝟏 + 𝝀 𝟐
𝒏
𝒗 𝟐 + … + 𝝀 𝒌
𝒏
𝒗 𝒌
 If largest eigenvalue > 1, 𝒙 𝒏 goes to infinite
 If largest eigenvalue < 1, 𝒙 𝒏 goes to zero
 In both cases, training becomes infeasible
 LSTM, GRU and other variants of RNN are designed to solve this problem while preserving long term
dependencies.
12

DM
RNN in session based
recommender system
13

DMStructure of proposed model
 Input : Sequence of an user
 유저가 본 아이템 리스트
 𝑖1,𝑡1
, 𝑖1,𝑡2
, … 𝑖1,𝑡 𝑘
 Output :
 유저가 볼 아이템 리스트(의 확률 분포)
 𝑝1,𝑡2
, 𝑝1,𝑡3
, … 𝑝1,𝑡 𝑘+1
 𝑖1,𝑡1
는 item id
 𝑝1,𝑡2
는 유저 1이 𝑡2 시간에 볼 아이템의 확률
분포
Ex
𝟏 𝟐 𝟑 𝟒 𝟓
𝑝1,𝑡2
= 𝟎. 𝟐, 𝟎. 𝟑, 𝟎. 𝟏, 𝟎. 𝟏, 𝟎. 𝟑
이면 item 1을 볼 확률이 0.2, item 2를 볼 확
률이 0.3, item 3을 볼 확률이 0.1 …
14
Input :
One-
hot
Encod
ed
Vector
Embed
ding
Layer
GRU
Layer
GRU
Layer
GRU
Layer
…
Feedfo
rward
Layer
Output
:
scores
on
items

DMStructure of proposed model
Ont-hot vector
 The input vector of length equal
to the number of items and only
the elements corresponding to
the active item is one.
Embedding
 Assign a trainable vector for
every item.
 A model with embedding
performs worse
15
Input :
One-
hot
Encod
ed
Vector
Embed
ding
Layer
GRU
Layer
GRU
Layer
GRU
Layer
…
Feedfo
rward
Layer
Output
:
scores
on
items

DMTraining the model (1)
Training RNN using mini-batch.
EX)
1
3
(𝑜2 𝑖
− 𝑦2 𝑖)
16
Input 1
RNN layerℎ1
Input 2
RNN layer
Input 1
Inputs 1
Input 2
Inputs 2
ℎ1
ℎ1
ℎ2
ℎ2
ℎ2
𝑜2
𝑜2
𝑜2
𝑦2
𝑦2
𝑦2

 For simplicity,
Model =
𝒇 𝒊𝒏𝒑𝒖𝒕, 𝒉𝒊𝒅𝒅𝒆𝒏 𝒔𝒕𝒂𝒕𝒆
= 𝒐𝒖𝒕𝒑𝒖𝒕, 𝒏𝒆𝒙𝒕 𝒉𝒊𝒅𝒅𝒆𝒏 𝒔𝒕𝒂𝒕𝒆
𝒇 𝒊 𝟎,𝟏, 𝒉 𝟏 = 𝟎 = 𝒐 𝟎,𝟐, 𝒉 𝟐
𝒐 𝟎,𝟐과 𝒊 𝟎,𝟐 로 패러미터 업데이트,
𝒇 𝒊 𝟎,𝒌, 𝒉 𝒌 = 𝟎 = 𝒐 𝟎,𝒌, 𝒉 𝒌
𝒐 𝟎,𝒌과 𝒊 𝟎,𝒌로 패러미터 업데이트
각 세션에 대해 𝒉 𝒌만 기억하고 있으면 여러
세션을 parallel하게 업데이트 할 수 있다.
17

Designed to training various
length of sessions in parallel
Not to break sessions down into
fragments.
18

DMRanking Loss
Idea:
 explicitly force rating of positive sample higher than ratings of negative samples.
Ordinary goal
 𝑢 = 𝑎𝑛 user.
 𝑦 𝑢,𝑖 = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑢𝑠𝑒𝑟 𝑢 𝑡𝑜 𝑙𝑖𝑘𝑒 𝑜𝑟 𝑠𝑒𝑒 𝑖𝑡𝑒𝑚 𝑖 , 𝑦 𝑢,𝑖 ∈ [0,1]
 𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑖 𝑦 𝑢,𝑖 , i ∈ 유저 𝑢가 본 𝑖𝑡𝑒𝑚들
 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝑗 𝑦 𝑢,𝑖 , j ∈ 유저 𝑢가 보지 않은 𝑖𝑡𝑒𝑚들
Goal of ranking loss
 𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑖 𝑗(𝑦 𝑢,𝑖−𝑦 𝑢,𝑗)
 i ∈ 유저 𝑢가 본 𝑖𝑡𝑒𝑚들, j ∈ 유저 𝑢가 본 𝑖𝑡𝑒𝑚들
19

DM
Ranking loss functions used in the model
Bayesian Personalized Ranking
Loss(BPR)
𝑵 𝒔 ≔ 유저가 보지 않은 아이템 중 일부
𝒓 𝒔,𝒌 ≔ 세션 s가 item k를 봤는지 안 봤는지에 대한 예측 값, 0과 1 사이.
𝒊 ≔ 유저가 다음번에 실제로 본 값
𝝈 𝒙 =
𝟏
𝟏 + 𝒆𝒙𝒑 −𝒙
𝒍𝒐𝒔𝒔 = −
𝟏
𝐍 𝐬
𝒋∈𝑵 𝒔
𝒍𝒐𝒈 𝝈 𝒓 𝒔,𝒊 − 𝒓 𝒔,𝒋
Proposed loss function(TOP1)
𝑵 𝒔 ≔ 유저가 보지 않은 아이템 중 일부
𝒓 𝒔,𝒌 ≔ 세션 s가 item k를 봤는지 안 봤는지에 대한 예측 값, 0과 1 사이.
𝒊 ≔ 유저가 다음번에 실제로 본 값
𝝈 𝒙 =
𝟏
𝟏 + 𝒆𝒙𝒑 −𝒙
𝒍𝒐𝒔𝒔 = −
𝟏
𝐍 𝐬
𝒋∈𝑵 𝒔
𝝈 𝒓 𝒔,𝒋 − 𝒓 𝒔,𝒊 + 𝝈( 𝒓 𝒔,𝒋
𝟐
)
20

DM
Experiments and Discussion
21

DMDatsets
Dataset Recsys 2015(RCS15) OTT video(VIDEO)
# sessions 15,324 ~37k
# items 37,483 ~330k
# clicks 71,222 ~180k
22
Preprocessing
Erase items in test-set that do not appeared in train-set
Erase Sessions with length 1
Do not split session sequence into train-set and test-set

DM
Evaluation measure
Recall@k
 I want this answer : Cat
 But the computer says Your answer
might be one of [Chicken, Dog, Horse]
 Then Recall @ 3 = 0
 I want this answer ! : pizza
 But the computer says Your answer
might be one of [chicken, dog, pizza]
 Then Recall @ 3 = 1
 Computer는 candidates를 갖고 있고, 이
를 정렬한 뒤 top-k개를 뽑아 이 중에 정
답을 있으면 1, 아니면 0이 된다.
MRR@k
Computer tries to guess my hair
color. He have 3 chances, and
says Red(1st), Black(2nd),
Yellow(3rd).
My hair color is black.
 MRR @ 3 = 1 / 2 = 0.5
 Computer는 candidates를 갖고 있고,
이를 정렬했을 때 (top-k개를 뽑아) 정
답의 rank의 역수를 말한다.
 정답이 top-k개 밖이면 0 MRR@k = 0
23

DMRecall@20 and MMR@20 using baseline methods
Baseline RSC15 VIDEO
Recall@20 MRR@20 Recall@20 MRR@20
POP 0.0050 0.0012 0.0499 0.0117
S-POP 0.2672 0.1775 0.1301 0.0863
Item-KNN 0.5065 0.2048 0.5598 0.3381
BPR-MF 0.2574 0.0618 0.0692 0.0374
24

DMRecall@20 and MRR@20 for different types of single-layer
GRU
Loss function # Units
Length of
hidden state
vector(𝒉𝒊)
0.5777RSC15 VIDEO
Recall@20 MRR@20 Recall@20 MRR@20
TOP1 100 0.5853 0.2305 0.6141 0.3511
BPR 100 0.6069 0.2407 0.5999 0.3260
Cross-Entropy 100 0.6074 0.2430 0.6372 0.3720
TOP1 1000 0.6206 0.2693 0.6624 0.3891
BPR 1000 0.6322 0.2467 0.6311 0.3136
Cross-Entropy 1000 0.5777 0.2153 N/A N/A
25

DMDiscussion
Larger hidden-state(unit) gives better performance.
 at 100 < at 1000 < at 10^4
Pointwise-loss is unstable
 I do not understand this means ‘numerically unstable’, overflow or underflow, or
means that results is inconsistent.
Deeper GRU layer improves performance.
Embedding is not good for this model.
26

DMWhat makes this paper great?
New parallel training methods for training RNN (in recommender
system)
Devise new Ranking Loss (I think other models can exploit this loss
function)
Performance improvement
 20~25% performance improved compared to best baseline Item-KNN
Designing novel model framework to solve session-based
Recommender system problem using RNN
27

Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, Alexandros Karatzoglou et al)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, Alexandros Karatzoglou et al)

Similar to Session-Based Recommendations with Recurrent Neural Networks(Balazs Hidasi, Alexandros Karatzoglou et al) (20)

Recently uploaded

Recently uploaded (20)