More Related Content Similar to Hierarchical Context enabled Recurrent Neural Network for Recommendation (20) Hierarchical Context enabled Recurrent Neural Network for Recommendation1. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST
Kyungwoo Song, Mingi Ji, Sungrae Park, Il-Chul Moon
Department of Industrial and Systems Engineering
KAIST
1
Hierarchical Context enabled
Recurrent Neural Network for
Recommendation (HCRNN)
kyungwoo.song@gmail.com
2. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 2
Contents
• Motivation
• Related Work
• Methodology
• Experimental Setting
• Results
• Conclusion
• Reference
3. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 3
Motivation
Temporary context
for current
𝒕 = 𝟏 𝒕 = 𝟐 𝒕 = 𝟑 𝒕 = 𝟒 𝒕 = 𝟓 𝒕 = 𝟔 𝒕 = 𝟕 𝒕 = 𝟖
Action Action Musical Musical Action Action Action/Romance Action
Sub-sequence1 (Action) Sub-sequence2 (Musical) Sub-sequence3 (Action/Romance)
≈
≈
≈
≈
≈
≈
≈
≈
≈
Local context
for subsequence
Global context
for sequence
• The long user history contains multiple hierarchical context; global context, local context,
and temporary context.
• The users’ interest drift should be considered in the hierarchical context.
• If we consider the hierarchical context
• User’s primary interest: action movie.
• We can recommend an action movie at t = 8 rather than a romance movie.
• 1) How can we model the hierarchical context?
• 2) How can we model the interest drift based on the hierarchical context?
• 3) How can we model the long-term and short-term dependency?
4. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 4
Related work
• Session based Recommendation (GRU4REC)
• Sequential model with GRUs for the recommendation. This
model adopts a session parallel batch and a loss function such
as Cross-Entropy, TOP1, and BPR.
• Neural Attentive Recommendation Machine (NARM)
• NARM is based on GRU4REC with an attention to consider the
long-term dependency.
• Short-Term Attention/Memory Priority (STAMP)
• STAMP considers both current interest and general interest of
users. In particular, STAMP used an additional neural
network for the current input only to model the user’s
current interest
NARM can be improved
• If we consider the both long-term and short-term dependency
STAMP can be improved
• If we consider the structured interest drift modeling
• If we consider the interest drift with hierarchical context
5. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 5
Methodology (HCRNN overall)
• Global context (for sequence) :
𝜃, Mglobal
• Abstractive context
• 𝜃 : Topic proportion
• Mglobal : Topic memory
• Local context (for subsequence) : 𝑐𝑡
• Relatively abstractive context
• It is generated by global context
adaptively
• Temporary context (for current) : ℎ 𝑡
• Specific context
• It is generated by focusing on the
current input
1) How can we model the hierarchical context ?
• Hierarchical contexts should have a different level of context
• Separate the generation of local context and the temporary context
6. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 6
Methodology (HCRNN overall)
1) How can we model the hierarchical context ?
• Hierarchical contexts should have a different level of context
• Separate the generation of local context and the temporary context
< LSTM with peephole > < HCRNN >
𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 𝑖 𝑡 ⊙ 𝜎𝑐 𝑐𝑡
ℎ 𝑡 = 𝑜𝑡 ⊙ 𝜎ℎ(𝑐𝑡)
𝑐𝑡 = 1 − 𝐺𝑡
𝑐
⊙ 𝑐𝑡−1 + 𝐺𝑡
𝑐
⊙ 𝑐𝑡
ℎ 𝑡 = 𝑟𝑡 ⊙ ℎ 𝑡−1 𝑊ℎℎ + 𝑥 𝑡 𝑊𝑥ℎ + 𝑏ℎ
ℎ 𝑡 = 1 − 𝑧𝑡 ⊙ ℎ 𝑡−1 + 𝑧𝑡 ⊙ 𝜎ℎ(ℎ 𝑡)
No direct connection between 𝒄 𝒕 and 𝒉 𝒕 in HCRNN
7. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 7
Methodology (HCRNN-1)
• (11)-(12) : Topic proportion for sequence (Variational encoder)
• (13) : Attention weight (Which global context vector (𝑀𝑔𝑙𝑜𝑏𝑎𝑙
𝑘
) should be used for the
current local context
• If 𝜃 𝑘 is large, its corresponding global context (𝑀𝑔𝑙𝑜𝑏𝑎𝑙
𝑘
) vector is used with high
importance
• (14)-(16) : Generation of local context with local context gate 𝐺𝑡
𝑐
• (17)-(20) : Generation of temporary context ℎ 𝑡 (separation with local context)
8. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 8
Methodology (HCRNN-2)
• Interest drift assumption : “If the user’s local context (for sub-
sequence) and the current item are very different, the user’s
temporary interest drift occurs.”
• local context : 𝑐𝑡
• Current item : 𝑥𝑡
• 𝑥𝑡 ⊙ 𝑐𝑡 ↓ ⟹ 𝑟𝑡 ↓ ⟹ ℎ 𝑡 focus on the current input instead of ℎ 𝑡−1
2) How can we model the interest drift based on the hierarchical context?
• Interest drift assumption
9. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 9
Methodology (HCRNN-3)
2) How can we model the interest drift based on the hierarchical context?
• Interest drift assumption with interest drift gate
• Sigmoid function outputs a value between 0 and 1, the reset gate of
HCRNN-2 in Eq. 21 can have a value between 0 and 1 theoretically.
• However, the sigmoid function is not sharp ⟹ 𝑟𝑡 in Eq. 21 : 0.47 (± 0.03)
• ⟹ Additional gate to make ℎ 𝑡 focus on the current input
• 𝑟𝑡 ⊙ 𝐺𝑡
𝑑
in HCRNN-3 : 0.29 (± 0.021) (38% smaller than 𝑟𝑡 in HCRNN-2)
10. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 10
Methodology (HCRNN-3+Bi)
bi-channel attention
• 𝛼 𝑡
𝑐
: attention based on the local context
• Emphasizes belonging to the same sub-sequence with the current
• ⟹ Short-term dependency
• 𝛼 𝑡
ℎ
: attention based on the temporary context
• find the similar transition throughout the entire history
• ⟹ Long-term dependency
3) How can we model the long-term and short-term dependency?
• bi-channel attention
11. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 11
Experimental Setting
• We aim at modeling a long user history
• ⇒ we removed sequences whose length is
less than 10.
• We removed the items that exist only in the
test set
• We removed the items that appeared less than
50/50/25 times in three datasets respectively.
• Cross-validation by assigning 10% of the
randomly chosen train set as the validation set
• POP
• SPOP
• Item-KNN (RecSys-10)
• BPR-MF (UAI-09)
• GRU4REC (ICLR-16)
• LSTM4REC
• NARM (CIKM-17)
• STAMP (KDD-18)
Data preprocessing Baselines
12. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 12
Results (Quantitative Evaluation)
• HCRNN have significant performance improvements in all data and
metrics
• HCRNN-1 > Baselines (NARM, STAMP)
• ⇒ The need for hierarchical context modeling in recommendations
• HCRNN-3 > HCRNN-2, HCRNN-1
• ⇒ Interest drift assumption may be experimentally justifiable.
• HCRNN-3+Bi > HCRNN-3
• ⇒ bi-channel attention with hierarchical contexts may improve the
performance experimentally.
13. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 13
Results (Embedding and Context)
• The local context is generated by the global context memory (𝑴 𝒈𝒍𝒐𝒃𝒂𝒍),
and the temporary context is generated by the previous temporary context
and the current item embedding (𝒙 𝒕)
• The item embeddings are coherently organized as a cohesive cluster with
the same genre
• The global context memory covers most of the area that the item
embeddings are dispersed
< Visualization of 𝑴 𝒈𝒍𝒐𝒃𝒂𝒍 and 𝒙 𝒕 >
< Interpretation of 𝑴 𝒈𝒍𝒐𝒃𝒂𝒍 >
14. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 14
Results (Interest drift assumption)
• If the genre of the current input is different with previous items, 𝑟𝑡 ⊙
𝐺𝑡
𝑑
has a smaller value compared to the opposite situation.
√
√
𝒓 𝒕
𝐍𝐀𝐑𝐌
𝒓 𝒕
𝐇𝐂𝐑𝐍𝐍
𝑮 𝒕
𝒅
𝒓 𝒕
𝐍𝐀𝐑𝐌
𝒓 𝒕
𝐇𝐂𝐑𝐍𝐍
𝑮 𝒕
𝒅
(case 1)
(case 2)
𝑯𝒊𝒈𝒉
𝑳𝒐𝒘
Time
Gate heatmap for a user history as an example.
We mark “check” when the genre of item changes.
Average value of 𝑟𝑡
𝐻𝐶𝑅𝑁𝑁
⊙ 𝐺𝑡
𝑑
gate after appearing items
with similar genre consecutively.
15. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 15
Results (bi-channel attention)
• NARM attention weight, 𝛼 𝑡
𝑁𝐴𝑅𝑀
, cannot differentiate the attentions
on the local and the temporary contexts.
• The bi-channel attentions distinguishes the attentions
• 𝛼 𝑡
(𝑐)
focuses on the neighbor attention (short-term)
• 𝛼 𝑡
(ℎ)
reads out through the whole sequence (long-term)
( )
( )
( )
( )
(case 1)
(case 2)
Time
Attention heatmap for a user history.
Averaged attention weight over time difference. Δ𝑡
means a time difference between prediction time
step and the timestep of the previous user history.
16. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 16
Results (Case study)
• Attention weight
• 𝛼 𝑡
𝑐
focuses on recent history
• 𝛼 𝑡
ℎ
considers relatively far
history.
Attention, gate value in NARM and HCRNN, and
the change of context value in HCRNN overtime
• Gate
• 𝐺𝑡=17
(𝑑)
has a relatively small value
• This small value is caused by the
selection of items disaligned to the
previous sub-sequence at 𝑡 = 16.
17. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 17
Conclusion
• Proposes the HCRNN to model the hierarchical contexts and
interest drift assumption for the sequential recommendation
• 1) How can we model the hierarchical context?
• Hierarchical contexts should have a different level of context
• Separate the generation of local context and the temporary context
• 2) How can we model the interest drift based on the
hierarchical context?
• Interest drift assumption with interest drift gate
• 3) How can we model the long-term and short-term
dependency?
• bi-channel attention with hierarchical context
18. Copyright © 2018 by Kyungwoo Song, Dept. of Industrial and Systems Engineering, KAIST 18
Reference
• Liu, Q.; Zeng, Y.; Mokhosi, R.; and Zhang, H. 2018. STAMP: Short-Term Attention/Memory Priority Model
for Session-based Recommendation. In Proceedings of the 24th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining, 1831–1839. ACM.
• Li, J.; Ren, P.; Chen, Z.; Ren, Z.; Lian, T.; and Ma, J. 2017. Neural attentive session-based
recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge
Management, 1419–1428. ACM.
• Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D. 2016. Session-based recommendations with
recurrent neural networks. International Conference on Learning Representations.
• Cho, K.; Van Merri¨enboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y.
2014. Learning phrase representations using RNN encoderdecoder for statistical machine translation.
arXiv preprint arXiv:1406.1078.
• Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and
translate. International Conference on Learning Representations.
• Rakhlin, A.; Shamir, O.; and Sridharan, K. 2012. Making gradient descent optimal for strongly convex
stochastic optimization. International Conference on Machine Learning.
• Rendle, S.; Freudenthaler, C.; Gantner, Z.; and Schmidt- Thieme, L. 2009. BPR: Bayesian personalized
ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial
intelligence, 452–461. AUAI Press.
• Sukhbaatar, S.; Weston, J.; Fergus, R.; et al. 2015. End-toend memory networks. In Advances in neural
information processing systems, 2440–2448.
• van der Maaten, L., and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning
research 9(Nov):2579–2605.
• Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. International Conference on
Learning Representations.