Tutorial on sequence aware recommender systems - UMAP 2018
1. Sequence-Aware Recommender
Systems
Massimo Quadrana, Pandora Media, USA
mquadrana@pandora.com
Paolo Cremonesi, Politecnico di Milano
paolo.cremonesi@polimi.it
Dietmar Jannach, AAU Klagenfurt, Austria
dietmar.jannach@aau.at
2. About today’s presenter(s)
• Paolo Cremonesi
– Politecnico di Milano, Italy
• Dietmar Jannach
– University Klagenfurt, Austria
• Massimo Quadrana
– Pandora, Italy
2
3. • M. Quadrana, P. Cremonesi, D. Jannach,
"Sequence-Aware Recommender Systems“
ACM Computing Surveys, 2018
• Link to paper:
bit.ly/sequence-aware-rs
Today’s tutorial based on
3
7. Recommender Systems
• A central part of our daily user experience
– They help us locate potentially interesting things
– They serve as filters in times of information overload
– They have an impact on user behavior and business
7
9. A field with a tradition
The last 18 years in Recsys research:
• 2000
recommendation = matrix competition
• 2010
recommendation = learning to rank
9
10. Common problem abstraction (1):
matrix completion
• Goal
– Learn missing ratings
• Quality assessment
– Error between true and estimated ratings
Item1 Item2 Item3 Item4 Item5
Alice 5 ? 4 4 ?
Paolo 3 1 ? 3 3
Susan ? 3 ? ? 5
Bob 3 ? 1 5 4
George ? 5 5 ? 1
10
11. Common problem abstraction (2):
learning to rank
• Goal
– Learn relative ranking of items
• Quality assessment
– Error between true and estimated ranking
11
12. A field with a tradition
• But are we focusing on
the most important problems?
12
13. Real-world problem situations
• User intent
• Short-term intent/context vs. long term taste
• Order can matter
• Order matters (constraints)
• Interest drift
• Reminders
• Repeated purchases
13
14. User intent
• Our user searched and listened to
“Last Christmas” by Wham!
• Should we, …
– Play more songs by Wham!?
– More pop Christmas songs?
– More popular songs from the 1980s?
– Play more songs with
controversial user feedback?
14
16. • Here’s what the customer purchased during the
last weeks
• Now, the user return to the shop and browse
these items
Short-term intent/context vs.
long term taste
16
17. Short-term intent/context vs.
long term taste
• What to recommend?
• Some plausible options
– Only shoes
– Mostly Nike shoes
– Maybe also some T-shirts
17
products purchased
in the past
products browsed
in the current session
18. Short-term intent/context vs.
long term taste
• Using the matrix completion formulation
– One trains a model based only on past actions
– Without the context, the algorithm will probably
most recommend
– Mostly (Nike) T-shirts and some trousers
– Is this what you expect?
18
products purchased
in the past
products browsed
in the current session
19. Order can matter
• Next rack recommendation
• What to recommend next should suit the
previous tracks
19
20. Order matters (order constraints)
• Is it meaningful to recommend Star Wars III, if
the user has not seen the previous episodes?
20
21. Before having a family After having a family
Interest drift
21
24. • The choice of the best recommender algorithm
depends on
– Goal (user task, application domain, context, …)
– Data available
– Target quality
Algorithms depends on …
24
26. • Movie recommendation
– if you watched a SF movie …
– … you recommend another SF movie
• Algorithm: most similar item
Examples based on goal …
26
27. • On-line store recommendations
– if you bought a coffee maker …
Examples based on goal …
27
28. • On-line store recommendations
– if you bought a coffee maker …
– … you recommend cups or coffee beans (not
another coffee maker)
• Algorithm: frequently bought together
user who bought … also bought
Examples based on goal …
28
29. • If you have
– all items with many ratings
– no item attributes
• You need CF
• If you have
– no ratings
– all item with many attributes
• You need CBF
Example based on data …
29
30. • Goal
– next track recommendation, …
• Data
– new users, …
• Need for quality
– context, intent, …
Examples for
sequence-aware algorithms …
30
31. Implications for research
• Many of the scenarios cannot be addressed by a
matrix completion or learning-to-rank problem
formulation
– No short-term context
– Only one single user-item interaction pair
– Often only one type of interaction (e.g., ratings)
– Rating timestamps sometimes exist, but might be
disconnected from actually experiencing/using the
item
31
32. • A family of recommenders that
– uses different input data,
– often bases the recommendations on certain types
of sequential patterns in the data,
– addresses the mentioned practical problem settings
Sequence-Aware Recommenders
32
44. Inputs: differences with traditional
RSs
44
• for each user, all sequences contain
one single action (item)
45. Inputs: differences with traditional
RSs (with time-stamp)
45
• for each user, there is
one single sequence
with several actions (item)
46. Output (1)
47
• One (or more) ordered list of items
• The list can have different interpretations,
based on goal, domain, application scenario
• Usual item-ranking
tasks
– list of alternatives
for a given item
– complements or
accessories
47. Output (2)
48
• An ordered list of items
• The list can have different interpretations,
based on goal, domain, application scenario
• Suggested sequence
of actions
– next-track music
recommendations
48. Output (3)
49
• An ordered list of items
• The list can have different interpretations,
based on goal, domain, application scenario
• Strict sequence
of actions
– course learning
recommendations
49. Typical computational tasks
• Find sequence-related in the data, e.g.,
– co-occurrence patterns
– sequential patterns
– distance patterns
• Reasoning about order constraints
– weak and strong constraints
• Relate patterns with user profile and current
point in time
– e.g., items that match the current session
51
50. Abstract problem characterization:
Item-ranking or list-generation
• Some definitions
– 𝑈 : users
– 𝐼 : items
– 𝐿 : ordered list of items of length 𝑘
– 𝐿∗ : set of all possible lists 𝐿 of length up to 𝑘
– 𝑓 𝑢, 𝐿 : utility function, with 𝑢 ∈ 𝑈 and 𝐿 ∈ 𝐿∗
𝐿 𝑢 = argmax
𝐿∈𝐿∗
𝑓 𝑢, 𝐿 𝑢 ∈ 𝑈
– Task: learn 𝑓 𝑢, 𝐿 from sequences 𝐴 of past user
actions
53
51. Abstract problem characterization
• Utility function is not limited to scoring
individual items
• The utility of entire lists can be assessed
– including, e.g., transition between objects,
fulfillment of order constraints, diversity aspects
• The design of the utility function depends on
the purpose of the system
– e.g., provide logical continuation, show alternatives,
show accessories, …
Jannach, D. and Adomavicius, G.: "Recommendations with a Purpose". In: Proceedings of the 10th ACM
Conference on Recommender Systems (RecSys 2016). Boston, Massachusetts, USA, 2016, pp. 7-10
54
52. On recommendation purposes
• Often, researchers are not explicit about the
purpose
– Traditionally, could be information filtering or
discovery, with conflicting goals
• Commonly used abstraction
– e.g., predict hidden rating
• For sequence-aware recommenders
– often: predict next (hidden) action, e.g., for a given
session beginning
55
53. Relation to other areas
• Implicit feedback recommender systems
– Sequence-aware recommenders are often built on
implicit feedback signals (action logs)
– Problem formulation is however not based on matrix
completion
• Context-aware recommender systems
– Sequence-aware recommenders often are special
forms of context-aware systems
– Here: Interactional context is relevant, which is only
implicitly defined through the user’s actions
56
54. • Time-Aware RSs
– Sequence-aware recommenders do not necessarily
need explicit timestamps
– Time-Aware RSs use explicit time
• e.g., to detect long-term user drifts
• Other:
– interest drift
– user-modeling
Relation to other areas
57
56. Categorization of tasks
• Four main categories
– Context adaptation
– Trend detection
– Repeated recommendation
– Consideration of order constraints and sequential
patterns
• Notes
– Categories are not mutually exclusive
– All types of problems based on the same problem
characterization, but with different utility functions,
and using the data in different ways
59
57. Context adaptation
• Traditional context-aware recommenders are
often based on the representational context
– defined set of variables and observations, e.g.,
weather, time of the day etc.
• Here, the interactional context is relevant
– no explicit representation of the variables
– contextual situation has to be inferred from user
actions
60
58. How much past information is
considered?
• Last-N interactions based recommendation:
– Often used in Next-Point-Of-Interest
recommendation scenarios
– In many cases only the very last visited location is
considered
– Also: “Customers who bought … also bought”
• Reasons to limit oneself:
– Not more information available
– Previous information not relevant
61
59. How much past information is
considered?
• Session-based recommendation:
– Short-term only
– Only last sequence of actions of the current user is
known
– User might be anonymous
• Session-aware recommendation:
– Short-term + Long-term
– In addition, past session of the current user are
known
– Allows for personalized session-based
recommendation
Quadrana, M., Karatzoglou, A., Hidasi, B., Cremonesi, P.: “Personalizing Session-based Recommendations
with Hierarchical Recurrent Neural Networks”. Proceedings ACM RecSys 2017: 130-137
62
67. Trend detection
• Less explored than context adaptation
• Community trends:
– Consider the recent or seasonal popularity of items,
e.g., in the fashion domain and, in particular, in the
news domain
• Individual trends:
– E.g., natural interest drift
• Over time, because of influence of other people, because of
a recent purchase, because something new was discovered
(e.g., a new artist)
Jannach, D., Ludewig, M. and Lerche, L.: "Session-based Item Recommendation in E-Commerce: On Short-
Term Intents, Reminders, Trends, and Discounts". User-Modeling and User-Adapted Interaction, Vol. 27(3-
5). Springer, 2017, pp. 351-392
70
68. Repeated Recommendation
• Identifying repeated user behavior patterns:
– Recommendation of repeat purchases or actions
• E.g., ink for a printer, next app to open after call on mobile
– Patterns can be mined from the individual or the
community as a whole
• Repeated recommendations as reminders
– Remind users of things they found interesting in the past
• To remind the of things they might have forgotten
• As navigational shortcuts, e.g., in a decision situation
• Timing as an interesting question in both situations
71
69. Consideration of Order Constraints
and Observed Sequential Patterns
• Two types of sequentiality information
– External domain knowledge: strict or weak ordering
constraints
• Strict, e.g., sequence of learning courses
• Weak, e.g., when recommending sequels to movies
– Information that is mined from the user behavior
• Learn that one movie is always consumed after another
• Predict next web page, e.g., using sequential pattern mining
techniques
72
70. Review of existing works (1)
• 100+ papers reviewed
– Focus on last N interactions common
• But more emphasis on session-based / session-aware
approaches in recent years
73
71. Review of existing works (2)
• 100+ papers reviewed
– Very limited work for other problems:
• Repeated recommendation, trend detection, consideration
of constraints
74
73. Summary of first part
• Matrix completion abstraction not well suited
for many practical problems
• In reality, rich user interaction logs are available
• Different types of information can be derived
from the sequential information in the logs
– and used for special recommendation tasks, in
particular for the prediction of the next-action
• Coming next:
– Algorithms and evaluation
76
77. Sequence Learning (SL)
• Useful in application domains where input data
has an inherent sequential nature
– Natural Language Processing
– Time-series prediction
– DNA modelling
– Sequence-Aware Recommendation
80
78. SL: Frequent Pattern Mining (FPM)
1. Discover user consumption patterns
• e.g. Association rules, (Contiguous) Sequential Patterns)
2. Look for patterns matching partial transactions
3. Rank items by confidence of matched rules
• FPM Applications
– Page prefetching and recommendations
Personalized FPM for next-item recommendation
Next-app prediction
81
79. SL: Frequent Pattern Mining (FPM)
• Pros
– Easy to implement
– Explainable predictions
• Cons
– Choice of the minimum support/confidence
thresholds
– Data sparsity
– Limited scalability
82
80. SL: Sequence Modeling
• Sequences of past user actions as
time series with discrete observations
– Timestamps used only to order user actions
• SM aim to learn models from past observations
(user actions) to predict future ones
• Categories of SM models
– Markov Models
– Reinforcement Learning
– Recurrent Neural Networks
83
81. SL: Sequence Modeling
Markov Models
• Stochastic processes over discrete random
variables
– Finite history (= order of the model) user actions
depend on a limited # of most recent actions
• Applications
– Online shops
– Playlist generation
– Variable Order Markov Models for news
recommendation
– Hidden Markov Models for contextual next track
prediction
84
82. SL: Sequence Modeling
Reinforcement Learning
• Learn by sequential interactions with the
environment
• Generate recommendations (actions)
and collect user feedback (reward)
• Markov Decisions Processes (MDPs)
• Applications
– Online e-commerce services
– Sequential relationships between the attributes of
items explored in a user session
85
83. SL: Sequence Modeling
Recurrent Neural Networks (RNN)
• Distributed real-valued hidden state models
with non-linear dynamics
– Hidden state: latent representation of user state
within/across sessions
– Update the hidden state on the current input and its
previous value, then use it to predict the probability
for the next action
• Trained end-to-end with gradient descent
• Applications
– Next-click prediction with RNNs
86
84. SL:
Distributed Item Representations
• Dense, lower-dimensional representations of
the items
– derived from sequences of events
– preserve the sequential relationships between items
• Similar to latent factor models
– “similar” items are projected into similar vectors
– every item is associated with a real-valued
embedding vector
– its projection into a lower-dimensional space
– certain item transition properties are preserved
• e.g., co-occurrence of items in similar contexts
87
86. SL: Distributed Item
Representations - Prod2Vec
vi = vector representing item i
ik = item at step k of the session
• Learn vectors vi in order to maximize the joint
probabilities of having items
ik-2 , ik-1 , ik+1 , ik+2
given item ik
• Sliding window of size N
– e.g., N = 5 k–2 k–1 k k+1 k+2
89
92. FPM
• Association Rule Mining
– items co-occurring within the same sessions
• no check on order
• if you like A and B, you also like C (aka: learning to rank)
• Sequential Pattern Mining
– Items co-occurring in the same order
• no check on distance
• If you watch A and later watch B, you will later watch C
• Contiguous Sequential Pattern Mining
– Item co-occurring in the same order and distance
• If you watch A and B one after the other, if now watch C
95
93. FPM
• Two steps approach
1. Offline: rule mining
2. Online: rule matching (with current user session)
• Rules have
– Confidence: conditional probability
– Support: number of examples (main parameter)
• lower thresholds -> fewer rules
– few rules: difficult to find rules matching a session
– many rules: noisy rules (low quality)
96
94. FPM
• More constrained FPM methods produce fewer
rules and are computationally more complex
– CSPM -> SPM -> ARM
• Efficiency and number of mined rules decrease
with the number N of past user actions (items)
considered in the mining
– N=1 : users who like A also like B
– N=2 : users who like A and B also like C
– …
97
96. Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
𝒉 𝟎 𝒉 𝟏 𝒉 𝟐
𝒙 𝟏 𝒙 𝟐
𝒚 𝟏 𝒚 𝟐
99
97. Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
• Three weight matrices
– ℎ 𝑡 = 𝑓 𝑥 𝑇
𝑊𝑥 + ℎ 𝑡−1
𝑇
𝑊ℎ + 𝑏ℎ
– 𝑦𝑡 = 𝑓 ℎ 𝑡
𝑇
𝑊𝑦
𝒉 𝟎 𝒉 𝟏 𝒉 𝟐
𝒙 𝟏 𝒙 𝟐
𝒚 𝟏 𝒚 𝟐
100
98. Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
• Three weight matrices
– ℎ 𝑡 = 𝑓 𝑥 𝑇
𝑊𝑥 + ℎ 𝑡−1
𝑇
𝑊ℎ + 𝑏ℎ
– 𝑦𝑡 = 𝑓 ℎ 𝑡
𝑇
𝑊𝑦
𝒉 𝟎 𝒉 𝟏 𝒉 𝟐
𝒙 𝟏 𝒙 𝟐
𝒚 𝟏 𝒚 𝟐
101
99. Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
• Three weight matrices
– ℎ 𝑡 = 𝑓 𝑥 𝑇
𝑊𝑥 + ℎ 𝑡−1
𝑇
𝑊ℎ + 𝑏ℎ
– 𝑦𝑡 = 𝑓 ℎ 𝑡
𝑇
𝑊𝑦
𝒉 𝟎 𝒉 𝟏 𝒉 𝟐
𝒙 𝟏 𝒙 𝟐
𝒚 𝟏 𝒚 𝟐
102
100. Simple Recurrent Neural Network
• Hidden state used to predict the output
– Computed from next input and previous hidden state
• Three weight matrices
– ℎ 𝑡 = 𝑓 𝑥 𝑇
𝑊𝑥 + ℎ 𝑡−1
𝑇
𝑊ℎ + 𝑏ℎ
– 𝑦𝑡 = 𝑓 ℎ 𝑡
𝑇
𝑊𝑦
𝒉 𝟎 𝒉 𝟏 𝒉 𝟐
𝒙 𝟏 𝒙 𝟐
𝒚 𝟏 𝒚 𝟐
103
102. Distributed Item Representations:
Prod2Vec
vi = vector representing item i
ik = item at step k of the session
• Learn vectors vi in order to maximize the joint
probabilities of having items
ik-2 , ik-1 , ik+1 , ik+2
given item ik
• Sliding window of size N
– e.g., N = 5 k–2 k–1 k k+1 k+2
105
103. Simple Recurrent Neural Network
• Fairly effective in modeling sequences
– good for session-based problems
• Problems
– exploding gradients careful regularization
– careful initialization
– how to manage attributes of events?
– how to manage multiple sessions ?
𝒉 𝟎 𝒉 𝟏 𝒉 𝟐
𝒙 𝟏 𝒙 𝟐
𝒚 𝟏 𝒚 𝟐
106
104. Multiple sessions
• Naïve solution: concatenation
• Whole user history as a single sequence
• Trivial implementation but limited effectiveness
User 1
Session 1 Session 2 Session N
…
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
107
105. Multiple sessions
• Naïve solution: concatenation
• Whole user history as a single sequence
• Trivial implementation but limited effectiveness
Hierarchical RNN (HRNN)
User 1
Session 1 Session 2 Session N
…
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
R
N
N
108
116. • Straight backpropagation fails Alternative
training methods
• Train one subnet at the time and freeze others
(like in ALS!)
• Depending on the frequency of iteration
– Residual training (train fully)
– Alternating training (per epoch)
– Interleaving training (per mini-batch)
Training Parallel RNN
119
119. • Off-line
– evaluation metrics
• Error metrics: RMSE, MAE
• Classification metrics: Precision, Recall, …
• Ranking metrics: MAP, MRR, NDCG, …
– dataset partitioning (hold-out, leave-one-out, …)
– available datasets
• On-line evaluation
– users studies
– field test
Traditional evaluation approaches
120. • Splitting the data into training and test sets
event-level session-level
Dataset partitioning
121. • Splitting the data into training and test sets
event-level session-level
Dataset partitioning
Training
Test
Users
Time (events)
122. • Splitting the data into training and test sets
event-level session-level
Dataset partitioning
Training
Test
Users
Time (events)
Users Time (events)
125. Community Level User-level
Dataset partitioning
Training
users
Test
users
Time (events)
Training
Test
Profile
Training
users
Test
users
Time (events)
126. Community Level User-level
Dataset partitioning
Training
users
Test
users
Time (events)
Training
Test
Profile
Training
users
Test
users
Time (events)
127. • session-based or session-aware task
use session-level partitioning
– better mimic real systems trained on past sessions
• session-based task
use session-level + user-level
– to test for new users
Dataset partitioning
128. • Fixed
– e.g., 80% training / 20% test
• Time rolling
– Move the time-split forward in time
Split size
129. • Sequence agnostic prediction
– e.g., similar to matrix completion
Target Items
Time (events)
130. • Sequence agnostic prediction
• Given-N next item prediction
– e.g., next track prediction
Target Items
Time (events)
131. • Sequence agnostic prediction
• Given-N next item prediction
– e.g., predict a sequence of events
Target Items
Time (events)
132. • Sequence agnostic prediction
• Given-N next item prediction
• Given-N next item prediction with look-ahead
Target Items
Time (events)
133. • Sequence agnostic prediction
• Given-N next item prediction
• Given-N next item prediction with look-ahead
• Given-N next item prediction with look-back
– e.g., how many events are required to predict a final
product
Target Items
Time (events)
134. • Sequence agnostic prediction
• Given-N next item prediction
• Given-N next item prediction with look-ahead
• Given-N next item prediction with look-back
• In same scenarios, the same item can be
consumed many times
– e.g, next track recommendation, repeated
recommendations, …
Target Items
135. • Classification metrics
– Precision, Recall, …
– good for context-adaptation (traditional matrix-
completion task)
• Ranking metrics
– MAP, NDCG, MRR, …
– good with fixed-sized splits
• Multi-metrics
– Ranking + Others (Diversity)
Evaluation metrics
137. Datasets
140
Name Domain Users Items Events Sessions
AOL Query 650k 17M 2.5M
Adressa News 15k 923 2.7M
Foursquare_2 POI 225k 22.5M
Gowalla_1 POI 54k 367k 4M
Gowalla_2 POI 196k 6.4M
30Music Music 40k 5.6M 31M 2.7M
138. Thank you for your attention
Massimo Quadrana, Pandora Media, USA
mquadrana@pandora.com
Paolo Cremonesi, Politecnico di Milano
paolo.cremonesi@polimi.it
Dietmar Jannach, AAU Klagenfurt, Austria
dietmar.jannach@aau.at
Editor's Notes
Analyze clicks to infer intent
Analyze clicks to infer intent
Analyze clicks to infer intent
Analyze clicks to infer intent
Analyze clicks to infer intent
We now focus on RNN because they're SOA for session-based recomm.
Naïve solution: concatenate and let the RNN deal with it
We now focus on RNN because they're SOA for session-based recomm.
Naïve solution: concatenate and let the RNN deal with it
For the first session of the user, GRUses behaves exactly as session-based GRU4Rec.
The user-level representation is used to initialize the hidden state of the next GRUses.
We call this variant HRNN Init.
In All mode, the user-level representation is used both to initialize and update the hidden state of the next GRUses.
The user representation is kept fixed during all the forthcoming session.
We call this variant HRNN All.
Notice that because of propagation of the user representation, two identical sessions will receive different recommendations
Notice that because of propagation of the user representation, two identical sessions will receive different recommendations
We now focus on RNN because they're SOA for session-based recomm.
Naïve solution: concatenate and let the RNN deal with it
Naïve solutions fail.
Training on features only has bad offline accuracy. Recommendations make sense, but are too general
Concatenating 1hot ID with features has similar quality to ID-only: stronger signal (ID) dominates over features
FIX SOL 2
Separate RNNs for each input type (ID, image, text, …)
Subnets model sessions separately, modeling different aspects
Merge hidden-layers w/ optional weighting/scaling
Output computed from the merged hidden representations