Recsys 2018 overview and highlights

Recsys 2018
Conference Overview and Highlights

Recsys Overview
- Deep learning is “omnipresent” now (no more specialized workshop or DL-specific track)
- Reinforcement Learning gaining popularity (industry mostly)
- User-centric papers (calibration, diversity…)
- Evaluation and Metrics
- Recsys Challenge (Spotify) - LTR
- Tutorials (material & slides here)
- Mixed Methods (Spotify)
- Sequence-aware RS (Politecnico di Milano + Pandora)
- OpenRec (open-source and modular library for NN algo’s)
- Deep Learning (Flipkart)
- Emotions and Personality in Recommender Systems
- Many authors are making their code available

Reinforcement Learning
Traditionally used in robotics, games (AlphaGo), self-driving cars...
Why RL in RS? RS have 2 competing goals:
● Recommend items with the highest user predicted engagement (exploit)
● Recommend items with uncertain predicted user engagement to gather more
information (explore)
=> Traditional RS focus on exploiting only.
Exploration is important in settings with new users, new items and dynamic user
preferences.

RS problem framed as a RL problem
RL: Sequential decision making problem
At step t, an agent must perform an action at
in an uncertain environment which
presents the agent with a new reward r(t+1)
and a new state s(t+1).
Example:
● action: recommend a product
● reward: click or no click (binary)
The goal of the agent is to learn a policy indicating the action that maximises the
total reward collected: greedy, epsilon-greedy, upper confidence bound (UCD),
Thompson sampling...

Netflix: Artwork recommendation using RL
Personalise artwork of movie titles so users can better decide whether to watch
something or not

Choose an
action
uniformly at
random with
probability
epsilon.

RL @ Recsys 2018
- REVEAL: workshop on Offline Evaluation for Recommender Systems
- BEARS (Insight, Athena): Evaluation framework to test bandit-based RS
- Netflix: Artwork recommendation using RL
- Pandora: Rank modules to show to users (i.e. “friday mood”, ...)
- Spotify: Jointly learn what items and explanations to show to users
- Criteo: Causal Embeddings (best paper)
- Deal with data confounded by the recommender by combining a large sample of biased
feedback data with a small sample of unbiased feedback data.

However...
Deep Reinforcement Learning Doesn't Work Yet (Feb 2018)
Reinforcement Learning never worked, and 'deep' only helped a bit (Feb 2018)
RL Researchers

Calibrated Recommendations (Steck)
- Nominated paper by Harald Steck (Netflix)
- RS trained on accuracy tend to focus on user’s main interests (unbalanced
recommendations)
- E.g. user’s items 70% romance and 30% action → an uncalibrated algorithm
would recommend most items in the romance category
- The work proposes
a calibration metric
and performs an
evaluation on
MovieLens

Calibration Metric
CKL
(p, q) = 0 → p(g | u) = q(g | u) (perfect calibration)

Calibration Method
Goal: Find the optimal set I* of items to recommend
Maximum marginal relevance:

Calibration vs Diversity
If we have 2 genres, romance and action, the most diverse list would contain 50%
romance and 50% action movies.
But it does not consider the accuracy - diversity tradeoff for each user (some
users may not want diverse recommendations) -> ~ personalized diversity
Extension taking into account diversity (introduce a new param beta that controls
the calibration-diversity tradeoff):
Diversity-
promoting
prior
Calibration
target

Results
calibration - accuracy tradeoff (controlled by lambda)

Reciprocal Recommender Systems (Kleinerman et al)
Reciprocal -> online dating, jobs… (recommending people), marketplaces (items)
For each user receiving a recommendation, the system finds the optimal balance
of:
● Likelihood of the user accepting the recommendation
● Likelihood of the recommended user positively responding
Evaluation on an online dating site (app)
marketplaces,

Reciprocal Recommender Systems (Kleinerman et al)
Approach based on combining:
● Collaborative filtering -> score for each user pair CF(x,y)
● AdaBoost classifier that predicts the probability that a user will respond to
another user based on features from the sender and the receiver (content +
popularity features) -> PR(y,x)
Classifier AUC = 0.83
Baseline = CF (on its own)
Learn a weight that balances CF and PR

Automatic Playlist Continuation
Given a playlist, what songs should be played after?
Input: A user-created playlist, represented by:
● Playlist metadata
● A list of the K tracks in the playlist, K = [0, 1, 5, 10, 25, or 100]
Output: A list of 500 recommended tracks, ordered by relevance

Top 3 Teams:
1st place: Two-stage model:
● 1st: WRMF + CNN + user-user + item-Item neighborhood models.
● 2nd: gradient boosting model used to re-rank the retrieved songs
2nd place: Multimodal CF that uses an autoencoder and a character-level CBB
3rd place: Two-stage model:
● 1st: LightFM
● 2nd: gradient boosting

Winning Team Model Architecture
Two-stage Model for Automatic Playlist Continuation at Scale [Volkovs et al]

Winning Team Model Architecture
First Stage: Linear weighted ensemble method:
Second Stage: Learning to rank using gradient boosting trees (GBT)
● first stage scores (s_blend, s_wrmf…) + other engineered features
● pairwise ranking loss
● 150 trees; depth = 10 (XGBoost library)
Cold-start handled as a separate problem

Main Findings
● Blending (1st stage) already produced high performance (high recall of 90% 20K
candidate songs - 60% in the top 1K).
● Neighbourhood based approaches outperformed CNN and WRMF (CNN was
the worst performing)
● Re-scaling similarity scores using inverse popularity [Verstrepen et al]
significantly improved the accuracy of the 2 neighbourhood-based approaches
● First few trees of GBM (2nd stage) already beat performance of blending (1st
stage) because GBM uses output of blender (scores of the different models)
as input

2nd place: Multimodal Collaborative Filtering
Multimodal approach that uses:
1. an autoencoder using the playlist
and its categorical contents
2. a character-level CNN that only
uses the playlist title
MMCF: Multimodal Collaborative Filtering for
Automatic Playlist Continuation. [Yang et al.]

2nd place: Multimodal Collaborative Filtering
Combine the two models using a linear combination of output vectors witem
& wtitle
The autoencoder can capture the characteristics of a given playlist more precisely
as the number of input items increases → give more weight to witem
when number
of items is higher
N([p; ap
]): number of items I(Tp
): importance of the playlist title

3rd place: hybrid two-stage recommender
1st stage: LightFM used for generating candidates
2nd stage:
● LightFM features (features produced by LightFM (score(p, t) but also bp
,bt
,
< qp
,qt
>) -- p=playlist, t=track
● Co-occurrence features related to playlist p and candidate track t
● e.g. number of playlists containing tracks t_i and t… → calculate it for each track in the playlist
t_i and candidate track t. Use mean, min, max, and median statistics over the tracks in the
playlist)
A hybrid two-stage recommender system for automatic playlist continuation

On the Robustness and Discriminative Power
of IR Metrics for Top-N Recommendation (Valcarce et al)
● Studies the robustness and discriminative power of several ranking metrics
(originally used in IR) when applied to the top-N recommendation task.
● A desirable metric for recommendation should be robust to incompleteness in
the test set.
● Assess robustness by simulating sparsity and popularity bias in the test set
(removing at random or removing top popular) and recalculating the metrics.
● Compare rankings of the test sets with Kendall’s correlation

On the Robustness and Discriminative Power
of IR Metrics for Top-N Recommendation (Valcarce et al)
3 datasets and 21 recommendation algorithms (All Items methodology)
● Precision is very robust to sparsity and popularity biases
● NDCG → high robustness to the sparsity bias and moderate robustness to
the popularity bias.
● MRR: performs poorly in RS evaluation.
Interesting approach to evaluate the robustness of our metrics for our own
datasets

Judging Similarity: A user-centric study of related
item recommendations (Yao et al.)
Evaluate item similarity Evaluate recommendation quality

Judging Similarity: A user-centric study of related
item recommendations (Yao et al.)
● User-centric evaluation of 6 related item algorithms: random,
content- based (tag genome), content-based (user reviews), svd,
item2vec, arm
● 700 participants (Movielens users invited by email)
● 2 research questions:
○ Which related item algorithms best match user perceptions of relatedness and
recommendation quality?
○ How should related item algorithms be designed to improve the user experience?
● Survey and responses are publicly available

Stratified sampling strategy
● 100 source items sampled from the top 2500 most popular items
○ Easier for people to know about the movies
○ 2500 most popular account for 80% of user ratings
● Stratified -> split 2500 into 10 groups, pick 10 random movies from each
● For each algorithm, retrieve top 10 neighbours
● Top 10,000 items considered as target items

High correlation
between similarity
and recom. Quality
(0.80 Spearman
rank-order
correlation)
CB approaches outperform CF-based ones in terms of user
expectations for similarity and recommendation quality

Results and Conclusions of the Study
- Content-based approaches item-similarity matches the most with user
expectations compared to CF approaches
- Perceived recommendation quality is also better.
- Users said they want something in between “similar to their interests” and “not
obvious”
- Based on the user’s feedback the authors suggest that related item
recommendations should combine item similarity with other factors such as
diversity and serendipity.

Interpreting User Inaction (Zhao et al.)
Most work focuses on user’s interactions with items. This work focuses on
studying the lack of interaction / inaction through a live user survey on MovieLens.
Inaction doesn’t always mean negative feedback. E.g. “explore later” type of
inaction is a positive user feedback.
Research questions:
● What causes inaction?
● Can inaction reason be predicted?
● Can we improve recommender systems using an inaction model?

7 categories of inaction:
● Not Noticed / Lack of attention (38.6%)
● Not Now (18.2%)
● Already Watched (14.6%)
● Others Better Titles (9.5%)
● Explore Later / Need more info to make a decision (6.9%)
● Would Not Enjoy / Not matching user’s taste (5.8%)
● Already Decided To Watch (5.8%)
User’s lack of attention → UI design should try to optimize user attention

Used a multinomial regression model to predict the type / category of user
inaction.
Main Findings:
=> Reason for inaction is hard to predict → overall poor classification performance
but some categories have high accuracy.
=> Predicted probability of “Not Now” could be used when to “skip” showing a
recommendation and wait for a future session to show it.

Mixed Methods (Discovery Weekly @ Spotify)
Qualitative
Quantitative

Mixed Methods (Discovery Weekly @ Spotify)
Qualitative:
How to do interviews and surveys (best practices)
E.g. scales shouldn’t have numbers, should have words
Quantitative:
How to collect data: attention, interaction, task-success

Recsys 2019
See you in Copenhagen

Recsys 2018 overview and highlights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Recsys 2018 overview and highlights

Similar to Recsys 2018 overview and highlights (20)

Recently uploaded

Recently uploaded (20)

Recsys 2018 overview and highlights