4. Definition of impressions
An item appears in the viewport of the
application
● for at least x milliseconds
● partially visible can be OK
Impressions can be logged for different
entities on screen
● shows, rows, boxart images, etc
5. Goal of this presentation
Impression data is critical for building recommender models at
Netflix
● and other industry recommenders
How do we incorporate impression data into recommender
models?
● Impressions for label definition (training objectives)
● Impressions for feature definition
Share interesting learnings and challenges
8. A simplified recommendation algorithm
Given a user:
for every item, predict
p(engage | user impression of item)
then choose the item with the highest
prediction
9. How to train p(engage | impression)
Binary classification model: engage or no-engage?
Training data: take all user-item impressions
10. If only “relevant” items are impressed
Training data concentrate on the most
relevant part of the item space
If we train classifier using this data
● relevance is not the main difference
between positives and negatives
● so it may be ignored by the model
The classifier will not generalize well to the
whole item space
● may over-predict for many non-relevant
items
11. Solution 1: Add item exploration
Display random items to each user
User still can’t impress every single item
● there can be millions of items
But user can impress most “types” of items
Model generalizes better!
Too much exploration may hurt user
experience or ads revenue
Explore volume needs to be limited
12. Solution 2: Add random negatives
Pseudo-impressions with no
engagement
May incorrectly mark a relevant item as
negative
● risk is small when item space is large
Random negatives are easy to classify
● little connection to user interests
But help a lot with model generalization
Challenges
● what distribution to sample negatives?
● how to mix random negatives with
impressed negatives?
13. Popularity bias
Definition: popular items get higher predictions than they should
Model trained only using impressions (exploit + explore)
● no popularity bias as popular items get both more positives
and more negatives in training data
● some items can suffer from high variance if not enough
explore
Adding uniform random negatives
● may increase popularity bias as we add the same number of
negatives for popular and non-popular items
14. When item space is large (millions)
Too costly to compute p(engage |
impression) for every item
Candidate generation pass
● efficient model architectures (e.g. two-
tower)
● millions → hundreds (loosely-relevant)
● care more about recall @ hundreds
Fine-grained ranking pass
● more sophisticated model architectures
● distinguish between good and excellent
● often trained only on impressed negatives as
it is applied on already relevant candidates
More passes can be used, eg
● adjusting the ranking for diversity
Efficiency optimization: use 2 passes
● both predict p(engage | impression)
● with different focuses
15. Repeated impressions
User scrolls back and forth multiple times
Items at the top get repeated
impressions
Need to deduplicate the impressions per
session in the training data
Otherwise, top items get unfairly
penalized in the model as they have more
repeated impressions
16. Noisy impressions
Many items on screen at the same
time
Not clear if the user saw the item
If no engagement, is it because
● user is not interested?
● user didn’t see it?
17. Impressions may have long-term value
Impression of a Netflix show makes it more familiar
to the user
● even if the user did not play it
User may become more/less likely to play the show
at the next impression
19. Typical features
Frequency counts: number past impressions
of item
● can add different variations
Engagement rate: #engagements /
#impressions
● how to set the value if #impressions = 0?
● 0, average, 1, adding prior?
● this could affect cold-start performance
● we can also skip this feature to let model
learn directly from raw counts
Categorical features: user’s impressed item
ids
● can help model generalize better via id
embeddings
But a user can have hundreds of impressions
even in a single day
Need to reduce the noise
20. Impression data volume
Impression data volume is huge
Logging is challenging
● heterogeneous client devices (TV,
mobile, web)
● need to process, sessionize and
summarize in real-time
● need to be available via multiple
channels (table, stream, API) for
different purposes
Handle volume in feature definition
● summary counts
● focus on most recent impressions
● increase minimum impression
duration requirement
● random sampling
21. How does impression features help?
Correlation
Should we then recommend more items with
many prior impressions from the user? No
Correlation does not imply causation
● highly-impressed items probably have
higher quality and thus have higher avg
label
In an AB test, after adding impression features
● model recommends more lowly-impressed
items
22. Conclusion
● Overview of using impression data to build an unbiased
recommendation model at Netflix
● Label definition: we may need exploration and random
negative sampling to enrich the training data
● Feature definition: various ways to summarize and
denoise impression data
● Long-term value: impressions can have different long-
term values for different users/items
23. Challenges
● How to do efficient exploration that maximizes signal
collection and minimizes user experience impact?
● How to sample random negatives? How to mix
random negatives with impressed negatives?
● How to model long-term value of impressions?