Jaya WWW talk 2023.pdf

© Tubi, proprietary and confidential
Powering Personalized
Binge-Watching
Recommendations: A Journey of
Realtime Multi-Interest Based
Retrieval
Jaya Kawale
Vice President of Engineering (Machine Learning), Tubi

What to expect ?
● Part I: How ML can help streaming services like Tubi ?
● Part II: Case Study of Retrieval
2

Introduction
3

© Tubi, proprietary and confidential 4
Free streaming service - Watch free movie, tv, news & sports!

Tubi
● More than 64 Million monthly
active users.
● Available across several
countries including US,
Canada, Mexico & LatAm.
● Most watched Free Ad
Supported Television (FAST)

How can Machine Learning help ?
6
Part I

Recommendation
8

Personalized Recommendations
Content
Ranking
Container
Ranking
Image Ranking
Search
Notifications
Container
Generation
Cold
starting
titles

Personalized Recommendations
Content
Ranking
Container
Ranking
Image Ranking
Search
Notifications
Container
Generation
Cold
starting
titles
● 70+ models helping organize the
homepage!
● Rank content and containers based
on users’ features and past
interactions
● Ranking based on GBDT, Deep
Neural Network
● Retrieval based upon a lot of
Embeddings (e.g. two tower, NLP
embeddings, etc)
● Distilled models for new users
● Exploration strategies for new titles

01
02
03
Offline vs Online
Feedback loops
Changing tastes and catalog
Why is it challenging ?

01
02
03
Offline vs Online
Feedback loops
Changing tastes and catalog
Why is it challenging ?
Beyond Accuracy
at the Top!

01 Offline vs Online
13

Typical Metrics
14
Typical Offline
Metrics
Typical Online
Metrics
Ranking metrics:
NDCG, NMRR,
Precision @K
Streaming,
Retention

Correlation vs Causation
Offline evaluation
● Use historical data
● Cheap, fast, risk free
● Correlation based
● Counterfactuality of rewards: Do not capture what would have happened if ?
Online evaluation
● Randomized experiment (A/B tests)
● Wait for days to compute the reward
● Reliable but expensive
15

Dynamic Environment
● Recommender dynamics can affect the performance in ways not captured by
the offline metrics. E.g. impression caps.
● Recommendations can influence user preferences in ways not captured by
offline metrics. E.g. Did you watch a title because it was recommended ?
● User dynamics and confounding factors can influence the watch behavior in
ways not captured by offline metrics. E.g. Watching a title because it was
recommended by a friend.
16

Counterfactual evaluation
● Estimate the potential outcome of a policy offline using logged data.
● Inverse Propensity Scoring (IPS): Importance weighting to account for the
mismatch in the distribution of logged data and the policy to evaluate.
● Several variants - CIPS, SNIPS, etc.
17

02 Feedback loops
18
03 Changing User tastes & Catalog

Feedback loops
● Different algorithms on the
homepage influencing one another
● Underlying data influencing the
algorithms
● Recommendations influencing the
watch behavior
● Watch behavior influences the
data.
19
Observational Data

Feedback loops
● Typical offline training - clicks, watched, plays
● Implicit feedback has inherent biases
● Position/ recommendations influence the data collection
20

Changing User tastes & Catalog
● Users adapt and change their preferences over time.
● Also, new titles come into the system whereas some others leave the service.
● Uncertainty around new users and titles
● Trends outside influence the watch behavior.
21

● Tradeoff: Explore unknown choices to gather
information vs exploit known preferences.
● Exploration helps break feedback loops and
helps with uncertainty around new items/
users.
● Caveat: Designing good exploration that
works in practice is hard due to
non-stationarity of the data and large
dynamic action spaces. Reward is myopic
● RL: Optimize long term
Exploration and bandits
22

Content Understanding
23

ML for Content
● Content understanding helps
understand the rich metadata
Helps us improve
● Recommendations
● Content acquisition decisions
● Cold starting of titles
● Container Genesis
● Image Ranking
● …
24
Plot Synopsis
Cast
Genre
Box office
Ratings
Posters/ images
Language
Video trailers

Content Understanding
Easy
Hard
Keyword Search
Review/ Sentiment Classification
Topic Extraction
Embedding Generation
Natural Language Understanding
Video Understanding
Multi-modal data Understanding
(e.g. Text + Images)

Spock Platform
● Platform for data ingestion, preprocessing and cleaning.
● Generates a variety of embeddings powering the different use cases across
the product.
● Helps assess embeddings quality via surrogate tasks.

1st & 3rd
Party Data
Audience
Assessment
Viewer-oriented data
Title-oriented
data
Products
Models
Embeddings (CTXT, MD, MMD,
Genre, Demos, Actor, et al)
Universe of Content + Metadata
Use Cases
Beam from
Universe to
Tubiverse
Cold➔
Warm➔Hot
Starting
Content Value
Assessment
Tiering
Inventory in
Tubiverse
Augmented
Search
Seeding
Growth
Coordinated
Pursuit of New
Audience
Portfolio
Analysis /
Simulation
Spock Platform

ML for Ads
28

Overview of ML for AdTech
29
Audience Segments: Leverage
data to generate Audience segments
for targeting Ad break finder: Detect where to place an Ad break
in a video using Computer Vision
Time series forecasting: Forecast Ad Opportunities
Ad Understanding: Understand what an ad is about.

The Journey of Retrieval
30
Part II

Retrieval
● Retrieval helps reduce the
candidate space to a much
smaller number.
● Typically lightweight methods
to prune candidates.
● Smaller candidates ->
Latency room for a
complicated ranker
31
Retrieval:
Reduces the
candidate
space to
hundreds
Ranker:
Ranks
hundreds
of content
HomePage

How it started ?
● Catalog was small. DAU was small.
● Ranking entire catalog for all users
possible.
● Offline Batch Based Jobs - Publish
Ranking Daily for all users. No real
time inference support needed.
● Issues: Daily ingest jobs. Compute
& storage cost.

As time goes by..
● Tubi starts becoming more popular.
Catalog grows.
● Ranking large catalog for all users
daily became compute intensive.
● Limit the number of candidates
ranked per device to save the daily
ingestion costs (say 200).
● Ingestion cost reduces but entire
page is not personalized.

Fast forward Ranking …
● We moved Rankers to real time
inference.
● Got rid of daily ingest jobs per user.
Huge savings in compute and storage.
● Also gives us room to personalize the
entire homepage.

Retrieval Gen 1.0
● Reduce the candidates for ranking and storing.
● Start with popularity based measures. No need to rank larger catalog.
● Simple measures: Popular in Country, Language, Genre, Externally, etc.
● Issues: Unpersonalized recall. Reinforces popularity bias.

Personalization is the key
● Idea: Start with collaborative filtering.
Use the “wisdom of the crowd”.
● Matrix Factorization: Factorize the
User-Item interaction matrix into low
rank matrices.
● Use the score of MF as first level
pruning.
● Issues: Cold start user/ item
1 x x 1
1 x 1 x
x x 1 1
1 x x 1
Movie
User

Item Embeddings
● Problem: Subsampling of users for training results in a poor user vector
representation. User vector vector also very large.
● Idea: Can we use item vector only ?
● Approximate User representation by watch history. Take the nearest
neighbors in the item space wrt the watch history.
● Retrieval candidates are nearest neighbors of watch history.

Moar Embeddings!
● Lot of additional metadata
associated with a title.
● Abundance of natural language
text.
● Use deep learning/ NLP to
generate more content
embeddings.
Additional Metadata associated with a title

Example search: “Kids Horror”
Why is NLP hard ?
Ambiguity in representation and learning!
Not looking for titles to make
kids horrified.

Word 2 Vec
● Use the similarity of word vectors to calculate the probability of the outside
context words given the centre word (or vice versa).
● Keep adjusting the word vector to maximize the probability.
*Richard Socher, Stanford NLP course

Doc 2 Vec
● Create a numeric representation
of a document instead of a word.
● Add paragraph id to the context
for a word.

Embeddings fun

Transformers, Language Models n all
● Transformers are ruling the world.
● BERT widely used. LLMs are on
everyone's mind.
● Pre-training vs Fine tuning.
● And the latest prompt engineering…
Pre-trained similarities not enough, fine tune for a
specific task.

Gen 2.0 Interaction Based Model - Two Tower
● Two Tower Model: User & Content
tower
● User features: e.g. watch history,
tenure, etc.
● Content features: e.g. genre, tags, etc.
● Final score determines user’s affinity for
a title. Use that for pruning candidates.

Recap: How to Generate Retrieval Candidates ?
● Variety of Content Embeddings: Interaction based, Language based, etc.
● For each of the embeddings, generate a User representation.
● Get the nearest neighbors for the user.
● Key: What could help build a user representation ? Watch History!

Embeddings Based Retrieval
Design Choice 1: Generate Average
Embedding Vector given the watch
history and then compute the Nearest
Neighbors ?
Pros: Single representation for a user
Cons: Averaging loses information. E.g.
a Horror & a Comedy title averaged
together.
A B C D
Watch History
Average Embedding
User Representation

Design Choice 2: Generate Nearest
Neighbors Per Watch History ?
Pros: Horror and Comedy titles not
averaged together.
Cons: A lot of Nearest Neighbors to
compute. Daily ingest jobs took
tremendous compute and storage.
A B C D
Watch History
E, F, G E, P, Q R, A, P X, Y, B
E, F, G, P, Q, R, X, Y

● User’s watch behavior shows
patterns of clustering
● Depending on the context,
particular titles should be shown.
For e.g. news in the morning, horror
in the evening.
● Key Idea: User embedding should
capture multi-modal interests.
Cluster 1: Romance
Cluster 2: Horror
Cluster 3: Action

● Design Choice 3: Medoid Based
Representation of User. [Pal et al, KDD
2020, PinnerSage: Multi-Modal User Embedding
Framework]
● Medoids to represent cluster
centres. Reason: Cheaper! Just Ids
as compared to embedding vectors.
Cluster 1
Cluster 2
Cluster 3

● Hierarchical clustering of the user
watch history. Important as
compared to fixed k clusters.
● Huge reduction in daily ingestion
jobs - only store mediods & NNs for
mediods.
Hierarchical Clustering

● Design Choice 4: Real time!
● Can we move the NN computation online ? Approximate them ?
● FAISS: ANN based RT inference. Get mediods for each user, compute the
ANN online.
● Only medoids need to be stored offline! More savings in compute & storage.

● Design Choice 5: Context Based
Exploration and Sampling
● Cluster Importance: Assign
importance based upon size of the
cluster, recency of the watched
content, time of watch, etc.
● Sample based upon importance.

● Design Choice 6: Bring it on!
● Additional signals, Adaptive Clusters, RT clustering, Better handling of
multiple embeddings, Incremental updates
● Sequence prediction: Use transformers to learn what to pay attention to.

Conclusions
● Retrieval is an important area that helps surface relevant content to the
users.
● User interests are multi-modal.
● The road ahead is very promising and exciting.

Thank You!
We are hiring!
Email: jkawale@tubi.tv
Twitter: @jayakawale

Jaya WWW talk 2023.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Jaya WWW talk 2023.pdf

Similar to Jaya WWW talk 2023.pdf (20)

Recently uploaded

Recently uploaded (20)

Jaya WWW talk 2023.pdf