Recommender Systems @ Scale, Big Data Europe Conference 2019

| CONTENT-BASED PERSONALIZATION
From Spark to Elasticsearch and
Back
Learning Large Scale
Models for
Content
Recommendation
Sonya Liberman
Personalization Team Lead
Outbrain Recommendations Group

2
Discovery State
of Mind
OUTBRAIN helps
people discover
things that they
find interesting.

OUTBRAIN IS EVERYWHERE
OFFICES IN 14 COUNTRIES

300 Billion
Content Recommendations
Every Month*
800 Million
People Worldwide*
OUTBRAIN
BY THE NUMBERS
* Outbrain Internal Data 2018

11
Oubtrain’s NLP Engine
Crawling articles where
our recommendations
are displayed
Crawling the content we
recommend
Over 3 million new
articles a week

12
What is a Document About?
Categories
Entertainment/Television
Topics
sitcom
Entities
Jim Parsons, Mayim Bialik,
CBS

1. Content Based Models
Recommends content based on semantic similarity with
user interests
17
Predictive Models
Music
Tech
Travel

1. Content Based Models
2. Collaborative Models
If my taste matches someone else’s taste, what they liked
will probably interest me as well
- Matrix Factorization
- Factorization Machines
- Feature Embedding with Deep Neural Networks
18
Predictive Models

The Machine Learning Challenge
Features Vector Supervision
User Profile Click / No Click
Current Context
Predicting the Probability of a Click

20
Offline Data Processing
and
Distributed Machine Learning Framework

21
Machine Learning, Technical Debt, and You - D. Sculley (Google) | PAPIs.io Boston 2017

22

23

What Data Scientists Spend the Most Time Doing ?
https://visit.figure-eight.com/data-science-report.html, 2016
Data Science Report (CrowdFlower)
Cleaning and
Organizing Data
Collecting
Datasets
Refining
Algorithms

What Makes Preparing Datasets a Hard Work ?

What Makes Preparing Datasets a Hard Work ?
Multiple data stores
Hive, MySQL, Cassandra
Multiple data resources
Recommendations, Clicks, User Profiles
Logic
Deduping, filtering, siloing
Scale
Huge tables, generating a large dataset
take a long time

A Variety of Evaluation Metrics

28
The Need for an Machine Learning Framework
Goal # 1 - Agile Research
Focus of Modeling, enable quick POC cycles

29
Goal # 2 - Robust Productization
Easy periodic model re-training
Automatic pre-production evaluation and decision
making

30
An Machine Learning Framework on Top of Spark
Distributed Machine
Learning Framework

31
Distributed Machine Learning Framework
Data
Collection
Feature
Engineering
Offline
Evaluation
Metrics
Model
Deployment
1 2
4 5
Model
Training
3

Ongoing Hourly Jobs which
1. Aggregate data from multiple sources
2. Apply data pre-processing logic
3. Make datasets available for model training and
evaluation
Automatic Data Collection
Data
Collection
1

1. Collects model predictions for our labeled test datasets
2. Applies a large set of evaluation metrics
3. Reports metrics to a centralized datastore
Offline Model Evaluation
Offline
Evaluation
Metrics
4

34
Data
Collection
Feature
Engineering
Model
Training
Offline
Evaluation
Metrics
Model
Deployment
1 2 3
4 5

Simple Model Interface
Data scientists and algorithm
engineers only need to
implement their model’s logic
Use a variety of algorithms -
Spark.ML, open source
implementations and “home
made”
Everything else - out-of-the-box

37
making

38
making

39
ML Framework for both Research and Production
Daily retraining of
production models to
refresh the models as new
data becomes available

40
ML Framework for both Research and Production
Data
Collection
Feature
Engineering
Model
Training
Offline
Evaluation
Metrics
Model
Deployment
1 2 3
4 5

41
Models are evaluated right before deployment
Deployment is conditional based on evaluation
results
Offline
Evaluation
Metrics
Model
Deployment
4 5

35K req/sec
Under 30ms latency
Millions of potential
recommendations
Our Pace

44
Using Search Technology for
Recommender System Serving Layer

Distributed Search Engines - What Can They Do ?
1. Efficiently apply filters over a large inventory
2. Apply (custom) scoring functions
3. Scale for inventory and number of requests

46
Why are Search Engines Efficient ?
Inverted Index

47
what the
day
brings

48
what the
day
brings

49
what the
day
brings

Query: user interests,
user geo, context
info
User, Context is translated into a query of Interests
and meta-data
Content is translated into a searchable index
Searchable
Index
Reducing Recommendations to Search

51
Tech
Music
Sports
Celebrities
Indexing the semantics
of an article

Generate a query
from User Interests
52
Music
Tech
Travel
Tech
Music
Sports
Celebrities
Get relevant
recommendations
Indexing the semantics
of an article

53
Beyond Content Based
Elasticsearch Plugins -
Writing custom scoring functions with native Java
Deploying on Elasticsearch Cluster
Instructing Elasticsearch to use it for its ranking

5555
Index item embedding
vectors to Elasticsearch for
every recommendation
item_vector_1
item_vector_2
item_vector_3
Recommending with Embeddings

5656
item_vector_1
item_vector_2
item_vector_3
Send user embedding
vector in Elasticsearch
query
user_vector

5757
item_vector_1
item_vector_2
item_vector_3
Compute relevance
with a custom score
function using plugins
Send user embedding
vector in Elasticsearch
query

58
Search-Based Serving Architecture of Embeddings-Based Recommendations|
S Liberman, S Bar, R Vannerom, D Rosenstein, R Lempel
RecSys 2019, Workshop on Online Recommender Systems and User Modeling
Challenges
1. Incorporating embedding into the scoring function
2. Synchronization and Versioning
3. Handling indirect, feature-based embeddings

59
Key Takeaways
1. Invest in your ML pipeline to perform agile
research
2. Automate periodic model deployment with
control mechanisms
3. Reduce your task to a problem someone
already solved

Recommender Systems @ Scale, Big Data Europe Conference 2019

More Related Content

Similar to Recommender Systems @ Scale, Big Data Europe Conference 2019

Recently uploaded

Recommender Systems @ Scale, Big Data Europe Conference 2019