| CONTENT-BASED PERSONALIZATION
From Spark to Elasticsearch and
Back
Learning Large Scale
Models for
Content
Recommendation
Sonya Liberman
Personalization Team Lead
Outbrain Recommendations Group
2
Discovery State
of Mind
OUTBRAIN helps
people discover
things that they
find interesting.
OUTBRAIN IS EVERYWHERE
OFFICES IN 14 COUNTRIES
9,000+ WEBSITES AND APPS
300 Billion
Content Recommendations
Every Month*
800 Million
People Worldwide*
OUTBRAIN
BY THE NUMBERS
* Outbrain Internal Data 2018
10
Know Your Reader
11
Oubtrain’s NLP Engine
Crawling articles where
our recommendations
are displayed
Crawling the content we
recommend
Over 3 million new
articles a week
12
What is a Document About?
Categories
Entertainment/Television
Topics
sitcom
Entities
Jim Parsons, Mayim Bialik,
CBS
13
User Semantic Profile
14
User Semantic Profile
15
16
Predictive Models
1. Content Based Models
Recommends content based on semantic similarity with
user interests
17
Predictive Models
Music
Tech
Travel
1. Content Based Models
2. Collaborative Models
If my taste matches someone else’s taste, what they liked
will probably interest me as well
- Matrix Factorization
- Factorization Machines
- Feature Embedding with Deep Neural Networks
18
Predictive Models
The Machine Learning Challenge
Features Vector Supervision
User Profile Click / No Click
Current Context
Predicting the Probability of a Click
20
Offline Data Processing
and
Distributed Machine Learning Framework
21
Machine Learning, Technical Debt, and You - D. Sculley (Google) | PAPIs.io Boston 2017
22
Machine Learning, Technical Debt, and You - D. Sculley (Google) | PAPIs.io Boston 2017
23
Machine Learning, Technical Debt, and You - D. Sculley (Google) | PAPIs.io Boston 2017
What Data Scientists Spend the Most Time Doing ?
https://visit.figure-eight.com/data-science-report.html, 2016
Data Science Report (CrowdFlower)
Cleaning and
Organizing Data
Collecting
Datasets
Refining
Algorithms
What Makes Preparing Datasets a Hard Work ?
What Makes Preparing Datasets a Hard Work ?
Multiple data stores
Hive, MySQL, Cassandra
Multiple data resources
Recommendations, Clicks, User Profiles
Logic
Deduping, filtering, siloing
Scale
Huge tables, generating a large dataset
take a long time
A Variety of Evaluation Metrics
28
The Need for an Machine Learning Framework
Goal # 1 - Agile Research
Focus of Modeling, enable quick POC cycles
29
The Need for an Machine Learning Framework
Goal # 1 - Agile Research
Focus of Modeling, enable quick POC cycles
Goal # 2 - Robust Productization
Easy periodic model re-training
Automatic pre-production evaluation and decision
making
30
An Machine Learning Framework on Top of Spark
Distributed Machine
Learning Framework
31
Distributed Machine Learning Framework
Data
Collection
Feature
Engineering
Offline
Evaluation
Metrics
Model
Deployment
1 2
4 5
Model
Training
3
Ongoing Hourly Jobs which
1. Aggregate data from multiple sources
2. Apply data pre-processing logic
3. Make datasets available for model training and
evaluation
Automatic Data Collection
Data
Collection
1
1. Collects model predictions for our labeled test datasets
2. Applies a large set of evaluation metrics
3. Reports metrics to a centralized datastore
Offline Model Evaluation
Offline
Evaluation
Metrics
4
34
Goal # 1 - Agile Research
Data
Collection
Feature
Engineering
Model
Training
Offline
Evaluation
Metrics
Model
Deployment
1 2 3
4 5
Simple Model Interface
Data scientists and algorithm
engineers only need to
implement their model’s logic
Use a variety of algorithms -
Spark.ML, open source
implementations and “home
made”
Everything else - out-of-the-box
36
Simple Model Interface
37
The Need for an Machine Learning Framework
Goal # 1 - Agile Research
Focus of Modeling, enable quick POC cycles
Goal # 2 - Robust Productization
Easy periodic model re-training
Automatic pre-production evaluation and decision
making
38
The Need for an Machine Learning Framework
Goal # 1 - Agile Research
Focus of Modeling, enable quick POC cycles
Goal # 2 - Robust Productization
Easy periodic model re-training
Automatic pre-production evaluation and decision
making
39
ML Framework for both Research and Production
Daily retraining of
production models to
refresh the models as new
data becomes available
40
ML Framework for both Research and Production
Data
Collection
Feature
Engineering
Model
Training
Offline
Evaluation
Metrics
Model
Deployment
1 2 3
4 5
41
Models are evaluated right before deployment
Deployment is conditional based on evaluation
results
Goal # 2 - Robust Productization
Offline
Evaluation
Metrics
Model
Deployment
4 5
42
The Serving Layer
35K req/sec
Under 30ms latency
Millions of potential
recommendations
Our Pace
44
Using Search Technology for
Recommender System Serving Layer
Distributed Search Engines - What Can They Do ?
1. Efficiently apply filters over a large inventory
2. Apply (custom) scoring functions
3. Scale for inventory and number of requests
46
Why are Search Engines Efficient ?
Inverted Index
47
Why are Search Engines Efficient ?
what the
day
brings
48
Why are Search Engines Efficient ?
what the
day
brings
49
Why are Search Engines Efficient ?
what the
day
brings
Query: user interests,
user geo, context
info
User, Context is translated into a query of Interests
and meta-data
Content is translated into a searchable index
Searchable
Index
Reducing Recommendations to Search
51
Tech
Music
Sports
Celebrities
Indexing the semantics
of an article
Reducing Recommendations to Search
Generate a query
from User Interests
52
Music
Tech
Travel
Tech
Music
Sports
Celebrities
Get relevant
recommendations
Indexing the semantics
of an article
Reducing Recommendations to Search
53
Beyond Content Based
Elasticsearch Plugins -
Writing custom scoring functions with native Java
Deploying on Elasticsearch Cluster
Instructing Elasticsearch to use it for its ranking
54
Beyond Content Based
5555
Index item embedding
vectors to Elasticsearch for
every recommendation
item_vector_1
item_vector_2
item_vector_3
Recommending with Embeddings
5656
item_vector_1
item_vector_2
item_vector_3
Recommending with Embeddings
Send user embedding
vector in Elasticsearch
query
user_vector
Index item embedding
vectors to Elasticsearch for
every recommendation
5757
item_vector_1
item_vector_2
item_vector_3
Recommending with Embeddings
Compute relevance
with a custom score
function using plugins
Send user embedding
vector in Elasticsearch
query
Index item embedding
vectors to Elasticsearch for
every recommendation
58
Search-Based Serving Architecture of Embeddings-Based Recommendations|
S Liberman, S Bar, R Vannerom, D Rosenstein, R Lempel
RecSys 2019, Workshop on Online Recommender Systems and User Modeling
Recommending with Embeddings
Challenges
1. Incorporating embedding into the scoring function
2. Synchronization and Versioning
3. Handling indirect, feature-based embeddings
59
Key Takeaways
1. Invest in your ML pipeline to perform agile
research
2. Automate periodic model deployment with
control mechanisms
3. Reduce your task to a problem someone
already solved
Thank You

Recommender Systems @ Scale, Big Data Europe Conference 2019

  • 1.
    | CONTENT-BASED PERSONALIZATION FromSpark to Elasticsearch and Back Learning Large Scale Models for Content Recommendation Sonya Liberman Personalization Team Lead Outbrain Recommendations Group
  • 2.
    2 Discovery State of Mind OUTBRAINhelps people discover things that they find interesting.
  • 7.
  • 8.
  • 9.
    300 Billion Content Recommendations EveryMonth* 800 Million People Worldwide* OUTBRAIN BY THE NUMBERS * Outbrain Internal Data 2018
  • 10.
  • 11.
    11 Oubtrain’s NLP Engine Crawlingarticles where our recommendations are displayed Crawling the content we recommend Over 3 million new articles a week
  • 12.
    12 What is aDocument About? Categories Entertainment/Television Topics sitcom Entities Jim Parsons, Mayim Bialik, CBS
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    1. Content BasedModels Recommends content based on semantic similarity with user interests 17 Predictive Models Music Tech Travel
  • 18.
    1. Content BasedModels 2. Collaborative Models If my taste matches someone else’s taste, what they liked will probably interest me as well - Matrix Factorization - Factorization Machines - Feature Embedding with Deep Neural Networks 18 Predictive Models
  • 19.
    The Machine LearningChallenge Features Vector Supervision User Profile Click / No Click Current Context Predicting the Probability of a Click
  • 20.
  • 21.
    21 Machine Learning, TechnicalDebt, and You - D. Sculley (Google) | PAPIs.io Boston 2017
  • 22.
    22 Machine Learning, TechnicalDebt, and You - D. Sculley (Google) | PAPIs.io Boston 2017
  • 23.
    23 Machine Learning, TechnicalDebt, and You - D. Sculley (Google) | PAPIs.io Boston 2017
  • 24.
    What Data ScientistsSpend the Most Time Doing ? https://visit.figure-eight.com/data-science-report.html, 2016 Data Science Report (CrowdFlower) Cleaning and Organizing Data Collecting Datasets Refining Algorithms
  • 25.
    What Makes PreparingDatasets a Hard Work ?
  • 26.
    What Makes PreparingDatasets a Hard Work ? Multiple data stores Hive, MySQL, Cassandra Multiple data resources Recommendations, Clicks, User Profiles Logic Deduping, filtering, siloing Scale Huge tables, generating a large dataset take a long time
  • 27.
    A Variety ofEvaluation Metrics
  • 28.
    28 The Need foran Machine Learning Framework Goal # 1 - Agile Research Focus of Modeling, enable quick POC cycles
  • 29.
    29 The Need foran Machine Learning Framework Goal # 1 - Agile Research Focus of Modeling, enable quick POC cycles Goal # 2 - Robust Productization Easy periodic model re-training Automatic pre-production evaluation and decision making
  • 30.
    30 An Machine LearningFramework on Top of Spark Distributed Machine Learning Framework
  • 31.
    31 Distributed Machine LearningFramework Data Collection Feature Engineering Offline Evaluation Metrics Model Deployment 1 2 4 5 Model Training 3
  • 32.
    Ongoing Hourly Jobswhich 1. Aggregate data from multiple sources 2. Apply data pre-processing logic 3. Make datasets available for model training and evaluation Automatic Data Collection Data Collection 1
  • 33.
    1. Collects modelpredictions for our labeled test datasets 2. Applies a large set of evaluation metrics 3. Reports metrics to a centralized datastore Offline Model Evaluation Offline Evaluation Metrics 4
  • 34.
    34 Goal # 1- Agile Research Data Collection Feature Engineering Model Training Offline Evaluation Metrics Model Deployment 1 2 3 4 5
  • 35.
    Simple Model Interface Datascientists and algorithm engineers only need to implement their model’s logic Use a variety of algorithms - Spark.ML, open source implementations and “home made” Everything else - out-of-the-box
  • 36.
  • 37.
    37 The Need foran Machine Learning Framework Goal # 1 - Agile Research Focus of Modeling, enable quick POC cycles Goal # 2 - Robust Productization Easy periodic model re-training Automatic pre-production evaluation and decision making
  • 38.
    38 The Need foran Machine Learning Framework Goal # 1 - Agile Research Focus of Modeling, enable quick POC cycles Goal # 2 - Robust Productization Easy periodic model re-training Automatic pre-production evaluation and decision making
  • 39.
    39 ML Framework forboth Research and Production Daily retraining of production models to refresh the models as new data becomes available
  • 40.
    40 ML Framework forboth Research and Production Data Collection Feature Engineering Model Training Offline Evaluation Metrics Model Deployment 1 2 3 4 5
  • 41.
    41 Models are evaluatedright before deployment Deployment is conditional based on evaluation results Goal # 2 - Robust Productization Offline Evaluation Metrics Model Deployment 4 5
  • 42.
  • 43.
    35K req/sec Under 30mslatency Millions of potential recommendations Our Pace
  • 44.
    44 Using Search Technologyfor Recommender System Serving Layer
  • 45.
    Distributed Search Engines- What Can They Do ? 1. Efficiently apply filters over a large inventory 2. Apply (custom) scoring functions 3. Scale for inventory and number of requests
  • 46.
    46 Why are SearchEngines Efficient ? Inverted Index
  • 47.
    47 Why are SearchEngines Efficient ? what the day brings
  • 48.
    48 Why are SearchEngines Efficient ? what the day brings
  • 49.
    49 Why are SearchEngines Efficient ? what the day brings
  • 50.
    Query: user interests, usergeo, context info User, Context is translated into a query of Interests and meta-data Content is translated into a searchable index Searchable Index Reducing Recommendations to Search
  • 51.
    51 Tech Music Sports Celebrities Indexing the semantics ofan article Reducing Recommendations to Search
  • 52.
    Generate a query fromUser Interests 52 Music Tech Travel Tech Music Sports Celebrities Get relevant recommendations Indexing the semantics of an article Reducing Recommendations to Search
  • 53.
    53 Beyond Content Based ElasticsearchPlugins - Writing custom scoring functions with native Java Deploying on Elasticsearch Cluster Instructing Elasticsearch to use it for its ranking
  • 54.
  • 55.
    5555 Index item embedding vectorsto Elasticsearch for every recommendation item_vector_1 item_vector_2 item_vector_3 Recommending with Embeddings
  • 56.
    5656 item_vector_1 item_vector_2 item_vector_3 Recommending with Embeddings Senduser embedding vector in Elasticsearch query user_vector Index item embedding vectors to Elasticsearch for every recommendation
  • 57.
    5757 item_vector_1 item_vector_2 item_vector_3 Recommending with Embeddings Computerelevance with a custom score function using plugins Send user embedding vector in Elasticsearch query Index item embedding vectors to Elasticsearch for every recommendation
  • 58.
    58 Search-Based Serving Architectureof Embeddings-Based Recommendations| S Liberman, S Bar, R Vannerom, D Rosenstein, R Lempel RecSys 2019, Workshop on Online Recommender Systems and User Modeling Recommending with Embeddings Challenges 1. Incorporating embedding into the scoring function 2. Synchronization and Versioning 3. Handling indirect, feature-based embeddings
  • 59.
    59 Key Takeaways 1. Investin your ML pipeline to perform agile research 2. Automate periodic model deployment with control mechanisms 3. Reduce your task to a problem someone already solved
  • 60.