SlideShare a Scribd company logo
1 of 142
Download to read offline
Bayesian Optimization for Balancing Metrics in
Recommender Systems
IJCAI Tutorial 2020
Yunbo Ouyang
Sr. Software Engineer
Brendan Gavin
Software Engineer
Cyrus DiCiccio
Sr. Software Engineer
Kinjal Basu
Staff Software Engineer
Viral Gupta
Tech Lead Comms AI
Overall
Agenda
1 Introduction to
Bayesian Optimization
2 Reformulating a
Recommender System
3 Practical
Considerations
4 Infrastructure
5 Future Direction
Introduction to Bayesian Optimization
Yunbo Ouyang
Sr. Software Engineer
Brendan Gavin
Software Engineer
Introduction to
Bayesian
Optimization
1 Problem Setup
2 Gaussian Processes
3 Optimization with GP's
4 Neural Architecture
Search
5 Demo
Why Bayesian Optimization?
E.g. in machine learning: optimizing hyperparameters
f(): F1, precision, recall, accuracy, etc
Why Bayesian Optimization?
E.g. in machine learning: optimizing hyperparameters
feature vector
parameters
hyperparameters
f(): F1, precision, recall, accuracy, etc
parameters: neural network weights, decision tree features to split on
hyperparameters: learning rate, classification threshold
Why Bayesian Optimization?
E.g. in machine learning: optimizing hyperparameters
feature vector
parameters
hyperparameters
parameters: fast/efficient to optimize
hyperparameters: slow/resource intensive to optimize
Why Bayesian Optimization?
E.g. in machine learning: optimizing hyperparameters
Want to optimize with a method that:
β€’ minimizes function evaluations
β€’ can optimize black box
β€’ quantifies uncertainty in solution
Bayesian optimization
β€’ model as a Gaussian random process
β€’ fit model to data
β€’ use uncertainty quantification to inform search for optimal x
Bayesian optimization
Bayesian optimization
Bayesian optimization
Gaussian Processes
Gaussian Processes: distributions for sampling functions
Gaussian random variable
Gaussian Processes: distributions for sampling functions
Gaussian random variable
Gaussian multivariate RV
Gaussian Processes: distributions for sampling functions
Gaussian random variable
Gaussian multivariate RV
Gaussian process
Gaussian Processes: distributions for sampling functions
Mean function
95% confidence
from
Gaussian Processes: distributions for sampling functions
Gaussian Processes: distributions for sampling functions
Gaussian Processes: distributions for sampling functions
Noiseless
Measurements
Measurements
With Noise
Gaussian Processes: distributions for sampling functions
Multitask Gaussian processes
Gaussian Processes: distributions for sampling functions
Multitask Gaussian processes
ICM approximation:
Optimization with Gaussian
Processes
Optimization with Gaussian processes
The basic recipe:
1. Choose kernel and mean functions
2. Sample function at several points
3. Use maximum likelihood to fit GP parameters
4. Are we done? If not, select new points to
sample from; GOTO step 3
Choosing mean and kernel functions
Mean function: usually
Choosing mean and kernel functions
Mean function: usually
Choosing mean and kernel functions
Mean function: usually
Kernel function: usually
Choosing mean and kernel functions
Mean function: usually
Kernel function: usually
RBF Kernel
Choosing mean and kernel functions
Mean function: usually
Kernel function: usually
RBF Kernel Matern Kernel
Which values of x to sample?
Start out with random points
Later, use acquisition function
Acquisition functions
Balance exploration and exploitation
Expected improvement:
Upper confidence bound (UCB):
Thompson sampling:
Acquisition functions
Expected Improvement
Upper confidence bound
Gaussian Process Regression: use maximum likelihood to fit GP parameters
Sample function and treat problem as discrete Gaussian RV
Use maximum likelihood to fit kernel parameters
Gaussian Process Regression: use maximum likelihood to fit GP parameters
Optimization with Gaussian processes
The basic recipe:
1. Choose kernel and mean functions
2. Sample function at several points
3. Use maximum likelihood to fit GP parameters
4. Are we done? If not, select new points to
sample from; GOTO step 3
Additional Resources
Gaussian process book: http://www.gaussianprocess.org/
β€’ Especially chapters 2 and 5
Peter Frazier’s tutorial on Bayesian optimization: arXiv:1807.02811
Eytan Bakshy’s publications: https://eytan.github.io/
Neural Architecture Search
(NAS)
Components
1 Search Space
2 Search Strategy
3 Performance
Estimation Strategy
NAS Overlapped with BO
β€’ Search Space
β€’ Both BO and NAS need to define a
proper search space
β€’ Search Strategy
β€’ BO is one important search strategy
β€’ Performance Estimation Strategy
β€’ Standard training and evaluation is
expensive. Performance estimation is
required
Search Space
The choice of the search space largely determines the difficulty of the
optimization problem
Search Strategy
β€’ Existing methods: Random Search, Bayesian Optimization, Evolutionary
Methods, Reinforcement Learning, Gradient-based Methods
β€’ BO has not been widely applied to NAS yet since Gaussian Process
focuses more on low-dimensional continuous optimization problems
Performance Estimation Strategy
β€’ Training each architecture to be evaluated from scratch frequently
yields computational demands in the order of thousands of GPU days
for NAS
References
β€’ Neural Architecture Search Survey: https://arxiv.org/pdf/1808.05377.pdf
β€’ AutoML online book: http://www.automl.org/book/
β€’ A blog for Neural Architecture Search: https://lilianweng.github.io/lil-
log/2020/08/06/neural-architecture-search.html
Bayesian Optimization Demo
Bayesian Optimization Framework Overview
β€’ GPyOpt: A Bayesian Optimization library based on GPy. Not actively
developed and maintained as of now.
β€’ Ray-Tune: Have multiple hyperparameter tuning frameworks integrated,
including BayesianOptimization and BoTorch.
β€’ BoTorch: A flexible and scalable Bayesian Optimization library based on
GPyTorch and PyTorch.
Advantages of BoTorch
β€’ BoTorch provides a modular and easily extensible interface for
composing Bayesian optimization primitives, including probabilistic
models, acquisition functions, and optimizers.
β€’ BoTorch harnesses the power of PyTorch, including auto-differentiation,
native support for highly parallelized modern hardware (e.g. GPUs)
using device-agnostic code, and a dynamic computation graph.
BoTorch Models: SingleTaskGP
β€’ A single-task exact GP model. Use this model when you have
independent output(s) and all outputs use the same training data.
BoTorch Models: FixedNoiseGP
β€’ A single-task exact GP model using fixed noise levels.
BoTorch Models: MultiTaskGP
β€’ Multi-task exact GP that uses a simple ICM (Intrinsic Coregionalization
Model) kernel.
BoTorch Model Fitting: fit_gpytorch_model
Parameters:
β€’ mll (MarginalLogLikelihood) – MarginalLogLikelihood to be maximized
β€’ optimizer (Callable) – The optimizer function
β€’ kwargs (Any) – Arguments passed along to the optimizer function
BoTorch Acquisition Functions: ExpectedImprovement
β€’ Single-outcome Expected Improvement
β€’ EI(x) = E(max(y - best_f, 0)), y ~ f(x)
BoTorch Acquisition Functions: UpperConfidenceBound
β€’ Single-outcome Upper Confidence Bound (UCB).
β€’ UCB(x) = mu(x) + sqrt(beta) * sigma(x)
BoTorch Optimizers: optimize_acqf
β€’ Optimize the acquisition function
β€’ Parameters
β€’ acq_function (AcquisitionFunction) – An AcquisitionFunction.
β€’ bounds (Tensor) – A 2 x d tensor of lower and upper bounds for each column
of X.
β€’ q (int) – The number of candidates.
β€’ num_restarts (int) – The number of starting points for multistart acquisition
function optimization.
β€’ raw_samples (int) – The number of samples for initialization.
Bayesian Optimization Demo Scripts
Bayesian Optimization Demo Scripts -- Continued
Reformulating a Recommendation System Problem
Viral Gupta
Tech Lead Comms AI
Yunbo Ouyang
Sr. Software Engineer
Agenda
1 Real World Problems
2 Examples
3 Solving the Problem
LinkedIn Feed
Mission: Enable
Members to build an
active professional
community that
advances their career.
LinkedIn Feed
Mission: Enable Members
to build an active
professional community
that advances their career.
Heterogenous List:
β€’ Shares from a member’s
connections
β€’ Recommendations such
as jobs, articles, courses,
etc.
β€’ Sponsored content or ads
Member Shares
Jobs
Articles
Ads
Important Metrics
Viral Actions (VA)
Members liked, shared or
commented on an item
Job Applies (JA)
Members applied for a job
Engaged Feed Sessions (EFS)
Sessions where a member
engaged with anything on
feed.
Ranking Function
𝑆 π‘š, 𝑒 ≔ 𝑃!" π‘š, 𝑒 + π‘₯#$% 𝑃#$% π‘š, 𝑒 + π‘₯&" 𝑃&" π‘š, 𝑒
β€’ The weight vector π‘₯ = π‘₯!"#, π‘₯$% controls the balance between the three business metrics:
EFS, VA and JA.
β€’ A Sample Business Strategy is
π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. 𝑉𝐴(π‘₯)
𝑠. 𝑑. 𝐸𝐹𝑆 π‘₯ > 𝑐#$%
𝐽𝐴 π‘₯ > 𝑐&"
β€’ m – Member, u - Item
● π‘Œ&,(
)
π‘₯ ∈ 0,1 denotes if the 𝑖-th member during the 𝑗-th session which was served by
parameter π‘₯, did action k or not. Here k = VA, EFS or JA.
● We model this data as follows
● Assume a Gaussian process prior on each of the latent function 𝑓) .
Modeling The Metrics
!
!
!
"
π‘Œ!,"(π‘₯) ~ Gaussian 𝑓$ π‘₯ , 𝜎%
𝑓'(π‘₯) ~ N 0, 𝐾()$(π‘₯, π‘₯ )
Reformulation
We approximate each of the metrics as:
𝑉𝐴 π‘₯ = 𝑓&'(π‘₯)
E 𝐹𝑆 π‘₯ = 𝑓()*(π‘₯)
𝐽𝐴 π‘₯ = 𝑓+' (π‘₯)
π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. 𝑉𝐴(π‘₯)
𝑠. 𝑑. EF 𝑆 π‘₯ > 𝑐()*
JA π‘₯ > 𝑐+'
π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’ 𝑓&'(π‘₯)
𝑠. 𝑑. 𝑓()*(π‘₯) > 𝑐()*
𝑓+'(π‘₯) > 𝑐+'
π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’ 𝑓 π‘₯
π‘₯ ∈ 𝑋
Benefit: The last problem can now be solved using techniques from the literature of Bayesian
Optimization.
The original optimization problem can be written through this parametrization.
PYMK
Recommendation
● Main driver of network growth at LinkedIn.
● Besides network growth, Engagement is another Important piece.
Business Metrics
● Accept Rate: the fraction of accepted invitations out of all invitations sent from
PYMK.
● Invite Rate: the fraction of impressions for which members send invitations, out of all
impressions delivered.
● Member Engagement Metric: the fraction of invitations sent that helps member
engagement.
PYMK Recommendation
PYMK Scoring Function
𝑆 π‘š, 𝑐 ≔ 𝑃*+,-./"00/1. π‘š, 𝑐 βˆ— (1 + 𝛼 𝑃#+232/ π‘š, 𝑐 )
β€’ pInviteAccept is a score predicting a combination of accept score and
invite score.
β€’ pEngage is a score predicting member engagement score.
β€’ 𝛼 is the hyper-parameter we need to tune
β€’ m – Member, c - candidate
Reformulation
We approximate each of the metrics as:
𝐼𝑛𝑣𝑖𝑑𝑒 π‘₯ = 𝑓!,-!./(π‘₯)
Accept π‘₯ = 𝑓011/2.(π‘₯)
πΈπ‘›π‘”π‘Žπ‘”π‘’ π‘₯ = 𝑓/,303/ (π‘₯)
π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. 𝐼𝑛𝑣𝑖𝑑𝑒(π‘₯)
𝑠. 𝑑. Accept π‘₯ > 𝑐011/2.
Engage π‘₯ > 𝑐/,303/
π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’ 𝑓011/2.(π‘₯)
𝑠. 𝑑. 𝑓!,-!./(π‘₯) > 𝑐!,-!./
𝑓/,303/(π‘₯) > 𝑐/,303/
π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’ 𝑓 π‘₯
π‘₯ ∈ 𝑋
Benefit: The last problem can now be solved using techniques from the literature of Bayesian
Optimization.
The original optimization problem can be written through this parametrization.
● π‘Œ&,(
)
π‘₯ ∈ 0,1 denotes if the 𝑖-th member during the 𝑗-th session which was served by
parameter π‘₯, did action k or not. Here k = invite, accept, engage.
● We model this data as follows
● Assume a Gaussian process prior on each of the latent function 𝑓) .
Modeling The Metrics
!
!
!
"
π‘Œ!,"(π‘₯) ~ Gaussian 𝑓 π‘₯ , 𝜎%
𝑓'(π‘₯) ~ N 0, 𝐾()$(π‘₯, π‘₯ )
LinkedIn Feed is mixture of OU and SU
Organic
Update
Sponsored
Update
Share
Comment
Like
Revenue
Need to find a desired balance
Share
Comment
Like
Revenue
Revenue Engagement Tradeoff (RENT)
● RENT is a mechanism that blends sponsored updates (SU) and organic updates (OU) to optimize
revenue and engagement tradeoff
● Two important while conflicting objectives are considered
Fast blending system with a light-weight infrastructure
● High velocity on model dev
● Ranking invariance
● Calibration with autotuning
72
Ads use-case
π‘€π‘–π‘›π‘–π‘šπ‘–π‘§π‘’. |π‘†π‘ˆ-415/66-7+6(𝐿𝐢𝐹) βˆ’ π‘†π‘ˆ-415/66-7+6(control)|
β€’ Compare ads models M1 and M2 by making sure each have equal impressions.
β€’ This is controlled by LCF factor that affects the score and hence the ranking on the feed.
β€’ Increasing LCF gives higher score and hence increase rank which increases number of
impressions
Solution
1 Mathematical
Abstraction
2 Bayesian Optimization
3 Thompson Sampling
Mathematical Abstraction – Constrained Optimization
max
8
𝑦(π‘₯)
𝑠. 𝑑. 𝑦9 π‘₯ β‰₯ 𝑐9, β‹― , 𝑦' π‘₯ β‰₯ 𝑐' β‹― , 𝑦: π‘₯ β‰₯ 𝑐:
Goal
β€’ When launching a new model online, we need to tune hyperparameters
to optimize the major metric while making sure the guardrail metrics do
not drop
Notation
β€’ π‘₯: The hyperparameter to tune
β€’ 𝑦(π‘₯): The major business metric to optimize
β€’ 𝑦Q π‘₯ : The guardrail metrics
β€’ 𝑐Q: The constraint value
Model Each Metric via Gaussian Process
Examples
β€’ 𝑦(π‘₯): Member viral actions in Feed; invitation rate in PYMK
β€’ 𝑦Q π‘₯ : Job application metrics in Feed; acceptance rate in PYMK
Use Gaussian Process to model online metrics.
Different distributions can be used (𝑛 π‘₯ is a normalizer if we use Binomial
or Poisson distribution):
β€’ 𝑦 π‘₯ ~πΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘› 𝑓 π‘₯ , 𝜎R
, 𝑓 π‘₯ ~πΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘› 0, 𝐾 π‘₯, π‘₯
β€’ 𝑦 π‘₯ ~π΅π‘–π‘›π‘œπ‘šπ‘–π‘Žπ‘™ 𝑛 π‘₯ , π‘†π‘–π‘”π‘šπ‘œπ‘–π‘‘(𝑓 π‘₯ ) , 𝑓 π‘₯ ~πΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘› 0, 𝐾 π‘₯, π‘₯
β€’ 𝑦 π‘₯ ~π‘ƒπ‘œπ‘–π‘ π‘ π‘œπ‘› 𝑒S T U(T)
, 𝑓 π‘₯ ~πΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘› 0, 𝐾 π‘₯, π‘₯
Combine All Metrics – Method of Lagrange Multipliers
Equivalent to maximize the following
𝐿 π‘₯ = 𝑦 π‘₯ + πœ†V 𝜎W 𝑦V π‘₯ βˆ’ 𝑐V + β‹― πœ†X 𝜎W 𝑦X π‘₯ βˆ’ 𝑐X
𝜎W(𝑧) = 1 + 𝑒YWZ YV
β€’ 𝜎W is a smoothing Sigmoid function which take the value close to 1 for
positive 𝑧 and close to 0 for negative 𝑧.
β€’ πœ†Q 𝜎W 𝑦Q π‘₯ βˆ’ 𝑐Q penalizes constraint violation.
β€’ πœ†Q is a positive multiplier, which is tunable in different applications
β€’ The optimal value of 𝐿(π‘₯) optimizes 𝑦 π‘₯ while satisfying all constraints.
Langrange Multipliers + Gaussian Process
𝐿(π‘₯) = 𝑦 π‘₯ + πœ†V 𝜎W 𝑦V π‘₯ βˆ’ 𝑐V + β‹― πœ†X 𝜎W(𝑦X π‘₯ βˆ’ 𝑐X)
Posterior samples of 𝑦 π‘₯ and 𝑦Q π‘₯ could be easily obtained via
Gaussian Process Regression. So could 𝐿(π‘₯).
Collecting Metrics
β€’ Metrics 𝑦 π‘₯ and 𝑦Q π‘₯ are collected on equal spaced Sobol points
π‘₯V, β‹― , π‘₯S.
β€’ Sobol points offer a lower discrepancy than the same number of random
numbers.
Reweighting Points
β€’ Initially recommender systems are served equally by π‘₯V, β‹― , π‘₯S.
β€’ After collecting metrics and apply Gaussian Process Regression, we
generate samples from 𝐿(π‘₯).
β€’ We need to assign each point π‘₯[ a probability value 𝑝[ to split traffic.
Thompson Sampling
β€’ Use Thompson Sampling to assign probability value 𝑝[ to each point π‘₯[
β€’ Step 1: For all candidate points π‘₯9, β‹― , π‘₯+, obtain posterior samples from 𝐿(π‘₯):
5𝐿 π‘₯9 , 5𝐿 π‘₯; , β‹― , 5𝐿 π‘₯+ .
β€’ Step 2: Pick π‘₯-
βˆ—
such that 5𝐿 π‘₯-
βˆ—
is the largest among 5𝐿 π‘₯9 , 5𝐿 π‘₯; , β‹― , 5𝐿 π‘₯+
β€’ Repeat Step 1 and Step 2 1000 times.
β€’ Step 3: Aggregate π‘₯9
βˆ—
, β‹― , π‘₯9===
βˆ—
. Set 𝑝- as the frequency of π‘₯- in π‘₯9
βˆ—
, β‹― , π‘₯9===
βˆ—
Intuition of Thompson Sampling
β€’ Thompson Sampling is used to select the optimal hyperparameters via
sampling from a distribution π‘₯V, 𝑝V , π‘₯R, 𝑝R , β‹― , π‘₯S, 𝑝S among all
candidates.
β€’ 𝑝[ is equal to the probability of each candidate being optimal among
all.
β€’ Thompson Sampling is proved to converge to the optimal value in
various settings.
β€’ Thompson Sampling is widely used in online A/B test, online advertising
and other applications.
Exploration versus Exploitation
β€’ Exploration: explore sub-optimal but high potential regions in the search
space
β€’ Exploitation: search around the current optimal candidate
β€’ Thompson Sampling automatically balances exploration and
exploitation via a probability distribution
Infrastructure
Viral Gupta
Tech Lead CommsAI
Agenda
1 System Design
2 At-Scale Use
3 Visualizations
4 Results
High Level Steps
We need to run the following steps in a loop
β€’ Optimization Function emits hyper-parameter distribution
β€’ Use the hyper-parameters in online serving
β€’ Collect Metrics
β€’ Call the optimization function
Goals of the Infrastructure
Few realities about the clients of the Optimization package
β€’ Every team within LinkedIn can have different online serving
system. eg: Streaming, offline, rest end point.
β€’ Tracking data and metrics for each system can be very different.
Eg. Different calculation methodology.
To make a minimum viable product and stay focused on solving the
optimization problem well we built the Optimization package as a
library.
Flow Architecture
Built the Optimization package as a library. It will expand as a spark
flow.
Client will call the Optimization package in its existing spark offline
flow.
Client flow will contain the following
β€’ A hadoop/spark flow to collect metrics and aggregate.
β€’ Call the Optimization package which expands into spark flow
β€’ A flow to use the output of the optimizer.
Overall System Architecture
Notifications Example
π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. π‘†π‘’π‘ π‘ π‘–π‘œπ‘›π‘ (π‘‘π‘Ÿ)
𝑠. 𝑑. πΆπ‘™π‘–π‘π‘˜π‘  π‘‘π‘Ÿ > 𝑐>?-0'6
𝑆𝑒𝑛𝑑 π‘‰π‘œπ‘™π‘’π‘šπ‘’ π‘‘π‘Ÿ < 𝑐%/+@ !7?A4/
Send the right volume of notification at most opportune time to
improve sessions and clicks keeping send volume same.
𝑃!-6-. π‘š, 𝑖 : π‘ƒπ‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦ π‘œπ‘“ π‘Ž 𝑣𝑖𝑠𝑖𝑑
𝑃>?-0' π‘š, 𝑖 : π‘ƒπ‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦ π‘œπ‘“ π‘Ž π‘π‘™π‘–π‘π‘˜ on	item	i member	m	
We have the following ML models
Notifications Scoring Function
𝑆 π‘š, 𝑖 ≔ 𝑃>?-0' π‘š, 𝑖 + π‘₯B 𝑃!-6-. π‘š, 𝑖 > π‘₯.C
β€’ The weight vector π‘₯ = π‘₯W, π‘₯] controls the balance between the
business metrics: Sessions, Clicks, Send Volume.
β€’ A Sample Business Strategy is
β€’ m – Member, i - Item
π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. π‘†π‘’π‘ π‘ π‘–π‘œπ‘›π‘ (π‘₯)
𝑠. 𝑑. πΆπ‘™π‘–π‘π‘˜π‘  π‘₯ > 𝑐>?-0'6
𝑆𝑒𝑛𝑑 π‘‰π‘œπ‘™π‘’π‘šπ‘’ π‘₯ < 𝑐%/+@ !7?A4/
Notifications Optimization
Optimizer outputs a probability distribution over parameters.
Using Optimizer output
The next step is to just map all members in the treatment to parameter
Set using the CDF.
Member Assignment to hyper-parameters
Optimizer outputs a probability distribution over parameters.
Overall System Architecture
Tracking Data Aggregation
Tracking Data from user activity logs are available on HDFS.
The raw data gets aggregated by the client flows.
Library API: Problem Specification
β€’ Problem Specification
β€’ Treatment models and a control model
β€’ Parameters and search range
β€’ Objective metric and constraint metrics
β€’ The number of exploration iterations
β€’ Exploration – Exploitation Setup
β€’ The algorithm starts with a few
exploration iterations that outputs uniform
serving probability.
β€’ Exploration stage is the stage in
the algorithm which explores each point
equally while exploitation stage is the
stage to reassign probabilities to different
points.
Library API: Problem Specification Continued
Offline System
The heart of the product
Tracking
β€’ All member activities are
tracked with the parameter
of interest.
β€’ ETL into HDFS for easy
consumption
Utility Evaluation
β€’ Using the tracking data we
generate (π‘₯, 𝑓! π‘₯ ) for
each function k.
β€’ The data is kept in
appropriate schema that is
problem agnostic.
Bayesian Optimization
β€’ The data and the problem
specifications are input to this.
β€’ Using the data, we first
estimate each of the posterior
distributions of the latent
functions.
β€’ Sample from those
distributions to get distribution
of the parameter π‘₯ which
maximizes the objective.
β€’ The Bayesian Optimization library generates
β€’ A set of potential candidates for trying in the next round (π‘₯9, π‘₯;, … , π‘₯+)
β€’ A probability of how likely each point is the true maximizer (𝑝9, 𝑝;, … , 𝑝+) such that
βˆ‘-D9
+
𝑝- = 1
β€’ To serve members with the above distribution, each memberId is mapped to [0,1] using a
hashing function β„Ž. For example
βˆ‘-D9
'
𝑝- < β„Ž π‘’π‘ π‘’π‘Ÿ ≀ βˆ‘-D9
'E9
𝑝-
Then my feed is served with parameter π‘₯'E9
β€’ The parameter store (depending on use-case) can contain
β€’ <parameterValue, probability> i.e. π‘₯-, 𝑝- or
β€’ <memberId, parameterValue>
The Parameter Store and Online Serving
Online System
Serving hundreds of millions of users
Parameter Sampling
β€’ For each member π‘š visiting
LinkedIn,
β€’ Depending on the parameter
store, we either evaluate <m,
parameterValue>
β€’ Or we directly call the store to
retrieve <m,
parameterValue>
Online Serving
β€’ Depending on the parameter
value that is retrieved (say x),
the notification item is scored
according to the ranking
function and served
𝑆 π‘š, 𝑒 ≔ 𝑃>?-0' π‘š, 𝑖 +
π‘₯B 𝑃!-6-. π‘š, 𝑖 > π‘₯.C
Library Design
Library design can be broken into 2 components – Function fitting
Module and Optimization module.
Entities
β€’ Task is the single unit of a problem that we want to solve.
β€’ OptimizationToken can be either objective or a constraint. A
given task object can have multiple optimizationTokens.
β€’ UtilitySegment encapsulates all the information for modelling a
single latent function fQ.
β€’ ObservationDataToken is used to store observed data. The raw
metrics 𝑦 (vector), hyper parameters π‘₯ (matrix), sampleSize 𝑛
(vector)
Task Object
Visualization
Tools
Plots (When tuning 1d parameter)
Shows how Send Volume for a cohort (FourByFour)
changes as we change threshold.
The red curve is the observed metrics.
The middle blue line is the mean of the GP estimate.
The left and right blue lines are the GP variance
estimates.
Results
Online A/B Testing Results
Operational Efficiency
β€’ Feed model tuning needs 0.5x traffic and improve the tuning
speed by 2.5x .
β€’ Ads model tuning improved by 4x.
β€’ Notification model tuning improved by 3x.
Key Takeaways
● Removes the human in the loop: Fully automatic process to find the
optimal parameters.
● Drastically improves developer productivity.
● Can scale to multiple competing metrics.
● Very easy onboarding infra for multiple vertical teams. Currently used by
Ads, Feed, Notifications, PYMK, etc.
Extensions of Bayesian Optimization
Cyrus DiCiccio
Sr. Software Engineer
The Simplest Optimization Problem
Our goal is to efficiently solve an optimization problem of
the form
Subject to constraints
Basic Bayesian Optimization
Observe data as
where the error term is assumed to follow a normal distribution,
and the mean function of interest is assumed to follow a Gaussian process
with mean mu, typically assumed to be identically zero, and covariance
kernel K (common choices include the RBF kernel or the Matern kernel).
Some Natural Extensions
1) Often there is time dependence in A/B experiments.
β€’ Day over-day fluctuations
β€’ User behavior may differ on weekdays vs weekends
2) Can we incorporate other sources of information into the optimization?
3) How can we address heterogeneity in users’ affinity towards treatments
Time Dependence
Traditional Bayesian optimization strongly assumes metrics are stable over
time. Is this a problem?
It can be when using explore/exploit methods
Examples of time dependence in A/B experiments.
β€’ Day over-day fluctuations
β€’ User behavior may differ on weekdays vs weekends
If an exploit step suggests a new point to sample, but this occurs on a
Saturday when users tend to be more active, this may give an overly
optimistic sense of the impact
Additional Sources of Information
It is common in the internet industry to iterate through a large volume of
models. Past experiment data may be suggestive towards the best
hyperparameters for a new model
Many companies have developed β€œoffline replay” systems which simulate
the results that would have been seen had users been given a particular
treatment.
Such sources of additional information can be leveraged to expedite
optimization
β€œPersonalized” Hyperparameter Tuning
It is plausible that different groups of members may have different affinity
towards model hyperparameters
For instance, active users of a platform may be very interested in frequent
updates
On the other hand, less active users may not appreciate frequent pings
This can be handled in a notification system by allowing a separate
hyperparameter for the active and less active users.
Commonalities
β€’ Each of these extensions are easily accommodated through modifications
of the Bayesian optimization framework
β€’ They are all used to improve the speed and accuracy of function fits
Time Varying
Models
Time Dependence
A/B experiment metrics often see variations in time, such as day over-day
fluctuations or differences on weekdays vs weekends or holidays
Bayesian optimization typically assumes data is observed as
a more realistic model might be to assume we have a time dependent error
term in addition to the measurement error:
Reformulating this model
The model incorporating time fluctuations can be expressed as
That is, we are now trying to optimize for a function that is allowed to vary with
time!
Practically, this can be accomplished by simple adjustments to the
covariance kernel.
Time Dependent Gaussian Processes
Rather than assuming the function follows a Gaussian process, i.e.
we can instead assume that the function is a Gaussian process indexed by
the hyperparameter and time:
And Bayesian optimization proceeds exactly as in the time-invariant case!
A Simple Example of a Time Dependent Kernel
A simple example of a kernel for a time dependent Gaussian process is
which specifies
and is equivalent to
An Example
Additional Sources
of Information
Additional Sources of Information
It is common in the internet industry to iterate through a large volume of
models. Past experiment data may be suggestive towards the best
hyperparameters for a new model
Many companies have developed β€œoffline replay” systems which simulate
the results that would have been seen had users been given a particular
treatment.
Such sources of additional information can be leveraged to expedite our
predictions of the metrics of interest
Joint Model Of Experiment And Additional Information
Assume we observe metrics of interest as
as well as another β€œproxy” metric
and we believe that g is informative of the function of interest f
We can jointly model the functions f and g to improve our predictions of f
Joint Gaussian Processes
We still assume
and also that
Further assume the Gaussian processes f and g are correlated. Namely
(Note that the common marginal priors is not a necessary assumption, but
one typically made to simplify estimation of kernel parameters)
Extensions To Multiple Sources Of Information
Suppose we are interested in modelling C different Gaussian Processes,
A simple joint model is specified by the Intrinsic Coregionalization Model
(ICM). This specifies that marginally
and also jointly
Notification Sessions Modeling
Send a notification if a score exceeds a threshold
Our goal is to estimate the sessions that would be observed for a given threshold
The sessions induced by sending a notification are easily modelled, for instance
by a survival model
Based on offline data, we can estimate the induced sessions as
Online vs Offline Sessions
Comparison of GP Fit With Left Out Data
Mean Squared Error:
β€’ Without offline data:
0.00928
β€’ With offline data:
0.00325
Relative efficiency: 285%
Group Level
Optimization
Group Definitions
Often, segments of users can be given by ad-hoc definitions based on
domain specific knowledge
Demographic segments (e.g. students vs full-time employees)
β€’ Usage patterns (e.g. heavy vs light users)
Recently, advances have been made towards identifying heterogeneous
cohorts of users which can be leveraged in the cohort definition phase
Causal Discovery of Groups
1. Launch an A/B experiment for a variety of hyperparameter values
2. Using this randomized experiment data, apply an automated method for
discovering subgroups of the population which demonstrate
heterogeneity in treatment effect
3. An example of such a tool is the β€œcausal tree” methodology proposed by
Athey and Imbens (2016)
4. This takes in user features (usage, demographics, etc), and results in a
tree whose splits are chosen according to heterogeneity in treatment
effect
Optimization by Group
We now want to give each group a hyperparameter value which globally
optimizes the metrics of interest.
Suppose we have G groups, and each group g is exposed to
hyperparameter xg
Our optimization problem can be expressed as
which is generally not tractable
Grey Box Optimization
Generally, the metric of interest can be expressed as a (weighted) average
of the metric across the cohorts
That is,
so we estimate the objective metric (and similarly constraint metrics)
separately for each cohort as a function of the hyperparameter for that
cohort
This estimation requires dramatically less data than estimating a single
function of G variables
Grey Box Optimization
Using the decomposition
Bayesian optimizations (explore/exploit) proceeds exactly as previously seen
Now, the mean and variances of the objective and constraints are simply the
sum of the independent GPs for each cohort
A Simple Example
Send a notification if a score exceeds a threshold
Ad-hoc cohorts based on usage frequency:
β€’ Four by four: daily active users
β€’ One by three: weekly active users
β€’ One by one: monthly active users
β€’ Onboarding
β€’ Dormant
Influence Of One Parameter On Number Of Sent Messages
Marginal plot of Send Volume for FourByFour
members
The two red lines are maximum and minimum over
all cohorts except FourByFour
The blue lines are the variance estimates
Summary
β€’ An advantage of Bayesian optimization is that it is easily modified to
suit the nuances of the optimization problem at hand
β€’ Some easy extensions
β€’ Incorporate time dependence seen in many online experiments
β€’ Incorporate additional sources of information
β€’ Personalize user experience through group specific optimization
Thank you
Marco Alvarado, Mavis Li, Yafei Wang, Jason Xu, Jinyun Yan, Shuo Yang, Wensheng Sun,
Yiping Yuan, Ajith Muralidharan, Matthew Walker, Boyi Chen, Jun Jia ,
Haichao Wei, Huiji Gao, Zheng Li, Mohsen Jamali, Shipeng Yu,
Hema Raghavan, Romer Rosales

More Related Content

Similar to Bayesian Optimization for Balancing Metrics in Recommender Systems

Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
Β 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
Β 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
Β 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
Β 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneXiaoweiJiang7
Β 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowDatabricks
Β 
Using Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesUsing Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesSigOpt
Β 
Using Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesUsing Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesScott Clark
Β 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarSigOpt
Β 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Databricks
Β 
Multi-criteria meta-parameter tuning for mono-objective stochastic metaheuris...
Multi-criteria meta-parameter tuning for mono-objective stochastic metaheuris...Multi-criteria meta-parameter tuning for mono-objective stochastic metaheuris...
Multi-criteria meta-parameter tuning for mono-objective stochastic metaheuris...guest78b81
Β 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
Β 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
Β 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
Β 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningSigOpt
Β 

Similar to Bayesian Optimization for Balancing Metrics in Recommender Systems (20)

Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Β 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Β 
Final
FinalFinal
Final
Β 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Β 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
Β 
Cutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tuneCutting edge hyperparameter tuning made simple with ray tune
Cutting edge hyperparameter tuning made simple with ray tune
Β 
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflowAdvanced Hyperparameter Optimization for Deep Learning with MLflow
Advanced Hyperparameter Optimization for Deep Learning with MLflow
Β 
Using Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesUsing Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning Pipelines
Β 
Using Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning PipelinesUsing Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning Pipelines
Β 
Benchmarking PyCon AU 2011 v0
Benchmarking PyCon AU 2011 v0Benchmarking PyCon AU 2011 v0
Benchmarking PyCon AU 2011 v0
Β 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise Webinar
Β 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for De...
Β 
Multi-criteria meta-parameter tuning for mono-objective stochastic metaheuris...
Multi-criteria meta-parameter tuning for mono-objective stochastic metaheuris...Multi-criteria meta-parameter tuning for mono-objective stochastic metaheuris...
Multi-criteria meta-parameter tuning for mono-objective stochastic metaheuris...
Β 
Tree building 2
Tree building 2Tree building 2
Tree building 2
Β 
Repair dagstuhl jan2017
Repair dagstuhl jan2017Repair dagstuhl jan2017
Repair dagstuhl jan2017
Β 
Pydata presentation
Pydata presentationPydata presentation
Pydata presentation
Β 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Β 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
Β 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
Β 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep Learning
Β 

Recently uploaded

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
Β 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
Β 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
Β 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
Β 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
Β 
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhisoniya singh
Β 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
Β 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
Β 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
Β 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
Β 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
Β 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
Β 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
Β 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Β 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
Β 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
Β 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
Β 

Recently uploaded (20)

Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
Β 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
Β 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Β 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Β 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Β 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Β 
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
Β 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
Β 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Β 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
Β 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Β 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Β 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
Β 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
Β 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Β 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Β 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
Β 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
Β 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Β 

Bayesian Optimization for Balancing Metrics in Recommender Systems

  • 1. Bayesian Optimization for Balancing Metrics in Recommender Systems IJCAI Tutorial 2020 Yunbo Ouyang Sr. Software Engineer Brendan Gavin Software Engineer Cyrus DiCiccio Sr. Software Engineer Kinjal Basu Staff Software Engineer Viral Gupta Tech Lead Comms AI
  • 2. Overall Agenda 1 Introduction to Bayesian Optimization 2 Reformulating a Recommender System 3 Practical Considerations 4 Infrastructure 5 Future Direction
  • 3. Introduction to Bayesian Optimization Yunbo Ouyang Sr. Software Engineer Brendan Gavin Software Engineer
  • 4. Introduction to Bayesian Optimization 1 Problem Setup 2 Gaussian Processes 3 Optimization with GP's 4 Neural Architecture Search 5 Demo
  • 5. Why Bayesian Optimization? E.g. in machine learning: optimizing hyperparameters f(): F1, precision, recall, accuracy, etc
  • 6. Why Bayesian Optimization? E.g. in machine learning: optimizing hyperparameters feature vector parameters hyperparameters f(): F1, precision, recall, accuracy, etc parameters: neural network weights, decision tree features to split on hyperparameters: learning rate, classification threshold
  • 7. Why Bayesian Optimization? E.g. in machine learning: optimizing hyperparameters feature vector parameters hyperparameters parameters: fast/efficient to optimize hyperparameters: slow/resource intensive to optimize
  • 8. Why Bayesian Optimization? E.g. in machine learning: optimizing hyperparameters Want to optimize with a method that: β€’ minimizes function evaluations β€’ can optimize black box β€’ quantifies uncertainty in solution
  • 9. Bayesian optimization β€’ model as a Gaussian random process β€’ fit model to data β€’ use uncertainty quantification to inform search for optimal x
  • 14. Gaussian Processes: distributions for sampling functions Gaussian random variable
  • 15. Gaussian Processes: distributions for sampling functions Gaussian random variable Gaussian multivariate RV
  • 16. Gaussian Processes: distributions for sampling functions Gaussian random variable Gaussian multivariate RV Gaussian process
  • 17. Gaussian Processes: distributions for sampling functions Mean function 95% confidence from
  • 18. Gaussian Processes: distributions for sampling functions
  • 19. Gaussian Processes: distributions for sampling functions
  • 20. Gaussian Processes: distributions for sampling functions Noiseless Measurements Measurements With Noise
  • 21. Gaussian Processes: distributions for sampling functions Multitask Gaussian processes
  • 22. Gaussian Processes: distributions for sampling functions Multitask Gaussian processes ICM approximation:
  • 24. Optimization with Gaussian processes The basic recipe: 1. Choose kernel and mean functions 2. Sample function at several points 3. Use maximum likelihood to fit GP parameters 4. Are we done? If not, select new points to sample from; GOTO step 3
  • 25. Choosing mean and kernel functions Mean function: usually
  • 26. Choosing mean and kernel functions Mean function: usually
  • 27. Choosing mean and kernel functions Mean function: usually Kernel function: usually
  • 28. Choosing mean and kernel functions Mean function: usually Kernel function: usually RBF Kernel
  • 29. Choosing mean and kernel functions Mean function: usually Kernel function: usually RBF Kernel Matern Kernel
  • 30. Which values of x to sample? Start out with random points Later, use acquisition function
  • 31. Acquisition functions Balance exploration and exploitation Expected improvement: Upper confidence bound (UCB): Thompson sampling:
  • 33. Gaussian Process Regression: use maximum likelihood to fit GP parameters Sample function and treat problem as discrete Gaussian RV
  • 34. Use maximum likelihood to fit kernel parameters Gaussian Process Regression: use maximum likelihood to fit GP parameters
  • 35. Optimization with Gaussian processes The basic recipe: 1. Choose kernel and mean functions 2. Sample function at several points 3. Use maximum likelihood to fit GP parameters 4. Are we done? If not, select new points to sample from; GOTO step 3
  • 36. Additional Resources Gaussian process book: http://www.gaussianprocess.org/ β€’ Especially chapters 2 and 5 Peter Frazier’s tutorial on Bayesian optimization: arXiv:1807.02811 Eytan Bakshy’s publications: https://eytan.github.io/
  • 38. Components 1 Search Space 2 Search Strategy 3 Performance Estimation Strategy
  • 39. NAS Overlapped with BO β€’ Search Space β€’ Both BO and NAS need to define a proper search space β€’ Search Strategy β€’ BO is one important search strategy β€’ Performance Estimation Strategy β€’ Standard training and evaluation is expensive. Performance estimation is required
  • 40. Search Space The choice of the search space largely determines the difficulty of the optimization problem
  • 41. Search Strategy β€’ Existing methods: Random Search, Bayesian Optimization, Evolutionary Methods, Reinforcement Learning, Gradient-based Methods β€’ BO has not been widely applied to NAS yet since Gaussian Process focuses more on low-dimensional continuous optimization problems
  • 42. Performance Estimation Strategy β€’ Training each architecture to be evaluated from scratch frequently yields computational demands in the order of thousands of GPU days for NAS
  • 43. References β€’ Neural Architecture Search Survey: https://arxiv.org/pdf/1808.05377.pdf β€’ AutoML online book: http://www.automl.org/book/ β€’ A blog for Neural Architecture Search: https://lilianweng.github.io/lil- log/2020/08/06/neural-architecture-search.html
  • 45. Bayesian Optimization Framework Overview β€’ GPyOpt: A Bayesian Optimization library based on GPy. Not actively developed and maintained as of now. β€’ Ray-Tune: Have multiple hyperparameter tuning frameworks integrated, including BayesianOptimization and BoTorch. β€’ BoTorch: A flexible and scalable Bayesian Optimization library based on GPyTorch and PyTorch.
  • 46. Advantages of BoTorch β€’ BoTorch provides a modular and easily extensible interface for composing Bayesian optimization primitives, including probabilistic models, acquisition functions, and optimizers. β€’ BoTorch harnesses the power of PyTorch, including auto-differentiation, native support for highly parallelized modern hardware (e.g. GPUs) using device-agnostic code, and a dynamic computation graph.
  • 47. BoTorch Models: SingleTaskGP β€’ A single-task exact GP model. Use this model when you have independent output(s) and all outputs use the same training data.
  • 48. BoTorch Models: FixedNoiseGP β€’ A single-task exact GP model using fixed noise levels.
  • 49. BoTorch Models: MultiTaskGP β€’ Multi-task exact GP that uses a simple ICM (Intrinsic Coregionalization Model) kernel.
  • 50. BoTorch Model Fitting: fit_gpytorch_model Parameters: β€’ mll (MarginalLogLikelihood) – MarginalLogLikelihood to be maximized β€’ optimizer (Callable) – The optimizer function β€’ kwargs (Any) – Arguments passed along to the optimizer function
  • 51. BoTorch Acquisition Functions: ExpectedImprovement β€’ Single-outcome Expected Improvement β€’ EI(x) = E(max(y - best_f, 0)), y ~ f(x)
  • 52. BoTorch Acquisition Functions: UpperConfidenceBound β€’ Single-outcome Upper Confidence Bound (UCB). β€’ UCB(x) = mu(x) + sqrt(beta) * sigma(x)
  • 53. BoTorch Optimizers: optimize_acqf β€’ Optimize the acquisition function β€’ Parameters β€’ acq_function (AcquisitionFunction) – An AcquisitionFunction. β€’ bounds (Tensor) – A 2 x d tensor of lower and upper bounds for each column of X. β€’ q (int) – The number of candidates. β€’ num_restarts (int) – The number of starting points for multistart acquisition function optimization. β€’ raw_samples (int) – The number of samples for initialization.
  • 55. Bayesian Optimization Demo Scripts -- Continued
  • 56. Reformulating a Recommendation System Problem Viral Gupta Tech Lead Comms AI Yunbo Ouyang Sr. Software Engineer
  • 57. Agenda 1 Real World Problems 2 Examples 3 Solving the Problem
  • 58. LinkedIn Feed Mission: Enable Members to build an active professional community that advances their career.
  • 59. LinkedIn Feed Mission: Enable Members to build an active professional community that advances their career. Heterogenous List: β€’ Shares from a member’s connections β€’ Recommendations such as jobs, articles, courses, etc. β€’ Sponsored content or ads Member Shares Jobs Articles Ads
  • 60. Important Metrics Viral Actions (VA) Members liked, shared or commented on an item Job Applies (JA) Members applied for a job Engaged Feed Sessions (EFS) Sessions where a member engaged with anything on feed.
  • 61. Ranking Function 𝑆 π‘š, 𝑒 ≔ 𝑃!" π‘š, 𝑒 + π‘₯#$% 𝑃#$% π‘š, 𝑒 + π‘₯&" 𝑃&" π‘š, 𝑒 β€’ The weight vector π‘₯ = π‘₯!"#, π‘₯$% controls the balance between the three business metrics: EFS, VA and JA. β€’ A Sample Business Strategy is π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. 𝑉𝐴(π‘₯) 𝑠. 𝑑. 𝐸𝐹𝑆 π‘₯ > 𝑐#$% 𝐽𝐴 π‘₯ > 𝑐&" β€’ m – Member, u - Item
  • 62. ● π‘Œ&,( ) π‘₯ ∈ 0,1 denotes if the 𝑖-th member during the 𝑗-th session which was served by parameter π‘₯, did action k or not. Here k = VA, EFS or JA. ● We model this data as follows ● Assume a Gaussian process prior on each of the latent function 𝑓) . Modeling The Metrics ! ! ! " π‘Œ!,"(π‘₯) ~ Gaussian 𝑓$ π‘₯ , 𝜎% 𝑓'(π‘₯) ~ N 0, 𝐾()$(π‘₯, π‘₯ )
  • 63. Reformulation We approximate each of the metrics as: 𝑉𝐴 π‘₯ = 𝑓&'(π‘₯) E 𝐹𝑆 π‘₯ = 𝑓()*(π‘₯) 𝐽𝐴 π‘₯ = 𝑓+' (π‘₯) π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. 𝑉𝐴(π‘₯) 𝑠. 𝑑. EF 𝑆 π‘₯ > 𝑐()* JA π‘₯ > 𝑐+' π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’ 𝑓&'(π‘₯) 𝑠. 𝑑. 𝑓()*(π‘₯) > 𝑐()* 𝑓+'(π‘₯) > 𝑐+' π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’ 𝑓 π‘₯ π‘₯ ∈ 𝑋 Benefit: The last problem can now be solved using techniques from the literature of Bayesian Optimization. The original optimization problem can be written through this parametrization.
  • 65. ● Main driver of network growth at LinkedIn. ● Besides network growth, Engagement is another Important piece. Business Metrics ● Accept Rate: the fraction of accepted invitations out of all invitations sent from PYMK. ● Invite Rate: the fraction of impressions for which members send invitations, out of all impressions delivered. ● Member Engagement Metric: the fraction of invitations sent that helps member engagement. PYMK Recommendation
  • 66. PYMK Scoring Function 𝑆 π‘š, 𝑐 ≔ 𝑃*+,-./"00/1. π‘š, 𝑐 βˆ— (1 + 𝛼 𝑃#+232/ π‘š, 𝑐 ) β€’ pInviteAccept is a score predicting a combination of accept score and invite score. β€’ pEngage is a score predicting member engagement score. β€’ 𝛼 is the hyper-parameter we need to tune β€’ m – Member, c - candidate
  • 67. Reformulation We approximate each of the metrics as: 𝐼𝑛𝑣𝑖𝑑𝑒 π‘₯ = 𝑓!,-!./(π‘₯) Accept π‘₯ = 𝑓011/2.(π‘₯) πΈπ‘›π‘”π‘Žπ‘”π‘’ π‘₯ = 𝑓/,303/ (π‘₯) π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. 𝐼𝑛𝑣𝑖𝑑𝑒(π‘₯) 𝑠. 𝑑. Accept π‘₯ > 𝑐011/2. Engage π‘₯ > 𝑐/,303/ π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’ 𝑓011/2.(π‘₯) 𝑠. 𝑑. 𝑓!,-!./(π‘₯) > 𝑐!,-!./ 𝑓/,303/(π‘₯) > 𝑐/,303/ π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’ 𝑓 π‘₯ π‘₯ ∈ 𝑋 Benefit: The last problem can now be solved using techniques from the literature of Bayesian Optimization. The original optimization problem can be written through this parametrization.
  • 68. ● π‘Œ&,( ) π‘₯ ∈ 0,1 denotes if the 𝑖-th member during the 𝑗-th session which was served by parameter π‘₯, did action k or not. Here k = invite, accept, engage. ● We model this data as follows ● Assume a Gaussian process prior on each of the latent function 𝑓) . Modeling The Metrics ! ! ! " π‘Œ!,"(π‘₯) ~ Gaussian 𝑓 π‘₯ , 𝜎% 𝑓'(π‘₯) ~ N 0, 𝐾()$(π‘₯, π‘₯ )
  • 69. LinkedIn Feed is mixture of OU and SU Organic Update Sponsored Update Share Comment Like Revenue
  • 70. Need to find a desired balance Share Comment Like Revenue
  • 71. Revenue Engagement Tradeoff (RENT) ● RENT is a mechanism that blends sponsored updates (SU) and organic updates (OU) to optimize revenue and engagement tradeoff ● Two important while conflicting objectives are considered
  • 72. Fast blending system with a light-weight infrastructure ● High velocity on model dev ● Ranking invariance ● Calibration with autotuning 72
  • 73. Ads use-case π‘€π‘–π‘›π‘–π‘šπ‘–π‘§π‘’. |π‘†π‘ˆ-415/66-7+6(𝐿𝐢𝐹) βˆ’ π‘†π‘ˆ-415/66-7+6(control)| β€’ Compare ads models M1 and M2 by making sure each have equal impressions. β€’ This is controlled by LCF factor that affects the score and hence the ranking on the feed. β€’ Increasing LCF gives higher score and hence increase rank which increases number of impressions
  • 74. Solution 1 Mathematical Abstraction 2 Bayesian Optimization 3 Thompson Sampling
  • 75. Mathematical Abstraction – Constrained Optimization max 8 𝑦(π‘₯) 𝑠. 𝑑. 𝑦9 π‘₯ β‰₯ 𝑐9, β‹― , 𝑦' π‘₯ β‰₯ 𝑐' β‹― , 𝑦: π‘₯ β‰₯ 𝑐: Goal β€’ When launching a new model online, we need to tune hyperparameters to optimize the major metric while making sure the guardrail metrics do not drop Notation β€’ π‘₯: The hyperparameter to tune β€’ 𝑦(π‘₯): The major business metric to optimize β€’ 𝑦Q π‘₯ : The guardrail metrics β€’ 𝑐Q: The constraint value
  • 76. Model Each Metric via Gaussian Process Examples β€’ 𝑦(π‘₯): Member viral actions in Feed; invitation rate in PYMK β€’ 𝑦Q π‘₯ : Job application metrics in Feed; acceptance rate in PYMK Use Gaussian Process to model online metrics. Different distributions can be used (𝑛 π‘₯ is a normalizer if we use Binomial or Poisson distribution): β€’ 𝑦 π‘₯ ~πΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘› 𝑓 π‘₯ , 𝜎R , 𝑓 π‘₯ ~πΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘› 0, 𝐾 π‘₯, π‘₯ β€’ 𝑦 π‘₯ ~π΅π‘–π‘›π‘œπ‘šπ‘–π‘Žπ‘™ 𝑛 π‘₯ , π‘†π‘–π‘”π‘šπ‘œπ‘–π‘‘(𝑓 π‘₯ ) , 𝑓 π‘₯ ~πΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘› 0, 𝐾 π‘₯, π‘₯ β€’ 𝑦 π‘₯ ~π‘ƒπ‘œπ‘–π‘ π‘ π‘œπ‘› 𝑒S T U(T) , 𝑓 π‘₯ ~πΊπ‘Žπ‘’π‘ π‘ π‘–π‘Žπ‘› 0, 𝐾 π‘₯, π‘₯
  • 77. Combine All Metrics – Method of Lagrange Multipliers Equivalent to maximize the following 𝐿 π‘₯ = 𝑦 π‘₯ + πœ†V 𝜎W 𝑦V π‘₯ βˆ’ 𝑐V + β‹― πœ†X 𝜎W 𝑦X π‘₯ βˆ’ 𝑐X 𝜎W(𝑧) = 1 + 𝑒YWZ YV β€’ 𝜎W is a smoothing Sigmoid function which take the value close to 1 for positive 𝑧 and close to 0 for negative 𝑧. β€’ πœ†Q 𝜎W 𝑦Q π‘₯ βˆ’ 𝑐Q penalizes constraint violation. β€’ πœ†Q is a positive multiplier, which is tunable in different applications β€’ The optimal value of 𝐿(π‘₯) optimizes 𝑦 π‘₯ while satisfying all constraints.
  • 78. Langrange Multipliers + Gaussian Process 𝐿(π‘₯) = 𝑦 π‘₯ + πœ†V 𝜎W 𝑦V π‘₯ βˆ’ 𝑐V + β‹― πœ†X 𝜎W(𝑦X π‘₯ βˆ’ 𝑐X) Posterior samples of 𝑦 π‘₯ and 𝑦Q π‘₯ could be easily obtained via Gaussian Process Regression. So could 𝐿(π‘₯).
  • 79. Collecting Metrics β€’ Metrics 𝑦 π‘₯ and 𝑦Q π‘₯ are collected on equal spaced Sobol points π‘₯V, β‹― , π‘₯S. β€’ Sobol points offer a lower discrepancy than the same number of random numbers.
  • 80. Reweighting Points β€’ Initially recommender systems are served equally by π‘₯V, β‹― , π‘₯S. β€’ After collecting metrics and apply Gaussian Process Regression, we generate samples from 𝐿(π‘₯). β€’ We need to assign each point π‘₯[ a probability value 𝑝[ to split traffic.
  • 81. Thompson Sampling β€’ Use Thompson Sampling to assign probability value 𝑝[ to each point π‘₯[ β€’ Step 1: For all candidate points π‘₯9, β‹― , π‘₯+, obtain posterior samples from 𝐿(π‘₯): 5𝐿 π‘₯9 , 5𝐿 π‘₯; , β‹― , 5𝐿 π‘₯+ . β€’ Step 2: Pick π‘₯- βˆ— such that 5𝐿 π‘₯- βˆ— is the largest among 5𝐿 π‘₯9 , 5𝐿 π‘₯; , β‹― , 5𝐿 π‘₯+ β€’ Repeat Step 1 and Step 2 1000 times. β€’ Step 3: Aggregate π‘₯9 βˆ— , β‹― , π‘₯9=== βˆ— . Set 𝑝- as the frequency of π‘₯- in π‘₯9 βˆ— , β‹― , π‘₯9=== βˆ—
  • 82. Intuition of Thompson Sampling β€’ Thompson Sampling is used to select the optimal hyperparameters via sampling from a distribution π‘₯V, 𝑝V , π‘₯R, 𝑝R , β‹― , π‘₯S, 𝑝S among all candidates. β€’ 𝑝[ is equal to the probability of each candidate being optimal among all. β€’ Thompson Sampling is proved to converge to the optimal value in various settings. β€’ Thompson Sampling is widely used in online A/B test, online advertising and other applications.
  • 83. Exploration versus Exploitation β€’ Exploration: explore sub-optimal but high potential regions in the search space β€’ Exploitation: search around the current optimal candidate β€’ Thompson Sampling automatically balances exploration and exploitation via a probability distribution
  • 85. Agenda 1 System Design 2 At-Scale Use 3 Visualizations 4 Results
  • 86. High Level Steps We need to run the following steps in a loop β€’ Optimization Function emits hyper-parameter distribution β€’ Use the hyper-parameters in online serving β€’ Collect Metrics β€’ Call the optimization function
  • 87. Goals of the Infrastructure Few realities about the clients of the Optimization package β€’ Every team within LinkedIn can have different online serving system. eg: Streaming, offline, rest end point. β€’ Tracking data and metrics for each system can be very different. Eg. Different calculation methodology. To make a minimum viable product and stay focused on solving the optimization problem well we built the Optimization package as a library.
  • 88. Flow Architecture Built the Optimization package as a library. It will expand as a spark flow. Client will call the Optimization package in its existing spark offline flow. Client flow will contain the following β€’ A hadoop/spark flow to collect metrics and aggregate. β€’ Call the Optimization package which expands into spark flow β€’ A flow to use the output of the optimizer.
  • 90. Notifications Example π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. π‘†π‘’π‘ π‘ π‘–π‘œπ‘›π‘ (π‘‘π‘Ÿ) 𝑠. 𝑑. πΆπ‘™π‘–π‘π‘˜π‘  π‘‘π‘Ÿ > 𝑐>?-0'6 𝑆𝑒𝑛𝑑 π‘‰π‘œπ‘™π‘’π‘šπ‘’ π‘‘π‘Ÿ < 𝑐%/+@ !7?A4/ Send the right volume of notification at most opportune time to improve sessions and clicks keeping send volume same. 𝑃!-6-. π‘š, 𝑖 : π‘ƒπ‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦ π‘œπ‘“ π‘Ž 𝑣𝑖𝑠𝑖𝑑 𝑃>?-0' π‘š, 𝑖 : π‘ƒπ‘Ÿπ‘œπ‘π‘Žπ‘π‘–π‘™π‘–π‘‘π‘¦ π‘œπ‘“ π‘Ž π‘π‘™π‘–π‘π‘˜ on item i member m We have the following ML models
  • 91. Notifications Scoring Function 𝑆 π‘š, 𝑖 ≔ 𝑃>?-0' π‘š, 𝑖 + π‘₯B 𝑃!-6-. π‘š, 𝑖 > π‘₯.C β€’ The weight vector π‘₯ = π‘₯W, π‘₯] controls the balance between the business metrics: Sessions, Clicks, Send Volume. β€’ A Sample Business Strategy is β€’ m – Member, i - Item π‘€π‘Žπ‘₯π‘–π‘šπ‘–π‘§π‘’. π‘†π‘’π‘ π‘ π‘–π‘œπ‘›π‘ (π‘₯) 𝑠. 𝑑. πΆπ‘™π‘–π‘π‘˜π‘  π‘₯ > 𝑐>?-0'6 𝑆𝑒𝑛𝑑 π‘‰π‘œπ‘™π‘’π‘šπ‘’ π‘₯ < 𝑐%/+@ !7?A4/
  • 92. Notifications Optimization Optimizer outputs a probability distribution over parameters.
  • 93. Using Optimizer output The next step is to just map all members in the treatment to parameter Set using the CDF.
  • 94. Member Assignment to hyper-parameters Optimizer outputs a probability distribution over parameters.
  • 96. Tracking Data Aggregation Tracking Data from user activity logs are available on HDFS. The raw data gets aggregated by the client flows.
  • 97. Library API: Problem Specification β€’ Problem Specification β€’ Treatment models and a control model β€’ Parameters and search range β€’ Objective metric and constraint metrics β€’ The number of exploration iterations β€’ Exploration – Exploitation Setup β€’ The algorithm starts with a few exploration iterations that outputs uniform serving probability. β€’ Exploration stage is the stage in the algorithm which explores each point equally while exploitation stage is the stage to reassign probabilities to different points.
  • 98. Library API: Problem Specification Continued
  • 99. Offline System The heart of the product Tracking β€’ All member activities are tracked with the parameter of interest. β€’ ETL into HDFS for easy consumption Utility Evaluation β€’ Using the tracking data we generate (π‘₯, 𝑓! π‘₯ ) for each function k. β€’ The data is kept in appropriate schema that is problem agnostic. Bayesian Optimization β€’ The data and the problem specifications are input to this. β€’ Using the data, we first estimate each of the posterior distributions of the latent functions. β€’ Sample from those distributions to get distribution of the parameter π‘₯ which maximizes the objective.
  • 100. β€’ The Bayesian Optimization library generates β€’ A set of potential candidates for trying in the next round (π‘₯9, π‘₯;, … , π‘₯+) β€’ A probability of how likely each point is the true maximizer (𝑝9, 𝑝;, … , 𝑝+) such that βˆ‘-D9 + 𝑝- = 1 β€’ To serve members with the above distribution, each memberId is mapped to [0,1] using a hashing function β„Ž. For example βˆ‘-D9 ' 𝑝- < β„Ž π‘’π‘ π‘’π‘Ÿ ≀ βˆ‘-D9 'E9 𝑝- Then my feed is served with parameter π‘₯'E9 β€’ The parameter store (depending on use-case) can contain β€’ <parameterValue, probability> i.e. π‘₯-, 𝑝- or β€’ <memberId, parameterValue> The Parameter Store and Online Serving
  • 101. Online System Serving hundreds of millions of users Parameter Sampling β€’ For each member π‘š visiting LinkedIn, β€’ Depending on the parameter store, we either evaluate <m, parameterValue> β€’ Or we directly call the store to retrieve <m, parameterValue> Online Serving β€’ Depending on the parameter value that is retrieved (say x), the notification item is scored according to the ranking function and served 𝑆 π‘š, 𝑒 ≔ 𝑃>?-0' π‘š, 𝑖 + π‘₯B 𝑃!-6-. π‘š, 𝑖 > π‘₯.C
  • 102. Library Design Library design can be broken into 2 components – Function fitting Module and Optimization module.
  • 103. Entities β€’ Task is the single unit of a problem that we want to solve. β€’ OptimizationToken can be either objective or a constraint. A given task object can have multiple optimizationTokens. β€’ UtilitySegment encapsulates all the information for modelling a single latent function fQ. β€’ ObservationDataToken is used to store observed data. The raw metrics 𝑦 (vector), hyper parameters π‘₯ (matrix), sampleSize 𝑛 (vector)
  • 106. Plots (When tuning 1d parameter) Shows how Send Volume for a cohort (FourByFour) changes as we change threshold. The red curve is the observed metrics. The middle blue line is the mean of the GP estimate. The left and right blue lines are the GP variance estimates.
  • 108. Online A/B Testing Results
  • 109. Operational Efficiency β€’ Feed model tuning needs 0.5x traffic and improve the tuning speed by 2.5x . β€’ Ads model tuning improved by 4x. β€’ Notification model tuning improved by 3x.
  • 110. Key Takeaways ● Removes the human in the loop: Fully automatic process to find the optimal parameters. ● Drastically improves developer productivity. ● Can scale to multiple competing metrics. ● Very easy onboarding infra for multiple vertical teams. Currently used by Ads, Feed, Notifications, PYMK, etc.
  • 111. Extensions of Bayesian Optimization Cyrus DiCiccio Sr. Software Engineer
  • 112. The Simplest Optimization Problem Our goal is to efficiently solve an optimization problem of the form Subject to constraints
  • 113. Basic Bayesian Optimization Observe data as where the error term is assumed to follow a normal distribution, and the mean function of interest is assumed to follow a Gaussian process with mean mu, typically assumed to be identically zero, and covariance kernel K (common choices include the RBF kernel or the Matern kernel).
  • 114. Some Natural Extensions 1) Often there is time dependence in A/B experiments. β€’ Day over-day fluctuations β€’ User behavior may differ on weekdays vs weekends 2) Can we incorporate other sources of information into the optimization? 3) How can we address heterogeneity in users’ affinity towards treatments
  • 115. Time Dependence Traditional Bayesian optimization strongly assumes metrics are stable over time. Is this a problem? It can be when using explore/exploit methods Examples of time dependence in A/B experiments. β€’ Day over-day fluctuations β€’ User behavior may differ on weekdays vs weekends If an exploit step suggests a new point to sample, but this occurs on a Saturday when users tend to be more active, this may give an overly optimistic sense of the impact
  • 116. Additional Sources of Information It is common in the internet industry to iterate through a large volume of models. Past experiment data may be suggestive towards the best hyperparameters for a new model Many companies have developed β€œoffline replay” systems which simulate the results that would have been seen had users been given a particular treatment. Such sources of additional information can be leveraged to expedite optimization
  • 117. β€œPersonalized” Hyperparameter Tuning It is plausible that different groups of members may have different affinity towards model hyperparameters For instance, active users of a platform may be very interested in frequent updates On the other hand, less active users may not appreciate frequent pings This can be handled in a notification system by allowing a separate hyperparameter for the active and less active users.
  • 118. Commonalities β€’ Each of these extensions are easily accommodated through modifications of the Bayesian optimization framework β€’ They are all used to improve the speed and accuracy of function fits
  • 120. Time Dependence A/B experiment metrics often see variations in time, such as day over-day fluctuations or differences on weekdays vs weekends or holidays Bayesian optimization typically assumes data is observed as a more realistic model might be to assume we have a time dependent error term in addition to the measurement error:
  • 121. Reformulating this model The model incorporating time fluctuations can be expressed as That is, we are now trying to optimize for a function that is allowed to vary with time! Practically, this can be accomplished by simple adjustments to the covariance kernel.
  • 122. Time Dependent Gaussian Processes Rather than assuming the function follows a Gaussian process, i.e. we can instead assume that the function is a Gaussian process indexed by the hyperparameter and time: And Bayesian optimization proceeds exactly as in the time-invariant case!
  • 123. A Simple Example of a Time Dependent Kernel A simple example of a kernel for a time dependent Gaussian process is which specifies and is equivalent to
  • 126. Additional Sources of Information It is common in the internet industry to iterate through a large volume of models. Past experiment data may be suggestive towards the best hyperparameters for a new model Many companies have developed β€œoffline replay” systems which simulate the results that would have been seen had users been given a particular treatment. Such sources of additional information can be leveraged to expedite our predictions of the metrics of interest
  • 127. Joint Model Of Experiment And Additional Information Assume we observe metrics of interest as as well as another β€œproxy” metric and we believe that g is informative of the function of interest f We can jointly model the functions f and g to improve our predictions of f
  • 128. Joint Gaussian Processes We still assume and also that Further assume the Gaussian processes f and g are correlated. Namely (Note that the common marginal priors is not a necessary assumption, but one typically made to simplify estimation of kernel parameters)
  • 129. Extensions To Multiple Sources Of Information Suppose we are interested in modelling C different Gaussian Processes, A simple joint model is specified by the Intrinsic Coregionalization Model (ICM). This specifies that marginally and also jointly
  • 130. Notification Sessions Modeling Send a notification if a score exceeds a threshold Our goal is to estimate the sessions that would be observed for a given threshold The sessions induced by sending a notification are easily modelled, for instance by a survival model Based on offline data, we can estimate the induced sessions as
  • 131. Online vs Offline Sessions
  • 132. Comparison of GP Fit With Left Out Data Mean Squared Error: β€’ Without offline data: 0.00928 β€’ With offline data: 0.00325 Relative efficiency: 285%
  • 134. Group Definitions Often, segments of users can be given by ad-hoc definitions based on domain specific knowledge Demographic segments (e.g. students vs full-time employees) β€’ Usage patterns (e.g. heavy vs light users) Recently, advances have been made towards identifying heterogeneous cohorts of users which can be leveraged in the cohort definition phase
  • 135. Causal Discovery of Groups 1. Launch an A/B experiment for a variety of hyperparameter values 2. Using this randomized experiment data, apply an automated method for discovering subgroups of the population which demonstrate heterogeneity in treatment effect 3. An example of such a tool is the β€œcausal tree” methodology proposed by Athey and Imbens (2016) 4. This takes in user features (usage, demographics, etc), and results in a tree whose splits are chosen according to heterogeneity in treatment effect
  • 136. Optimization by Group We now want to give each group a hyperparameter value which globally optimizes the metrics of interest. Suppose we have G groups, and each group g is exposed to hyperparameter xg Our optimization problem can be expressed as which is generally not tractable
  • 137. Grey Box Optimization Generally, the metric of interest can be expressed as a (weighted) average of the metric across the cohorts That is, so we estimate the objective metric (and similarly constraint metrics) separately for each cohort as a function of the hyperparameter for that cohort This estimation requires dramatically less data than estimating a single function of G variables
  • 138. Grey Box Optimization Using the decomposition Bayesian optimizations (explore/exploit) proceeds exactly as previously seen Now, the mean and variances of the objective and constraints are simply the sum of the independent GPs for each cohort
  • 139. A Simple Example Send a notification if a score exceeds a threshold Ad-hoc cohorts based on usage frequency: β€’ Four by four: daily active users β€’ One by three: weekly active users β€’ One by one: monthly active users β€’ Onboarding β€’ Dormant
  • 140. Influence Of One Parameter On Number Of Sent Messages Marginal plot of Send Volume for FourByFour members The two red lines are maximum and minimum over all cohorts except FourByFour The blue lines are the variance estimates
  • 141. Summary β€’ An advantage of Bayesian optimization is that it is easily modified to suit the nuances of the optimization problem at hand β€’ Some easy extensions β€’ Incorporate time dependence seen in many online experiments β€’ Incorporate additional sources of information β€’ Personalize user experience through group specific optimization
  • 142. Thank you Marco Alvarado, Mavis Li, Yafei Wang, Jason Xu, Jinyun Yan, Shuo Yang, Wensheng Sun, Yiping Yuan, Ajith Muralidharan, Matthew Walker, Boyi Chen, Jun Jia , Haichao Wei, Huiji Gao, Zheng Li, Mohsen Jamali, Shipeng Yu, Hema Raghavan, Romer Rosales