Most large-scale online recommender systems like newsfeed ranking, people recommendations, job recommendations, etc. often have multiple utilities or metrics that need to be simultaneously optimized. The machine learning models that are trained to optimize a single utility are combined together through parameters to generate the final ranking function. These combination parameters drive business metrics. Finding the right choice of the parameters is often done through online A/B experimentation, which can be incredibly complex and time-consuming, especially considering the non-linear effects of these parameters on the metrics of interest.
In this tutorial, we will talk about how we can apply Bayesian Optimization techniques to obtain the parameters for such complex online systems in order to balance the competing metrics. First, we will provide an in-depth introduction to Bayesian Optimization, covering some of the basics as well as the recent advances in the field. Second, we will talk about how to formulate a real-world recommender system problem as a black-box optimization problem that can be solved via Bayesian Optimization. We will focus on a few key problems such as newsfeed ranking, people recommendations, job recommendations, etc. Third, we will talk about the architecture of the solution and how we are able to deploy it for large-scale systems. Finally, we will discuss the extensions and some of the future directions in this domain.
6. Why Bayesian Optimization?
E.g. in machine learning: optimizing hyperparameters
feature vector
parameters
hyperparameters
f(): F1, precision, recall, accuracy, etc
parameters: neural network weights, decision tree features to split on
hyperparameters: learning rate, classification threshold
7. Why Bayesian Optimization?
E.g. in machine learning: optimizing hyperparameters
feature vector
parameters
hyperparameters
parameters: fast/efficient to optimize
hyperparameters: slow/resource intensive to optimize
8. Why Bayesian Optimization?
E.g. in machine learning: optimizing hyperparameters
Want to optimize with a method that:
β’ minimizes function evaluations
β’ can optimize black box
β’ quantifies uncertainty in solution
9. Bayesian optimization
β’ model as a Gaussian random process
β’ fit model to data
β’ use uncertainty quantification to inform search for optimal x
24. Optimization with Gaussian processes
The basic recipe:
1. Choose kernel and mean functions
2. Sample function at several points
3. Use maximum likelihood to fit GP parameters
4. Are we done? If not, select new points to
sample from; GOTO step 3
33. Gaussian Process Regression: use maximum likelihood to fit GP parameters
Sample function and treat problem as discrete Gaussian RV
34. Use maximum likelihood to fit kernel parameters
Gaussian Process Regression: use maximum likelihood to fit GP parameters
35. Optimization with Gaussian processes
The basic recipe:
1. Choose kernel and mean functions
2. Sample function at several points
3. Use maximum likelihood to fit GP parameters
4. Are we done? If not, select new points to
sample from; GOTO step 3
36. Additional Resources
Gaussian process book: http://www.gaussianprocess.org/
β’ Especially chapters 2 and 5
Peter Frazierβs tutorial on Bayesian optimization: arXiv:1807.02811
Eytan Bakshyβs publications: https://eytan.github.io/
39. NAS Overlapped with BO
β’ Search Space
β’ Both BO and NAS need to define a
proper search space
β’ Search Strategy
β’ BO is one important search strategy
β’ Performance Estimation Strategy
β’ Standard training and evaluation is
expensive. Performance estimation is
required
40. Search Space
The choice of the search space largely determines the difficulty of the
optimization problem
41. Search Strategy
β’ Existing methods: Random Search, Bayesian Optimization, Evolutionary
Methods, Reinforcement Learning, Gradient-based Methods
β’ BO has not been widely applied to NAS yet since Gaussian Process
focuses more on low-dimensional continuous optimization problems
42. Performance Estimation Strategy
β’ Training each architecture to be evaluated from scratch frequently
yields computational demands in the order of thousands of GPU days
for NAS
43. References
β’ Neural Architecture Search Survey: https://arxiv.org/pdf/1808.05377.pdf
β’ AutoML online book: http://www.automl.org/book/
β’ A blog for Neural Architecture Search: https://lilianweng.github.io/lil-
log/2020/08/06/neural-architecture-search.html
45. Bayesian Optimization Framework Overview
β’ GPyOpt: A Bayesian Optimization library based on GPy. Not actively
developed and maintained as of now.
β’ Ray-Tune: Have multiple hyperparameter tuning frameworks integrated,
including BayesianOptimization and BoTorch.
β’ BoTorch: A flexible and scalable Bayesian Optimization library based on
GPyTorch and PyTorch.
46. Advantages of BoTorch
β’ BoTorch provides a modular and easily extensible interface for
composing Bayesian optimization primitives, including probabilistic
models, acquisition functions, and optimizers.
β’ BoTorch harnesses the power of PyTorch, including auto-differentiation,
native support for highly parallelized modern hardware (e.g. GPUs)
using device-agnostic code, and a dynamic computation graph.
47. BoTorch Models: SingleTaskGP
β’ A single-task exact GP model. Use this model when you have
independent output(s) and all outputs use the same training data.
50. BoTorch Model Fitting: fit_gpytorch_model
Parameters:
β’ mll (MarginalLogLikelihood) β MarginalLogLikelihood to be maximized
β’ optimizer (Callable) β The optimizer function
β’ kwargs (Any) β Arguments passed along to the optimizer function
53. BoTorch Optimizers: optimize_acqf
β’ Optimize the acquisition function
β’ Parameters
β’ acq_function (AcquisitionFunction) β An AcquisitionFunction.
β’ bounds (Tensor) β A 2 x d tensor of lower and upper bounds for each column
of X.
β’ q (int) β The number of candidates.
β’ num_restarts (int) β The number of starting points for multistart acquisition
function optimization.
β’ raw_samples (int) β The number of samples for initialization.
59. LinkedIn Feed
Mission: Enable Members
to build an active
professional community
that advances their career.
Heterogenous List:
β’ Shares from a memberβs
connections
β’ Recommendations such
as jobs, articles, courses,
etc.
β’ Sponsored content or ads
Member Shares
Jobs
Articles
Ads
60. Important Metrics
Viral Actions (VA)
Members liked, shared or
commented on an item
Job Applies (JA)
Members applied for a job
Engaged Feed Sessions (EFS)
Sessions where a member
engaged with anything on
feed.
61. Ranking Function
π π, π’ β π!" π, π’ + π₯#$% π#$% π, π’ + π₯&" π&" π, π’
β’ The weight vector π₯ = π₯!"#, π₯$% controls the balance between the three business metrics:
EFS, VA and JA.
β’ A Sample Business Strategy is
πππ₯ππππ§π. ππ΄(π₯)
π . π‘. πΈπΉπ π₯ > π#$%
π½π΄ π₯ > π&"
β’ m β Member, u - Item
62. β π&,(
)
π₯ β 0,1 denotes if the π-th member during the π-th session which was served by
parameter π₯, did action k or not. Here k = VA, EFS or JA.
β We model this data as follows
β Assume a Gaussian process prior on each of the latent function π) .
Modeling The Metrics
!
!
!
"
π!,"(π₯) ~ Gaussian π$ π₯ , π%
π'(π₯) ~ N 0, πΎ()$(π₯, π₯ )
63. Reformulation
We approximate each of the metrics as:
ππ΄ π₯ = π&'(π₯)
E πΉπ π₯ = π()*(π₯)
π½π΄ π₯ = π+' (π₯)
πππ₯ππππ§π. ππ΄(π₯)
π . π‘. EF π π₯ > π()*
JA π₯ > π+'
πππ₯ππππ§π π&'(π₯)
π . π‘. π()*(π₯) > π()*
π+'(π₯) > π+'
πππ₯ππππ§π π π₯
π₯ β π
Benefit: The last problem can now be solved using techniques from the literature of Bayesian
Optimization.
The original optimization problem can be written through this parametrization.
65. β Main driver of network growth at LinkedIn.
β Besides network growth, Engagement is another Important piece.
Business Metrics
β Accept Rate: the fraction of accepted invitations out of all invitations sent from
PYMK.
β Invite Rate: the fraction of impressions for which members send invitations, out of all
impressions delivered.
β Member Engagement Metric: the fraction of invitations sent that helps member
engagement.
PYMK Recommendation
66. PYMK Scoring Function
π π, π β π*+,-./"00/1. π, π β (1 + πΌ π#+232/ π, π )
β’ pInviteAccept is a score predicting a combination of accept score and
invite score.
β’ pEngage is a score predicting member engagement score.
β’ πΌ is the hyper-parameter we need to tune
β’ m β Member, c - candidate
67. Reformulation
We approximate each of the metrics as:
πΌππ£ππ‘π π₯ = π!,-!./(π₯)
Accept π₯ = π011/2.(π₯)
πΈπππππ π₯ = π/,303/ (π₯)
πππ₯ππππ§π. πΌππ£ππ‘π(π₯)
π . π‘. Accept π₯ > π011/2.
Engage π₯ > π/,303/
πππ₯ππππ§π π011/2.(π₯)
π . π‘. π!,-!./(π₯) > π!,-!./
π/,303/(π₯) > π/,303/
πππ₯ππππ§π π π₯
π₯ β π
Benefit: The last problem can now be solved using techniques from the literature of Bayesian
Optimization.
The original optimization problem can be written through this parametrization.
68. β π&,(
)
π₯ β 0,1 denotes if the π-th member during the π-th session which was served by
parameter π₯, did action k or not. Here k = invite, accept, engage.
β We model this data as follows
β Assume a Gaussian process prior on each of the latent function π) .
Modeling The Metrics
!
!
!
"
π!,"(π₯) ~ Gaussian π π₯ , π%
π'(π₯) ~ N 0, πΎ()$(π₯, π₯ )
69. LinkedIn Feed is mixture of OU and SU
Organic
Update
Sponsored
Update
Share
Comment
Like
Revenue
70. Need to find a desired balance
Share
Comment
Like
Revenue
71. Revenue Engagement Tradeoff (RENT)
β RENT is a mechanism that blends sponsored updates (SU) and organic updates (OU) to optimize
revenue and engagement tradeoff
β Two important while conflicting objectives are considered
72. Fast blending system with a light-weight infrastructure
β High velocity on model dev
β Ranking invariance
β Calibration with autotuning
72
73. Ads use-case
πππππππ§π. |ππ-415/66-7+6(πΏπΆπΉ) β ππ-415/66-7+6(control)|
β’ Compare ads models M1 and M2 by making sure each have equal impressions.
β’ This is controlled by LCF factor that affects the score and hence the ranking on the feed.
β’ Increasing LCF gives higher score and hence increase rank which increases number of
impressions
75. Mathematical Abstraction β Constrained Optimization
max
8
π¦(π₯)
π . π‘. π¦9 π₯ β₯ π9, β― , π¦' π₯ β₯ π' β― , π¦: π₯ β₯ π:
Goal
β’ When launching a new model online, we need to tune hyperparameters
to optimize the major metric while making sure the guardrail metrics do
not drop
Notation
β’ π₯: The hyperparameter to tune
β’ π¦(π₯): The major business metric to optimize
β’ π¦Q π₯ : The guardrail metrics
β’ πQ: The constraint value
76. Model Each Metric via Gaussian Process
Examples
β’ π¦(π₯): Member viral actions in Feed; invitation rate in PYMK
β’ π¦Q π₯ : Job application metrics in Feed; acceptance rate in PYMK
Use Gaussian Process to model online metrics.
Different distributions can be used (π π₯ is a normalizer if we use Binomial
or Poisson distribution):
β’ π¦ π₯ ~πΊππ’π π πππ π π₯ , πR
, π π₯ ~πΊππ’π π πππ 0, πΎ π₯, π₯
β’ π¦ π₯ ~π΅πππππππ π π₯ , πππππππ(π π₯ ) , π π₯ ~πΊππ’π π πππ 0, πΎ π₯, π₯
β’ π¦ π₯ ~ππππ π ππ πS T U(T)
, π π₯ ~πΊππ’π π πππ 0, πΎ π₯, π₯
77. Combine All Metrics β Method of Lagrange Multipliers
Equivalent to maximize the following
πΏ π₯ = π¦ π₯ + πV πW π¦V π₯ β πV + β― πX πW π¦X π₯ β πX
πW(π§) = 1 + πYWZ YV
β’ πW is a smoothing Sigmoid function which take the value close to 1 for
positive π§ and close to 0 for negative π§.
β’ πQ πW π¦Q π₯ β πQ penalizes constraint violation.
β’ πQ is a positive multiplier, which is tunable in different applications
β’ The optimal value of πΏ(π₯) optimizes π¦ π₯ while satisfying all constraints.
78. Langrange Multipliers + Gaussian Process
πΏ(π₯) = π¦ π₯ + πV πW π¦V π₯ β πV + β― πX πW(π¦X π₯ β πX)
Posterior samples of π¦ π₯ and π¦Q π₯ could be easily obtained via
Gaussian Process Regression. So could πΏ(π₯).
79. Collecting Metrics
β’ Metrics π¦ π₯ and π¦Q π₯ are collected on equal spaced Sobol points
π₯V, β― , π₯S.
β’ Sobol points offer a lower discrepancy than the same number of random
numbers.
80. Reweighting Points
β’ Initially recommender systems are served equally by π₯V, β― , π₯S.
β’ After collecting metrics and apply Gaussian Process Regression, we
generate samples from πΏ(π₯).
β’ We need to assign each point π₯[ a probability value π[ to split traffic.
81. Thompson Sampling
β’ Use Thompson Sampling to assign probability value π[ to each point π₯[
β’ Step 1: For all candidate points π₯9, β― , π₯+, obtain posterior samples from πΏ(π₯):
5πΏ π₯9 , 5πΏ π₯; , β― , 5πΏ π₯+ .
β’ Step 2: Pick π₯-
β
such that 5πΏ π₯-
β
is the largest among 5πΏ π₯9 , 5πΏ π₯; , β― , 5πΏ π₯+
β’ Repeat Step 1 and Step 2 1000 times.
β’ Step 3: Aggregate π₯9
β
, β― , π₯9===
β
. Set π- as the frequency of π₯- in π₯9
β
, β― , π₯9===
β
82. Intuition of Thompson Sampling
β’ Thompson Sampling is used to select the optimal hyperparameters via
sampling from a distribution π₯V, πV , π₯R, πR , β― , π₯S, πS among all
candidates.
β’ π[ is equal to the probability of each candidate being optimal among
all.
β’ Thompson Sampling is proved to converge to the optimal value in
various settings.
β’ Thompson Sampling is widely used in online A/B test, online advertising
and other applications.
83. Exploration versus Exploitation
β’ Exploration: explore sub-optimal but high potential regions in the search
space
β’ Exploitation: search around the current optimal candidate
β’ Thompson Sampling automatically balances exploration and
exploitation via a probability distribution
86. High Level Steps
We need to run the following steps in a loop
β’ Optimization Function emits hyper-parameter distribution
β’ Use the hyper-parameters in online serving
β’ Collect Metrics
β’ Call the optimization function
87. Goals of the Infrastructure
Few realities about the clients of the Optimization package
β’ Every team within LinkedIn can have different online serving
system. eg: Streaming, offline, rest end point.
β’ Tracking data and metrics for each system can be very different.
Eg. Different calculation methodology.
To make a minimum viable product and stay focused on solving the
optimization problem well we built the Optimization package as a
library.
88. Flow Architecture
Built the Optimization package as a library. It will expand as a spark
flow.
Client will call the Optimization package in its existing spark offline
flow.
Client flow will contain the following
β’ A hadoop/spark flow to collect metrics and aggregate.
β’ Call the Optimization package which expands into spark flow
β’ A flow to use the output of the optimizer.
90. Notifications Example
πππ₯ππππ§π. πππ π ππππ (π‘π)
π . π‘. πΆπππππ π‘π > π>?-0'6
ππππ ππππ’ππ π‘π < π%/+@ !7?A4/
Send the right volume of notification at most opportune time to
improve sessions and clicks keeping send volume same.
π!-6-. π, π : ππππππππππ‘π¦ ππ π π£ππ ππ‘
π>?-0' π, π : ππππππππππ‘π¦ ππ π πππππ on item i member m
We have the following ML models
91. Notifications Scoring Function
π π, π β π>?-0' π, π + π₯B π!-6-. π, π > π₯.C
β’ The weight vector π₯ = π₯W, π₯] controls the balance between the
business metrics: Sessions, Clicks, Send Volume.
β’ A Sample Business Strategy is
β’ m β Member, i - Item
πππ₯ππππ§π. πππ π ππππ (π₯)
π . π‘. πΆπππππ π₯ > π>?-0'6
ππππ ππππ’ππ π₯ < π%/+@ !7?A4/
96. Tracking Data Aggregation
Tracking Data from user activity logs are available on HDFS.
The raw data gets aggregated by the client flows.
97. Library API: Problem Specification
β’ Problem Specification
β’ Treatment models and a control model
β’ Parameters and search range
β’ Objective metric and constraint metrics
β’ The number of exploration iterations
β’ Exploration β Exploitation Setup
β’ The algorithm starts with a few
exploration iterations that outputs uniform
serving probability.
β’ Exploration stage is the stage in
the algorithm which explores each point
equally while exploitation stage is the
stage to reassign probabilities to different
points.
99. Offline System
The heart of the product
Tracking
β’ All member activities are
tracked with the parameter
of interest.
β’ ETL into HDFS for easy
consumption
Utility Evaluation
β’ Using the tracking data we
generate (π₯, π! π₯ ) for
each function k.
β’ The data is kept in
appropriate schema that is
problem agnostic.
Bayesian Optimization
β’ The data and the problem
specifications are input to this.
β’ Using the data, we first
estimate each of the posterior
distributions of the latent
functions.
β’ Sample from those
distributions to get distribution
of the parameter π₯ which
maximizes the objective.
100. β’ The Bayesian Optimization library generates
β’ A set of potential candidates for trying in the next round (π₯9, π₯;, β¦ , π₯+)
β’ A probability of how likely each point is the true maximizer (π9, π;, β¦ , π+) such that
β-D9
+
π- = 1
β’ To serve members with the above distribution, each memberId is mapped to [0,1] using a
hashing function β. For example
β-D9
'
π- < β π’π ππ β€ β-D9
'E9
π-
Then my feed is served with parameter π₯'E9
β’ The parameter store (depending on use-case) can contain
β’ <parameterValue, probability> i.e. π₯-, π- or
β’ <memberId, parameterValue>
The Parameter Store and Online Serving
101. Online System
Serving hundreds of millions of users
Parameter Sampling
β’ For each member π visiting
LinkedIn,
β’ Depending on the parameter
store, we either evaluate <m,
parameterValue>
β’ Or we directly call the store to
retrieve <m,
parameterValue>
Online Serving
β’ Depending on the parameter
value that is retrieved (say x),
the notification item is scored
according to the ranking
function and served
π π, π’ β π>?-0' π, π +
π₯B π!-6-. π, π > π₯.C
103. Entities
β’ Task is the single unit of a problem that we want to solve.
β’ OptimizationToken can be either objective or a constraint. A
given task object can have multiple optimizationTokens.
β’ UtilitySegment encapsulates all the information for modelling a
single latent function fQ.
β’ ObservationDataToken is used to store observed data. The raw
metrics π¦ (vector), hyper parameters π₯ (matrix), sampleSize π
(vector)
106. Plots (When tuning 1d parameter)
Shows how Send Volume for a cohort (FourByFour)
changes as we change threshold.
The red curve is the observed metrics.
The middle blue line is the mean of the GP estimate.
The left and right blue lines are the GP variance
estimates.
109. Operational Efficiency
β’ Feed model tuning needs 0.5x traffic and improve the tuning
speed by 2.5x .
β’ Ads model tuning improved by 4x.
β’ Notification model tuning improved by 3x.
110. Key Takeaways
β Removes the human in the loop: Fully automatic process to find the
optimal parameters.
β Drastically improves developer productivity.
β Can scale to multiple competing metrics.
β Very easy onboarding infra for multiple vertical teams. Currently used by
Ads, Feed, Notifications, PYMK, etc.
112. The Simplest Optimization Problem
Our goal is to efficiently solve an optimization problem of
the form
Subject to constraints
113. Basic Bayesian Optimization
Observe data as
where the error term is assumed to follow a normal distribution,
and the mean function of interest is assumed to follow a Gaussian process
with mean mu, typically assumed to be identically zero, and covariance
kernel K (common choices include the RBF kernel or the Matern kernel).
114. Some Natural Extensions
1) Often there is time dependence in A/B experiments.
β’ Day over-day fluctuations
β’ User behavior may differ on weekdays vs weekends
2) Can we incorporate other sources of information into the optimization?
3) How can we address heterogeneity in usersβ affinity towards treatments
115. Time Dependence
Traditional Bayesian optimization strongly assumes metrics are stable over
time. Is this a problem?
It can be when using explore/exploit methods
Examples of time dependence in A/B experiments.
β’ Day over-day fluctuations
β’ User behavior may differ on weekdays vs weekends
If an exploit step suggests a new point to sample, but this occurs on a
Saturday when users tend to be more active, this may give an overly
optimistic sense of the impact
116. Additional Sources of Information
It is common in the internet industry to iterate through a large volume of
models. Past experiment data may be suggestive towards the best
hyperparameters for a new model
Many companies have developed βoffline replayβ systems which simulate
the results that would have been seen had users been given a particular
treatment.
Such sources of additional information can be leveraged to expedite
optimization
117. βPersonalizedβ Hyperparameter Tuning
It is plausible that different groups of members may have different affinity
towards model hyperparameters
For instance, active users of a platform may be very interested in frequent
updates
On the other hand, less active users may not appreciate frequent pings
This can be handled in a notification system by allowing a separate
hyperparameter for the active and less active users.
118. Commonalities
β’ Each of these extensions are easily accommodated through modifications
of the Bayesian optimization framework
β’ They are all used to improve the speed and accuracy of function fits
120. Time Dependence
A/B experiment metrics often see variations in time, such as day over-day
fluctuations or differences on weekdays vs weekends or holidays
Bayesian optimization typically assumes data is observed as
a more realistic model might be to assume we have a time dependent error
term in addition to the measurement error:
121. Reformulating this model
The model incorporating time fluctuations can be expressed as
That is, we are now trying to optimize for a function that is allowed to vary with
time!
Practically, this can be accomplished by simple adjustments to the
covariance kernel.
122. Time Dependent Gaussian Processes
Rather than assuming the function follows a Gaussian process, i.e.
we can instead assume that the function is a Gaussian process indexed by
the hyperparameter and time:
And Bayesian optimization proceeds exactly as in the time-invariant case!
123. A Simple Example of a Time Dependent Kernel
A simple example of a kernel for a time dependent Gaussian process is
which specifies
and is equivalent to
126. Additional Sources of Information
It is common in the internet industry to iterate through a large volume of
models. Past experiment data may be suggestive towards the best
hyperparameters for a new model
Many companies have developed βoffline replayβ systems which simulate
the results that would have been seen had users been given a particular
treatment.
Such sources of additional information can be leveraged to expedite our
predictions of the metrics of interest
127. Joint Model Of Experiment And Additional Information
Assume we observe metrics of interest as
as well as another βproxyβ metric
and we believe that g is informative of the function of interest f
We can jointly model the functions f and g to improve our predictions of f
128. Joint Gaussian Processes
We still assume
and also that
Further assume the Gaussian processes f and g are correlated. Namely
(Note that the common marginal priors is not a necessary assumption, but
one typically made to simplify estimation of kernel parameters)
129. Extensions To Multiple Sources Of Information
Suppose we are interested in modelling C different Gaussian Processes,
A simple joint model is specified by the Intrinsic Coregionalization Model
(ICM). This specifies that marginally
and also jointly
130. Notification Sessions Modeling
Send a notification if a score exceeds a threshold
Our goal is to estimate the sessions that would be observed for a given threshold
The sessions induced by sending a notification are easily modelled, for instance
by a survival model
Based on offline data, we can estimate the induced sessions as
132. Comparison of GP Fit With Left Out Data
Mean Squared Error:
β’ Without offline data:
0.00928
β’ With offline data:
0.00325
Relative efficiency: 285%
134. Group Definitions
Often, segments of users can be given by ad-hoc definitions based on
domain specific knowledge
Demographic segments (e.g. students vs full-time employees)
β’ Usage patterns (e.g. heavy vs light users)
Recently, advances have been made towards identifying heterogeneous
cohorts of users which can be leveraged in the cohort definition phase
135. Causal Discovery of Groups
1. Launch an A/B experiment for a variety of hyperparameter values
2. Using this randomized experiment data, apply an automated method for
discovering subgroups of the population which demonstrate
heterogeneity in treatment effect
3. An example of such a tool is the βcausal treeβ methodology proposed by
Athey and Imbens (2016)
4. This takes in user features (usage, demographics, etc), and results in a
tree whose splits are chosen according to heterogeneity in treatment
effect
136. Optimization by Group
We now want to give each group a hyperparameter value which globally
optimizes the metrics of interest.
Suppose we have G groups, and each group g is exposed to
hyperparameter xg
Our optimization problem can be expressed as
which is generally not tractable
137. Grey Box Optimization
Generally, the metric of interest can be expressed as a (weighted) average
of the metric across the cohorts
That is,
so we estimate the objective metric (and similarly constraint metrics)
separately for each cohort as a function of the hyperparameter for that
cohort
This estimation requires dramatically less data than estimating a single
function of G variables
138. Grey Box Optimization
Using the decomposition
Bayesian optimizations (explore/exploit) proceeds exactly as previously seen
Now, the mean and variances of the objective and constraints are simply the
sum of the independent GPs for each cohort
139. A Simple Example
Send a notification if a score exceeds a threshold
Ad-hoc cohorts based on usage frequency:
β’ Four by four: daily active users
β’ One by three: weekly active users
β’ One by one: monthly active users
β’ Onboarding
β’ Dormant
140. Influence Of One Parameter On Number Of Sent Messages
Marginal plot of Send Volume for FourByFour
members
The two red lines are maximum and minimum over
all cohorts except FourByFour
The blue lines are the variance estimates
141. Summary
β’ An advantage of Bayesian optimization is that it is easily modified to
suit the nuances of the optimization problem at hand
β’ Some easy extensions
β’ Incorporate time dependence seen in many online experiments
β’ Incorporate additional sources of information
β’ Personalize user experience through group specific optimization
142. Thank you
Marco Alvarado, Mavis Li, Yafei Wang, Jason Xu, Jinyun Yan, Shuo Yang, Wensheng Sun,
Yiping Yuan, Ajith Muralidharan, Matthew Walker, Boyi Chen, Jun Jia ,
Haichao Wei, Huiji Gao, Zheng Li, Mohsen Jamali, Shipeng Yu,
Hema Raghavan, Romer Rosales