survey slides for contextual bandit
main reference: Li Zhou. A Survey on Contextual Multi-armed Bandits. arXiv, 2015. (https://arxiv.org/abs/1508.03326)
Houston machine learning meetup, two papers are discussed:
- A Contextual-Bandit Approach to Personalized News Article Recommendation
http://rob.schapire.net/papers/www10.pdf
- An efficient bandit algorithm for realtime multivariate optimization
https://www.kdd.org/kdd2017/papers/view/an-efficient-bandit-algorithm-for-realtime-multivariate-optimization
Talk from QCon SF on 2018-11-05
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front each of our members at the right time. With a catalog spanning thousands of titles and a diverse member base spanning over a hundred million accounts, recommending the titles that are just right for each member is crucial. But the job of recommendation does not end there. Why should you care about any particular title we recommend? What can we say about a new and unfamiliar title that will pique your interest? How do we convince you that a title is worth watching? Answering these questions is critical in helping our members discover great content, especially for unfamiliar titles. One way to do this is to consider the artwork or imagery we use to visually portray each title. If the artwork representing a title captures something compelling to you, then it acts as a gateway into that title and gives you some visual “evidence” for why the title might be good for you. Selecting good artwork is important because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we show for each title on the Netflix homepage. We will look at how to frame this as a machine learning problem using contextual multi-armed bandits in a recommendation system setting. We will also describe the algorithmic and system challenges involved in getting this type of approach for artwork personalization to succeed at Netflix scale. Finally, we will discuss some of the future opportunities that we see to expand and improve upon this approach.
Houston machine learning meetup, two papers are discussed:
- A Contextual-Bandit Approach to Personalized News Article Recommendation
http://rob.schapire.net/papers/www10.pdf
- An efficient bandit algorithm for realtime multivariate optimization
https://www.kdd.org/kdd2017/papers/view/an-efficient-bandit-algorithm-for-realtime-multivariate-optimization
Talk from QCon SF on 2018-11-05
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front each of our members at the right time. With a catalog spanning thousands of titles and a diverse member base spanning over a hundred million accounts, recommending the titles that are just right for each member is crucial. But the job of recommendation does not end there. Why should you care about any particular title we recommend? What can we say about a new and unfamiliar title that will pique your interest? How do we convince you that a title is worth watching? Answering these questions is critical in helping our members discover great content, especially for unfamiliar titles. One way to do this is to consider the artwork or imagery we use to visually portray each title. If the artwork representing a title captures something compelling to you, then it acts as a gateway into that title and gives you some visual “evidence” for why the title might be good for you. Selecting good artwork is important because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we show for each title on the Netflix homepage. We will look at how to frame this as a machine learning problem using contextual multi-armed bandits in a recommendation system setting. We will also describe the algorithmic and system challenges involved in getting this type of approach for artwork personalization to succeed at Netflix scale. Finally, we will discuss some of the future opportunities that we see to expand and improve upon this approach.
What if there’s a better way to run more complex tests and gain results faster? Joni explains the wonderful world of multi-armed bandit experiments.
About Joni
Joni Turunen is a Senior Developer at FROSMO currently working in the Product Team. He has vast experience with different frontend & backend technologies. With his 8 year history in the company he knows Frosmo's software inside out.
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
In this talk, we present a general multi-armed bandit framework for recommendations on the Netflix homepage. We present two example case studies using MABs at Netflix - a) Artwork Personalization to recommend personalized visuals for each of our members for the different titles and b) Billboard recommendation to recommend the right title to be watched on the Billboard.
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Fernando Amat and Elliot Chow from Netflix talk about the Bandit infrastructure for Personalized Recommendations
Overview of tree algorithms from decision tree to xgboostTakami Sato
For my understanding, I surveyed popular tree algorithms on Machine Learning and their evolution. This is the first time I wrote a presentation in English. So, I am happy if you give me a feedback.
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
I present a brief review, and an outlook on the rapid changes happening in the field of recommendation engine research on the heels of the deep learning revolution!
At Netflix, we try to provide the best personalized video recommendations to our members. To do this, we need to adapt our recommendations for each contextual situation, which depends on information such as time or device. In this talk, I will describe how state of the art Contextual Recommendations are used at Netflix. A first example of contextual adaptation is the model that powers the Continue Watching row. It uses a feature-based approach with a carefully constructed training set to learn how to adapt to the context of the member. Next, I will dive into more modern approaches such as Tensor Factorization and LSTMs and share some results from deployments of these methods. I will highlight lessons learned and some common pitfalls of using these powerful methods in industrial scale systems. Finally, I will touch upon system reliability, choice of optimization metrics, hidden costs, risks and benefits of using highly adaptive systems.
In order to automatically infer the resource consumption of programs, analyzers track how data sizes change along a program’s execution. Typically, analyzers measure the sizes of data by applying norms which are mappings from data to natural numbers that represent the sizes of the corresponding data. When norms are defined by taking type information into account, they are named typed-norms. We define a transformational approach to resource analysis with typed-norms. The analysis is based on a transformation of the program into an intermediate abstract program in which each variable is abstracted with respect to all considered norms which are valid for its type. We also sketch a simple analysis that can be used to automatically infer the required, useful, typed-norms from programs.
What if there’s a better way to run more complex tests and gain results faster? Joni explains the wonderful world of multi-armed bandit experiments.
About Joni
Joni Turunen is a Senior Developer at FROSMO currently working in the Product Team. He has vast experience with different frontend & backend technologies. With his 8 year history in the company he knows Frosmo's software inside out.
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
In this talk, we present a general multi-armed bandit framework for recommendations on the Netflix homepage. We present two example case studies using MABs at Netflix - a) Artwork Personalization to recommend personalized visuals for each of our members for the different titles and b) Billboard recommendation to recommend the right title to be watched on the Billboard.
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
In this talk at the Netflix Machine Learning Platform Meetup on 12 Sep 2019, Fernando Amat and Elliot Chow from Netflix talk about the Bandit infrastructure for Personalized Recommendations
Overview of tree algorithms from decision tree to xgboostTakami Sato
For my understanding, I surveyed popular tree algorithms on Machine Learning and their evolution. This is the first time I wrote a presentation in English. So, I am happy if you give me a feedback.
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
I present a brief review, and an outlook on the rapid changes happening in the field of recommendation engine research on the heels of the deep learning revolution!
At Netflix, we try to provide the best personalized video recommendations to our members. To do this, we need to adapt our recommendations for each contextual situation, which depends on information such as time or device. In this talk, I will describe how state of the art Contextual Recommendations are used at Netflix. A first example of contextual adaptation is the model that powers the Continue Watching row. It uses a feature-based approach with a carefully constructed training set to learn how to adapt to the context of the member. Next, I will dive into more modern approaches such as Tensor Factorization and LSTMs and share some results from deployments of these methods. I will highlight lessons learned and some common pitfalls of using these powerful methods in industrial scale systems. Finally, I will touch upon system reliability, choice of optimization metrics, hidden costs, risks and benefits of using highly adaptive systems.
In order to automatically infer the resource consumption of programs, analyzers track how data sizes change along a program’s execution. Typically, analyzers measure the sizes of data by applying norms which are mappings from data to natural numbers that represent the sizes of the corresponding data. When norms are defined by taking type information into account, they are named typed-norms. We define a transformational approach to resource analysis with typed-norms. The analysis is based on a transformation of the program into an intermediate abstract program in which each variable is abstracted with respect to all considered norms which are valid for its type. We also sketch a simple analysis that can be used to automatically infer the required, useful, typed-norms from programs.
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...Dmitrii Ignatov
Mining ternary relations or triadic Boolean tensors is one of the recent trends in knowledge discovery that allows one to take into account various modalities of input object-attribute data.
For example, in movie databases like IMBD, an analyst may find not only movies grouped by specific genres but see their common keywords. In the so called folksonomies, users can be grouped according to their shared resources and used tags. In gene expression analysis, genes can be grouped along with samples of tissues and time intervals providing comprehensible patterns. However, pattern explosion effects even with one more dimension are seriously aggravated. In this paper, we continue our previous study on searching for a smaller collection of ``optimal'' patterns in triadic data with respect to a set of quality criteria such as patterns' cardinality, density, diversity, coverage, etc. We show how a simple data preprocessing has enabled us to use the frequent itemset mining algorithm.
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Umberto Picchini
An important, and well studied, class of stochastic models is given by stochastic differential equations (SDEs). In this talk, we consider Bayesian inference based on measurements from several individuals, to provide inference at the "population level" using mixed-effects modelling. We consider the case where dynamics are expressed via SDEs or other stochastic (Markovian) models. Stochastic differential equation mixed-effects models (SDEMEMs) are flexible hierarchical models that account for (i) the intrinsic random variability in the latent states dynamics, as well as (ii) the variability between individuals, and also (iii) account for measurement error. This flexibility gives rise to methodological and computational difficulties.
Fully Bayesian inference for nonlinear SDEMEMs is complicated by the typical intractability of the observed data likelihood which motivates the use of sampling-based approaches such as Markov chain Monte Carlo. A Gibbs sampler is proposed to target the marginal posterior of all parameters of interest. The algorithm is made computationally efficient through careful use of blocking strategies, particle filters (sequential Monte Carlo) and correlated pseudo-marginal approaches. The resulting methodology is is flexible, general and is able to deal with a large class of nonlinear SDEMEMs [1]. In a more recent work [2], we also explored ways to make inference even more scalable to an increasing number of individuals, while also dealing with state-space models driven by other stochastic dynamic models than SDEs, eg Markov jump processes and nonlinear solvers typically used in systems biology.
[1] S. Wiqvist, A. Golightly, AT McLean, U. Picchini (2020). Efficient inference for stochastic differential mixed-effects models using correlated particle pseudo-marginal algorithms, CSDA, https://doi.org/10.1016/j.csda.2020.107151
[2] S. Persson, N. Welkenhuysen, S. Shashkova, S. Wiqvist, P. Reith, G. W. Schmidt, U. Picchini, M. Cvijovic (2021). PEPSDI: Scalable and flexible inference framework for stochastic dynamic single-cell models, bioRxiv doi:10.1101/2021.07.01.450748.
1. Consider experiments with the following censoring mechanism A gr.docxstilliegeorgiana
1. Consider experiments with the following censoring mechanism: A group of n units is observed from time 0; observation stops at the time of the rth failure or at time C, whatever occurs first. Show by direct calculation that the likelihood function is of the form L = Yn i=1 f(ti) δiS(ti+)1−δi , assuming that the units gave failure times which are i.i.d. with survivor function S(t) and p.d.f. f(t). (Hint: first define ti and δi .)
2. Suppose that T is a survival random variable with survival function S and cumulative hazard function H(t) = − log S(t). Show that H(T) ∼ exp(1).
3. Suppose that the lifetime Ti has hazard function hi(t) and that Ci is a random censoring time associated with Ti . Define λi(t) = lim ∆t→0 P(t ≤ Ti ≤ t + ∆t|Ti ≥ t, Ci ≥ t) ∆t (a) Show that if Ti is independent of Ci , hi(t) = λi(t). (b) Suppose that there exists an unobserved covariate Zi which affects both Ti and Ci , as follows: P(Ti ≥ t|Zi) = exp(−Ziθt), P(Ci ≥ t|Zi) = exp(−Ziρt), and Ti , Ci are independent, given Zi . Assume that Zi has a gamma distribution with density function g(z) = φ φ Γ(φ) z φ−1 e −φz(z > 0). Show that the joint survivor function for Ti , Ci is P(Ti ≥ t, Ci ≥ s) = 1 + 1 φ θt + 1 φ ρs−φ .
4. The lifetime of an article is thought to have an exponential distribution. Twelve such articles were selected at random and tested until nine of them failed. The nine observed failure times were 8, 14, 23, 32, 46, 57, 69, 88, 109. Assume that the data follow the exponential distribution. (a) Compute the maximum likelihood estimate of mean µ. (b) Compute the Fisher information for ˆµ. (c) Obtain a 90% confidence interval for µ by using the quantity Z = (ˆµ−µ)/se(ˆµ) where se(ˆµ) is the standard error for the estimate ˆµ.
.
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...SYRTO Project
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time Series Models. Andre Lucas. Amsterdam - June, 25 2015. European Financial Management Association 2015 Annual Meetings.
Exploring the feature space of large collections of time seriesRob Hyndman
It is becoming increasingly common for organizations to collect very large amounts of data over time. Data visualization is essential for exploring and understanding structures and patterns, and to identify unusual observations. However, the sheer quantity of data available challenges current time series visualisation methods.
For example, Yahoo has banks of mail servers that are monitored over time. Many measurements on server performance are collected every hour for each of thousands of servers. We wish to identify servers that are behaving unusually.
Alternatively, we may have thousands of time series we wish to forecast, and we want to be able to identify the types of time series that are easy to forecast and those that are inherently challenging.
I will demonstrate a functional data approach to this problem using a vector of features on each time series, measuring characteristics of the series. For example, the features may include lag correlation, strength of seasonality, spectral entropy, etc. Then we use a principal component decomposition on the features, and plot the first few principal components. This enables us to explore a lower dimensional space and discover interesting structure and unusual observations.
Brief History of Visual Representation LearningSangwoo Mo
- [2012-2015] Evolution of deep learning architectures
- [2016-2019] Learning paradigms for diverse tasks
- [2020-current] Scaling laws and foundation models
Learning Visual Representations from Uncurated DataSangwoo Mo
Slide about the defense of my Ph.D. dissertation: "Learning Visual Representations from Uncurated Data"
It includes four papers about
- Learning from multi-object images for contrastive learning [1] and Vision Transformer (ViT) [2]
- Learning with limited labels (semi-sup) for image classification [3] and vision-language [4] models
[1] Mo*, Kang* et al. Object-aware Contrastive Learning for Debiased Scene Representation. NeurIPS’21.
[2] Kang*, Mo* et al. OAMixer: Object-aware Mixing Layer for Vision Transformers. CVPRW’22.
[3] Mo et al. RoPAWS: Robust Semi-supervised Representation Learning from Uncurated Data. ICLR’23.
[4] Mo et al. S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions. Under Review.
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...Sangwoo Mo
Lab seminar introduces Ting Chen's recent 3 works:
- Pix2seq: A Language Modeling Framework for Object Detection (ICLR’22)
- A Unified Sequence Interface for Vision Tasks (NeurIPS’22)
- A Generalist Framework for Panoptic Segmentation of Images and Videos (submitted to ICLR’23)
Lab seminar on
- Sharpness-Aware Minimization for Efficiently Improving Generalization (ICLR 2021)
- When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations (under review)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
4. Multi-Armed Bandit
At each time t, the agent selects an arm at (at ∈ {1, ..., K})
Then, the agent recieves a reward rt(= rat ,t) from the enviroment
If ri,t is i.i.d. of some distribution, we call it stochastic bandit, and if
ri,t is selected by the enviroment, we call it adversarial bandit
The goal of MAB is to find the policy π ∈ Π s.t.
π(a1, r1, ...at−1, rt−1) = at
which minimizes the regret1
RT := max
i=1,...,K
E
T
t=1
ri,t −
T
t=1
rat ,t
1
Properly speaking, cumulative pseudo-regret.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 4 / 32
5. Contextual Bandit
In contextual bandit, the agent recieves an additional information
(=context) ct
1 ∈ C at the begining of time t
In stochastic contextual bandit, the reward ri,t can be represented as
a function of the context ci,t and noise i,t
ri,t = f (ci,t) + i,t
or simply ri,t = fi (ct) + i,t if ct is independent to i
In adversarial contextual bandit, the reward ri,t is selected by the
enviroment, as in the non-contextual MAB
1
Many literatures often notate ci,t to emphasize that each arm i has a corresponding context ci,t . However, both notations
are identical since we can construct a single vector ct by concatenating ci,t s.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 5 / 32
6. Optimal Regret Bound
Stochastic Bandit: Ω(log T)1
Adversarial Bandit: Ω(
√
KT)2
Contextual Bandit: Ω(d
√
T)3
1
Lai & Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985.
2
Auer et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem. FOCS, 1995. By minmax strategy.
Note that adversarial bandit can be thought as a 2-player game by the agent and the enviroment.
3
Dani et al. Stochastic Linear Optimization under Bandit Feedback. COLT, 2012. Remark that the lower bound is Ω(
√
T)
even for the stochastic contextual bandit, since context may come in adversarially.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 6 / 32
7. Na¨ıve Approach: Reduce to MAB
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 7 / 32
8. Na¨ıve Approach: Reduce to MAB
Approach 1: assume the context set is finite (|C| = N)
Run MAB algorithm (ex. EXP3) for each context independently
The regret bound is O(
√
TNK log K)1 (w/ EXP3)
Approach 2: assume the policy space is finite (|H| = M)
Run MAB algorithm (ex. EXP3) on policies, instead of arms
The regret bound is O(
√
TM log M) (w/ EXP3)
1 N
c=1 O(nc
√
K log K) ≤ O(
√
TN
√
K log K) where nc is number of context c observed (by Cauchy-Schwarz inequality)
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 8 / 32
10. Review: Index Policy and Greedy Algorithm
Since Gittins Index1, index policy became one of the most popular
strategy for MAB problems
Idea: for each time t, define a score si,t (=index) for each arm i.
Select an arm which has the highest score
Question: how to define proper si,t?
Na¨ıve approach: use empirical mean2! (greedy algorithm)
However, na¨ıve greedy algorithm may occur O(T) regret
1
Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society, 1979.
2
Note that MAB becomes trivial if we know the true mean. The general goal of MAB algorithms is to estimate mean
correctly and rapidly (explore-exploit dilema)
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 10 / 32
11. Review: UCB1
Assume ri,t ∼ Pi with support [0, 1] and mean µi
Idea: select more seldom-selected arms and less often-selected arms.
In other words, give a confidence bonus1!
UCB12: define score as
si,t = ˆµi,t +
2 log t
ni,t
where ˆµi,t is empirical mean, and ni,t is number of arm i selected
UCB1 policy garantees the optimal regret O(log T)
Also, there are other choices for UCB (ex. KL-UCB3, Bayes-UCB4)
1
We call this bonus UCB(upper confidence bound). Thus, score = estimated mean + UCB.
2
Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002.
3
Garivier & Capp´e. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT, 2011.
4
Kaufmann et al. On Bayesian Upper Confidence Bounds for Bandit Problems. AISTATS, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 11 / 32
12. LinUCB
Assume ri,t ∼ P(ri,t | ci,t, θ) where E[ri,t] = cT
i,tθ∗ (ci,t, θ ∈ Rd )
Like UCB1, want to define score as
si,t = cT
i,t
ˆθt + UCBi,t
Question: how to choose proper UCBi,t?
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 12 / 32
13. LinUCB
Idea: let ˆθt be an estimator of θ∗ by ridge regression
ˆθt = (CT
t Ct + λId )−1
CT
t Rt
where Ct = {c1, ..., ct−1} and Rt = {r1, ..., rt−1}
Then, the inequality below holds with probability 1 − δ
T
cT
i,t
ˆθt − cT
i,tθ∗
≤ ( + 1) cT
i,tA−1
t ci,t
where At = CT
t Ct + Id and = 1
2 log 2TK
δ
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 13 / 32
14. LinUCB
LinUCB1: define score as
si,t = cT
i,t
ˆθt + α cT
i,tA−1
t ci,t
Regret bound (with probability 1 − δ) is
O(d T log
1 + T
δ
)
LinUCB policy garantees the optimal regret ˜O(d
√
T)
Also, there are other choices for UCB (ex. LinREL2, CoFineUCB3)
1
Li et al. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.
2
Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002.
3
Yue et al. Hierarchical Exploration for Accelerating Contextual Bandits. ICML, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 14 / 32
15. Review: Thompson Sampling
Another popular strategy for MAB is Thompson Sampling1
It can be applied to both contextual and non-contextual bandit
Assume ri,t ∼ P(ri,t | ci,t, θ∗) with prior θ∗ ∼ P(θ)
Idea: sample estimator ˆθt from the posterior distribution
step 1. draw θt from posterior P(θ | D = {ct, at, rt})
step 2. select arm ai = arg maxi E[ri,t | ci,t, θt]
The idea is simple, but it works well both in theory2 and in practice3
1
Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.
Biometrica, 1933.
2
Agrawal et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT, 2012.
3
Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 2010.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 15 / 32
16. LinTS
Assume ri,t ∼ N(cT
i,tθ∗, v2) and θ∗ ∼ N(θt, v2B−1
t ) where
Bt =
t−1
τ=1
ci,τ cT
i,τ + Id , ˆθt = B−1
t
t−1
τ=1
ci,τ ri,τ
ri,t ∈ [¯ri,t − R, ¯ri,t + R], v = R
24
d log
t
δ
Then, the posterior of θ∗ is N(θt+1, v2B−1
t+1)
LinTS1: run Thompson Sampling in this assumption
Regret bound (with probability 1 − δ) is
O(
d2 √
T1+ log(Td) log
1
δ
)
1
Agrawal et al. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 16 / 32
17. UCB & TS: Nonlinear Case
Assume E[ri,t] = f (ci,t) is general nonlinear function
If we assume f is a member of exponential family, we can use
GLM-UCB1
If we assume f is sampled from a Guassian Process, we can use
GP-UCB2/CGP-UCB3
If we assume f is an element of Reproducing Kernel Hilbert Space,
we can use KernelUCB4
Also, we can use Thompson Sampling if we know the form of
probability distribution
1
Filippi et al. Parametric Bandits: The Generalized Linear Case. NIPS, 2010.
2
Srinivas et al. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010.
3
Krause & Ong. Contextual Gaussian Process Bandit Optimization. NIPS, 2011.
4
Valko et al. Finite-Time Analysis of Kernelised Contextual Bandits. UAI, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 17 / 32
19. Epoch-Greedy
Assume policy space H if finite1
Idea: explore T steps and exploit T − T steps (epsilon-first)
issue 1. how to get an unbiased estimator of the best policy?
issue 2. how to balance explore and exploit if we don’t know T?
trick 1: use D = {ct, at, rt} observed in explore step
ˆπ = max
π∈H
(ct ,at ,rt )∈D
raI(π(ct) = at)
1/K
trick 2: run epsilon-first in mini-batches (partition of T)
1
Infinite w/ finite VC-dimension can be derived in similar way
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 19 / 32
20. Epoch-Greedy
Epoch-Greedy1: combine trick 1 & trick 2
Regret bound is ˜O(T2/3) (not optimal!)
1
Langford & Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS, 2007.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 20 / 32
21. RandomizedUCB
Idea: estimate the distribution Pt over the policy space H
RandomizedUCB1:
Regret bound is ˜O(
√
T), but time complexity is O(T6)
1
Dudik et al. Efficient Optimal Learning for Contextual Bandits. UAI, 2011.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 21 / 32
22. ILOVECONBANDITS
Idea: similar to RandomizedUCB, improve time complexity
ILOVECONBANDITS1 (Importance-weighted LOw-Variance
Epoch-Timed Oracleized CONtextual BANDITS):
Regret bound is ˜O(
√
T), and time complexity is O(T1.5)
1
Agrawal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML, 2014.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 22 / 32
24. Review: EXP3
Assume ri,t ∈ [0, 1] is selected by the enviroment
In adversarial setting, the agent must select arm randomly
Idea: weight more probability to higher-reward ovserved arms
EXP31 (EXPonential-weight algorithm for EXPloration and
EXPloitation):
Regret bound is O(
√
TK log K)
1
Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 24 / 32
25. EXP4
Idea: run EXP3 on policies, instead of arms
EXP41 (EXPonential-weight algorithm for EXPloration and
EXPloitation using EXPert advice):
Regret bound is O(
√
TK log N), but variance is high
1
Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 25 / 32
26. EXP4.P
Idea: run EXP4 with better weight, to make algorithm stable
EXP4.P1 (EXP4 with Probability):
Regret bound is O(
√
TK log N), with high probability
1
Beygelzimer et al. Contextual Bandit Algorithms with Supervised Learning Guarantees. AISTATS, 2011.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 26 / 32
27. Supervised Learning to Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 27 / 32
28. Supervised Learning to Contextual Bandit
Idea: note that contextual bandit can be thought as a supervised
learing problem with partially-observed restriction
Trick: use randomized algorithm (ex. epsilon-greedy) and unbiased
(true) reward estimator ˆrat ,t =
rat ,t
pat
instead of observed reward rat ,t.
Then,
E[ˆri,t] = pi ·
ri,t
pi
+ (1 − pi ) · 0 = ri,t
Using this trick, any supervised learning algorithm can be converted
to a contextual bandit algorithm
Banditron and NeuralBandit are examples using neural network
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 28 / 32
29. Banditron and NeuralBandit
Both Banditron1 and NeuralBandit2 uses multi-layer perceptron and
epsilon-greedy algorithm w/ unbiased reward estimator
However, Banditron uses 0-1 loss (classification) while NeuralBandit
uses L2 loss (regression)
Regret bound of original Banditron is O(T2/3), and a 2nd-order
variant3 reduced it to ˜O(
√
T)
No theoretical garnatee is proved for NeuralBandit yet
1
Kakade et al. Efficient Bandit Algorithms for Online Multiclass Prediction. ICML, 2008.
2
Allesiardo et al. A Neural Networks Committee for the Contextual Bandit Problem. ICONIP, 2014.
3
Crammer & Gentile. Multiclass Classification with Bandit Feedback using Adaptive Regularization. ICML, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 29 / 32