SlideShare a Scribd company logo
Lab Seminar: Contextual Bandit Survey
Sangwoo Mo
KAIST
swmo@kaist.ac.kr
August 4, 2016
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 1 / 32
Overview
1 Problem Setting
2 Na¨ıve Approach: Reduce to MAB
3 Stochastic Contextual Bandit
UCB & Thompson Sampling
Arbitrary Set of Policies
4 Adversarial Contextual Bandit
5 Supervised Learning to Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 2 / 32
Problem Setting
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 3 / 32
Multi-Armed Bandit
At each time t, the agent selects an arm at (at ∈ {1, ..., K})
Then, the agent recieves a reward rt(= rat ,t) from the enviroment
If ri,t is i.i.d. of some distribution, we call it stochastic bandit, and if
ri,t is selected by the enviroment, we call it adversarial bandit
The goal of MAB is to find the policy π ∈ Π s.t.
π(a1, r1, ...at−1, rt−1) = at
which minimizes the regret1
RT := max
i=1,...,K
E
T
t=1
ri,t −
T
t=1
rat ,t
1
Properly speaking, cumulative pseudo-regret.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 4 / 32
Contextual Bandit
In contextual bandit, the agent recieves an additional information
(=context) ct
1 ∈ C at the begining of time t
In stochastic contextual bandit, the reward ri,t can be represented as
a function of the context ci,t and noise i,t
ri,t = f (ci,t) + i,t
or simply ri,t = fi (ct) + i,t if ct is independent to i
In adversarial contextual bandit, the reward ri,t is selected by the
enviroment, as in the non-contextual MAB
1
Many literatures often notate ci,t to emphasize that each arm i has a corresponding context ci,t . However, both notations
are identical since we can construct a single vector ct by concatenating ci,t s.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 5 / 32
Optimal Regret Bound
Stochastic Bandit: Ω(log T)1
Adversarial Bandit: Ω(
√
KT)2
Contextual Bandit: Ω(d
√
T)3
1
Lai & Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985.
2
Auer et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem. FOCS, 1995. By minmax strategy.
Note that adversarial bandit can be thought as a 2-player game by the agent and the enviroment.
3
Dani et al. Stochastic Linear Optimization under Bandit Feedback. COLT, 2012. Remark that the lower bound is Ω(
√
T)
even for the stochastic contextual bandit, since context may come in adversarially.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 6 / 32
Na¨ıve Approach: Reduce to MAB
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 7 / 32
Na¨ıve Approach: Reduce to MAB
Approach 1: assume the context set is finite (|C| = N)
Run MAB algorithm (ex. EXP3) for each context independently
The regret bound is O(
√
TNK log K)1 (w/ EXP3)
Approach 2: assume the policy space is finite (|H| = M)
Run MAB algorithm (ex. EXP3) on policies, instead of arms
The regret bound is O(
√
TM log M) (w/ EXP3)
1 N
c=1 O(nc
√
K log K) ≤ O(
√
TN
√
K log K) where nc is number of context c observed (by Cauchy-Schwarz inequality)
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 8 / 32
Stochastic Contextual Bandit
UCB & Thompson Sampling
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 9 / 32
Review: Index Policy and Greedy Algorithm
Since Gittins Index1, index policy became one of the most popular
strategy for MAB problems
Idea: for each time t, define a score si,t (=index) for each arm i.
Select an arm which has the highest score
Question: how to define proper si,t?
Na¨ıve approach: use empirical mean2! (greedy algorithm)
However, na¨ıve greedy algorithm may occur O(T) regret
1
Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society, 1979.
2
Note that MAB becomes trivial if we know the true mean. The general goal of MAB algorithms is to estimate mean
correctly and rapidly (explore-exploit dilema)
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 10 / 32
Review: UCB1
Assume ri,t ∼ Pi with support [0, 1] and mean µi
Idea: select more seldom-selected arms and less often-selected arms.
In other words, give a confidence bonus1!
UCB12: define score as
si,t = ˆµi,t +
2 log t
ni,t
where ˆµi,t is empirical mean, and ni,t is number of arm i selected
UCB1 policy garantees the optimal regret O(log T)
Also, there are other choices for UCB (ex. KL-UCB3, Bayes-UCB4)
1
We call this bonus UCB(upper confidence bound). Thus, score = estimated mean + UCB.
2
Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002.
3
Garivier & Capp´e. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT, 2011.
4
Kaufmann et al. On Bayesian Upper Confidence Bounds for Bandit Problems. AISTATS, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 11 / 32
LinUCB
Assume ri,t ∼ P(ri,t | ci,t, θ) where E[ri,t] = cT
i,tθ∗ (ci,t, θ ∈ Rd )
Like UCB1, want to define score as
si,t = cT
i,t
ˆθt + UCBi,t
Question: how to choose proper UCBi,t?
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 12 / 32
LinUCB
Idea: let ˆθt be an estimator of θ∗ by ridge regression
ˆθt = (CT
t Ct + λId )−1
CT
t Rt
where Ct = {c1, ..., ct−1} and Rt = {r1, ..., rt−1}
Then, the inequality below holds with probability 1 − δ
T
cT
i,t
ˆθt − cT
i,tθ∗
≤ ( + 1) cT
i,tA−1
t ci,t
where At = CT
t Ct + Id and = 1
2 log 2TK
δ
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 13 / 32
LinUCB
LinUCB1: define score as
si,t = cT
i,t
ˆθt + α cT
i,tA−1
t ci,t
Regret bound (with probability 1 − δ) is
O(d T log
1 + T
δ
)
LinUCB policy garantees the optimal regret ˜O(d
√
T)
Also, there are other choices for UCB (ex. LinREL2, CoFineUCB3)
1
Li et al. A contextual-bandit approach to personalized news article recommendation. WWW, 2010.
2
Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002.
3
Yue et al. Hierarchical Exploration for Accelerating Contextual Bandits. ICML, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 14 / 32
Review: Thompson Sampling
Another popular strategy for MAB is Thompson Sampling1
It can be applied to both contextual and non-contextual bandit
Assume ri,t ∼ P(ri,t | ci,t, θ∗) with prior θ∗ ∼ P(θ)
Idea: sample estimator ˆθt from the posterior distribution
step 1. draw θt from posterior P(θ | D = {ct, at, rt})
step 2. select arm ai = arg maxi E[ri,t | ci,t, θt]
The idea is simple, but it works well both in theory2 and in practice3
1
Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples.
Biometrica, 1933.
2
Agrawal et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT, 2012.
3
Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 2010.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 15 / 32
LinTS
Assume ri,t ∼ N(cT
i,tθ∗, v2) and θ∗ ∼ N(θt, v2B−1
t ) where
Bt =
t−1
τ=1
ci,τ cT
i,τ + Id , ˆθt = B−1
t
t−1
τ=1
ci,τ ri,τ
ri,t ∈ [¯ri,t − R, ¯ri,t + R], v = R
24
d log
t
δ
Then, the posterior of θ∗ is N(θt+1, v2B−1
t+1)
LinTS1: run Thompson Sampling in this assumption
Regret bound (with probability 1 − δ) is
O(
d2 √
T1+ log(Td) log
1
δ
)
1
Agrawal et al. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 16 / 32
UCB & TS: Nonlinear Case
Assume E[ri,t] = f (ci,t) is general nonlinear function
If we assume f is a member of exponential family, we can use
GLM-UCB1
If we assume f is sampled from a Guassian Process, we can use
GP-UCB2/CGP-UCB3
If we assume f is an element of Reproducing Kernel Hilbert Space,
we can use KernelUCB4
Also, we can use Thompson Sampling if we know the form of
probability distribution
1
Filippi et al. Parametric Bandits: The Generalized Linear Case. NIPS, 2010.
2
Srinivas et al. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010.
3
Krause & Ong. Contextual Gaussian Process Bandit Optimization. NIPS, 2011.
4
Valko et al. Finite-Time Analysis of Kernelised Contextual Bandits. UAI, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 17 / 32
Stochastic Contextual Bandit
Arbitrary Set of Policies
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 18 / 32
Epoch-Greedy
Assume policy space H if finite1
Idea: explore T steps and exploit T − T steps (epsilon-first)
issue 1. how to get an unbiased estimator of the best policy?
issue 2. how to balance explore and exploit if we don’t know T?
trick 1: use D = {ct, at, rt} observed in explore step
ˆπ = max
π∈H
(ct ,at ,rt )∈D
raI(π(ct) = at)
1/K
trick 2: run epsilon-first in mini-batches (partition of T)
1
Infinite w/ finite VC-dimension can be derived in similar way
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 19 / 32
Epoch-Greedy
Epoch-Greedy1: combine trick 1 & trick 2
Regret bound is ˜O(T2/3) (not optimal!)
1
Langford & Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS, 2007.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 20 / 32
RandomizedUCB
Idea: estimate the distribution Pt over the policy space H
RandomizedUCB1:
Regret bound is ˜O(
√
T), but time complexity is O(T6)
1
Dudik et al. Efficient Optimal Learning for Contextual Bandits. UAI, 2011.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 21 / 32
ILOVECONBANDITS
Idea: similar to RandomizedUCB, improve time complexity
ILOVECONBANDITS1 (Importance-weighted LOw-Variance
Epoch-Timed Oracleized CONtextual BANDITS):
Regret bound is ˜O(
√
T), and time complexity is O(T1.5)
1
Agrawal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML, 2014.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 22 / 32
Adversarial Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 23 / 32
Review: EXP3
Assume ri,t ∈ [0, 1] is selected by the enviroment
In adversarial setting, the agent must select arm randomly
Idea: weight more probability to higher-reward ovserved arms
EXP31 (EXPonential-weight algorithm for EXPloration and
EXPloitation):
Regret bound is O(
√
TK log K)
1
Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 24 / 32
EXP4
Idea: run EXP3 on policies, instead of arms
EXP41 (EXPonential-weight algorithm for EXPloration and
EXPloitation using EXPert advice):
Regret bound is O(
√
TK log N), but variance is high
1
Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 25 / 32
EXP4.P
Idea: run EXP4 with better weight, to make algorithm stable
EXP4.P1 (EXP4 with Probability):
Regret bound is O(
√
TK log N), with high probability
1
Beygelzimer et al. Contextual Bandit Algorithms with Supervised Learning Guarantees. AISTATS, 2011.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 26 / 32
Supervised Learning to Contextual Bandit
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 27 / 32
Supervised Learning to Contextual Bandit
Idea: note that contextual bandit can be thought as a supervised
learing problem with partially-observed restriction
Trick: use randomized algorithm (ex. epsilon-greedy) and unbiased
(true) reward estimator ˆrat ,t =
rat ,t
pat
instead of observed reward rat ,t.
Then,
E[ˆri,t] = pi ·
ri,t
pi
+ (1 − pi ) · 0 = ri,t
Using this trick, any supervised learning algorithm can be converted
to a contextual bandit algorithm
Banditron and NeuralBandit are examples using neural network
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 28 / 32
Banditron and NeuralBandit
Both Banditron1 and NeuralBandit2 uses multi-layer perceptron and
epsilon-greedy algorithm w/ unbiased reward estimator
However, Banditron uses 0-1 loss (classification) while NeuralBandit
uses L2 loss (regression)
Regret bound of original Banditron is O(T2/3), and a 2nd-order
variant3 reduced it to ˜O(
√
T)
No theoretical garnatee is proved for NeuralBandit yet
1
Kakade et al. Efficient Bandit Algorithms for Online Multiclass Prediction. ICML, 2008.
2
Allesiardo et al. A Neural Networks Committee for the Contextual Bandit Problem. ICONIP, 2014.
3
Crammer & Gentile. Multiclass Classification with Bandit Feedback using Adaptive Regularization. ICML, 2013.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 29 / 32
Summary & Reference
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 30 / 32
Summary
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 31 / 32
Reference
[Zhou 2015] A Survey on Contextual Multi-armed Bandits. arXiv,
2015.
[Burtini’ 2015] A Survey of Online Experiment Design with the
Stochastic Multi-Armed Bandit. arXiv, 2015.
[Bubeck’ 2012] Regret Analysis of Stochastic and Nonstochastic
Multi-armed Bandit Problems. arXiv, 2012.
Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 32 / 32

More Related Content

What's hot

Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
Jaya Kawale
 
Multi-armed bandit by Joni Turunen
Multi-armed bandit by Joni TurunenMulti-armed bandit by Joni Turunen
Multi-armed bandit by Joni Turunen
Frosmo
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
Jaya Kawale
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
Faisal Siddiqi
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
Takami Sato
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
Deep Learning JP
 
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
Kenshi Abe
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
Sudeep Das, Ph.D.
 
AlphaGo Zero 解説
AlphaGo Zero 解説AlphaGo Zero 解説
AlphaGo Zero 解説
suckgeun lee
 
探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット
H Okazaki
 
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
Preferred Networks
 
強化学習6章
強化学習6章強化学習6章
強化学習6章
hiroki yamaoka
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)
Yoshitaka Ushiku
 
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
Deep Learning JP
 
対立強化学習による鬼ごっこゲームでのスキル獲得(RSJ2018ポスター)
対立強化学習による鬼ごっこゲームでのスキル獲得(RSJ2018ポスター)対立強化学習による鬼ごっこゲームでのスキル獲得(RSJ2018ポスター)
対立強化学習による鬼ごっこゲームでのスキル獲得(RSJ2018ポスター)
marieooshima
 
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
ManaMurakami1
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
Ryo Iwaki
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at Netflix
Linas Baltrunas
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
Justin Basilico
 
Time, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender Systems
Yves Raimond
 

What's hot (20)

Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
Multi-armed bandit by Joni Turunen
Multi-armed bandit by Joni TurunenMulti-armed bandit by Joni Turunen
Multi-armed bandit by Joni Turunen
 
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
AlphaGo Zero 解説
AlphaGo Zero 解説AlphaGo Zero 解説
AlphaGo Zero 解説
 
探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット探索と活用の戦略 ベイズ最適化と多腕バンディット
探索と活用の戦略 ベイズ最適化と多腕バンディット
 
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
PyData.Tokyo Meetup #21 講演資料「Optuna ハイパーパラメータ最適化フレームワーク」太田 健
 
強化学習6章
強化学習6章強化学習6章
強化学習6章
 
Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)Curriculum Learning (関東CV勉強会)
Curriculum Learning (関東CV勉強会)
 
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
 
対立強化学習による鬼ごっこゲームでのスキル獲得(RSJ2018ポスター)
対立強化学習による鬼ごっこゲームでのスキル獲得(RSJ2018ポスター)対立強化学習による鬼ごっこゲームでのスキル獲得(RSJ2018ポスター)
対立強化学習による鬼ごっこゲームでのスキル獲得(RSJ2018ポスター)
 
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
「NVIDIA プロファイラを用いたPyTorch学習最適化手法のご紹介(修正前 typoあり)」
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
Context Aware Recommendations at Netflix
Context Aware Recommendations at NetflixContext Aware Recommendations at Netflix
Context Aware Recommendations at Netflix
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
 
Time, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender SystemsTime, Context and Causality in Recommender Systems
Time, Context and Causality in Recommender Systems
 

Similar to Contextual Bandit Survey

Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
Sangwoo Mo
 
A Transformational Approach to Resource Analysis with Typed-Norms
A Transformational Approach to Resource Analysis  with Typed-NormsA Transformational Approach to Resource Analysis  with Typed-Norms
A Transformational Approach to Resource Analysis with Typed-Norms
Facultad de Informática UCM
 
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...
Dmitrii Ignatov
 
The comparative study of finite difference method and monte carlo method for ...
The comparative study of finite difference method and monte carlo method for ...The comparative study of finite difference method and monte carlo method for ...
The comparative study of finite difference method and monte carlo method for ...
Alexander Decker
 
11.the comparative study of finite difference method and monte carlo method f...
11.the comparative study of finite difference method and monte carlo method f...11.the comparative study of finite difference method and monte carlo method f...
11.the comparative study of finite difference method and monte carlo method f...Alexander Decker
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision Theory
Sangwoo Mo
 
Dependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian NonparametricsDependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian Nonparametrics
Julyan Arbel
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Umberto Picchini
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
Arthur Charpentier
 
Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
Arthur Charpentier
 
1. Consider experiments with the following censoring mechanism A gr.docx
1. Consider experiments with the following censoring mechanism A gr.docx1. Consider experiments with the following censoring mechanism A gr.docx
1. Consider experiments with the following censoring mechanism A gr.docx
stilliegeorgiana
 
07.12.2012 - Aprajit Mahajan
07.12.2012 - Aprajit Mahajan07.12.2012 - Aprajit Mahajan
07.12.2012 - Aprajit Mahajan
AMDSeminarSeries
 
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
SYRTO Project
 
Exploring the feature space of large collections of time series
Exploring the feature space of large collections of time seriesExploring the feature space of large collections of time series
Exploring the feature space of large collections of time series
Rob Hyndman
 
Phd Proposal
Phd ProposalPhd Proposal
Phd Proposal
Steven Hamblin
 
Rotting Infinitely Many-Armed Bandits
Rotting Infinitely Many-Armed BanditsRotting Infinitely Many-Armed Bandits
Rotting Infinitely Many-Armed Bandits
JunghunKim27
 
Meta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningUniversité de Liège (ULg)
 
Controlled sequential Monte Carlo
Controlled sequential Monte Carlo Controlled sequential Monte Carlo
Controlled sequential Monte Carlo
JeremyHeng10
 
PhD defense talk slides
PhD  defense talk slidesPhD  defense talk slides
PhD defense talk slides
Chiheb Ben Hammouda
 
Session II - Estimation methods and accuracy - Brunero Liseo, Discussion
Session II - Estimation methods and accuracy - Brunero Liseo, Discussion Session II - Estimation methods and accuracy - Brunero Liseo, Discussion
Session II - Estimation methods and accuracy - Brunero Liseo, Discussion
Istituto nazionale di statistica
 

Similar to Contextual Bandit Survey (20)

Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
A Transformational Approach to Resource Analysis with Typed-Norms
A Transformational Approach to Resource Analysis  with Typed-NormsA Transformational Approach to Resource Analysis  with Typed-Norms
A Transformational Approach to Resource Analysis with Typed-Norms
 
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...
Turning Krimp into a Triclustering Technique on Sets of Attribute-Condition P...
 
The comparative study of finite difference method and monte carlo method for ...
The comparative study of finite difference method and monte carlo method for ...The comparative study of finite difference method and monte carlo method for ...
The comparative study of finite difference method and monte carlo method for ...
 
11.the comparative study of finite difference method and monte carlo method f...
11.the comparative study of finite difference method and monte carlo method f...11.the comparative study of finite difference method and monte carlo method f...
11.the comparative study of finite difference method and monte carlo method f...
 
Statistical Decision Theory
Statistical Decision TheoryStatistical Decision Theory
Statistical Decision Theory
 
Dependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian NonparametricsDependent processes in Bayesian Nonparametrics
Dependent processes in Bayesian Nonparametrics
 
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...Bayesian inference for mixed-effects models driven by SDEs and other stochast...
Bayesian inference for mixed-effects models driven by SDEs and other stochast...
 
Reinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and FinanceReinforcement Learning in Economics and Finance
Reinforcement Learning in Economics and Finance
 
Side 2019 #10
Side 2019 #10Side 2019 #10
Side 2019 #10
 
1. Consider experiments with the following censoring mechanism A gr.docx
1. Consider experiments with the following censoring mechanism A gr.docx1. Consider experiments with the following censoring mechanism A gr.docx
1. Consider experiments with the following censoring mechanism A gr.docx
 
07.12.2012 - Aprajit Mahajan
07.12.2012 - Aprajit Mahajan07.12.2012 - Aprajit Mahajan
07.12.2012 - Aprajit Mahajan
 
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
Spillover Dynamics for Systemic Risk Measurement Using Spatial Financial Time...
 
Exploring the feature space of large collections of time series
Exploring the feature space of large collections of time seriesExploring the feature space of large collections of time series
Exploring the feature space of large collections of time series
 
Phd Proposal
Phd ProposalPhd Proposal
Phd Proposal
 
Rotting Infinitely Many-Armed Bandits
Rotting Infinitely Many-Armed BanditsRotting Infinitely Many-Armed Bandits
Rotting Infinitely Many-Armed Bandits
 
Meta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learningMeta-learning of exploration-exploitation strategies in reinforcement learning
Meta-learning of exploration-exploitation strategies in reinforcement learning
 
Controlled sequential Monte Carlo
Controlled sequential Monte Carlo Controlled sequential Monte Carlo
Controlled sequential Monte Carlo
 
PhD defense talk slides
PhD  defense talk slidesPhD  defense talk slides
PhD defense talk slides
 
Session II - Estimation methods and accuracy - Brunero Liseo, Discussion
Session II - Estimation methods and accuracy - Brunero Liseo, Discussion Session II - Estimation methods and accuracy - Brunero Liseo, Discussion
Session II - Estimation methods and accuracy - Brunero Liseo, Discussion
 

More from Sangwoo Mo

Brief History of Visual Representation Learning
Brief History of Visual Representation LearningBrief History of Visual Representation Learning
Brief History of Visual Representation Learning
Sangwoo Mo
 
Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated Data
Sangwoo Mo
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
Sangwoo Mo
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
Sangwoo Mo
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
Sangwoo Mo
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
Sangwoo Mo
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
Sangwoo Mo
 
Object-Region Video Transformers
Object-Region Video TransformersObject-Region Video Transformers
Object-Region Video Transformers
Sangwoo Mo
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Sangwoo Mo
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
Sangwoo Mo
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
Sangwoo Mo
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
Sangwoo Mo
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
Sangwoo Mo
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
Sangwoo Mo
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
Sangwoo Mo
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Sangwoo Mo
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
Sangwoo Mo
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
Sangwoo Mo
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Sangwoo Mo
 

More from Sangwoo Mo (20)

Brief History of Visual Representation Learning
Brief History of Visual Representation LearningBrief History of Visual Representation Learning
Brief History of Visual Representation Learning
 
Learning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated DataLearning Visual Representations from Uncurated Data
Learning Visual Representations from Uncurated Data
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
 
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model...
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)Deep Learning Theory Seminar (Chap 3, part 2)
Deep Learning Theory Seminar (Chap 3, part 2)
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
 
Introduction to Diffusion Models
Introduction to Diffusion ModelsIntroduction to Diffusion Models
Introduction to Diffusion Models
 
Object-Region Video Transformers
Object-Region Video TransformersObject-Region Video Transformers
Object-Region Video Transformers
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)Sharpness-aware minimization (SAM)
Sharpness-aware minimization (SAM)
 
Explicit Density Models
Explicit Density ModelsExplicit Density Models
Explicit Density Models
 
Score-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential EquationsScore-Based Generative Modeling through Stochastic Differential Equations
Score-Based Generative Modeling through Stochastic Differential Equations
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
Meta-Learning with Implicit Gradients
Meta-Learning with Implicit GradientsMeta-Learning with Implicit Gradients
Meta-Learning with Implicit Gradients
 
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
Challenging Common Assumptions in the Unsupervised Learning of Disentangled R...
 
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
 
Bayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-LearningBayesian Model-Agnostic Meta-Learning
Bayesian Model-Agnostic Meta-Learning
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 

Recently uploaded

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 

Recently uploaded (20)

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 

Contextual Bandit Survey

  • 1. Lab Seminar: Contextual Bandit Survey Sangwoo Mo KAIST swmo@kaist.ac.kr August 4, 2016 Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 1 / 32
  • 2. Overview 1 Problem Setting 2 Na¨ıve Approach: Reduce to MAB 3 Stochastic Contextual Bandit UCB & Thompson Sampling Arbitrary Set of Policies 4 Adversarial Contextual Bandit 5 Supervised Learning to Contextual Bandit Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 2 / 32
  • 3. Problem Setting Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 3 / 32
  • 4. Multi-Armed Bandit At each time t, the agent selects an arm at (at ∈ {1, ..., K}) Then, the agent recieves a reward rt(= rat ,t) from the enviroment If ri,t is i.i.d. of some distribution, we call it stochastic bandit, and if ri,t is selected by the enviroment, we call it adversarial bandit The goal of MAB is to find the policy π ∈ Π s.t. π(a1, r1, ...at−1, rt−1) = at which minimizes the regret1 RT := max i=1,...,K E T t=1 ri,t − T t=1 rat ,t 1 Properly speaking, cumulative pseudo-regret. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 4 / 32
  • 5. Contextual Bandit In contextual bandit, the agent recieves an additional information (=context) ct 1 ∈ C at the begining of time t In stochastic contextual bandit, the reward ri,t can be represented as a function of the context ci,t and noise i,t ri,t = f (ci,t) + i,t or simply ri,t = fi (ct) + i,t if ct is independent to i In adversarial contextual bandit, the reward ri,t is selected by the enviroment, as in the non-contextual MAB 1 Many literatures often notate ci,t to emphasize that each arm i has a corresponding context ci,t . However, both notations are identical since we can construct a single vector ct by concatenating ci,t s. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 5 / 32
  • 6. Optimal Regret Bound Stochastic Bandit: Ω(log T)1 Adversarial Bandit: Ω( √ KT)2 Contextual Bandit: Ω(d √ T)3 1 Lai & Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985. 2 Auer et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem. FOCS, 1995. By minmax strategy. Note that adversarial bandit can be thought as a 2-player game by the agent and the enviroment. 3 Dani et al. Stochastic Linear Optimization under Bandit Feedback. COLT, 2012. Remark that the lower bound is Ω( √ T) even for the stochastic contextual bandit, since context may come in adversarially. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 6 / 32
  • 7. Na¨ıve Approach: Reduce to MAB Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 7 / 32
  • 8. Na¨ıve Approach: Reduce to MAB Approach 1: assume the context set is finite (|C| = N) Run MAB algorithm (ex. EXP3) for each context independently The regret bound is O( √ TNK log K)1 (w/ EXP3) Approach 2: assume the policy space is finite (|H| = M) Run MAB algorithm (ex. EXP3) on policies, instead of arms The regret bound is O( √ TM log M) (w/ EXP3) 1 N c=1 O(nc √ K log K) ≤ O( √ TN √ K log K) where nc is number of context c observed (by Cauchy-Schwarz inequality) Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 8 / 32
  • 9. Stochastic Contextual Bandit UCB & Thompson Sampling Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 9 / 32
  • 10. Review: Index Policy and Greedy Algorithm Since Gittins Index1, index policy became one of the most popular strategy for MAB problems Idea: for each time t, define a score si,t (=index) for each arm i. Select an arm which has the highest score Question: how to define proper si,t? Na¨ıve approach: use empirical mean2! (greedy algorithm) However, na¨ıve greedy algorithm may occur O(T) regret 1 Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society, 1979. 2 Note that MAB becomes trivial if we know the true mean. The general goal of MAB algorithms is to estimate mean correctly and rapidly (explore-exploit dilema) Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 10 / 32
  • 11. Review: UCB1 Assume ri,t ∼ Pi with support [0, 1] and mean µi Idea: select more seldom-selected arms and less often-selected arms. In other words, give a confidence bonus1! UCB12: define score as si,t = ˆµi,t + 2 log t ni,t where ˆµi,t is empirical mean, and ni,t is number of arm i selected UCB1 policy garantees the optimal regret O(log T) Also, there are other choices for UCB (ex. KL-UCB3, Bayes-UCB4) 1 We call this bonus UCB(upper confidence bound). Thus, score = estimated mean + UCB. 2 Auer et al. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 2002. 3 Garivier & Capp´e. The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. COLT, 2011. 4 Kaufmann et al. On Bayesian Upper Confidence Bounds for Bandit Problems. AISTATS, 2012. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 11 / 32
  • 12. LinUCB Assume ri,t ∼ P(ri,t | ci,t, θ) where E[ri,t] = cT i,tθ∗ (ci,t, θ ∈ Rd ) Like UCB1, want to define score as si,t = cT i,t ˆθt + UCBi,t Question: how to choose proper UCBi,t? Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 12 / 32
  • 13. LinUCB Idea: let ˆθt be an estimator of θ∗ by ridge regression ˆθt = (CT t Ct + λId )−1 CT t Rt where Ct = {c1, ..., ct−1} and Rt = {r1, ..., rt−1} Then, the inequality below holds with probability 1 − δ T cT i,t ˆθt − cT i,tθ∗ ≤ ( + 1) cT i,tA−1 t ci,t where At = CT t Ct + Id and = 1 2 log 2TK δ Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 13 / 32
  • 14. LinUCB LinUCB1: define score as si,t = cT i,t ˆθt + α cT i,tA−1 t ci,t Regret bound (with probability 1 − δ) is O(d T log 1 + T δ ) LinUCB policy garantees the optimal regret ˜O(d √ T) Also, there are other choices for UCB (ex. LinREL2, CoFineUCB3) 1 Li et al. A contextual-bandit approach to personalized news article recommendation. WWW, 2010. 2 Auer. Using Confidence Bounds for Exploitation-Exploration Trade-offs. JMLR, 2002. 3 Yue et al. Hierarchical Exploration for Accelerating Contextual Bandits. ICML, 2012. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 14 / 32
  • 15. Review: Thompson Sampling Another popular strategy for MAB is Thompson Sampling1 It can be applied to both contextual and non-contextual bandit Assume ri,t ∼ P(ri,t | ci,t, θ∗) with prior θ∗ ∼ P(θ) Idea: sample estimator ˆθt from the posterior distribution step 1. draw θt from posterior P(θ | D = {ct, at, rt}) step 2. select arm ai = arg maxi E[ri,t | ci,t, θt] The idea is simple, but it works well both in theory2 and in practice3 1 Thompson. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples. Biometrica, 1933. 2 Agrawal et al. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. COLT, 2012. 3 Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 2010. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 15 / 32
  • 16. LinTS Assume ri,t ∼ N(cT i,tθ∗, v2) and θ∗ ∼ N(θt, v2B−1 t ) where Bt = t−1 τ=1 ci,τ cT i,τ + Id , ˆθt = B−1 t t−1 τ=1 ci,τ ri,τ ri,t ∈ [¯ri,t − R, ¯ri,t + R], v = R 24 d log t δ Then, the posterior of θ∗ is N(θt+1, v2B−1 t+1) LinTS1: run Thompson Sampling in this assumption Regret bound (with probability 1 − δ) is O( d2 √ T1+ log(Td) log 1 δ ) 1 Agrawal et al. Thompson Sampling for Contextual Bandits with Linear Payoffs. ICML, 2013. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 16 / 32
  • 17. UCB & TS: Nonlinear Case Assume E[ri,t] = f (ci,t) is general nonlinear function If we assume f is a member of exponential family, we can use GLM-UCB1 If we assume f is sampled from a Guassian Process, we can use GP-UCB2/CGP-UCB3 If we assume f is an element of Reproducing Kernel Hilbert Space, we can use KernelUCB4 Also, we can use Thompson Sampling if we know the form of probability distribution 1 Filippi et al. Parametric Bandits: The Generalized Linear Case. NIPS, 2010. 2 Srinivas et al. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. ICML, 2010. 3 Krause & Ong. Contextual Gaussian Process Bandit Optimization. NIPS, 2011. 4 Valko et al. Finite-Time Analysis of Kernelised Contextual Bandits. UAI, 2013. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 17 / 32
  • 18. Stochastic Contextual Bandit Arbitrary Set of Policies Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 18 / 32
  • 19. Epoch-Greedy Assume policy space H if finite1 Idea: explore T steps and exploit T − T steps (epsilon-first) issue 1. how to get an unbiased estimator of the best policy? issue 2. how to balance explore and exploit if we don’t know T? trick 1: use D = {ct, at, rt} observed in explore step ˆπ = max π∈H (ct ,at ,rt )∈D raI(π(ct) = at) 1/K trick 2: run epsilon-first in mini-batches (partition of T) 1 Infinite w/ finite VC-dimension can be derived in similar way Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 19 / 32
  • 20. Epoch-Greedy Epoch-Greedy1: combine trick 1 & trick 2 Regret bound is ˜O(T2/3) (not optimal!) 1 Langford & Zhang. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. NIPS, 2007. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 20 / 32
  • 21. RandomizedUCB Idea: estimate the distribution Pt over the policy space H RandomizedUCB1: Regret bound is ˜O( √ T), but time complexity is O(T6) 1 Dudik et al. Efficient Optimal Learning for Contextual Bandits. UAI, 2011. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 21 / 32
  • 22. ILOVECONBANDITS Idea: similar to RandomizedUCB, improve time complexity ILOVECONBANDITS1 (Importance-weighted LOw-Variance Epoch-Timed Oracleized CONtextual BANDITS): Regret bound is ˜O( √ T), and time complexity is O(T1.5) 1 Agrawal et al. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. ICML, 2014. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 22 / 32
  • 23. Adversarial Contextual Bandit Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 23 / 32
  • 24. Review: EXP3 Assume ri,t ∈ [0, 1] is selected by the enviroment In adversarial setting, the agent must select arm randomly Idea: weight more probability to higher-reward ovserved arms EXP31 (EXPonential-weight algorithm for EXPloration and EXPloitation): Regret bound is O( √ TK log K) 1 Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 24 / 32
  • 25. EXP4 Idea: run EXP3 on policies, instead of arms EXP41 (EXPonential-weight algorithm for EXPloration and EXPloitation using EXPert advice): Regret bound is O( √ TK log N), but variance is high 1 Auer et al. The nonstochastic multiarmed bandit problem. SIAM, 2002. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 25 / 32
  • 26. EXP4.P Idea: run EXP4 with better weight, to make algorithm stable EXP4.P1 (EXP4 with Probability): Regret bound is O( √ TK log N), with high probability 1 Beygelzimer et al. Contextual Bandit Algorithms with Supervised Learning Guarantees. AISTATS, 2011. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 26 / 32
  • 27. Supervised Learning to Contextual Bandit Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 27 / 32
  • 28. Supervised Learning to Contextual Bandit Idea: note that contextual bandit can be thought as a supervised learing problem with partially-observed restriction Trick: use randomized algorithm (ex. epsilon-greedy) and unbiased (true) reward estimator ˆrat ,t = rat ,t pat instead of observed reward rat ,t. Then, E[ˆri,t] = pi · ri,t pi + (1 − pi ) · 0 = ri,t Using this trick, any supervised learning algorithm can be converted to a contextual bandit algorithm Banditron and NeuralBandit are examples using neural network Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 28 / 32
  • 29. Banditron and NeuralBandit Both Banditron1 and NeuralBandit2 uses multi-layer perceptron and epsilon-greedy algorithm w/ unbiased reward estimator However, Banditron uses 0-1 loss (classification) while NeuralBandit uses L2 loss (regression) Regret bound of original Banditron is O(T2/3), and a 2nd-order variant3 reduced it to ˜O( √ T) No theoretical garnatee is proved for NeuralBandit yet 1 Kakade et al. Efficient Bandit Algorithms for Online Multiclass Prediction. ICML, 2008. 2 Allesiardo et al. A Neural Networks Committee for the Contextual Bandit Problem. ICONIP, 2014. 3 Crammer & Gentile. Multiclass Classification with Bandit Feedback using Adaptive Regularization. ICML, 2013. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 29 / 32
  • 30. Summary & Reference Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 30 / 32
  • 31. Summary Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 31 / 32
  • 32. Reference [Zhou 2015] A Survey on Contextual Multi-armed Bandits. arXiv, 2015. [Burtini’ 2015] A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit. arXiv, 2015. [Bubeck’ 2012] Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. arXiv, 2012. Sangwoo Mo (KAIST) Contextual Bandit Survey August 4, 2016 32 / 32