SlideShare a Scribd company logo
1 of 44
Download to read offline
Regret of Queueing Bandits
Sanjay Shakkottai
Department of Electrical and Computer Engineering
The University of Texas at Austin
Joint with Subhashini Krishnasamy, Rajat Sen, Ari Arapostathis (UT
Austin); Ramesh Johari (Stanford Univ.)
SAVES Meeting
April 10, 2018
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 1 / 44
Motivation (1/3)
Stream of multiple types of tasks
(jobs)
Multiple agents (servers) with
varying task dependent expertise
Match (schedule) tasks to agents
Dynamic decision making problem
because the number of tasks
changes with time based on past
decisions
Queueing Models
Rich history for such decision making
through queueing and scheduling for
various performance metrics
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 2 / 44
Motivation (2/3)
Emerging Setting: Agent and task
characteristics unknown
Joint online learning and dynamic
optimization
Online Learning: Learn agent and task
characteristics/statistics
Formally, learn task dependent service
rates of agents
Dynamic Optimization: Using the
learned statistics, iteratively optimize to
achieve performance goals
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks
Applications
Online service systems (Uber, Lyft, Airbnb, Upwork); Scheduling in wireless
networks; Crowdsourced task allocation for human-machine systems
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 3 / 44
Motivation, Questions and Approach (3/3)
How well do we need to learn the
statistics?
What is the time-scale of learning?
How much resources are need for
learning?
Algorithms for joint online learning
and optimization?
Bandit Approach
Rich history for online learning and
optimization through bandits and regret
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 4 / 44
Bandit Overview
μ1 μ2 μ3 μK
Arm	1 Arm	2 Arm	3 Arm	K
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 5 / 44
Bandit Overview (1/3)
μ1 μ2 μ3 μK
Arm	1 Arm	2 Arm	3 Arm	K
Multi-armed Bandit: K arms, each arm returns a random Bernoulli
reward if the arm is played
Can play one arm at each (discrete) time t
Associate a rv Xi (t) with arm i; with P(Xi (t) = 1) = µi
WLOG 1 > µ1 > µ2 ≥ . . . ≥ µK > 0
Reward: Accumulate reward at time t if the chosen arm returns ’1’
Key Question
Suppose {µi }K
i=1 are unknown. Which arm to play at each time to
maximize expected cumulative reward?
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 6 / 44
Bandit Overview (2/3)
μ1 μ2 μ3 μK
Arm	1 Arm	2 Arm	3 Arm	K
As we play arms over time, we learn the values of {µi } (with varying
reliabilities)
Explore vs. Exploit: At time t should we play unknown arms (explore
to discover the arm with maximum µi ) OR play best known arm
(exploit past information)
Applications – Optimizing while Learning: Online advertising, drug
trials, wireless spectrum probing/sharing, finance, ...
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 7 / 44
Bandit Overview (3/3)
Policy π plays a sequence of arms {i1, i2, . . .} over time
Arm selection can depend on all past arm selections and reward
observations
Regret of a policy R(t): The expected accumulated loss of reward with
respect to a genie that knows the best arm (i.e. genie knows {µi })
R(t) = tµ1 − E
t
s=1
Xis (s)
Key Results (Lai and Robbins; Auer, Cesa-Bianchi and Fischer)
1. R(t) scales as K log(t)
2. Simple algorithms along with finite time upper and lower regret bounds
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 8 / 44
Part 1: Queue Regret for Server Selection/Matching
μ1
μ2
μ3
μK
arrivals
agents / servers
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 9 / 44
Queueing + Bandits
Arms as servers; departure from queue
if reward equals ’1’
Bernoulli job arrivals at rate λ ∈ (0, 1);
job backlogged in queue until served
Genie is stable: λ < µ1
Bandit algorithm schedules server
(’plays arm’) whenever queue is
backlogged
Applications: Online service systems
(Uber, Lyft, Airbnb, Upwork); financial
markets (limit order books);
communication networks, ...
μ1
μ2
μ3
μK
arrivals
agents / servers
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 10 / 44
Queue-Regret
time
Queue	length
Regenerative	cycle
Q(t) queue length at time t under bandit algorithm,
Q∗(t) queue length under “genie” policy
Always schedules the best server (here, server ’1’)
Ψ(t) is the queue-regret
Ψ(t) := E [Q(t) − Q∗
(t)] .
Interpretation: Ψ(t) traditional regret with caveat that reward
accumulated only if queue is backlogged
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 11 / 44
The Bandit vs. the Queueing Viewpoint
Q(t) = t
s=1 ((A(s) − D(s)) : The queue length equals [cumulative
arrivals - cumulative departures]
Q∗(t) = t
s=1 ((A(s) − D∗(s))
Ψ(t) = E t
s=1 (D∗(s) − D(s)) : Accumulated difference in service
In bandit terms, this is the difference in accumulated rewards
Bandit Viewpoint
Regret increases over time: Ψ(t) ≤ R(t) ∼ K log(t)
Queueing Viewpoint
As t → ∞ in steady state, E[Q(t) − Q∗(t)] ∼ 0
Key Question
How do we bridge these two different viewpoints?
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 12 / 44
Intuition – Bridging these Viewpoints
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
Early Stage Late Stage
μ1
μ2
μ3
μK
arrivals
agents / servers
Over time, we (approximately) learn the values of {µi }
Eventually, learn “well enough” so that “effective service rate” exceeds
λ (arrival rate)
Queue length hits zero periodically =⇒ sample path queue-regret
“resets” at these epochs!
Takeaway
We should anticipate a phase transition in queue-regret behavior
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 13 / 44
Main Results – The Late Stage (1/3)
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
Early Stage Late Stage
μ1
μ2
μ3
μK
arrivals
agents / servers
Queue length hits zero infinitely often; at these epochs the
sample-path regret “resets”
Queue-regret approximately a (discrete) derivative of the bandit
cumulative regret
Since the optimal cumulative regret scales like log(t), asymptotically
the optimal queue-regret should scale like 1/t
Takeaway
Order-wise matching upper and lower bounds showing O(1/t) behavior
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 14 / 44
Main Results – The Early Stage (2/3)
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
Early Stage Late Stage
μ1
μ2
μ3
μK
arrivals
agents / servers
Still cannot (even approximately) identify the best server
Expected service rate is smaller than arrival rate λ
Queue continuously backlogged; queue-regret similar to bandit regret
Takeaway
Order-wise matching upper and lower bounds showing O(log(t)) behavior
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 15 / 44
Main Results – The Transition (3/3)
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
Early Stage Late Stage
μ1
μ2
μ3
μK
arrivals
agents / servers
Time to switch scales at least as t = Ω(K/ ),
= (µ1 − λ) : Gap between the arrival rate and best service rate
Transition analysis through a heavily loaded setting as → 0
Scale K and ; demonstrate algorithm with queue-regret
O poly(log t)/ 2t for times that are arbitrarily close to Ω(K/ )
Takeaway
Phase transition time scales as (K/ ). Smaller means harder to learn
optimal server, and pushes out the phase transition time.
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 16 / 44
Implications
Scheduling: Much of scheduling literature focuses on steady-state or
long-time-scale behavior (e.g. Lyapunov arguments)
With emerging systems (online matching markets, wireless systems),
short-time behavior is equally important
In Online service systems, the number of jobs per customer might
reach steady-state only after a long time
In wireless 5G, much more flux between base-stations due to
densification
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 17 / 44
Related Work
Bandits and Cumulative Regret: Vast literature started with Lai &
Robbins 1985, and UCB (finite time bounds and simple algorithm)
Auer, Cesa-Bianchi, & Fischer 2002; See Bubeck and Cesa-Bianchi
2012 for a survey
Bandit and Queues: Rich history with focus on infinite horizon costs
and optimality of index policies (Gittins index 1979): Cox & Smith
1961, Buyukkoc, Varaiya and Walrand 1985, Van Mieghem 1995, Lott
& Teneketzis 2000, Mahajan & Teneketzis 2008, Nino-Mora 2006
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 18 / 44
Achieving the Bounds: −Greedy Thompson Sampling
time	steps
exploit
explore
Bandit algorithms trade-off between explore and exploit steps
t− Greedy: With probability t, choose a server uniformly at random;
other-wise use Thompson sampling (here, t = 3K log2
(t)/t)
Thompson Sampling: Sampling and Bayesian update algorithm to
model and update {µi }K
i=1
Jointly used to both update “belief” on best arm as well as determine
the next arm to sample
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 19 / 44
−Greedy Structured Exploration
time	steps
exploit
explore
Explore Step: t− Greedy algorithm provides structured exploration
t− Greedy: With probability t, choose a server uniformly at random;
other-wise use Thompson sampling (here, t = 3K log2
(t)/t)
Ensures that a poly-logarithmic amount of time used for “pure”
learning
Provides high probability upper bounds on number of sub-optimal
schedules in the exploit steps
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 20 / 44
A Primer on Thompson Sampling
0 0.2 0.4 0.6 0.8 1
pdf
0
2
4
6
8
10
12
β(1, 1)
β(3, 2)
β(10, 4)
β(100, 34)
Model µi as a random variable Qi for each i ∈ [K]
Initially, Uniform [0, 1] prior distribution for each Qi
Sample each distribution, choose arm/server with largest sample
Update the sampled arm’s distribution (Bayesian posterior
distribution) based on {0, 1} observations
Observations until time t: Arm i has seen Ai (t) ’1’s and Bi (t) ’0’s
Conjugate Prior: The posterior distribution of Qi given the
observations is Beta(Ai (t) + 1, Bi (t) + 1)
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 21 / 44
Achievability in the Late Stage
Theorem
Consider any problem instance (λ,µµµ). Then,
Ψ(t) = O K
log3
t
2t
for all t large enough (precise bounds available).
K = number of servers/arms
= (λ − µ1) : Gap between the arrival rate and best service rate
Scaling of K and (Heavy-Load Scaling)
For any β ∈ (0, 1), there exists a scaling of K with such that the
queue-regret scales as O poly(log t)/ 2t for all t > (K/ )β
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 22 / 44
Sketch of Proof: Achievability in the Late Stage (1/3)
time
Queue	length
Regenerative	cycle
Main Challenge: Coupled Cycles
Queues usually go through regenerative cycles which are independent
BUT HERE ...
Queue length evolution is dependent on the past history of bandit arm
schedules (cycles are coupled by the bandit)
Our Approach
1. High probability bound on the number of sub-optimal schedules
2. Bounds on length of regenerative cycle
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 23 / 44
Sketch of Proof: Achievability in the Late Stage (2/3)
time
Queue	length
Regenerative	cycle
Sub-optimal Schedules: Structured exploration via t−Greedy shows
that all servers including the sub-optimal ones, are sampled a
sufficiently large number of times
Ensures that algorithm schedules the correct link in the exploit phase
in the late stages with high probability
Busy Cycle of Queue: Coarse high probability upper bound on the
queue-length =⇒ coarse upper bound on busy cycle
Recursive Bound: Use above bound to get tighter bounds on the
queue-length, and in turn, the start of the current regenerative cycle
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 24 / 44
Sketch of Proof: Bounding the Busy Cycle (3/3)
time
Queue	length
Regenerative	cycle
Bandit System: Bandit bounds suggest that expected number of
sub-optimal arm pulls until time t is bounded by log(t)
Queueing System: There is a linear gap between arrival and best
service rate (scales as t)
Combination implies that even in the “worst case”, the current regenerative
cycle cannot extend too far into the past
Use this as a first cut bound on busy cycle, use this bound to bound
queue length, and again use this queue length bound to sharpen busy
cycle bound
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 25 / 44
Converse in the Late Stage
Theorem
For any problem instance (λ,µµµ) and any “reasonable” policy, the
queue-regret Ψ(t) satisfies
Ψ(t) ≥
λ
4
D(µµµ)(1 − α)(K − 1)
1
t
for infinitely many t, where D(µµµ) = ∆
KL µmin, µ∗+1
2
.
λ = arrival rate
∆ = rate gap between best and second best server/arm
α ∈ (0, 1), characterizes “reasonable” policy (formally, α-consistency in
bandit literature)
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 26 / 44
Sketch of Proof: Converse in the Late Stage
Sample Path Coupling: Construct a new bandit system (Bandit-Alt)
for which:
(a) Queue-regret is unchanged from bandit system
(b) Queue length of genie system (sample-path-wise) smaller than
Bandit-Alt
Bandit Bounds: (Roughly) use one time-step argument to show that
probability of using wrong server is O(1/t) for Bandit-Alt
Actually more delicate argument, as one step bounds not attainable
Show average (over time) bounds using bandit measure-change
arguments, and use pigeon-holing to get infinitely often bounds on
queue-regret
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 27 / 44
Numerical Results (1/2)
t
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Ψ(t)
0
50
100
150
Phase Transition
Shift
ǫ = 0.05
ǫ = 0.1
ǫ = 0.15
System with 5 servers with ∈ {0.05, 0.1, 0.15}
The phase-transition point shifts towards the right as decreases
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 28 / 44
Numerical Results (2/2)
t
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Ψ(t)
0
2
4
6
8
10
Q-ThS(Exp. Prob. = 3K log2
(t)
t )
Q-UCB
UCB-1
Thompson
Q-Ths(Exp. Prob. = 0.4K log2
(t)
t )
Comparison of queue-regret performance of Q-ThS, Q-UCB, UCB-1
amd Thompson Sampling
5 server system with u = 0.15 and ∆ = 0.17
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 29 / 44
Algorithm Design Questions
Learning with Queues: Should we explore more aggressively in initial
stages, because regret “resets” and we do not have an “asymptotic”
penalty?
Learning and Matching: More complex resource allocation tasks such
as matching
Interactions with multiple users/queues
Low dimensional structure across users and serves
Learning with Agent Dynamics: Agents/servers change over time
(agents arrive and depart)
How much to “trust” current “learned” agents and how much to explore
new agents?
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 30 / 44
Part 2: Holding-Cost Regret for Multi-Class Systems
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 31 / 44
System Model: Multi-Class Queues
K servers and U queues (job
classes)
Bernoulli job arrivals at rate
λu ∈ (0, 1) to class u ∈ U
Service rate matrix
(µuk, 1 ≤ u ≤ U, 1 ≤ k ≤ K)
µuk ∈ (0, 1) is the unknown
service rate (success probability)
of server k for a job of type u
Models servers/agents whose
service rate depends on the type
of job/task
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 32 / 44
System Model: Holding Costs
Holding cost: The expected total
waiting cost over finite time T
J(T) := E
T
t=1
βt
U
i=1
ci Qi (t)
β ∈ (0, 1] a discount factor
(useful when considering
T → ∞)
ck > 0 a waiting time cost in
queue class k
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 33 / 44
The cµ Rule
Algorithm 1 The cµ Algorithm with costs {ck} and rates {µk}
At time t,
Choose job-server pairs according to the Max-Weight rule with weights
given by {ci µi,j } (product of the waiting cost and success probability)
Single Server Case
Serve the non-empty queue with largest
cuµu, u = 1, 2, . . . , U
Buyukkoc, Varaiya and Walrand 1985
Single server cµ rule is holding cost optimal
for any β ∈ (0, 1] and any T > 0
μ1
μ1 = (μ11 μ12 … μ1U)Queue U
Queue 2
Queue 1
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 34 / 44
Background: A History of the cµ Rule
The single server system (with known statistics)
Optimal expected holding costs for finite/infinite horizon (Cox and
Smith 1961, Buyukkoc, Varaiya and Walrand 1985)
Asymptotically optimal for convex costs in heavy traffic (Van Mieghem
1995)
The multi-server system – optimal policies for infinite horizon and
heavy traffic (with known statistics)
’N’, ’W’ networks – Harrison 1998, Bell and Williams 2001
Homogeneous servers – Lott and Teneketzis 2000, Glazebrook and
Nino Mora 2001
Generalized cµ rule (with convex costs, i.e. c = c(q) is convex) –
Mandelbaum and Stolyar 2004
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 35 / 44
Single Server Case: Main Result (1/2)
Service rates unknown – estimate using observed samples from past
allocation decisions, and use these parameters {ˆµu}
Unlike bandit algorithms, there is no explicit explore for forced learning
of server rates
J∗(T) is the holding cost with cµ rule
J(T) is the holding cost with empirical c ˆµ rule
Main Result: Constant Holding Cost Regret
J∗(T) − J(T) = O(1), and does not depend on T
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 36 / 44
Single Server Case: Main Result (2/2)
time
Queue	length
Regenerative	cycle
Intuition: Busy cycles are identical for all work conserving policies
Intuition: Within each cycle, all jobs need to be scheduled by any
policy (in some order). Implies sufficient number of server “samples”
Stability (busy cycle are sample-path identical) + Server samples gives
“free explore”
Implication: Learned priority order of queues “couples” with genie by
time τ = O(log t) w.p. 1/t3
Explore-Free System
No “random exploration” regret incurred unlike typical bandit systems
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 37 / 44
Multi Server Case: Instability
Stability of multi-server cµ rule
previously unknown
New Result: In general, the cµ
rule is unstable
Queue 1 has strict priority:
c1µ1j > c2µ2j , j = 1, 2
Server 1 has higher rate:
µ11 > µ12
π1 : stationary distribution of Q1
Queue 1
Queue 2
μ12
μ11
μ21
μ22
!1
!2
Result: Q2 is strongly unstable
If λ2 > π1(0)µ21 + π1(0, 1)µ22, then there exists positive constants
b0, b1, t0 s.t. ∀t ≥ t0,
P (Q2(t) < b2t) ≤ exp(−b1t)
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 38 / 44
Multi Server Case: Sufficient Conditions for Stability
Stability and Tails Bounds on Busy Cycles
λ · α < min
q∈QK
(R(q) · α), for some α > 0, α ∈ PU,
where PU is the probability simplex, Ql := {q ∈ ZU
+ : |q1| = l} for l ∈ Z+.
R(q) is the service rate to queues as a function of queue lengths
Condition in addition leads to exponential tails on busy cycles – uses
drift analysis from (Hajek 1982)
In limiting cases, provides close to complete stability region
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 39 / 44
Multi Server Case: A Conditional Explore Algorithm (1/2)
Algorithm 2 Conditional Explore c ˆµ Algorithm
At time t,
ε(t) ← 1 Nmin(t) < Υ(t) ,
B(t) ← independent Bernoulli sample of mean min{1, 3U log2
t
t }.
if ε(t) ∧ B(t) = 1 then
Explore: Schedule from E uniformly at random.
else
Exploit: Schedule according to the cµ rule with parameters ˆµ(t).
end if
Nmin(t) = mini,j Ni,j (t), Υ(t) = polylog(t)
Intuition: Explore only if the (worst case) number of samples is
sub-logarithmic with respect to time
Algorithm initially explores aggressively, but falls off quickly
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 40 / 44
Multi Server Case: A Conditional Explore Algorithm (2/2)
Algorithm 3 Conditional Explore c ˆµ Algorithm
At time t,
ε(t) ← 1 Nmin(t) < Υ(t) ,
B(t) ← independent Bernoulli sample of mean min{1, 3U log2
t
t }.
if ε(t) ∧ B(t) = 1 then
Explore: Schedule from E uniformly at random.
else
Exploit: Schedule according to the cµ rule with parameters ˆµ(t).
end if
Key point: Can show that with sufficiently high probability, algorithm
does not explore after sufficient time elapses
Every link has a constant probability of being scheduled in a busy cycle
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 41 / 44
Multi Server Case: The Main Result
time
Queue	length
Regenerative	cycle
O(1) Holding Cost Regret
For any (λ, µ) such that the cµ rule with known parameters has
exponential tails (satisfies stability), the holding cost regret with the
Conditional Explore c ˆµ Algorithm is O(1), i.e. independent of time.
Takeaway: Explore strategies different from traditional bandit settings
Intuition: Asymptotic free-learning in a queueing job/task allocation
setting
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 42 / 44
Conclusion
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
Early Stage Late Stage
μ1
μ2
μ3
μK
arrivals
agents / servers
Queues + Bandits over finite time horizons
Phase transition in queue-regret
Initially increases logarithmically over time
Asymptotically goes down as 1/t
Learning based variants of the cµ rule
Conditional Explore to cut off asymptotic explore
Free-learning in queueing + learning
Holding cost regret does not scale with time
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 43 / 44
References
“Regret of Queueing Bandits”, S. Krishnasamy, R. Sen, R. Johari and
S. Shakkottai. Proceedings of the Thirtieth Annual Conference on
Neural Information Processing Systems (NIPS), Barcelona, Spain,
December 2016. Available at: https://arxiv.org/abs/1604.06377
“On Learning the c mu Rule: Single and Multiserver Settings”, S.
Krishnasamy, A. Arapostathis, R. Johari and S. Shakkottai, UT Austin
Technical Report, February 2018. Available at:
https://arxiv.org/abs/1802.06723
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 44 / 44

More Related Content

Similar to Regret of Queueing Bandits

A car sharing auction with temporal-spatial OD connection conditions
A car sharing auction with temporal-spatial OD connection conditionsA car sharing auction with temporal-spatial OD connection conditions
A car sharing auction with temporal-spatial OD connection conditions
harapon
 
Dynamic Kohonen Network for Representing Changes in Inputs
Dynamic Kohonen Network for Representing Changes in InputsDynamic Kohonen Network for Representing Changes in Inputs
Dynamic Kohonen Network for Representing Changes in Inputs
Jean Fecteau
 

Similar to Regret of Queueing Bandits (20)

Stochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsStochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithms
 
Maneuvering target track prediction model
Maneuvering target track prediction modelManeuvering target track prediction model
Maneuvering target track prediction model
 
CLIM: Transition Workshop - Sea Ice, Unstable Subspace, Model Error: Three To...
CLIM: Transition Workshop - Sea Ice, Unstable Subspace, Model Error: Three To...CLIM: Transition Workshop - Sea Ice, Unstable Subspace, Model Error: Three To...
CLIM: Transition Workshop - Sea Ice, Unstable Subspace, Model Error: Three To...
 
L08.pdf
L08.pdfL08.pdf
L08.pdf
 
A Strategic Model For Dynamic Traffic Assignment
A Strategic Model For Dynamic Traffic AssignmentA Strategic Model For Dynamic Traffic Assignment
A Strategic Model For Dynamic Traffic Assignment
 
A prospect theory model of route choice with context dependent reference points
A prospect theory model of route choice with context dependent reference pointsA prospect theory model of route choice with context dependent reference points
A prospect theory model of route choice with context dependent reference points
 
Sensor Fusion Study - Ch3. Least Square Estimation [강소라, Stella, Hayden]
Sensor Fusion Study - Ch3. Least Square Estimation [강소라, Stella, Hayden]Sensor Fusion Study - Ch3. Least Square Estimation [강소라, Stella, Hayden]
Sensor Fusion Study - Ch3. Least Square Estimation [강소라, Stella, Hayden]
 
Thiele
ThieleThiele
Thiele
 
CoopLoc Technical Presentation
CoopLoc Technical PresentationCoopLoc Technical Presentation
CoopLoc Technical Presentation
 
Join Cardinality Estimation Methods_in_Oracle12c.pdf
Join Cardinality Estimation Methods_in_Oracle12c.pdfJoin Cardinality Estimation Methods_in_Oracle12c.pdf
Join Cardinality Estimation Methods_in_Oracle12c.pdf
 
Clock Skew Compensation Algorithm Immune to Floating-Point Precision Loss
Clock Skew Compensation Algorithm Immune to Floating-Point Precision LossClock Skew Compensation Algorithm Immune to Floating-Point Precision Loss
Clock Skew Compensation Algorithm Immune to Floating-Point Precision Loss
 
A car sharing auction with temporal-spatial OD connection conditions
A car sharing auction with temporal-spatial OD connection conditionsA car sharing auction with temporal-spatial OD connection conditions
A car sharing auction with temporal-spatial OD connection conditions
 
RSC: Mining and Modeling Temporal Activity in Social Media
RSC: Mining and Modeling Temporal Activity in Social MediaRSC: Mining and Modeling Temporal Activity in Social Media
RSC: Mining and Modeling Temporal Activity in Social Media
 
Analysis of computational
Analysis of computationalAnalysis of computational
Analysis of computational
 
Ck4201578592
Ck4201578592Ck4201578592
Ck4201578592
 
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETA
 
Dynamic Kohonen Network for Representing Changes in Inputs
Dynamic Kohonen Network for Representing Changes in InputsDynamic Kohonen Network for Representing Changes in Inputs
Dynamic Kohonen Network for Representing Changes in Inputs
 
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
QMC: Operator Splitting Workshop, Estimation of Inverse Covariance Matrix in ...
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 
A Polynomial-Space Exact Algorithm for TSP in Degree-5 Graphs
A Polynomial-Space Exact Algorithm for TSP in Degree-5 GraphsA Polynomial-Space Exact Algorithm for TSP in Degree-5 Graphs
A Polynomial-Space Exact Algorithm for TSP in Degree-5 Graphs
 

More from Center for Transportation Research - UT Austin

Status of two projects: Real-time Signal Control and Traffic Stability; Impro...
Status of two projects: Real-time Signal Control and Traffic Stability; Impro...Status of two projects: Real-time Signal Control and Traffic Stability; Impro...
Status of two projects: Real-time Signal Control and Traffic Stability; Impro...
Center for Transportation Research - UT Austin
 

More from Center for Transportation Research - UT Austin (20)

Flying with SAVES
Flying with SAVESFlying with SAVES
Flying with SAVES
 
Advances in Millimeter Wave for V2X
Advances in Millimeter Wave for V2XAdvances in Millimeter Wave for V2X
Advances in Millimeter Wave for V2X
 
Collaborative Sensing and Heterogeneous Networking Leveraging Vehicular Fleets
Collaborative Sensing and Heterogeneous Networking Leveraging Vehicular FleetsCollaborative Sensing and Heterogeneous Networking Leveraging Vehicular Fleets
Collaborative Sensing and Heterogeneous Networking Leveraging Vehicular Fleets
 
Collaborative Sensing for Automated Vehicles
Collaborative Sensing for Automated VehiclesCollaborative Sensing for Automated Vehicles
Collaborative Sensing for Automated Vehicles
 
Statistical Inference Using Stochastic Gradient Descent
Statistical Inference Using Stochastic Gradient DescentStatistical Inference Using Stochastic Gradient Descent
Statistical Inference Using Stochastic Gradient Descent
 
CAV/Mixed Transportation Modeling
CAV/Mixed Transportation ModelingCAV/Mixed Transportation Modeling
CAV/Mixed Transportation Modeling
 
Real-time Signal Control and Traffic Stability / Improved Models for Managed ...
Real-time Signal Control and Traffic Stability / Improved Models for Managed ...Real-time Signal Control and Traffic Stability / Improved Models for Managed ...
Real-time Signal Control and Traffic Stability / Improved Models for Managed ...
 
Sharing Novel Data Sources to Promote Innovation Through Collaboration: Case ...
Sharing Novel Data Sources to Promote Innovation Through Collaboration: Case ...Sharing Novel Data Sources to Promote Innovation Through Collaboration: Case ...
Sharing Novel Data Sources to Promote Innovation Through Collaboration: Case ...
 
UT SAVES: Situation Aware Vehicular Engineering Systems
UT SAVES: Situation Aware Vehicular Engineering SystemsUT SAVES: Situation Aware Vehicular Engineering Systems
UT SAVES: Situation Aware Vehicular Engineering Systems
 
Regret of Queueing Bandits
Regret of Queueing BanditsRegret of Queueing Bandits
Regret of Queueing Bandits
 
Sharing Novel Data Sources to Promote Innovation through Collaboration: Case ...
Sharing Novel Data Sources to Promote Innovation through Collaboration: Case ...Sharing Novel Data Sources to Promote Innovation through Collaboration: Case ...
Sharing Novel Data Sources to Promote Innovation through Collaboration: Case ...
 
CAV/Mixed Transportation Modeling
CAV/Mixed Transportation ModelingCAV/Mixed Transportation Modeling
CAV/Mixed Transportation Modeling
 
Collaborative Sensing for Automated Vehicles
Collaborative Sensing for Automated VehiclesCollaborative Sensing for Automated Vehicles
Collaborative Sensing for Automated Vehicles
 
Advances in Millimeter Wave for V2X
Advances in Millimeter Wave for V2XAdvances in Millimeter Wave for V2X
Advances in Millimeter Wave for V2X
 
Statistical Inference Using Stochastic Gradient Descent
Statistical Inference Using Stochastic Gradient DescentStatistical Inference Using Stochastic Gradient Descent
Statistical Inference Using Stochastic Gradient Descent
 
Status of two projects: Real-time Signal Control and Traffic Stability; Impro...
Status of two projects: Real-time Signal Control and Traffic Stability; Impro...Status of two projects: Real-time Signal Control and Traffic Stability; Impro...
Status of two projects: Real-time Signal Control and Traffic Stability; Impro...
 
SAVES general overview
SAVES general overviewSAVES general overview
SAVES general overview
 
D-STOP Overview April 2018
D-STOP Overview April 2018D-STOP Overview April 2018
D-STOP Overview April 2018
 
Managing Mobility during Design-Build Highway Construction: Successes and Les...
Managing Mobility during Design-Build Highway Construction: Successes and Les...Managing Mobility during Design-Build Highway Construction: Successes and Les...
Managing Mobility during Design-Build Highway Construction: Successes and Les...
 
The Future of Fly Ash in Texas Concrete
The Future of Fly Ash in Texas ConcreteThe Future of Fly Ash in Texas Concrete
The Future of Fly Ash in Texas Concrete
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Regret of Queueing Bandits

  • 1. Regret of Queueing Bandits Sanjay Shakkottai Department of Electrical and Computer Engineering The University of Texas at Austin Joint with Subhashini Krishnasamy, Rajat Sen, Ari Arapostathis (UT Austin); Ramesh Johari (Stanford Univ.) SAVES Meeting April 10, 2018 Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 1 / 44
  • 2. Motivation (1/3) Stream of multiple types of tasks (jobs) Multiple agents (servers) with varying task dependent expertise Match (schedule) tasks to agents Dynamic decision making problem because the number of tasks changes with time based on past decisions Queueing Models Rich history for such decision making through queueing and scheduling for various performance metrics μ1 μ2 μ3 μK agents / servers μ1 = (μ11 μ12 … μ1U) Queue U Queue 2 Queue 1 tasks Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 2 / 44
  • 3. Motivation (2/3) Emerging Setting: Agent and task characteristics unknown Joint online learning and dynamic optimization Online Learning: Learn agent and task characteristics/statistics Formally, learn task dependent service rates of agents Dynamic Optimization: Using the learned statistics, iteratively optimize to achieve performance goals μ1 μ2 μ3 μK agents / servers μ1 = (μ11 μ12 … μ1U) Queue U Queue 2 Queue 1 tasks Applications Online service systems (Uber, Lyft, Airbnb, Upwork); Scheduling in wireless networks; Crowdsourced task allocation for human-machine systems Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 3 / 44
  • 4. Motivation, Questions and Approach (3/3) How well do we need to learn the statistics? What is the time-scale of learning? How much resources are need for learning? Algorithms for joint online learning and optimization? Bandit Approach Rich history for online learning and optimization through bandits and regret μ1 μ2 μ3 μK agents / servers μ1 = (μ11 μ12 … μ1U) Queue U Queue 2 Queue 1 tasks Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 4 / 44
  • 5. Bandit Overview μ1 μ2 μ3 μK Arm 1 Arm 2 Arm 3 Arm K Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 5 / 44
  • 6. Bandit Overview (1/3) μ1 μ2 μ3 μK Arm 1 Arm 2 Arm 3 Arm K Multi-armed Bandit: K arms, each arm returns a random Bernoulli reward if the arm is played Can play one arm at each (discrete) time t Associate a rv Xi (t) with arm i; with P(Xi (t) = 1) = µi WLOG 1 > µ1 > µ2 ≥ . . . ≥ µK > 0 Reward: Accumulate reward at time t if the chosen arm returns ’1’ Key Question Suppose {µi }K i=1 are unknown. Which arm to play at each time to maximize expected cumulative reward? Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 6 / 44
  • 7. Bandit Overview (2/3) μ1 μ2 μ3 μK Arm 1 Arm 2 Arm 3 Arm K As we play arms over time, we learn the values of {µi } (with varying reliabilities) Explore vs. Exploit: At time t should we play unknown arms (explore to discover the arm with maximum µi ) OR play best known arm (exploit past information) Applications – Optimizing while Learning: Online advertising, drug trials, wireless spectrum probing/sharing, finance, ... Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 7 / 44
  • 8. Bandit Overview (3/3) Policy π plays a sequence of arms {i1, i2, . . .} over time Arm selection can depend on all past arm selections and reward observations Regret of a policy R(t): The expected accumulated loss of reward with respect to a genie that knows the best arm (i.e. genie knows {µi }) R(t) = tµ1 − E t s=1 Xis (s) Key Results (Lai and Robbins; Auer, Cesa-Bianchi and Fischer) 1. R(t) scales as K log(t) 2. Simple algorithms along with finite time upper and lower regret bounds Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 8 / 44
  • 9. Part 1: Queue Regret for Server Selection/Matching μ1 μ2 μ3 μK arrivals agents / servers Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 9 / 44
  • 10. Queueing + Bandits Arms as servers; departure from queue if reward equals ’1’ Bernoulli job arrivals at rate λ ∈ (0, 1); job backlogged in queue until served Genie is stable: λ < µ1 Bandit algorithm schedules server (’plays arm’) whenever queue is backlogged Applications: Online service systems (Uber, Lyft, Airbnb, Upwork); financial markets (limit order books); communication networks, ... μ1 μ2 μ3 μK arrivals agents / servers Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 10 / 44
  • 11. Queue-Regret time Queue length Regenerative cycle Q(t) queue length at time t under bandit algorithm, Q∗(t) queue length under “genie” policy Always schedules the best server (here, server ’1’) Ψ(t) is the queue-regret Ψ(t) := E [Q(t) − Q∗ (t)] . Interpretation: Ψ(t) traditional regret with caveat that reward accumulated only if queue is backlogged Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 11 / 44
  • 12. The Bandit vs. the Queueing Viewpoint Q(t) = t s=1 ((A(s) − D(s)) : The queue length equals [cumulative arrivals - cumulative departures] Q∗(t) = t s=1 ((A(s) − D∗(s)) Ψ(t) = E t s=1 (D∗(s) − D(s)) : Accumulated difference in service In bandit terms, this is the difference in accumulated rewards Bandit Viewpoint Regret increases over time: Ψ(t) ≤ R(t) ∼ K log(t) Queueing Viewpoint As t → ∞ in steady state, E[Q(t) − Q∗(t)] ∼ 0 Key Question How do we bridge these two different viewpoints? Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 12 / 44
  • 13. Intuition – Bridging these Viewpoints t 0 500 1000 1500 2000 2500 3000 3500 4000 Ψ(t) 0 5 10 15 20 25 30 35 40 Ω 1 t O log3 t t O log3 t O log t log log t Early Stage Late Stage μ1 μ2 μ3 μK arrivals agents / servers Over time, we (approximately) learn the values of {µi } Eventually, learn “well enough” so that “effective service rate” exceeds λ (arrival rate) Queue length hits zero periodically =⇒ sample path queue-regret “resets” at these epochs! Takeaway We should anticipate a phase transition in queue-regret behavior Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 13 / 44
  • 14. Main Results – The Late Stage (1/3) t 0 500 1000 1500 2000 2500 3000 3500 4000 Ψ(t) 0 5 10 15 20 25 30 35 40 Ω 1 t O log3 t t O log3 t O log t log log t Early Stage Late Stage μ1 μ2 μ3 μK arrivals agents / servers Queue length hits zero infinitely often; at these epochs the sample-path regret “resets” Queue-regret approximately a (discrete) derivative of the bandit cumulative regret Since the optimal cumulative regret scales like log(t), asymptotically the optimal queue-regret should scale like 1/t Takeaway Order-wise matching upper and lower bounds showing O(1/t) behavior Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 14 / 44
  • 15. Main Results – The Early Stage (2/3) t 0 500 1000 1500 2000 2500 3000 3500 4000 Ψ(t) 0 5 10 15 20 25 30 35 40 Ω 1 t O log3 t t O log3 t O log t log log t Early Stage Late Stage μ1 μ2 μ3 μK arrivals agents / servers Still cannot (even approximately) identify the best server Expected service rate is smaller than arrival rate λ Queue continuously backlogged; queue-regret similar to bandit regret Takeaway Order-wise matching upper and lower bounds showing O(log(t)) behavior Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 15 / 44
  • 16. Main Results – The Transition (3/3) t 0 500 1000 1500 2000 2500 3000 3500 4000 Ψ(t) 0 5 10 15 20 25 30 35 40 Ω 1 t O log3 t t O log3 t O log t log log t Early Stage Late Stage μ1 μ2 μ3 μK arrivals agents / servers Time to switch scales at least as t = Ω(K/ ), = (µ1 − λ) : Gap between the arrival rate and best service rate Transition analysis through a heavily loaded setting as → 0 Scale K and ; demonstrate algorithm with queue-regret O poly(log t)/ 2t for times that are arbitrarily close to Ω(K/ ) Takeaway Phase transition time scales as (K/ ). Smaller means harder to learn optimal server, and pushes out the phase transition time. Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 16 / 44
  • 17. Implications Scheduling: Much of scheduling literature focuses on steady-state or long-time-scale behavior (e.g. Lyapunov arguments) With emerging systems (online matching markets, wireless systems), short-time behavior is equally important In Online service systems, the number of jobs per customer might reach steady-state only after a long time In wireless 5G, much more flux between base-stations due to densification Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 17 / 44
  • 18. Related Work Bandits and Cumulative Regret: Vast literature started with Lai & Robbins 1985, and UCB (finite time bounds and simple algorithm) Auer, Cesa-Bianchi, & Fischer 2002; See Bubeck and Cesa-Bianchi 2012 for a survey Bandit and Queues: Rich history with focus on infinite horizon costs and optimality of index policies (Gittins index 1979): Cox & Smith 1961, Buyukkoc, Varaiya and Walrand 1985, Van Mieghem 1995, Lott & Teneketzis 2000, Mahajan & Teneketzis 2008, Nino-Mora 2006 Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 18 / 44
  • 19. Achieving the Bounds: −Greedy Thompson Sampling time steps exploit explore Bandit algorithms trade-off between explore and exploit steps t− Greedy: With probability t, choose a server uniformly at random; other-wise use Thompson sampling (here, t = 3K log2 (t)/t) Thompson Sampling: Sampling and Bayesian update algorithm to model and update {µi }K i=1 Jointly used to both update “belief” on best arm as well as determine the next arm to sample Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 19 / 44
  • 20. −Greedy Structured Exploration time steps exploit explore Explore Step: t− Greedy algorithm provides structured exploration t− Greedy: With probability t, choose a server uniformly at random; other-wise use Thompson sampling (here, t = 3K log2 (t)/t) Ensures that a poly-logarithmic amount of time used for “pure” learning Provides high probability upper bounds on number of sub-optimal schedules in the exploit steps Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 20 / 44
  • 21. A Primer on Thompson Sampling 0 0.2 0.4 0.6 0.8 1 pdf 0 2 4 6 8 10 12 β(1, 1) β(3, 2) β(10, 4) β(100, 34) Model µi as a random variable Qi for each i ∈ [K] Initially, Uniform [0, 1] prior distribution for each Qi Sample each distribution, choose arm/server with largest sample Update the sampled arm’s distribution (Bayesian posterior distribution) based on {0, 1} observations Observations until time t: Arm i has seen Ai (t) ’1’s and Bi (t) ’0’s Conjugate Prior: The posterior distribution of Qi given the observations is Beta(Ai (t) + 1, Bi (t) + 1) Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 21 / 44
  • 22. Achievability in the Late Stage Theorem Consider any problem instance (λ,µµµ). Then, Ψ(t) = O K log3 t 2t for all t large enough (precise bounds available). K = number of servers/arms = (λ − µ1) : Gap between the arrival rate and best service rate Scaling of K and (Heavy-Load Scaling) For any β ∈ (0, 1), there exists a scaling of K with such that the queue-regret scales as O poly(log t)/ 2t for all t > (K/ )β Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 22 / 44
  • 23. Sketch of Proof: Achievability in the Late Stage (1/3) time Queue length Regenerative cycle Main Challenge: Coupled Cycles Queues usually go through regenerative cycles which are independent BUT HERE ... Queue length evolution is dependent on the past history of bandit arm schedules (cycles are coupled by the bandit) Our Approach 1. High probability bound on the number of sub-optimal schedules 2. Bounds on length of regenerative cycle Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 23 / 44
  • 24. Sketch of Proof: Achievability in the Late Stage (2/3) time Queue length Regenerative cycle Sub-optimal Schedules: Structured exploration via t−Greedy shows that all servers including the sub-optimal ones, are sampled a sufficiently large number of times Ensures that algorithm schedules the correct link in the exploit phase in the late stages with high probability Busy Cycle of Queue: Coarse high probability upper bound on the queue-length =⇒ coarse upper bound on busy cycle Recursive Bound: Use above bound to get tighter bounds on the queue-length, and in turn, the start of the current regenerative cycle Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 24 / 44
  • 25. Sketch of Proof: Bounding the Busy Cycle (3/3) time Queue length Regenerative cycle Bandit System: Bandit bounds suggest that expected number of sub-optimal arm pulls until time t is bounded by log(t) Queueing System: There is a linear gap between arrival and best service rate (scales as t) Combination implies that even in the “worst case”, the current regenerative cycle cannot extend too far into the past Use this as a first cut bound on busy cycle, use this bound to bound queue length, and again use this queue length bound to sharpen busy cycle bound Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 25 / 44
  • 26. Converse in the Late Stage Theorem For any problem instance (λ,µµµ) and any “reasonable” policy, the queue-regret Ψ(t) satisfies Ψ(t) ≥ λ 4 D(µµµ)(1 − α)(K − 1) 1 t for infinitely many t, where D(µµµ) = ∆ KL µmin, µ∗+1 2 . λ = arrival rate ∆ = rate gap between best and second best server/arm α ∈ (0, 1), characterizes “reasonable” policy (formally, α-consistency in bandit literature) Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 26 / 44
  • 27. Sketch of Proof: Converse in the Late Stage Sample Path Coupling: Construct a new bandit system (Bandit-Alt) for which: (a) Queue-regret is unchanged from bandit system (b) Queue length of genie system (sample-path-wise) smaller than Bandit-Alt Bandit Bounds: (Roughly) use one time-step argument to show that probability of using wrong server is O(1/t) for Bandit-Alt Actually more delicate argument, as one step bounds not attainable Show average (over time) bounds using bandit measure-change arguments, and use pigeon-holing to get infinitely often bounds on queue-regret Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 27 / 44
  • 28. Numerical Results (1/2) t 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Ψ(t) 0 50 100 150 Phase Transition Shift ǫ = 0.05 ǫ = 0.1 ǫ = 0.15 System with 5 servers with ∈ {0.05, 0.1, 0.15} The phase-transition point shifts towards the right as decreases Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 28 / 44
  • 29. Numerical Results (2/2) t 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Ψ(t) 0 2 4 6 8 10 Q-ThS(Exp. Prob. = 3K log2 (t) t ) Q-UCB UCB-1 Thompson Q-Ths(Exp. Prob. = 0.4K log2 (t) t ) Comparison of queue-regret performance of Q-ThS, Q-UCB, UCB-1 amd Thompson Sampling 5 server system with u = 0.15 and ∆ = 0.17 Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 29 / 44
  • 30. Algorithm Design Questions Learning with Queues: Should we explore more aggressively in initial stages, because regret “resets” and we do not have an “asymptotic” penalty? Learning and Matching: More complex resource allocation tasks such as matching Interactions with multiple users/queues Low dimensional structure across users and serves Learning with Agent Dynamics: Agents/servers change over time (agents arrive and depart) How much to “trust” current “learned” agents and how much to explore new agents? Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 30 / 44
  • 31. Part 2: Holding-Cost Regret for Multi-Class Systems μ1 μ2 μ3 μK agents / servers μ1 = (μ11 μ12 … μ1U) Queue U Queue 2 Queue 1 tasks Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 31 / 44
  • 32. System Model: Multi-Class Queues K servers and U queues (job classes) Bernoulli job arrivals at rate λu ∈ (0, 1) to class u ∈ U Service rate matrix (µuk, 1 ≤ u ≤ U, 1 ≤ k ≤ K) µuk ∈ (0, 1) is the unknown service rate (success probability) of server k for a job of type u Models servers/agents whose service rate depends on the type of job/task μ1 μ2 μ3 μK agents / servers μ1 = (μ11 μ12 … μ1U) Queue U Queue 2 Queue 1 tasks Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 32 / 44
  • 33. System Model: Holding Costs Holding cost: The expected total waiting cost over finite time T J(T) := E T t=1 βt U i=1 ci Qi (t) β ∈ (0, 1] a discount factor (useful when considering T → ∞) ck > 0 a waiting time cost in queue class k μ1 μ2 μ3 μK agents / servers μ1 = (μ11 μ12 … μ1U) Queue U Queue 2 Queue 1 tasks Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 33 / 44
  • 34. The cµ Rule Algorithm 1 The cµ Algorithm with costs {ck} and rates {µk} At time t, Choose job-server pairs according to the Max-Weight rule with weights given by {ci µi,j } (product of the waiting cost and success probability) Single Server Case Serve the non-empty queue with largest cuµu, u = 1, 2, . . . , U Buyukkoc, Varaiya and Walrand 1985 Single server cµ rule is holding cost optimal for any β ∈ (0, 1] and any T > 0 μ1 μ1 = (μ11 μ12 … μ1U)Queue U Queue 2 Queue 1 Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 34 / 44
  • 35. Background: A History of the cµ Rule The single server system (with known statistics) Optimal expected holding costs for finite/infinite horizon (Cox and Smith 1961, Buyukkoc, Varaiya and Walrand 1985) Asymptotically optimal for convex costs in heavy traffic (Van Mieghem 1995) The multi-server system – optimal policies for infinite horizon and heavy traffic (with known statistics) ’N’, ’W’ networks – Harrison 1998, Bell and Williams 2001 Homogeneous servers – Lott and Teneketzis 2000, Glazebrook and Nino Mora 2001 Generalized cµ rule (with convex costs, i.e. c = c(q) is convex) – Mandelbaum and Stolyar 2004 Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 35 / 44
  • 36. Single Server Case: Main Result (1/2) Service rates unknown – estimate using observed samples from past allocation decisions, and use these parameters {ˆµu} Unlike bandit algorithms, there is no explicit explore for forced learning of server rates J∗(T) is the holding cost with cµ rule J(T) is the holding cost with empirical c ˆµ rule Main Result: Constant Holding Cost Regret J∗(T) − J(T) = O(1), and does not depend on T Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 36 / 44
  • 37. Single Server Case: Main Result (2/2) time Queue length Regenerative cycle Intuition: Busy cycles are identical for all work conserving policies Intuition: Within each cycle, all jobs need to be scheduled by any policy (in some order). Implies sufficient number of server “samples” Stability (busy cycle are sample-path identical) + Server samples gives “free explore” Implication: Learned priority order of queues “couples” with genie by time τ = O(log t) w.p. 1/t3 Explore-Free System No “random exploration” regret incurred unlike typical bandit systems Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 37 / 44
  • 38. Multi Server Case: Instability Stability of multi-server cµ rule previously unknown New Result: In general, the cµ rule is unstable Queue 1 has strict priority: c1µ1j > c2µ2j , j = 1, 2 Server 1 has higher rate: µ11 > µ12 π1 : stationary distribution of Q1 Queue 1 Queue 2 μ12 μ11 μ21 μ22 !1 !2 Result: Q2 is strongly unstable If λ2 > π1(0)µ21 + π1(0, 1)µ22, then there exists positive constants b0, b1, t0 s.t. ∀t ≥ t0, P (Q2(t) < b2t) ≤ exp(−b1t) Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 38 / 44
  • 39. Multi Server Case: Sufficient Conditions for Stability Stability and Tails Bounds on Busy Cycles λ · α < min q∈QK (R(q) · α), for some α > 0, α ∈ PU, where PU is the probability simplex, Ql := {q ∈ ZU + : |q1| = l} for l ∈ Z+. R(q) is the service rate to queues as a function of queue lengths Condition in addition leads to exponential tails on busy cycles – uses drift analysis from (Hajek 1982) In limiting cases, provides close to complete stability region Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 39 / 44
  • 40. Multi Server Case: A Conditional Explore Algorithm (1/2) Algorithm 2 Conditional Explore c ˆµ Algorithm At time t, ε(t) ← 1 Nmin(t) < Υ(t) , B(t) ← independent Bernoulli sample of mean min{1, 3U log2 t t }. if ε(t) ∧ B(t) = 1 then Explore: Schedule from E uniformly at random. else Exploit: Schedule according to the cµ rule with parameters ˆµ(t). end if Nmin(t) = mini,j Ni,j (t), Υ(t) = polylog(t) Intuition: Explore only if the (worst case) number of samples is sub-logarithmic with respect to time Algorithm initially explores aggressively, but falls off quickly Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 40 / 44
  • 41. Multi Server Case: A Conditional Explore Algorithm (2/2) Algorithm 3 Conditional Explore c ˆµ Algorithm At time t, ε(t) ← 1 Nmin(t) < Υ(t) , B(t) ← independent Bernoulli sample of mean min{1, 3U log2 t t }. if ε(t) ∧ B(t) = 1 then Explore: Schedule from E uniformly at random. else Exploit: Schedule according to the cµ rule with parameters ˆµ(t). end if Key point: Can show that with sufficiently high probability, algorithm does not explore after sufficient time elapses Every link has a constant probability of being scheduled in a busy cycle Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 41 / 44
  • 42. Multi Server Case: The Main Result time Queue length Regenerative cycle O(1) Holding Cost Regret For any (λ, µ) such that the cµ rule with known parameters has exponential tails (satisfies stability), the holding cost regret with the Conditional Explore c ˆµ Algorithm is O(1), i.e. independent of time. Takeaway: Explore strategies different from traditional bandit settings Intuition: Asymptotic free-learning in a queueing job/task allocation setting Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 42 / 44
  • 43. Conclusion t 0 500 1000 1500 2000 2500 3000 3500 4000 Ψ(t) 0 5 10 15 20 25 30 35 40 Ω 1 t O log3 t t O log3 t O log t log log t Early Stage Late Stage μ1 μ2 μ3 μK arrivals agents / servers Queues + Bandits over finite time horizons Phase transition in queue-regret Initially increases logarithmically over time Asymptotically goes down as 1/t Learning based variants of the cµ rule Conditional Explore to cut off asymptotic explore Free-learning in queueing + learning Holding cost regret does not scale with time Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 43 / 44
  • 44. References “Regret of Queueing Bandits”, S. Krishnasamy, R. Sen, R. Johari and S. Shakkottai. Proceedings of the Thirtieth Annual Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, December 2016. Available at: https://arxiv.org/abs/1604.06377 “On Learning the c mu Rule: Single and Multiserver Settings”, S. Krishnasamy, A. Arapostathis, R. Johari and S. Shakkottai, UT Austin Technical Report, February 2018. Available at: https://arxiv.org/abs/1802.06723 Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 44 / 44