Regret of Queueing Bandits

Regret of Queueing Bandits
Sanjay Shakkottai
Department of Electrical and Computer Engineering
The University of Texas at Austin
Joint with Subhashini Krishnasamy, Rajat Sen, Ari Arapostathis (UT
Austin); Ramesh Johari (Stanford Univ.)
SAVES Meeting
April 10, 2018
Sanjay Shakkottai (ECE, UT Austin) Regret of Queueing Bandits March 1, 2018 1 / 44

Motivation (1/3)
Stream of multiple types of tasks
(jobs)
Multiple agents (servers) with
varying task dependent expertise
Match (schedule) tasks to agents
Dynamic decision making problem
because the number of tasks
changes with time based on past
decisions
Queueing Models
Rich history for such decision making
through queueing and scheduling for
various performance metrics
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks

Motivation (2/3)
Emerging Setting: Agent and task
characteristics unknown
Joint online learning and dynamic
optimization
Online Learning: Learn agent and task
characteristics/statistics
Formally, learn task dependent service
rates of agents
Dynamic Optimization: Using the
learned statistics, iteratively optimize to
achieve performance goals
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks
Applications
Online service systems (Uber, Lyft, Airbnb, Upwork); Scheduling in wireless
networks; Crowdsourced task allocation for human-machine systems

Motivation, Questions and Approach (3/3)
How well do we need to learn the
statistics?
What is the time-scale of learning?
How much resources are need for
learning?
Algorithms for joint online learning
and optimization?
Bandit Approach
Rich history for online learning and
optimization through bandits and regret
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks

Bandit Overview
μ1 μ2 μ3 μK
Arm 1 Arm 2 Arm 3 Arm K

Bandit Overview (1/3)
μ1 μ2 μ3 μK
Multi-armed Bandit: K arms, each arm returns a random Bernoulli
reward if the arm is played
Can play one arm at each (discrete) time t
Associate a rv Xi (t) with arm i; with P(Xi (t) = 1) = µi
WLOG 1 > µ1 > µ2 ≥ . . . ≥ µK > 0
Reward: Accumulate reward at time t if the chosen arm returns ’1’
Key Question
Suppose {µi }K
i=1 are unknown. Which arm to play at each time to
maximize expected cumulative reward?

μ1 μ2 μ3 μK
As we play arms over time, we learn the values of {µi } (with varying
reliabilities)
Explore vs. Exploit: At time t should we play unknown arms (explore
to discover the arm with maximum µi ) OR play best known arm
(exploit past information)
Applications – Optimizing while Learning: Online advertising, drug
trials, wireless spectrum probing/sharing, ﬁnance, ...

Policy π plays a sequence of arms {i1, i2, . . .} over time
Arm selection can depend on all past arm selections and reward
observations
Regret of a policy R(t): The expected accumulated loss of reward with
respect to a genie that knows the best arm (i.e. genie knows {µi })
R(t) = tµ1 − E
t
s=1
Xis (s)
Key Results (Lai and Robbins; Auer, Cesa-Bianchi and Fischer)
1. R(t) scales as K log(t)
2. Simple algorithms along with ﬁnite time upper and lower regret bounds

Part 1: Queue Regret for Server Selection/Matching
μ1
μ2
μ3
μK
arrivals
agents / servers

Queueing + Bandits
Arms as servers; departure from queue
if reward equals ’1’
Bernoulli job arrivals at rate λ ∈ (0, 1);
job backlogged in queue until served
Genie is stable: λ < µ1
Bandit algorithm schedules server
(’plays arm’) whenever queue is
backlogged
Applications: Online service systems
(Uber, Lyft, Airbnb, Upwork); ﬁnancial
markets (limit order books);
communication networks, ...
μ1
μ2
μ3
μK
arrivals
agents / servers

Queue-Regret
time
Queue length
Regenerative cycle
Q(t) queue length at time t under bandit algorithm,
Q∗(t) queue length under “genie” policy
Always schedules the best server (here, server ’1’)
Ψ(t) is the queue-regret
Ψ(t) := E [Q(t) − Q∗
(t)] .
Interpretation: Ψ(t) traditional regret with caveat that reward
accumulated only if queue is backlogged

The Bandit vs. the Queueing Viewpoint
Q(t) = t
s=1 ((A(s) − D(s)) : The queue length equals [cumulative
arrivals - cumulative departures]
Q∗(t) = t
s=1 ((A(s) − D∗(s))
Ψ(t) = E t
s=1 (D∗(s) − D(s)) : Accumulated difference in service
In bandit terms, this is the difference in accumulated rewards
Bandit Viewpoint
Regret increases over time: Ψ(t) ≤ R(t) ∼ K log(t)
Queueing Viewpoint
As t → ∞ in steady state, E[Q(t) − Q∗(t)] ∼ 0
Key Question
How do we bridge these two different viewpoints?

Intuition – Bridging these Viewpoints
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
Early Stage Late Stage
μ1
μ2
μ3
μK
arrivals
agents / servers
Over time, we (approximately) learn the values of {µi }
Eventually, learn “well enough” so that “eﬀective service rate” exceeds
λ (arrival rate)
Queue length hits zero periodically =⇒ sample path queue-regret
“resets” at these epochs!
Takeaway
We should anticipate a phase transition in queue-regret behavior

Main Results – The Late Stage (1/3)
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
μ1
μ2
μ3
μK
arrivals
agents / servers
Queue length hits zero inﬁnitely often; at these epochs the
sample-path regret “resets”
Queue-regret approximately a (discrete) derivative of the bandit
cumulative regret
Since the optimal cumulative regret scales like log(t), asymptotically
the optimal queue-regret should scale like 1/t
Takeaway
Order-wise matching upper and lower bounds showing O(1/t) behavior

Main Results – The Early Stage (2/3)
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
μ1
μ2
μ3
μK
arrivals
agents / servers
Still cannot (even approximately) identify the best server
Expected service rate is smaller than arrival rate λ
Queue continuously backlogged; queue-regret similar to bandit regret
Takeaway
Order-wise matching upper and lower bounds showing O(log(t)) behavior

Main Results – The Transition (3/3)
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
μ1
μ2
μ3
μK
arrivals
agents / servers
Time to switch scales at least as t = Ω(K/ ),
= (µ1 − λ) : Gap between the arrival rate and best service rate
Transition analysis through a heavily loaded setting as → 0
Scale K and ; demonstrate algorithm with queue-regret
O poly(log t)/ 2t for times that are arbitrarily close to Ω(K/ )
Takeaway
Phase transition time scales as (K/ ). Smaller means harder to learn
optimal server, and pushes out the phase transition time.

Implications
Scheduling: Much of scheduling literature focuses on steady-state or
long-time-scale behavior (e.g. Lyapunov arguments)
With emerging systems (online matching markets, wireless systems),
short-time behavior is equally important
In Online service systems, the number of jobs per customer might
reach steady-state only after a long time
In wireless 5G, much more ﬂux between base-stations due to
densiﬁcation

Related Work
Bandits and Cumulative Regret: Vast literature started with Lai &
Robbins 1985, and UCB (ﬁnite time bounds and simple algorithm)
Auer, Cesa-Bianchi, & Fischer 2002; See Bubeck and Cesa-Bianchi
2012 for a survey
Bandit and Queues: Rich history with focus on inﬁnite horizon costs
and optimality of index policies (Gittins index 1979): Cox & Smith
1961, Buyukkoc, Varaiya and Walrand 1985, Van Mieghem 1995, Lott
& Teneketzis 2000, Mahajan & Teneketzis 2008, Nino-Mora 2006

Achieving the Bounds: −Greedy Thompson Sampling
time steps
exploit
explore
Bandit algorithms trade-oﬀ between explore and exploit steps
t− Greedy: With probability t, choose a server uniformly at random;
other-wise use Thompson sampling (here, t = 3K log2
(t)/t)
Thompson Sampling: Sampling and Bayesian update algorithm to
model and update {µi }K
i=1
Jointly used to both update “belief” on best arm as well as determine
the next arm to sample

−Greedy Structured Exploration
time steps
exploit
explore
Explore Step: t− Greedy algorithm provides structured exploration
t− Greedy: With probability t, choose a server uniformly at random;
other-wise use Thompson sampling (here, t = 3K log2
(t)/t)
Ensures that a poly-logarithmic amount of time used for “pure”
learning
Provides high probability upper bounds on number of sub-optimal
schedules in the exploit steps

A Primer on Thompson Sampling
0 0.2 0.4 0.6 0.8 1
pdf
0
2
4
6
8
10
12
β(1, 1)
β(3, 2)
β(10, 4)
β(100, 34)
Model µi as a random variable Qi for each i ∈ [K]
Initially, Uniform [0, 1] prior distribution for each Qi
Sample each distribution, choose arm/server with largest sample
Update the sampled arm’s distribution (Bayesian posterior
distribution) based on {0, 1} observations
Observations until time t: Arm i has seen Ai (t) ’1’s and Bi (t) ’0’s
Conjugate Prior: The posterior distribution of Qi given the
observations is Beta(Ai (t) + 1, Bi (t) + 1)

Achievability in the Late Stage
Theorem
Consider any problem instance (λ,µµµ). Then,
Ψ(t) = O K
log3
t
2t
for all t large enough (precise bounds available).
K = number of servers/arms
= (λ − µ1) : Gap between the arrival rate and best service rate
Scaling of K and (Heavy-Load Scaling)
For any β ∈ (0, 1), there exists a scaling of K with such that the
queue-regret scales as O poly(log t)/ 2t for all t > (K/ )β

Sketch of Proof: Achievability in the Late Stage (1/3)
time
Queue length
Regenerative cycle
Main Challenge: Coupled Cycles
Queues usually go through regenerative cycles which are independent
BUT HERE ...
Queue length evolution is dependent on the past history of bandit arm
schedules (cycles are coupled by the bandit)
Our Approach
1. High probability bound on the number of sub-optimal schedules
2. Bounds on length of regenerative cycle

Sketch of Proof: Achievability in the Late Stage (2/3)
time
Queue length
Regenerative cycle
Sub-optimal Schedules: Structured exploration via t−Greedy shows
that all servers including the sub-optimal ones, are sampled a
suﬃciently large number of times
Ensures that algorithm schedules the correct link in the exploit phase
in the late stages with high probability
Busy Cycle of Queue: Coarse high probability upper bound on the
queue-length =⇒ coarse upper bound on busy cycle
Recursive Bound: Use above bound to get tighter bounds on the
queue-length, and in turn, the start of the current regenerative cycle

Sketch of Proof: Bounding the Busy Cycle (3/3)
time
Queue length
Regenerative cycle
Bandit System: Bandit bounds suggest that expected number of
sub-optimal arm pulls until time t is bounded by log(t)
Queueing System: There is a linear gap between arrival and best
service rate (scales as t)
Combination implies that even in the “worst case”, the current regenerative
cycle cannot extend too far into the past
Use this as a ﬁrst cut bound on busy cycle, use this bound to bound
queue length, and again use this queue length bound to sharpen busy
cycle bound

Converse in the Late Stage
Theorem
For any problem instance (λ,µµµ) and any “reasonable” policy, the
queue-regret Ψ(t) satisﬁes
Ψ(t) ≥
λ
4
D(µµµ)(1 − α)(K − 1)
1
t
for inﬁnitely many t, where D(µµµ) = ∆
KL µmin, µ∗+1
2
.
λ = arrival rate
∆ = rate gap between best and second best server/arm
α ∈ (0, 1), characterizes “reasonable” policy (formally, α-consistency in
bandit literature)

Sketch of Proof: Converse in the Late Stage
Sample Path Coupling: Construct a new bandit system (Bandit-Alt)
for which:
(a) Queue-regret is unchanged from bandit system
(b) Queue length of genie system (sample-path-wise) smaller than
Bandit-Alt
Bandit Bounds: (Roughly) use one time-step argument to show that
probability of using wrong server is O(1/t) for Bandit-Alt
Actually more delicate argument, as one step bounds not attainable
Show average (over time) bounds using bandit measure-change
arguments, and use pigeon-holing to get inﬁnitely often bounds on
queue-regret

Numerical Results (1/2)
t
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Ψ(t)
0
50
100
150
Phase Transition
Shift
ǫ = 0.05
ǫ = 0.1
ǫ = 0.15
System with 5 servers with ∈ {0.05, 0.1, 0.15}
The phase-transition point shifts towards the right as decreases

Numerical Results (2/2)
t
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Ψ(t)
0
2
4
6
8
10
Q-ThS(Exp. Prob. = 3K log2
(t)
t )
Q-UCB
UCB-1
Thompson
Q-Ths(Exp. Prob. = 0.4K log2
(t)
t )
Comparison of queue-regret performance of Q-ThS, Q-UCB, UCB-1
amd Thompson Sampling
5 server system with u = 0.15 and ∆ = 0.17

Algorithm Design Questions
Learning with Queues: Should we explore more aggressively in initial
stages, because regret “resets” and we do not have an “asymptotic”
penalty?
Learning and Matching: More complex resource allocation tasks such
as matching
Interactions with multiple users/queues
Low dimensional structure across users and serves
Learning with Agent Dynamics: Agents/servers change over time
(agents arrive and depart)
How much to “trust” current “learned” agents and how much to explore
new agents?

Part 2: Holding-Cost Regret for Multi-Class Systems
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks

System Model: Multi-Class Queues
K servers and U queues (job
classes)
Bernoulli job arrivals at rate
λu ∈ (0, 1) to class u ∈ U
Service rate matrix
(µuk, 1 ≤ u ≤ U, 1 ≤ k ≤ K)
µuk ∈ (0, 1) is the unknown
service rate (success probability)
of server k for a job of type u
Models servers/agents whose
service rate depends on the type
of job/task
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks

System Model: Holding Costs
Holding cost: The expected total
waiting cost over ﬁnite time T
J(T) := E
T
t=1
βt
U
i=1
ci Qi (t)
β ∈ (0, 1] a discount factor
(useful when considering
T → ∞)
ck > 0 a waiting time cost in
queue class k
μ1
μ2
μ3
μK
agents / servers
μ1 = (μ11 μ12 … μ1U)
Queue U
Queue 2
Queue 1
tasks

The cµ Rule
Algorithm 1 The cµ Algorithm with costs {ck} and rates {µk}
At time t,
Choose job-server pairs according to the Max-Weight rule with weights
given by {ci µi,j } (product of the waiting cost and success probability)
Single Server Case
Serve the non-empty queue with largest
cuµu, u = 1, 2, . . . , U
Buyukkoc, Varaiya and Walrand 1985
Single server cµ rule is holding cost optimal
for any β ∈ (0, 1] and any T > 0
μ1
μ1 = (μ11 μ12 … μ1U)Queue U
Queue 2
Queue 1

Background: A History of the cµ Rule
The single server system (with known statistics)
Optimal expected holding costs for finite/infinite horizon (Cox and
Smith 1961, Buyukkoc, Varaiya and Walrand 1985)
Asymptotically optimal for convex costs in heavy traffic (Van Mieghem
1995)
The multi-server system – optimal policies for infinite horizon and
heavy traffic (with known statistics)
’N’, ’W’ networks – Harrison 1998, Bell and Williams 2001
Homogeneous servers – Lott and Teneketzis 2000, Glazebrook and
Nino Mora 2001
Generalized cµ rule (with convex costs, i.e. c = c(q) is convex) –
Mandelbaum and Stolyar 2004

Single Server Case: Main Result (1/2)
Service rates unknown – estimate using observed samples from past
allocation decisions, and use these parameters {ˆµu}
Unlike bandit algorithms, there is no explicit explore for forced learning
of server rates
J∗(T) is the holding cost with cµ rule
J(T) is the holding cost with empirical c ˆµ rule
Main Result: Constant Holding Cost Regret
J∗(T) − J(T) = O(1), and does not depend on T

Single Server Case: Main Result (2/2)
time
Queue length
Regenerative cycle
Intuition: Busy cycles are identical for all work conserving policies
Intuition: Within each cycle, all jobs need to be scheduled by any
policy (in some order). Implies suﬃcient number of server “samples”
Stability (busy cycle are sample-path identical) + Server samples gives
“free explore”
Implication: Learned priority order of queues “couples” with genie by
time τ = O(log t) w.p. 1/t3
Explore-Free System
No “random exploration” regret incurred unlike typical bandit systems

Multi Server Case: Instability
Stability of multi-server cµ rule
previously unknown
New Result: In general, the cµ
rule is unstable
Queue 1 has strict priority:
c1µ1j > c2µ2j , j = 1, 2
Server 1 has higher rate:
µ11 > µ12
π1 : stationary distribution of Q1
Queue 1
Queue 2
μ12
μ11
μ21
μ22
!1
!2
Result: Q2 is strongly unstable
If λ2 > π1(0)µ21 + π1(0, 1)µ22, then there exists positive constants
b0, b1, t0 s.t. ∀t ≥ t0,
P (Q2(t) < b2t) ≤ exp(−b1t)

Multi Server Case: Suﬃcient Conditions for Stability
Stability and Tails Bounds on Busy Cycles
λ · α < min
q∈QK
(R(q) · α), for some α > 0, α ∈ PU,
where PU is the probability simplex, Ql := {q ∈ ZU
+ : |q1| = l} for l ∈ Z+.
R(q) is the service rate to queues as a function of queue lengths
Condition in addition leads to exponential tails on busy cycles – uses
drift analysis from (Hajek 1982)
In limiting cases, provides close to complete stability region

Multi Server Case: A Conditional Explore Algorithm (1/2)
Algorithm 2 Conditional Explore c ˆµ Algorithm
At time t,
ε(t) ← 1 Nmin(t) < Υ(t) ,
B(t) ← independent Bernoulli sample of mean min{1, 3U log2
t
t }.
if ε(t) ∧ B(t) = 1 then
Explore: Schedule from E uniformly at random.
else
Exploit: Schedule according to the cµ rule with parameters ˆµ(t).
end if
Nmin(t) = mini,j Ni,j (t), Υ(t) = polylog(t)
Intuition: Explore only if the (worst case) number of samples is
sub-logarithmic with respect to time
Algorithm initially explores aggressively, but falls oﬀ quickly

Multi Server Case: A Conditional Explore Algorithm (2/2)
Algorithm 3 Conditional Explore c ˆµ Algorithm
At time t,
ε(t) ← 1 Nmin(t) < Υ(t) ,
B(t) ← independent Bernoulli sample of mean min{1, 3U log2
t
t }.
if ε(t) ∧ B(t) = 1 then
Explore: Schedule from E uniformly at random.
else
Exploit: Schedule according to the cµ rule with parameters ˆµ(t).
end if
Key point: Can show that with suﬃciently high probability, algorithm
does not explore after suﬃcient time elapses
Every link has a constant probability of being scheduled in a busy cycle

Multi Server Case: The Main Result
time
Queue length
Regenerative cycle
O(1) Holding Cost Regret
For any (λ, µ) such that the cµ rule with known parameters has
exponential tails (satisﬁes stability), the holding cost regret with the
Conditional Explore c ˆµ Algorithm is O(1), i.e. independent of time.
Takeaway: Explore strategies diﬀerent from traditional bandit settings
Intuition: Asymptotic free-learning in a queueing job/task allocation
setting

Conclusion
t
0 500 1000 1500 2000 2500 3000 3500 4000
Ψ(t)
0
5
10
15
20
25
30
35
40
Ω 1
t
O log3
t
t
O log3
t
O log t
log log t
μ1
μ2
μ3
μK
arrivals
agents / servers
Queues + Bandits over ﬁnite time horizons
Phase transition in queue-regret
Initially increases logarithmically over time
Asymptotically goes down as 1/t
Learning based variants of the cµ rule
Conditional Explore to cut oﬀ asymptotic explore
Free-learning in queueing + learning
Holding cost regret does not scale with time

References
“Regret of Queueing Bandits”, S. Krishnasamy, R. Sen, R. Johari and
S. Shakkottai. Proceedings of the Thirtieth Annual Conference on
Neural Information Processing Systems (NIPS), Barcelona, Spain,
December 2016. Available at: https://arxiv.org/abs/1604.06377
“On Learning the c mu Rule: Single and Multiserver Settings”, S.
Krishnasamy, A. Arapostathis, R. Johari and S. Shakkottai, UT Austin
Technical Report, February 2018. Available at:
https://arxiv.org/abs/1802.06723

Regret of Queueing Bandits

Recommended

Recommended

More Related Content

Similar to Regret of Queueing Bandits

Similar to Regret of Queueing Bandits (20)

More from Center for Transportation Research - UT Austin

More from Center for Transportation Research - UT Austin (20)

Recently uploaded

Recently uploaded (20)

Regret of Queueing Bandits