Approximate Dynamic Programming using Fluid and Diffusion Approximations with Applications to Power Management (CDC 2009, Shanghai)

Approximate Dynamic Programming using
Fluid and Diffusion Approximations
with Applications to Power Management

Speaker: Dayu Huang

Wei Chen, Dayu Huang, Ankur A. Kulkarni,1Jayakrishnan Unnikrishnan, Quanyan Zhu,
Prashant Mehta, Sean Meyn, and Adam Wierman 2

Coordinated Science Laboratory, UIUC
Dept. of IESE, UIUC 1
Dept. of CS, California Inst. of Tech. 2
4 120

3 100

80
J
2

1 60

National Science Foundation (ECS-0523620 and CCF-0830511), 0 40

ITMANET DARPA RK 2006-07284, and Microsoft Research
−1 20

−2 n 0 x
0 1 2 3 4 5 6 7 8 9 10 x 104 0 2 4 6 8 10 12 14 16 18 20

Introduction
MDP model Control

i.i.d
Cost
Minimize average cost

Introduction
MDP model Control

i.i.d
Cost

Generator

Introduction
MDP model Control

i.i.d
Cost

Generator Cost Optimality Equation (ACOE)
Average

Generator

Relative value function

Solve ACOE and Find

TD Learning
The “curse of dimensionality”:
Complexity of solving ACOE grows exponentially with
the dimension of the state space.

Approximate within a nite-dimensional function class

Criterion: minimize the mean-squre error

solved by stochastic approximation algorithms

TD Learning
The “curse of dimensionality”:
Complexity of solving ACOE grows exponentially with
the dimension of the state space.

Approximate within a nite-dimensional function class

Criterion: minimize the mean-squre error

solved by stochastic approximation algorithms

Problem: How to select the basis functions ?

key to the success of TD learning

Approach Based on Fluid and Di usion Models
this talk: uid model

Total cost for
Value function of the uid model an associated deterministic model

is a tight approximation to
120

100

Fluid value function
80
60

40

20

0 2 4 6 8 10 12 14 16 18 20

can be used as a part of the basis

Related Work
Multiclass queueing network

Meyn 1997, Meyn 1997b

optimal control Chen and Meyn 1999

simulation Hendersen et.al. 2003

network scheduling Veatch 2004
and routing Moallemi, Kumar and Van Roy 2006

Meyn 2007 Control Techniques for
Complex Networks

other approaches Tsitsiklis and Van Roy 1997
Mannor, Menache and Shimkin 2005

Related Work
Multiclass queueing network

Meyn 1997, Meyn 1997b

optimal control Chen and Meyn 1999

simulation Hendersen et.al. 2003

network scheduling Veatch 2004
and routing Moallemi, Kumar and Van Roy 2006

Meyn 2007 Control Techniques for
Complex Networks

other approaches Tsitsiklis and Van Roy 1997
Mannor, Menache and Shimkin 2005
Taylor series approximation this work

Power Management via Speed Scaling
Bansal, Kimbrel and Pruhs 2007
Wierman, Andrew and Tang 2009

Single processor
job arrivals

processing rate
determined by the current power

Control the processing speed to balance delay and energy costs

Kaxiras and Martonosi 2008
Processor design: polynomial cost Wierman, Andrew and Tang 2009
This talk
We also consider
for wireless communication applications

Fluid Model MDP

Fluid model:

Total Cost

Total Cost Optimality Equation (TCOE) for the uid model:

Why Fluid Model? MDP

First order Taylor series approximation

Why Fluid Model? MDP

First order Taylor series approximation

TCOE

ACOE

almost solves the ACOE Simple but
powerful idea!

Approximation of the Cost Function

Error Analysis
constant?

Approximation of the Cost Function

Error Analysis
constant?

Surrogate cost

approximates

Bounds on ?

Structure Results on the Fluid Solution

Lower Bound

Convexity of

TD Learning Experiment

Basis functions:

4 120

3 100 Approximate relative value function
Fluid value function
2 80

1 60

0 40

−1 20

−2 0
0 1 2 3 4 5 6 7 8 9 10 x 104 0 2 4 6 8 10 12 14 16 18 20

Estimates of Coe cients for the case of quadratic cost

TD Learning with Policy Improvement

3 Average cost at stage

2

0 5 10 15 20 25

Nearly optimal after just a few iterations
Need the value of the optimal policy

Conclusions

The uid value function can be used as a part of the basis for TD-learning.

Motivated by analysis using Taylor series expansion:
The uid value function almost solves ACOE. In particular,
it solves the ACOE for a slightly di erent cost function; and
the error term can be estimated.

TD learning with policy improvement gives a near optimal policy
in a few iterations, as shown by experiments.

Application in power management for processors.

Value Iteration Chen and Meyn 1999

250

−
200

150 Initialization: V0 0
Initialization: V0 =
100

50

0
5 10 15 20 n

Policy

180

160 Stochastic optimal policy
140 myopic policy
120 Di erence
100

80

60

40

20

0

−20
0 2 4 6 8 10 12 14 16 18 20

Approximate Dynamic Programming using Fluid and Diffusion Approximations with Applications to Power Management (CDC 2009, Shanghai)

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Approximate Dynamic Programming using Fluid and Diffusion Approximations with Applications to Power Management (CDC 2009, Shanghai)

Similar to Approximate Dynamic Programming using Fluid and Diffusion Approximations with Applications to Power Management (CDC 2009, Shanghai) (20)

Approximate Dynamic Programming using Fluid and Diffusion Approximations with Applications to Power Management (CDC 2009, Shanghai)