Application of Reinforcement Learning in Network Routing By Chaopin Zhu
Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning
Supervised Learning Feature: Learning with a teacher Phases Training phase Testing phase Application Pattern recognition Function approximation
Unsupervised Leaning Feature Learning without a teacher Application Feature extraction Other preprocessing
Reinforcement Learning Feature: Learning with a critic Application Optimization Function approximation
Elements of Reinforcement Learning Agent Environment Policy Reward function Value function Model of environment (optional)
Reinforcement Learning Problem
Markov Decision Process (MDP) Definition: A reinforcement learning task that satisfies the Markov property Transition probabilities
An Example of MDP
Markov Decision Process (cont.) Parameters Value functions
Elementary Methods for Reinforcement Learning Problem Dynamic programming Monte Carlo Methods Temporal-Difference Learning
Bellman’s Equations
Dynamic Programming Methods Policy evaluation Policy improvement
Dynamic Programming (cont.) E ---- policy evaluation I  ---- policy improvement Policy Iteration Value Iteration
Monte Carlo Methods Feature Learning from experience Do not need complete transition probabilities Idea Partition experience into episodes Average sample return Update at episode-by-episode base
Temporal-Difference Learning Features  (Combination of Monte Carlo and DP ideas) Learn from experience (Monte Carlo) Update estimates based in part on other learned estimates (DP) TD(  ) algorithm seemlessly integrates TD and Monte Carlo Methods
TD(0) Learning Initialize V(x) arbitrarily to the policy to be evaluated Repeat (for each episode): Initialize x Repeat (for each step of episode) a  action given by    for x Take action a; observe reward r and next state x’ x  x’ until x is terminal
Q-Learning Initialize Q(x,a) arbitrarily Repeat (for each episode) Initialize x Repeat (for each step of episode): Choose a from x using policy derived from Q Take action a, observe r, x’ x  x’ until x is terminal
Q-Routing Q x (y,d)----estimated time that a packet would take to reach the destination node d from current node x via x’s neighbor node y T y (d) ------y’s estimate for the time remaining in the trip q y  ---------queuing time in node y T xy  --------transmission time between x and y
Algorithm of Q-Routing Set initial Q-values for each node Get the first packet from the packet queue of node x Choose the best neighbor node  and forward the packet to node  by Get the estimated value  from node Update  Go to 2.
Dual Reinforcement Q-Routing
Network Model
Network Model (cont.)
Node Model
Routing Controller
Initialization/ Termination Procedures Initilization Initialize and / or register global variable Initialize routing table Termination Destroy routing table Release memory
Arrival Procedure Data packet arrival Update routing table Route it with control information or destroy the packet if it reaches the destination Control information packet arrival Update routing table Destroy the packet
Departure Procedure Set all fields of the packet Get a shortest route Send the packet according to the route
References [1] Richard S. Sutton and Andrew G. Barto, Reinforcement Learning—An Introduction [2] Chengan Guo, Applications of Reinforcement Learning in Sequence Detection and Network Routing [3] Simon Haykin, Neural Networks– A Comprehensive Foundation

Applying Reinforcement Learning for Network Routing

  • 1.
    Application of ReinforcementLearning in Network Routing By Chaopin Zhu
  • 2.
    Machine Learning SupervisedLearning Unsupervised Learning Reinforcement Learning
  • 3.
    Supervised Learning Feature:Learning with a teacher Phases Training phase Testing phase Application Pattern recognition Function approximation
  • 4.
    Unsupervised Leaning FeatureLearning without a teacher Application Feature extraction Other preprocessing
  • 5.
    Reinforcement Learning Feature:Learning with a critic Application Optimization Function approximation
  • 6.
    Elements of ReinforcementLearning Agent Environment Policy Reward function Value function Model of environment (optional)
  • 7.
  • 8.
    Markov Decision Process(MDP) Definition: A reinforcement learning task that satisfies the Markov property Transition probabilities
  • 9.
  • 10.
    Markov Decision Process(cont.) Parameters Value functions
  • 11.
    Elementary Methods forReinforcement Learning Problem Dynamic programming Monte Carlo Methods Temporal-Difference Learning
  • 12.
  • 13.
    Dynamic Programming MethodsPolicy evaluation Policy improvement
  • 14.
    Dynamic Programming (cont.)E ---- policy evaluation I ---- policy improvement Policy Iteration Value Iteration
  • 15.
    Monte Carlo MethodsFeature Learning from experience Do not need complete transition probabilities Idea Partition experience into episodes Average sample return Update at episode-by-episode base
  • 16.
    Temporal-Difference Learning Features (Combination of Monte Carlo and DP ideas) Learn from experience (Monte Carlo) Update estimates based in part on other learned estimates (DP) TD(  ) algorithm seemlessly integrates TD and Monte Carlo Methods
  • 17.
    TD(0) Learning InitializeV(x) arbitrarily to the policy to be evaluated Repeat (for each episode): Initialize x Repeat (for each step of episode) a  action given by  for x Take action a; observe reward r and next state x’ x  x’ until x is terminal
  • 18.
    Q-Learning Initialize Q(x,a)arbitrarily Repeat (for each episode) Initialize x Repeat (for each step of episode): Choose a from x using policy derived from Q Take action a, observe r, x’ x  x’ until x is terminal
  • 19.
    Q-Routing Q x(y,d)----estimated time that a packet would take to reach the destination node d from current node x via x’s neighbor node y T y (d) ------y’s estimate for the time remaining in the trip q y ---------queuing time in node y T xy --------transmission time between x and y
  • 20.
    Algorithm of Q-RoutingSet initial Q-values for each node Get the first packet from the packet queue of node x Choose the best neighbor node and forward the packet to node by Get the estimated value from node Update Go to 2.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
    Initialization/ Termination ProceduresInitilization Initialize and / or register global variable Initialize routing table Termination Destroy routing table Release memory
  • 27.
    Arrival Procedure Datapacket arrival Update routing table Route it with control information or destroy the packet if it reaches the destination Control information packet arrival Update routing table Destroy the packet
  • 28.
    Departure Procedure Setall fields of the packet Get a shortest route Send the packet according to the route
  • 29.
    References [1] RichardS. Sutton and Andrew G. Barto, Reinforcement Learning—An Introduction [2] Chengan Guo, Applications of Reinforcement Learning in Sequence Detection and Network Routing [3] Simon Haykin, Neural Networks– A Comprehensive Foundation