10. ASSUMPTIONS
◉ There exists a path between any two parts of city.
◉ Data for each point is presnet in System.
◉ Maximimum number of buses available are bounded by N.
◉ There is are N imagirary passangers at DEPOT.
10
11. 11
is a Boolean Variable it is 1 if and only if bus b goes to point j from point
i in rth step else it is 0.
denotes the member of distance matrix i,j
32. The Fittest Survives!!!!
32
The fitness function determines how fit an individual is (the ability
of an individual to compete with other individuals). It gives a fitness
score to each individual. The probability that an individual will be
selected for reproduction is based on its fitness score.
36. Overview of Reinforcement Learning
◉ Based on Markov Decision Process mathematical framework.
◉ Consist of agent which takes certain actions in the
environment according to a policy and gets reward or
punishment for a certain action.
◉ The policy function is optimized by training over and over
again through the process of policy gradient descent.
36
37. Heuristic Solution – Reinforcement Learning
◉ NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT
LEARNING –
Irwan Bello , Hieu Pham , Quoc V. Le, Mohammad Norouzi, Samy Bengio
37
38. Overview of Neural Combinatorial Optimization with
Reinforcement Learning
◉ This paper presents a framework to tackle combinatorial
optimization problems using neural network and
reinforcement learning.
◉ Solves a TSP that given a set of city coordinates predicts a
distribution over different city permutations.
◉ Using negative tour length as the reward function and
optimize the network using policy gradient method.
38
39. Overview of Neural Combinatorial Optimization with
Reinforcement Learning
◉ Supervised learning is not applicable to combinatorial
optimization problem because one does not have access to
optimal labels.
◉ RL is based on exploration and learning in an environment.
39
40. Neural Network Architecture for TSP
◉ Given an input graph represented as a sequence of n cities in
a two dimensional space s={xi}n. Total length is given by
below equation which is also the loss function of the neural
network.
40
41. Network Architecture-Pointer Network
◉ Using pointer network allows model to effectively point to a
specific position in the input sequence .
◉ Pointer network consist of an encoder network which
transforms input into a predicted sequence.
41
43. Network Optimization using Policy gradient
◉ Training a neural network involves using a supervised loss
function which involves the difference between predicted
path and the path found by policy function.
◉ Thus the model gets trained from the policy function of RL as
the training set and learns to find a TSP given a set of input.
43
44. Reinforcement Learning + Neural Network for VRP Problem
◉ Modifying the Bello et. al. paper for solving VRP Problem.
◉ Using the pointer Network model proposed by Bello to use it
in the VRP model.
44
45. Our solution for solving Vehicle Routing Problem
◉ We represent each state by a sequence of tuple {xi(t)=(si,di(t)},
where s(i) are the distance coordinates and di(t) is the
demand at point i at time t. set of all inputs is denoted by X(t) .
◉ We start from an arbitrary input X0 and at every step, our
pointer model predicts an output y0 of the next customer to
be picked.
45
46. Our solution for solving Vehicle Routing Problem
◉ We are interested in finding a stochastic policy pi which generates a
sequence Y in such a way which minimizes a loss while satisfying the
problem constraint.
◉ After training the bus over an environment, We find the policy
function pi and then use that policy function to train out network
model for predicting the optimal path.
◉ Using Reinforcement Learning in VRP involves multiple agents acting
on an environment optimizing the total distance covered.
46
48. Algorithm Comparison
48
Time on
vpr10
Time on
vpr50
Time on
vpr100
Training
Time
Accuracy
Integer
Programm
ing
5 minutes 54 hours Approx.
300 days
Nil 100%
Genetic
Programm
ing
1 min 1.5
minutes
2 minutes Nil 20%
Reinforce
ment
Learning
0.1 min 0.2 min 0.5 min 6 hours 61%
49. Conclusion
◉ We finally propose 3 solutions for solving the required problem.
◉ The algorithm can be chosen depending on the given dataset, test time,
training time, hardware etc.
◉ Integer Programming solution is NP-Hard hence takes exponential time.
◉ Hence , for real life application it is important to develop a heuristic
solution .
◉ Genetic Algorithms looks for best possible solution through the principle
of evolution.
◉ Reinforcement learning helps create a policy function which helps in
training a neural network for predicting outcome.
49