SQL Database Design For Developers at php[tek] 2024
Paper Study: A learning based iterative method for solving vehicle routing
1. A Learning-based Iterative
Method for Solving Vehicle
Routing Problem
Hao Lu, Xingwen Zhang and Shuang Yang
Princeton University and Ant Financial Services Group
ICLR 2020
2. Abstract
• Present “Learn to Improve (L2I)” to solve capacitated vehicle routing
problem (CVRP).
• Start with initial solution, refine the solution iteratively.
• Outperform the classical operations research (OR) approach (e.g.
LKH3).
3. Introduction
• In recent years, after the Pointer Network, researchers start to
develop new deep learning and reinforcement learning framework to
solve combinatorial optimization problems.
• In terms of vehicle routing problem, the prior results can not beat the
OR algorithm LKH3.
• Propose a learning-based algorithm for solving CVRP and outperform
classical solvers.
4. Introduction (cont’d)
• Propose hierarchical framework.
• Separate heuristic operators into two classes, improvement operators and
perturbation operators.
• Choose the class first and then choose operators within the class.
• Propose an ensemble method training several RL policies at the same
time.
6. Capacitated Vehicle Routing Problem (CVRP)
• There is a depot and a set of 𝑁 customers in the CVRP. Each customer
𝑖 has a demand 𝑑𝑖 to be satisfied.
• A vehicle which starts at and ends at the depot, can serve a set of
customers and the total customer demand does not exceed the
capacity of the vehicle 𝐶.
• Find a set of routes with minimal cost to fulfill the demands of a set of
customers without violating vehicle capacity constraints.
7. Local search and 2-opt
• Start with feasible solution and look for an improved solution .
• Two TSP tours are called 2-adjacent if one can be obtained from the
other by deleting two edges and adding two edges.
• A TSP tour T is called 2-optimal if there is no 2-adjacent tour to T with
lower cost than T.
• 2-opt heuristic: Continuously replace the 2-adjacent tour whose cost
is lower than current tour until there is a 2-optimal tour.
Source: MIT 15.053/8 The Traveling Salesman Problem and Heuristics
12. • Improvement operator try to improve the solution.
• Call maximum consecutive sequence of improvement operators applied
before perturbation an improvement iteration.
• Perturbation operator destroy and reconstruct to generate a new
starting solution.
• If no cost reduction has been made for 𝐿 improvement steps, perturb
the solution.
• After 𝑇 steps, the algorithm stops and choose the minimum cost
solution.
16. States for each node
+1 if action led to reduction, -1 otherwise
problem
solution
17. Reward and Policy network
• Reward
• Intermediate impact
• 1 if the operator improve the current solution, -1 otherwise
• Advantage-based
• Take the distance for the problem during the first improvement iteration as a baseline.
• For the subsequent iteration, receive reward equal to difference between current distance
and the baseline.
• Policy network
• REINFORCE algorithm
𝛻𝜃 𝐽 𝜃 𝑠 = 𝔼 𝜋~𝑝 𝜃 . 𝑠 [ 𝐿 𝜋 𝑠 − 𝑏 𝑠 𝛻𝜃log p 𝜃(𝜋|𝑠)]
18. • Attention network:
• Transformer
• 8 heads
• 64 output unit
• Ensemble method: train 6 different policies.
19. Experiments and Analyses
• Three sub-problems with number of customers 𝑁 = 20,50, 100
• Location of each customer and the depot in 0,1 2.
• Demand of each customer in 1, 2, … , 9 .
• The capacity of a vehicle is 20, 30, 40 for 𝑁 = 20,50,100, respectively.
• ADAM optimizer
• 𝑇 = 40000, perturb solution after 𝐿 = 6 consecutive non-
improvements
• 2000 random samples
23. Apply on TSP
Use the first node as depot, zero demand in each customer
24. Conclusion
• Propose “Learn to Improve” for solving CVRP and ensemble method
training several RL policies and choose the best solution produced by
the policies.
• Combine the strength of OR with learning capabilities of RL.
• Achieve new state-of-the-art result for CVRP instances.