NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...
A parallel gpu version of the traveling salesman problem slides
1. A Parallel GPU Version
of the Traveling
Salesman Problem
By Molly A. O’Neil, Dan Tamir and Martin Burtscher
Presented By
Rukshan Siriwardhane (148208V)
Vimukthi Wickramasinghe (148245F)
2. Outline
● The Travelling Salesman Problem
● The TSP algorithm used
● Using a GPU to solve TSP
● Optimizations used
● Evaluation method
● Results
3. The Traveling Salesman Problem
Defn
- Given n cities, find the shortest Hamiltonian tour
between the cities
● Combinatorial optimization problem
○ Eg: Finding effective drilling arm movement, best routing, logistics etc.
● A brute force search in the solution space is not feasible
● Usually expressed as a graph problem
○ Complete, undirected, planar, Euclidean graph is used
○ Vertices represent cities
○ Edge weights reflect distances or costs
4. ● Optimal solution is NP-hard
○ Heuristic algorithms used to find an approximate solution.
● Here an iterative hill climbing search algorithm is used
○ Generate k random initial tours (k climbers)
○ Iteratively refine them until local minimum reached
● In each iteration, apply best opt-2 move
○ Find best pair of edges (a, b) and (c, d)
such that replacing them with (a,d)
and (b, c) minimizes tour length
The TSP Algorithm used
6. Using a GPU to solve TSP
Parallelism Memory access
regularity
Code regularity Data reuse
More than 10,000
threads
Sets of 32 threads
needs to have
good access to
memory
Sets of 32 threads
need to follow the
same control flow
At least O(n2
)
operations on
O(n) data
7. Using a GPU to solve TSP
▪ Assuming 100-city problems & 100,000 climbers
▪ Climbers are independent, can be run in parallel
▪ Pro - Plenty of data parallelism
▪ Con - Potential load imbalance
▪ Different number of steps required to reach local minimum
▪ Every step determines best of 4851 opt-2 moves
▪ Same control flow (but different data)
▪ Coalesced memory access patterns
▪ O(n2
) operations on O(n) data
8. Optimizations - code
● Main code section: finding best opt-2 move
○ Doubly nested loop
■ Only computes difference in tour length, not absolute length
○ Highly optimized to minimize memory accesses
■ “Caches” rest of data in registers
■ Requires only 6 clock cycles per move on a Xeon CPU core
○ Local minimum compared to best solution so far
■ Best solution updated if needed, otherwise tour is discarded
○ Other small optimizations
9. Optimizations - GPU
● Random tours generated in parallel on GPU
○ Minimizes data transfer to GPU
● 2D distance matrix resident in shared memory
○ Ensures hits in software-controlled fast data cache
● Tours copied to local memory in chunks of 1024
○ Enables accessing them with coalesced loads & stores
10. Evaluation Method
● Hardware
○ NVIDIA Tesla C2050 GPU
○ (1.15 GHz 14 SMs w/ 32 PEs, 3GB global memory)
○ Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons, sharing 4TB main
memory)
● Data
○ Five 100-city inputs from TSPLIB
● Implementations
○ CUDA (GPU), Pthreads (CPU), serial C (CPU)
○ Use almost identical code for finding best opt-2 move
11. Results - Runtime Comparison
● GPU is 7.8x faster than CPU with 8 cores
● One GPU chip is as fast as 16 or 32 CPU chips
12. Speedup over Serial
● Pthreads code scales well up to 32 threads (4 CPUs)
● CPU performance fluctuates (NUMA), GPU stable
13. Results - Solution Quality
● Optimal tour found in 4 of 5 cases with 100,000 climbers
○ 200,000 climbers find best solution in fifth case
● Runtime independent of input and linear in climbers
14. Summary
▪ TSP_GPU algorithm
▪ Highly optimized implementation for GPUs
▪ Evaluates almost 20 billion tour modifications per
second on a single GPU (as fast as 32 8-core Xeons)
▪ Produces high-quality results
▪ May be better suited for GPU than Ant Colony
Optimization and GAs.