A parallel gpu version of the traveling salesman problem slides

•

0 likes•1,416 views

Vimukthi Wickramasinghe

My presentation for paper review for MSc advanced algorithms subject.

Engineering

A Parallel GPU Version
of the Traveling
Salesman Problem
By Molly A. O’Neil, Dan Tamir and Martin Burtscher
Presented By
Rukshan Siriwardhane (148208V)
Vimukthi Wickramasinghe (148245F)

Outline
● The Travelling Salesman Problem
● The TSP algorithm used
● Using a GPU to solve TSP
● Optimizations used
● Evaluation method
● Results

The Traveling Salesman Problem
Defn
- Given n cities, find the shortest Hamiltonian tour
between the cities
● Combinatorial optimization problem
○ Eg: Finding effective drilling arm movement, best routing, logistics etc.
● A brute force search in the solution space is not feasible
● Usually expressed as a graph problem
○ Complete, undirected, planar, Euclidean graph is used
○ Vertices represent cities
○ Edge weights reflect distances or costs

● Optimal solution is NP-hard
○ Heuristic algorithms used to find an approximate solution.
● Here an iterative hill climbing search algorithm is used
○ Generate k random initial tours (k climbers)
○ Iteratively refine them until local minimum reached
● In each iteration, apply best opt-2 move
○ Find best pair of edges (a, b) and (c, d)
such that replacing them with (a,d)
and (b, c) minimizes tour length
The TSP Algorithm used

Using a GPU to solve TSP
Parallelism Memory access
regularity
Code regularity Data reuse
More than 10,000
threads
Sets of 32 threads
needs to have
good access to
memory
Sets of 32 threads
need to follow the
same control flow
At least O(n2
)
operations on
O(n) data

Using a GPU to solve TSP
▪ Assuming 100-city problems & 100,000 climbers
▪ Climbers are independent, can be run in parallel
▪ Pro - Plenty of data parallelism
▪ Con - Potential load imbalance
▪ Different number of steps required to reach local minimum
▪ Every step determines best of 4851 opt-2 moves
▪ Same control flow (but different data)
▪ Coalesced memory access patterns
▪ O(n2
) operations on O(n) data

Optimizations - code
● Main code section: finding best opt-2 move
○ Doubly nested loop
■ Only computes difference in tour length, not absolute length
○ Highly optimized to minimize memory accesses
■ “Caches” rest of data in registers
■ Requires only 6 clock cycles per move on a Xeon CPU core
○ Local minimum compared to best solution so far
■ Best solution updated if needed, otherwise tour is discarded
○ Other small optimizations

Optimizations - GPU
● Random tours generated in parallel on GPU
○ Minimizes data transfer to GPU
● 2D distance matrix resident in shared memory
○ Ensures hits in software-controlled fast data cache
● Tours copied to local memory in chunks of 1024
○ Enables accessing them with coalesced loads & stores

Evaluation Method
● Hardware
○ NVIDIA Tesla C2050 GPU
○ (1.15 GHz 14 SMs w/ 32 PEs, 3GB global memory)
○ Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons, sharing 4TB main
memory)
● Data
○ Five 100-city inputs from TSPLIB
● Implementations
○ CUDA (GPU), Pthreads (CPU), serial C (CPU)
○ Use almost identical code for finding best opt-2 move

Results - Runtime Comparison
● GPU is 7.8x faster than CPU with 8 cores
● One GPU chip is as fast as 16 or 32 CPU chips

Speedup over Serial
● Pthreads code scales well up to 32 threads (4 CPUs)
● CPU performance fluctuates (NUMA), GPU stable

Results - Solution Quality
● Optimal tour found in 4 of 5 cases with 100,000 climbers
○ 200,000 climbers find best solution in fifth case
● Runtime independent of input and linear in climbers

Summary
▪ TSP_GPU algorithm
▪ Highly optimized implementation for GPUs
▪ Evaluates almost 20 billion tour modifications per
second on a single GPU (as fast as 32 8-core Xeons)
▪ Produces high-quality results
▪ May be better suited for GPU than Ant Colony
Optimization and GAs.

What's hot

Graph 500 DISLIB powered optimized versionAnton Korzh

Meteo I/O IntroductionRiccardo Rigon

cnsm2011_slidererngvit yanggratoke

Introduction to Date and Time API 3Kenji HASUNUMA

Slides meyer116prettygully

Log Event Stream Processing In Flink WayGeorge T. C. Lai

Quantum Machine Learning for IBM AISasha Lazarevic

Rtl design optimizations and tradeoffsGrace Abraham

Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Filipo Mór

Public Wi-FiHiroshi Mano

OSM Cycle Mapgravitystorm

Advanced Tracing features using GDB and LTTngmarckhouzam

Peer sim (p2p network)Hein Min Htike

What's hot (13)

Graph 500 DISLIB powered optimized version

Meteo I/O Introduction

cnsm2011_slide

Introduction to Date and Time API 3

Slides meyer116

Log Event Stream Processing In Flink Way

Quantum Machine Learning for IBM AI

Rtl design optimizations and tradeoffs

Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...

Public Wi-Fi

OSM Cycle Map

Advanced Tracing features using GDB and LTTng

Peer sim (p2p network)

Similar to A parallel gpu version of the traveling salesman problem slides

Cs403 Parellel Programming Travelling Salesman ProblemJishnu P

Optimized Multi-agent Box-pushing - 2017-10-24Aritra Sarkar

진동데이터 활용 충돌체 탐지 AI 경진대회 1등DACON AI 데이콘

Travelling salesman problemDimitris Mavrommatis

Terrain Rendering usingGPU-Based Geometry Clipmapsnone299359

Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf

[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine TranslationHayahide Yamagishi

Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata

Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...Nevada County Tech Connection

Esd module2SOURAV KUMAR

Aa sort-v4Malithi Edirisinghe

Prediction of taxi rides ETADaniel Marcous

Comparing pregel related systemsPrashant Raaghav

Monte Carlo G P U Jan2010John Holden

In datacenter performance analysis of a tensor processing unitJinwon Lee

Distributed implementation of a lstm on spark and tensorflowEmanuel Di Nardo

Adaptive indexing throttling Arpit Jain

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Tokyo Institute of Technology

Multicore architecturesMuhammet SOYTÜRK

SSCCIP Final Presentation (The Spartans) Derek J. Russell

Similar to A parallel gpu version of the traveling salesman problem slides (20)

Cs403 Parellel Programming Travelling Salesman Problem

Optimized Multi-agent Box-pushing - 2017-10-24

진동데이터 활용 충돌체 탐지 AI 경진대회 1등

Travelling salesman problem

Terrain Rendering usingGPU-Based Geometry Clipmaps

Mirko Damiani - An Embedded soft real time distributed system in Go

[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation

Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model

Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...

Esd module2

Aa sort-v4

Prediction of taxi rides ETA

Comparing pregel related systems

Monte Carlo G P U Jan2010

In datacenter performance analysis of a tensor processing unit

Distributed implementation of a lstm on spark and tensorflow

Adaptive indexing throttling

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...

Multicore architectures

SSCCIP Final Presentation (The Spartans)

Recently uploaded

Indian Dairy Industry Present Status and.pptMadan Karki

Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER

Virtual memory management in Operating SystemRashmi Bhat

Input Output Management in Operating SystemRashmi Bhat

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...9953056974 Low Rate Call Girls In Saket, Delhi NCR

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

Research Methodology for Engineering pdfCaalaaAbdulkerim

Internet of things -Arshdeep Bahga .pptxVelmuruganTECE

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Earthing details of Electrical Substationstephanwindworld

Past, Present and Future of Generative AIabhishek36461

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774

home automation using Arduino by Aditya Prasadaditya806802

Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441

The SRE Report 2024 - Great Findings for the teamsDILIPKUMARMONDAL6

NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...Amil Baba Dawood bangali

Recently uploaded (20)

Indian Dairy Industry Present Status and.ppt

Risk Assessment For Installation of Drainage Pipes.pdf

Virtual memory management in Operating System

Input Output Management in Operating System

Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction

Research Methodology for Engineering pdf

Internet of things -Arshdeep Bahga .pptx

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers

young call girls in Green Park🔝 9953056974 🔝 escort Service

Earthing details of Electrical Substation

Past, Present and Future of Generative AI

young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service

Arduino_CSE ece ppt for working and principal of arduino.ppt

home automation using Arduino by Aditya Prasad

Instrumentation, measurement and control of bio process parameters ( Temperat...

The SRE Report 2024 - Great Findings for the teams

NO1 Certified Black Magic Specialist Expert Amil baba in Uae Dubai Abu Dhabi ...

A parallel gpu version of the traveling salesman problem slides

1. A Parallel GPU Version of the Traveling Salesman Problem By Molly A. O’Neil, Dan Tamir and Martin Burtscher Presented By Rukshan Siriwardhane (148208V) Vimukthi Wickramasinghe (148245F)

2. Outline ● The Travelling Salesman Problem ● The TSP algorithm used ● Using a GPU to solve TSP ● Optimizations used ● Evaluation method ● Results

3. The Traveling Salesman Problem Defn - Given n cities, find the shortest Hamiltonian tour between the cities ● Combinatorial optimization problem ○ Eg: Finding effective drilling arm movement, best routing, logistics etc. ● A brute force search in the solution space is not feasible ● Usually expressed as a graph problem ○ Complete, undirected, planar, Euclidean graph is used ○ Vertices represent cities ○ Edge weights reflect distances or costs

4. ● Optimal solution is NP-hard ○ Heuristic algorithms used to find an approximate solution. ● Here an iterative hill climbing search algorithm is used ○ Generate k random initial tours (k climbers) ○ Iteratively refine them until local minimum reached ● In each iteration, apply best opt-2 move ○ Find best pair of edges (a, b) and (c, d) such that replacing them with (a,d) and (b, c) minimizes tour length The TSP Algorithm used

5. The TSP Algorithm used

6. Using a GPU to solve TSP Parallelism Memory access regularity Code regularity Data reuse More than 10,000 threads Sets of 32 threads needs to have good access to memory Sets of 32 threads need to follow the same control flow At least O(n2 ) operations on O(n) data

7. Using a GPU to solve TSP ▪ Assuming 100-city problems & 100,000 climbers ▪ Climbers are independent, can be run in parallel ▪ Pro - Plenty of data parallelism ▪ Con - Potential load imbalance ▪ Different number of steps required to reach local minimum ▪ Every step determines best of 4851 opt-2 moves ▪ Same control flow (but different data) ▪ Coalesced memory access patterns ▪ O(n2 ) operations on O(n) data

8. Optimizations - code ● Main code section: finding best opt-2 move ○ Doubly nested loop ■ Only computes difference in tour length, not absolute length ○ Highly optimized to minimize memory accesses ■ “Caches” rest of data in registers ■ Requires only 6 clock cycles per move on a Xeon CPU core ○ Local minimum compared to best solution so far ■ Best solution updated if needed, otherwise tour is discarded ○ Other small optimizations

9. Optimizations - GPU ● Random tours generated in parallel on GPU ○ Minimizes data transfer to GPU ● 2D distance matrix resident in shared memory ○ Ensures hits in software-controlled fast data cache ● Tours copied to local memory in chunks of 1024 ○ Enables accessing them with coalesced loads & stores

10. Evaluation Method ● Hardware ○ NVIDIA Tesla C2050 GPU ○ (1.15 GHz 14 SMs w/ 32 PEs, 3GB global memory) ○ Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons, sharing 4TB main memory) ● Data ○ Five 100-city inputs from TSPLIB ● Implementations ○ CUDA (GPU), Pthreads (CPU), serial C (CPU) ○ Use almost identical code for finding best opt-2 move

11. Results - Runtime Comparison ● GPU is 7.8x faster than CPU with 8 cores ● One GPU chip is as fast as 16 or 32 CPU chips

12. Speedup over Serial ● Pthreads code scales well up to 32 threads (4 CPUs) ● CPU performance fluctuates (NUMA), GPU stable

13. Results - Solution Quality ● Optimal tour found in 4 of 5 cases with 100,000 climbers ○ 200,000 climbers find best solution in fifth case ● Runtime independent of input and linear in climbers

14. Summary ▪ TSP_GPU algorithm ▪ Highly optimized implementation for GPUs ▪ Evaluates almost 20 billion tour modifications per second on a single GPU (as fast as 32 8-core Xeons) ▪ Produces high-quality results ▪ May be better suited for GPU than Ant Colony Optimization and GAs.

15. Any Questions?

16. Thank You..

A parallel gpu version of the traveling salesman problem slides

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to A parallel gpu version of the traveling salesman problem slides

Similar to A parallel gpu version of the traveling salesman problem slides (20)

More from Vimukthi Wickramasinghe

More from Vimukthi Wickramasinghe (8)

Recently uploaded

Recently uploaded (20)

A parallel gpu version of the traveling salesman problem slides