SlideShare a Scribd company logo
A Parallel GPU Version
of the Traveling
Salesman Problem
By Molly A. O’Neil, Dan Tamir and Martin Burtscher
Presented By
Rukshan Siriwardhane (148208V)
Vimukthi Wickramasinghe (148245F)
Outline
● The Travelling Salesman Problem
● The TSP algorithm used
● Using a GPU to solve TSP
● Optimizations used
● Evaluation method
● Results
The Traveling Salesman Problem
Defn
- Given n cities, find the shortest Hamiltonian tour
between the cities
● Combinatorial optimization problem
○ Eg: Finding effective drilling arm movement, best routing, logistics etc.
● A brute force search in the solution space is not feasible
● Usually expressed as a graph problem
○ Complete, undirected, planar, Euclidean graph is used
○ Vertices represent cities
○ Edge weights reflect distances or costs
● Optimal solution is NP-hard
○ Heuristic algorithms used to find an approximate solution.
● Here an iterative hill climbing search algorithm is used
○ Generate k random initial tours (k climbers)
○ Iteratively refine them until local minimum reached
● In each iteration, apply best opt-2 move
○ Find best pair of edges (a, b) and (c, d)
such that replacing them with (a,d)
and (b, c) minimizes tour length
The TSP Algorithm used
The TSP Algorithm used
Using a GPU to solve TSP
Parallelism Memory access
regularity
Code regularity Data reuse
More than 10,000
threads
Sets of 32 threads
needs to have
good access to
memory
Sets of 32 threads
need to follow the
same control flow
At least O(n2
)
operations on
O(n) data
Using a GPU to solve TSP
▪ Assuming 100-city problems & 100,000 climbers
▪ Climbers are independent, can be run in parallel
▪ Pro - Plenty of data parallelism
▪ Con - Potential load imbalance
▪ Different number of steps required to reach local minimum
▪ Every step determines best of 4851 opt-2 moves
▪ Same control flow (but different data)
▪ Coalesced memory access patterns
▪ O(n2
) operations on O(n) data
Optimizations - code
● Main code section: finding best opt-2 move
○ Doubly nested loop
■ Only computes difference in tour length, not absolute length
○ Highly optimized to minimize memory accesses
■ “Caches” rest of data in registers
■ Requires only 6 clock cycles per move on a Xeon CPU core
○ Local minimum compared to best solution so far
■ Best solution updated if needed, otherwise tour is discarded
○ Other small optimizations
Optimizations - GPU
● Random tours generated in parallel on GPU
○ Minimizes data transfer to GPU
● 2D distance matrix resident in shared memory
○ Ensures hits in software-controlled fast data cache
● Tours copied to local memory in chunks of 1024
○ Enables accessing them with coalesced loads & stores
Evaluation Method
● Hardware
○ NVIDIA Tesla C2050 GPU
○ (1.15 GHz 14 SMs w/ 32 PEs, 3GB global memory)
○ Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons, sharing 4TB main
memory)
● Data
○ Five 100-city inputs from TSPLIB
● Implementations
○ CUDA (GPU), Pthreads (CPU), serial C (CPU)
○ Use almost identical code for finding best opt-2 move
Results - Runtime Comparison
● GPU is 7.8x faster than CPU with 8 cores
● One GPU chip is as fast as 16 or 32 CPU chips
Speedup over Serial
● Pthreads code scales well up to 32 threads (4 CPUs)
● CPU performance fluctuates (NUMA), GPU stable
Results - Solution Quality
● Optimal tour found in 4 of 5 cases with 100,000 climbers
○ 200,000 climbers find best solution in fifth case
● Runtime independent of input and linear in climbers
Summary
▪ TSP_GPU algorithm
▪ Highly optimized implementation for GPUs
▪ Evaluates almost 20 billion tour modifications per
second on a single GPU (as fast as 32 8-core Xeons)
▪ Produces high-quality results
▪ May be better suited for GPU than Ant Colony
Optimization and GAs.
Any Questions?
Thank You..

More Related Content

What's hot

Graph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized versionGraph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized version
Anton Korzh
 
Meteo I/O Introduction
Meteo I/O IntroductionMeteo I/O Introduction
Meteo I/O Introduction
Riccardo Rigon
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
rerngvit yanggratoke
 
Introduction to Date and Time API 3
Introduction to Date and Time API 3Introduction to Date and Time API 3
Introduction to Date and Time API 3
Kenji HASUNUMA
 
Slides meyer116
Slides meyer116Slides meyer116
Slides meyer116
prettygully
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink Way
George T. C. Lai
 
Quantum Machine Learning for IBM AI
Quantum Machine Learning for IBM AIQuantum Machine Learning for IBM AI
Quantum Machine Learning for IBM AI
Sasha Lazarevic
 
Rtl design optimizations and tradeoffs
Rtl design optimizations and tradeoffsRtl design optimizations and tradeoffs
Rtl design optimizations and tradeoffs
Grace Abraham
 
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Filipo Mór
 
Public Wi-Fi
Public Wi-FiPublic Wi-Fi
Public Wi-Fi
Hiroshi Mano
 
OSM Cycle Map
OSM Cycle MapOSM Cycle Map
OSM Cycle Map
gravitystorm
 
Advanced Tracing features using GDB and LTTng
Advanced Tracing features using GDB and LTTngAdvanced Tracing features using GDB and LTTng
Advanced Tracing features using GDB and LTTng
marckhouzam
 
Peer sim (p2p network)
Peer sim (p2p network)Peer sim (p2p network)
Peer sim (p2p network)
Hein Min Htike
 

What's hot (13)

Graph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized versionGraph 500 DISLIB powered optimized version
Graph 500 DISLIB powered optimized version
 
Meteo I/O Introduction
Meteo I/O IntroductionMeteo I/O Introduction
Meteo I/O Introduction
 
cnsm2011_slide
cnsm2011_slidecnsm2011_slide
cnsm2011_slide
 
Introduction to Date and Time API 3
Introduction to Date and Time API 3Introduction to Date and Time API 3
Introduction to Date and Time API 3
 
Slides meyer116
Slides meyer116Slides meyer116
Slides meyer116
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink Way
 
Quantum Machine Learning for IBM AI
Quantum Machine Learning for IBM AIQuantum Machine Learning for IBM AI
Quantum Machine Learning for IBM AI
 
Rtl design optimizations and tradeoffs
Rtl design optimizations and tradeoffsRtl design optimizations and tradeoffs
Rtl design optimizations and tradeoffs
 
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...
 
Public Wi-Fi
Public Wi-FiPublic Wi-Fi
Public Wi-Fi
 
OSM Cycle Map
OSM Cycle MapOSM Cycle Map
OSM Cycle Map
 
Advanced Tracing features using GDB and LTTng
Advanced Tracing features using GDB and LTTngAdvanced Tracing features using GDB and LTTng
Advanced Tracing features using GDB and LTTng
 
Peer sim (p2p network)
Peer sim (p2p network)Peer sim (p2p network)
Peer sim (p2p network)
 

Similar to A parallel gpu version of the traveling salesman problem slides

Cs403 Parellel Programming Travelling Salesman Problem
Cs403   Parellel Programming Travelling Salesman ProblemCs403   Parellel Programming Travelling Salesman Problem
Cs403 Parellel Programming Travelling Salesman Problem
Jishnu P
 
Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24
Aritra Sarkar
 
진동데이터 활용 충돌체 탐지 AI 경진대회 1등
진동데이터 활용 충돌체 탐지 AI 경진대회 1등진동데이터 활용 충돌체 탐지 AI 경진대회 1등
진동데이터 활용 충돌체 탐지 AI 경진대회 1등
DACON AI 데이콘
 
Travelling salesman problem
Travelling salesman problemTravelling salesman problem
Travelling salesman problem
Dimitris Mavrommatis
 
Terrain Rendering using GPU-Based Geometry Clipmaps
Terrain Rendering usingGPU-Based Geometry ClipmapsTerrain Rendering usingGPU-Based Geometry Clipmaps
Terrain Rendering using GPU-Based Geometry Clipmaps
none299359
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
linuxlab_conf
 
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
Hayahide Yamagishi
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Koichi Shirahata
 
Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...
Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...
Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...
Nevada County Tech Connection
 
Esd module2
Esd module2Esd module2
Esd module2
SOURAV KUMAR
 
Aa sort-v4
Aa sort-v4Aa sort-v4
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETA
Daniel Marcous
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systems
Prashant Raaghav
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
John Holden
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
Emanuel Di Nardo
 
Adaptive indexing throttling
Adaptive indexing throttling Adaptive indexing throttling
Adaptive indexing throttling
Arpit Jain
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Tokyo Institute of Technology
 
Multicore architectures
Multicore architecturesMulticore architectures
Multicore architectures
Muhammet SOYTÜRK
 
SSCCIP Final Presentation (The Spartans)
SSCCIP Final Presentation (The Spartans) SSCCIP Final Presentation (The Spartans)
SSCCIP Final Presentation (The Spartans)
Derek J. Russell
 

Similar to A parallel gpu version of the traveling salesman problem slides (20)

Cs403 Parellel Programming Travelling Salesman Problem
Cs403   Parellel Programming Travelling Salesman ProblemCs403   Parellel Programming Travelling Salesman Problem
Cs403 Parellel Programming Travelling Salesman Problem
 
Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24Optimized Multi-agent Box-pushing - 2017-10-24
Optimized Multi-agent Box-pushing - 2017-10-24
 
진동데이터 활용 충돌체 탐지 AI 경진대회 1등
진동데이터 활용 충돌체 탐지 AI 경진대회 1등진동데이터 활용 충돌체 탐지 AI 경진대회 1등
진동데이터 활용 충돌체 탐지 AI 경진대회 1등
 
Travelling salesman problem
Travelling salesman problemTravelling salesman problem
Travelling salesman problem
 
Terrain Rendering using GPU-Based Geometry Clipmaps
Terrain Rendering usingGPU-Based Geometry ClipmapsTerrain Rendering usingGPU-Based Geometry Clipmaps
Terrain Rendering using GPU-Based Geometry Clipmaps
 
Mirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in GoMirko Damiani - An Embedded soft real time distributed system in Go
Mirko Damiani - An Embedded soft real time distributed system in Go
 
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...
Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...
Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...
 
Esd module2
Esd module2Esd module2
Esd module2
 
Aa sort-v4
Aa sort-v4Aa sort-v4
Aa sort-v4
 
Prediction of taxi rides ETA
Prediction of taxi rides ETAPrediction of taxi rides ETA
Prediction of taxi rides ETA
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systems
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
Distributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflowDistributed implementation of a lstm on spark and tensorflow
Distributed implementation of a lstm on spark and tensorflow
 
Adaptive indexing throttling
Adaptive indexing throttling Adaptive indexing throttling
Adaptive indexing throttling
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
Multicore architectures
Multicore architecturesMulticore architectures
Multicore architectures
 
SSCCIP Final Presentation (The Spartans)
SSCCIP Final Presentation (The Spartans) SSCCIP Final Presentation (The Spartans)
SSCCIP Final Presentation (The Spartans)
 

More from Vimukthi Wickramasinghe

Beanstalkg
BeanstalkgBeanstalkg
pgdip-project-report-final-148245F
pgdip-project-report-final-148245Fpgdip-project-report-final-148245F
pgdip-project-report-final-148245F
Vimukthi Wickramasinghe
 
Factored Operating Systems paper review
Factored Operating Systems paper reviewFactored Operating Systems paper review
Factored Operating Systems paper review
Vimukthi Wickramasinghe
 
Exploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewExploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper review
Vimukthi Wickramasinghe
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Vimukthi Wickramasinghe
 
Application Performance & Flexibility on Exokernel Systems paper review
Application Performance & Flexibility on Exokernel Systems paper reviewApplication Performance & Flexibility on Exokernel Systems paper review
Application Performance & Flexibility on Exokernel Systems paper review
Vimukthi Wickramasinghe
 
Improved Query Performance With Variant Indexes - review presentation
Improved Query Performance With Variant Indexes - review presentationImproved Query Performance With Variant Indexes - review presentation
Improved Query Performance With Variant Indexes - review presentation
Vimukthi Wickramasinghe
 
Smart mrs bi project-presentation
Smart mrs bi project-presentationSmart mrs bi project-presentation
Smart mrs bi project-presentation
Vimukthi Wickramasinghe
 

More from Vimukthi Wickramasinghe (8)

Beanstalkg
BeanstalkgBeanstalkg
Beanstalkg
 
pgdip-project-report-final-148245F
pgdip-project-report-final-148245Fpgdip-project-report-final-148245F
pgdip-project-report-final-148245F
 
Factored Operating Systems paper review
Factored Operating Systems paper reviewFactored Operating Systems paper review
Factored Operating Systems paper review
 
Exploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper reviewExploring Strategies for Training Deep Neural Networks paper review
Exploring Strategies for Training Deep Neural Networks paper review
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
 
Application Performance & Flexibility on Exokernel Systems paper review
Application Performance & Flexibility on Exokernel Systems paper reviewApplication Performance & Flexibility on Exokernel Systems paper review
Application Performance & Flexibility on Exokernel Systems paper review
 
Improved Query Performance With Variant Indexes - review presentation
Improved Query Performance With Variant Indexes - review presentationImproved Query Performance With Variant Indexes - review presentation
Improved Query Performance With Variant Indexes - review presentation
 
Smart mrs bi project-presentation
Smart mrs bi project-presentationSmart mrs bi project-presentation
Smart mrs bi project-presentation
 

Recently uploaded

Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
ElakkiaU
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 
Engineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdfEngineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdf
edwin408357
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
morris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdfmorris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdf
ycwu0509
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
Paris Salesforce Developer Group
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
mahaffeycheryld
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 

Recently uploaded (20)

Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
An Introduction to the Compiler Designss
An Introduction to the Compiler DesignssAn Introduction to the Compiler Designss
An Introduction to the Compiler Designss
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 
Engineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdfEngineering Standards Wiring methods.pdf
Engineering Standards Wiring methods.pdf
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
morris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdfmorris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdf
 
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 

A parallel gpu version of the traveling salesman problem slides

  • 1. A Parallel GPU Version of the Traveling Salesman Problem By Molly A. O’Neil, Dan Tamir and Martin Burtscher Presented By Rukshan Siriwardhane (148208V) Vimukthi Wickramasinghe (148245F)
  • 2. Outline ● The Travelling Salesman Problem ● The TSP algorithm used ● Using a GPU to solve TSP ● Optimizations used ● Evaluation method ● Results
  • 3. The Traveling Salesman Problem Defn - Given n cities, find the shortest Hamiltonian tour between the cities ● Combinatorial optimization problem ○ Eg: Finding effective drilling arm movement, best routing, logistics etc. ● A brute force search in the solution space is not feasible ● Usually expressed as a graph problem ○ Complete, undirected, planar, Euclidean graph is used ○ Vertices represent cities ○ Edge weights reflect distances or costs
  • 4. ● Optimal solution is NP-hard ○ Heuristic algorithms used to find an approximate solution. ● Here an iterative hill climbing search algorithm is used ○ Generate k random initial tours (k climbers) ○ Iteratively refine them until local minimum reached ● In each iteration, apply best opt-2 move ○ Find best pair of edges (a, b) and (c, d) such that replacing them with (a,d) and (b, c) minimizes tour length The TSP Algorithm used
  • 6. Using a GPU to solve TSP Parallelism Memory access regularity Code regularity Data reuse More than 10,000 threads Sets of 32 threads needs to have good access to memory Sets of 32 threads need to follow the same control flow At least O(n2 ) operations on O(n) data
  • 7. Using a GPU to solve TSP ▪ Assuming 100-city problems & 100,000 climbers ▪ Climbers are independent, can be run in parallel ▪ Pro - Plenty of data parallelism ▪ Con - Potential load imbalance ▪ Different number of steps required to reach local minimum ▪ Every step determines best of 4851 opt-2 moves ▪ Same control flow (but different data) ▪ Coalesced memory access patterns ▪ O(n2 ) operations on O(n) data
  • 8. Optimizations - code ● Main code section: finding best opt-2 move ○ Doubly nested loop ■ Only computes difference in tour length, not absolute length ○ Highly optimized to minimize memory accesses ■ “Caches” rest of data in registers ■ Requires only 6 clock cycles per move on a Xeon CPU core ○ Local minimum compared to best solution so far ■ Best solution updated if needed, otherwise tour is discarded ○ Other small optimizations
  • 9. Optimizations - GPU ● Random tours generated in parallel on GPU ○ Minimizes data transfer to GPU ● 2D distance matrix resident in shared memory ○ Ensures hits in software-controlled fast data cache ● Tours copied to local memory in chunks of 1024 ○ Enables accessing them with coalesced loads & stores
  • 10. Evaluation Method ● Hardware ○ NVIDIA Tesla C2050 GPU ○ (1.15 GHz 14 SMs w/ 32 PEs, 3GB global memory) ○ Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons, sharing 4TB main memory) ● Data ○ Five 100-city inputs from TSPLIB ● Implementations ○ CUDA (GPU), Pthreads (CPU), serial C (CPU) ○ Use almost identical code for finding best opt-2 move
  • 11. Results - Runtime Comparison ● GPU is 7.8x faster than CPU with 8 cores ● One GPU chip is as fast as 16 or 32 CPU chips
  • 12. Speedup over Serial ● Pthreads code scales well up to 32 threads (4 CPUs) ● CPU performance fluctuates (NUMA), GPU stable
  • 13. Results - Solution Quality ● Optimal tour found in 4 of 5 cases with 100,000 climbers ○ 200,000 climbers find best solution in fifth case ● Runtime independent of input and linear in climbers
  • 14. Summary ▪ TSP_GPU algorithm ▪ Highly optimized implementation for GPUs ▪ Evaluates almost 20 billion tour modifications per second on a single GPU (as fast as 32 8-core Xeons) ▪ Produces high-quality results ▪ May be better suited for GPU than Ant Colony Optimization and GAs.