A parallel gpu version of the traveling salesman problem slides

•

0 likes•1,417 views

This document summarizes a parallel GPU implementation of the traveling salesman problem (TSP). The TSP algorithm used is an iterative hill climbing search that generates random initial tours and refines them through opt-2 moves until a local minimum is reached. The algorithm was optimized for GPUs by distributing independent climbers across threads and minimizing memory accesses through techniques like caching distance matrices in shared memory. Evaluation on a Tesla GPU found it was 7.8x faster than an 8-core CPU implementation and produced optimal tours in 4 out of 5 test cases using 100,000-200,000 climbers.

Engineering

A Parallel GPU Version
of the Traveling
Salesman Problem
By Molly A. O’Neil, Dan Tamir and Martin Burtscher
Presented By
Rukshan Siriwardhane (148208V)
Vimukthi Wickramasinghe (148245F)

Outline
● The Travelling Salesman Problem
● The TSP algorithm used
● Using a GPU to solve TSP
● Optimizations used
● Evaluation method
● Results

The Traveling Salesman Problem
Defn
- Given n cities, find the shortest Hamiltonian tour
between the cities
● Combinatorial optimization problem
○ Eg: Finding effective drilling arm movement, best routing, logistics etc.
● A brute force search in the solution space is not feasible
● Usually expressed as a graph problem
○ Complete, undirected, planar, Euclidean graph is used
○ Vertices represent cities
○ Edge weights reflect distances or costs

● Optimal solution is NP-hard
○ Heuristic algorithms used to find an approximate solution.
● Here an iterative hill climbing search algorithm is used
○ Generate k random initial tours (k climbers)
○ Iteratively refine them until local minimum reached
● In each iteration, apply best opt-2 move
○ Find best pair of edges (a, b) and (c, d)
such that replacing them with (a,d)
and (b, c) minimizes tour length
The TSP Algorithm used

Using a GPU to solve TSP
Parallelism Memory access
regularity
Code regularity Data reuse
More than 10,000
threads
Sets of 32 threads
needs to have
good access to
memory
Sets of 32 threads
need to follow the
same control flow
At least O(n2
)
operations on
O(n) data

Using a GPU to solve TSP
▪ Assuming 100-city problems & 100,000 climbers
▪ Climbers are independent, can be run in parallel
▪ Pro - Plenty of data parallelism
▪ Con - Potential load imbalance
▪ Different number of steps required to reach local minimum
▪ Every step determines best of 4851 opt-2 moves
▪ Same control flow (but different data)
▪ Coalesced memory access patterns
▪ O(n2
) operations on O(n) data

Optimizations - code
● Main code section: finding best opt-2 move
○ Doubly nested loop
■ Only computes difference in tour length, not absolute length
○ Highly optimized to minimize memory accesses
■ “Caches” rest of data in registers
■ Requires only 6 clock cycles per move on a Xeon CPU core
○ Local minimum compared to best solution so far
■ Best solution updated if needed, otherwise tour is discarded
○ Other small optimizations

Optimizations - GPU
● Random tours generated in parallel on GPU
○ Minimizes data transfer to GPU
● 2D distance matrix resident in shared memory
○ Ensures hits in software-controlled fast data cache
● Tours copied to local memory in chunks of 1024
○ Enables accessing them with coalesced loads & stores

Evaluation Method
● Hardware
○ NVIDIA Tesla C2050 GPU
○ (1.15 GHz 14 SMs w/ 32 PEs, 3GB global memory)
○ Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons, sharing 4TB main
memory)
● Data
○ Five 100-city inputs from TSPLIB
● Implementations
○ CUDA (GPU), Pthreads (CPU), serial C (CPU)
○ Use almost identical code for finding best opt-2 move

Results - Runtime Comparison
● GPU is 7.8x faster than CPU with 8 cores
● One GPU chip is as fast as 16 or 32 CPU chips

Speedup over Serial
● Pthreads code scales well up to 32 threads (4 CPUs)
● CPU performance fluctuates (NUMA), GPU stable

Results - Solution Quality
● Optimal tour found in 4 of 5 cases with 100,000 climbers
○ 200,000 climbers find best solution in fifth case
● Runtime independent of input and linear in climbers

Summary
▪ TSP_GPU algorithm
▪ Highly optimized implementation for GPUs
▪ Evaluates almost 20 billion tour modifications per
second on a single GPU (as fast as 32 8-core Xeons)
▪ Produces high-quality results
▪ May be better suited for GPU than Ant Colony
Optimization and GAs.

This document summarizes research on implementing robust parallel adaptive smoothing algorithms for image processing on CPU and GPU/MPI systems. It presents adaptive smoothing techniques, CPU and GPU/MPI implementations of the algorithms, results comparing performance of the implementations, and conclusions that the GPU implementation provides over a 100x speed up for processing full resolution images compared to the CPU implementation.

Toolchain for real-time simulations: GSN-MeteoIO-GEOtop

Riccardo Rigon

This document describes a real-time toolchain for collecting sensor data through GSN, feeding it into the GEOtop hydrological model, and publishing the real-time simulation results. Key aspects include making GEOtop capable of starting, pausing and resuming simulations using recovery files, accessing sensor data through the MeteoIO library and its GSNIO plugin, and demonstrating the end-to-end workflow of retrieving real-time data and running GEOtop simulations. Future work includes enhancing GSN and the models to better support real-time gridded data and multiple concurrent users.

Oss vs drive test pdcp throughput

Subhash Kumar

Automating Complex Airport Operations with WSO2 Middleware Platform

WSO2

Operations conducted in a typical airport are diversified and often span multiple information systems. Complex yet interdependent processes such as flight check-in, flight information maintenance, cleaning crew operations and waiting time prediction raise the requirement of operating a connected business. These requirements have made the airport operators choose comprehensive enterprise middleware platforms over isolated software products. In this session we will explore the features of WSO2’s comprehensive 100% open source middleware platform which enables successful automation of complex airport operations. We will do so with a few example processes from two information domains such as individuals and airport services. This webinar will also discuss the following topics: Introducing airport operations and the requirements of a connected business Providing real-time information to individuals Integrating diversified information for successful flight operations Predicting the time taken for regular airport services

Introsort or introspective sort

Eftykhar Mahmud

Introsort is a hybrid sorting algorithm that begins with quicksort and switches to heapsort when the recursion depth exceeds a level based on the number of elements. This avoids quicksort's potential O(n^2) worst-case runtime. It also uses insertion sort for small element subsets. Introsort provides optimal average O(n log n) performance while ensuring quicksort cannot exceed its worst-case bounds.

QUANTUM COMP 22

Tejasri Jampani

This document discusses quantum computation and its advantages over classical computation. Quantum computation uses quantum bits (qubits) that can exist in superpositions of states rather than just 1s and 0s. This allows quantum processors to perform multiple computations simultaneously. While challenging to implement physically, quantum algorithms like Shor's algorithm could solve certain problems like integer factorization vastly faster than any classical computer. Nanotechnology is needed to build qubits that can maintain coherent superpositions, potentially enabling the construction of a functional quantum computer.

Coupling ecophysiological data with models_Whitley

TERN Australia

The document describes equations and relationships related to photosynthesis and plant productivity. It defines key terms like NPP, GPP, Rd, and presents equations showing their relationships to factors like Vcmax, leaf nitrogen concentration, temperature and precipitation. It also includes graphs showing correlations between leaf nitrogen content, photosynthetic rates and environmental variables.

Spark meets Telemetry

Roberto Agostino Vitillo

This document discusses using Spark to analyze telemetry data from Mozilla. It describes how telemetry data is currently collected and processed using a map-reduce framework in Python. It then introduces Spark as a faster alternative that supports interactive analysis, machine learning, and other capabilities. The document demonstrates launching a Spark cluster on EC2 to analyze Mozilla telemetry data in an interactive tutorial.

This document discusses an optimized Graph500 implementation using the DISLIB library. Key points: - DISLIB is a library for message passing and synchronization that aims to improve productivity for Graph500 and other applications. - The Graph500 algorithm uses a simple 1D decomposition with a CRS graph storage format and BFS kernel. DISLIB handles complexity like message aggregation and synchronization. - For the June 2011 submissions, DISLIB was built on DCMF for BlueGene/P and GASNET for InfiniBand clusters. The November 2011 submission used MPI with improved tuning.

Meteo I/O Introduction

Riccardo Rigon

MeteoIO is a meteorological data input/output library that aims to remove IO functions from physics engines, make IO easy to use and robust, and simplify data preparation. It supports reading data from various sources into standardized data structures and writing output. Plugins allow adding new data sources. It performs operations like data filtering, unit conversions, and spatial and temporal interpolation to produce complete meteorological datasets.

cnsm2011_slide

rerngvit yanggratoke

This document summarizes an approach called Gossip-based Resource Allocation for Green Computing in Large Clouds. The approach uses a distributed middleware architecture and gossip-based algorithms to dynamically consolidate virtual machines on the minimum number of active servers for energy efficiency. It aims to maximize utility and minimize power consumption and reconfiguration costs in large cloud environments with over 100,000 machines. The Generic Resource Management Protocol (GRMP) is presented as a scalable solution that can be instantiated in different ways, such as GRMP-Q, to achieve server consolidation under low load and fair allocation under high load. Simulation results show GRMP-Q reduces power consumption while maintaining satisfied demand and fairness across large systems.

Introduction to Date and Time API 3

Kenji HASUNUMA

This document provides an introduction to Java's Date and Time API. It discusses key classes like LocalDate, LocalDateTime, and ZonedDateTime that represent dates, times, and timestamps. It also covers formatting and parsing dates and times using DateTimeFormatter, and manipulating dates and times using factory and conversion methods. The document concludes that the API provides a powerful way to perform date and time calculations while modeling the ISO 8601 standard.

Slides meyer116

prettygully

The era of genomic evaluation has brought the need to perform computations involving large, dense matrices. Particular tasks are the computation and inversion of the genomic relationship matrix. This paper investigates the suitability of Graphics Processing Units together with highly optimised software libraries for these computations, using blocked algorithms. It is shown that calculations are readily sped up by parallel processing, using freely available library routines, and that reductions in time by factors of 4 to 5 are achievable even for `consumer' grade graphics cards.

Log Event Stream Processing In Flink Way

George T. C. Lai

Logs are one of the most important sources to monitor and reveal some significant events of interest. In this presentation, we introduced an implementation of log streams processing architecture based on Apache Flink. With fluentd, different kinds of emitted logs are collected and sent to Kafka. After having processed by Flink, we try to build a dash board utilizing elasticsearch and kibana for visualization.

Quantum Machine Learning for IBM AI

Sasha Lazarevic

The document discusses various machine learning algorithms that have been implemented using quantum computing techniques, including linear regression, ridge regression, perceptron, support vector machines, k-means clustering, principal component analysis, autoencoders, restricted Boltzmann machines, deep neural networks, generative adversarial networks, reinforcement learning, and Monte Carlo sampling. It also describes common quantum algorithms like quantum phase estimation, quantum amplitude estimation, quantum Fourier transform, and variational quantum algorithms. Finally, it discusses approaches for encoding classical data onto quantum states for processing, including using basis states, amplitudes, and rotation angles.

Rtl design optimizations and tradeoffs

Grace Abraham

This document discusses several optimizations and tradeoffs that can be made during RTL design of datapaths, including pipelining, concurrency, component allocation, operator binding, and operator scheduling. It provides examples of applying these techniques to optimize designs for FIR filters and SAD computation. Additionally, it discusses a multiple clocking scheme for low-power RTL design and an input space adaptive design methodology for optimizing energy and performance.

Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...

Filipo Mór

This document discusses parallelization strategies for N-body simulations on multicore architectures. It describes GPU and multicore CPU implementations of an N-body simulation algorithm. For the GPU implementation, data is transferred between global GPU memory and shared memory buffers while computation is performed in parallel. For the multicore CPU implementation, the serial code was parallelized with OpenMP directives to distribute work across CPU cores without data transfers. Computational costs and speedups of the different approaches are analyzed.

Public Wi-Fi

Hiroshi Mano

- The document summarizes the results of monitoring a public Wi-Fi channel (CH 6) for 300 seconds at Shinjuku station in Tokyo on October 11, 2011 between 18:00-18:05. - It found that beacon and probe request packets consumed approximately 1/4 of the total time, with beacons using 21.17 seconds (7.06% of occupancy rate) and probe requests using 10.75 seconds (3.58% of occupancy rate). - Probe response packets occupied 44.22 seconds, or 14.74% of the total time monitored.

OSM Cycle Map

gravitystorm

The document summarizes the OpenStreetMap cycle map, which contains over 47,000 km of mapped cycle routes. It describes how the map is generated from OpenStreetMap data and rendered into image tiles, then hosted on a website. Over 1 million image tiles totaling 23 GB are generated, taking around 5-6 hours. In June 2008, there were nearly 8,000 visitors to the site. Future plans include adding more detailed cycling route information and developing printed map versions.

Advanced Tracing features using GDB and LTTng

marckhouzam

This document discusses advanced tracing features using GDB and LTTng. It provides an overview of the Linux Tracing Toolkit next generation (LTTng), which allows highly efficient full system tracing. LTTng includes kernel tracing, user-space tracing, a trace viewer, and trace streaming. It also discusses LTTng's Eclipse integration, which provides perspectives and views for visualizing traces, as well as upcoming features like integrating GDB tracepoints. The document concludes by explaining how to get Eclipse set up for tracing in under a minute.

Peer sim (p2p network)

Hein Min Htike

PeerSim is an open-source simulation engine used to model peer-to-peer networks with up to 1 million nodes. It has two simulation modes: event-driven and cycle-driven. The document describes using PeerSim to create new protocols for finding the minimum and maximum values in a network, including implementing MaxFunction and MinFunction classes, observer classes, and configuration files to simulate networks of different sizes and topologies. Key results showed that networks with more connections converged faster, and removing nodes dynamically affected the minimum and maximum values over time.

Cs403 Parellel Programming Travelling Salesman Problem

Jishnu P

This document discusses parallelizing the traveling salesman problem. It proposes two approaches: 1) dividing the permutations of routes among threads, having each calculate costs and return the minimum, and 2) sharing a variable for minimum cost across threads and updating it concurrently as threads evaluate routes. Pseudocode is provided for both approaches. Speedup calculations assume ideal parallelization but also account for reductions from unconnected cities.

Optimized Multi-agent Box-pushing - 2017-10-24

Aritra Sarkar

The document summarizes an internship project involving optimized collaborative box-pushing using multiple agents. The project aimed to implement a distributed system on a Raspberry Pi-based embedded system to minimize completion time while maximizing accuracy. Algorithms like A* and D*-Lite were used for path planning, and approaches like Hopcroft-Karp and Hungarian were employed for agent assignment. Issues around localization errors were addressed through techniques like encoder feedback and motion retiming. The project evolved from simulation to hardware implementation with three networked rovers performing box pushing in test environments.

진동데이터 활용 충돌체 탐지 AI 경진대회 1등

DACON AI 데이콘

The document summarizes the approach taken to detect collisions by analyzing vibration data in a machine learning competition hosted on Dacon.io. It describes initial data processing steps including Fourier transforms and calculating onset times. Models tested included XGB, DNNs, and 1D CNNs applied directly to raw waveform data. Key aspects that improved performance were using L1 loss with Adam optimization, quantile transformations of the output, and developing separate models for position, mass and velocity rather than a single model. The best score achieved was around 0.0015-0.0017.

Travelling salesman problem

Dimitris Mavrommatis

This document discusses implementing a brute force algorithm to solve the travelling salesman problem (TSP) using GPUs. TSP involves finding the shortest route to visit each city once and return to the origin city. The author details dividing the problem across GPU blocks, threads, and permutations to calculate all routes within memory limits. Shared memory is used to find the shortest path within each block, while global memory tracks the overall shortest path across blocks. Testing showed GPUs can efficiently solve large TSP problems due to parallelizing many small route calculations. The key challenge was dividing the problem suitably for massive parallel GPU processing.

Terrain Rendering usingGPU-Based Geometry Clipmaps

none299359

Mirko Damiani - An Embedded soft real time distributed system in Go

linuxlab_conf

[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation

Hayahide Yamagishi

Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model

Koichi Shirahata

This document analyzes the performance of lattice quantum chromodynamics (QCD) simulations using the asynchronous partitioned global address space (APGAS) programming model on GPUs. It implements lattice QCD in X10 CUDA and compares performance to other implementations. Results show a 19.4x speedup from using X10 CUDA on 32 nodes of the TSUBAME 2.5 supercomputer compared to the original X10 implementation. Optimizations like data layout transformation and communication overlapping contributed to this acceleration.

Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...

Nevada County Tech Connection

This presentation provides an overview of ACME Robotics' 2017-2018 robotics team. It discusses their involvement in the FIRST and FTC robotics competitions. It then details how the team uses computer vision with OpenCV to implement tasks like jewel and cryptobox detection. It explains concepts like image representation, thresholding, color spaces, contours and morphological operations. Finally, it discusses the team's use of motion profiling and PID control to implement accurate motion control for their robot subsystems.

What's hot

Graph 500 DISLIB powered optimized version

Anton Korzh

Meteo I/O Introduction

Riccardo Rigon

cnsm2011_slide

rerngvit yanggratoke

Introduction to Date and Time API 3

Kenji HASUNUMA

Slides meyer116

prettygully

Log Event Stream Processing In Flink Way

George T. C. Lai

Quantum Machine Learning for IBM AI

Sasha Lazarevic

Rtl design optimizations and tradeoffs

Grace Abraham

Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...

Filipo Mór

Public Wi-Fi

Hiroshi Mano

OSM Cycle Map

gravitystorm

Advanced Tracing features using GDB and LTTng

marckhouzam

Peer sim (p2p network)

Hein Min Htike

What's hot (13)

Graph 500 DISLIB powered optimized version

Meteo I/O Introduction

cnsm2011_slide

Introduction to Date and Time API 3

Slides meyer116

Log Event Stream Processing In Flink Way

Quantum Machine Learning for IBM AI

Rtl design optimizations and tradeoffs

Parallelization Strategies for Implementing Nbody Codes on Multicore Architec...

Public Wi-Fi

OSM Cycle Map

Advanced Tracing features using GDB and LTTng

Peer sim (p2p network)

Similar to A parallel gpu version of the traveling salesman problem slides

Cs403 Parellel Programming Travelling Salesman Problem

Jishnu P

Optimized Multi-agent Box-pushing - 2017-10-24

Aritra Sarkar

진동데이터 활용 충돌체 탐지 AI 경진대회 1등

DACON AI 데이콘

Travelling salesman problem

Dimitris Mavrommatis

Terrain Rendering usingGPU-Based Geometry Clipmaps

none299359

Mirko Damiani - An Embedded soft real time distributed system in Go

linuxlab_conf

[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation

Hayahide Yamagishi

Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model

Koichi Shirahata

Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...

Nevada County Tech Connection

Esd module2

SOURAV KUMAR

This document discusses processor design, including custom single-purpose processors and general-purpose processors. It covers topics such as combinational and sequential logic design, finite state machine design, optimizing custom processors by improving the original program, finite state machine with datapath, and datapath and finite state machine. General-purpose processors are also introduced, including their basic architecture consisting of a control unit and datapath.

Aa sort-v4

Malithi Edirisinghe

The document proposes a new parallel sorting algorithm called AA-sort that is suitable for multi-core SIMD processors. AA-sort avoids unaligned memory accesses to fully utilize SIMD instructions. It sorts data in-core using an optimized combsort and then merges sorted blocks out-of-core in parallel. Experimental results on PowerPC and Cell processors show AA-sort has better scalability and performance than other algorithms both sequentially and in parallel.

Prediction of taxi rides ETA

Daniel Marcous

Comparing pregel related systems

Prashant Raaghav

The document summarizes preliminary results from a project comparing the performance of open source implementations of Pregel and related graph processing systems (Hama, Giraph, GPS) on single-source shortest path (SSSP) and PageRank algorithms. Initial results show that Hama does not scale well to larger graphs, while Giraph and GPS scale better. Further analysis of memory usage, network traffic, additional systems like GraphLab and Signal/Collect, and using Green-Marl to generate code for Giraph and GPS is still in progress.

Monte Carlo G P U Jan2010

John Holden

Monte Carlo simulation is one of the most important numerical methods in financial derivative pricing and risk management. Due to the increasing sophistication of exotic derivative models, Monte Carlo becomes the method of choice for numerical implementations because of its flexibility in high-dimensional problems. However, the method of discretization of the underlying stochastic differential equation (SDE) has a significant effect on convergence. In addition the choice of computing platform and the exploitation of parallelism offers further efficiency gains. We consider here the effect of higher order discretization methods together with the possibilities opened up by the advent of programmable graphics processing units (GPUs) on the overall performance of Monte Carlo and quasi-Monte Carlo methods.

In datacenter performance analysis of a tensor processing unit

Jinwon Lee

Distributed implementation of a lstm on spark and tensorflow

Emanuel Di Nardo

Adaptive indexing throttling

Arpit Jain

This document proposes an architecture for adaptively throttling indexing to improve query responsiveness. It describes testing an in-house throttling algorithm, AIMD, Gradient2, and TCP Vegas algorithms. TCP Vegas processed all updates fastest at 16 minutes while maintaining lower response times than other algorithms. Areas for improvement include reliability, intelligent limit allocation between update types, failover strategies, and metrics for Redis.

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...

Tokyo Institute of Technology

Multicore architectures

Muhammet SOYTÜRK

SSCCIP Final Presentation (The Spartans)

Derek J. Russell

The document describes the design and implementation of a rover called S.P.A.R.T.A.N. that was built to collect data samples and traverse terrain. Key aspects include using sensors to detect changes in CO2, temperature, magnetic fields, and read RFID tags. The rover uses a rock crawler chassis with a robotic arm to pick up samples. An Arduino controls movement and sensors while communicating data to a computer via XBee. The design successfully enabled terrain traversal and sensor data collection, though some components like the electromagnet did not work as intended.

Similar to A parallel gpu version of the traveling salesman problem slides (20)

Cs403 Parellel Programming Travelling Salesman Problem

Optimized Multi-agent Box-pushing - 2017-10-24

진동데이터 활용 충돌체 탐지 AI 경진대회 1등

Travelling salesman problem

Terrain Rendering usingGPU-Based Geometry Clipmaps

Mirko Damiani - An Embedded soft real time distributed system in Go

[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation

Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model

Robotics: Vision-Aided Navigation and Motion Path Planning on Low-End Android...

Esd module2

Aa sort-v4

Prediction of taxi rides ETA

Comparing pregel related systems

Monte Carlo G P U Jan2010

In datacenter performance analysis of a tensor processing unit

Distributed implementation of a lstm on spark and tensorflow

Adaptive indexing throttling

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...

Multicore architectures

SSCCIP Final Presentation (The Spartans)

More from Vimukthi Wickramasinghe

Beanstalkg

Vimukthi Wickramasinghe

This document provides an overview of BeanstalkG, an open source work queue system written in Go that is based on the Beanstalkd work queue but aims to add additional features like high availability. It discusses the motivation for creating BeanstalkG due to issues with lack of high availability in Beanstalkd. It then covers key aspects of work queues, examples of existing work queue systems, the state model and commands of Beanstalkd, and the architecture and workflow of BeanstalkG including how producers, workers, tubes and backends are handled. It concludes with a demo, roadmap which includes implementing high availability, and a call for contributions.

pgdip-project-report-final-148245F

Vimukthi Wickramasinghe

This document describes the development of Visuo, a deep learning video search engine. It begins with an introduction discussing the importance of machine understanding of video content as video becomes a primary form of storing human knowledge. It then provides an outline of the report and introduces Visuo as the goal of the research. The document contains sections on literature review of content-based video retrieval and deep learning techniques for video as well as the proposed research methodology and architecture for Visuo.

Factored Operating Systems paper review

Vimukthi Wickramasinghe

The document discusses the scalability issues of contemporary operating systems as multi-core processors increase exponentially. It introduces the concept of a factored operating system (FOS) as an alternative designed for 1000+ core systems. FOS avoids using locks, separates OS and application resources, and replaces shared memory communication with messaging. The key aspects of FOS include running a microkernel on each core for fast messaging, having multiple server instances of each OS service to scale, and scheduling OS and applications on different cores. FOS aims to overcome the scalability limitations of current OS designs for future many-core chips.

Exploring Strategies for Training Deep Neural Networks paper review

Vimukthi Wickramasinghe

This document discusses strategies for training deep neural networks. It introduces stacked restricted Boltzmann machine networks and stacked autoencoder networks as two methods. For stacked restricted Boltzmann machine networks, individual layers of restricted Boltzmann machines are trained using contrastive divergence and then stacked together. For stacked autoencoder networks, layers of autoencoders are trained to minimize reconstruction loss and stacked in the same way. Experimental results show that pre-training deep networks layer-by-layer in an unsupervised manner helps learn more complex representations than single layer networks.

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...

Vimukthi Wickramasinghe

The document describes a new approach for using deep neural networks to learn features for statistical machine translation. Specifically, it uses deep autoencoders to extract features from input data in an unsupervised manner, rather than manually engineering features. The approach feeds 16 input features into a deep belief network made of restricted Boltzmann machines. This network is then unrolled to form a deep autoencoder, which is fine-tuned using backpropagation. Stacking multiple trained autoencoders results in statistically significant improvements over baseline features on Chinese-English translation tasks.

Application Performance & Flexibility on Exokernel Systems paper review

Vimukthi Wickramasinghe

This document provides an overview of operating system structures and summarizes a research paper discussing the performance of exokernel systems. It introduces layered and alternative OS structures like microkernels and exokernels. The paper discusses the principles of exokernels and evaluates the performance of the Xok exokernel and its components like the XN storage system. Performance tests show Xok can match or exceed UNIX performance for most applications and significantly outperform UNIX for I/O intensive servers like HTTP servers. While exokernels provide advantages like flexibility and performance, their interfaces can be complex and lead to code management issues.

Improved Query Performance With Variant Indexes - review presentation

Vimukthi Wickramasinghe

This document summarizes different types of database indexes that can improve query performance, including value-list indexes, bitmap indexes, projection indexes, and bit-sliced indexes. It compares how each index type can be used to efficiently evaluate single-column sum aggregates and range predicates. The document also discusses how join indexes and bitmap join indexes can help support ad-hoc OLAP-style queries on dimensional data when the query criteria are not known in advance. Segmentation and clustering techniques are also presented to further improve grouping efficiency in data warehousing environments.

Smart mrs bi project-presentation

Vimukthi Wickramasinghe

This document introduces SMARTMRS, a business intelligence (BI) solution for OpenMRS, an open-source medical record system. SMARTMRS uses a Python-based technology stack including the Cubes library for online analytical processing (OLAP) and data warehousing functionality. It connects directly to OpenMRS's backend MySQL database. The document discusses why a Python/Cubes approach was chosen over Pentaho, and provides an overview of how Cubes can be used to define dimensions and facts from OpenMRS data for OLAP and ad hoc analysis through a demo.

More from Vimukthi Wickramasinghe (8)

Beanstalkg

pgdip-project-report-final-148245F

Factored Operating Systems paper review

Exploring Strategies for Training Deep Neural Networks paper review

Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...

Application Performance & Flexibility on Exokernel Systems paper review

Improved Query Performance With Variant Indexes - review presentation

Smart mrs bi project-presentation

Recently uploaded

Mechanical Engineering on AAI Summer Training Report-003.pdf

21UME003TUSHARDEB

学校原版美国波士顿大学毕业证学历学位证书原版一模一样

171ticu

原版一模一样【微信：741003700 】【美国波士顿大学毕业证学历学位证书】【微信：741003700 】学位证，留信认证（真实可查，永久存档）offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原海外各大学 Bachelor Diploma degree, Master Degree Diploma 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024

Sinan KOZAK

Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.

Computational Engineering IITH Presentation

co23btech11018

22CYT12-Unit-V-E Waste and its Management.ppt

KrishnaveniKrishnara1

Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.

CEC 352 - SATELLITE COMMUNICATION UNIT 1

PKavitha10

An Introduction to the Compiler Designss

ElakkiaU

SCALING OF MOS CIRCUITS m .pptx

harshapolam10

CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS

RamonNovais6

Engineering Standards Wiring methods.pdf

edwin408357

Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...

IJECEIAES

Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to precisely delineate tumor boundaries from magnetic resonance imaging (MRI) scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The model is rigorously trained and evaluated, exhibiting remarkable performance metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical image analysis and enhance healthcare outcomes. This research paves the way for future exploration and optimization of advanced CNN models in medical imaging, emphasizing addressing false positives and resource efficiency.

一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理

nedcocy

原版一模一样【微信：741003700 】【(爱大毕业证书)爱荷华大学毕业证成绩单】【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才办理(爱大毕业证书)爱荷华大学毕业证【微信：741003700 】外观非常简单，由纸质材料制成，上面印有校徽、校名、毕业生姓名、专业等信息。办理(爱大毕业证书)爱荷华大学毕业证【微信：741003700 】格式相对统一，各专业都有相应的模板。通常包括以下部分：校徽：象征着学校的荣誉和传承。校名:学校英文全称授予学位：本部分将注明获得的具体学位名称。毕业生姓名：这是最重要的信息之一，标志着该证书是由特定人员获得的。颁发日期：这是毕业正式生效的时间，也代表着毕业生学业的结束。其他信息：根据不同的专业和学位，可能会有一些特定的信息或章节。办理(爱大毕业证书)爱荷华大学毕业证【微信：741003700 】价值很高，需要妥善保管。一般来说，应放置在安全、干燥、防潮的地方，避免长时间暴露在阳光下。如需使用，最好使用复印件而不是原件，以免丢失。综上所述，办理(爱大毕业证书)爱荷华大学毕业证【微信：741003700 】是证明身份和学历的高价值文件。外观简单庄重，格式统一，包括重要的个人信息和发布日期。对持有人来说，妥善保管是非常重要的。

Advanced control scheme of doubly fed induction generator for wind turbine us...

IJECEIAES

This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.

Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...

PriyankaKilaniya

Energy efficiency has been important since the latter part of the last century. The main object of this survey is to determine the energy efficiency knowledge among consumers. Two separate districts in Bangladesh are selected to conduct the survey on households and showrooms about the energy and seller also. The survey uses the data to find some regression equations from which it is easy to predict energy efficiency knowledge. The data is analyzed and calculated based on five important criteria. The initial target was to find some factors that help predict a person's energy efficiency knowledge. From the survey, it is found that the energy efficiency awareness among the people of our country is very low. Relationships between household energy use behaviors are estimated using a unique dataset of about 40 households and 20 showrooms in Bangladesh's Chapainawabganj and Bagerhat districts. Knowledge of energy consumption and energy efficiency technology options is found to be associated with household use of energy conservation practices. Household characteristics also influence household energy use behavior. Younger household cohorts are more likely to adopt energy-efficient technologies and energy conservation practices and place primary importance on energy saving for environmental reasons. Education also influences attitudes toward energy conservation in Bangladesh. Low-education households indicate they primarily save electricity for the environment while high-education households indicate they are motivated by environmental concerns.

Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...

shadow0702a

This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL. The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process. The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging. It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal. Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages. Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.

morris_worm_intro_and_source_code_analysis_.pdf

ycwu0509

AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...

Paris Salesforce Developer Group

Generative AI Use cases applications solutions and implementation.pdf

mahaffeycheryld

Generative AI solutions encompass a range of capabilities from content creation to complex problem-solving across industries. Implementing generative AI involves identifying specific business needs, developing tailored AI models using techniques like GANs and VAEs, and integrating these models into existing workflows. Data quality and continuous model refinement are crucial for effective implementation. Businesses must also consider ethical implications and ensure transparency in AI decision-making. Generative AI's implementation aims to enhance efficiency, creativity, and innovation by leveraging autonomous generation and sophisticated learning algorithms to meet diverse business challenges. https://www.leewayhertz.com/generative-ai-use-cases-and-applications/

Software Engineering and Project Management - Introduction, Modeling Concepts...

Prakhyath Rai

Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.

Design and optimization of ion propulsion drone

bjmsejournal

Electric propulsion technology is widely used in many kinds of vehicles in recent years, and aircrafts are no exception. Technically, UAVs are electrically propelled but tend to produce a significant amount of noise and vibrations. Ion propulsion technology for drones is a potential solution to this problem. Ion propulsion technology is proven to be feasible in the earth’s atmosphere. The study presented in this article shows the design of EHD thrusters and power supply for ion propulsion drones along with performance optimization of high-voltage power supply for endurance in earth’s atmosphere.

Recently uploaded (20)

Mechanical Engineering on AAI Summer Training Report-003.pdf

学校原版美国波士顿大学毕业证学历学位证书原版一模一样

Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024

Computational Engineering IITH Presentation

22CYT12-Unit-V-E Waste and its Management.ppt

CEC 352 - SATELLITE COMMUNICATION UNIT 1

An Introduction to the Compiler Designss

SCALING OF MOS CIRCUITS m .pptx

CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS

Engineering Standards Wiring methods.pdf

Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...

一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理

Advanced control scheme of doubly fed induction generator for wind turbine us...

Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...

Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...

morris_worm_intro_and_source_code_analysis_.pdf

AI + Data Community Tour - Build the Next Generation of Apps with the Einstei...

Generative AI Use cases applications solutions and implementation.pdf

Software Engineering and Project Management - Introduction, Modeling Concepts...

Design and optimization of ion propulsion drone

A parallel gpu version of the traveling salesman problem slides

1. A Parallel GPU Version of the Traveling Salesman Problem By Molly A. O’Neil, Dan Tamir and Martin Burtscher Presented By Rukshan Siriwardhane (148208V) Vimukthi Wickramasinghe (148245F)

2. Outline ● The Travelling Salesman Problem ● The TSP algorithm used ● Using a GPU to solve TSP ● Optimizations used ● Evaluation method ● Results

3. The Traveling Salesman Problem Defn - Given n cities, find the shortest Hamiltonian tour between the cities ● Combinatorial optimization problem ○ Eg: Finding effective drilling arm movement, best routing, logistics etc. ● A brute force search in the solution space is not feasible ● Usually expressed as a graph problem ○ Complete, undirected, planar, Euclidean graph is used ○ Vertices represent cities ○ Edge weights reflect distances or costs

4. ● Optimal solution is NP-hard ○ Heuristic algorithms used to find an approximate solution. ● Here an iterative hill climbing search algorithm is used ○ Generate k random initial tours (k climbers) ○ Iteratively refine them until local minimum reached ● In each iteration, apply best opt-2 move ○ Find best pair of edges (a, b) and (c, d) such that replacing them with (a,d) and (b, c) minimizes tour length The TSP Algorithm used

5. The TSP Algorithm used

6. Using a GPU to solve TSP Parallelism Memory access regularity Code regularity Data reuse More than 10,000 threads Sets of 32 threads needs to have good access to memory Sets of 32 threads need to follow the same control flow At least O(n2 ) operations on O(n) data

7. Using a GPU to solve TSP ▪ Assuming 100-city problems & 100,000 climbers ▪ Climbers are independent, can be run in parallel ▪ Pro - Plenty of data parallelism ▪ Con - Potential load imbalance ▪ Different number of steps required to reach local minimum ▪ Every step determines best of 4851 opt-2 moves ▪ Same control flow (but different data) ▪ Coalesced memory access patterns ▪ O(n2 ) operations on O(n) data

8. Optimizations - code ● Main code section: finding best opt-2 move ○ Doubly nested loop ■ Only computes difference in tour length, not absolute length ○ Highly optimized to minimize memory accesses ■ “Caches” rest of data in registers ■ Requires only 6 clock cycles per move on a Xeon CPU core ○ Local minimum compared to best solution so far ■ Best solution updated if needed, otherwise tour is discarded ○ Other small optimizations

9. Optimizations - GPU ● Random tours generated in parallel on GPU ○ Minimizes data transfer to GPU ● 2D distance matrix resident in shared memory ○ Ensures hits in software-controlled fast data cache ● Tours copied to local memory in chunks of 1024 ○ Enables accessing them with coalesced loads & stores

10. Evaluation Method ● Hardware ○ NVIDIA Tesla C2050 GPU ○ (1.15 GHz 14 SMs w/ 32 PEs, 3GB global memory) ○ Nautilus supercomputer (2.0 GHz 8-core X7550 Xeons, sharing 4TB main memory) ● Data ○ Five 100-city inputs from TSPLIB ● Implementations ○ CUDA (GPU), Pthreads (CPU), serial C (CPU) ○ Use almost identical code for finding best opt-2 move

11. Results - Runtime Comparison ● GPU is 7.8x faster than CPU with 8 cores ● One GPU chip is as fast as 16 or 32 CPU chips

12. Speedup over Serial ● Pthreads code scales well up to 32 threads (4 CPUs) ● CPU performance fluctuates (NUMA), GPU stable

13. Results - Solution Quality ● Optimal tour found in 4 of 5 cases with 100,000 climbers ○ 200,000 climbers find best solution in fifth case ● Runtime independent of input and linear in climbers

14. Summary ▪ TSP_GPU algorithm ▪ Highly optimized implementation for GPUs ▪ Evaluates almost 20 billion tour modifications per second on a single GPU (as fast as 32 8-core Xeons) ▪ Produces high-quality results ▪ May be better suited for GPU than Ant Colony Optimization and GAs.

15. Any Questions?

16. Thank You..

A parallel gpu version of the traveling salesman problem slides

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to A parallel gpu version of the traveling salesman problem slides

Similar to A parallel gpu version of the traveling salesman problem slides (20)

More from Vimukthi Wickramasinghe

More from Vimukthi Wickramasinghe (8)

Recently uploaded

Recently uploaded (20)

A parallel gpu version of the traveling salesman problem slides