This document announces a seminar on using GPUs for parallel processing. The talk will be given by A. Stephen McGough and is part of the Sci-Prog seminar series on computing and programming topics ranging from basic to advanced levels. More information can be found on the website and researchers can contact Matt Wade to access the research community site. Announcements are sent via the sci-prog-seminars mailing list. The seminar is organized by several doctors and an individual.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
This document provides an overview of CUDA (Compute Unified Device Architecture), a parallel computing platform developed by NVIDIA that allows programming of GPUs for general-purpose processing. It outlines CUDA's process flow of copying data to the GPU, running a kernel program on the GPU, and copying results back to CPU memory. It then demonstrates CUDA concepts like kernel and thread structure, memory management, and provides a code example of vector addition to illustrate CUDA programming.
This document provides an overview of Nvidia CUDA programming basics. It discusses the CUDA programming model, memory model, and API. The programming model describes how the GPU is seen as a compute device to execute kernels in parallel across a grid of thread blocks. Each block contains a batch of cooperating threads with shared memory. The memory model describes the different memory spaces including shared, global, and constant memory. The API extends C with qualifiers for functions, variables, and execution configurations to specify kernel execution. A simple example calculates scalar products across vectors in parallel. Optimization techniques for the example are discussed.
2011.02.18 marco parenzan - modelli di programmazione per le gpuMarco Parenzan
This document discusses programming languages and compilers for GPUs. It begins by noting that Nvidia/CUDA is not the only option for GPU computing and that other platforms like Intel and AMD exist. It then explains why GPU computing is useful due to GPUs' parallel processing capabilities. The document outlines some popular GPU products and programming models like CUDA, OpenCL, and DirectCompute. It provides kernel code examples for tasks like matrix multiplication in languages like C for CUDA and OpenCL. Finally, it discusses related topics like GPU programming libraries, host languages, and metaprogramming.
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
In this deck from the Stanford HPC Conference, Doug Miles from NVIDIA presents: Accelerating HPC Applications on NVIDIA GPUs with OpenACC."
"OpenACC is a directive-based parallel programming model for GPU accelerated and heterogeneous parallel HPC systems. It offers higher programmer productivity compared to use of explicit models like CUDA and OpenCL.
Application source code instrumented with OpenACC directives remains portable to any system with a standard Fortran/C/C++ compiler, and can be efficiently parallelized for various types of HPC systems – multicore CPUs, heterogeneous CPU+GPU, and manycore processors.
This talk will include an introduction to the OpenACC programming model, provide examples of its use in a number of production applications, explain how OpenACC and CUDA Unified Memory working together can dramatically simplify GPU programming, and close with a few thoughts on OpenACC future directions."
Watch the video: https://youtu.be/CaE3n89QM8o
Learn more: https://www.openacc.org/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Introduction to cuda geek camp singapore 2011Raymond Tay
This document provides an introduction to CUDA (Compute Unified Device Architecture). It discusses that GPUs have advantages over CPUs for parallel computing due to their optimized architecture and large number of cores. It explains how CUDA works by offloading parts of a program to run on GPU memory and cores. An example of a block cipher encryption is provided to illustrate a CPU and GPU program for the same task. Additional CUDA concepts covered include debugging tools, adoption rates, and libraries.
This document summarizes random number generation using OpenCL. It discusses the Marsaglia polar method for generating random numbers and Gaussian pairs. It presents pseudocode for the Gaussian pair generation algorithm. Profiling results show that 54% of time is spent generating Gaussian pairs while 46% is for random numbers. The document also discusses optimization techniques like using local memory, coalesced global memory access, and choosing an optimal work group size. Performance results show near linear speedup from 1 to 8 GPUs.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
This document provides an overview of CUDA (Compute Unified Device Architecture), a parallel computing platform developed by NVIDIA that allows programming of GPUs for general-purpose processing. It outlines CUDA's process flow of copying data to the GPU, running a kernel program on the GPU, and copying results back to CPU memory. It then demonstrates CUDA concepts like kernel and thread structure, memory management, and provides a code example of vector addition to illustrate CUDA programming.
This document provides an overview of Nvidia CUDA programming basics. It discusses the CUDA programming model, memory model, and API. The programming model describes how the GPU is seen as a compute device to execute kernels in parallel across a grid of thread blocks. Each block contains a batch of cooperating threads with shared memory. The memory model describes the different memory spaces including shared, global, and constant memory. The API extends C with qualifiers for functions, variables, and execution configurations to specify kernel execution. A simple example calculates scalar products across vectors in parallel. Optimization techniques for the example are discussed.
2011.02.18 marco parenzan - modelli di programmazione per le gpuMarco Parenzan
This document discusses programming languages and compilers for GPUs. It begins by noting that Nvidia/CUDA is not the only option for GPU computing and that other platforms like Intel and AMD exist. It then explains why GPU computing is useful due to GPUs' parallel processing capabilities. The document outlines some popular GPU products and programming models like CUDA, OpenCL, and DirectCompute. It provides kernel code examples for tasks like matrix multiplication in languages like C for CUDA and OpenCL. Finally, it discusses related topics like GPU programming libraries, host languages, and metaprogramming.
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
In this deck from the Stanford HPC Conference, Doug Miles from NVIDIA presents: Accelerating HPC Applications on NVIDIA GPUs with OpenACC."
"OpenACC is a directive-based parallel programming model for GPU accelerated and heterogeneous parallel HPC systems. It offers higher programmer productivity compared to use of explicit models like CUDA and OpenCL.
Application source code instrumented with OpenACC directives remains portable to any system with a standard Fortran/C/C++ compiler, and can be efficiently parallelized for various types of HPC systems – multicore CPUs, heterogeneous CPU+GPU, and manycore processors.
This talk will include an introduction to the OpenACC programming model, provide examples of its use in a number of production applications, explain how OpenACC and CUDA Unified Memory working together can dramatically simplify GPU programming, and close with a few thoughts on OpenACC future directions."
Watch the video: https://youtu.be/CaE3n89QM8o
Learn more: https://www.openacc.org/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Introduction to cuda geek camp singapore 2011Raymond Tay
This document provides an introduction to CUDA (Compute Unified Device Architecture). It discusses that GPUs have advantages over CPUs for parallel computing due to their optimized architecture and large number of cores. It explains how CUDA works by offloading parts of a program to run on GPU memory and cores. An example of a block cipher encryption is provided to illustrate a CPU and GPU program for the same task. Additional CUDA concepts covered include debugging tools, adoption rates, and libraries.
This document summarizes random number generation using OpenCL. It discusses the Marsaglia polar method for generating random numbers and Gaussian pairs. It presents pseudocode for the Gaussian pair generation algorithm. Profiling results show that 54% of time is spent generating Gaussian pairs while 46% is for random numbers. The document also discusses optimization techniques like using local memory, coalesced global memory access, and choosing an optimal work group size. Performance results show near linear speedup from 1 to 8 GPUs.
This document discusses GPU computing and CUDA programming. It begins with an introduction to GPU computing and CUDA. CUDA (Compute Unified Device Architecture) allows programming of Nvidia GPUs for parallel computing. The document then provides examples of optimizing matrix multiplication and closest pair problems using CUDA. It also discusses implementing and optimizing convolutional neural networks (CNNs) and autoencoders for GPUs using CUDA. Performance results show speedups for these deep learning algorithms when using GPUs versus CPU-only implementations.
The ability to write shaders that can be used on any hardware vendor's graphics card that supports the OpenGL Shading Language. Each hardware vendor includes the GLSL compiler in their driver, thus allowing each vendor to create code optimized for their particular graphics card's architecture.
Implementation of Computational Algorithms using Parallel Programmingijtsrd
Parallel computing is a type of computation in which many processing are performed concurrently often by dividing large problems into smaller ones that execute independently of each other. There are several different types of parallel computing. The first one is the shared memory architecture which harnesses the power of multiple processors and multiple cores on a single machine and uses threads of programs and shared memory to exchange data. The second type of parallel computing is the distributed architecture which harnesses the power of multiple machines in a networked environment and uses message passing to communicate processes actions to one another. This paper implements several computational algorithms using parallel programming techniques namely distributed message passing. The algorithms are Mandelbrot set, Bucket Sort, Monte Carlo, Grayscale Image Transformation, Array Summation, and Insertion Sort algorithms. All these algorithms are to be implemented using C .NET and tested in a parallel environment using the MPI.NET SDK and the DeinoMPI API. Experiments conducted showed that the proposed parallel algorithms have faster execution time than their sequential counterparts. As future work, the proposed algorithms are to be redesigned to operate on shared memory multi processor and multi core architectures. Youssef Bassil ""Implementation of Computational Algorithms using Parallel Programming"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd22947.pdf
Paper URL: https://www.ijtsrd.com/computer-science/parallel-computing/22947/implementation-of-computational-algorithms-using-parallel-programming/youssef-bassil
Abstract:
Many machine learning algorithms can be implemented to run parallel operations on graphics cards. Deeplearning4j is a Java-based machine learning library, which includes implementations of many popular neural-network algorithms. Deeplearning4j uses uses a library called Nd4j to run matrix algebra operations on either CPUs or GPUs with NVIDIA’s CUDA API.
In this talk, I will show how to get a simple machine learning algorithm running on the GPU. I will also cover how to get started with CUDA development: how to get your code to run on the GPU, how to monitor the device, and how to write code to make effective use of parralelization.
Bio: Gary Sieling is a Lead Software Engineer at IQVIA, in Blue Bell, PA, with an interests in database technologies, machine learning, and software engineering practices. He has been involved in curating talks for a company lunch and learn program and the organizing committee for a tech conference. Building on these experiences, he built a search engine called FindLectures.com to help find great talks and speakers.
Introduction to homomorphic encryption, encryption which allows computations on ciphertext. An overview of key aspects and the ideas that allow these schemes to work is given, as well as examples of how to apply it.
Christoph Matthies (@chrisma0), Hubert Hesse (@hubx), Robert Lehmann (@rlehmann)
Cloud computing is an ever-growing field in today‘s era.With the accumulation of data and the
advancement of technology,a large amount of data is generated everyday.Storage, availability and security of
the data form major concerns in the field of cloud computing.This paper focuses on homomorphic encryption,
which is largely used for security of data in the cloud.Homomorphic encryption is defined as the technique of
encryption in which specific operations can be carried out on the encrypted data.The data is stored on a remote
server.The task here is operating on the encrypted data.There are two types of homomorphic encryption, Fully
homomorphic encryption and patially homomorphic encryption.Fully homomorphic encryption allow arbitrary
computation on the ciphertext in a ring, while the partially homomorphic encryption is the one in which
addition or multiplication operations can be carried out on the normal ciphertext.Homomorphic encryption
plays a vital role in cloud computing as the encrypted data of companies is stored in a public cloud, thus taking
advantage of the cloud provider‘s services.Various algorithms and methods of homomorphic encryption that
have been proposed are discussed in this paper
This document provides a tutorial introduction to GPGPU computation using NVIDIA CUDA. It begins with a brief overview and warnings about the large numbers involved in GPGPU. The agenda then outlines topics to be covered including general purpose GPU computing using CUDA and optimization topics like memory bandwidth optimization. Key aspects of CUDA programming are introduced like the CUDA memory model, compute capabilities of GPUs, and profiling tools. Examples are provided of simple CUDA kernels and how to configure kernel launches for grids and blocks of threads. Optimization techniques like choosing block/grid sizes to maximize occupancy are also discussed.
Introducing additional threads acting on mutable states throws away all the understandably and predictability of single threaded applications. This is caused by the combinatorial explosion of the states space caused by non-deterministic preemptive scheduling. Message passing between single threaded agents is a better alternative, while still giving us parallel execution through pipelining, improved cache locality and no kernel thread context switching.
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
This document discusses thread cooperation in parallel programming on GPUs. It introduces mechanisms for threads to communicate and synchronize their parallel execution. Specifically, it shows how to index data for threads within blocks and across multiple blocks to allow vector addition on arbitrarily long vectors. By launching threads in a grid of blocks and incrementing the thread index by the total number of threads, each thread can work on a different portion of the data in parallel.
The document summarizes a presentation on revocable identity-based encryption (RIBE) from codes with rank metric. Key points:
- RIBE adds an efficient revocation procedure to identity-based encryption by using a binary tree structure and key updates.
- The construction is based on low rank parity-check codes, with the master secret key defined as the "trapdoor" generated by the RankSign algorithm.
- Security relies on the rank syndrome decoding problem. Key updates are done efficiently through the binary tree with logarithmic complexity.
- Parameters are given that allow decoding of up to 2wr errors with small failure probability, suitable for the identity-based encryption scheme.
The goal of this session is to demonstrate techniques that improve GPU scalability when rendering complex scenes. This is achieved through a modular design that separates the scene graph representation from the rendering backend. We will explain how the modules in this pipeline are designed and give insights to implementation details, which leverage GPU''s compute capabilities for scene graph processing. Our modules cover topics such as shader generation for improved parameter management, synchronizing updates between scenegraph and rendering backend, as well as efficient data structures inside the renderer.
Video here: http://on-demand.gputechconf.com/gtc/2013/video/S3032-Advanced-Scenegraph-Rendering-Pipeline.mp4
This slide is going to introduce the concept of TensorFlow based on the source code study, including tensor, operation, computation graph and execution.
Shai Halevi discusses new ways to protect cloud data and security. Presented at "New Techniques for Protecting Cloud Data and Security" organized by the New York Technology Council.
Homomorphic encryption allows computations to be carried out on encrypted data without decrypting it first. This summary discusses Craig Gentry's scheme for fully homomorphic encryption based on ideal lattices. The scheme works by encrypting bits as ciphertexts with small noise that grows with computations. A bootstrapping procedure called re-crypt reduces the noise to keep ciphertexts decryptable. While promising for applications like cloud computing, the scheme has high computational costs that scale poorly with security level. Current research aims to make homomorphic encryption more efficient and practical.
Rust is a multi-paradigm systems programming language focused on safety, especially safe concurrency. Rust is syntactically similar to C++, but is designed to provide better memory safety while maintaining high performance.
This talk covers the following: principles of design, features, and applications. There are many successful projects used Rust, including browsers, operation systems, and database management systems, which will be also discussed in the talk.
This document summarizes CUDA programming using CUBLAS and direct parallelization. It first introduces CUBLAS, which implements BLAS functions on GPUs using CUDA. It describes how to initialize CUBLAS, transfer data between host and device memory, execute CUBLAS functions, and clean up. It then discusses direct parallelization, where each thread is assigned a specific task. It explains how to determine grid and block sizes, allocate device memory, copy data to the device, execute kernels, and copy results back to host memory. The document provides examples of using CUBLAS and coding a direct parallelization kernel for a matrix-vector multiplication operation.
This document introduces Groovy, a dynamic language for the Java Virtual Machine (JVM). Groovy extends Java syntactically and semantically, allowing Java and Groovy code to seamlessly work together. Groovy aims to be a more concise, compact, and pragmatic alternative to Java with features like optional parentheses, closures, builders, and dynamic typing. The document discusses how Groovy can be used for everything from small scripts to full applications and its popularity in testing, building, and rapid prototyping.
Parallel processing & Multi level logicHamza Saleem
This presentation is about Parallel Processing. How do we perform our task by parallel processing quickly but it has very complex circuitry. To reduce the circuitry we use Multi level logic.
This document discusses GPU computing and CUDA programming. It begins with an introduction to GPU computing and CUDA. CUDA (Compute Unified Device Architecture) allows programming of Nvidia GPUs for parallel computing. The document then provides examples of optimizing matrix multiplication and closest pair problems using CUDA. It also discusses implementing and optimizing convolutional neural networks (CNNs) and autoencoders for GPUs using CUDA. Performance results show speedups for these deep learning algorithms when using GPUs versus CPU-only implementations.
The ability to write shaders that can be used on any hardware vendor's graphics card that supports the OpenGL Shading Language. Each hardware vendor includes the GLSL compiler in their driver, thus allowing each vendor to create code optimized for their particular graphics card's architecture.
Implementation of Computational Algorithms using Parallel Programmingijtsrd
Parallel computing is a type of computation in which many processing are performed concurrently often by dividing large problems into smaller ones that execute independently of each other. There are several different types of parallel computing. The first one is the shared memory architecture which harnesses the power of multiple processors and multiple cores on a single machine and uses threads of programs and shared memory to exchange data. The second type of parallel computing is the distributed architecture which harnesses the power of multiple machines in a networked environment and uses message passing to communicate processes actions to one another. This paper implements several computational algorithms using parallel programming techniques namely distributed message passing. The algorithms are Mandelbrot set, Bucket Sort, Monte Carlo, Grayscale Image Transformation, Array Summation, and Insertion Sort algorithms. All these algorithms are to be implemented using C .NET and tested in a parallel environment using the MPI.NET SDK and the DeinoMPI API. Experiments conducted showed that the proposed parallel algorithms have faster execution time than their sequential counterparts. As future work, the proposed algorithms are to be redesigned to operate on shared memory multi processor and multi core architectures. Youssef Bassil ""Implementation of Computational Algorithms using Parallel Programming"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-3 , April 2019, URL: https://www.ijtsrd.com/papers/ijtsrd22947.pdf
Paper URL: https://www.ijtsrd.com/computer-science/parallel-computing/22947/implementation-of-computational-algorithms-using-parallel-programming/youssef-bassil
Abstract:
Many machine learning algorithms can be implemented to run parallel operations on graphics cards. Deeplearning4j is a Java-based machine learning library, which includes implementations of many popular neural-network algorithms. Deeplearning4j uses uses a library called Nd4j to run matrix algebra operations on either CPUs or GPUs with NVIDIA’s CUDA API.
In this talk, I will show how to get a simple machine learning algorithm running on the GPU. I will also cover how to get started with CUDA development: how to get your code to run on the GPU, how to monitor the device, and how to write code to make effective use of parralelization.
Bio: Gary Sieling is a Lead Software Engineer at IQVIA, in Blue Bell, PA, with an interests in database technologies, machine learning, and software engineering practices. He has been involved in curating talks for a company lunch and learn program and the organizing committee for a tech conference. Building on these experiences, he built a search engine called FindLectures.com to help find great talks and speakers.
Introduction to homomorphic encryption, encryption which allows computations on ciphertext. An overview of key aspects and the ideas that allow these schemes to work is given, as well as examples of how to apply it.
Christoph Matthies (@chrisma0), Hubert Hesse (@hubx), Robert Lehmann (@rlehmann)
Cloud computing is an ever-growing field in today‘s era.With the accumulation of data and the
advancement of technology,a large amount of data is generated everyday.Storage, availability and security of
the data form major concerns in the field of cloud computing.This paper focuses on homomorphic encryption,
which is largely used for security of data in the cloud.Homomorphic encryption is defined as the technique of
encryption in which specific operations can be carried out on the encrypted data.The data is stored on a remote
server.The task here is operating on the encrypted data.There are two types of homomorphic encryption, Fully
homomorphic encryption and patially homomorphic encryption.Fully homomorphic encryption allow arbitrary
computation on the ciphertext in a ring, while the partially homomorphic encryption is the one in which
addition or multiplication operations can be carried out on the normal ciphertext.Homomorphic encryption
plays a vital role in cloud computing as the encrypted data of companies is stored in a public cloud, thus taking
advantage of the cloud provider‘s services.Various algorithms and methods of homomorphic encryption that
have been proposed are discussed in this paper
This document provides a tutorial introduction to GPGPU computation using NVIDIA CUDA. It begins with a brief overview and warnings about the large numbers involved in GPGPU. The agenda then outlines topics to be covered including general purpose GPU computing using CUDA and optimization topics like memory bandwidth optimization. Key aspects of CUDA programming are introduced like the CUDA memory model, compute capabilities of GPUs, and profiling tools. Examples are provided of simple CUDA kernels and how to configure kernel launches for grids and blocks of threads. Optimization techniques like choosing block/grid sizes to maximize occupancy are also discussed.
Introducing additional threads acting on mutable states throws away all the understandably and predictability of single threaded applications. This is caused by the combinatorial explosion of the states space caused by non-deterministic preemptive scheduling. Message passing between single threaded agents is a better alternative, while still giving us parallel execution through pipelining, improved cache locality and no kernel thread context switching.
CUDA by Example : Thread Cooperation : NotesSubhajit Sahu
This document discusses thread cooperation in parallel programming on GPUs. It introduces mechanisms for threads to communicate and synchronize their parallel execution. Specifically, it shows how to index data for threads within blocks and across multiple blocks to allow vector addition on arbitrarily long vectors. By launching threads in a grid of blocks and incrementing the thread index by the total number of threads, each thread can work on a different portion of the data in parallel.
The document summarizes a presentation on revocable identity-based encryption (RIBE) from codes with rank metric. Key points:
- RIBE adds an efficient revocation procedure to identity-based encryption by using a binary tree structure and key updates.
- The construction is based on low rank parity-check codes, with the master secret key defined as the "trapdoor" generated by the RankSign algorithm.
- Security relies on the rank syndrome decoding problem. Key updates are done efficiently through the binary tree with logarithmic complexity.
- Parameters are given that allow decoding of up to 2wr errors with small failure probability, suitable for the identity-based encryption scheme.
The goal of this session is to demonstrate techniques that improve GPU scalability when rendering complex scenes. This is achieved through a modular design that separates the scene graph representation from the rendering backend. We will explain how the modules in this pipeline are designed and give insights to implementation details, which leverage GPU''s compute capabilities for scene graph processing. Our modules cover topics such as shader generation for improved parameter management, synchronizing updates between scenegraph and rendering backend, as well as efficient data structures inside the renderer.
Video here: http://on-demand.gputechconf.com/gtc/2013/video/S3032-Advanced-Scenegraph-Rendering-Pipeline.mp4
This slide is going to introduce the concept of TensorFlow based on the source code study, including tensor, operation, computation graph and execution.
Shai Halevi discusses new ways to protect cloud data and security. Presented at "New Techniques for Protecting Cloud Data and Security" organized by the New York Technology Council.
Homomorphic encryption allows computations to be carried out on encrypted data without decrypting it first. This summary discusses Craig Gentry's scheme for fully homomorphic encryption based on ideal lattices. The scheme works by encrypting bits as ciphertexts with small noise that grows with computations. A bootstrapping procedure called re-crypt reduces the noise to keep ciphertexts decryptable. While promising for applications like cloud computing, the scheme has high computational costs that scale poorly with security level. Current research aims to make homomorphic encryption more efficient and practical.
Rust is a multi-paradigm systems programming language focused on safety, especially safe concurrency. Rust is syntactically similar to C++, but is designed to provide better memory safety while maintaining high performance.
This talk covers the following: principles of design, features, and applications. There are many successful projects used Rust, including browsers, operation systems, and database management systems, which will be also discussed in the talk.
This document summarizes CUDA programming using CUBLAS and direct parallelization. It first introduces CUBLAS, which implements BLAS functions on GPUs using CUDA. It describes how to initialize CUBLAS, transfer data between host and device memory, execute CUBLAS functions, and clean up. It then discusses direct parallelization, where each thread is assigned a specific task. It explains how to determine grid and block sizes, allocate device memory, copy data to the device, execute kernels, and copy results back to host memory. The document provides examples of using CUBLAS and coding a direct parallelization kernel for a matrix-vector multiplication operation.
This document introduces Groovy, a dynamic language for the Java Virtual Machine (JVM). Groovy extends Java syntactically and semantically, allowing Java and Groovy code to seamlessly work together. Groovy aims to be a more concise, compact, and pragmatic alternative to Java with features like optional parentheses, closures, builders, and dynamic typing. The document discusses how Groovy can be used for everything from small scripts to full applications and its popularity in testing, building, and rapid prototyping.
Parallel processing & Multi level logicHamza Saleem
This presentation is about Parallel Processing. How do we perform our task by parallel processing quickly but it has very complex circuitry. To reduce the circuitry we use Multi level logic.
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
Presentation to University of Kentucky Computer Science graduate studentrs on high level Cloud Computing, how MapReduce works, and the current competition for Parallel Processing on a Massive Scale
Parallel Processing for Digital Image EnhancementNora Youssef
Masters defense for Parallel Processing for Digital Image Enhancement
By Nora Youssef
Defense: 1 Oct 2015
Judges:
Prof. Dr. El-Sayed M. El-Horbaty - ASU
Prof. Dr. Hani Mohamed Kamal Mahdi - ASU
Prof. Dr. Mohamed Waheed Meselhey - Suez Channel University
The document discusses sequential processing, parallel processing, and pipelining techniques to improve CPU performance.
Sequential processing executes instructions one at a time based on the von Neumann architecture. Pipelining breaks jobs into stages to keep processor resources busy and improve throughput. Parallel processing uses multiple processors simultaneously to potentially reduce execution time by dividing a program across processors. Different parallel processor architectures include multiple instruction/multiple data streams and symmetric multiprocessors. The document compares sequential, pipelined, and parallel systems and their advantages and disadvantages for efficient processing.
Introduction to Parallel Processing Algorithms in Shared Nothing DatabasesOfir Manor
This document provides an introduction to parallel processing algorithms in shared nothing databases. It discusses scaling databases through a shared nothing architecture where data is sharded across multiple independent nodes. Examples are given of single table processing and join processing across the sharded database. Execution plans are shown for queries involving filtering, aggregation, sorting and joins on single and multiple tables both when the tables are distributed by the same key and different keys.
QGIS plugin for parallel processing in terrain analysisRoss McDonald
Art Lembo's presentation on embarrassingly parallel processing with QGIS and pyCUDA for terrain analysis. Given at 6th Scottish QGIS UK user group meeting.
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...Johann Schleier-Smith
Real-time predictive applications can demand continuous and agile development, with new models constantly being trained, tested, and then deployed. Training and testing are done by replaying stored event logs, running new models in the context of historical data in a form of backtesting or ``what if?'' analysis. To replay weeks or months of logs while developers wait, we need systems that can stream event logs through prediction logic many times faster than the real-time rate. A challenge with high-speed replay is preserving sequential semantics while harnessing parallel processing power. The crux of the problem lies with causal dependencies inherent in the sequential semantics of log replay.
We introduce an execution engine that produces serial-equivalent output while accelerating throughput with pipelining and distributed parallelism. This is made possible by optimizing for high throughput rather than the traditional stream processing goal of low latency, and by aggressive sharing of versioned state, a technique we term Multi-Versioned Parallel Streaming (MVPS).
In experiments we see that this engine, which we call ReStream, performs as well as batch processing and more than an order of magnitude better than a single-threaded implementation.
The document discusses lessons learned from using Spring Batch to process pension recalculations in parallel. It describes how the initial implementation used Hibernate in the item reader, which caused problems with concurrent sessions. The solution was to remove Hibernate from the reader and instead use JDBC to fetch primary keys and let the processor fetch related data using Hibernate. This improved performance significantly by allowing more parallelization and avoiding issues with concurrent sessions. The key lessons are to avoid Hibernate in the reader, test parallelization early, and monitor SQL performance.
Massively Parallel Processing with Procedural Python - Pivotal HAWQInMobi Technology
The document discusses massively parallel processing using procedural Python. It describes EMC Corporation and its subsidiaries which provide data storage, virtualization, security, and other software solutions. It also discusses Pivotal's open source contributions and the architecture of its HAWQ database which allows Python user-defined functions to perform parallel operations across clusters.
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni
The document discusses LatentView Analytics and provides an overview of data processing frameworks and MapReduce. It introduces LatentView Analytics, describing its services, partners, and experience. It then discusses distributed and parallel processing frameworks, providing examples like Hadoop, Spark, and Storm. It also provides a brief history of Hadoop, describing its key developments from 1999 to present day in addressing challenges of indexing, crawling, distributed processing etc. Finally, it explains the MapReduce process and provides a simple example to illustrate mapping and reducing functions.
The document summarizes four major theories of information processing:
1) The stage theory proposes information is processed and stored in three stages: sensory memory, short-term memory, and long-term memory.
2) The levels-of-processing theory states retrieval depends on the depth of elaboration during encoding, from superficial to deep semantic analysis.
3) Parallel distributed processing theory posits information is processed simultaneously across networks rather than sequentially as in stage theory.
4) Connectionist theory emphasizes information storage in networks of brain connections that become stronger through elaboration.
Massively Parallel Processing with Procedural Python (PyData London 2014)Ian Huston
The Python data ecosystem has grown beyond the confines of single machines to embrace scalability. Here we describe one of our approaches to scaling, which is already being used in production systems. The goal of in-database analytics is to bring the calculations to the data, reducing transport costs and I/O bottlenecks. Using PL/Python we can run parallel queries across terabytes of data using not only pure SQL but also familiar PyData packages such as scikit-learn and nltk. This approach can also be used with PL/R to make use of a wide variety of R packages. We look at examples on Postgres compatible systems such as the Greenplum Database and on Hadoop through Pivotal HAWQ. We will also introduce MADlib, Pivotal’s open source library for scalable in-database machine learning, which uses Python to glue SQL queries to low level C++ functions and is also usable through the PyMADlib package.
002 - Introduction to CUDA Programming_1.pptceyifo9332
This document provides an introduction to CUDA programming. It discusses the programmer's view of the GPU as a co-processor with its own memory, and how GPUs are well-suited for data-parallel applications with many independent computations. It describes how CUDA uses a grid of blocks of threads to run kernels in parallel. Memory is organized into global, constant, shared, and local memory. Kernels launch a grid of blocks, and threads within blocks can cooperate through shared memory and synchronization.
1) The document provides an introduction to GPGPU programming with CUDA, outlining goals of providing an overview and vision for using GPUs to improve applications.
2) Key aspects of GPU programming are discussed, including the large number of cores devoted to data processing, example applications that are well-suited to parallelization, and the CUDA tooling in Visual Studio.
3) A hands-on example of matrix multiplication is presented to demonstrate basic CUDA programming concepts like memory management between host and device, kernel invocation across a grid of blocks, and using thread IDs to parallelize work.
CUDA is a parallel computing platform and programming model developed by Nvidia that allows software developers and researchers to utilize GPUs for general purpose processing. CUDA allows developers to achieve up to 100x performance gains over CPU-only applications. CUDA works by having the CPU copy input data to GPU memory, executing a kernel program on the GPU that runs in parallel across many threads, and copying the results back to CPU memory. Key GPU memories that can be used in CUDA programs include shared memory for thread cooperation, textures for cached reads, and constants for read-only data.
This document provides an overview of CUDA (Compute Unified Device Architecture), NVIDIA's parallel computing platform and programming model that allows software developers to leverage the parallel compute engines in NVIDIA GPUs. The document discusses key aspects of CUDA including: GPU hardware architecture with many scalar processors and concurrent threads; the CUDA programming model with host CPU code calling parallel kernels that execute across multiple GPU threads; memory hierarchies and data transfers between host and device memory; and programming basics like compiling with nvcc, allocating and copying data between host and device memory.
This document provides an overview of CUDA C/C++ basics for processing data in parallel on GPUs. It discusses:
- The CUDA architecture which exposes GPU parallelism for general-purpose computing while retaining performance.
- The CUDA programming model which uses a grid of thread blocks, with each block containing a group of threads that execute concurrently.
- Key CUDA C/C++ concepts like declaring device functions, launching kernels, managing host and device memory, and using thread and block indexes to parallelize work across threads and blocks.
- A simple example of vector addition to demonstrate parallel execution using threads and blocks, with indexing to map threads/blocks to problem elements.
The document summarizes a lecture on parallel computing with CUDA (Compute Unified Device Architecture). It introduces CUDA as a parallel programming model for GPUs, covering key concepts like memory architecture, host-GPU workload partitioning, programming paradigm, and programming examples. It then outlines the agenda, benefits of GPU computing, and provides details on CUDA programming interfaces, kernels, threads, blocks, and memory hierarchies. Finally, it lists some lab exercises on CUDA programming including HelloWorld, matrix multiplication, and parallel sorting algorithms.
This document provides an introduction to parallel programming using GPUs. It outlines the hardware architecture of GPUs, which have hundreds of cores optimized for processing pixels in parallel. It then discusses CUDA programming, with examples of initializing the GPU, allocating and transferring memory, executing kernels, and common applications in physics, finance, and other fields. The document concludes by discussing the sparse conjugate gradient method for inverting matrices on the GPU as an example application in computational physics.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
This lecture discusses manycore GPU architectures and programming, focusing on the CUDA programming model. It covers GPU execution models, CUDA programming concepts like threads and blocks, and how to manage GPU memory including different memory types like global and shared memory. It also discusses optimizing memory access patterns for global memory and profiling CUDA programs.
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
Kiran Lonikar proposes extending Project Tungsten in Spark SQL to enable parallel execution of DataFrame operations on GPUs. The proposal involves refactoring DataFrames to use a columnar layout and generating OpenCL code for batched execution across columns. Initial results show speedups from GPU execution. Future work includes supporting multi-GPU execution and adapting additional systems like Impala that may be better suited than Spark for GPU integration.
Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.
This document discusses parallel computing with GPUs. It introduces parallel computing, GPUs, and CUDA. It describes how GPUs are well-suited for data-parallel applications due to their large number of cores and throughput-oriented design. The CUDA programming model is also summarized, including how kernels are launched on the GPU from the CPU. Examples are provided of simple CUDA programs to perform operations like squaring elements in parallel on the GPU.
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
Ian Buck developed GPU computing at Nvidia. CUDA 1.0 was released in 2006, allowing normal applications to utilize GPU processing for higher performance without low-level programming. A GPU can execute many more instructions per clock than a CPU due to its large number of arithmetic logic units. In CUDA, programs specify blocks and threads to distribute work across a GPU. Calling a GPU function launches the specified number of blocks with threads. This massive parallelism allows GPUs to greatly accelerate brute force searches.
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
This document provides an introduction to CUDA and OpenCL for graphics processors. It discusses how GPUs are optimized for throughput rather than latency via parallel processing. The CUDA programming model exposes thread-level parallelism through blocks of cooperative threads and SIMD parallelism. OpenCL is inspired by CUDA but is hardware-vendor neutral. Both support features like shared memory, synchronization, and memory copies between host and device. Efficient CUDA coding requires exposing abundant fine-grained parallelism and minimizing execution and memory divergence.
The document provides an overview of GPU computing and CUDA programming. It discusses how GPUs enable massively parallel and affordable computing through their manycore architecture. The CUDA programming model allows developers to accelerate applications by launching parallel kernels on the GPU from their existing C/C++ code. Kernels contain many concurrent threads that execute the same code on different data. CUDA features a memory hierarchy and runtime for managing GPU memory and launching kernels. Overall, the document introduces GPU and CUDA concepts for general-purpose parallel programming on NVIDIA GPUs.
Scaling into Billions of Nodes and Relationships with Neo4j Graph Data ScienceNeo4j
The document discusses Neo4j Graph Data Science (GDS) and its ability to scale to billions of nodes and relationships. It outlines a typical GDS workflow involving graph projection, algorithm execution, and data export. It then discusses challenges of scaling GDS, including data size, import/export speeds, and algorithm performance. The document dives into how GDS addresses these challenges through techniques like graph compression, parallel processing, and optimized data structures like "huge collections" to handle large primitive data types in Java.
This document summarizes VPU and GPGPU computing technologies. It discusses that a VPU is a visual processing unit, also known as a GPU. GPUs have massively parallel architectures that allow them to perform better than CPUs for some complex computational tasks. The document then discusses GPU, PPU and GPGPU architectures, programming models like CUDA, and applications of GPGPU computing such as machine learning, robotics and scientific research.
Similar to Using GPUs for parallel processing (20)
ProSocial Behaviour - Applied Social Psychology - Psychology SuperNotesPsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
You may be stressed about revealing your cancer diagnosis to your child or children.
Children love stories and these often provide parents with a means of broaching tricky subjects and so the ‘The Secret Warrior’ book was especially written for CANSA TLC, by creative writer and social worker, Sally Ann Carter.
Find out more:
https://cansa.org.za/resources-to-help-share-a-parent-or-loved-ones-cancer-diagnosis-with-a-child/
Understanding of Self - Applied Social Psychology - Psychology SuperNotesPsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Covey says most people look for quick fixes. They see a big success and want to know how he did it, believing (and hoping) they can do the same following a quick bullet list.
But real change, the author says, comes not from the outside in, but from the inside out. And the most fundamental way of changing yourself is through a paradigm shift.
That paradigm shift is a new way of looking at the world. The 7 Habits of Highly Effective People presents an approach to effectiveness based on character and principles.
The first three habits indeed deal with yourself because it all starts with you. The first three habits move you from dependence from the world to the independence of making your own world.
Habits 4, 5 and 6 are about people and relationships. The will move you from independence to interdependence. Such, cooperating to achieve more than you could have by yourself.
The last habit, habit number 7, focuses on continuous growth and improvement.
Aggression - Applied Social Psychology - Psychology SuperNotesPsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Aggression - Applied Social Psychology - Psychology SuperNotes
Using GPUs for parallel processing
1. Sci-Prog seminar series
Talks on computing and programming related topics ranging from basic to
advanced levels.
Talk: Using GPUs for parallel processing
A. Stephen McGough
Website: http://conferences.ncl.ac.uk/sciprog/index.php
Research community site: contact Matt Wade for access
Alerts mailing list: sci-prog-seminars@ncl.ac.uk
(sign up at http://lists.ncl.ac.uk )
Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough,
Dr Ben Allen and Gregg Iceton
3. Why?
observation
• Moore’s XXXX is dead?
law
• “the number of transistors on integrated circuits
doubles approximately every two years”
– Processors aren’t getting faster… They’re getting fatter
Processor Speed and Energy
Assume 1 GHz Core consumes 1watt
A 4GHz Core consumes ~64watts
Four 1GHz cores consume ~4watts
Power ~frequency3
Computers are going many-core
4. What?
• Games industry is multi-billion dollar
• Gamers want photo-realistic games
– Computationally expensive
– Requires complex physics calculations
• Latest generation of Graphical Processing Units
are therefore many core parallel processors
– General Purpose Graphical Processing Units - GPGPUs
5. Not just normal processors
• 1000’s of cores
– But cores are simpler than a normal processor
– Multiple cores perform the same action at the same
time – Single Instruction Multiple Data – SIMD
• Conventional processor -> Minimize latency
– Of a single program
• GPU -> Maximize throughput of all cores
• Potential for orders of magnitude speed-up
6. “If you were plowing a field, which would you
rather use: two strong oxen or 1024 chicken?”
• Famous quote from Seymour Cray arguing for
small numbers of processors
– But the chickens are now winning
• Need a new way to think about programming
– Need hugely parallel algorithms
• Many existing algorithms won’t work (efficiently)
7. Some Issues with GPGPUs
• Cores are slower than a standard CPU
– But you have lots more
• No direct control on when your code runs on a core
– GPGPU decides where and when
• Can’t communicate between cores
• Order of execution is ‘random’
– Synchronization is through exiting parallel GPU code
• SIMD only works (efficiently) if all cores are doing the
same thing
– NVIDIA GPU’s have Warps of 32 cores working together
• Code divergence leads to more Warps
• Cores can interfere with each other
– Overwriting each others memory
8. How
• Many approaches
– OpenGL – for the mad Guru
– Computer Unified Device Architecture (CUDA)
– OpenCL – emerging standard
– Dynamic Parallelism – For existing code loops
• Focus here on CUDA
– Well developed and supported
– Exploits full power of GPGPU
9. CUDA
• CUDA is a set of extensions to C/C++
– (and Fortran)
• Code consists of sequential and parallel parts
– Parallel parts are written as kernels
• Describe what one thread of the code will do
Start Sequential code
Transfer data to card
Execute Kernel
Transfer data from card
Finish Sequential code
10. Example: Vector Addition
• One dimensional data
• Add two vectors (A,B) together to produce C
• Need to define the kernel to run and the main
code
• Each thread can compute a single value for C
11. Example: Vector Addition
• Pseudo code for the kernel:
– Identify which element in the vector I’m computing
•i
– Compute C[i] = A[i] + B[i]
• How do we identify our index (i)?
12. Blocks and Threads
• In CUDA the whole data
space is the Grid
– Divided into a number
of blocks
• Divided into a number of
threads
• Blocks can be executed
in any order
• Threads in a block are
executed together
• Blocks and Threads can
be 1D, 2D or 3D
13. Blocks
• As Blocks are
executed in arbitrary
order this gives
CUDA the
opportunity to scale
to the number of
cores in a particular
device
14. Thread id
• CUDA provides three pieces of data for
identifying a thread
– BlockIdx – block identity
– BlockDim – the size of a block (no of threads in block)
– ThreadIdx – identity of a thread in a block
• Can use these to compute the absolute thread id
id = BlockIdx * BlockDim + ThreadIdx
• EG: BlockIdx = 2, BlockDim = 3, ThreadIdx = 1
• id = 2 * 3 + 1 = 7
Thread index 0 1 2 0 1 2 0 1 2
0 1 2 3 4 5 6 7 8
Block0 Block1 Block2
15. Example: Vector Addition
Kernel code
Entry point for a
Normal function
kernel
definition
__global__ void vector_add(double *A, double *B,
double* C, int N) {
// Find my thread id - block and thread
int id = blockDim.x * blockIdx.x + threadIdx.x;
if (id >= N) {return;} // I'm not a valid ID
C[id] = A[id] + B[id]; // do my work
} Compute my
absolute thread id
We might be
invalid – if
data size not Do the work
completely
divisible by
blocks
16. Example: Vector Addition
Pseudo code for sequential code
• Create Data on Host Computer
• Create space on device
• Copy data to device
• Run Kernel
• Copy data back to host and do something with it
• Clean up
17. Host and Device
• Data needs copying to / from the GPU (device)
• Often end up with same data on both
– Postscript variable names with _device or _host
• To help identify where data is
A_host A_device
Host Device
18. Example: Vector Addition
int N = 2000;
double *A_host = new double[N]; // Create data on host computer
double *B_host = new double[N]; double *C_host = new double[N];
for(int i=0; i<N; i++) { A_host[i] = i; B_host[i] = (double)i/N; }
double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*sizeof(double));
cudaMalloc((void**) &B_device, N*sizeof(double));
cudaMalloc((void**) &C_device, N*sizeof(double));
// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*sizeof(double), cudaMemcpyHostToDevice);
// How many blocks will we need? Choose block size of 256
int blocks = (N - 0.5)/256 + 1;
vector_add<<<blocks, 256>>>(A_device, B_device, C_device, N); // run kernel
// Copy data back
cudaMemcpy(C_host, C_device, N*sizeof(double), cudaMemcpyDeviceToHost);
// do something with result
// free device memory
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host); // free host memory
19. More Complex: Matrix Addition
• Now a 2D problem
– BlockIdx, BlockDim, ThreadIdx now have x and y
• But general principles hold
– For kernel
• Compute location in matrix of two diminutions
– For main code
• Define and transmit data
• But keep data 1D
– Why?
20. Why data in 1D?
• If you define data as 2D there is no guarantee
that data will be a contiguous block of memory
– Can’t be transmitted to card in one command
X X
Some other
data
21. Faking 2D data
• 2D data size N*M
• Define 1D array of size N*M
• Index element at [x,y] as
x*N+y
• Then can transfer to device in one go
Row 1 Row 2 Row 3 Row 4
22. Example: Matrix Add
Kernel
__global__ void matrix_add(double *A, double *B, double* C, int N, int M)
{
// Find my thread id - block and thread
Both
int idX = blockDim.x * blockIdx.x + threadIdx.x;
dimensions
int idY = blockDim.y * blockIdx.y + threadIdx.y;
if (idX >= N || idY >= M) {return;} // I'm not a valid ID
int id = idY * N + idX;
Get both
C[id] = A[id] + B[id]; // do my work
dimensions
}
Compute
1D location
23. Example: Matrix Addition
Main Code
int N = 20;
int M = 10;
double *A_host = new double[N * M]; // Create data on host computer
double *B_host = new double[N * M];
double *C_host = new double[N * M]; Define matrices
for(int i=0; i<N; i++) {
for (int j = 0; j < M; j++) {
on host
A_host[i + j * N] = i; B_host[i + j * N] = (double)j/M;
}
}
double *A_device, *B_device, *C_device; // allocate space on device GPGPU
cudaMalloc((void**) &A_device, N*M*sizeof(double));
Define space on
cudaMalloc((void**) &B_device, N*M*sizeof(double)); device
cudaMalloc((void**) &C_device, N*M*sizeof(double));
// Copy data from host memory to device memory
cudaMemcpy(A_device, A_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(B_device, B_host, N*M*sizeof(double), cudaMemcpyHostToDevice);
Copy data to
device
// How many blocks will we need? Choose block size of 16
int blocksX = (N - 0.5)/16 + 1;
int blocksY = (M - 0.5)/16 + 1;
dim3 dimGrid(blocksX, blocksY);
dim3 dimBlocks(16, 16); Run Kernel
matrix_add<<<dimGrid, dimBlocks>>>(A_device, B_device, C_device, N, M);
// Copy data back from device to host
cudaMemcpy(C_host, C_device, N*M*sizeof(double), cudaMemcpyDeviceToHost); Bring data back
// Free device
//for (int i = 0; i < N*M; i++) printf("C[%d,%d] = %fn", i/N, i%N, C_host[i]);
cudaFree(A_device); cudaFree(B_device); cudaFree(C_device);
free(A_host); free(B_host); free(C_host); Tidy up
24. Running Example
• Computer: condor-gpu01
– Set path
• set path = ( $path /usr/local/cuda/bin/ )
• Compile command nvcc
• Then just run the binary file
• C2050, 440 cores, 3GB RAM
– Single precision flops 1.03Tflops
– Double precision flops 515Gflops
25. Summary and Questions
• GPGPU’s have great potential for parallelism
• But at a cost
– Not ‘normal’ parallel computing
– Need to think about problems in a new way
• Further reading
– NVIDIA CUDA Zone https://developer.nvidia.com/category/zone/cuda-zone
– Online courses https://www.coursera.org/course/hetero
26. Sci-Prog seminar series
Talks on computing and programming related topics ranging from basic to
advanced levels.
Talk: Using GPUs for parallel processing
A. Stephen McGough
Website: http://conferences.ncl.ac.uk/sciprog/index.php
Research community site: contact Matt Wade for access
Alerts mailing list: sci-prog-seminars@ncl.ac.uk
(sign up at http://lists.ncl.ac.uk )
Organisers: Dr Liz Petrie, Dr Matt Wade, Dr Stephen McGough,
Dr Ben Allen and Gregg Iceton