A presentation that introduces the basic concepts of parallel computing and gives some details on General Purpose GPU computing using the CUDA architecture.
This document discusses GPU memory and how to optimize memory access patterns. It begins with an example of how a wide memory bus is used in GPUs. It describes the importance of coalescing memory accesses from multiple threads to fully utilize the bus bandwidth. It also discusses memory bank conflicts that can occur if multiple threads access the same memory bank, degrading performance. The key to high GPU memory bandwidth is coalescing accesses and avoiding bank conflicts.
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
Presentation HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Matrix Factorization, by Huming Zhu at the AMD Developer Summit (APU13) November 11-13, 2013.
This document discusses how work groups are scheduled for execution on GPU compute units. It explains that work groups are broken down into hardware schedulable units known as warps or wavefronts. These group threads together and execute instructions in lockstep. The document covers thread scheduling, effects of divergent control flow, predication, warp voting, and optimization techniques like maximizing occupancy.
Knitting boar - Toronto and Boston HUGs - Nov 2012Josh Patterson
1) The document discusses machine learning and parallel iterative algorithms like stochastic gradient descent. It introduces the Mahout machine learning library and describes an implementation of parallel SGD called Knitting Boar that runs on YARN.
2) Knitting Boar parallelizes Mahout's SGD algorithm by having worker nodes process partitions of the training data in parallel while a master node merges their results.
3) The author argues that approaches like Knitting Boar and IterativeReduce provide better ways to implement machine learning algorithms for big data compared to traditional MapReduce.
IRJET- Latin Square Computation of Order-3 using Open CLIRJET Journal
This document discusses using OpenCL parallel programming to compute Latin squares of order 3 more efficiently than sequential algorithms. It proposes dividing the input matrix into sub-matrices that are processed concurrently by multiple processing elements in the GPU. This parallel approach reduces the computation time compared to performing the operations sequentially on the CPU. First, the input matrix is divided based on task or data parallelism. Then the sub-matrices are computed simultaneously by different processing elements. The results are combined and stored in GPU memory before being transferred to CPU memory and output. Implementing the Latin square computation with OpenCL exploits parallelism to improve efficiency over the traditional sequential approach.
This lecture covers the principles and the architectures of modern cluster schedulers, including Apache Mesos, Apache Yarn, Google Borg and K8s, and some notes on Omega
This document discusses parallel computing with GPUs. It introduces parallel computing, GPUs, and CUDA. It describes how GPUs are well-suited for data-parallel applications due to their large number of cores and throughput-oriented design. The CUDA programming model is also summarized, including how kernels are launched on the GPU from the CPU. Examples are provided of simple CUDA programs to perform operations like squaring elements in parallel on the GPU.
This document discusses approaches to programming multiple devices in OpenCL, including using a single context with multiple devices or multiple contexts. With a single context, memory objects are shared but data must be explicitly transferred between devices. Multiple contexts allow splitting work by device but require extra communication. Load balancing work between heterogeneous CPUs and GPUs requires considering scheduling overhead and data location.
This document discusses GPU memory and how to optimize memory access patterns. It begins with an example of how a wide memory bus is used in GPUs. It describes the importance of coalescing memory accesses from multiple threads to fully utilize the bus bandwidth. It also discusses memory bank conflicts that can occur if multiple threads access the same memory bank, degrading performance. The key to high GPU memory bandwidth is coalescing accesses and avoiding bank conflicts.
HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Mat...AMD Developer Central
Presentation HC-4012, Complex Network Clustering Using GPU-based Parallel Non-negative Matrix Factorization, by Huming Zhu at the AMD Developer Summit (APU13) November 11-13, 2013.
This document discusses how work groups are scheduled for execution on GPU compute units. It explains that work groups are broken down into hardware schedulable units known as warps or wavefronts. These group threads together and execute instructions in lockstep. The document covers thread scheduling, effects of divergent control flow, predication, warp voting, and optimization techniques like maximizing occupancy.
Knitting boar - Toronto and Boston HUGs - Nov 2012Josh Patterson
1) The document discusses machine learning and parallel iterative algorithms like stochastic gradient descent. It introduces the Mahout machine learning library and describes an implementation of parallel SGD called Knitting Boar that runs on YARN.
2) Knitting Boar parallelizes Mahout's SGD algorithm by having worker nodes process partitions of the training data in parallel while a master node merges their results.
3) The author argues that approaches like Knitting Boar and IterativeReduce provide better ways to implement machine learning algorithms for big data compared to traditional MapReduce.
IRJET- Latin Square Computation of Order-3 using Open CLIRJET Journal
This document discusses using OpenCL parallel programming to compute Latin squares of order 3 more efficiently than sequential algorithms. It proposes dividing the input matrix into sub-matrices that are processed concurrently by multiple processing elements in the GPU. This parallel approach reduces the computation time compared to performing the operations sequentially on the CPU. First, the input matrix is divided based on task or data parallelism. Then the sub-matrices are computed simultaneously by different processing elements. The results are combined and stored in GPU memory before being transferred to CPU memory and output. Implementing the Latin square computation with OpenCL exploits parallelism to improve efficiency over the traditional sequential approach.
This lecture covers the principles and the architectures of modern cluster schedulers, including Apache Mesos, Apache Yarn, Google Borg and K8s, and some notes on Omega
This document discusses parallel computing with GPUs. It introduces parallel computing, GPUs, and CUDA. It describes how GPUs are well-suited for data-parallel applications due to their large number of cores and throughput-oriented design. The CUDA programming model is also summarized, including how kernels are launched on the GPU from the CPU. Examples are provided of simple CUDA programs to perform operations like squaring elements in parallel on the GPU.
This document discusses approaches to programming multiple devices in OpenCL, including using a single context with multiple devices or multiple contexts. With a single context, memory objects are shared but data must be explicitly transferred between devices. Multiple contexts allow splitting work by device but require extra communication. Load balancing work between heterogeneous CPUs and GPUs requires considering scheduling overhead and data location.
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
The document summarizes two papers about MapReduce frameworks for cloud computing. The first paper describes Hadoop, which uses MapReduce and HDFS to process large amounts of distributed data across clusters. HDFS stores data across cluster nodes in a fault-tolerant manner, while MapReduce splits jobs into parallel map and reduce tasks. The second paper discusses P2P-MapReduce, which allows for a dynamic cloud environment where nodes can join and leave. It uses a peer-to-peer model where nodes can be masters or slaves, and maintains backup masters to prevent job loss if the primary master fails.
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
This document discusses optimizations for implementing an N-body simulation algorithm on GPUs using OpenCL. It begins with an overview of the basic N-body algorithm and its parallel implementation. Two key optimizations are explored: using local memory to enable data reuse across work items, and unrolling the computation loop. Performance results on AMD and Nvidia GPUs show that data reuse provides significant speedup, and loop unrolling further improves performance on the AMD GPU. An example N-body application is provided to experiment with these optimization techniques.
Parallel computing uses multiple processors simultaneously to solve computational problems faster. It allows solving larger problems or more problems in less time. Shared memory parallel programming with tools like OpenMP and pthreads is used for multicore processors that share memory. Distributed memory parallel programming with MPI is used for large clusters with separate processor memories. GPU programming with CUDA is also widely used to leverage graphics hardware for parallel tasks like SIMD. The key challenges in parallel programming are load balancing, communication overhead, and synchronization between processors.
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
This document summarizes a survey on GPU systems and their performance on different applications. It discusses how GPUs can be used for general purpose computing due to their high parallel processing capabilities. Several computational intensive applications that achieve speedups when implemented on GPUs are described, including video decoding, matrix multiplication, parallel AES encryption, and password recovery for MS office documents. The GPU architecture and Nvidia's CUDA programming model are also summarized. While GPUs provide significant performance benefits, some limitations for non-graphics applications are noted. The conclusion is that GPUs are a good alternative for computational intensive tasks to reduce CPU load and improve performance compared to CPU-only implementations.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
The document discusses improving the performance of the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP) on CUDA GPUs by avoiding duplicated computation among threads. It first provides background on GPU architecture, the PFSP problem, and related parallelization methods. It then observes that if two permutations share the same prefix, their completion time tables will contain identical column data equal to the length of the prefix, leading to duplicated computation. The paper proposes an approach where each thread is assigned a permutation and allocated shared memory to store and compute the completion time table in parallel, avoiding this duplicated work by leveraging the shared prefix property. Experimental results show the new approach runs up to 1.5 times faster than an existing
KnittingBoar Toronto Hadoop User Group Nov 27 2012Adam Muise
This document discusses machine learning and parallel iterative algorithms. It provides an introduction to machine learning and Mahout. It then describes Knitting Boar, a system for parallelizing stochastic gradient descent on Hadoop YARN. Knitting Boar partitions data among workers that perform online logistic regression in batches. The workers send gradient updates to a master node, which averages the updates to produce a new global model. Experimental results show Knitting Boar achieves roughly linear speedup. The document concludes by discussing developing YARN applications and the Knitting Boar codebase.
Hadoop interview questions for freshers and experienced people. This is the best place for all beginners and Experts who are eager to learn Hadoop Tutorial from the scratch.
Read more here http://softwarequery.com/hadoop/
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Yahoo Developer Network
This document discusses programming abstractions for smart applications on clouds. It proposes a new programming model called Deformable Mesh Abstraction (DMA) that addresses limitations in existing models like MapReduce. DMA allows tasks to recursively spawn new tasks at runtime, supports efficient communication through a shared structure, and can operate on changing datasets. The document describes how DMA can model heuristic problem solving and presents case studies applying DMA to AI planners. It also discusses how DMA could be extended to support file systems and integrated with Hadoop.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, CEO and Founder of Auviz Systems, presents the "Trade-offs in Implementing Deep Neural Networks on FPGAs" tutorial at the May 2015 Embedded Vision Summit.
Video and images are a key part of Internet traffic—think of all the data generated by social networking sites such as Facebook and Instagram—and this trend continues to grow. Extracting usable information from video and images is thus a growing requirement in the data center. For example, object and face recognition are valuable for a wide range of uses, from social applications to security applications. Deep neural networks are currently the most popular form of convolutional neural networks (CNN) used in data centers for such applications. 3D convolutions are a core part of CNNs. Nagesh presents alternative implementations of 3D convolutions on FPGAs, and discusses trade-offs among them.
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction.
This document discusses synchronization, timing, and profiling in OpenCL. It covers coarse-grained synchronization at the command queue level and fine-grained synchronization at the function call level using events. It describes how to use events for timing, profiling, and asynchronous host-device communication. It provides an example of how asynchronous I/O can improve performance in medical imaging applications by overlapping computation and data transfers.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
Presentation WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by Leo Meyerovich and Matthew Torok at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Monte Carlo simulation is well-suited for GPU acceleration due to its highly parallel nature. GPUs provide lower cost and higher performance than CPUs for Monte Carlo applications. Numerical libraries for GPUs allow developers to focus on their models rather than reimplementing basic components. NAG has developed GPU libraries including random number generators and is working with financial institutions to apply Monte Carlo simulations to problems in finance.
The document describes a proposed grid computing framework that aims to make grid computing easier to deploy, use, and maintain. The framework would accept computational problems from users, distribute tasks to client machines based on dependencies and load balancing, collect and compile results from clients, and present outputs to the user. The framework is intended to address concerns with existing grid middleware being complicated and not accessible to all, and will be open source, Linux-based, and work on a moderately sized local area network.
This document discusses patterns for parallel computing. It outlines key concepts like Amdahl's law and types of parallelism like data and task parallelism. Examples are provided of how major tech companies like Microsoft, Google, Amazon implement parallelism at different levels of their infrastructure and applications to scale efficiently. Design principles are discussed for converting sequential programs to parallel programs while maintaining performance.
The document discusses the CAP theorem which states that it is impossible for a distributed computer system to simultaneously provide consistency, availability, and partition tolerance. It defines these terms and explores how different systems address the tradeoffs. Consistency means all nodes see the same data at the same time. Availability means every request results in a response. Partition tolerance means the system continues operating despite network failures. The CAP theorem says a system can only choose two of these properties. The document discusses how different types of systems, like CP and AP systems, handle partitions and trade off consistency and availability. It also notes the CAP theorem is more nuanced in reality with choices made at fine granularity within systems.
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
The document summarizes two papers about MapReduce frameworks for cloud computing. The first paper describes Hadoop, which uses MapReduce and HDFS to process large amounts of distributed data across clusters. HDFS stores data across cluster nodes in a fault-tolerant manner, while MapReduce splits jobs into parallel map and reduce tasks. The second paper discusses P2P-MapReduce, which allows for a dynamic cloud environment where nodes can join and leave. It uses a peer-to-peer model where nodes can be masters or slaves, and maintains backup masters to prevent job loss if the primary master fails.
Machine Learning with New Hardware ChallegensOscar Law
Describe basic neural network design and focus on Convolutional Neural Network architecture. Explain why CPU and GPU can't fulfill CNN hardware requirement. List out three hardware examples: Nvidia, Microsoft and Google. Finally highlight optimization approach for CNN design.
This document discusses optimizations for implementing an N-body simulation algorithm on GPUs using OpenCL. It begins with an overview of the basic N-body algorithm and its parallel implementation. Two key optimizations are explored: using local memory to enable data reuse across work items, and unrolling the computation loop. Performance results on AMD and Nvidia GPUs show that data reuse provides significant speedup, and loop unrolling further improves performance on the AMD GPU. An example N-body application is provided to experiment with these optimization techniques.
Parallel computing uses multiple processors simultaneously to solve computational problems faster. It allows solving larger problems or more problems in less time. Shared memory parallel programming with tools like OpenMP and pthreads is used for multicore processors that share memory. Distributed memory parallel programming with MPI is used for large clusters with separate processor memories. GPU programming with CUDA is also widely used to leverage graphics hardware for parallel tasks like SIMD. The key challenges in parallel programming are load balancing, communication overhead, and synchronization between processors.
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
This document summarizes a survey on GPU systems and their performance on different applications. It discusses how GPUs can be used for general purpose computing due to their high parallel processing capabilities. Several computational intensive applications that achieve speedups when implemented on GPUs are described, including video decoding, matrix multiplication, parallel AES encryption, and password recovery for MS office documents. The GPU architecture and Nvidia's CUDA programming model are also summarized. While GPUs provide significant performance benefits, some limitations for non-graphics applications are noted. The conclusion is that GPUs are a good alternative for computational intensive tasks to reduce CPU load and improve performance compared to CPU-only implementations.
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
The document discusses improving the performance of the Tabu Search algorithm for solving the Permutation Flowshop Scheduling Problem (PFSP) on CUDA GPUs by avoiding duplicated computation among threads. It first provides background on GPU architecture, the PFSP problem, and related parallelization methods. It then observes that if two permutations share the same prefix, their completion time tables will contain identical column data equal to the length of the prefix, leading to duplicated computation. The paper proposes an approach where each thread is assigned a permutation and allocated shared memory to store and compute the completion time table in parallel, avoiding this duplicated work by leveraging the shared prefix property. Experimental results show the new approach runs up to 1.5 times faster than an existing
KnittingBoar Toronto Hadoop User Group Nov 27 2012Adam Muise
This document discusses machine learning and parallel iterative algorithms. It provides an introduction to machine learning and Mahout. It then describes Knitting Boar, a system for parallelizing stochastic gradient descent on Hadoop YARN. Knitting Boar partitions data among workers that perform online logistic regression in batches. The workers send gradient updates to a master node, which averages the updates to produce a new global model. Experimental results show Knitting Boar achieves roughly linear speedup. The document concludes by discussing developing YARN applications and the Knitting Boar codebase.
Hadoop interview questions for freshers and experienced people. This is the best place for all beginners and Experts who are eager to learn Hadoop Tutorial from the scratch.
Read more here http://softwarequery.com/hadoop/
Apache Hadoop India Summit 2011 Keynote talk "Programming Abstractions for Sm...Yahoo Developer Network
This document discusses programming abstractions for smart applications on clouds. It proposes a new programming model called Deformable Mesh Abstraction (DMA) that addresses limitations in existing models like MapReduce. DMA allows tasks to recursively spawn new tasks at runtime, supports efficient communication through a shared structure, and can operate on changing datasets. The document describes how DMA can model heuristic problem solving and presents case studies applying DMA to AI planners. It also discusses how DMA could be extended to support file systems and integrated with Hadoop.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2015-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, CEO and Founder of Auviz Systems, presents the "Trade-offs in Implementing Deep Neural Networks on FPGAs" tutorial at the May 2015 Embedded Vision Summit.
Video and images are a key part of Internet traffic—think of all the data generated by social networking sites such as Facebook and Instagram—and this trend continues to grow. Extracting usable information from video and images is thus a growing requirement in the data center. For example, object and face recognition are valuable for a wide range of uses, from social applications to security applications. Deep neural networks are currently the most popular form of convolutional neural networks (CNN) used in data centers for such applications. 3D convolutions are a core part of CNNs. Nagesh presents alternative implementations of 3D convolutions on FPGAs, and discusses trade-offs among them.
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
In this session, we will introduce “Knitting Boar”, an open-source Java library for performing distributed online learning on a Hadoop cluster under YARN. We will give an overview of how Woven Wabbit works and examine the lessons learned from YARN application construction.
This document discusses synchronization, timing, and profiling in OpenCL. It covers coarse-grained synchronization at the command queue level and fine-grained synchronization at the function call level using events. It describes how to use events for timing, profiling, and asynchronous host-device communication. It provides an example of how asynchronous I/O can improve performance in medical imaging applications by overlapping computation and data transfers.
Despite the increase of deep learning practitioners and researchers, many of them do not use GPUs, this may lead to long training/evaluation cycles and non-practical research.
In his talk, Lior shares how to get started with GPUs and some of the best practices that helped him during research and work. The talk is for everyone who works with machine learning (deep learning experience is NOT mandatory!), It covers the very basics of how GPU works, CUDA drivers, IDE configuration, training, inference, and multi-GPU training.
WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by ...AMD Developer Central
Presentation WT-4065, Superconductor: GPU Web Programming for Big Data Visualization, by Leo Meyerovich and Matthew Torok at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Monte Carlo simulation is well-suited for GPU acceleration due to its highly parallel nature. GPUs provide lower cost and higher performance than CPUs for Monte Carlo applications. Numerical libraries for GPUs allow developers to focus on their models rather than reimplementing basic components. NAG has developed GPU libraries including random number generators and is working with financial institutions to apply Monte Carlo simulations to problems in finance.
The document describes a proposed grid computing framework that aims to make grid computing easier to deploy, use, and maintain. The framework would accept computational problems from users, distribute tasks to client machines based on dependencies and load balancing, collect and compile results from clients, and present outputs to the user. The framework is intended to address concerns with existing grid middleware being complicated and not accessible to all, and will be open source, Linux-based, and work on a moderately sized local area network.
This document discusses patterns for parallel computing. It outlines key concepts like Amdahl's law and types of parallelism like data and task parallelism. Examples are provided of how major tech companies like Microsoft, Google, Amazon implement parallelism at different levels of their infrastructure and applications to scale efficiently. Design principles are discussed for converting sequential programs to parallel programs while maintaining performance.
The document discusses the CAP theorem which states that it is impossible for a distributed computer system to simultaneously provide consistency, availability, and partition tolerance. It defines these terms and explores how different systems address the tradeoffs. Consistency means all nodes see the same data at the same time. Availability means every request results in a response. Partition tolerance means the system continues operating despite network failures. The CAP theorem says a system can only choose two of these properties. The document discusses how different types of systems, like CP and AP systems, handle partitions and trade off consistency and availability. It also notes the CAP theorem is more nuanced in reality with choices made at fine granularity within systems.
This document provides an introduction to peer-to-peer (P2P) computer networks. It discusses how P2P networks rely on the computing power and bandwidth of participants rather than centralized servers. The document then covers several examples of P2P networks including Gnutella and Kademlia, and discusses techniques like distributed hash tables, queries, and node joining/leaving.
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
The document discusses NoSQL databases and the CAP theorem. It begins by providing an overview of NoSQL databases, their key features like being schemaless and supporting eventual consistency over ACID transactions. It then explains the CAP theorem - that a distributed system can only provide two of consistency, availability, and partition tolerance. It also discusses how Google's Spanner database achieves consistency and scalability using ideas from Lamport's Paxos algorithm and a new time service called TrueTime.
Please contact me to download this pres.A comprehensive presentation on the field of Parallel Computing.It's applications are only growing exponentially day by days.A useful seminar covering basics,its classification and implementation thoroughly.
Visit www.ameyawaghmare.wordpress.com for more info
This document discusses NoSQL and the CAP theorem. It begins with an introduction of the presenter and an overview of topics to be covered: What is NoSQL and the CAP theorem. It then defines NoSQL, provides examples of major NoSQL categories (document, graph, key-value, and wide-column stores), and explains why NoSQL is used, including to handle large, dynamic, and distributed data. The document also explains the CAP theorem, which states that a distributed data store can only satisfy two of three properties: consistency, availability, and partition tolerance. It provides examples of how to choose availability over consistency or vice versa. Finally, it concludes that both SQL and NoSQL have valid use cases and a combination
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
The document discusses multi-core and many-core processors and parallel programming models. It provides an overview of hardware trends including increasing numbers of cores in CPUs and GPUs. It also covers parallel programming approaches like shared memory, message passing, data parallelism and task parallelism. Specific APIs discussed include Win32 threads, OpenMP, and Intel TBB.
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecturemohamedragabslideshare
This document summarizes research on revisiting co-processing techniques for hash joins on coupled CPU-GPU architectures. It discusses three co-processing mechanisms: off-loading, data dividing, and pipelined execution. Off-loading involves assigning entire operators like joins to either the CPU or GPU. Data dividing partitions data between the processors. Pipelined execution aims to schedule workloads adaptively between the CPU and GPU to maximize efficiency on the coupled architecture. The researchers evaluate these approaches for hash join algorithms, which first partition, build hash tables, and probe tables on the input relations.
CUDA is a parallel computing platform developed by NVIDIA that allows developers to use GPUs for general purpose processing. It extends programming languages like C, C++ and Fortran to leverage the parallel processing capabilities of GPUs. The CUDA platform divides a program into portions that run on the CPU and GPU - the CPU handles control tasks while the GPU executes extensive calculations in parallel across its many cores. This approach of using GPUs for general computations beyond graphics is called GPGPU (general-purpose graphics processing). Parallel computing solves problems faster by breaking them into discrete parts that can be processed simultaneously, unlike serial computing which handles one instruction at a time.
This document provides an introduction to parallel computing. It discusses serial versus parallel computing and how parallel computing involves simultaneously using multiple compute resources to solve problems. Common parallel computer architectures involve multiple processors on a single computer or connecting multiple standalone computers together in a cluster. Parallel computers can use shared memory, distributed memory, or hybrid memory architectures. The document outlines some of the key considerations and challenges in moving from serial to parallel code such as decomposing problems, identifying dependencies, mapping tasks to resources, and handling dependencies.
The document discusses advancements in computer architecture, including multi-core computers, multithreading, and GPUs. It describes how multi-core processors integrate multiple processor cores on a single chip to provide cheap parallel processing and increase computation power. It also discusses how GPUs are optimized for graphics applications through massively parallel and highly multithreaded designs. Programming models like CUDA allow GPUs to be used for general purpose computing by addressing thread, data, and task parallelism. Overall, the document outlines how multi-core and GPU technologies enable computers to better utilize parallelism for improved performance.
The document discusses advancements in computer architecture, including multi-core computers, multithreading, and GPUs. It describes how multi-core processors integrate multiple processor cores on a single chip to provide cheap parallel processing and increase computation power. It also discusses how multithreading exploits thread-level parallelism and how GPUs are optimized for parallel graphics applications through thousands of simple processor cores focused on throughput over latency. The document provides examples of Intel's multi-core chips and the Polaris chip with 80 cores, and explains how applications can benefit from multi-core and multi-threaded programming.
This document discusses advance computer architectures including multi-core computers, multithreading, and GPUs. It provides information on multi-core systems and how they integrate multiple processor cores on a single chip to provide cheap parallel computing. It also discusses limitations of single core architectures and how multithreading enables parallelism through dividing instruction streams into threads. Finally, it covers GPUs and how they are optimized for parallel processing of graphics applications using thousands of simpler cores compared to CPUs.
This document discusses advance computer architectures including multi-core computers, multithreading, and GPUs. It provides information on multi-core systems having multiple processor cores on a single chip that share memory. It discusses how multi-core processors address limitations of single core designs by providing cheaper parallelism while increasing computation power. The document also covers multithreading, different approaches, and how programming must support multi-core through multiple threads or processes. Finally, it introduces GPUs, how they are optimized for graphics applications through parallelism and throughput, and how CUDA enables general purpose programming on GPUs.
This document provides an overview of Google's Bigtable distributed storage system. It describes Bigtable's data model as a sparse, multidimensional sorted map indexed by row, column, and timestamp. Bigtable stores data across many tablet servers, with a single master server coordinating metadata operations like tablet assignment and load balancing. The master uses Chubby, a distributed lock service, to track which tablet servers are available and reassign tablets if servers become unreachable.
This document provides an overview of parallel and distributed computing. It begins by outlining the key learning outcomes of studying this topic, which include defining parallel algorithms, analyzing parallel performance, applying task decomposition techniques, and performing parallel programming. It then reviews the history of computing from the batch era to today's network era. The rest of the document discusses parallel computing concepts like Flynn's taxonomy, shared vs distributed memory systems, limits of parallelism based on Amdahl's law, and different types of parallelism including bit-level, instruction-level, data, and task parallelism. It concludes by covering parallel implementation in both software through parallel programming and in hardware through parallel processing.
This document provides an overview of parallel computing. It discusses why parallel computation is needed due to limitations in increasing processor speed. It then covers various parallel platforms including shared and distributed memory systems. It describes different parallel programming models and paradigms including MPI, OpenMP, Pthreads, CUDA and more. It also discusses key concepts like load balancing, domain decomposition, and synchronization which are important for parallel programming.
The document provides an overview of introductory GPGPU programming with CUDA. It discusses why GPUs are useful for parallel computing applications due to their high FLOPS and memory bandwidth capabilities. It then outlines the CUDA programming model, including launching kernels on the GPU with grids and blocks of threads, and memory management between CPU and GPU. As an example, it walks through a simple matrix multiplication problem implemented on the CPU and GPU to illustrate CUDA programming concepts.
This document provides an overview of parallel and distributed computing using GPUs. It discusses GPU architecture and how GPUs are designed for massively parallel processing using hundreds of smaller cores compared to CPUs which use 4-8 larger cores. The document also covers GPU memory hierarchy, programming GPUs using OpenCL, and key concepts like work items, work groups, and occupancy which is keeping GPU compute units busy with work to process.
This document discusses using GPUs to improve the performance of content-based matching. It describes how GPUs can process subscriptions and events in parallel using thousands of lightweight threads. The algorithm stores constraints in arrays to maximize memory coalescing. Testing shows the GPU implementation is 7-13x faster than software on CPUs and can process over 9,000 events per second while using modest memory. Future work includes integrating the algorithm into a real system and exploring probabilistic matching.
The document discusses MapReduce and the Hadoop framework. It provides an overview of how MapReduce works, examples of problems it can solve, and how Hadoop implements MapReduce at scale across large clusters in a fault-tolerant manner using the HDFS distributed file system and YARN resource management.
This document provides an outline of manycore GPU architectures and programming. It introduces GPU architectures, the GPGPU concept, and CUDA programming. It discusses the GPU execution model, CUDA programming model, and how to work with different memory types in CUDA like global, shared and constant memory. It also covers streams and concurrency, CUDA intrinsics and libraries, performance profiling and debugging. Finally, it mentions directive-based programming models like OpenACC and OpenMP.
The document discusses big data and distributed computing. It explains that big data refers to large, unstructured datasets that are too large for traditional databases. Distributed computing uses multiple computers connected via a network to process large datasets in parallel. Hadoop is an open-source framework for distributed computing that uses MapReduce and HDFS for parallel processing and storage across clusters. HDFS stores data redundantly across nodes for fault tolerance.
Architecting and productionising data science applications at scalesamthemonad
This document discusses architecting and productionizing data science applications at scale. It covers topics like parallel processing with Spark, streaming platforms like Kafka, and scalable machine learning approaches. It also discusses architectures for data pipelines and productionizing models, with a focus on automation, avoiding SQL databases, and using Kafka streams and Spark for batch and streaming workloads.
Similar to Parallel Computing: Perspectives for more efficient hydrological modeling (20)
In a recent study (Koutsoyiannis et al., On the credibility of climate predictions, Hydrological Sciences Journal, 53 (4), 671–684, 2008), the credibility of climate predictions was assessed based on comparisons with long series of observations.Extending this research, which compared the outputs of various climatic models to temperature and precipitation observations from 8 stations around the globe, we test the performance of climate models at over 50 additional stations. Furthermore, we make comparisons at a large sub-continental spatial scale after integrating modelled and observed series.
The use of Cellular Automata is extended in various disciplines for the modeling of complex system procedures. Their inherent simplicity and their natural parallelism make them a very efficient tool for the simulation of large scale physical phenomena. We explore the framework of Cellular Automata to develop a physically based model for the spatial and temporal prediction of shallow landslides. Particular weight is given to the modeling of hydrological processes in order to investigate the hydrological triggering mechanisms and the importance of continuous modeling of water balance to detect timing and location of soil slips occurrences. Specifically, the 3D flow of water and the resulting water balance in the unsaturated and saturated zone is modeled taking into account important phenomena such as hydraulic hysteresis and evapotranspiration. In this poster the hydrological component of the model will be presented and tested against well established benchmark experiments [Vauclin et al, 1975; Vauclin et al, 1979]. Furthermore, we investigate the applicability of incorporating it in a hydrological catchment model for the prediction (temporal and spatial) of rainfall-triggered shallow landslides.
An introductory presentation of my PhD research covering rainfall-induced landslides, subsurface hydrology, unsaturated soil mechanics, Ground Penetration Radars and some experimental data from a field campaign that I conducted.
A distributed physically based model to predict timing and spatial distributi...Grigoris Anagnostopoulos
Shallow landslides induced by rainfall are among the most costly and deadly natural hazards, which mostly afflict mountainous and steep terrain regions. Crucial role in the initiation of these events is attributed to subsurface hydrology and how changes in the soil water regime can affect significantly the soil shear strength. Rainfall infiltration results in a decrease of matric suction, which is followed by a rapid drop in apparent cohesion. Especially on steep slopes in shallow soils, this loss of shear strength can lead to failure even in the unsaturated zone before positive water pressures are developed. Evidently, fundamental elements for an efficient prediction of rainfall-induced landslides are the interdependence of shear strength and suction, as well as the temporal evolution of suction during the wetting and drying process. A distributed physically based model, raster-based and continuous in space and time, was developed in order to investigate the interactions between surface and subsurface hydrology and shallow landslides initiation. In this effort emphasis is given to the modelling of the temporal evolution of hydrological processes and their triggering effects to soil slip occurrences. Specifically, the 3D variably saturated flow through soil and the resulting water balance is modelled using the Cellular Automata concept. Evapotranspiration, root water uptake and soil hydraulic hysteresis are taken into account for the continuous simulation of soil water content during storm and inter-storm periods. A multidimensional limit equilibrium analysis is utilized for the computation of the stability of every cell by taking into account the basic principles of unsaturated soil mechanics. A test case of a serious and diffused in space landslide event in Switzerland is investigated for the verification of the model.
Landslides of any type, and particularly soil slips, pose a great threat in mountainous and steep terrain environ- ments. One of the major triggering mechanisms for slope failures in shallow soils is the build-up of soil pore water pressure resulting in a decrease of effective stress. However, infiltration may have other effects both before and after slope failure. Especially, on steep slopes in shallow soils, soil slips can be triggered by a rapid drop in the apparent cohesion following a decrease in matric suction when a wetting front penetrates into the soil without generating positive pore pressures. These types of failures are very frequent in pre-alpine and alpine landscapes. The key factor for a realistic prediction of rainfall-induced landslides are the interdependence of shear strength and suction and the monitoring of suction changes during the cyclic wetting (due to infiltration) and drying (due to percolation and evaporation) processes. The non-unique relationship between suction and water content, expressed by the Soil Water Retention Curve, results in different values of suction and, therefore, of soil shear strength for the same water content, depending on whether the soil is being wetted (during storms) or dried (during inter-storm periods). We developed a physically based distributed in space and continuous in time model for the simulation of the hydrological triggering of shallow landslides at scales larger than a single slope. In this modeling effort particular weight is given to the modeling of hydrological processes in order to investigate the role of hydrologi- cal triggering mechanisms on soil changes leading to slip occurrences. Specifically, the 3D flow of water and the resulting water balance in the unsaturated and saturated zone is modeled using a Cellular Automata framework. The infinite slope analysis is coupled to the hydrological component of the model for the computation of slope stability. For the computation of the Factor of Safety a unified concept for effective stress under both saturated and unsaturated conditions has been used (Lu Ning and Godt Jonathan, WRR, 2010). A test case of a serious landslide event in Switzerland is investigated to assess the plausibility of the model and to verify its perfomance.
How to Manage Reception Report in Odoo 17Celine George
A business may deal with both sales and purchases occasionally. They buy things from vendors and then sell them to their customers. Such dealings can be confusing at times. Because multiple clients may inquire about the same product at the same time, after purchasing those products, customers must be assigned to them. Odoo has a tool called Reception Report that can be used to complete this assignment. By enabling this, a reception report comes automatically after confirming a receipt, from which we can assign products to orders.
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumMJDuyan
(𝐓𝐋𝐄 𝟏𝟎𝟎) (𝐋𝐞𝐬𝐬𝐨𝐧 𝟏)-𝐏𝐫𝐞𝐥𝐢𝐦𝐬
𝐃𝐢𝐬𝐜𝐮𝐬𝐬 𝐭𝐡𝐞 𝐄𝐏𝐏 𝐂𝐮𝐫𝐫𝐢𝐜𝐮𝐥𝐮𝐦 𝐢𝐧 𝐭𝐡𝐞 𝐏𝐡𝐢𝐥𝐢𝐩𝐩𝐢𝐧𝐞𝐬:
- Understand the goals and objectives of the Edukasyong Pantahanan at Pangkabuhayan (EPP) curriculum, recognizing its importance in fostering practical life skills and values among students. Students will also be able to identify the key components and subjects covered, such as agriculture, home economics, industrial arts, and information and communication technology.
𝐄𝐱𝐩𝐥𝐚𝐢𝐧 𝐭𝐡𝐞 𝐍𝐚𝐭𝐮𝐫𝐞 𝐚𝐧𝐝 𝐒𝐜𝐨𝐩𝐞 𝐨𝐟 𝐚𝐧 𝐄𝐧𝐭𝐫𝐞𝐩𝐫𝐞𝐧𝐞𝐮𝐫:
-Define entrepreneurship, distinguishing it from general business activities by emphasizing its focus on innovation, risk-taking, and value creation. Students will describe the characteristics and traits of successful entrepreneurs, including their roles and responsibilities, and discuss the broader economic and social impacts of entrepreneurial activities on both local and global scales.
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
This presentation was provided by Racquel Jemison, Ph.D., Christina MacLaughlin, Ph.D., and Paulomi Majumder. Ph.D., all of the American Chemical Society, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
How to Download & Install Module From the Odoo App Store in Odoo 17Celine George
Custom modules offer the flexibility to extend Odoo's capabilities, address unique requirements, and optimize workflows to align seamlessly with your organization's processes. By leveraging custom modules, businesses can unlock greater efficiency, productivity, and innovation, empowering them to stay competitive in today's dynamic market landscape. In this tutorial, we'll guide you step by step on how to easily download and install modules from the Odoo App Store.
2. General Concepts
GPU Programming
CA Parallel implementation
What is parallel computing?
Simultaneous use of multiple computing resources to solve a single
computational problem.
The computing resources can be:
A single computer with multiple processors.
A number of computers connected to a network.
A combination of both.
Benefits of parallel computing:
The computational load is broken apart in discrete pieces of work
that can be treated simultaneously.
The total simulation time is much less using multiple computing
resources.
Parallel Computing: Perspectives for more e cient hydrological modeling
2 / 20
3. General Concepts
GPU Programming
CA Parallel implementation
Parallel Computer Models Classification
Parallel Computer Classification
Flynn’s taxonomy: A widely used classification
Flynn's taxonomy: a widely used classifications
Classify along two independent dimensions:
◦ Classify along two independent dimensions:
Instruction and Data.
Instruction and Data
Each dimension can have two possible states:
◦ Each dimension can have two possible states:
Single or Multiple
Single or Multiple.
SISD
Single Instruction,
Single Data
SIMD
Single Instruction,
Multiple Data
MISD
Multiple Instruction,
Single Data
MIMD
Multiple Instruction,
Multiple Data
38
Parallel Computing: Perspectives for more e cient hydrological modeling
3 / 20
4. General Concepts
CPU
CPU GPU Programming
CPU
CPU
CA Parallel implementation
MIMD: Multiple Instruction, Multiple Data
The most common type of Interconnectcomputer (most modern
parallel
computers fall into this category).
Consists of a collection of fully independent processing units or
Memory
cores having their own control unit and its own ALU.
Execution
FIGURE 2.3
can be synchronous or asynchronous, as the processors
own pace.
Acan operate system
shared-memory at their
CPU
CPU
CPU
CPU
Memory
Memory
Memory
Memory
Interconnect
FIGURE 2.4
A distributed-memory system Parallel Computing: Perspectives for more e cient hydrological modeling
4 / 20
5. General Concepts
GPU Programming
CA Parallel implementation
Parallelism: An everyday example
Parallelism
Task parallelism: the ability to execute di↵erent tasks within a
problem at the same time.
As an analogy, think about a farmer who
hires workers to pick apples from an
orchard of trees
Data parallelism: the ability to execute parts of the same task on
di↵erent data at the same time.
◦ Worker hardware
As an analogy, think about a
farmer who hires workers to
(processing element)
pick apples from his trees:
◦ Trees tasks
Worker = hardware
◦ Apples data
(processing element).
Trees = task.
Apples = data.
Parallel Computing: Perspectives for more e cient hydrological modeling
5 / 20
47
6. Parallelism
General Concepts
GPU Programming
CA Parallel implementation
Sequential approach
The serial approach would be to have one
worker pick all of the apples from each tree
The sequential approach would be to have the worker pick all of
the apples from each tree.
48
Parallel Computing: Perspectives for more e cient hydrological modeling
6 / 20
7. Parallelism – More workers
workers
Parallelism: More
General Concepts
GPU Programming
CA Parallel implementation
Data parallel hardware: Working on the same tree, which allows
Working on the same tree.
each task parallel hardware, and would allow each task to
◦ data to be completed quicker.
be completed quicker work per tree?
How many workers should
How many workers should there be per tree?
What ififsome trees have few apples, while others have many?
What some trees have few apples, while others many?
49
Parallel Computing: Perspectives for more e cient hydrological modeling
7 / 20
8. Parallelism – More workers
Parallelism: More workers
General Concepts
GPU Programming
CA Parallel implementation
Each parallelism: Each worker pick a different tree
Task worker pick apples from apples from a di↵erent tree.
◦ Task parallelism, and although each task takes the
Although as in the serial version, many are
same time each task takes the same time as in the sequential version,
many tasks are parallel
accomplished inaccomplished in parallel.
What there are only few densely populated trees?
◦ What if if there are only aafew densely populated trees?
50
Parallel Computing: Perspectives for more e cient hydrological modeling
8 / 20
9. General Concepts
GPU Programming
CA Parallel implementation
Algorithm Decomposition
Task Decomposition
Most of engineering problems are non trivial and it is crucial to
have more formal to functionally independent parts
reduces an algorithm concepts for determining parallelism.
Tasks may have dependencies on other tasks
The concept of decomposition
◦ If the input of task B is dependent on the output of task A, then task
B is Task decomposition: dividing the algorithm into individual tasks,
dependent on task A
which are functionally independent. Tasks which don’t have
◦ Tasks that don’t have dependencies (or whose dependencies are
dependencies (or whose dependencies are completed) can be
completed) can be executed at any time to achieve parallelism
executed at any time to achieve parallelism.
◦ Task dependency graphs are used to describe the relationship
Data decomposition: dividing a data set into discrete chunks that
between tasks
can be processed in parallel.
A
B
A
B is dependent on A
B
C
A and B are independent
of each other
C is dependent on A and B
Parallel Computing: Perspectives for more e cient hydrological modeling
52
9 / 20
10. General Concepts
GPU Programming
CA Parallel
A quiet revolution and potential build-up implementation
◦ Calculation:TFLOPS Programming?
Why GPU vs. 100 GFLOPS
◦
Memory Bandwidth: ~10x
Many-core GPU
Multi-core CPU
Courtesy: John Owens
Figure 1.1. GPU in every PC– massive volume and potential impact
◦ Enlarging Perform ance Gap betw een GPUs and CPUs.
Parallel programming is easier than ever because it can be done at
relative low-end pc’s.
10
Cards such as the Nvidia Tesla C1060 and GT200 contain 240
cores, each of which is highly multithreaded.
Parallel Computing: Perspectives for more e cient hydrological modeling
10 / 20
11. General Concepts
●
CPU
GPU Programming
CA Parallel implementation
GPU vs CPU
●
●
●
GPU: area used for but very cache
Most die Few instructions memoryfast execution. Uses very fast
Relatively few transistors for ALUs
GDDR3 RAM. Most die area is used for ALUs and the caches are
relative small.
GPU CPU: Lots of instructions but slower execution. Uses slower DDR2
●
or die area used it ALUs
Most DDR3 RAM (butfor has direct access to more memory than
●
Relativelyfew transistors for ALUs.
relative small caches
GPUs). Most die area is used for memory cache and there are
Parallel Computing: Perspectives for more e cient hydrological modeling
11 / 20
12. General Concepts
GPU Programming
CA Parallel implementation
GPU is fastGPU is fast
Parallel Computing: Perspectives for more e cient hydrological modeling
12 / 20
13. General Concepts
GPU Programming
CA Parallel implementation
CUDA: Compute Unified Device Architecture
CUDA Program: Consists of phases that are executed on either
the host (CPU) or a device (GPU).
No data parallelism = the code is executed at the host.
Data parallelism = the code is executed at the device.
Data-parallel portions of an application are expressed as device
kernels which run on the device.
Arrays of Parallel Threads
GPU kernels are written using the Single Program Multiple Data
(SPMD) programming model.
• A CUDA kernel is executed by an array of
threads
SPMD executes multiple instances of the same program
– All threads run the same code (SPMD)
independently, where eachthat it uses to compute memorya di↵erent portion of
– Each thread has an ID program works on addresses and
the data. make control decisions
threadID
0 1 2 3 4 5 6 7
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
Parallel Computing: Perspectives for more e cient hydrological modeling
15
13 / 20
14. General Concepts
GPU Programming
CA Parallel implementation
CUDA: Compute Unified Device Architecture
Chapter 2. Programming Model
Grid
A CUDA kernel is executed
by an array of threads.
Each thread has an ID,
which is used to compute
memory addresses and make
control decisions.
CUDA threads are organized
into multiple blocks.
Threads within a block
cooperate via shared
memory, atomic operations
and barrier synchronization.
Block (0, 0)
Block (1, 0)
Block (2, 0)
Block (0, 1)
Block (1, 1)
Block (2, 1)
Block (1, 1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)
Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)
Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)
Figure 2-1.Grid of Thread Blocks
Parallel Computing: Perspectives for more e cient hydrological modeling
2.3
Memory Hierarchy
14 / 20
15. General Concepts
GPU Programming
CA Parallel implementation
CUDA memory types
Chapter 4: Hardware Implementation
Global memory: Low
bandwidth but large space.
Fastest read/write calls if
they are coalesced.
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Texture memory: Cache
optimized for 2D spatial
patterns.
Shared Memory
Registers
Constant memory: Slow,
but with cache (8 kb).
Processor 1
Registers
Processor 2
Registers
…
Instruction
Unit
Processor M
Constant
Cache
Shared memory: Fast, but
it can be used only by the
threads of the same block.
Texture
Cache
Device Memory
Registers: 32768 32-bit
registers per Multi-processor.
A set of SIMT multiprocessors with on-chip shared memory.
Figure 4-2.Hardware Model
Parallel Computing: Perspectives for more e cient hydrological modeling
4.2
Multiple Devices
15 / 20
16. General Concepts
GPU Programming
CA Parallel implementation
CA Parallel implementation
A parallel version of the Cellular Automata algorithm for variably
saturated flow in soils was developed in CUDA API.
The infiltration experiment of Vauclin et al. (1979) was chosen as a
benchmark test for the accuracy and the speed of the algorithm.
0
t = 2 hrs
t = 3 hrs
t = 4 hrs
t = 8 hrs
experimental data
Water Depth (m)
0.5
1
1.5
2
0
0.5
1
1.5
Distance (m)
2
2.5
3
Parallel Computing: Perspectives for more e cient hydrological modeling
16 / 20
17. General Concepts
GPU Programming
CA Parallel implementation
Why parallel code is important?
In real case scenarios, where the 3-D simulation of large areas is
needed, the grid sizes are excessively large.
In natural hazards assessment the simulations should be fast in order
to be useful (the prediction should be before the actual event!).
Fast simulations allow us to calibrate easier the model parameters
and investigate more e ciently the physical phenomena.
The inherent CA concept natural parallelism make easier the
parallel implementation of the algorithm.
Parallel Computing: Perspectives for more e cient hydrological modeling
17 / 20
18. General Concepts
GPU Programming
CA Parallel implementation
Technical details
Di culties
The most challenging issue was the irregular geometry of the
domain which made more di cult the exploitation of the locality at
the thread computations and the use of the shared memory.
The cell values were stored in a 1D array and for each cell the
indexes of its neighboring cells were also stored.
Code structure
Simulation constants are stored in the constant memory.
Soil properties for each soil class are stored in the texture memory.
Atomic operations are used in order to check for convergence at
every iteration.
The shared memory is used to accelerate the atomic operations and
the block’s memory accesses.
Parallel Computing: Perspectives for more e cient hydrological modeling
18 / 20
19. General Concepts
GPU Programming
CA Parallel implementation
Results of the numerical tests
Nvidia Quadro 2000:
192 CUDA cores.
1 GB GDDR5 of RAM memory.
100000"
90"
70"
Speed%Up%
Speed%(%cells/sec%)%
80"
10000"
1000"
100"
CPU"
10"
GPU"
60"
50"
40"
30"
20"
10"
1"
1000"
10000"
100000"
Number%of%Cells%
1000000"
10000000"
0"
1000"
10000"
100000"
Number%of%Cells%
1000000"
10000000"
Parallel Computing: Perspectives for more e cient hydrological modeling
19 / 20