Parallel Computing: Perspectives for more e cient
hydrological modeling

Grigorios Anagnostopoulos

Internal Seminar, 11.1...
General Concepts

GPU Programming

CA Parallel implementation

What is parallel computing?

Simultaneous use of multiple c...
General Concepts

GPU Programming

CA Parallel implementation

Parallel Computer Models Classification
Parallel Computer Cl...
General Concepts

CPU

CPU GPU Programming
CPU

CPU
CA Parallel implementation

MIMD: Multiple Instruction, Multiple Data
...
General Concepts

GPU Programming

CA Parallel implementation

Parallelism: An everyday example

Parallelism



Task para...
Parallelism

General Concepts



GPU Programming

CA Parallel implementation

Sequential approach
The serial approach wou...
Parallelism – More workers
workers
Parallelism: More

General Concepts

GPU Programming

CA Parallel implementation

Data...
Parallelism – More workers
Parallelism: More workers

General Concepts

GPU Programming

CA Parallel implementation

 Eac...
General Concepts

GPU Programming

CA Parallel implementation

Algorithm Decomposition

Task Decomposition
Most of enginee...
General Concepts



GPU Programming
CA Parallel
A quiet revolution and potential build-up implementation
◦ Calculation:TF...
General Concepts

●

CPU

GPU Programming

CA Parallel implementation

GPU vs CPU

●
●

●

GPU: area used for but very cac...
General Concepts

GPU Programming

CA Parallel implementation

GPU is fastGPU is fast

Parallel Computing: Perspectives fo...
General Concepts

GPU Programming

CA Parallel implementation

CUDA: Compute Unified Device Architecture
CUDA Program: Cons...
General Concepts

GPU Programming

CA Parallel implementation

CUDA: Compute Unified Device Architecture
Chapter 2. Program...
General Concepts

GPU Programming

CA Parallel implementation

CUDA memory types
Chapter 4: Hardware Implementation

Globa...
General Concepts

GPU Programming

CA Parallel implementation

CA Parallel implementation
A parallel version of the Cellul...
General Concepts

GPU Programming

CA Parallel implementation

Why parallel code is important?

In real case scenarios, wh...
General Concepts

GPU Programming

CA Parallel implementation

Technical details
Di culties
The most challenging issue was...
General Concepts

GPU Programming

CA Parallel implementation

Results of the numerical tests
Nvidia Quadro 2000:
192 CUDA...
Thanks for your attention!
Upcoming SlideShare
Loading in …5
×

Parallel Computing: Perspectives for more efficient hydrological modeling

941 views
791 views

Published on

A presentation that introduces the basic concepts of parallel computing and gives some details on General Purpose GPU computing using the CUDA architecture.

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
941
On SlideShare
0
From Embeds
0
Number of Embeds
47
Actions
Shares
0
Downloads
23
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Parallel Computing: Perspectives for more efficient hydrological modeling

  1. 1. Parallel Computing: Perspectives for more e cient hydrological modeling Grigorios Anagnostopoulos Internal Seminar, 11.10.2011
  2. 2. General Concepts GPU Programming CA Parallel implementation What is parallel computing? Simultaneous use of multiple computing resources to solve a single computational problem. The computing resources can be: A single computer with multiple processors. A number of computers connected to a network. A combination of both. Benefits of parallel computing: The computational load is broken apart in discrete pieces of work that can be treated simultaneously. The total simulation time is much less using multiple computing resources. Parallel Computing: Perspectives for more e cient hydrological modeling 2 / 20
  3. 3. General Concepts GPU Programming CA Parallel implementation Parallel Computer Models Classification Parallel Computer Classification Flynn’s taxonomy: A widely used classification Flynn's taxonomy: a widely used classifications Classify along two independent dimensions:  ◦ Classify along two independent dimensions: Instruction and Data.  Instruction and Data Each dimension can have two possible states: ◦ Each dimension can have two possible states:  Single or Multiple Single or Multiple. SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data MISD Multiple Instruction, Single Data MIMD Multiple Instruction, Multiple Data 38 Parallel Computing: Perspectives for more e cient hydrological modeling 3 / 20
  4. 4. General Concepts CPU CPU GPU Programming CPU CPU CA Parallel implementation MIMD: Multiple Instruction, Multiple Data The most common type of Interconnectcomputer (most modern parallel computers fall into this category). Consists of a collection of fully independent processing units or Memory cores having their own control unit and its own ALU. Execution FIGURE 2.3 can be synchronous or asynchronous, as the processors own pace. Acan operate system shared-memory at their CPU CPU CPU CPU Memory Memory Memory Memory Interconnect FIGURE 2.4 A distributed-memory system Parallel Computing: Perspectives for more e cient hydrological modeling 4 / 20
  5. 5. General Concepts GPU Programming CA Parallel implementation Parallelism: An everyday example Parallelism  Task parallelism: the ability to execute di↵erent tasks within a problem at the same time. As an analogy, think about a farmer who hires workers to pick apples from an orchard of trees Data parallelism: the ability to execute parts of the same task on di↵erent data at the same time. ◦ Worker  hardware As an analogy, think about a farmer who hires workers to (processing element) pick apples from his trees: ◦ Trees  tasks Worker = hardware ◦ Apples  data (processing element). Trees = task. Apples = data. Parallel Computing: Perspectives for more e cient hydrological modeling 5 / 20 47
  6. 6. Parallelism General Concepts  GPU Programming CA Parallel implementation Sequential approach The serial approach would be to have one worker pick all of the apples from each tree The sequential approach would be to have the worker pick all of the apples from each tree. 48 Parallel Computing: Perspectives for more e cient hydrological modeling 6 / 20
  7. 7. Parallelism – More workers workers Parallelism: More General Concepts GPU Programming CA Parallel implementation Data parallel hardware: Working on the same tree, which allows Working on the same tree. each task parallel hardware, and would allow each task to ◦ data to be completed quicker. be completed quicker work per tree? How many workers should  How many workers should there be per tree? What ififsome trees have few apples, while others have many? What some trees have few apples, while others many? 49 Parallel Computing: Perspectives for more e cient hydrological modeling 7 / 20
  8. 8. Parallelism – More workers Parallelism: More workers General Concepts GPU Programming CA Parallel implementation  Each parallelism: Each worker pick a different tree Task worker pick apples from apples from a di↵erent tree. ◦ Task parallelism, and although each task takes the Although as in the serial version, many are same time each task takes the same time as in the sequential version, many tasks are parallel accomplished inaccomplished in parallel. What there are only few densely populated trees? ◦ What if if there are only aafew densely populated trees? 50 Parallel Computing: Perspectives for more e cient hydrological modeling 8 / 20
  9. 9. General Concepts GPU Programming CA Parallel implementation Algorithm Decomposition Task Decomposition Most of engineering problems are non trivial and it is crucial to   have more formal to functionally independent parts reduces an algorithm concepts for determining parallelism. Tasks may have dependencies on other tasks The concept of decomposition ◦ If the input of task B is dependent on the output of task A, then task B is Task decomposition: dividing the algorithm into individual tasks, dependent on task A which are functionally independent. Tasks which don’t have ◦ Tasks that don’t have dependencies (or whose dependencies are dependencies (or whose dependencies are completed) can be completed) can be executed at any time to achieve parallelism executed at any time to achieve parallelism. ◦ Task dependency graphs are used to describe the relationship Data decomposition: dividing a data set into discrete chunks that between tasks can be processed in parallel. A B A B is dependent on A B C A and B are independent of each other C is dependent on A and B Parallel Computing: Perspectives for more e cient hydrological modeling 52 9 / 20
  10. 10. General Concepts  GPU Programming CA Parallel A quiet revolution and potential build-up implementation ◦ Calculation:TFLOPS Programming? Why GPU vs. 100 GFLOPS ◦ Memory Bandwidth: ~10x Many-core GPU Multi-core CPU Courtesy: John Owens Figure 1.1. GPU in every PC– massive volume and potential impact ◦ Enlarging Perform ance Gap betw een GPUs and CPUs. Parallel programming is easier than ever because it can be done at relative low-end pc’s. 10 Cards such as the Nvidia Tesla C1060 and GT200 contain 240 cores, each of which is highly multithreaded. Parallel Computing: Perspectives for more e cient hydrological modeling 10 / 20
  11. 11. General Concepts ● CPU GPU Programming CA Parallel implementation GPU vs CPU ● ● ● GPU: area used for but very cache Most die Few instructions memoryfast execution. Uses very fast Relatively few transistors for ALUs GDDR3 RAM. Most die area is used for ALUs and the caches are relative small. GPU CPU: Lots of instructions but slower execution. Uses slower DDR2 ● or die area used it ALUs Most DDR3 RAM (butfor has direct access to more memory than ● Relativelyfew transistors for ALUs. relative small caches GPUs). Most die area is used for memory cache and there are Parallel Computing: Perspectives for more e cient hydrological modeling 11 / 20
  12. 12. General Concepts GPU Programming CA Parallel implementation GPU is fastGPU is fast Parallel Computing: Perspectives for more e cient hydrological modeling 12 / 20
  13. 13. General Concepts GPU Programming CA Parallel implementation CUDA: Compute Unified Device Architecture CUDA Program: Consists of phases that are executed on either the host (CPU) or a device (GPU). No data parallelism = the code is executed at the host. Data parallelism = the code is executed at the device. Data-parallel portions of an application are expressed as device kernels which run on the device. Arrays of Parallel Threads GPU kernels are written using the Single Program Multiple Data (SPMD) programming model. • A CUDA kernel is executed by an array of threads SPMD executes multiple instances of the same program – All threads run the same code (SPMD)   independently, where eachthat it uses to compute memorya di↵erent portion of – Each thread has an ID program works on addresses and the data. make control decisions threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Parallel Computing: Perspectives for more e cient hydrological modeling 15 13 / 20
  14. 14. General Concepts GPU Programming CA Parallel implementation CUDA: Compute Unified Device Architecture Chapter 2. Programming Model Grid A CUDA kernel is executed by an array of threads. Each thread has an ID, which is used to compute memory addresses and make control decisions. CUDA threads are organized into multiple blocks. Threads within a block cooperate via shared memory, atomic operations and barrier synchronization. Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Figure 2-1.Grid of Thread Blocks Parallel Computing: Perspectives for more e cient hydrological modeling 2.3 Memory Hierarchy 14 / 20
  15. 15. General Concepts GPU Programming CA Parallel implementation CUDA memory types Chapter 4: Hardware Implementation Global memory: Low bandwidth but large space. Fastest read/write calls if they are coalesced. Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Texture memory: Cache optimized for 2D spatial patterns. Shared Memory Registers Constant memory: Slow, but with cache (8 kb). Processor 1 Registers Processor 2 Registers … Instruction Unit Processor M Constant Cache Shared memory: Fast, but it can be used only by the threads of the same block. Texture Cache Device Memory Registers: 32768 32-bit registers per Multi-processor. A set of SIMT multiprocessors with on-chip shared memory. Figure 4-2.Hardware Model Parallel Computing: Perspectives for more e cient hydrological modeling 4.2 Multiple Devices 15 / 20
  16. 16. General Concepts GPU Programming CA Parallel implementation CA Parallel implementation A parallel version of the Cellular Automata algorithm for variably saturated flow in soils was developed in CUDA API. The infiltration experiment of Vauclin et al. (1979) was chosen as a benchmark test for the accuracy and the speed of the algorithm. 0 t = 2 hrs t = 3 hrs t = 4 hrs t = 8 hrs experimental data Water Depth (m) 0.5 1 1.5 2 0 0.5 1 1.5 Distance (m) 2 2.5 3 Parallel Computing: Perspectives for more e cient hydrological modeling 16 / 20
  17. 17. General Concepts GPU Programming CA Parallel implementation Why parallel code is important? In real case scenarios, where the 3-D simulation of large areas is needed, the grid sizes are excessively large. In natural hazards assessment the simulations should be fast in order to be useful (the prediction should be before the actual event!). Fast simulations allow us to calibrate easier the model parameters and investigate more e ciently the physical phenomena. The inherent CA concept natural parallelism make easier the parallel implementation of the algorithm. Parallel Computing: Perspectives for more e cient hydrological modeling 17 / 20
  18. 18. General Concepts GPU Programming CA Parallel implementation Technical details Di culties The most challenging issue was the irregular geometry of the domain which made more di cult the exploitation of the locality at the thread computations and the use of the shared memory. The cell values were stored in a 1D array and for each cell the indexes of its neighboring cells were also stored. Code structure Simulation constants are stored in the constant memory. Soil properties for each soil class are stored in the texture memory. Atomic operations are used in order to check for convergence at every iteration. The shared memory is used to accelerate the atomic operations and the block’s memory accesses. Parallel Computing: Perspectives for more e cient hydrological modeling 18 / 20
  19. 19. General Concepts GPU Programming CA Parallel implementation Results of the numerical tests Nvidia Quadro 2000: 192 CUDA cores. 1 GB GDDR5 of RAM memory. 100000" 90" 70" Speed%Up% Speed%(%cells/sec%)% 80" 10000" 1000" 100" CPU" 10" GPU" 60" 50" 40" 30" 20" 10" 1" 1000" 10000" 100000" Number%of%Cells% 1000000" 10000000" 0" 1000" 10000" 100000" Number%of%Cells% 1000000" 10000000" Parallel Computing: Perspectives for more e cient hydrological modeling 19 / 20
  20. 20. Thanks for your attention!

×