Parallel Computing: Perspectives for more efficient hydrological modeling

Parallel Computing: Perspectives for more e cient
hydrological modeling

Grigorios Anagnostopoulos

Internal Seminar, 11.10.2011

General Concepts

GPU Programming

CA Parallel implementation

What is parallel computing?

Simultaneous use of multiple computing resources to solve a single
computational problem.
The computing resources can be:
A single computer with multiple processors.
A number of computers connected to a network.
A combination of both.

Beneﬁts of parallel computing:
The computational load is broken apart in discrete pieces of work
that can be treated simultaneously.
The total simulation time is much less using multiple computing
resources.

Parallel Computing: Perspectives for more e cient hydrological modeling
2 / 20

General Concepts

GPU Programming


Parallel Computer Models Classiﬁcation
Parallel Computer Classification
Flynn’s taxonomy: A widely used classiﬁcation
Flynn's taxonomy: a widely used classifications
Classify along two independent dimensions:



◦ Classify along two independent dimensions:
Instruction and Data.
 Instruction and Data
Each dimension can have two possible states:
◦ Each dimension can have two possible states:
 Single or Multiple
Single or Multiple.
SISD
Single Instruction,
Single Data

SIMD
Single Instruction,
Multiple Data

MISD
Multiple Instruction,
Single Data

MIMD
Multiple Instruction,
Multiple Data

38
3 / 20

General Concepts

CPU

CPU GPU Programming
CPU

CPU

MIMD: Multiple Instruction, Multiple Data
The most common type of Interconnectcomputer (most modern
parallel
computers fall into this category).
Consists of a collection of fully independent processing units or
Memory
cores having their own control unit and its own ALU.
Execution
FIGURE 2.3

can be synchronous or asynchronous, as the processors
own pace.

Acan operate system
shared-memory at their

CPU

CPU

CPU

CPU

Memory

Memory

Memory

Memory

Interconnect

FIGURE 2.4
A distributed-memory system Parallel Computing: Perspectives for more e cient hydrological modeling

4 / 20

General Concepts

GPU Programming


Parallelism: An everyday example

Parallelism



Task parallelism: the ability to execute di↵erent tasks within a
problem at the same time.

As an analogy, think about a farmer who
hires workers to pick apples from an
orchard of trees

Data parallelism: the ability to execute parts of the same task on
di↵erent data at the same time.

◦ Worker  hardware
As an analogy, think about a
farmer who hires workers to
(processing element)
pick apples from his trees:

◦ Trees  tasks

Worker = hardware
◦ Apples  data
(processing element).
Trees = task.
Apples = data.

5 / 20
47

Parallelism

General Concepts



GPU Programming


Sequential approach
The serial approach would be to have one
worker pick all of the apples from each tree

The sequential approach would be to have the worker pick all of
the apples from each tree.

48

6 / 20

Parallelism – More workers
workers
Parallelism: More

General Concepts

GPU Programming


Data parallel hardware: Working on the same tree, which allows
Working on the same tree.
each task parallel hardware, and would allow each task to
◦ data to be completed quicker.

be completed quicker work per tree?
How many workers should

 How many workers should there be per tree?
What ififsome trees have few apples, while others have many?
What some trees have few apples, while others many?

49
7 / 20

Parallelism – More workers
Parallelism: More workers

General Concepts

GPU Programming


 Each parallelism: Each worker pick a different tree
Task worker pick apples from apples from a di↵erent tree.

◦ Task parallelism, and although each task takes the
Although as in the serial version, many are
same time each task takes the same time as in the sequential version,
many tasks are parallel
accomplished inaccomplished in parallel.
What there are only few densely populated trees?
◦ What if if there are only aafew densely populated trees?

50
8 / 20

General Concepts

GPU Programming


Algorithm Decomposition

Task Decomposition
Most of engineering problems are non trivial and it is crucial to



have more formal to functionally independent parts
reduces an algorithm concepts for determining parallelism.
Tasks may have dependencies on other tasks
The concept of decomposition
◦ If the input of task B is dependent on the output of task A, then task
B is Task decomposition: dividing the algorithm into individual tasks,
dependent on task A
which are functionally independent. Tasks which don’t have
◦ Tasks that don’t have dependencies (or whose dependencies are
dependencies (or whose dependencies are completed) can be
completed) can be executed at any time to achieve parallelism
executed at any time to achieve parallelism.
◦ Task dependency graphs are used to describe the relationship
Data decomposition: dividing a data set into discrete chunks that
between tasks
can be processed in parallel.
A

B

A

B is dependent on A

B

C

A and B are independent
of each other

C is dependent on A and B

52

9 / 20

General Concepts



GPU Programming
CA Parallel
A quiet revolution and potential build-up implementation
◦ Calculation:TFLOPS Programming?
Why GPU vs. 100 GFLOPS

◦

Memory Bandwidth: ~10x

Many-core GPU

Multi-core CPU

Courtesy: John Owens

Figure 1.1. GPU in every PC– massive volume and potential impact
◦ Enlarging Perform ance Gap betw een GPUs and CPUs.
Parallel programming is easier than ever because it can be done at
relative low-end pc’s.
10

Cards such as the Nvidia Tesla C1060 and GT200 contain 240
cores, each of which is highly multithreaded.
10 / 20

General Concepts

●

CPU

GPU Programming


GPU vs CPU

●
●

●

GPU: area used for but very cache
Most die Few instructions memoryfast execution. Uses very fast

Relatively few transistors for ALUs

GDDR3 RAM. Most die area is used for ALUs and the caches are
relative small.

GPU CPU: Lots of instructions but slower execution. Uses slower DDR2
●

or die area used it ALUs
Most DDR3 RAM (butfor has direct access to more memory than

●

Relativelyfew transistors for ALUs.
relative small caches

GPUs). Most die area is used for memory cache and there are

11 / 20

General Concepts

GPU Programming


GPU is fastGPU is fast

12 / 20

General Concepts

GPU Programming


CUDA: Compute Uniﬁed Device Architecture
CUDA Program: Consists of phases that are executed on either
the host (CPU) or a device (GPU).
No data parallelism = the code is executed at the host.
Data parallelism = the code is executed at the device.

Data-parallel portions of an application are expressed as device
kernels which run on the device.

Arrays of Parallel Threads

GPU kernels are written using the Single Program Multiple Data
(SPMD) programming model.
• A CUDA kernel is executed by an array of

threads
SPMD executes multiple instances of the same program
– All threads run the same code (SPMD)
independently, where eachthat it uses to compute memorya di↵erent portion of
– Each thread has an ID program works on addresses and
the data. make control decisions
threadID

0 1 2 3 4 5 6 7

…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…

15

13 / 20

General Concepts

GPU Programming


CUDA: Compute Uniﬁed Device Architecture
Chapter 2. Programming Model

Grid

A CUDA kernel is executed
by an array of threads.
Each thread has an ID,
which is used to compute
memory addresses and make
control decisions.
CUDA threads are organized
into multiple blocks.
Threads within a block
cooperate via shared
memory, atomic operations
and barrier synchronization.

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Block (1, 1)
Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)



Figure 2-1.Grid of Thread Blocks

2.3

Memory Hierarchy

14 / 20

General Concepts

GPU Programming


CUDA memory types
Chapter 4: Hardware Implementation

Global memory: Low
bandwidth but large space.
Fastest read/write calls if
they are coalesced.

Device
Multiprocessor N

Multiprocessor 2
Multiprocessor 1

Texture memory: Cache
optimized for 2D spatial
patterns.

Shared Memory
Registers

Constant memory: Slow,
but with cache (8 kb).

Processor 1

Registers

Processor 2

Registers

…

Instruction
Unit
Processor M

Constant
Cache

Shared memory: Fast, but
it can be used only by the
threads of the same block.

Texture
Cache

Device Memory

Registers: 32768 32-bit
registers per Multi-processor.

A set of SIMT multiprocessors with on-chip shared memory.

Figure 4-2.Hardware Model


4.2

Multiple Devices

15 / 20

General Concepts

GPU Programming


A parallel version of the Cellular Automata algorithm for variably
saturated ﬂow in soils was developed in CUDA API.
The inﬁltration experiment of Vauclin et al. (1979) was chosen as a
benchmark test for the accuracy and the speed of the algorithm.
0
t = 2 hrs
t = 3 hrs
t = 4 hrs
t = 8 hrs
experimental data

Water Depth (m)

0.5

1

1.5

2
0

0.5

1

1.5
Distance (m)

2

2.5

3

16 / 20

General Concepts

GPU Programming


Why parallel code is important?

In real case scenarios, where the 3-D simulation of large areas is
needed, the grid sizes are excessively large.
In natural hazards assessment the simulations should be fast in order
to be useful (the prediction should be before the actual event!).
Fast simulations allow us to calibrate easier the model parameters
and investigate more e ciently the physical phenomena.

The inherent CA concept natural parallelism make easier the
parallel implementation of the algorithm.

17 / 20

General Concepts

GPU Programming


Technical details
Di culties
The most challenging issue was the irregular geometry of the
domain which made more di cult the exploitation of the locality at
the thread computations and the use of the shared memory.
The cell values were stored in a 1D array and for each cell the
indexes of its neighboring cells were also stored.

Code structure
Simulation constants are stored in the constant memory.
Soil properties for each soil class are stored in the texture memory.
Atomic operations are used in order to check for convergence at
every iteration.
The shared memory is used to accelerate the atomic operations and
the block’s memory accesses.
18 / 20

General Concepts

GPU Programming


Results of the numerical tests
Nvidia Quadro 2000:
192 CUDA cores.
1 GB GDDR5 of RAM memory.

100000"

90"

70"

Speed%Up%

Speed%(%cells/sec%)%

80"
10000"
1000"
100"

CPU"

10"

GPU"

60"
50"
40"
30"
20"
10"

1"
1000"

10000"

100000"

Number%of%Cells%

1000000"

10000000"

0"
1000"

10000"

100000"
Number%of%Cells%

1000000"

10000000"

19 / 20

Parallel Computing: Perspectives for more efficient hydrological modeling

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Parallel Computing: Perspectives for more efficient hydrological modeling

Similar to Parallel Computing: Perspectives for more efficient hydrological modeling (20)

More from Grigoris Anagnostopoulos

More from Grigoris Anagnostopoulos (6)

Recently uploaded

Recently uploaded (20)

Parallel Computing: Perspectives for more efficient hydrological modeling