Paralell

Introduction to
Parallel Programming

Agenda
• Flynn taxonomy
• CUDA hardware
• Parallel Programming Intro

Flynn Taxonomy
• Single Instruction Single Data (SISD)
– Classic CPU architecture
• Single Instruction Multiple Data (SIMD)
– Original Supercomputers (Cray-1)
– Todays Graphics Cards, OpenCL (SIMT)
• Multiple Instruction Multiple Data (MIMD)
– Clusters, Distributed Computing, MPI (MPMD)
– Multi-Core CPUs, Using Threads or IPC (SPMD)
• Multiple Instruction Single Data (MISD)
– Redundancy/Experimental

Example
• Adding two vectors
• Show for :
– SISD
– SIMD SSE
– SIMD-SIMT (CUDA/OpenCL)
– MIMD-MPMD
– MIMD-SPMD

SISD
• Set CX = 0
• Load AX, Array1[CX]
• Load BX, Array2[CX]
• Add AX,BX
• Store Arrary3[CX],ax
• Inc CX
• Loop CX < Length

SIMD
• MOVSS xmm0, address of array1
• MOVSS xmm1, address of array2
• ADDSS xmm0,xmm1

SIMD-SIMT (CUDA/OpenCL)
• ThreadID=1
• Set CX = ThreadId
• Load
AX, Array1[CX]
• Load
BX, Array2[CX]
• Add AX,BX
• Store
Arrary3[CX],ax
• ThreadID=2
• Set CX = ThreadID
• Load
AX, Array1[CX]
• Load
BX, Array2[CX]
• Add AX,BX
• Store
Arrary3[CX],ax
• ThreadID=n
• Set CX = ThreadID
• Load
AX, Array1[CX]
• Load
BX, Array2[CX]
• Add AX,BX
• Store
Arrary3[CX],ax

MIMD-MPMD
• Partition Array1 and Array2
• Distribute partitions to nodes
• While waiting for all responses
– Place response into correct spot in array3

MIMD-SPMD
• Set CX = 0
• Add AX,BX
• Inc CX
• Loop CX < Length/2
• Wait for thread 2
• Set CX = Length/2
• Add AX,BX
• Inc CX
• Loop CX < Length
• Wait for thread 1

Parallel Programming Intro
• Syncing
• Computational Patterns
• Data Usage Patterns

Syncing
• Fence/Barrier
– Use to sync all workers before/after work
– Use to allow data to stabilize
• Lock
– Allow only one worker at a time access to
resource
• Signal
– Allows one worker to wait for another workers
task

Computational Patterns
• Map
• Reduction/Fold
• Sequences
• Pipeline

Map
• Applies a function to one or more collections
and creates a new collection of the results.
– B[] = inc(A[])
– C[] = add(A[],B[])
• Variation of Map is Transform which does this
in place.
• f() is a pure function. Almost impossible to use
non-pure function except in SISD systems.

Reduction/Fold
• Applies a binary function to a collection
repeatedly till only one value is left.
– B = sum(A[])
– C = max(A[])

Superscalar Sequence
• Slice a collection into smaller collections
• Apply functions new collections
• Merge results back to single collection
• Similar to Map, but Sequence will apply
different functions to each sliced collection

Pipeline
• Multiple functions are applied to a collection
where the output of each function serves as
the input to the next function
– B[] = f(g(A[]))

Data Usage Patterns
• Gather
• Subdivide
• Scatter
• Pack

Gather
• Takes a collection of indices and creates a
collection from another that includes only the
given indices.
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {1,4,8}
• {a,d,h} = gather(A,B)

Subdivide
• Creates multiple smaller collections from one
collection.
• Many variations based on spatial properties of
source and destination collections.
• A[] = {a,b,c,d,e,f,g,h}
• subdivide(A,4) = {a,b,c,d},{e,f,g,h}
• neighbors(A) = {a,b,c},{b,c,d},{c,d,e},…,{f,g,h}

Scatter
• Takes a collection of indices and updates a
collection from another that includes only the
given indices.
• Can be non-deterministic if B doesn’t contain
unique indices
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {0,1,0,8,0,0,4,0}
• C[] = {m,n,o,p,q,r,s,t,v}
• {m,a,o,h,p,q,d,s,t,v} = scatter(a,b,c)

Pack
• Takes a collection and creates a new collection
based on a unary boolean function.
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {0,1,0,8,0,0,4,0}
• Pack(A,vowels(x)) = {a,e,i}
• Pack(b, f(x>4)) = {8}

CUDA Hardware
• SIMD-SIMT with lots of caveats.
• Memory Separate from other devices
• Device broken up into Streaming Multiprocessors (SM)
which have their own:
– memory cache(Very Limited)
– 32bit registers(varies between 8k and 64k)
– One or more Warp scheduler
• Executes ‘kernels’ which are made up of many Blocks
of Threads which are made up of warps
• Access CPU memory at PCIe speeds i.e. very slowly.

Kernels
• Flavor of C/C++
– Extensions
• Keywords to denote if code is for Host, CUDA, or both
• Keywords to denote if variable is for Host, CUDA, or both
• Built-in variables for execution context
• Synchronization
• 3d interoperability
– Restrictions
• No RTTI
• No Exceptions
• No STL

Kernels cont.
• Execution context decided at runtime
– Number of threads in a block (Max 512/1024)
– Number of blocks
• Should be coded to take into account the
block size and block count can change.

Thread Blocks
• Split into Warps of 32 threads.
• Will be threads of sequential ID.
• Can share a limited amount of memory
between threads
• Threads will have two ID’s
– Based on threadIdx which is the ID within the
block
– Based on threadIdx and blockIdx which is the ID
within the kernel execution context

Warps
• Collection of 32 threads from the same thread
block.
• A SM can context switch between warps at no
cost.
• Threads in a warp execute in lockstep.

Data Divergence - CUDA Caveats
• Threads in a warp share limited data cache
• If a thread causes a cache miss it will be
suspended till other threads finish
• After threads are finished suspended threads
will be restarted

Execution Divergence - CUDA Caveats
• Threads in a warp execute instructions in
lockstep
• If a thread branches it will be suspended till
other threads reach the same execution point
• Smallest execution unit is the warp so a single
thread can tie up the warp and thus the SM

Paralell

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Paralell

Similar to Paralell (20)

Paralell