Introduction to
Parallel Programming
Agenda
• Flynn taxonomy
• CUDA hardware
• Parallel Programming Intro
Flynn Taxonomy
• Single Instruction Single Data (SISD)
– Classic CPU architecture
• Single Instruction Multiple Data (SIMD)
– Original Supercomputers (Cray-1)
– Todays Graphics Cards, OpenCL (SIMT)
• Multiple Instruction Multiple Data (MIMD)
– Clusters, Distributed Computing, MPI (MPMD)
– Multi-Core CPUs, Using Threads or IPC (SPMD)
• Multiple Instruction Single Data (MISD)
– Redundancy/Experimental
Example
• Adding two vectors
• Show for :
– SISD
– SIMD SSE
– SIMD-SIMT (CUDA/OpenCL)
– MIMD-MPMD
– MIMD-SPMD
SISD
• Set CX = 0
• Load AX, Array1[CX]
• Load BX, Array2[CX]
• Add AX,BX
• Store Arrary3[CX],ax
• Inc CX
• Loop CX < Length
SIMD
• MOVSS xmm0, address of array1
• MOVSS xmm1, address of array2
• ADDSS xmm0,xmm1
SIMD-SIMT (CUDA/OpenCL)
• ThreadID=1
• Set CX = ThreadId
• Load
AX, Array1[CX]
• Load
BX, Array2[CX]
• Add AX,BX
• Store
Arrary3[CX],ax
• ThreadID=2
• Set CX = ThreadID
• Load
AX, Array1[CX]
• Load
BX, Array2[CX]
• Add AX,BX
• Store
Arrary3[CX],ax
• ThreadID=n
• Set CX = ThreadID
• Load
AX, Array1[CX]
• Load
BX, Array2[CX]
• Add AX,BX
• Store
Arrary3[CX],ax
MIMD-MPMD
• Partition Array1 and Array2
• Distribute partitions to nodes
• While waiting for all responses
– Place response into correct spot in array3
MIMD-SPMD
• Set CX = 0
• Load AX, Array1[CX]
• Load BX, Array2[CX]
• Add AX,BX
• Store Arrary3[CX],ax
• Inc CX
• Loop CX < Length/2
• Wait for thread 2
• Set CX = Length/2
• Load AX, Array1[CX]
• Load BX, Array2[CX]
• Add AX,BX
• Store Arrary3[CX],ax
• Inc CX
• Loop CX < Length
• Wait for thread 1
Parallel Programming Intro
• Syncing
• Computational Patterns
• Data Usage Patterns
Syncing
• Fence/Barrier
– Use to sync all workers before/after work
– Use to allow data to stabilize
• Lock
– Allow only one worker at a time access to
resource
• Signal
– Allows one worker to wait for another workers
task
Computational Patterns
• Map
• Reduction/Fold
• Sequences
• Pipeline
Map
• Applies a function to one or more collections
and creates a new collection of the results.
– B[] = inc(A[])
– C[] = add(A[],B[])
• Variation of Map is Transform which does this
in place.
• f() is a pure function. Almost impossible to use
non-pure function except in SISD systems.
Map
Reduction/Fold
• Applies a binary function to a collection
repeatedly till only one value is left.
– B = sum(A[])
– C = max(A[])
Superscalar Sequence
• Slice a collection into smaller collections
• Apply functions new collections
• Merge results back to single collection
• Similar to Map, but Sequence will apply
different functions to each sliced collection
Superscalar Sequence
Pipeline
• Multiple functions are applied to a collection
where the output of each function serves as
the input to the next function
– B[] = f(g(A[]))
Pipeline
f g
f g
f g
f g
Data Usage Patterns
• Gather
• Subdivide
• Scatter
• Pack
Gather
• Takes a collection of indices and creates a
collection from another that includes only the
given indices.
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {1,4,8}
• {a,d,h} = gather(A,B)
Subdivide
• Creates multiple smaller collections from one
collection.
• Many variations based on spatial properties of
source and destination collections.
• A[] = {a,b,c,d,e,f,g,h}
• subdivide(A,4) = {a,b,c,d},{e,f,g,h}
• neighbors(A) = {a,b,c},{b,c,d},{c,d,e},…,{f,g,h}
Scatter
• Takes a collection of indices and updates a
collection from another that includes only the
given indices.
• Can be non-deterministic if B doesn’t contain
unique indices
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {0,1,0,8,0,0,4,0}
• C[] = {m,n,o,p,q,r,s,t,v}
• {m,a,o,h,p,q,d,s,t,v} = scatter(a,b,c)
Pack
• Takes a collection and creates a new collection
based on a unary boolean function.
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {0,1,0,8,0,0,4,0}
• Pack(A,vowels(x)) = {a,e,i}
• Pack(b, f(x>4)) = {8}
CUDA Hardware
• SIMD-SIMT with lots of caveats.
• Memory Separate from other devices
• Device broken up into Streaming Multiprocessors (SM)
which have their own:
– memory cache(Very Limited)
– 32bit registers(varies between 8k and 64k)
– One or more Warp scheduler
• Executes ‘kernels’ which are made up of many Blocks
of Threads which are made up of warps
• Access CPU memory at PCIe speeds i.e. very slowly.
Kernels
• Flavor of C/C++
– Extensions
• Keywords to denote if code is for Host, CUDA, or both
• Keywords to denote if variable is for Host, CUDA, or both
• Built-in variables for execution context
• Synchronization
• 3d interoperability
– Restrictions
• No RTTI
• No Exceptions
• No STL
Kernels cont.
• Execution context decided at runtime
– Number of threads in a block (Max 512/1024)
– Number of blocks
• Should be coded to take into account the
block size and block count can change.
Thread Blocks
• Split into Warps of 32 threads.
• Will be threads of sequential ID.
• Can share a limited amount of memory
between threads
• Threads will have two ID’s
– Based on threadIdx which is the ID within the
block
– Based on threadIdx and blockIdx which is the ID
within the kernel execution context
Warps
• Collection of 32 threads from the same thread
block.
• A SM can context switch between warps at no
cost.
• Threads in a warp execute in lockstep.
Data Divergence - CUDA Caveats
• Threads in a warp share limited data cache
• If a thread causes a cache miss it will be
suspended till other threads finish
• After threads are finished suspended threads
will be restarted
Execution Divergence - CUDA Caveats
• Threads in a warp execute instructions in
lockstep
• If a thread branches it will be suspended till
other threads reach the same execution point
• Smallest execution unit is the warp so a single
thread can tie up the warp and thus the SM

Paralell

  • 1.
  • 2.
    Agenda • Flynn taxonomy •CUDA hardware • Parallel Programming Intro
  • 3.
    Flynn Taxonomy • SingleInstruction Single Data (SISD) – Classic CPU architecture • Single Instruction Multiple Data (SIMD) – Original Supercomputers (Cray-1) – Todays Graphics Cards, OpenCL (SIMT) • Multiple Instruction Multiple Data (MIMD) – Clusters, Distributed Computing, MPI (MPMD) – Multi-Core CPUs, Using Threads or IPC (SPMD) • Multiple Instruction Single Data (MISD) – Redundancy/Experimental
  • 4.
    Example • Adding twovectors • Show for : – SISD – SIMD SSE – SIMD-SIMT (CUDA/OpenCL) – MIMD-MPMD – MIMD-SPMD
  • 5.
    SISD • Set CX= 0 • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • Inc CX • Loop CX < Length
  • 6.
    SIMD • MOVSS xmm0,address of array1 • MOVSS xmm1, address of array2 • ADDSS xmm0,xmm1
  • 7.
    SIMD-SIMT (CUDA/OpenCL) • ThreadID=1 •Set CX = ThreadId • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • ThreadID=2 • Set CX = ThreadID • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • ThreadID=n • Set CX = ThreadID • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax
  • 8.
    MIMD-MPMD • Partition Array1and Array2 • Distribute partitions to nodes • While waiting for all responses – Place response into correct spot in array3
  • 9.
    MIMD-SPMD • Set CX= 0 • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • Inc CX • Loop CX < Length/2 • Wait for thread 2 • Set CX = Length/2 • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • Inc CX • Loop CX < Length • Wait for thread 1
  • 10.
    Parallel Programming Intro •Syncing • Computational Patterns • Data Usage Patterns
  • 11.
    Syncing • Fence/Barrier – Useto sync all workers before/after work – Use to allow data to stabilize • Lock – Allow only one worker at a time access to resource • Signal – Allows one worker to wait for another workers task
  • 12.
    Computational Patterns • Map •Reduction/Fold • Sequences • Pipeline
  • 13.
    Map • Applies afunction to one or more collections and creates a new collection of the results. – B[] = inc(A[]) – C[] = add(A[],B[]) • Variation of Map is Transform which does this in place. • f() is a pure function. Almost impossible to use non-pure function except in SISD systems.
  • 14.
  • 15.
    Reduction/Fold • Applies abinary function to a collection repeatedly till only one value is left. – B = sum(A[]) – C = max(A[])
  • 16.
    Superscalar Sequence • Slicea collection into smaller collections • Apply functions new collections • Merge results back to single collection • Similar to Map, but Sequence will apply different functions to each sliced collection
  • 17.
  • 18.
    Pipeline • Multiple functionsare applied to a collection where the output of each function serves as the input to the next function – B[] = f(g(A[]))
  • 19.
  • 20.
    Data Usage Patterns •Gather • Subdivide • Scatter • Pack
  • 21.
    Gather • Takes acollection of indices and creates a collection from another that includes only the given indices. • A[] = {a,b,c,d,e,f,g,h,i} • B[] = {1,4,8} • {a,d,h} = gather(A,B)
  • 22.
    Subdivide • Creates multiplesmaller collections from one collection. • Many variations based on spatial properties of source and destination collections. • A[] = {a,b,c,d,e,f,g,h} • subdivide(A,4) = {a,b,c,d},{e,f,g,h} • neighbors(A) = {a,b,c},{b,c,d},{c,d,e},…,{f,g,h}
  • 23.
    Scatter • Takes acollection of indices and updates a collection from another that includes only the given indices. • Can be non-deterministic if B doesn’t contain unique indices • A[] = {a,b,c,d,e,f,g,h,i} • B[] = {0,1,0,8,0,0,4,0} • C[] = {m,n,o,p,q,r,s,t,v} • {m,a,o,h,p,q,d,s,t,v} = scatter(a,b,c)
  • 24.
    Pack • Takes acollection and creates a new collection based on a unary boolean function. • A[] = {a,b,c,d,e,f,g,h,i} • B[] = {0,1,0,8,0,0,4,0} • Pack(A,vowels(x)) = {a,e,i} • Pack(b, f(x>4)) = {8}
  • 25.
    CUDA Hardware • SIMD-SIMTwith lots of caveats. • Memory Separate from other devices • Device broken up into Streaming Multiprocessors (SM) which have their own: – memory cache(Very Limited) – 32bit registers(varies between 8k and 64k) – One or more Warp scheduler • Executes ‘kernels’ which are made up of many Blocks of Threads which are made up of warps • Access CPU memory at PCIe speeds i.e. very slowly.
  • 26.
    Kernels • Flavor ofC/C++ – Extensions • Keywords to denote if code is for Host, CUDA, or both • Keywords to denote if variable is for Host, CUDA, or both • Built-in variables for execution context • Synchronization • 3d interoperability – Restrictions • No RTTI • No Exceptions • No STL
  • 27.
    Kernels cont. • Executioncontext decided at runtime – Number of threads in a block (Max 512/1024) – Number of blocks • Should be coded to take into account the block size and block count can change.
  • 28.
    Thread Blocks • Splitinto Warps of 32 threads. • Will be threads of sequential ID. • Can share a limited amount of memory between threads • Threads will have two ID’s – Based on threadIdx which is the ID within the block – Based on threadIdx and blockIdx which is the ID within the kernel execution context
  • 29.
    Warps • Collection of32 threads from the same thread block. • A SM can context switch between warps at no cost. • Threads in a warp execute in lockstep.
  • 30.
    Data Divergence -CUDA Caveats • Threads in a warp share limited data cache • If a thread causes a cache miss it will be suspended till other threads finish • After threads are finished suspended threads will be restarted
  • 31.
    Execution Divergence -CUDA Caveats • Threads in a warp execute instructions in lockstep • If a thread branches it will be suspended till other threads reach the same execution point • Smallest execution unit is the warp so a single thread can tie up the warp and thus the SM