3. Flynn Taxonomy
• Single Instruction Single Data (SISD)
– Classic CPU architecture
• Single Instruction Multiple Data (SIMD)
– Original Supercomputers (Cray-1)
– Todays Graphics Cards, OpenCL (SIMT)
• Multiple Instruction Multiple Data (MIMD)
– Clusters, Distributed Computing, MPI (MPMD)
– Multi-Core CPUs, Using Threads or IPC (SPMD)
• Multiple Instruction Single Data (MISD)
– Redundancy/Experimental
4. Example
• Adding two vectors
• Show for :
– SISD
– SIMD SSE
– SIMD-SIMT (CUDA/OpenCL)
– MIMD-MPMD
– MIMD-SPMD
8. MIMD-MPMD
• Partition Array1 and Array2
• Distribute partitions to nodes
• While waiting for all responses
– Place response into correct spot in array3
11. Syncing
• Fence/Barrier
– Use to sync all workers before/after work
– Use to allow data to stabilize
• Lock
– Allow only one worker at a time access to
resource
• Signal
– Allows one worker to wait for another workers
task
13. Map
• Applies a function to one or more collections
and creates a new collection of the results.
– B[] = inc(A[])
– C[] = add(A[],B[])
• Variation of Map is Transform which does this
in place.
• f() is a pure function. Almost impossible to use
non-pure function except in SISD systems.
15. Reduction/Fold
• Applies a binary function to a collection
repeatedly till only one value is left.
– B = sum(A[])
– C = max(A[])
16. Superscalar Sequence
• Slice a collection into smaller collections
• Apply functions new collections
• Merge results back to single collection
• Similar to Map, but Sequence will apply
different functions to each sliced collection
18. Pipeline
• Multiple functions are applied to a collection
where the output of each function serves as
the input to the next function
– B[] = f(g(A[]))
21. Gather
• Takes a collection of indices and creates a
collection from another that includes only the
given indices.
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {1,4,8}
• {a,d,h} = gather(A,B)
22. Subdivide
• Creates multiple smaller collections from one
collection.
• Many variations based on spatial properties of
source and destination collections.
• A[] = {a,b,c,d,e,f,g,h}
• subdivide(A,4) = {a,b,c,d},{e,f,g,h}
• neighbors(A) = {a,b,c},{b,c,d},{c,d,e},…,{f,g,h}
23. Scatter
• Takes a collection of indices and updates a
collection from another that includes only the
given indices.
• Can be non-deterministic if B doesn’t contain
unique indices
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {0,1,0,8,0,0,4,0}
• C[] = {m,n,o,p,q,r,s,t,v}
• {m,a,o,h,p,q,d,s,t,v} = scatter(a,b,c)
24. Pack
• Takes a collection and creates a new collection
based on a unary boolean function.
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {0,1,0,8,0,0,4,0}
• Pack(A,vowels(x)) = {a,e,i}
• Pack(b, f(x>4)) = {8}
25. CUDA Hardware
• SIMD-SIMT with lots of caveats.
• Memory Separate from other devices
• Device broken up into Streaming Multiprocessors (SM)
which have their own:
– memory cache(Very Limited)
– 32bit registers(varies between 8k and 64k)
– One or more Warp scheduler
• Executes ‘kernels’ which are made up of many Blocks
of Threads which are made up of warps
• Access CPU memory at PCIe speeds i.e. very slowly.
26. Kernels
• Flavor of C/C++
– Extensions
• Keywords to denote if code is for Host, CUDA, or both
• Keywords to denote if variable is for Host, CUDA, or both
• Built-in variables for execution context
• Synchronization
• 3d interoperability
– Restrictions
• No RTTI
• No Exceptions
• No STL
27. Kernels cont.
• Execution context decided at runtime
– Number of threads in a block (Max 512/1024)
– Number of blocks
• Should be coded to take into account the
block size and block count can change.
28. Thread Blocks
• Split into Warps of 32 threads.
• Will be threads of sequential ID.
• Can share a limited amount of memory
between threads
• Threads will have two ID’s
– Based on threadIdx which is the ID within the
block
– Based on threadIdx and blockIdx which is the ID
within the kernel execution context
29. Warps
• Collection of 32 threads from the same thread
block.
• A SM can context switch between warps at no
cost.
• Threads in a warp execute in lockstep.
30. Data Divergence - CUDA Caveats
• Threads in a warp share limited data cache
• If a thread causes a cache miss it will be
suspended till other threads finish
• After threads are finished suspended threads
will be restarted
31. Execution Divergence - CUDA Caveats
• Threads in a warp execute instructions in
lockstep
• If a thread branches it will be suspended till
other threads reach the same execution point
• Smallest execution unit is the warp so a single
thread can tie up the warp and thus the SM