SlideShare a Scribd company logo
1 of 31
Introduction to
Parallel Programming
Agenda
• Flynn taxonomy
• CUDA hardware
• Parallel Programming Intro
Flynn Taxonomy
• Single Instruction Single Data (SISD)
– Classic CPU architecture
• Single Instruction Multiple Data (SIMD)
– Original Supercomputers (Cray-1)
– Todays Graphics Cards, OpenCL (SIMT)
• Multiple Instruction Multiple Data (MIMD)
– Clusters, Distributed Computing, MPI (MPMD)
– Multi-Core CPUs, Using Threads or IPC (SPMD)
• Multiple Instruction Single Data (MISD)
– Redundancy/Experimental
Example
• Adding two vectors
• Show for :
– SISD
– SIMD SSE
– SIMD-SIMT (CUDA/OpenCL)
– MIMD-MPMD
– MIMD-SPMD
SISD
• Set CX = 0
• Load AX, Array1[CX]
• Load BX, Array2[CX]
• Add AX,BX
• Store Arrary3[CX],ax
• Inc CX
• Loop CX < Length
SIMD
• MOVSS xmm0, address of array1
• MOVSS xmm1, address of array2
• ADDSS xmm0,xmm1
SIMD-SIMT (CUDA/OpenCL)
• ThreadID=1
• Set CX = ThreadId
• Load
AX, Array1[CX]
• Load
BX, Array2[CX]
• Add AX,BX
• Store
Arrary3[CX],ax
• ThreadID=2
• Set CX = ThreadID
• Load
AX, Array1[CX]
• Load
BX, Array2[CX]
• Add AX,BX
• Store
Arrary3[CX],ax
• ThreadID=n
• Set CX = ThreadID
• Load
AX, Array1[CX]
• Load
BX, Array2[CX]
• Add AX,BX
• Store
Arrary3[CX],ax
MIMD-MPMD
• Partition Array1 and Array2
• Distribute partitions to nodes
• While waiting for all responses
– Place response into correct spot in array3
MIMD-SPMD
• Set CX = 0
• Load AX, Array1[CX]
• Load BX, Array2[CX]
• Add AX,BX
• Store Arrary3[CX],ax
• Inc CX
• Loop CX < Length/2
• Wait for thread 2
• Set CX = Length/2
• Load AX, Array1[CX]
• Load BX, Array2[CX]
• Add AX,BX
• Store Arrary3[CX],ax
• Inc CX
• Loop CX < Length
• Wait for thread 1
Parallel Programming Intro
• Syncing
• Computational Patterns
• Data Usage Patterns
Syncing
• Fence/Barrier
– Use to sync all workers before/after work
– Use to allow data to stabilize
• Lock
– Allow only one worker at a time access to
resource
• Signal
– Allows one worker to wait for another workers
task
Computational Patterns
• Map
• Reduction/Fold
• Sequences
• Pipeline
Map
• Applies a function to one or more collections
and creates a new collection of the results.
– B[] = inc(A[])
– C[] = add(A[],B[])
• Variation of Map is Transform which does this
in place.
• f() is a pure function. Almost impossible to use
non-pure function except in SISD systems.
Map
Reduction/Fold
• Applies a binary function to a collection
repeatedly till only one value is left.
– B = sum(A[])
– C = max(A[])
Superscalar Sequence
• Slice a collection into smaller collections
• Apply functions new collections
• Merge results back to single collection
• Similar to Map, but Sequence will apply
different functions to each sliced collection
Superscalar Sequence
Pipeline
• Multiple functions are applied to a collection
where the output of each function serves as
the input to the next function
– B[] = f(g(A[]))
Pipeline
f g
f g
f g
f g
Data Usage Patterns
• Gather
• Subdivide
• Scatter
• Pack
Gather
• Takes a collection of indices and creates a
collection from another that includes only the
given indices.
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {1,4,8}
• {a,d,h} = gather(A,B)
Subdivide
• Creates multiple smaller collections from one
collection.
• Many variations based on spatial properties of
source and destination collections.
• A[] = {a,b,c,d,e,f,g,h}
• subdivide(A,4) = {a,b,c,d},{e,f,g,h}
• neighbors(A) = {a,b,c},{b,c,d},{c,d,e},…,{f,g,h}
Scatter
• Takes a collection of indices and updates a
collection from another that includes only the
given indices.
• Can be non-deterministic if B doesn’t contain
unique indices
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {0,1,0,8,0,0,4,0}
• C[] = {m,n,o,p,q,r,s,t,v}
• {m,a,o,h,p,q,d,s,t,v} = scatter(a,b,c)
Pack
• Takes a collection and creates a new collection
based on a unary boolean function.
• A[] = {a,b,c,d,e,f,g,h,i}
• B[] = {0,1,0,8,0,0,4,0}
• Pack(A,vowels(x)) = {a,e,i}
• Pack(b, f(x>4)) = {8}
CUDA Hardware
• SIMD-SIMT with lots of caveats.
• Memory Separate from other devices
• Device broken up into Streaming Multiprocessors (SM)
which have their own:
– memory cache(Very Limited)
– 32bit registers(varies between 8k and 64k)
– One or more Warp scheduler
• Executes ‘kernels’ which are made up of many Blocks
of Threads which are made up of warps
• Access CPU memory at PCIe speeds i.e. very slowly.
Kernels
• Flavor of C/C++
– Extensions
• Keywords to denote if code is for Host, CUDA, or both
• Keywords to denote if variable is for Host, CUDA, or both
• Built-in variables for execution context
• Synchronization
• 3d interoperability
– Restrictions
• No RTTI
• No Exceptions
• No STL
Kernels cont.
• Execution context decided at runtime
– Number of threads in a block (Max 512/1024)
– Number of blocks
• Should be coded to take into account the
block size and block count can change.
Thread Blocks
• Split into Warps of 32 threads.
• Will be threads of sequential ID.
• Can share a limited amount of memory
between threads
• Threads will have two ID’s
– Based on threadIdx which is the ID within the
block
– Based on threadIdx and blockIdx which is the ID
within the kernel execution context
Warps
• Collection of 32 threads from the same thread
block.
• A SM can context switch between warps at no
cost.
• Threads in a warp execute in lockstep.
Data Divergence - CUDA Caveats
• Threads in a warp share limited data cache
• If a thread causes a cache miss it will be
suspended till other threads finish
• After threads are finished suspended threads
will be restarted
Execution Divergence - CUDA Caveats
• Threads in a warp execute instructions in
lockstep
• If a thread branches it will be suspended till
other threads reach the same execution point
• Smallest execution unit is the warp so a single
thread can tie up the warp and thus the SM

More Related Content

What's hot

Storage Performance measurement using Tivoli productivity Center
Storage Performance measurement using Tivoli productivity CenterStorage Performance measurement using Tivoli productivity Center
Storage Performance measurement using Tivoli productivity CenterIBM Danmark
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...Altinity Ltd
 
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...Computer Science Club
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonAMD Developer Central
 
Multi core k means
Multi core k meansMulti core k means
Multi core k meansb0rAAs
 
Graph Regularised Hashing
Graph Regularised HashingGraph Regularised Hashing
Graph Regularised HashingSean Moran
 
SHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionSHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionGennaro Caccavale
 
Look Mommy, No GC! (TechDays NL 2017)
Look Mommy, No GC! (TechDays NL 2017)Look Mommy, No GC! (TechDays NL 2017)
Look Mommy, No GC! (TechDays NL 2017)Dina Goldshtein
 
Azure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analyticsAzure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analyticsLamprini Koutsokera
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryInfluxData
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clusteringSubhas Kumar Ghosh
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalCodemotion Tel Aviv
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning FrameworksTakeo Imai
 

What's hot (20)

Storage Performance measurement using Tivoli productivity Center
Storage Performance measurement using Tivoli productivity CenterStorage Performance measurement using Tivoli productivity Center
Storage Performance measurement using Tivoli productivity Center
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 
distance_matrix_ch
distance_matrix_chdistance_matrix_ch
distance_matrix_ch
 
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
 
Map db
Map dbMap db
Map db
 
Keccak
KeccakKeccak
Keccak
 
Multi core k means
Multi core k meansMulti core k means
Multi core k means
 
Tractor Pulling on Data Warehouse
Tractor Pulling on Data WarehouseTractor Pulling on Data Warehouse
Tractor Pulling on Data Warehouse
 
Graph Regularised Hashing
Graph Regularised HashingGraph Regularised Hashing
Graph Regularised Hashing
 
20131212
2013121220131212
20131212
 
Sha3
Sha3Sha3
Sha3
 
SHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge functionSHA-3, Keccak & Sponge function
SHA-3, Keccak & Sponge function
 
Look Mommy, No GC! (TechDays NL 2017)
Look Mommy, No GC! (TechDays NL 2017)Look Mommy, No GC! (TechDays NL 2017)
Look Mommy, No GC! (TechDays NL 2017)
 
Mmclass5
Mmclass5Mmclass5
Mmclass5
 
Azure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analyticsAzure Stream Analytics Project : On-demand real-time analytics
Azure Stream Analytics Project : On-demand real-time analytics
 
Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Processing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, TikalProcessing Big Data in Real-Time - Yanai Franchi, Tikal
Processing Big Data in Real-Time - Yanai Franchi, Tikal
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
 

Viewers also liked

Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 modelsanandme07
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel ProgrammingUday Sharma
 
Ccn unit-2- data link layer by prof.suresha v
Ccn unit-2- data link layer by prof.suresha vCcn unit-2- data link layer by prof.suresha v
Ccn unit-2- data link layer by prof.suresha vSURESHA V
 
Lecture 2
Lecture 2Lecture 2
Lecture 2Mr SMAK
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processingPage Maker
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsguest084d20
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm ModelsMartin Coronel
 
Flynns classification
Flynns classificationFlynns classification
Flynns classificationYasir Khan
 
The Data Link Layer
The Data Link LayerThe Data Link Layer
The Data Link Layerrobbbminson
 

Viewers also liked (12)

Ceg4131 models
Ceg4131 modelsCeg4131 models
Ceg4131 models
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Ccn unit-2- data link layer by prof.suresha v
Ccn unit-2- data link layer by prof.suresha vCcn unit-2- data link layer by prof.suresha v
Ccn unit-2- data link layer by prof.suresha v
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
 
Introduction to parallel processing
Introduction to parallel processingIntroduction to parallel processing
Introduction to parallel processing
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm Models
 
Flynns classification
Flynns classificationFlynns classification
Flynns classification
 
The Data Link Layer
The Data Link LayerThe Data Link Layer
The Data Link Layer
 

Similar to Paralell

FPGA_Logic.pdf
FPGA_Logic.pdfFPGA_Logic.pdf
FPGA_Logic.pdfwafawafa52
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.pptceyifo9332
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution modelVajira Thambawita
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Sparknickmbailey
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designyousefzahdeh
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixCodemotion Tel Aviv
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxssuser413a98
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStoreMariaDB plc
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyondLihang Li
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentationMichael Keane
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizationsGal Marder
 

Similar to Paralell (20)

FPGA_Logic.pdf
FPGA_Logic.pdfFPGA_Logic.pdf
FPGA_Logic.pdf
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
Lecture 7 cuda execution model
Lecture 7   cuda execution modelLecture 7   cuda execution model
Lecture 7 cuda execution model
 
Cassandra and Spark
Cassandra and SparkCassandra and Spark
Cassandra and Spark
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
RISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and designRISC Vs CISC Computer architecture and design
RISC Vs CISC Computer architecture and design
 
Tc basics
Tc basicsTc basics
Tc basics
 
JVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, WixJVM Memory Model - Yoav Abrahami, Wix
JVM Memory Model - Yoav Abrahami, Wix
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
 
ch11_031102.ppt
ch11_031102.pptch11_031102.ppt
ch11_031102.ppt
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Spark real world use cases and optimizations
Spark real world use cases and optimizationsSpark real world use cases and optimizations
Spark real world use cases and optimizations
 

Paralell

  • 2. Agenda • Flynn taxonomy • CUDA hardware • Parallel Programming Intro
  • 3. Flynn Taxonomy • Single Instruction Single Data (SISD) – Classic CPU architecture • Single Instruction Multiple Data (SIMD) – Original Supercomputers (Cray-1) – Todays Graphics Cards, OpenCL (SIMT) • Multiple Instruction Multiple Data (MIMD) – Clusters, Distributed Computing, MPI (MPMD) – Multi-Core CPUs, Using Threads or IPC (SPMD) • Multiple Instruction Single Data (MISD) – Redundancy/Experimental
  • 4. Example • Adding two vectors • Show for : – SISD – SIMD SSE – SIMD-SIMT (CUDA/OpenCL) – MIMD-MPMD – MIMD-SPMD
  • 5. SISD • Set CX = 0 • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • Inc CX • Loop CX < Length
  • 6. SIMD • MOVSS xmm0, address of array1 • MOVSS xmm1, address of array2 • ADDSS xmm0,xmm1
  • 7. SIMD-SIMT (CUDA/OpenCL) • ThreadID=1 • Set CX = ThreadId • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • ThreadID=2 • Set CX = ThreadID • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • ThreadID=n • Set CX = ThreadID • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax
  • 8. MIMD-MPMD • Partition Array1 and Array2 • Distribute partitions to nodes • While waiting for all responses – Place response into correct spot in array3
  • 9. MIMD-SPMD • Set CX = 0 • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • Inc CX • Loop CX < Length/2 • Wait for thread 2 • Set CX = Length/2 • Load AX, Array1[CX] • Load BX, Array2[CX] • Add AX,BX • Store Arrary3[CX],ax • Inc CX • Loop CX < Length • Wait for thread 1
  • 10. Parallel Programming Intro • Syncing • Computational Patterns • Data Usage Patterns
  • 11. Syncing • Fence/Barrier – Use to sync all workers before/after work – Use to allow data to stabilize • Lock – Allow only one worker at a time access to resource • Signal – Allows one worker to wait for another workers task
  • 12. Computational Patterns • Map • Reduction/Fold • Sequences • Pipeline
  • 13. Map • Applies a function to one or more collections and creates a new collection of the results. – B[] = inc(A[]) – C[] = add(A[],B[]) • Variation of Map is Transform which does this in place. • f() is a pure function. Almost impossible to use non-pure function except in SISD systems.
  • 14. Map
  • 15. Reduction/Fold • Applies a binary function to a collection repeatedly till only one value is left. – B = sum(A[]) – C = max(A[])
  • 16. Superscalar Sequence • Slice a collection into smaller collections • Apply functions new collections • Merge results back to single collection • Similar to Map, but Sequence will apply different functions to each sliced collection
  • 18. Pipeline • Multiple functions are applied to a collection where the output of each function serves as the input to the next function – B[] = f(g(A[]))
  • 20. Data Usage Patterns • Gather • Subdivide • Scatter • Pack
  • 21. Gather • Takes a collection of indices and creates a collection from another that includes only the given indices. • A[] = {a,b,c,d,e,f,g,h,i} • B[] = {1,4,8} • {a,d,h} = gather(A,B)
  • 22. Subdivide • Creates multiple smaller collections from one collection. • Many variations based on spatial properties of source and destination collections. • A[] = {a,b,c,d,e,f,g,h} • subdivide(A,4) = {a,b,c,d},{e,f,g,h} • neighbors(A) = {a,b,c},{b,c,d},{c,d,e},…,{f,g,h}
  • 23. Scatter • Takes a collection of indices and updates a collection from another that includes only the given indices. • Can be non-deterministic if B doesn’t contain unique indices • A[] = {a,b,c,d,e,f,g,h,i} • B[] = {0,1,0,8,0,0,4,0} • C[] = {m,n,o,p,q,r,s,t,v} • {m,a,o,h,p,q,d,s,t,v} = scatter(a,b,c)
  • 24. Pack • Takes a collection and creates a new collection based on a unary boolean function. • A[] = {a,b,c,d,e,f,g,h,i} • B[] = {0,1,0,8,0,0,4,0} • Pack(A,vowels(x)) = {a,e,i} • Pack(b, f(x>4)) = {8}
  • 25. CUDA Hardware • SIMD-SIMT with lots of caveats. • Memory Separate from other devices • Device broken up into Streaming Multiprocessors (SM) which have their own: – memory cache(Very Limited) – 32bit registers(varies between 8k and 64k) – One or more Warp scheduler • Executes ‘kernels’ which are made up of many Blocks of Threads which are made up of warps • Access CPU memory at PCIe speeds i.e. very slowly.
  • 26. Kernels • Flavor of C/C++ – Extensions • Keywords to denote if code is for Host, CUDA, or both • Keywords to denote if variable is for Host, CUDA, or both • Built-in variables for execution context • Synchronization • 3d interoperability – Restrictions • No RTTI • No Exceptions • No STL
  • 27. Kernels cont. • Execution context decided at runtime – Number of threads in a block (Max 512/1024) – Number of blocks • Should be coded to take into account the block size and block count can change.
  • 28. Thread Blocks • Split into Warps of 32 threads. • Will be threads of sequential ID. • Can share a limited amount of memory between threads • Threads will have two ID’s – Based on threadIdx which is the ID within the block – Based on threadIdx and blockIdx which is the ID within the kernel execution context
  • 29. Warps • Collection of 32 threads from the same thread block. • A SM can context switch between warps at no cost. • Threads in a warp execute in lockstep.
  • 30. Data Divergence - CUDA Caveats • Threads in a warp share limited data cache • If a thread causes a cache miss it will be suspended till other threads finish • After threads are finished suspended threads will be restarted
  • 31. Execution Divergence - CUDA Caveats • Threads in a warp execute instructions in lockstep • If a thread branches it will be suspended till other threads reach the same execution point • Smallest execution unit is the warp so a single thread can tie up the warp and thus the SM