SlideShare a Scribd company logo
GPU-Accelerated
Parallel Computing
RTSS Jun Young Park
References
 Intro to Parallel Computing
– Udacity
 CUDA by Example – Jason
Sanders, Edward Kandrot
Introduction to CUDA
Why GPU?
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
…
3584 cores
Good for few huge tasks Good for enormous small tasks
3.6 GHz
1.531 MHz
Measuring Performance
CPU – Latency
 How long does it take for a work
GPU - Throughput
 How many tasks per hour
Data Size : 4.5[GB]
Assume that …
CPU can process 2 tasks at the time and each core processes 200 [MB/h]
GPU can process 40 tasks at the time and each core processes 50 [MB/h]
Latency
CPU : 4500/200 [MB/h] = 22.5 [Hours]
GPU : 4500/50 [MB/h] = 90 [Hours]
Throughput
CPU : 2[Tasks]/22.5[Hours] = 0.089[Tasks/Hour]
GPU : 40[Tasks]/90[Hours] = 0.445[Tasks/Hour]
Better !
Better !
CUDA Program Diagram
CPU GPU
Memory MemorycudaMemcpy()
cudaMalloc()
__global__ hello()
hello.cu
NVCC
Co-processor
Typical Procedure
CPU allocates memory on GPU
• cudaMalloc((void **)pointer, size);
CPU copies input data from CPU to GPU
• cudaMemcpy(dest, &src, size, cudaMemcpyHostToDevice)
CPU launches kernel on GPU
• Kernel<<<N_BLOCKS,N_THREADS>>>(args…)
CPU copies results back to CPU from GPU
• cudaMemcpy(dest, &src, size, cudaMemcpyDeviceToHost)
CUDA Example - Addiction
- Single Thread (1)
The pointers will be indicate GPU memory space
Allocate memory for each pointers
Copy from CPU -> GPU
CUDA Example - Addiction
- Single Thread (2)
Kernel : Will be executed in GPU
CPU GPU
d_a
d_b
d_out
h_a
h_b
h_out
1.Memcpy
sum
2.Kernal call
3.d_out updated
4.Memcpy
CUDA Example - Addiction
- Single Thread (3)
 Compilation using NVCC
 Execution result
CUDA Example – Cubic
- Multi Thread (1)
To be used for determining
the size of the memory space
Initialize the elements
with each array index.
CUDA Example - Cubic
- Multi Thread (2)
Kernel call with SIZE_ARRAY threads.
To acquire current thread index
CUDA Example - Cubic
- Multi Thread (3)
Small Project
- cuCUDAn
99x99 Dan with every single multiplication threads.
Total 10,000 threads for multiplication. (0x0 … 99x99)
cuCUDAn
- Implementation
idx Value
0 0
1 1
2 2
3 3
4 4
… …
99 99
idx Value
0 0
1 1
2 2
3 3
4 4
… …
99 99
X T[0] T[1] T[2] T[3] T[…] T[99]
B[0] 0 0 0 0 … 0
B[1] 0 1 2 3 … 99
B[2] 0 2 4 6 … 198
B[3] 0 3 6 9 … 297
B[…] … … … … … …
B[99] 0 99 198 297 … 9801
blockDim.x
Limitation : Maximum 512/1024 threads for a block.
cuCUDAn
- Implementation
dim(d_out) = [100 * 100]
Using 1D-Matrix instead of 2D-Matrix
cuCUDAn
- Implementation
Initialize cudaEvent objects.
mul<<<numBlocks, numThreads>>>
Launch the kernel between recorder.
Wait until the kernel finished.
Get elapsed time.
cuCUDAn
- Implementation
Acquire the result from the GPU
Showing results.
Free-up memory spaces.
cuCUDAn
- Result
……
Multiplication in each threads.
※ Total : 10,000 Threads
Elapsed time : 7.168 [μs]
cuCUDAn
- Result
It feels dizzy …
However, It works well!
Self-Check
 Which kind of task do CPUs and GPUs specialized for?
 Show the way to qualify for the CPUs and GPUs.
 Describe the basic procedure of CUDA programs.
 Describe the procedure how to measure elapsed time using cudaEvent object.
Communication &
Hardware
Communication Patterns
 Map : 1-to-1 matching.
 Gather : Many-to-1 matching.
 Scatter : 1-to-Many matching.
 *Stencil : Input from fixed neighborhood in array
Map Gather Scatter Stencil
Communication Patterns
 Transpose : 1-to-1 matching.
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
1 6 11
2 7 12
3 8 13
4 9 14
5 10 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 6 11 2 7 12 3 8 13 4 9 14 5 10 15
Thread Block and SM
Block Block Block Block
Threads
Kernel
Stream Multiprocessor (in Titan V)
Memory
Cores
Mapped for block (1 or more)
Memory Structure – Programmer’s View
Thread
1
Local
Thread
2
Local
Thread
3
Local
Thread
N
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Thread
Local
Shared
Thread
Local
Thread
Local
Thread
Local
Thread
Local
Shared
Global Memory
Block 1 Block 2 Block 3 Block N
GPU
Memory Structure – Titan V
Synchronization
- Barrier
 Barrier : Wait until all of the operations finished.
B
a
r
r
i
e
r
Threads
__syncThreads()
Synchronization
- Example
0 1 2 3 4 5 6 7 …
1 2 3 4 5 6 7 8 …
Shifting…….
• Quiz from Lesson 2.
• Each thread performs ‘Shift Operation’.
Synchronization
- Example
Copy from Shared to Global
*To send result to the host.
Synchronization
- Example
• Without Barrier : Only one element(thread) is filled with the index (Don’t wait for other threads)
• With Barrier : Each elements are filled with the index (Wait until other elements filled)
Memory Example
- Local
Can only be used in the thread
Memory Example
- Global
Local
Vars.
Pointing the global memory allocated in elsewhere.
Memory Example
- Shared
Declare shared array.
Set barrier for the line above.
(Writing operation)
Atomicity
 CUDA supports atomic operations.
 atomicAdd(), atomicCAS(), atomicXor() … and so on
 Limitations
 Only certain data types, operations.
 No ordering constraints.
 May slow down.
Memory Management Strategies
 Maximize arithmetic intensity
 Maximize compute operations per thread
 Minimize time spent on memory per thread
 Move frequently-accessed data to fast memory
Local
Shared
Global
Host
[Access Speed from Core]
Coalesce Memory Access
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
T
h
r
e
a
d
Coalesce Stride (Not coalesced)
Global Memory
with single transaction with multiple transaction
Thread Divergence
 Each threads do different works.
T1
T2
T3
T4
T1
T2
T3
T4
IF-THEN
ELSE
Thread Divergence (Warp Divergence1))
 Assume that a thread loops the code for its thread index.
 Each threads wait until the other threads finished.
1) https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec3-2x2.pdf - Lecture note from Prof. Mike Giles (Oxford University)
T1
T2
T3
T4
1 2 3 40 0
Pre Loop Loop 1 Loop 2 Loop 3 Loop 4 Post Loop
Self Check
 Communication Patterns
 Memory Structure in CUDA
 Synchronization
 Atomicity
 Memory Management Strategies

More Related Content

What's hot

Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolversinside-BigData.com
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaRob Gillen
 
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...JAX London
 
NVIDIA GeForece Spec for personal CUDA environment, 27/Aug/2011
NVIDIA GeForece Spec for personal CUDA environment, 27/Aug/2011NVIDIA GeForece Spec for personal CUDA environment, 27/Aug/2011
NVIDIA GeForece Spec for personal CUDA environment, 27/Aug/2011Yukio Saito
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012DefCamp
 
Linux scheduling and input and output
Linux scheduling and input and outputLinux scheduling and input and output
Linux scheduling and input and outputSanidhya Chugh
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Yukio Saito
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule
 
Introduction to Hadron Structure from Lattice QCD
Introduction to Hadron Structure from Lattice QCDIntroduction to Hadron Structure from Lattice QCD
Introduction to Hadron Structure from Lattice QCDChristos Kallidonis
 
Resource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsResource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsTsung-en Hsiao
 
Named entity recognition - Kaggle/Own data
Named entity recognition - Kaggle/Own dataNamed entity recognition - Kaggle/Own data
Named entity recognition - Kaggle/Own dataPetr Lorenc
 

What's hot (20)

Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Chainer v4 and v5
Chainer v4 and v5Chainer v4 and v5
Chainer v4 and v5
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
20131121
2013112120131121
20131121
 
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
Java Core | Understanding the Disruptor: a Beginner's Guide to Hardcore Concu...
 
NVIDIA GeForece Spec for personal CUDA environment, 27/Aug/2011
NVIDIA GeForece Spec for personal CUDA environment, 27/Aug/2011NVIDIA GeForece Spec for personal CUDA environment, 27/Aug/2011
NVIDIA GeForece Spec for personal CUDA environment, 27/Aug/2011
 
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012NVidia CUDA for Bruteforce Attacks - DefCamp 2012
NVidia CUDA for Bruteforce Attacks - DefCamp 2012
 
Linux scheduling and input and output
Linux scheduling and input and outputLinux scheduling and input and output
Linux scheduling and input and output
 
Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2Nvidia® cuda™ 5 sample evaluationresult_2
Nvidia® cuda™ 5 sample evaluationresult_2
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Cuda
CudaCuda
Cuda
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
CUDA
CUDACUDA
CUDA
 
Introduction to Hadron Structure from Lattice QCD
Introduction to Hadron Structure from Lattice QCDIntroduction to Hadron Structure from Lattice QCD
Introduction to Hadron Structure from Lattice QCD
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Resource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsResource Management with Systemd and cgroups
Resource Management with Systemd and cgroups
 
Named entity recognition - Kaggle/Own data
Named entity recognition - Kaggle/Own dataNamed entity recognition - Kaggle/Own data
Named entity recognition - Kaggle/Own data
 

Similar to GPU-Accelerated Parallel Computing

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.pptceyifo9332
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1Yukio Saito
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Tokyo Institute of Technology
 
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsProtecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsHeechul Yun
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to AcceleratorsDilum Bandara
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threadingAntonio Cesarano
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxNaoto MATSUMOTO
 
How to build a gaming computer
How to build a gaming computerHow to build a gaming computer
How to build a gaming computerDonald Gillies
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdfTigabu Yaya
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptxSherazMunawar5
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computingArka Ghosh
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 

Similar to GPU-Accelerated Parallel Computing (20)

002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt002 - Introduction to CUDA Programming_1.ppt
002 - Introduction to CUDA Programming_1.ppt
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1
Nvidia® cuda™ 5.0 Sample Evaluation Result Part 1
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
Reduction
ReductionReduction
Reduction
 
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsProtecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threading
 
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Exploring Parallel Merging In GPU Based Systems Using CUDA C.
Exploring Parallel Merging In GPU Based Systems Using CUDA C.
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
How to build a gaming computer
How to build a gaming computerHow to build a gaming computer
How to build a gaming computer
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
 
GPU Introduction.pptx
 GPU Introduction.pptx GPU Introduction.pptx
GPU Introduction.pptx
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 

More from Jun Young Park

Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorchJun Young Park
 
Using Multi GPU in PyTorch
Using Multi GPU in PyTorchUsing Multi GPU in PyTorch
Using Multi GPU in PyTorchJun Young Park
 
Trial for Practical NN Using
Trial for Practical NN UsingTrial for Practical NN Using
Trial for Practical NN UsingJun Young Park
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural NetworkJun Young Park
 
PyTorch and Transfer Learning
PyTorch and Transfer LearningPyTorch and Transfer Learning
PyTorch and Transfer LearningJun Young Park
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksJun Young Park
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural NetworkJun Young Park
 

More from Jun Young Park (8)

Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Using Multi GPU in PyTorch
Using Multi GPU in PyTorchUsing Multi GPU in PyTorch
Using Multi GPU in PyTorch
 
Trial for Practical NN Using
Trial for Practical NN UsingTrial for Practical NN Using
Trial for Practical NN Using
 
Convolutional Neural Network
Convolutional Neural NetworkConvolutional Neural Network
Convolutional Neural Network
 
PyTorch and Transfer Learning
PyTorch and Transfer LearningPyTorch and Transfer Learning
PyTorch and Transfer Learning
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Deep Neural Network
Deep Neural NetworkDeep Neural Network
Deep Neural Network
 
Introduction to Neural Network
Introduction to Neural NetworkIntroduction to Neural Network
Introduction to Neural Network
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesThousandEyes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Product School
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Thierry Lestable
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...Product School
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaRTTS
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform EngineeringJemma Hussein Allen
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...Elena Simperl
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Alison B. Lowndes
 

Recently uploaded (20)

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 

GPU-Accelerated Parallel Computing

  • 2. References  Intro to Parallel Computing – Udacity  CUDA by Example – Jason Sanders, Edward Kandrot
  • 4. Why GPU? T T Core T T Core T T Core T T Core T T Core T T Core … 3584 cores Good for few huge tasks Good for enormous small tasks 3.6 GHz 1.531 MHz
  • 5. Measuring Performance CPU – Latency  How long does it take for a work GPU - Throughput  How many tasks per hour Data Size : 4.5[GB] Assume that … CPU can process 2 tasks at the time and each core processes 200 [MB/h] GPU can process 40 tasks at the time and each core processes 50 [MB/h] Latency CPU : 4500/200 [MB/h] = 22.5 [Hours] GPU : 4500/50 [MB/h] = 90 [Hours] Throughput CPU : 2[Tasks]/22.5[Hours] = 0.089[Tasks/Hour] GPU : 40[Tasks]/90[Hours] = 0.445[Tasks/Hour] Better ! Better !
  • 6. CUDA Program Diagram CPU GPU Memory MemorycudaMemcpy() cudaMalloc() __global__ hello() hello.cu NVCC Co-processor
  • 7. Typical Procedure CPU allocates memory on GPU • cudaMalloc((void **)pointer, size); CPU copies input data from CPU to GPU • cudaMemcpy(dest, &src, size, cudaMemcpyHostToDevice) CPU launches kernel on GPU • Kernel<<<N_BLOCKS,N_THREADS>>>(args…) CPU copies results back to CPU from GPU • cudaMemcpy(dest, &src, size, cudaMemcpyDeviceToHost)
  • 8. CUDA Example - Addiction - Single Thread (1) The pointers will be indicate GPU memory space Allocate memory for each pointers Copy from CPU -> GPU
  • 9. CUDA Example - Addiction - Single Thread (2) Kernel : Will be executed in GPU CPU GPU d_a d_b d_out h_a h_b h_out 1.Memcpy sum 2.Kernal call 3.d_out updated 4.Memcpy
  • 10. CUDA Example - Addiction - Single Thread (3)  Compilation using NVCC  Execution result
  • 11. CUDA Example – Cubic - Multi Thread (1) To be used for determining the size of the memory space Initialize the elements with each array index.
  • 12. CUDA Example - Cubic - Multi Thread (2) Kernel call with SIZE_ARRAY threads. To acquire current thread index
  • 13. CUDA Example - Cubic - Multi Thread (3)
  • 14. Small Project - cuCUDAn 99x99 Dan with every single multiplication threads. Total 10,000 threads for multiplication. (0x0 … 99x99)
  • 15. cuCUDAn - Implementation idx Value 0 0 1 1 2 2 3 3 4 4 … … 99 99 idx Value 0 0 1 1 2 2 3 3 4 4 … … 99 99 X T[0] T[1] T[2] T[3] T[…] T[99] B[0] 0 0 0 0 … 0 B[1] 0 1 2 3 … 99 B[2] 0 2 4 6 … 198 B[3] 0 3 6 9 … 297 B[…] … … … … … … B[99] 0 99 198 297 … 9801 blockDim.x Limitation : Maximum 512/1024 threads for a block.
  • 16. cuCUDAn - Implementation dim(d_out) = [100 * 100] Using 1D-Matrix instead of 2D-Matrix
  • 17. cuCUDAn - Implementation Initialize cudaEvent objects. mul<<<numBlocks, numThreads>>> Launch the kernel between recorder. Wait until the kernel finished. Get elapsed time.
  • 18. cuCUDAn - Implementation Acquire the result from the GPU Showing results. Free-up memory spaces.
  • 19. cuCUDAn - Result …… Multiplication in each threads. ※ Total : 10,000 Threads Elapsed time : 7.168 [μs]
  • 20. cuCUDAn - Result It feels dizzy … However, It works well!
  • 21. Self-Check  Which kind of task do CPUs and GPUs specialized for?  Show the way to qualify for the CPUs and GPUs.  Describe the basic procedure of CUDA programs.  Describe the procedure how to measure elapsed time using cudaEvent object.
  • 23. Communication Patterns  Map : 1-to-1 matching.  Gather : Many-to-1 matching.  Scatter : 1-to-Many matching.  *Stencil : Input from fixed neighborhood in array Map Gather Scatter Stencil
  • 24. Communication Patterns  Transpose : 1-to-1 matching. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 6 11 2 7 12 3 8 13 4 9 14 5 10 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 6 11 2 7 12 3 8 13 4 9 14 5 10 15
  • 25. Thread Block and SM Block Block Block Block Threads Kernel Stream Multiprocessor (in Titan V) Memory Cores Mapped for block (1 or more)
  • 26. Memory Structure – Programmer’s View Thread 1 Local Thread 2 Local Thread 3 Local Thread N Local Shared Thread Local Thread Local Thread Local Thread Local Shared Thread Local Thread Local Thread Local Thread Local Shared Thread Local Thread Local Thread Local Thread Local Shared Global Memory Block 1 Block 2 Block 3 Block N GPU
  • 28. Synchronization - Barrier  Barrier : Wait until all of the operations finished. B a r r i e r Threads __syncThreads()
  • 29. Synchronization - Example 0 1 2 3 4 5 6 7 … 1 2 3 4 5 6 7 8 … Shifting……. • Quiz from Lesson 2. • Each thread performs ‘Shift Operation’.
  • 30. Synchronization - Example Copy from Shared to Global *To send result to the host.
  • 31. Synchronization - Example • Without Barrier : Only one element(thread) is filled with the index (Don’t wait for other threads) • With Barrier : Each elements are filled with the index (Wait until other elements filled)
  • 32. Memory Example - Local Can only be used in the thread
  • 33. Memory Example - Global Local Vars. Pointing the global memory allocated in elsewhere.
  • 34. Memory Example - Shared Declare shared array. Set barrier for the line above. (Writing operation)
  • 35. Atomicity  CUDA supports atomic operations.  atomicAdd(), atomicCAS(), atomicXor() … and so on  Limitations  Only certain data types, operations.  No ordering constraints.  May slow down.
  • 36. Memory Management Strategies  Maximize arithmetic intensity  Maximize compute operations per thread  Minimize time spent on memory per thread  Move frequently-accessed data to fast memory Local Shared Global Host [Access Speed from Core]
  • 37. Coalesce Memory Access 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 T h r e a d T h r e a d T h r e a d T h r e a d T h r e a d T h r e a d T h r e a d T h r e a d Coalesce Stride (Not coalesced) Global Memory with single transaction with multiple transaction
  • 38. Thread Divergence  Each threads do different works. T1 T2 T3 T4 T1 T2 T3 T4 IF-THEN ELSE
  • 39. Thread Divergence (Warp Divergence1))  Assume that a thread loops the code for its thread index.  Each threads wait until the other threads finished. 1) https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec3-2x2.pdf - Lecture note from Prof. Mike Giles (Oxford University) T1 T2 T3 T4 1 2 3 40 0 Pre Loop Loop 1 Loop 2 Loop 3 Loop 4 Post Loop
  • 40. Self Check  Communication Patterns  Memory Structure in CUDA  Synchronization  Atomicity  Memory Management Strategies

Editor's Notes

  1. CPU 는 한번에 큰 작업을 빠르게 수행 할 수 있는 코어와 스레드로 구성되어 있음 -> 큰 작업 몇 개에 최적 GPU 는 작은 작업을 수행 할 수 있는 수 많은 CUDA 코어로 구성되어 있음. -> 수많은 작은 작업에 최적
  2. 앞서 다룬 특성에 따라 성능 체크를 달리함. CPU -> 어떤 일을 수행하는데 얼마나 걸리는가? GPU -> 시간당 얼마나 많은 일을 하는가?
  3. cu 확장자로 작성한 소스를 컴파일 하면 CPU와 GPU 양측을 위한 코드가 동시 생성됨. GPU는 CPU의 보조 프로세서 처럼 동작함. 각각의 Memory 공간 존재 (Host/Device) __global__ 을 통해 GPU에서 실행할 커널 생선 가능. cudaMalloc() 을 통해 GPU 메모리 공간 할당 가능. cudaMemcpy() 를 통해 CPU와 GPU 메모리 간 복사 작업 가능.
  4. GPU에 메모리를 할당 HOST 에서 보낼 데이터를 GPU 메모리로 복사. 커널 코드 실행, BLOCK 수와 THREAD 수 명시. 커널 코드 실행 후 GPU 메모리 공간에 갱신된 결과를 HOST 로 복사.
  5. 실수 변수가 호스트에 할당된 공간을 의미. 다음과 같은 포인터들이 디바이스에 할당될 공간을 의미. (Global – 2장에서 다룰 것) cudaMalloc 을 통해 Device 에 메모리 공간 할당. 크기는 float 의 크기. cudaMemcpy 를 통해 Host 의 메모리에 존재하는 값을 Device 에 할당된 메모리 공간에 복사.
  6. 커널은 다음과 같이 정의됨, 입력 받은 Parameter 들은 Device 의 Global Memory 공간. Kernel 을 호출 할 때 이러한 메모리 공간에 대한 포인터를 전달받음. 다시 cudaMemcpy 를 통해 Host 의 메모리 공간에 Device 에서 계산을 마친 값을 복사해옴.
  7. 컴파일은 다음과 같은 전용 컴파일러인 nvcc 이용 사용법은 gcc와 유사 실행해보면 원하는 결과가 나오는 것을 확인할 수 있음.
  8. 스레드를 행렬의 각 요소를 계산하는데 사용하고자 함. 행렬의 SIZE 와 할당할 크기를 설정. 요소 하나가 int 크기이므로 다음과 같이 계산. 마찬가지로 입/출력 값에 대한 Host 변수을 정의 입력 변수 h_in 초기화, 각 요소의 값은 인덱스 번호와 같다. Device 에 대한 메모리 공간을 지시하는 포인터 선언.`
  9. 여기선 3제곱을 구하는 커널을 작성, 커널에는 입 출력에 대한 Device 메모리 주소가 들어감. 현재 인덱스는 스레드의 인덱스와 같음, 이를 할당해줌. 해당 인덱스의 출력은 입력의 3제곱.
  10. 각 스레드마다 정상적으로 3제곱이 계산된 것을 확인할 수 있음.
  11. 작은 프로젝트 쿠쿠단을 작성해봄. 99x99 까지 계산하는데 각각의 곱셈은 하나하나의 스레드에서 계산되도록 만듦.
  12. 커널 내에선 단순히 두 값의 곱을 하나의 출력 인덱스에 저장하는 연산을 수행함. d_out 엔 100000개의 원소가 있고 각각의 원소가 하나의 스레드를 가지게 된다.
  13. 앞자리와 뒷자리 각각은 100의 크기를 가지는 벡터로 정의되어 있고 그 값은 벡터의 인덱스로 초기화 한다. 결과가 저장될 결과 행렬의 크기는 100x100 이다. 이어서 cudaMalloc 을 수행해야 하나 앞에서 다루었으므로 생략
  14. 행렬의 크기만큼 스레드/블럭을 지정하여 커널을 실행시킨다. 커널 주위를 Cudaevent 객체의 Start 와 Stop 으로 감싸 구간 실행 시간을 측정한다.
  15. 커널 실행후 값이 채워져 나온 d_out 의 값을 Host 메모리 공간에 복사한다. 결과를 출력하고 Device에 할당된 메모리를 해제한다.
  16. 스레드 할당이 올바르게 이루어진다 소요시간은 약 7 마이크로초
  17. 결과또한 충돌없이 잘 계산되었다
  18. 메모리가 값을 참조하는 방식엔 여러 방식이 있으나 우선 4가지를 살펴본다. *Stencil 잘 이해 안감
  19. Transpose는 행렬의 전치를 메모리 공간으로 표현하는 방법이다.
  20. 강좌에선 SM 내부에 코어가 한 묶음 있지만, Titan V 에선 4개가 있음. 블록이 각각 SM에 할당됨. *Titan X 나 그 외 하위 그래픽카드에선 SM 내에 블록이 1개만 존재 한 SM에 여러 블록 가질 수 있음. 단, 블록이 여러 SM에 걸칠 수 없음 - 블록 사이즈 한계가 있음 (SM 수가 한정적)ㄴ
  21. 프로그램 관점에서 보면 GPU의 내부의 구조는 이러함 GPU를 총괄하는 Global memory. 블록을 총괄하는 Shared Memory. 스레드마다 할당된 Local Memory.
  22. 실제 Titan V 의 구조는 이러함.
  23. 여러 스레드가 하나의 공유메모리를 접근하려다보면 동기화 문제가 생김. __syncThread() 를 사용하여 모든 스레드의 공유메모리 읽기/쓰기가 완료될 때 까지 일시 대기. (공유메모리에 있는 값의 변경이 완료되는 것을 의미) 작업이 끝나면 재개. 배리어에 다 도착 해서 오퍼레이션이 끝날 때 까지 대기.
  24. 동기화를 이해하기 위해 행렬 Shift 예제를 수행함.
  25. 왼쪽과 같은 코드에서 문제가 되는 부분은 읽기/쓰기 가 발생하는 부분. 처음에 인덱스 번호를 공유메모리에 ‘쓰는’ 부분 공유메모리의 어떤 원소의 다음 원소를 ‘읽는’ 작업 읽은 값을 그 이전 원소에 ‘쓰는’ 작업 총 3개의 Barrier 가 필요하여 오른쪽 코드와 같이 수정. // 블록 안에 있는 스레드만
  26. Barrier 가 없으면 모든 스레드의 공유메모리 쓰기 작업을 기다리지 않기에 하나만 쓰고 끝나버려 다음과 같은 결과가 나옴. Barrier 를 올바르게 적용하면 다음과 같이 제대로 된 결과가 나옴.
  27. Local 메모리는 다음과 같이 어떤 커널 ‘스레드' 내에서만 쓰이는 메모리 영역.
  28. Global Memory는 Device 에 Allocate 되어 GPU내의 모든 스레드와 블록에서 참조 가능한 메모리 공간이다.
  29. Shared Memory는 다음과 같이 __shared__ 키워드를 사용하여 정의한다.
  30. 특정 연산, 데이터 타입에만 적용 순서를 제한할 수 없음 – 어떤 것이 먼저 수행될지 모름. 메모리 접근을 직렬화 하면 대기시간이 생겨 느려질수 있다. 하나의 sm 만이 아니라 전체 시스템에서의 동기화를 의미 (Global Memory 를 참조하기때문에 많이 느림) // No ordering constraints : 둘중 어떤게 먼저 실행될지 모름 (대신 둘다 실행 되는건 확실히 방지)
  31. 효율적인 메모리 사용을 위해 스레드가 메모리에 사용하는 시간을 줄임 자주 쓰는 데이터를 빠른 메모리에 두기 스레드당 연산량을 늘림 Warp(Thread Divergence) 고려
  32. Global Memory 는 Coalesce 된 형태로 저장하는게 효율적이다. 사이가 벌어지면 그만큼 Transaction 이 많이 필요하고 더군다나 벌어진 정도가 불규칙하다면 매우 비효율적이다.
  33. 한 블록 내의 스레드가 서로 다른일을 하는 것 조건문에 따라 같은 Warp 내의 Thread 의 수행시간이 달라질 수 있음. 그렇게 되면 먼저 끝난 Thread 는 대기하게 됨.
  34. Warp 에 해당하는 스레드가 서로 하는일이 다르다면 어떤 스레드가 먼저 끝나도 다른 스레드가 모두 끝날때까지 기다려야함 (Warp Divergence) 그림은 서로 다른 시간동안 수행할 때의 예제 ( For loop ) 조건문을 가능한한 줄여 스레드들이 비슷한 시간에 끝낼 수 있도록함.