SlideShare a Scribd company logo
1 of 4
Download to read offline
/
Docs » Tutorials » Tutorial 02: CUDA in Actions
Tutorial 02: CUDA in Actions
Introduction
In tutorial 01, we implemented vector addition in CUDA using only one GPU thread. However,
the strength of GPU lies in its massive parallelism. In this tutorial, we will explore how to
exploit GPU parallelism.
Going parallel
CUDA use a kernel execution con guration <<<...>>> to tell CUDA runtime how many
threads to launch on GPU. CUDA organizes threads into a group called "thread block". Kernel
can launch multiple thread blocks, organized into a "grid" structure.
The syntax of kernel execution con guration is as follows
<<< M , T >>>
Which indicate that a kernel launches with a grid of M thread blocks. Each thread block has T
parallel threads.
Exercise 1: Parallelizing vector addition using
multithread
In this exercise, we will parallelize vector addition from tutorial 01 ( vector_add.cu ) using a
thread block with 256 threads. The new kernel execution con guration is shown below.
vector_add <<< 1 , 256 >>> (d_out, d_a, d_b, N);
CUDA provides built-in variables for accessing thread information. In this exercise, we will use
two of them: threadIdx.x and blockIdx.x .
threadIdx.x contains the index of the thread within the block
blockDim.x contains the size of thread block (number of threads in the thread block).
For the vector_add() con guration, ะhe value of threadIdx.x ranges from 0 to 255 and the
value of blockDim.x is 256.
  v: latest 
/
Parallelizing idea
Recalls the kernel of single thread version in vector_add.cu . Notes that we modi ed the
vector_add() kernel a bit to make the explanation easier.
__global__ void vector_add(float *out, float *a, float *b, int n) {
int index = 0;
int stride = 1
for(int i = index; i < n; i += stride){
out[i] = a[i] + b[i];
}
}
In this implementation, only one thread computes vector addition by iterating through the
whole arrays. With 256 threads, the addition can be spread across threads and computed
simultaneously.
For the k -th thread, the loop starts from k -th element and iterates through the array with a
loop stride of 256. For example, in the 0-th iteration, the k -th thread computes the addition
of k -th element. In the next iteration, the k -th thread computes the addition of (k+256) -th
element, and so on. Following gure shows an illustration of the idea.
EXERCISE: Try to implement this in vector_add_thread.cu
1. Copy vector_add.cu to vector_add_thread.cu
$> cp vector_add.cu vector_add_thread.cu
1. Parallelize vector_add() using a thread block with 256 threads.
2. Compile and pro le the program
$> nvcc vector_add_thread.cu -o vector_add_thread
$> nvprof ./vector_add_thread
See the solution in solutions/vector_add_thread.cu
Performance
Following is the pro ling result on Tesla M2050
  v: latest 
/
Exercise 2: Adding more thread blocks
CUDA GPUs have several parallel processors called Streaming Multiprocessors or SMs. Each
SM consists of multiple parallel processors and can run multiple concurrent thread blocks. To
take advantage of CUDA GPUs, kernel should be launched with multiple thread blocks. This
exercise will expand the vector addition from exercise 1 to uses multiple thread blocks.
Similar to thread information, CUDA provides built-in variables for accessing block
information. In this exercise, we will use two of them: blockIdx.x and gridDim.x .
blockIdx.x contains the index of the block with in the grid
gridDim.x contains the size of the grid
Parallelizing idea
Instead of using a thread block to iterate over the arrays, we will use multiple thread blocks to
create N threads; each thread processes an element of the arrays. Following is an illustration
of the parallelization idea.
With 256 threads per thread block, we need at least N/256 thread blocks to have a total of N
threads. To assign a thread to a speci c element, we need to know a unique index for each
thread. Such index can be computed as follow
int tid = blockIdx.x * blockDim.x + threadIdx.x;
EXERCISE: Try to implement this in vector_add_grid.cu
1. Copy vector_add.cu to vector_add_grid.cu
$> cp vector_add.cu vector_add_thread.cu
1. Parallelize vector_add() using multiple thread blocks.
==6430== Profiling application: ./vector_add_thread
==6430== Profiling result:
Time(%) Time Calls Avg Min Max Name
39.18% 22.780ms 1 22.780ms 22.780ms 22.780ms vector_add(float*, float*, float*, int
34.93% 20.310ms 2 10.155ms 10.137ms 10.173ms [CUDA memcpy HtoD]
25.89% 15.055ms 1 15.055ms 15.055ms 15.055ms [CUDA memcpy DtoH]
  v: latest 
/
2. Handle case when N is an arbitrary number.
3. HINT: Add a condition to check that the thread work within the acceptable array index
range.
4. Compile and pro le the program
$> nvcc vector_add_grid.cu -o vector_add_grid
$> nvprof ./vector_add_grid
See the solution in solutions/vector_add_grid.cu
Performance
Following is the pro ling result on Tesla M2050
Comparing Performance
Version Execution Time (ms) Speedup
1 thread 1425.29 1.00x
1 block 22.78 62.56x
Multiple blocks 1.13 1261.32x
Wrap up
This tutorial gives you some rough idea of how to parallelize program on GPUs. So far, we
learned about GPU threads, thread blocks, and grid. We parallized vector addition using
multiple threads and multiple thread blocks.
Acknowledgments
Contents are adopted from An Even Easier Introduction to CUDA by Mark Harris, NVIDIA
and CUDA C/C++ Basics by Cyril Zeller, NVIDIA.
==6564== Profiling application: ./vector_add_grid
==6564== Profiling result:
Time(%) Time Calls Avg Min Max Name
55.65% 20.312ms 2 10.156ms 10.150ms 10.162ms [CUDA memcpy HtoD]
41.24% 15.050ms 1 15.050ms 15.050ms 15.050ms [CUDA memcpy DtoH]
3.11% 1.1347ms 1 1.1347ms 1.1347ms 1.1347ms vector_add(float*, float*, float*, int
  v: latest 

More Related Content

What's hot

NS-2 Tutorial
NS-2 TutorialNS-2 Tutorial
NS-2 Tutorial
code453
 
CUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesCUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : Notes
Subhajit Sahu
 

What's hot (20)

Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
GPU Computing with CUDA
GPU Computing with CUDAGPU Computing with CUDA
GPU Computing with CUDA
 
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACCAccelerating HPC Applications on NVIDIA GPUs with OpenACC
Accelerating HPC Applications on NVIDIA GPUs with OpenACC
 
Ns 2 Network Simulator An Introduction
Ns 2 Network Simulator An IntroductionNs 2 Network Simulator An Introduction
Ns 2 Network Simulator An Introduction
 
【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks【論文紹介】Relay: A New IR for Machine Learning Frameworks
【論文紹介】Relay: A New IR for Machine Learning Frameworks
 
NS-2 Tutorial
NS-2 TutorialNS-2 Tutorial
NS-2 Tutorial
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Ns network simulator
Ns network simulatorNs network simulator
Ns network simulator
 
NvFX GTC 2013
NvFX GTC 2013NvFX GTC 2013
NvFX GTC 2013
 
Grover's Algorithm
Grover's AlgorithmGrover's Algorithm
Grover's Algorithm
 
Lecture 04
Lecture 04Lecture 04
Lecture 04
 
GPU Computing with Ruby
GPU Computing with RubyGPU Computing with Ruby
GPU Computing with Ruby
 
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
PT-4057, Automated CUDA-to-OpenCL™ Translation with CU2CL: What's Next?, by W...
 
2011.02.18 marco parenzan - modelli di programmazione per le gpu
2011.02.18   marco parenzan - modelli di programmazione per le gpu2011.02.18   marco parenzan - modelli di programmazione per le gpu
2011.02.18 marco parenzan - modelli di programmazione per le gpu
 
CUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : NotesCUDA by Example : Graphics Interoperability : Notes
CUDA by Example : Graphics Interoperability : Notes
 
~Ns2~
~Ns2~~Ns2~
~Ns2~
 
TensorFlow Study Part I
TensorFlow Study Part ITensorFlow Study Part I
TensorFlow Study Part I
 
Ns2
Ns2Ns2
Ns2
 
2013 0928 programming by cuda
2013 0928 programming by cuda2013 0928 programming by cuda
2013 0928 programming by cuda
 
Java file
Java fileJava file
Java file
 

Similar to CUDA Tutorial 02 : CUDA in Actions : Notes

Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
mouhouioui
 
I Need Serious Help with this assignment. Plenty Points to be earned.pdf
I Need Serious Help with this assignment. Plenty Points to be earned.pdfI Need Serious Help with this assignment. Plenty Points to be earned.pdf
I Need Serious Help with this assignment. Plenty Points to be earned.pdf
ahntagencies
 
Python multithreading session 9 - shanmugam
Python multithreading session 9 - shanmugamPython multithreading session 9 - shanmugam
Python multithreading session 9 - shanmugam
Navaneethan Naveen
 
JOURNAL PAPER
JOURNAL PAPERJOURNAL PAPER
JOURNAL PAPER
Raj kumar
 
Vectors Intro.ppt
Vectors Intro.pptVectors Intro.ppt
Vectors Intro.ppt
puneet680917
 

Similar to CUDA Tutorial 02 : CUDA in Actions : Notes (20)

Python multithreading
Python multithreadingPython multithreading
Python multithreading
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
I Need Serious Help with this assignment. Plenty Points to be earned.pdf
I Need Serious Help with this assignment. Plenty Points to be earned.pdfI Need Serious Help with this assignment. Plenty Points to be earned.pdf
I Need Serious Help with this assignment. Plenty Points to be earned.pdf
 
Python multithreading session 9 - shanmugam
Python multithreading session 9 - shanmugamPython multithreading session 9 - shanmugam
Python multithreading session 9 - shanmugam
 
اسلاید ارائه اول جلسه ۱۰ کلاس پایتون برای هکر های قانونی
اسلاید ارائه اول جلسه ۱۰ کلاس پایتون برای هکر های قانونی اسلاید ارائه اول جلسه ۱۰ کلاس پایتون برای هکر های قانونی
اسلاید ارائه اول جلسه ۱۰ کلاس پایتون برای هکر های قانونی
 
Multi Threading
Multi ThreadingMulti Threading
Multi Threading
 
Java concurrency
Java concurrencyJava concurrency
Java concurrency
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Python multithreaded programming
Python   multithreaded programmingPython   multithreaded programming
Python multithreaded programming
 
JOURNAL PAPER
JOURNAL PAPERJOURNAL PAPER
JOURNAL PAPER
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Mathemetics module
Mathemetics moduleMathemetics module
Mathemetics module
 
Java adv
Java advJava adv
Java adv
 
Thread model in java
Thread model in javaThread model in java
Thread model in java
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
 
Vectors Intro.ppt
Vectors Intro.pptVectors Intro.ppt
Vectors Intro.ppt
 
Generators & Decorators.pptx
Generators & Decorators.pptxGenerators & Decorators.pptx
Generators & Decorators.pptx
 
Multithreading Concepts
Multithreading ConceptsMultithreading Concepts
Multithreading Concepts
 
C optimization notes
C optimization notesC optimization notes
C optimization notes
 
Keras and TensorFlow
Keras and TensorFlowKeras and TensorFlow
Keras and TensorFlow
 

More from Subhajit Sahu

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
Subhajit Sahu
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
Subhajit Sahu
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Subhajit Sahu
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTES
Subhajit Sahu
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTES
Subhajit Sahu
 
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESDynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Subhajit Sahu
 

More from Subhajit Sahu (20)

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTES
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTES
 
Survey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSurvey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTES
 
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERDynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
 
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
 
Can you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTESCan you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTES
 
HITS algorithm : NOTES
HITS algorithm : NOTESHITS algorithm : NOTES
HITS algorithm : NOTES
 
Basic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESBasic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTES
 
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESDynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
 
Are Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESAre Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTES
 
Taxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESTaxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTES
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTES
 
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
 
Income Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESIncome Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTES
 
Youngistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESYoungistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTES
 

Recently uploaded

young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Naicy mandal
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...9004554577, Get Adorable Call Girls service. Book call girls & escort service...
9004554577, Get Adorable Call Girls service. Book call girls & escort service...
 
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
Kalyan callg Girls, { 07738631006 } || Call Girl In Kalyan Women Seeking Men ...
 
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
(ISHITA) Call Girls Service Aurangabad Call Now 8617697112 Aurangabad Escorts...
 
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
Call Girls in Thane 9892124323, Vashi cAll girls Serivces Juhu Escorts, powai...
 
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
young call girls in Sainik Farm 🔝 9953056974 🔝 Delhi escort Service
 
Top Rated Pune Call Girls Katraj ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated  Pune Call Girls Katraj ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Top Rated  Pune Call Girls Katraj ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls Katraj ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
 
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Dharwad 7001035870 Whatsapp Number, 24/07 Booking
 
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Hauz Quazi  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Hauz Quazi (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
Makarba ( Call Girls ) Ahmedabad ✔ 6297143586 ✔ Hot Model With Sexy Bhabi Rea...
 
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
(=Towel) Dubai Call Girls O525547819 Call Girls In Dubai (Fav0r)
 
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
 
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Sakshi Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Pimple Saudagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Pimple Saudagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Pimple Saudagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Pimple Saudagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
 
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
 
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
Get Premium Pimple Saudagar Call Girls (8005736733) 24x7 Rate 15999 with A/c ...
 
Top Rated Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated  Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...Top Rated  Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
Top Rated Pune Call Girls Chakan ⟟ 6297143586 ⟟ Call Me For Genuine Sex Serv...
 
Book Paid Lohegaon Call Girls Pune 8250192130Low Budget Full Independent High...
Book Paid Lohegaon Call Girls Pune 8250192130Low Budget Full Independent High...Book Paid Lohegaon Call Girls Pune 8250192130Low Budget Full Independent High...
Book Paid Lohegaon Call Girls Pune 8250192130Low Budget Full Independent High...
 
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(PARI) Alandi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 

CUDA Tutorial 02 : CUDA in Actions : Notes

  • 1. / Docs » Tutorials » Tutorial 02: CUDA in Actions Tutorial 02: CUDA in Actions Introduction In tutorial 01, we implemented vector addition in CUDA using only one GPU thread. However, the strength of GPU lies in its massive parallelism. In this tutorial, we will explore how to exploit GPU parallelism. Going parallel CUDA use a kernel execution con guration <<<...>>> to tell CUDA runtime how many threads to launch on GPU. CUDA organizes threads into a group called "thread block". Kernel can launch multiple thread blocks, organized into a "grid" structure. The syntax of kernel execution con guration is as follows <<< M , T >>> Which indicate that a kernel launches with a grid of M thread blocks. Each thread block has T parallel threads. Exercise 1: Parallelizing vector addition using multithread In this exercise, we will parallelize vector addition from tutorial 01 ( vector_add.cu ) using a thread block with 256 threads. The new kernel execution con guration is shown below. vector_add <<< 1 , 256 >>> (d_out, d_a, d_b, N); CUDA provides built-in variables for accessing thread information. In this exercise, we will use two of them: threadIdx.x and blockIdx.x . threadIdx.x contains the index of the thread within the block blockDim.x contains the size of thread block (number of threads in the thread block). For the vector_add() con guration, ะhe value of threadIdx.x ranges from 0 to 255 and the value of blockDim.x is 256.   v: latest 
  • 2. / Parallelizing idea Recalls the kernel of single thread version in vector_add.cu . Notes that we modi ed the vector_add() kernel a bit to make the explanation easier. __global__ void vector_add(float *out, float *a, float *b, int n) { int index = 0; int stride = 1 for(int i = index; i < n; i += stride){ out[i] = a[i] + b[i]; } } In this implementation, only one thread computes vector addition by iterating through the whole arrays. With 256 threads, the addition can be spread across threads and computed simultaneously. For the k -th thread, the loop starts from k -th element and iterates through the array with a loop stride of 256. For example, in the 0-th iteration, the k -th thread computes the addition of k -th element. In the next iteration, the k -th thread computes the addition of (k+256) -th element, and so on. Following gure shows an illustration of the idea. EXERCISE: Try to implement this in vector_add_thread.cu 1. Copy vector_add.cu to vector_add_thread.cu $> cp vector_add.cu vector_add_thread.cu 1. Parallelize vector_add() using a thread block with 256 threads. 2. Compile and pro le the program $> nvcc vector_add_thread.cu -o vector_add_thread $> nvprof ./vector_add_thread See the solution in solutions/vector_add_thread.cu Performance Following is the pro ling result on Tesla M2050   v: latest 
  • 3. / Exercise 2: Adding more thread blocks CUDA GPUs have several parallel processors called Streaming Multiprocessors or SMs. Each SM consists of multiple parallel processors and can run multiple concurrent thread blocks. To take advantage of CUDA GPUs, kernel should be launched with multiple thread blocks. This exercise will expand the vector addition from exercise 1 to uses multiple thread blocks. Similar to thread information, CUDA provides built-in variables for accessing block information. In this exercise, we will use two of them: blockIdx.x and gridDim.x . blockIdx.x contains the index of the block with in the grid gridDim.x contains the size of the grid Parallelizing idea Instead of using a thread block to iterate over the arrays, we will use multiple thread blocks to create N threads; each thread processes an element of the arrays. Following is an illustration of the parallelization idea. With 256 threads per thread block, we need at least N/256 thread blocks to have a total of N threads. To assign a thread to a speci c element, we need to know a unique index for each thread. Such index can be computed as follow int tid = blockIdx.x * blockDim.x + threadIdx.x; EXERCISE: Try to implement this in vector_add_grid.cu 1. Copy vector_add.cu to vector_add_grid.cu $> cp vector_add.cu vector_add_thread.cu 1. Parallelize vector_add() using multiple thread blocks. ==6430== Profiling application: ./vector_add_thread ==6430== Profiling result: Time(%) Time Calls Avg Min Max Name 39.18% 22.780ms 1 22.780ms 22.780ms 22.780ms vector_add(float*, float*, float*, int 34.93% 20.310ms 2 10.155ms 10.137ms 10.173ms [CUDA memcpy HtoD] 25.89% 15.055ms 1 15.055ms 15.055ms 15.055ms [CUDA memcpy DtoH]   v: latest 
  • 4. / 2. Handle case when N is an arbitrary number. 3. HINT: Add a condition to check that the thread work within the acceptable array index range. 4. Compile and pro le the program $> nvcc vector_add_grid.cu -o vector_add_grid $> nvprof ./vector_add_grid See the solution in solutions/vector_add_grid.cu Performance Following is the pro ling result on Tesla M2050 Comparing Performance Version Execution Time (ms) Speedup 1 thread 1425.29 1.00x 1 block 22.78 62.56x Multiple blocks 1.13 1261.32x Wrap up This tutorial gives you some rough idea of how to parallelize program on GPUs. So far, we learned about GPU threads, thread blocks, and grid. We parallized vector addition using multiple threads and multiple thread blocks. Acknowledgments Contents are adopted from An Even Easier Introduction to CUDA by Mark Harris, NVIDIA and CUDA C/C++ Basics by Cyril Zeller, NVIDIA. ==6564== Profiling application: ./vector_add_grid ==6564== Profiling result: Time(%) Time Calls Avg Min Max Name 55.65% 20.312ms 2 10.156ms 10.150ms 10.162ms [CUDA memcpy HtoD] 41.24% 15.050ms 1 15.050ms 15.050ms 15.050ms [CUDA memcpy DtoH] 3.11% 1.1347ms 1 1.1347ms 1.1347ms 1.1347ms vector_add(float*, float*, float*, int   v: latest 