Cuda lab manual

1,938 views
1,844 views

Published on

Parallel Computing with Compute Unified Device Architecture (CUDA)

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,938
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
120
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Cuda lab manual

  1. 1. Al-Khawarizmi Institute of Computer Science Univeristy of Engineering and Technology, Lahore Pakistan LAB WORKBOOK Parallel Programming With CUDA Summar Short Course August 2009 © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  2. 2. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. TABLE OF CONTENTS1 INTRODUCTION ........................................................................................................................................... 41.1 GENERAL PURPOSE GRAPHIC PROCESSING UNIT (GPGPU) ..................................................................... 41.2 COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) .............................................................................. 41.3 MAIN OBJECTIVES ..................................................................................................................................... 42 SETTING UP CUDA DEVELOPMENT ENVIRONMENT ..................................................................... 52.1 VERIFYING THAT YOU HAVE A CUDA-CAPABLE SYSTEM .......................................................................... 52.2 DOWNLOADING CUDA DEVELOPMENT COMPONENTS............................................................................. 62.3 INSTALLING CUDA SOFTWARE COMPONENTS .......................................................................................... 62.4 VERIFYING CUDA INSTALLATIONS ........................................................................................................... 82.5 GENERAL PROCEDURE OF PROGRAMMING IN CUDA .............................................................................. 113 PROGRAMMING IN CUDA ........................................................................................................................ 113.1 PROGRAMMING EXERCISE 1 (HELLO WORLD) ......................................................................................... 113.2 PROGRAMMING EXERCISE 2 (MATRIX MULTIPLICATION) ........................................................................ 133.3 PROGRAMMING EXERCISE 3 (NUMERICAL CALCULATION OF VALUE OF PI (Π)) ........................................ 173.4 PROGRAMMING EXERCISE 4 (PARALLEL SORT) ........................................................................................ 20 © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  3. 3. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. LAB WORKBOOKThis workbook is written for assisting the students of Summer Short Course on“Parallel Programming With CUDA” at Al-Khawarzmi Institute of Computer Science(KICS).This edition was prepared over a short period of two months and was finalized in July2009. The contents of this document have been compiled from various academicresources to expose the students to Genral Purpose Graphic Processing Units(GPGPU) and Nvidia’s Compute Unified Device Architecture (CUDA) in a hands-onfashion.For Further information, please contact the KICS at UET, Lahore: Telephone: (042) 992 50450 Fax: (042) 992 50246 Email: ghulam.mustafa@kics.edu.pk © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  4. 4. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.1 IntroductionMulticore and Many-core systems provide within-the-box parallel processing capabilities. Computing task thatwere run on supercomputers in past are now able to run on desktops provided that we know the capabilitiesof available hardware, and software techniques to exploit these available resources.1.1 General Purpose Graphic Processing Unit (GPGPU)Graphic Processing Unit (GPU) available on commodity video adapters has evolved into highly parallel,multithreaded, Many-core processor, thanks to gaming industry. These GPUs have huge computationalpower as well as very high memory bandwidth that can be exploited by general purpose high performanceapplications. These programmable GPU are also known as general purpose graphic processing units(GPGPU, from now onward we will use term GPU). GPU is specialized for compute-intensive, highlyparallel computation just like graphics rendering is done. GPU is based on SIMD architectural model andutilized by data-parallel programming model.1.2 Compute Unified Device Architecture (CUDA) Nvidia Corporation, market leader in GPU market, introduced a general purpose parallel computingarchitecture in November 2006, to harness the computing capabilities of their high-end GPUs. ComputeUnified Device Architecture (CUDA) is based on a new parallel programming model and instruction setarchitecture that leverages the parallel compute engine in NVIDIA GPUs to solve many complexcomputational problems in a more efficient way than on a CPU. CUDA comes with a software environmentthat allows developers to use C as a high-level programming language. Other languages such as FORTRAN,C++, OpenGL, and DirectX will be supported in the future.1.3 Main ObjectivesThe objective of this lab is to become familiar with parallel programming using CUDA. It will give you anidea that how we can run CUDA programs on systems with and without CUDA-capable GPU. Programmingexercises will enable you to decompose a certain complex problem into portions that could run in parallelusing data-parallel programming model. Following activities are intended to be carried out in this lab: • Verification of CUDA-capable system • Installation and verification of CUDA software components • Programming exercises o Hello world o Matrix Multiplication o Numerical calculation of the value of π o Parallel SortAt the end of this lab, you should be able to: • Setup CUDA development environment • Write, compile and run CUDA programs on Nvidia device as well as on x86 multicore systems in device emulation mode. • Use data parallel programming model © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  5. 5. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.2 Setting up CUDA development environmentTo use CUDA on your system, you will need a supported version of Linux with a gcc compiler and toolchain,CUDA software (available freely at http://www.nvidia.com/cuda) and a CUDA-capable GPU. If you do nothave a CUDA-capable GPU, you can still use CUDA in device emulation mode. Device emulation mode isbasically for debugging purposes and obviously, does not offer as much performance as with a CUDA-capable GPU. So device emulation mode should not be used for release versions and performance tuning.After installing CUDA software, we need to test our CUDA build environment by compiling and runningone or more sample programs (available in CUDA SDK). This will validate that hardware and software arerunning and communicating correctly.2.1 Verifying that you have a CUDA-Capable SystemBefore starting installation of different CUDA software components, we should verify that we havesupported version of Linux with a gcc compiler, toolchain and optionally CUDA-capable Nvidia GPU.2.1.1 Verify Nvidia video adapterEnter the following command to verify Nvidia video adapter, Note: Skip this section if your system is not equiped wih a CUDA-capable Nvidia GPU. [root@gm gm]# lspci |grep -i nVidia01:00.0 VGA compatible controller: nVidia Corporation GeForce 9600M GT (rev a1)[root@gm gm]#If you do not see anything, either you do not have an Nvidia graphic adapter or you have to update PCIhardware database, maintained by Linux, using following command. If your network connection is fine,output should look like below.[root@gm gm]# update-pciids % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 148k 100 148k 0 0 6241k 0 --:--:-- --:--:-- --:--:-- 6767kDone.[root@gm gm]#2.1.2 Verify supported version of LinuxCurrent version (2.2) of CUDA software components requires an x86-based Linux distribution. Followingcommand checks distribution and release number of running system,[root@gm gm]# uname -i && cat /etc/*releasei386Fedora release 10 (Cambridge)Fedora release 10 (Cambridge)Fedora release 10 (Cambridge)[root@gm gm]# © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  6. 6. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.Output shows that running system is 32-bit (i386) Fedora version 10. On a 64-bit system running in 64-bitmode the typical output will be x86_64. Version 2.2 of CUDA development tools support only followingdistributions: Red Hat Enterprise Linux 4.3-4.7, 5.0-5.3 SUSE Enterprise Desktop 10-SP2 Open SUSE 11.0 or 11.1 Fedora 9 or 10 Ubuntu 8.04 or 8.10You should frequently visit CUDA download page for updates because other distributions are promised to besupported latter.2.1.3 Verifying gccCurrent CUDA development tools supports version 3.4, 4.x of gcc. You can check the version of currentlyinstalled gcc by issuing the following command:[root@gm gm]# gcc --versiongcc (GCC) 4.3.2 20081105 (Red Hat 4.3.2-7)Copyright (C) 2008 Free Software Foundation, Inc.This is free software; see the source for copying conditions. There is NOwarranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.[root@gm gm]#2.2 Downloading CUDA development componentsYou can get CUDA software components from http://www.nvidia.com/object/cuda_get.html.Read the instructions given on this page carefully and download necessary files. Nvidia CUDA Driver is notnecessary if you do not have Nvidia GPU and want to run CUDA programs in device emulation mode.2.3 Installing CUDA software componentsUninstall any previously installed versions of CUDA SDK and toolkit, by just deleting the directorycontaining these packages. Default directory for toolkit and SDK are /usr/local/cuda/ and~/NVIDIA_CUDA_SDK/ respectively. If you want to keep older versions, just rename these directories.2.3.1 Installing CUDA driver Note: You do not have to install CUDA driver if you dont have an Nvidia GPU (cuda- capable). If tried, You will see an error like "You do not appear to have an NVIDIA GPU supported by the 185.18.14 NVIDIA Linux graphics driver installed in this system."You need to shutdown x server before installing the driver (best way is to change id:5:initdefault:to id:3:initdefault: in /etc/inittab file and reboot). You will get console only (No graphics).Secondly, you must have source code of running kernel (if needed) that can be installed by issuing followingcommand:[root@gm gm]# yum install kernel-devel © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  7. 7. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.More information about driver installation is available onhttp://us.download.nvidia.com/XFree86/Linux-x86/1.0-9755/README/index.html.To install driver, first of all exit the GUI (ctl-alt-backspace). On available command line issue the followingcommands to turn off x windows as a superuser, install driver and restart GUI environment, respectively.[root@gm gm]# supassword:[root@gm gm]# /sbin/init 3[root@gm gm]# cd <directory containing downloaded .run files>[root@gm gm]# ./NVIDIA-Linux-x86-185.18.14-pkg1.run[root@gm gm]# /sbin/init 5You can also issue the following command to start the GUI environment,[root@gm gm]# startxMake sure your internet connection is working fine. Follow the instruction displayed on your screen. Note: You can verify driver release by running the following command, [root@gm gm]# /usr/bin/nvidia-settings2.3.2 Installing CUDA toolkitJust issue following commands,[root@gm gm]# cd <directory containing downloaded .run files>[root@gm gm]# ./cudatoolkit_2.2_linux_32_fedora10.run(Output omitted for the sake of brevity)2.3.3 Setting environment variablesIssue following commands,[root@gm gm]# export PATH=/usr/local/cuda/bin/:$PATH[root@gm gm]# export LD_LIBRARY_PATH=/usr/local/cuda/lib/:$LD_LIBRARY_PATHYou can make these settings permanent by putting the above mentioned commands to ~/.bashrc2.3.4 Configuring CUDA librariesAdd LD_LIBRARY_PATH=/usr/local/cuda/lib/:$LD_LIBRARY_PATH to /etc/ld.so.confand issue the following command,[root@gm gm]# ldconfig2.3.5 Installing CUDA SDK[root@gm gm]# cd <directory containing downloaded .run files>[root@gm gm]# ./cudasdk_2.21_linux.run(Output omitted for the sake of brevity) © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  8. 8. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.2.3.6 Installing CUDA Debugger[root@gm gm]# cd <directory containing downloaded .run files>[root@gm gm]# ./cudagdb_2.2_linux_32_rhel5.3.run(Output omitted for the sake of brevity)2.4 Verifying CUDA installationsAfter installation, best practice is to validate the installed packages and environment setting.2.4.1 Verifing CUDA environment[root@gm gm]# envORBIT_SOCKETDIR=/tmp/orbit-gmHOSTNAME=gm.kics-uetTERM=xtermSHELL=/bin/bashXDG_SESSION_COOKIE=871a3cd51587ff750aec3a5049a408c9-1247661191.484531-1772071398HISTSIZE=1000GTK_RC_FILES=/etc/gtk/gtkrc:/home/gm/.gtkrc-1.2-gnome2WINDOWID=31457334QTDIR=/usr/lib/qt-3.3QTINC=/usr/lib/qt-3.3/includehttp_proxy=http://10.11.20.20:8888/USER=gmLD_LIBRARY_PATH=/usr/local/cuda/lib/:LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:do=00;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=00;32:*.tar=00;31:*.tgz=00;31:*.svgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.lzma=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.dz=00;31:*.gz=00;31:*.bz2=00;31:*.tbz2=00;31:*.bz=00;31:*.tz=00;31:*.deb=00;31:*.rpm=00;31:*.jar=00;31:*.rar=00;31:*.ace=00;31:*.zoo=00;31:*.cpio=00;31:*.7z=00;31:*.rz=00;31:*.jpg=00;35:*.jpeg=00;35:*.gif=00;35:*.bmp=00;35:*.pbm=00;35:*.pgm=00;35:*.ppm=00;35:*.tga=00;35:*.xbm=00;35:*.xpm=00;35:*.tif=00;35:*.tiff=00;35:*.png=00;35:*.mng=00;35:*.pcx=00;35:*.mov=00;35:*.mpg=00;35:*.mpeg=00;35:*.m2v=00;35:*.mkv=00;35:*.ogm=00;35:*.mp4=00;35:*.m4v=00;35:*.mp4v=00;35:*.vob=00;35:*.qt=00;35:*.nuv=00;35:*.wmv=00;35:*.asf=00;35:*.rm=00;35:*.rmvb=00;35:*.flc=00;35:*.avi=00;35:*.fli=00;35:*.gl=00;35:*.dl=00;35:*.xcf=00;35:*.xwd=00;35:*.yuv=00;35:*.svg=00;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:SSH_AUTH_SOCK=/tmp/keyring-qpkd1F/sshGNOME_KEYRING_SOCKET=/tmp/keyring-qpkd1F/socketUSERNAME=gmSESSION_MANAGER=local/unix:@/tmp/.ICE-unix/2747,unix/unix:/tmp/.ICE-unix/2747DESKTOP_SESSION=gnomePATH=/usr/local/cuda/bin/:/usr/kerberos/sbin:/usr/lib/qt-3.3/bin:/usr/kerberos/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/gm/binMAIL=/var/spool/mail/gmPWD=/home/gm/DesktopXMODIFIERS=@im=imsettingsGNOME_KEYRING_PID=2745LANG=en_US.UTF-8GDM_LANG=en_US.UTF-8GDMSESSION=gnomeSSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpassHOME=/rootSHLVL=3 © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  9. 9. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.no_proxy=localhost,127.0.0.0/8GNOME_DESKTOP_SESSION_ID=this-is-deprecatedLOGNAME=gmQTLIB=/usr/lib/qt-3.3/libDBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E9ZoYtPeZC,guid=4328bc8674e6eb0b12d4ef874a5dcc87LESSOPEN=|/usr/bin/lesspipe.sh %sDISPLAY=:0.0G_BROKEN_FILENAMES=1XAUTHORITY=/root/.xauth5fdjoqCOLORTERM=gnome-terminal_=/usr/bin/envOLDPWD=/home/gm2.4.2 Verify CUDA compilernvcc is compiler driver for CUDA programs. It calls gcc compiler for C code and NVIDIA PTX compilerfoe CUDA code. To verify, enter one of the following commands:[root@gm gm]# which nvcc/usr/local/cuda/bin/nvcc[root@gm ~]# nvcc –Vnvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2009 NVIDIACorporation Built on Thu_Apr__9_07:37:20_PDT_2009 Cuda compilation tools,release 2.2, V0.2.1221[root@gm ~]#2.4.3 Compiling Sample Projects[root@gm gm]# cd <SDK directory>[root@gm gm]# makeThe resulting binaries will be in NVIDIA_CUDA_SDK/bin/linux/release2.4.4 Compiling Sample Projects in emulation mode[root@gm gm]# cd <SDK derectory>[root@gm gm]# make emu=1The resulting binaries will be in NVIDIA_CUDA_SDK/bin/linux/emurelease.2.4.5 Running deviceQuery and bandwidthTest Note: You do not need to run deviceQuery and bandwidthTest if you dont have an Nvidia GPU (cuda-capable). In this case, you can try some other executable from nvidia_CUDA_SDK/bin/linux/emurelease directoryRun ./deviceQuery in <NVIDIA_CUDA_SDK>/bin/linux/release. To run deviceQuery, onSELinux-enabled systems, you may need to disable this security feature using setenforce command.[root@gm gm]# setenforce 0 © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  10. 10. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.[root@gm gm]# cd <NVIDIA_CUDA_SDK>/bin/linux/emurelease[root@gm release]# ./deviceQueryCUDA Device Query (Runtime API) version (CUDART static linking)There is 1 device supporting CUDADevice 0: "GeForce 9600M GT" CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 1 Total amount of global memory: 536150016 bytes Number of multiprocessors: 4 Number of cores: 32 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.25 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: No Compute mode: Default (multiple hostthreads can use this device simultaneously)Test PASSEDPress ENTER to exit...To test that system and CUDA-capable device communicate correctly, run following[root@gm release]# ./bandwidthTestRunning on...... device 0:GeForce 9600M GTQuick ModeHost to Device Bandwidth for Pageable memory.Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1756.6Quick ModeDevice to Host Bandwidth for Pageable memory.Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1168.8Quick ModeDevice to Device Bandwidth.Transfer Size (Bytes) Bandwidth(MB/s) 33554432 10762.2&&&& Test PASSED © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  11. 11. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.Press ENTER to exit...Start using CUDA to build your own high performance applications. NVIDIA CUDA Programming Guide,located in /usr/local/cuda/doc/ is your next step in this course.2.5 General procedure of programming in CUDAYou can use any text editor to write your CUDA source code for your program. Save it with .cu extension. Thenissue the following commnd (assuming environment variables are properly set, as described above):[root@gm <dir>]# nvcc –o <executeable_name> -deviceemu <program_name>.cu[root@gm <dir>]# ./<executeable_name>Replace contents contained in “< >” with actual names. “-deviceemu” compiles code that is expected torun on CPU only.3 Programming in CUDACUDA comes with a software environment that allows developers to use C as a high-level programminglanguage. This section is composed of programming exercises for hands on practice. Problem partitionaing interms of threads and thread Blocks, and organization of thread blocks in one or more block grids is the mainchallenge faced by CUDA programmers. Following programming exercises are designed to understand thisconcept of problem orchestration. Complicated details of CUDA like compilation steps, generated files,different file formats, and very precise and efficient use of different memory hierarchy etc. are out of scope ofthis activity. You will gradually learn these concepts. Most important is to tackle problem orchestration and toget output of your simple programs.3.1 Programming Exercise 1 (Hello World)This is a well-known warm-up program that asks all threads to prints Hello World!3.1.1 Lab ObjectivesObjectives of this lab experiment include: 1. Learning about the general structure of a CUDA program 2. Learning the concept of kernel, kernel invocation, hierarchical thread grouping. 3. Learning the concept of threadIdx, blockIdx and blockDim. 4. Compiling and running CUDA code in device emulation mode3.1.2 SetupMake sure that environment variables are properly setup. If not first set the environment variables as mentioned insection 2.3.3./* * File: Hello_World.cu * Author: Ghulam Mustafa © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  12. 12. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. */#include <cuda.h>#include <stdio.h>#include <stdlib.h>__global__ void printhello(){ int thid = blockIdx.x * blockDim.x + threadIdx.x; printf("Thread%d: Hello World!n", thid);}int main(){printhello<<<5,10>>>();return 0;}3.1.3 ProcedureWrite this simple program in any text editor and save it with .cu extension (if softcopy is not available). Compileand run as mentioned below. Experiment with kernel invocation statement by changing the values of dimGrid anddimBlock where general kernel invocation statement is “kernel<<<dimGrid, dimBlock>>> ( ).” Try to figure outhow the ID of a thread will change by changing dimBlock and dimGrid.To Compile & Run:[root@gm gm]# nvcc –o hello -deviceemu Hello_World.cu[root@gm gm]# ./hello3.1.4 ConclusionsList your conclusions with respect to the objectives of this experiment3.1.5 Lab Instructor’s EvaluationLab instructor’s remark whether the student finished the work to meet the lab objectives. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  13. 13. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.3.2 Programming Exercise 2 (Matrix Multiplication)Parallel matrix multiplication is representative of those problems which are good examples for CUDAimplementation. Each element of resulting matrix is calculated in parallel.3.2.1 Lab ObjectivesObjectives of this lab experiment include: 5. Learning the application of CUDA to linear algebra problems 6. Learning how to partion a large problem in to subproblems 7. Learning how to exploit the thread and block IDs for useful calculations 8. Learning how to download parallel portion of code to device 9. Learning how to use device memory 10. Understanding hetrogeneous programming3.2.2 SetupMake sure that environment variables are properly setup. If not first set the environment variables as mentioned insection 2.3.3./* * File: matrix_mul.cu * Author: Ghulam Mustafa * Created on July 31,2009, 7:30 PM * Code is adapted from Nvidia CUDA Programming Guide ver 2.2.1 * Matrices are stored in row-major order:M(row, col) = M.ents[row*M.w + col]*/#include <cuda.h>#include <stdio.h>#include <stdlib.h>#define BLOCK_SZ 2#define DBG 1//Order of Matrix X = (Xr x Xc)#define Xc (2 * BLOCK_SZ)#define Xr (3 * BLOCK_SZ)//Order of Matrix Y = (Yr x Yc) © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  14. 14. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.#define Yc (2 * BLOCK_SZ)#define Yr Xc//Order of Matrix Z = (Zr x Zc)#define Zc Yc#define Zr Xr#define N (Zr*Zc)typedef struct Matrix{ int r,c; float* elements;} matrix;void populate_matrix(matrix*);void print_matrix(matrix);__global__ void matrix_mul_krnl(matrix A, matrix B, matrix C){ float C_entry = 0; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int i; for (i = 0; i < A.c; i++) C_entry += A.elements[row * A.c + i] * B.elements[i * B.c + col]; C.elements[row * C.c + col] = C_entry; } int main() { matrix X, Y, Z; X.r = Xr; Y.r = Yr; Z.r = Zr; X.c = Xc; Y.c = Yc; Z.c = Zc; if(DBG) printf("C(%d,%d) = A(%d,%d) x B(%d,%d)n-----------------------n" ,Z.r,Z.c, X.r,X.c, Y.r,Y.c); size_t size_Z = Z.c * Z.r * sizeof(float); Z.elements = (float*) malloc(size_Z); populate_matrix(&X); populate_matrix(&Y); printf("Matrix A (%d,%d)n",X.r,X.c); print_matrix(X); printf("Matrix B(%d,%d)n",Y.r,Y.c); print_matrix(Y); matrix d_A; d_A.c = X.c; d_A.r = X.r; size_t size_A = X.c * X.r * sizeof(float); cudaMalloc((void**)&d_A.elements, size_A); cudaMemcpy(d_A.elements, X.elements, size_A, cudaMemcpyHostToDevice); matrix d_B; d_B.c = Y.c; © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  15. 15. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. d_B.r = Y.r; size_t size_B = Y.c * Y.r * sizeof(float); cudaMalloc((void**)&d_B.elements, size_B); cudaMemcpy(d_B.elements, Y.elements, size_B,cudaMemcpyHostToDevice); // Allocate C in device memory matrix d_C; d_C.c = Z.c; d_C.r = Z.r; size_t size_C = Z.c * Z.r * sizeof(float); cudaMalloc((void**)&d_C.elements, size_C); dim3 dimBlock(BLOCK_SZ, BLOCK_SZ); dim3 dimGrid(Y.c / dimBlock.x, X.r / dimBlock.y); matrix_mul_krnl<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); // Read C from device memory cudaMemcpy(Z.elements, d_C.elements, size_C, cudaMemcpyDeviceToHost); // Free device memory cudaFree(d_A.elements); cudaFree(d_B.elements); cudaFree(d_C.elements); printf("Matrix C(%d,%d)n",Z.r,Z.c); print_matrix(Z); free (X.elements); free(Y.elements); free(Z.elements); }void populate_matrix(matrix* mat){ int dim = mat -> c * mat -> r; size_t sz = dim * sizeof(float); mat -> elements = (float*) malloc(sz); int i; for (i = 0; i < dim; i++) mat->elements[i] = (float)(rand()%1000);}void print_matrix(matrix mat){ int i, n = 0, dim; dim = mat.c * mat.r; for (i = 0; i < dim; i++) { if (i == mat.c * n) { printf("n"); n++; } printf("%0.2ft", mat.elements[i]); } © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  16. 16. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.printf("n============================================================n");}3.2.3 ProcedureWrite this program in any text editor and save it with .cu extension (if softcopy is not available). Compile and run asmentioned below. Experiment with matrices of different sizes as well as with different block sizes. Try tounderstand the concept of threadIdx, blockDim and blockIdx and how they are used in this context.To Compile & Run:[root@gm gm]# nvcc –o matrix -deviceemu Matrix_mul.cu[root@gm gm]# ./matrix3.2.4 ConclusionsList your conclusions with respect to the objectives of this experiment.3.2.5 Lab Instructor’s EvaluationLab instructor’s remark whether the student finished the work to meet the lab objectives. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  17. 17. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.3.3 Programming Exercise 3 (Numerical calculation of value of pi ( ))Parallel programming is extensively used in scientific computing. Numerical calculation of the value of Pi involvesthe usage of loop. This programming exercise uses specified numbers of threads in such a way that each thread isassigned an equal portion of specified interval.3.3.1 Lab ObjectivesObjectives of this lab experiment include: 11. Learning the application of CUDA to scientific (numerical) computing 12. Learning how to use thread IDs in the situations where sequence of executaion is important 13. Learning how to attack loops for parallelism3.3.2 SetupMake sure that environment variables are properly setup. If not first set the environment variables as mentioned insection 2.3.3./* * File: pi.cu * Author: Ghulam Mustafa * Created on July 31,2009, 7:30 PM */#include <cuda.h>#include <stdio.h>#include <stdlib.h>typedef struct PI_data{ int n; int PerThrItr; int nThr;} data;__global__ void calculate_PI(data d, float* s){ float sum, x, w; int itr,i,j; itr = d.PerThrItr; i = blockIdx.x * blockDim.x + threadIdx.x; int N = d.n-i; w = 1.0/(float)N; sum = 0.0; if (i < d.nThr) { for (j = i * itr; j < (i * itr+itr); j++) { x = w * (j-0.5); sum+= (4.0)/(1.0 + x*x); } s[i] = sum * w; }} © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  18. 18. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.// Host codeint main(int argc, char** argv){ printf("Usage: ./<progname> #intervals #Threadsn"); if(argc < 2) { printf("Usage: ./<progname> #itrations #Threadsn"); exit(1); } data pi_data; float PI=0; pi_data.n = atoi(argv[1]); pi_data.nThr = atoi(argv[2]); pi_data.PerThrItr = pi_data.n/pi_data.nThr; float *d_sum; float *h_sum; // Allocate vectors in device memory size_t size = pi_data.nThr * sizeof(float); cudaMalloc((void**)&d_sum, size); //Memory allocation on host h_sum = (float*) malloc(size); // cudaMemcpy(d_sum, h_sum, size, cudaMemcpyHostToDevice); int threads_per_block = 4; int blocks_per_grid; blocks_per_grid = (pi_data.nThr + threads_per_block -1)/threads_per_block; calculate_PI<<<blocks_per_grid, threads_per_block>>>(pi_data, d_sum); cudaMemcpy(h_sum, d_sum, size, cudaMemcpyDeviceToHost); int i; for (i = 0; i < pi_data.nThr; i++) PI+= h_sum[i]; //PI = PI * pi_data.n; printf("Using %d itrations, Value of PI is %f n", pi_data.n, PI); // Free device memory cudaFree(d_sum);}3.3.3 ProcedureFor computing Pi we use numerical methods. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  19. 19. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. N −1 4 4 1 ∑ 1Π= ∫ dx = × 0 1 + x2 2 i =0  i − 0 .5  N 1+    N Using this technique each partial sum can be calculated in parallel. Write this program in any text editor and saveit with .cu extension (if softcopy is not available). Compile and run as mentioned below. Experiment with ofdifferent values of intervals and threads. Try to understand how threadIdx, blockDim and blockIdx are exploitedhere to keep the sequence of workflow.To Compile & Run:[root@gm gm]# nvcc –o PI -deviceemu pi.cu[root@gm gm]# ./PI <2300> <25>3.3.4 ConclusionsList your conclusions with respect to the objectives of this experiment3.3.5 Lab Instructor’s EvaluationLab instructor’s remark whether the student finished the work to meet the lab objectives. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  20. 20. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore.3.4 Programming Exercise 4 (Parallel Sort)A sorting network is a sorting algorithm, where the sequence of comparisons is not data-dependent. Thatmakes them suitable for parallel implementations. Bitonic sort is one of the fastest sorting networks,consisting of Θ(n log n 2 ) comparators. It has a simple implementation and its very efficient when sorting asmall number of elements3.4.1 Lab ObjectivesObjectives of this lab experiment include: 14. Learning Bitonic sorting algorithm 15. Learning how to use __shared__ construct 16. Learning how to use __device__ construct 17. Using Barrier syncrhonization for thread coordinateion support parallelism.3.4.2 SetupMake sure that environment variables are properly setup. If not first set the environment variables as mentioned insection 2.3.3./* * File: parallel_sort.cu * Author: Ghulam Mustafa * Created on July 31,2009, 7:30 PM * Code is adapted from Nvidia CUDA SDK sample projects ver 2.2.1*/#include <cuda.h>#include <stdio.h>#include <stdlib.h>#define NUM 32__device__ inline void swap(int & a, int & b){ int tmp = a; a = b; b = tmp;}__global__ static void bitonicSort(int * values){ extern __shared__ int shared[]; const unsigned int tid = threadIdx.x; // Copy input to shared mem. shared[tid] = values[tid]; __syncthreads(); // Parallel bitonic sort for (unsigned int k = 2; k <= NUM; k *= 2) { // Bitonic merge: for (unsigned int j = k / 2; j>0; j /= 2) © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  21. 21. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. { unsigned int ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) { swap(shared[tid], shared[ixj]); } } else { if (shared[tid] < shared[ixj]) { swap(shared[tid], shared[ixj]); } } } __syncthreads(); } } // Write result. values[tid] = shared[tid];}int main(int argc, char** argv){ int values[NUM]; printf( "nUnsorted Arrayn==============n"); for(int i = 0; i < NUM; i++) { values[i] = rand()%1000; printf("%dt",values[i]); } printf("n"); int * dvalues; cudaMalloc((void**)&dvalues, sizeof(int) * NUM); cudaMemcpy(dvalues, values, sizeof(int) * NUM, cudaMemcpyHostToDevice); bitonicSort<<<1, NUM, sizeof(int) * NUM>>>(dvalues); // check for any errors cudaMemcpy(values, dvalues, sizeof(int) * NUM, cudaMemcpyDeviceToHost); cudaFree(dvalues); bool passed = true; int i; printf( "nSorted Arrayn==============n"); for( i = 1; i < NUM; i++) { if (values[i-1] > values[i]) passed = false; printf( "%dt", values[i-1]); © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.
  22. 22. Al-Khawarizmi Institute of Computer Science – CUDA LABWORK BOOK University of Engineering & Technology, Lahore. } printf( "%dtn", values[i]); printf( "Test %sn", passed ? "PASSED" : "FAILED");}3.4.3 ProcedureWrite this program in any text editor and save it with .cu extension (if softcopy is not available). Compile andrun as mentioned below. Experiment with values of NUM and check the status of test (last line of theoutput). Try to understand the concept of threadIdx, blockDim and blockIdx and how they are used in thiscontext.To Compile & Run:[root@gm gm]# nvcc –o ll_sort -deviceemu parallel_sort.cu[root@gm gm]# ./ll_sort3.4.4 ConclusionsList your conclusions with respect to the objectives of this experiment.3.4.5 Lab Instructor’s EvaluationLab instructor’s remark whether the student finished the work to meet the lab objectives. © Copyright 2009 Al-Khawarizmi Institute of Computer Science University of Engineering and Technology, Lahore.

×