CUDA     Development using             PyCUDA                Part 1     Prof. Mario A. Gazziro (Yah!)Organization: Prof. A...
What 1 TeraFlop/s means ?C o mp ute r       Te ra F lo p s     Ye a r o f         P ric e      Ins titute                 ...
Instalation – Step 1DRIVER:sudo /etc/init.d/gdm stop<ALT+F1><logar como labredes senha 12345678>chmod 777 devdriver_4.2_li...
Instalation – Step 2TOOLKIT:chmod 777 ./cudatoolkit_4.2.9_linux_64_ubuntu10.04.runsudo ./cudatoolkit_4.2.9_linux_64_ubuntu...
Instalation – Step 3SDK:chmod 777 gpucomputingsdk_4.2.9_linux.run<NAO USAR SUDO!>./gpucomputingsdk_4.2.9_linux.run<concord...
Instalation – Step 4cd ~/NVIDIA_GPU_Computing_SDK/make<aguardar>cd ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/ls./Sobe...
CUDA SDK sample applications
CUDA SDK sample applications
Instalation – Step 5<INSTALACAO DAS DEPENDENCIAS DO PYTHON >sudo apt-get install python-numpysudo apt-get install python-h...
Instalation – Step 6Py-Cuda:sudo apt-get install build-essential python-dev python-  setuptools libboost-python-dev libboo...
Part I: Overview  Definition:     Graphical Processing Units are just graphical card adaptors that can      give access ...
 Why this topic is important?• Data-intensive problems challenge conventional computing  architectures with demanding CPU...
 Where would I encounter this?• Gaming• Raytracing• 3D Scanners• Computer Graphics• Number Crunching• Scientific Calculat...
CUDA vs IntelNVIDIA GeForce 8800 GTX vs Intel Xeon E5335 2GHz, L2 cache 8MB
Grid of thread blocks                         The computational grid                          consist of a grid of thread...
Elementwise Matrix Addition
Elementwise Matrix AdditionThe nested for-loops are replaced with an implicit grid
Memory modelCUDA exposes all the different type of memory on GPU:
Part II: Classroom Exercises – HEAT (local)Compile and runheat.cu exampleCommand line: nvcc heat.cu –o heat -lglut
Part II: Classroom Exercises – HEAT (remote)Compile and run heat.cu example in HAL9k gpu server (IFSC/CIERMag)cmd: ssh – X...
Part II: Classroom Exercises – EXERC1Type, compile and test the following code. What this program do ?Command line: gcc ex...
Part II: Classroom Exercises – EXERC2Type, compile and test the following code. What this program do ?Command line: gcc ex...
Part II: Classroom Exercises – EXERC3Type, compile and test the following code. What this program do ?Command line: nvcc e...
Part II: Classroom Exercises – EXERC4
Part II: Classroom Exercises – EXERC5
Part II: Classroom Exercises – EXERC6 – part A
Part II: Classroom Exercises – EXERC6 – part B
Part III: ProjectCase Study: Initial calculation for solving sparse matrix in themethod proposed by professor Guilherme Si...
Part III: Project Best Solution (Mateus and Bié)Case Study: Initial calculation for solving sparse matrix in themethod pro...
Questions ?So long and thanks by all the fish! – See you tomorrow!2nd day activities:-Database integration(HDF5)-Graphics ...
References   Gokhale M. et al, Hardware Technologies for High-Performance    Data-Intensive Computing, IEEE Computer, 18-...
/s?
Upcoming SlideShare
Loading in...5
×

CUDA Development in Python Language

1,203

Published on

CUDA Development in Python Language

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,203
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
22
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X CourseName, X.X
  • CourseName, X.X CourseName, X.X
  • CourseName, X.X CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X
  • CourseName, X.X CourseName, X.X
  • CUDA Development in Python Language

    1. 1. CUDA Development using PyCUDA Part 1 Prof. Mario A. Gazziro (Yah!)Organization: Prof. André Carvalho July 2012 Support:Igor, Heitor, Pedro, Ruan and Andre
    2. 2. What 1 TeraFlop/s means ?C o mp ute r Te ra F lo p s Ye a r o f P ric e Ins titute /s ins ta la tio n US $ Po we rPC 3 2007 500.000 U S P (C C E ) 970C lu s te r G PU 16 2009 1 00.000 IFS C Attilio *B lu e G e n e L 26 - (49) 2007- D o n atio n IIN N (201 0) TU PÃ 258 201 0 25.000.000 IN PE*only paralell computer – all other are serial ones.
    3. 3. Instalation – Step 1DRIVER:sudo /etc/init.d/gdm stop<ALT+F1><logar como labredes senha 12345678>chmod 777 devdriver_4.2_linux_64_295.41.runsudo ./devdriver_4.2_linux_64_295.41.run<concordar com tudo>sudo /etc/init.d/gdm start
    4. 4. Instalation – Step 2TOOLKIT:chmod 777 ./cudatoolkit_4.2.9_linux_64_ubuntu10.04.runsudo ./cudatoolkit_4.2.9_linux_64_ubuntu10.04.run<concordar com todas as opcoes><incluir o texto abaixo no final do arquivo .bashrc>cd ~gedit .bashrcexport PATH=/usr/local/cuda/bin:$PATHexport LPATH=/usr/lib/nvidia-current:$LPATHexport LIBRARY_PATH=/usr/lib/nvidia-current:$LIBRARY_PATHexport LD_LIBRARY_PATH=/usr/lib/nvidia-current:/usr/local/ cuda/lib64:/usr/local/cuda/lib: $LD_LIBRARY_PATH<salvar e sair>
    5. 5. Instalation – Step 3SDK:chmod 777 gpucomputingsdk_4.2.9_linux.run<NAO USAR SUDO!>./gpucomputingsdk_4.2.9_linux.run<concordar com todas as opcoes><fechar todas janelas de terminal - variaveis de ambiente><Testar compilador>nvcc<deve aparecer a seguinte mensagem de erro>nvcc fatal : No input files specified; use option --help for more information
    6. 6. Instalation – Step 4cd ~/NVIDIA_GPU_Computing_SDK/make<aguardar>cd ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/ls./SobelFilter./particles<clique com botao 2 e selecione move cursor mode>< mova o mouse>./fluidsGL<FECHAR TODOS>./particles<CTRL+Z>bg./fluidsGL< verificar execução conconcorrente dos kernels >< compartilhamento dos recursos da GPU >
    7. 7. CUDA SDK sample applications
    8. 8. CUDA SDK sample applications
    9. 9. Instalation – Step 5<INSTALACAO DAS DEPENDENCIAS DO PYTHON >sudo apt-get install python-numpysudo apt-get install python-h5pysudo apt-get install python-scipysudo apt-get install python-matplotlib<testar interface grafica>cd ~/Área de Trabalho/CUDApython ./teste_grafico.pyvisualizador hdf5:chmod 777 hdfview_install_linux64.bin./hdfview_install_xxx.bin./hdfview
    10. 10. Instalation – Step 6Py-Cuda:sudo apt-get install build-essential python-dev python- setuptools libboost-python-dev libboost-thread-dev –ytar xzvf pycuda-2011.2.2.tar.gzcd pycuda-2011.2.2/./configure.py --cuda-root=/usr/local/cuda --cudadrv-lib- dir=/usr/lib --boost-inc-dir=/usr/include --boost-lib- dir=/usr/lib --boost-python-libname=boost_python-mt --boost-thread-libname=boost_thread-mt --no-use- shipped-boostmake -j 4<apagar siteconf.py em caso de erro!!!>sudo env PATH=$PATH python setup.py install<testar pycuda>cd ~/Área de Trabalho/CUDA/pycuda-2011.2.2/examplespython ./demo.py<visualizar manipulacao de arrays e matrizes>
    11. 11. Part I: Overview  Definition:  Graphical Processing Units are just graphical card adaptors that can give access to programmers to their internal API (Advanced Program Interface). Today, there are even GPUs without graphics output (build only to perform scientific calculations).  Introduced in 2006, the Compute Unified Device Architecture is a combination of software and hardware architecture (available for NVIDIA G80 GPUs and above) which enables data-parallel general purpose computing on the graphics hardware. It therefore offers a C- like programming API with some language extensions.  Key Points:  The architecture offers support for massively multi threaded applications and provides support for inter-thread communication and memory access.
    12. 12.  Why this topic is important?• Data-intensive problems challenge conventional computing architectures with demanding CPU, memory, and I/O requirements.• Emerging hardware technologies, like CUDA architecture can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.
    13. 13.  Where would I encounter this?• Gaming• Raytracing• 3D Scanners• Computer Graphics• Number Crunching• Scientific Calculation
    14. 14. CUDA vs IntelNVIDIA GeForce 8800 GTX vs Intel Xeon E5335 2GHz, L2 cache 8MB
    15. 15. Grid of thread blocks  The computational grid consist of a grid of thread blocks  Each thread executes the kernel  The application specifies the grid and block dimensions  The grid layouts can be 1, 2 or 3-dimensional  The maximal sizes are determined by GPU memory  Each block has a unique block ID  Each thread has a unique thread ID (within the block)
    16. 16. Elementwise Matrix Addition
    17. 17. Elementwise Matrix AdditionThe nested for-loops are replaced with an implicit grid
    18. 18. Memory modelCUDA exposes all the different type of memory on GPU:
    19. 19. Part II: Classroom Exercises – HEAT (local)Compile and runheat.cu exampleCommand line: nvcc heat.cu –o heat -lglut
    20. 20. Part II: Classroom Exercises – HEAT (remote)Compile and run heat.cu example in HAL9k gpu server (IFSC/CIERMag)cmd: ssh – X yah@afrodite.ifsc.usp.br –p 2236What is the change ?Why this change happen ?What is influence of the network latency in the final result ?
    21. 21. Part II: Classroom Exercises – EXERC1Type, compile and test the following code. What this program do ?Command line: gcc exerc1.c –o exerc1
    22. 22. Part II: Classroom Exercises – EXERC2Type, compile and test the following code. What this program do ?Command line: gcc exerc2.c –o exerc2
    23. 23. Part II: Classroom Exercises – EXERC3Type, compile and test the following code. What this program do ?Command line: nvcc exerc3.cu –o exerc3
    24. 24. Part II: Classroom Exercises – EXERC4
    25. 25. Part II: Classroom Exercises – EXERC5
    26. 26. Part II: Classroom Exercises – EXERC6 – part A
    27. 27. Part II: Classroom Exercises – EXERC6 – part B
    28. 28. Part III: ProjectCase Study: Initial calculation for solving sparse matrix in themethod proposed by professor Guilherme Sipahi, from IFSC N=1001; K(1:N) = rand(1,N); g1(1:2*N) = rand(1,2*N); k = 1.3; tic; for i=1:N for j=1:N M(i,j) = g1(N+i-j)*(K(i)+k)*(K(j)+k); end endTask: Design the CUDA kernel for this algorithm (using PyCuda orC) and compare its speed-up with the gold-standard provided byprofessor.
    29. 29. Part III: Project Best Solution (Mateus and Bié)Case Study: Initial calculation for solving sparse matrix in themethod proposed by professor Guilherme Sipahi, from IFSC BLOCK(16, 2, 1) GRID(1000/32, 1000) 370uS with 16 cuda cores
    30. 30. Questions ?So long and thanks by all the fish! – See you tomorrow!2nd day activities:-Database integration(HDF5)-Graphics Visualization(Matplotlib)-Thread syncs andThread fences- Atomic Operations andCritical region control
    31. 31. References  Gokhale M. et al, Hardware Technologies for High-Performance Data-Intensive Computing, IEEE Computer, 18-9162, pg 60, 2008.  Lietsch S. et al. A CUDA-Supported Approach to Remote Rendering, Lecture Notes in Computer Science. 2007.  Fujimoto N. Faster Matrix-Vector Multiplication on GeForce 8800 GTX, IEEE, 2008.Book Reference  NVIDIA Corporation, David, NVIDIA CUDA Programming Guide, Version 1.1, 2007.
    32. 32. /s?
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×