Cuda

1,328 views

Published on

Tutorial de CUDA - Interface com matlab via mesh files

Published in: Education, Technology
3 Comments
0 Likes
Statistics
Notes
  • @MarioGazziro
    I use MATLAB Version: 8.1.0.604 (R2013a) and whichever mex version comes with it. I get zeroes if I compile it by just using 'mex'. If however I use the command from the tutorial: 'nvmex -f nvmexopts.bat brazil_coulomb.cu -IC:cudainclude -LC:cudalib -lcufft -lcudart #include 'cuda.h' #include 'mex.h' #include '/usr/include/math.h' '

    then I get the following error:
    Warning: No source files in argument list. Assuming C source
    code for linking purposes. To override this
    assumption use '-fortran' or '-cxx'.
    *
    ld: warning: ignoring file coulomb.o, file was built for i386 which is not the architecture being linked (x86_64): coulomb.o
    Undefined symbols for architecture x86_64:
    '_mexFunction', referenced from:
    -exported_symbol[s_list] command line option
    ld: symbol(s) not found for architecture x86_64
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    mex: link of ' 'coulomb.mexmaci64'' failed.

    I modified the code in your example to calculate a coulomb field.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • which version of matlab mex compiler are u using ? this is a very old tutorial. tey my new slides about python and cuda
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi, I'm getting zeros for the CUDA output matrix from your code. Any idea?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total views
1,328
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
36
Comments
3
Likes
0
Embeds 0
No embeds

No notes for slide

Cuda

  1. 1. Uso de placas gráficas em computação de alto-desempenho (High-performance computing using GPU´s) Mario Alexandre Gazziro (YAH!) Orientador: Jan F. W. Slaets 24/09/08February 9, 2007
  2. 2. Part I: Overview  Definition:  Introduced in 2006, the Compute Unified Device Architecture is a combination of software and hardware architecture (available for NVIDIA G80 GPUs and above) which enables data-parallel general purpose computing on the graphics hardware. It therefore offers a C- like programming API with some language extensions.  Key Points:  The architecture offers support for massively multi threaded applications and provides support for inter-thread communication and memory access.2/9/07 Course Title 2
  3. 3.   Why this topic is important? Data-intensive problems challenge conventional computing architectures with demanding CPU,memory, and I/O requirements. Emerging hardware technologies, like CUDA architecture can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.2/9/07 Course Title 3
  4. 4.  Where would I encounter this? Gaming Raytracing 3D Scanners Computer Graphics Number Crunching Scientific Calculation2/9/07 Course Title 4
  5. 5. CUDA SDK sample applications2/9/07 Course Title 5
  6. 6. CUDA SDK sample applications2/9/07 Course Title 6
  7. 7. CUDA vs Intel NVIDIA GeForce 8800 GTX vs Intel Xeon E5335 2GHz, L2 cache 8MB2/9/07 Course Title 7
  8. 8. Grid of thread blocks  The computational grid consist of a grid of thread blocks  Each thread executes the kernel  The application specifies the grid and block dimensions  The grid layouts can be 1, 2 or 3-dimensional  The maximal sizes are determined by GPU memory  Each block has a unique block ID  Each thread has a unique thread ID (within the block)2/9/07 Course Title 8
  9. 9. Elementwise Matrix Addition2/9/07 Course Title 9
  10. 10. Elementwise Matrix Addition The nested for-loops are replaced with an implicit grid2/9/07 Course Title 10
  11. 11. Memory model CUDA exposes all the different type of memory on GPU:2/9/07 Course Title 11
  12. 12. Part II: Accelerating MATLAB with CUDA Case Study: Initial calculation for solving sparse matrix in the method proposed by professor Guilherme Sipahi, from IFSC N=1001; K(1:N) = rand(1,N); g1(1:2*N) = rand(1,2*N); k = 1.3; tic; for i=1:N for j=1:N M(i,j) = g1(N+i-j)*(K(i)+k)*(K(j)+k); end end matlabTime=toc tic; M=guilherme_cuda(K,g1); cudaTime=toc speedup=matlabTime/cudaTime2/9/07 Course Title 12
  13. 13. Results: Speedup of 4.77 times using a NVIDA 8400M with 128 MB matlabTime = 10.6880 cudaTime = 2.2406 speedup = 4.7701 >>2/9/07 Course Title 13
  14. 14. The MEX file structure The main() function is replaced with mexFunction. #include "mex.h" void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { code that handles interface and calls to computational function return; } mexFunction arguments: - nlhs: The number of lhs (output) arguments. - plhs: Pointer to an array which will hold the output data, each element is type mxArray. - nrhs: The number of rhs (input) arguments. - prhs: Pointer to an array which holds the input data, each element is type const mxArray.2/9/07 Course Title 14
  15. 15. MX Functions The collection of functions used to manipulate mxArrays are called MX-functions and their names begin with mx. Examples: • mxArray creation functions: mxCreateNumericArray, mxCreateDoubleMatrix, mxCreateString, mxCreateDoubleScalar. • Access data members of mxArrays: mxGetPr, mxGetPi, mxGetM, mxGetN. • Modify data members: mxSetPr, mxSetPi. • Manage mxArray memory: mxMalloc, mxCalloc, mxFree, mxDestroyArray.2/9/07 Course Title 15
  16. 16. Mex file for CUDA used in case study – Part 1 Compilation instructions under MATLAB: nvmex -f nvmexopts.bat square_me_cuda.cu -IC:cudainclude -LC:cudalib -lcufft -lcudart#include "cuda.h"#include "mex.h"/* Kernel to compute elements of the array on the GPU */__global__ void guilherme_kernel(float* K, float* g1, float* M, int N){ int k = 1.3; int i = blockIdx.x*blockDim.x+threadIdx.x; int j = blockIdx.y*blockDim.y+threadIdx.y; if ( i < N && j < N) M[i+j*N]=g1[N+i-j]*(K[i]+k)*(K[j]+k);}2/9/07 Course Title 16
  17. 17. Mex file for CUDA used in case study – Part 2/* Gateway function */void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[]){ int j, m_0, m_1, m_o, n_0, n_1, n_o; double *data1, *data2, *data3; float *data1f, *data2f, *data3f; float *data1f_gpu, *data2f_gpu, *data3f_gpu; mxClassID category; if (nrhs != (nlhs+1)) mexErrMsgTxt("The number of input and output arguments must be the same."); /* Find the dimensions of the data */ m_0 = mxGetM(prhs[0]); n_0 = mxGetN(prhs[0]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data1f_gpu,sizeof(float)*m_0*n_0); /* Retrieve the input data */ data1 = mxGetPr(prhs[0]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[0]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data1f_gpu, data1, sizeof(float)*m_0*n_0, cudaMemcpyHostToDevice); }2/9/07 Course Title 17
  18. 18. Mex file for CUDA used in case study – Part 3 /* Find the dimensions of the data */ m_1 = mxGetM(prhs[1]); n_1 = mxGetN(prhs[1]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data2f_gpu,sizeof(float)*m_1*n_1); /* Retrieve the input data */ data2 = mxGetPr(prhs[1]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[1]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data2f_gpu, data2, sizeof(float)*m_1*n_1, cudaMemcpyHostToDevice); } /* Find the dimensions of the data */ m_o = n_0; n_o = n_1; /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data3f_gpu,sizeof(float)*m_o*n_o);2/9/07 Course Title 18
  19. 19. Mex file for CUDA used in case study – Part 4 /* Compute execution configuration using 128 threads per block */ dim3 dimBlock(128); dim3 dimGrid((m_o*n_o)/dimBlock.x); if ( (n_o*m_o) % 128 !=0 ) dimGrid.x+=1; /* Call function on GPU */ guilherme_kernel<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, data3f_gpu, n_o*m_o); data3f = (float *) mxMalloc(sizeof(float)*m_o*n_o); /* Copy result back to host */ cudaMemcpy( data3f, data3f_gpu, sizeof(float)*n_o*m_o, cudaMemcpyDeviceToHost); /* Create an mxArray for the output data */ plhs[0] = mxCreateDoubleMatrix(m_o, n_o, mxREAL); /* Create a pointer to the output data */ data3 = mxGetPr(plhs[0]);2/9/07 Course Title 19
  20. 20. Part III: Device options GPU Model Memory Threads Price (R$) 8600 GT 256 MB 3,072 150.00 8600 GT 512 MB 3,072 300.00 8800 GT 512 MB 12,288 800.00 9800 GTX 512 MB(DDR3) 12,288 1,200.00 9800 GX2 1 GB(DDR3) 24,576 2,500.002/9/07 Course Title 20
  21. 21. References  Gokhale M. et al, Hardware Technologies for High-Performance Data-Intensive Computing, IEEE Computer, 18-9162, pg 60, 2008.  Lietsch S. et al. A CUDA-Supported Approach to Remote Rendering, Lecture Notes in Computer Science. 2007.  Fujimoto N. Faster Matrix-Vector Multiplication on GeForce 8800 GTX, IEEE, 2008. Book Reference  NVIDIA Corporation, David, NVIDIA CUDA Programming Guide, Version 1.1, 2007.2/9/07 Course Title 21
  22. 22. Questions ? So long and thanks by all the fish!2/9/07 Course Title 22

×