Cuda

  • 853 views
Uploaded on

Tutorial de CUDA - Interface com matlab via mesh files

Tutorial de CUDA - Interface com matlab via mesh files

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • @MarioGazziro
    I use MATLAB Version: 8.1.0.604 (R2013a) and whichever mex version comes with it. I get zeroes if I compile it by just using 'mex'. If however I use the command from the tutorial: 'nvmex -f nvmexopts.bat brazil_coulomb.cu -IC:cudainclude -LC:cudalib -lcufft -lcudart #include 'cuda.h' #include 'mex.h' #include '/usr/include/math.h' '

    then I get the following error:
    Warning: No source files in argument list. Assuming C source
    code for linking purposes. To override this
    assumption use '-fortran' or '-cxx'.
    *
    ld: warning: ignoring file coulomb.o, file was built for i386 which is not the architecture being linked (x86_64): coulomb.o
    Undefined symbols for architecture x86_64:
    '_mexFunction', referenced from:
    -exported_symbol[s_list] command line option
    ld: symbol(s) not found for architecture x86_64
    clang: error: linker command failed with exit code 1 (use -v to see invocation)
    mex: link of ' 'coulomb.mexmaci64'' failed.

    I modified the code in your example to calculate a coulomb field.
    Are you sure you want to
    Your message goes here
  • which version of matlab mex compiler are u using ? this is a very old tutorial. tey my new slides about python and cuda
    Are you sure you want to
    Your message goes here
  • Hi, I'm getting zeros for the CUDA output matrix from your code. Any idea?
    Are you sure you want to
    Your message goes here
    Be the first to like this
No Downloads

Views

Total Views
853
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
32
Comments
3
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Uso de placas gráficas em computação de alto-desempenho (High-performance computing using GPU´s) Mario Alexandre Gazziro (YAH!) Orientador: Jan F. W. Slaets 24/09/08February 9, 2007
  • 2. Part I: Overview  Definition:  Introduced in 2006, the Compute Unified Device Architecture is a combination of software and hardware architecture (available for NVIDIA G80 GPUs and above) which enables data-parallel general purpose computing on the graphics hardware. It therefore offers a C- like programming API with some language extensions.  Key Points:  The architecture offers support for massively multi threaded applications and provides support for inter-thread communication and memory access.2/9/07 Course Title 2
  • 3.   Why this topic is important? Data-intensive problems challenge conventional computing architectures with demanding CPU,memory, and I/O requirements. Emerging hardware technologies, like CUDA architecture can significantly boost performance of a wide range of applications by increasing compute cycles and bandwidth and reducing latency.2/9/07 Course Title 3
  • 4.  Where would I encounter this? Gaming Raytracing 3D Scanners Computer Graphics Number Crunching Scientific Calculation2/9/07 Course Title 4
  • 5. CUDA SDK sample applications2/9/07 Course Title 5
  • 6. CUDA SDK sample applications2/9/07 Course Title 6
  • 7. CUDA vs Intel NVIDIA GeForce 8800 GTX vs Intel Xeon E5335 2GHz, L2 cache 8MB2/9/07 Course Title 7
  • 8. Grid of thread blocks  The computational grid consist of a grid of thread blocks  Each thread executes the kernel  The application specifies the grid and block dimensions  The grid layouts can be 1, 2 or 3-dimensional  The maximal sizes are determined by GPU memory  Each block has a unique block ID  Each thread has a unique thread ID (within the block)2/9/07 Course Title 8
  • 9. Elementwise Matrix Addition2/9/07 Course Title 9
  • 10. Elementwise Matrix Addition The nested for-loops are replaced with an implicit grid2/9/07 Course Title 10
  • 11. Memory model CUDA exposes all the different type of memory on GPU:2/9/07 Course Title 11
  • 12. Part II: Accelerating MATLAB with CUDA Case Study: Initial calculation for solving sparse matrix in the method proposed by professor Guilherme Sipahi, from IFSC N=1001; K(1:N) = rand(1,N); g1(1:2*N) = rand(1,2*N); k = 1.3; tic; for i=1:N for j=1:N M(i,j) = g1(N+i-j)*(K(i)+k)*(K(j)+k); end end matlabTime=toc tic; M=guilherme_cuda(K,g1); cudaTime=toc speedup=matlabTime/cudaTime2/9/07 Course Title 12
  • 13. Results: Speedup of 4.77 times using a NVIDA 8400M with 128 MB matlabTime = 10.6880 cudaTime = 2.2406 speedup = 4.7701 >>2/9/07 Course Title 13
  • 14. The MEX file structure The main() function is replaced with mexFunction. #include "mex.h" void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) { code that handles interface and calls to computational function return; } mexFunction arguments: - nlhs: The number of lhs (output) arguments. - plhs: Pointer to an array which will hold the output data, each element is type mxArray. - nrhs: The number of rhs (input) arguments. - prhs: Pointer to an array which holds the input data, each element is type const mxArray.2/9/07 Course Title 14
  • 15. MX Functions The collection of functions used to manipulate mxArrays are called MX-functions and their names begin with mx. Examples: • mxArray creation functions: mxCreateNumericArray, mxCreateDoubleMatrix, mxCreateString, mxCreateDoubleScalar. • Access data members of mxArrays: mxGetPr, mxGetPi, mxGetM, mxGetN. • Modify data members: mxSetPr, mxSetPi. • Manage mxArray memory: mxMalloc, mxCalloc, mxFree, mxDestroyArray.2/9/07 Course Title 15
  • 16. Mex file for CUDA used in case study – Part 1 Compilation instructions under MATLAB: nvmex -f nvmexopts.bat square_me_cuda.cu -IC:cudainclude -LC:cudalib -lcufft -lcudart#include "cuda.h"#include "mex.h"/* Kernel to compute elements of the array on the GPU */__global__ void guilherme_kernel(float* K, float* g1, float* M, int N){ int k = 1.3; int i = blockIdx.x*blockDim.x+threadIdx.x; int j = blockIdx.y*blockDim.y+threadIdx.y; if ( i < N && j < N) M[i+j*N]=g1[N+i-j]*(K[i]+k)*(K[j]+k);}2/9/07 Course Title 16
  • 17. Mex file for CUDA used in case study – Part 2/* Gateway function */void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[]){ int j, m_0, m_1, m_o, n_0, n_1, n_o; double *data1, *data2, *data3; float *data1f, *data2f, *data3f; float *data1f_gpu, *data2f_gpu, *data3f_gpu; mxClassID category; if (nrhs != (nlhs+1)) mexErrMsgTxt("The number of input and output arguments must be the same."); /* Find the dimensions of the data */ m_0 = mxGetM(prhs[0]); n_0 = mxGetN(prhs[0]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data1f_gpu,sizeof(float)*m_0*n_0); /* Retrieve the input data */ data1 = mxGetPr(prhs[0]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[0]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data1f_gpu, data1, sizeof(float)*m_0*n_0, cudaMemcpyHostToDevice); }2/9/07 Course Title 17
  • 18. Mex file for CUDA used in case study – Part 3 /* Find the dimensions of the data */ m_1 = mxGetM(prhs[1]); n_1 = mxGetN(prhs[1]); /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data2f_gpu,sizeof(float)*m_1*n_1); /* Retrieve the input data */ data2 = mxGetPr(prhs[1]); /* Check if the input array is single or double precision */ category = mxGetClassID(prhs[1]); if( category == mxSINGLE_CLASS) { /* The input array is single precision, it can be sent directly to the card */ cudaMemcpy( data2f_gpu, data2, sizeof(float)*m_1*n_1, cudaMemcpyHostToDevice); } /* Find the dimensions of the data */ m_o = n_0; n_o = n_1; /* Create an input and output data array on the GPU*/ cudaMalloc( (void **) &data3f_gpu,sizeof(float)*m_o*n_o);2/9/07 Course Title 18
  • 19. Mex file for CUDA used in case study – Part 4 /* Compute execution configuration using 128 threads per block */ dim3 dimBlock(128); dim3 dimGrid((m_o*n_o)/dimBlock.x); if ( (n_o*m_o) % 128 !=0 ) dimGrid.x+=1; /* Call function on GPU */ guilherme_kernel<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, data3f_gpu, n_o*m_o); data3f = (float *) mxMalloc(sizeof(float)*m_o*n_o); /* Copy result back to host */ cudaMemcpy( data3f, data3f_gpu, sizeof(float)*n_o*m_o, cudaMemcpyDeviceToHost); /* Create an mxArray for the output data */ plhs[0] = mxCreateDoubleMatrix(m_o, n_o, mxREAL); /* Create a pointer to the output data */ data3 = mxGetPr(plhs[0]);2/9/07 Course Title 19
  • 20. Part III: Device options GPU Model Memory Threads Price (R$) 8600 GT 256 MB 3,072 150.00 8600 GT 512 MB 3,072 300.00 8800 GT 512 MB 12,288 800.00 9800 GTX 512 MB(DDR3) 12,288 1,200.00 9800 GX2 1 GB(DDR3) 24,576 2,500.002/9/07 Course Title 20
  • 21. References  Gokhale M. et al, Hardware Technologies for High-Performance Data-Intensive Computing, IEEE Computer, 18-9162, pg 60, 2008.  Lietsch S. et al. A CUDA-Supported Approach to Remote Rendering, Lecture Notes in Computer Science. 2007.  Fujimoto N. Faster Matrix-Vector Multiplication on GeForce 8800 GTX, IEEE, 2008. Book Reference  NVIDIA Corporation, David, NVIDIA CUDA Programming Guide, Version 1.1, 2007.2/9/07 Course Title 21
  • 22. Questions ? So long and thanks by all the fish!2/9/07 Course Title 22