Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Vpu technology &gpgpu computing


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

Vpu technology &gpgpu computing

  1. 1. VPU TECHNOLOGY &GPGPU COMPUTING Arka Ghosh( B.Tech Computer Science & Engineering DELIVERED AT Seacom Engineering College,CSE Dept DATE 7 th April’2011
  2. 2. What Is VPU? VPU is Visual Processing Unit it is more generally known as Graphics Processing Unit or GPU. The Graphics Processing Unit is a MASSIVELY PARALAL & MASSIVELY MULTITHREADED microprocessor. <ul><li>HyBrid Solutions </li></ul><ul><li>NVIDIA SLI </li></ul><ul><li>ATI Raedon CROSSFIREX </li></ul><ul><li>Why GPU? </li></ul><ul><li>GPU is used for high performance Computing . </li></ul><ul><li>Long time ago work of GPU was to offload & accelerate graphics rendering from the CPU, but now a days the scene has changed.GPU has capability to work like a CPU,in some complex computational cases it beats the CPU. </li></ul><ul><li>GPU Solutions:- </li></ul><ul><li>We can get GPU in two forms </li></ul><ul><li>1.Integrated GPU </li></ul><ul><li>It is integrated on the chipset of MotherBoard.It has low memory bandwidth & its latency time is much more than Dedicated ones. </li></ul><ul><li>i.e-NVIDIA 730a Chipset provides 8200GT GPU with 540Mhz core. </li></ul><ul><li>2.Discrete or Dedicated GPU </li></ul><ul><li>It is the most power full form of is generally installed on PCIe or AGP port of MotherBoard.It has its own memory module. </li></ul><ul><li>i.e-ATI Raedon HD 5970 X2 has Compute power of 4.64 TeraFlops with 3200 Stream Processors & 1 Ghz core </li></ul>© Arka Ghosh 2011
  3. 3. What is PPU? PPU is physics processing unit. which specialized for calculation of rigid body dynamics, soft body dynamics, collision detection, fluid dynamics, hair and clothing simulation, finite element analysis, and fracturing of objects. <ul><ul><ul><li>LARRABEE </li></ul></ul></ul><ul><li>FUSION </li></ul><ul><li>The Main Leader of PPU is AGIA PhysX. </li></ul><ul><li>It consists of a general purpose RISC core controlling an array of custom SIMD floating point VLIW processors </li></ul><ul><li>working in local banked memories, with a switch-fabric to manage transfers between them. There is no </li></ul><ul><li>cache-hierarchy as in a CPU or GPU. </li></ul><ul><li>GPUs vs PPUs:- </li></ul><ul><li>The drive toward GPGPU is making GPUs more and more suitable for the job of a PPU. </li></ul><ul><li>ULTIMATE FATE OF GPU:- </li></ul><ul><li>1.Intel’s LARRABEE </li></ul><ul><li>2.AMD’s FUSION </li></ul>© Arka Ghosh 2011
  4. 4. -:INTO THE ARCHITECTURE:- <ul><li>Use of SPM:- </li></ul><ul><li>SPM or SCRATCHPAD MEMORY is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress.Inreference to a microprocessor (&quot;CPU&quot;), scratchpad refers to a special high-speed memory circuit used to hold small </li></ul><ul><li>items of data for rapid retrieval. </li></ul><ul><li>EXAMPLE:-• NVIDIA's 8800 GPU running under CUDA provides 16KiB of Scratchpad per thread-bundle when being used for </li></ul><ul><li>gpgpu tasks. </li></ul><ul><li>STREAM PROCESSING:  </li></ul><ul><li>The stream processing paradigm simplifies parallel software and hardware by restricting the parallel computation </li></ul><ul><li>that can be performed. </li></ul><ul><li>1.Uniform Stream. </li></ul><ul><li>Applications:- </li></ul><ul><li>Compute Intensity </li></ul><ul><li>Data Parallelism </li></ul><ul><li>Data Locality </li></ul><ul><li>Conventional, sequential paradigm Parallel SIMD paradigm, packed registers (SWAR) </li></ul><ul><li>for(int el = 0; el < 100; el++) // for each vector </li></ul><ul><li>vector_sum(result[el], source0[el], source1[el]); </li></ul><ul><li>for(int i = 0; i < 100 * 4; i++) </li></ul><ul><li>result[i] = source0[i] + source1[i]; </li></ul>© Arka Ghosh 2011
  5. 5.  Graphics Pipeline  <ul><li>The graphics pipeline typically accepts some representation of a three-dimensional scene as an input and results in a 2D raster image as output. OpenGL and Direct3D are two notable graphics pipeline models accepted as widespread industry standards. </li></ul><ul><li>Stages of the graphics pipeline:-> </li></ul><ul><li>1.Transformation </li></ul><ul><li>2.Per-vertex lighting </li></ul><ul><li>3.Viewing transformation </li></ul><ul><li>4.Primitives generation </li></ul><ul><li>5.Projection transformation </li></ul><ul><li>6.Clipping </li></ul><ul><li>7.Viewport transformation </li></ul><ul><li>8.Scan conversion or rasterization </li></ul><ul><li>9.Texturing, fragment shading </li></ul><ul><li>10.Display </li></ul><ul><li>Shader  </li></ul><ul><li>Shaders are used to program the graphics processing unit (GPU) programmable rendering pipeline, which has mostly superseded the fixed-function pipeline that allowed only common geometry transformation and pixel-shading functions; with shaders, customized effects can be used. </li></ul><ul><li><<<Types Of Shader>>> </li></ul><ul><li>Vertex shaders. </li></ul><ul><li>Pixel shaders </li></ul><ul><li>Geometrical shaders </li></ul><ul><li>USEFULLNESS OF SHADER:- </li></ul><ul><li>1.Simplified graphic processing unit pipeline </li></ul><ul><li>2.Parallel processing </li></ul><ul><li>Programming shaders </li></ul><ul><li>We can programe shader by using OpenGL,Cg & Microsoft HLSL. </li></ul>© Arka Ghosh 2011
  6. 6. GPU CLUSTER  <ul><li>What is Cluster? </li></ul><ul><li>GPU CLUSTER  </li></ul><ul><li>Each node of the cluster is GPU. </li></ul><ul><li>1.Homogeneous </li></ul><ul><li>2.Heterogeneous </li></ul><ul><li>Components </li></ul><ul><li>Hardware (Other):- </li></ul><ul><li>I nterconnector </li></ul><ul><li>Software:- </li></ul><ul><li>1. Operating System </li></ul><ul><li>2. GPU driver for the each type of GPU present in each cluster node. </li></ul><ul><li>3. Clustering API (such as the Message Passing Interface, MPI). </li></ul><ul><li>.. Algorithm mapping </li></ul><ul><li>GPU SWITCHING  </li></ul><ul><li>Means Switching from one cluster node to another. </li></ul><ul><li>WINDOWS Switching. </li></ul><ul><li>LINUX Switching. </li></ul>© Arka Ghosh 2011
  7. 7. What Is GPGPU? GPGPU stands for general purpose graphics processin unit computing.Using GPU as CPU is the GPGPU computing <ul><li>NVIDIA CUDA:- </li></ul><ul><li>It is a GPGPU Computing architecture. </li></ul><ul><li>It provides heterogeneous computing environment. </li></ul><ul><li>Why GPU Computing? </li></ul><ul><li>To achive high performance computing. </li></ul><ul><li>Minimize ERROR </li></ul><ul><li>LOW power Consumption..GO GREEN. </li></ul>NVIDIA FLEXES TESLA MUSCLE
  8. 8. CUDA Kernels and Threads <ul><li>Parallel portions of an application are executed on </li></ul><ul><li>the device as kernels </li></ul><ul><li>One kernel is executed at a time </li></ul><ul><li>Many threads execute each kernel </li></ul><ul><li>Differences between CUDA and CPU threads </li></ul><ul><li>CUDA threads are extremely lightweight </li></ul><ul><li>CUDA uses 1000s of threads to achieve efficiency </li></ul><ul><li>Multi-core CPUs can use only a few </li></ul><ul><li>Definitions </li></ul><ul><li>Device = GPU </li></ul><ul><li>Host = CPU </li></ul><ul><li>Kernel = function that runs on the device </li></ul>Data Movement Example int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } © Arka Ghosh 2011
  9. 9. © Arka Ghosh2011 10-Series Architecture 240 thread processors execute kernel threads 30 multiprocessors , each contains 8 thread processors One double-precision unit Shared memory enables thread cooperation Thread Processors Multiprocessor Shared Memory Double
  10. 10. Execution Model Software Hardware Threads are executed by thread processors Thread Thread Processor Multiprocessor Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can reside on Thread Block ... Grid Device one multiprocessor - limited by multiprocessor resources (shared memory and register file) A kernel is launched as a grid of thread blocks Only one kernel can execute on a device at one time © Arka Ghosh2011
  11. 11. Tesla Architecture  © Arka Ghosh 2011
  12. 12. Time GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch Serial Kernel Execution Parallel Kernel Execution Kernel 1 Kernel 1 Kernel 2 Kernel 2 Ker 4 nel Kernel 3 Kernel 5 Kernel 3 Kernel 4 Kernel 5 Kernel 2 Kernel 2 © Arka Ghosh2011
  13. 13. EXAMPLE:-> <ul><li>MATLAB CODE FOR SIMPLE FFT(CPU HOST MODE) </li></ul><ul><li>FOR DEVICE( nVidia QUADRO Fx 5200*2) </li></ul><ul><li>clear ALL; </li></ul><ul><li>t1=cputime; </li></ul><ul><li>x=rand(2^20,1); </li></ul><ul><li>f=fft(x); </li></ul><ul><li>t2=cputime; </li></ul><ul><li>t3=t2-t1; </li></ul><ul><li>Here t3=0.4056 </li></ul><ul><li>Clear ALL; </li></ul><ul><li>t1=cputime; </li></ul><ul><li>x=rand(2^20,1); </li></ul><ul><li>gx=gpuArray(x); </li></ul><ul><li>f=fft(gx); </li></ul><ul><li>t2=cputime; </li></ul><ul><li>t3=t2-t1; </li></ul><ul><li>Here t3=0.006056 </li></ul>clear ALL t1=cputime; x=rand(50); y=rand(50); z=rand(50); a=10; b=20; c=30; d=40; f=a*(x^2)+b*(x*y)+c*(y^3)+d*(z^4); net=feedforwardnet(800); net=trainlm(net,x,f); t2=cputime; t3=t2-t1; MATLAB code For Simple ANN For CPU t3=250.2154 For GPU t3=122.25 So we can see that The GPU is nearabout 204% efficient than CPU. © Arka Ghosh 2011
  14. 14. CONCLUSION  <ul><li>C for the GPU </li></ul><ul><li>Multi-GPU Computing </li></ul><ul><li>Massively Multi-threaded Computing Architecture </li></ul><ul><li>Compatible with Industry Standard Architectures </li></ul><ul><li>WHERE GPGPU IS USED? </li></ul><ul><li>MIT-for educational & Scientific Research Purpose </li></ul><ul><li>Stanford University--for educational & Scientific Research Purpose </li></ul><ul><li>NCSA (National Center for Supercomputing Applications) </li></ul><ul><li>NASA </li></ul><ul><li>Machine Learning & AI field </li></ul><ul><li>Machine Vision(Mainly Robot Vision) </li></ul><ul><li>Tablets </li></ul>© Arka Ghosh 2011
  15. 15. Acknowledgement  <ul><li>Mriganka Chakraborty(prof. Secom Engineering College) </li></ul><ul><li>Saibal Chakraborty </li></ul><ul><li>Dr.Nicolas Pinto .prof. of MIT.-Advanced Supercomputing Dept </li></ul><ul><li>T.Halfhill-NVIDIA Corp Developer Guide </li></ul><ul><li>GOOGLE </li></ul>
  16. 16. THANK YOU