1     平行視覺與GPGPU/CUDA                  王元凱               輔仁大學電機工程系            Email: ykwang@mail.fju.edu.tw              U...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 2                  What about this Talk          The Multicor...
3    1. The Multicore Era    for Computer Vision   Paradigm shift from Clock Speed Race    to Multicore Race   Some exam...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 4                  Multicore Computing          What Is Multi...
Wang, Yuan-Kai (王元凱)      Parallel Vision with GPGPU/CUDA   p. 5                       Moores Law          In 1965, Gordo...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 6               Review of Moores Law          Transistors in ...
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 7                       Problems          More transistors n...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 8           Paradigm Shift from 2000          General-purpo...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA                  p. 9              The Multicore Evolution     Fr...
Wang, Yuan-Kai (王元凱)      Parallel Vision with GPGPU/CUDA        p. 10     Moore’s Law Needs Multicore          Single co...
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA   p. 11                       Two Architectures                ...
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA                 p. 12                       Multicore CPU (1/2...
Wang, Yuan-Kai (王元凱)          Parallel Vision with GPGPU/CUDA          p. 13                       Multicore CPU (2/2)   ...
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 14                       GPGPU (1/2)          GPU (Graphical...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 15                        GPGPU (2/2)          GPGPU has mo...
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 16        Computer Vision Needs     High Performance Computin...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA              p. 17                  Approaches for HPC          Cl...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 18                       However          Multicore is not a ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 19        Multicore Demands Threading
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 20           2. GPGPU and CUDA            GPGPU Hardware     ...
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 21                       Why GPGPU        GPGPU has many-cor...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 22                       GPGPU Vendors               NVIDIA...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 23                       Hardware View               • PC-ba...
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA      p. 24                 Applications of GPGPU            h...
Wang, Yuan-Kai (王元凱)                Parallel Vision with GPGPU/CUDA           p. 25                           Two New GPGP...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 26           nVidia GPGPU Architecture          SM/SP(Stream ...
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA   p. 27                       Memory Hierarchy       On-Chip M...
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA          p. 28                       Parallel Computing       ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 29                 Parallel Programming        Many codes are...
Wang, Yuan-Kai (王元凱)      Parallel Vision with GPGPU/CUDA   p. 30                       Multi-threading          Multi-th...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 31             Parallel Programming in              Sequential...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 32                           CUDA          CUDA: Compute Unif...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA             p. 33      CUDA Hardware Environment:             CPU...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 34                CUDA Software Stack
Wang, Yuan-Kai (王元凱)                Parallel Vision with GPGPU/CUDA                              p. 35            Processi...
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA   p. 36                       Programming with                 ...
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA                   p. 37            Example - Hello World(1/3)...
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA                  p. 38            Example - Hello World(2/3)   ...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA                p. 39            Example - Hello World(3/3)       ...
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA   p. 40                       Parallelization          Multico...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 41         Develop Multi-thread Program        Identify paral...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 42           3. Image Restoration            (Retinex) by CUDA
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA              p. 43                       Image Restoration   ...
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA         p. 44                         Algorithms for         ...
Wang, Yuan-Kai (王元凱)                     Parallel Vision with GPGPU/CUDA                             p. 45                ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 46            Decompose the Problem        Two basic approach...
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA         p. 47                       Multi-Threading        A...
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 48       Domain Decomposition (1/3)       An        image ex...
Wang, Yuan-Kai (王元凱)             Parallel Vision with GPGPU/CUDA              p. 49       Domain Decomposition (2/3)      ...
Wang, Yuan-Kai (王元凱)                     Parallel Vision with GPGPU/CUDA          p. 50       Domain Decomposition (3/3)  ...
Wang, Yuan-Kai (王元凱)       Parallel Vision with GPGPU/CUDA                   p. 51                        The Method      ...
Wang, Yuan-Kai (王元凱)                        Parallel Vision with GPGPU/CUDA                                               ...
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA            p. 53               Our Memory Hierarchy         Textur...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA                   p. 54          Experimental Results (1/2)      ...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA                   p. 55          Experimental Results (2/2)      ...
Wang, Yuan-Kai (王元凱)               Parallel Vision with GPGPU/CUDA          p. 56                  GPGPU Speedup over CPU ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 57           4. Feature Extraction              (SIFT) by CUDA
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 58                       What Is SIFT       SIFT           ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 59                  Applications of SIFT    Object recognitio...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA              p. 60          Parallelize SIFT by GPGPUIntel Q9400   ...
Wang, Yuan-Kai (王元凱)         Parallel Vision with GPGPU/CUDA         p. 61                 Experimental Results           ...
Wang, Yuan-Kai (王元凱)      Parallel Vision with GPGPU/CUDA                 p. 62                       Execution Time      ...
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA   p. 63                            Speedup                    ...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 64                     5. Video                 Cloud Comput...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 65     A Campus Monitoring System                       中控室技術展...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA            p. 66                       一、人事件技術展示                 ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA             p. 67         1.1 翻牆及禁區入侵偵測技術     偵 測 電 資 大 樓      後方與...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA             p. 68          1.2 嵌入式PTZ相機追蹤技術     透過前端固定式      監控系統取...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 69               1.3 攝影機異常偵測技術     以雲端平台同時對環      校及電資大樓多支攝  ...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 70                       二、車事件技術展示                          ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 71         2.1 嵌入式非法停車偵測技術  以嵌入式平台   偵測違法停車   車輛,並驅動   PTZ攝影機...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 72         2.2 戶外停車場空位偵測技術  偵測大型停   車場車位狀   態,並顯示   空車位位置。  ...
Wang, Yuan-Kai (王元凱)           Parallel Vision with GPGPU/CUDA   p. 73                       三、中控室技術展示                    ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 74      3.1 電子地圖式中控室展示技術  以    Google   Map 整 合 所   有異質監控   資...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA        p. 75         3.2 多重解析度廣域監視技術                               ...
Wang, Yuan-Kai (王元凱)              Parallel Vision with GPGPU/CUDA         p. 76       3.3 高效率的影片事件檢索技術    將冗長的監視影片,轉換    ...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA                        p.                       系統架構               ...
786. Conclusions
Wang, Yuan-Kai (王元凱)    Parallel Vision with GPGPU/CUDA   p. 79          Issues with Parallelization         Good paralle...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 80       Parallel Computing on GPGPU          CUDA can only...
Wang, Yuan-Kai (王元凱)     Parallel Vision with GPGPU/CUDA   p. 81           Programming Challenges                  of CUDA...
Wang, Yuan-Kai (王元凱)                 Parallel Vision with GPGPU/CUDA                       p. 82                  GPUs for...
Wang, Yuan-Kai (王元凱)                  Parallel Vision with GPGPU/CUDA                             p. 83      GPUs for Comp...
Wang, Yuan-Kai (王元凱)                     Parallel Vision with GPGPU/CUDA                            p. 84      GPUs for Co...
Wang, Yuan-Kai (王元凱)   Parallel Vision with GPGPU/CUDA   p. 85              The ParLab in Berkeley          The Parallel ...
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA         p. 86              Multicore Programming              ...
Wang, Yuan-Kai (王元凱)        Parallel Vision with GPGPU/CUDA   p. 87                       Special Conference          HPE...
88 The EndFree for Questions
Upcoming SlideShare
Loading in …5
×

Parallel Vision by GPGPU/CUDA

3,332 views

Published on

Academic talk made by Yuan-Kai Wang

Published in: Education, Technology

Parallel Vision by GPGPU/CUDA

  1. 1. 1 平行視覺與GPGPU/CUDA 王元凱 輔仁大學電機工程系 Email: ykwang@mail.fju.edu.tw URL: http://www.ykwang.tw 2011/10/07本著作採用創用CC 「姓名標示」授權條款台灣3.0版
  2. 2. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 2 What about this Talk  The Multicore Era  It’s time for Parallel Computing  GPGPU/CUDA  GUGPU Architecture  Parallel Programming by CUDA  Some Examples  Image Restoration (Retinex)  Feature Extraction (SIFT)  Video Cloud Computing
  3. 3. 3 1. The Multicore Era for Computer Vision Paradigm shift from Clock Speed Race to Multicore Race Some examples of Multicore
  4. 4. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 4 Multicore Computing  What Is Multicore  Combine multiple chips of processor into single chip  Multicore computing is inevitable
  5. 5. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 5 Moores Law  In 1965, Gordon Moore (Intel co-founder) predicted  The transistors no. on an IC would double every 18 months  The well-known law • The performance of computer doubles every 18 months • More transistors  More performance  The prediction was kept correctly by Intels CPUs for 40 years
  6. 6. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 6 Review of Moores Law  Transistors in a chip did increase
  7. 7. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 7 Problems  More transistors need high frequency  High frequency needs high power consumption  We come into the Clock Speed Race  But 4GHz has been the limit Moore’s law breaks
  8. 8. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 8 Paradigm Shift from 2000  General-purpose multicore comes of age  Chip companies race to create multicore processors  CPU: Intel Core Duo, Quad-core, ...  DSP: TI DaVinci  GPU: nVidia GeForce/Tesla  ...
  9. 9. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 9 The Multicore Evolution From large mono-core to multiple lightweight cores Pentium processor Core Duo 5~10 years Optimized for single 10~100 energy efficient thread cores optimized for parallel execution
  10. 10. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 10 Moore’s Law Needs Multicore  Single core cannot fit Moores law  Multicore can fit Moores law if a parallel programming model exists Multi-Core Performance Single Core Time
  11. 11. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 11 Two Architectures for Multicore  Symmetric multiprocessing (SMP)  Multicore CPU, GPGPU, multicore DSP  Homogeneous computing  Asymmetric multiprocessing (AMP)  CPU+GPGPU, CPU+FPGA, CPU+DSP  Heterogeneous computing
  12. 12. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 12 Multicore CPU (1/2)  Two or more CPUs on a chip  Ex.: Intel Core i7 One Processor With multiple execution Cores
  13. 13. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 13 Multicore CPU (2/2)  Windows Task Manager(工作管理員) Two cores Eight cores
  14. 14. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 14 GPGPU (1/2)  GPU (Graphical Processing Unit)  The processor in graphics card to speed up 3D graphics  Game playing is a major application  GPGPU: General-Purpose GPU  General purpose computation using GPU in applications other than 3D graphics
  15. 15. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 15 GPGPU (2/2)  GPGPU has more cores than CPU  120 ~ 512 cores  GPGPU is more powerful than multicore CPU  Vendors:  nVidia  ATI  Intel  AMD
  16. 16. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 16 Computer Vision Needs High Performance Computing  An CV example : video processing  Intelligent video surveillance,  Its complexity is high  One video: 10 Megapixels, 30fps,  100 flops per pixel   30 Gigaflops per video  Massive data processing  Intensive computation
  17. 17. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 17 Approaches for HPC  Cluster/distributed computing  MAP-REDUCE(Google) Supercomputer (Cloud Computing)  MPI  Multi-processing computing  Multicore CPU  Programming with multithreading  FPGA/DSP  GPGPU  Programming with CUDA
  18. 18. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 18 However  Multicore is not a simple solution for upgrading performance  The transition from single core to multicore will be blocked by software  We are not ready to face the software programming challenges
  19. 19. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 19 Multicore Demands Threading
  20. 20. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 20 2. GPGPU and CUDA  GPGPU Hardware  Programming by CUDA
  21. 21. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 21 Why GPGPU  GPGPU has many-core (> 100 cores)  Suitable for intensive parallel computing  GPGPU v.s. CPU  Calculation: 367 GFLOPS v.s. 32 GFLOPS  Memory Bandwidth: 86.4 GB/s v.s. 8.4 GB/s
  22. 22. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 22 GPGPU Vendors  NVIDIA  ATI  Intel  AMD  …
  23. 23. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 23 Hardware View • PC-based • GPGPU card as a coprocessor From PC to PSC : Personal Super-Computer
  24. 24. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 24 Applications of GPGPU http://developer.nvidia.com/category/zone/cuda-zone
  25. 25. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 25 Two New GPGPUs from nVidia  GT200  GTX 260/280, Quardro5800, Tesla 1060  Fermi  Tesla 2060 ALU ALU Control ALU ALU Cache DRAM DRAM CPU(host) GPU(device) Multicore Many-core
  26. 26. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 26 nVidia GPGPU Architecture  SM/SP(Stream multiprocessor/Stream processor) + Shared memory + DRAM
  27. 27. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 27 Memory Hierarchy  On-Chip Memory  Registers  Shared Memory  Constant Memory  Texture Memory  Off-Chip Memory  Local Memory  Global Memory
  28. 28. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 28 Parallel Computing  Serial Computing GPGPU Cores  Parallel Computing
  29. 29. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 29 Parallel Programming  Many codes are written in C/C++/Java  Especially algorithmic programs  Can we write GPGPU parallel programs by C/C++/Java?  However, C/C++ is sequential  Three control structures of C/C++/Java: sequence, selection, repetition
  30. 30. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 30 Multi-threading  Multi-threading is the most important technique for parallel programming  Some techniques are ready  Pthread, Win32 thread, OpenMP, MPI, Intel TBB (Threading Building Block)...  New techniques  CUDA, OpenCL, ...
  31. 31. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 31 Parallel Programming in Sequential Language  Do we need to learn new languages for multi-threading?  No  Write multi-threading codes in C/C++  Add functions/directives to C/C++ for multi-threading  That is the way current solutions did  pthread, Win32 thread, OpenMP, MPI, CUDA, OpenCL, ...
  32. 32. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 32 CUDA  CUDA: Compute Unified Device Architecture  Parallel programming for nVidias GPGPU  Use C/C++ language  Java, Fortran, Matlab are OK  When executing CUDA programs, the GPU operates as coprocessor to the main CPU
  33. 33. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 33 CUDA Hardware Environment: CPU+GPU  GPU  Organizes, interprets, and CPU PCI-E GPU communicates information  GPU  Handles the core processing on large quantities of parallel information  Compute-intensive portions of applications that are executed many times, but on different data, are extracted from the main application and compiled to execute in parallel on the GPU
  34. 34. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 34 CUDA Software Stack
  35. 35. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 35 Processing Flow on CUDA Main CPU 3 2 Memory Copy processing  5 Instruct  the  data Copy the  processing result 4 1 Memory for GPU Execute   Allocate  parallel in  device memory each core 6 Release  device memory
  36. 36. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 36 Programming with Memory Hierarchy  Locality principle  Temporal locality  Spatial locality
  37. 37. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 37 Example - Hello World(1/3) int main() { Host Device char src[12]="Hello World"; char h_hello[12]; src d_hello1 char* d_hello1; char* d_hello2; h_hello d_hello2 cudaMalloc((void**) &d_hello1, sizeof(char)*12); cudaMalloc((void**) &d_hello2, sizeof(char)*12); cudaMemcpy(d_hello1 , src , sizeof(char)* 12 , cudaMemcpyHostToDevice); hello<<<1,1>>>(d_hello1 , d_hello2 ); call the kernel function
  38. 38. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 38 Example - Hello World(2/3)  Kernel Function __global__ void hello(char* hello1 , char* hello2 ) { int k; for(k = 0 ; hello1[k] != 0 ; k++){ Host Device hello2[k] = hello1[k]; } src d_hello1 } No parallel processing in this example h_hello d_hello2
  39. 39. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 39 Example - Hello World(3/3) cudaMemcpy(h_hello, d_hello2, sizeof(char)* 12, cudaMemcpyDeviceToHost); printf("%sn", h_hello); Host Device cudaFree(d_hello1);  cudaFree(d_hello2); src d_hello1 system("pause"); h_hello d_hello2 return 0; } Result:
  40. 40. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 40 Parallelization  Multicore/Multi-threading  Data Parallelization  Data distribution  Parallel convolution  Reduction algorithm  Amdahl’s law  Memory Hierarchy Management  Locality principle  Program accesses a relatively small portion of the address space at any instant of time
  41. 41. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 41 Develop Multi-thread Program  Identify parallelism: Analyze algorithm  Express parallelism: Write parallel code  Validate parallelism: Debug & verify parallel code  Optimize parallelism: enhance parallel performance
  42. 42. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 42 3. Image Restoration (Retinex) by CUDA
  43. 43. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 43 Image Restoration  Restore and enhance an image  Its complexity is high for large images Original Complexity: Restored O(N2) ~ O(N3)
  44. 44. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 44 Algorithms for Image Restoration  Wiener Filter  Histogram Based Approach  Histogram Equalization, Histogram Modification, …  Retinex  Path-based Retinex  Recursive Retinex  Center/surround Retinex  No iterative process and is suitable for parallelization  Multi-Scale Retinex with Color Restoration (MSRCR) [Rahman et al. 1997]
  45. 45. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 45 MSRCR Algorithm   n Ri  x, y   ri ( x, y )   Wk log Ii  x, y   log  Fk  x, y   Ii  x, y   , i   R, G, B ,   k 1  Ri  x, y  : the MSRCR output  Ii  x, y : the original image distribution in the ith spectral band  F  x, y  k : the kth Gaussian Surround function  : the convolution operation W : the weight k  ri ( x, y ) : the color restoration factor in the ith spectral band    I i ( x, y )  N : the number of spectral bands ri ( x, y )    log    N  , : the gain constant     i 1 I i ( x, y )   : controls the strength of the nonlinearity
  46. 46. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 46 Decompose the Problem  Two basic approaches to partition computational work  Domain decomposition GPGPU  Partition the data used Cooperate in solving the problem  Function decomposition CPU  Partition the jobs (functions) from the overall work (problem)
  47. 47. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 47 Multi-Threading  A program running In Serial In Parallel http://en.wikipedia.org/wiki/Thread_(computer_science)
  48. 48. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 48 Domain Decomposition (1/3)  An image example  It is 2D data  Three popular partition ways
  49. 49. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 49 Domain Decomposition (2/3)  Domain data are usually processed by loop  for (i=0; i<height; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); j i The X-ray image of a circuit board Original image(img1) Enhanced image(img2)
  50. 50. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 50 Domain Decomposition (3/3) j i A three-block partition example OpenMP  // Thread 1 CUDA(SPMD) for (i=0; i<height/3; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]);  // Thread 2 for (i=height/3; i<height*2/3; i++) fork(threads) subdomain 1 subdomain 2 subdomain 3 for (j=0; j<width; j++) i=0 i=4 i=8 img2[i][j] = RemoveNoise(img1[i][j]); i=1 i=5 i=9  // Thread 3 i=2 i=6 i=10 i=3 i=7 i=11 for (i=height*2/3; i<height; i++) for (j=0; j<width; j++) join(barrier) img2[i][j] = RemoveNoise(img1[i][j]);
  51. 51. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 51 The Method CPU GPGPU Copy Data from CPU to Gaussian Blur GPGPU Log-domain Processing Normalization Copy Data Histogram from GPGPU Stretching to CPU Intel Core 2 - 2 cores Tesla C1060 - 240 SPs (3.0GHZ) (1.296GHZ)
  52. 52. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 52 Parallelization by GPGPU  Multicore/Multi-threading  Tesla C1060 : 240 SP (Stream Processor)  CUDA: , Thread , Block , Grid  Data Parallelization  Parallel convolution  Parallel convolution M pixels PE data time 1 pixels pixels 1 pixels t0 t1 t2 t3 t4 t5 A(0) A(0)+A(1) A(0)+A(1)+A(2)+A(3) sum 0 1 A(1) M PE i PE i 2 A(2) A(2)+A(3)pixels pixels pixels pixels A(3) 3 4 A(4) A(4)+A(5) A(4)+A(5)+A(6)+A(7) pixels 5 A(5) 1 pixels 1 pixels 6 A(6) A(6)+A(7) pixels 7 A(7)
  53. 53. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 53 Our Memory Hierarchy Texture Parallel Gaussian Blur Memory Constant Parallel Log-domain Memory Processing Global Memory Parallel Normalization Shared Memory Parallel Histogram Stretching
  54. 54. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 54 Experimental Results (1/2) Original images CPU results GPGPU results
  55. 55. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 55 Experimental Results (2/2) Original images CPU results GPGPU results
  56. 56. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 56 GPGPU Speedup over CPU 2 10 Speedup__N 74x Speedup Speedup__P Speedup__NPP 2x Speedup 1 10 2 3 4 10 10 10 M • Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103 • NPP: nVidia Performance Primitive
  57. 57. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 57 4. Feature Extraction (SIFT) by CUDA
  58. 58. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 58 What Is SIFT  SIFT  Scale Invariant Feature Transform  Invariance of feature points  Translation  Rotation  Scale
  59. 59. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 59 Applications of SIFT Object recognition/tracking Image retrieval Autostitch
  60. 60. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 60 Parallelize SIFT by GPGPUIntel Q9400 Geforce GTS 250Quad cores 128 SPs(2.66GHz) (1.836GHz)
  61. 61. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 61 Experimental Results CPU GPU
  62. 62. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 62 Execution Time CPU: 10 seconds in average ms GPGPU: 0.8 seconds in average
  63. 63. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 63 Speedup 13x speedup in average
  64. 64. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 64 5. Video Cloud Computing 戶外/園區的大面積監控 • 大量攝影機數目 • 系統穩定度之挑戰 技術特點 • 涵蓋雲端運算與嵌入式系統 • 整合電子地圖、事件、與視訊摘要之中控顯示 • 克服戶外天候影響之偵測技術
  65. 65. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 65 A Campus Monitoring System 中控室技術展示區 人 事 件 技 術 展 示 區 車事件技術展示區
  66. 66. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 66 一、人事件技術展示 電子資訊研究大樓 交大校內 機車環校道路 科學園區  翻牆及禁區入侵偵測技術  嵌入式PTZ相機追蹤技術  攝影機異常偵測技術
  67. 67. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 67 1.1 翻牆及禁區入侵偵測技術 偵 測 電 資 大 樓 後方與科學園 區銜接之機車 環校道路圍牆, 電子資訊研究大樓 是否有人爬牆 侵入,並發送 警報。 交大校內 機車環校道路 科學園區
  68. 68. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 68 1.2 嵌入式PTZ相機追蹤技術 透過前端固定式 監控系統取得追 蹤物體之初始位 置。 以嵌入式平台進 電子資訊研究大樓 行移動物體追蹤, 並控制PTZ攝影 交大校內 機鏡頭。 機車環校道路 科學園區
  69. 69. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 69 1.3 攝影機異常偵測技術 以雲端平台同時對環 校及電資大樓多支攝 影機進行攝影機異常 偵測。(GPGPU) 模擬電資大樓之攝影 機被人蓄意破壞,將 偵測並警報。 有效排除人來人往的 環校攝影機之假警報。
  70. 70. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 70 二、車事件技術展示  嵌入式非法停車偵測技術 (暨動態場景之人物特徵偵測)  戶外停車場空位偵測技術
  71. 71. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 71 2.1 嵌入式非法停車偵測技術  以嵌入式平台 偵測違法停車 車輛,並驅動 PTZ攝影機拍攝 事件特寫影像。  多解析度連續 影像之人臉偵 測,以停止PTZ 攝影機之特寫 追蹤。(GPGPU)
  72. 72. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 72 2.2 戶外停車場空位偵測技術 偵測大型停 車場車位狀 態,並顯示 空車位位置。 當車輛停妥 於任一空車 位,該車位 將顯示為佔 用中。
  73. 73. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 73 三、中控室技術展示 智慧型社區事件安全監控系 統中控室  電子地圖式中空式展示技術 (中央視訊及管理系統)  多重解析度廣域監視技術  高效率的影片事件檢索技術
  74. 74. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 74 3.1 電子地圖式中控室展示技術 以 Google Map 整 合 所 有異質監控 資訊。 Video Event Geograph y
  75. 75. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 75 3.2 多重解析度廣域監視技術 可旋轉式投影機 大小眼多重 解析度顯示  整 合 Google Earth  GPGPU 硬 體 加速影像貼 合計算 固定式投影機
  76. 76. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 76 3.3 高效率的影片事件檢索技術  將冗長的監視影片,轉換 成精簡的摘要影片,使用 者可在短時間內調閱指定 攝影機之全日事件。 3:00 對濃縮影片進行瀏覽 5:00 時 電子資訊研究大樓 間 軸 交大校內 機車環校道路 科學園區 利用空間對時間做壓 縮
  77. 77. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 系統架構 環校 電資大樓 … … 停車場合法停車 x5 CMS 608 3D …… 3D … 停車場 x8 人非法翻牆 路邊非法停車 翻牆 HVR CAD CAD 77
  78. 78. 786. Conclusions
  79. 79. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 79 Issues with Parallelization  Good parallel programs  Execute correctly  with good speedup  Ideal speedup by Amdahls law  Speedup = N if you has N cores  However, no ideal speedup exists  Because parallel overhead, such as Data communication Data dependencies and synchronization  Other issues: design overhead  No free lunch for software development
  80. 80. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 80 Parallel Computing on GPGPU  CUDA can only parallelize codes for nVidias GPGPU  CUDA’s programming model:  Multithread  SPMD (Single Program Multiple Data)  Best-performance CUDA code needs optimization  Native code can be improved by CUDA  2~3 times  Optimization can be achieved by  Data parallelism, Thread parallelism, Data localization
  81. 81. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 81 Programming Challenges of CUDA  We have to manually parallelize the algorithm  We need expertise in  Algorithms of image and signal processing  Filtering, frequency analysis, compression, feature extraction, recognition, ...  Theory, tools and methodology of parallel computing  Communication, synchronization, resource management, load balancing, debugging, ...
  82. 82. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 82 GPUs for Multimedia 3.5X 10 X 10 X PowerDirector7 Ultra CUDA JPEG Decoder DivideFrame GPU Decoder 26 X 10 X Hyperspectral Image GPU Decoder Motion Estimation for Compression on (Vegas/Premiere) - H.264/AVC on NVIDIA GPUs Using the Power of Multiple GPUs NVIDIA Graphic Card to Using NVIDIA CUDA Decode H.264 Video Files
  83. 83. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 83 GPUs for Computer Vision(1/2) 87 X 26 X 200 X 100 XCUDA SURF – A Real-time Leukocyte Tracking: Real-time Spatiotemporal Image Denoising withImplementation for SURF ImageJ Plugin Stereo Matching Using the Bilateral Filter TU Darmstadt University of Virginia Dual-Cross-Bilateral Grid Wlroclaw University of Technology 85 X 100 X 8X 13 X Digital Breast Fast Optical Flow on GPU A Framework for Efficient Accelerating Advanced MRI Tomosynthesis At Video Rate for Full HD and Scalable Execution of Reconstructions Reconstruction Resolution Domain-specific Templates University of Illinois Massachusetts General Onera On GPU Hospital NEC Labs, Berkeley, Purdue
  84. 84. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 84 GPUs for Computer Vision(2/2) 20 X 13 X 109 X 263 X GPU for Surveillance Fast Human Detection with Fast Sliding-Window GPU Acceleration of Object Cascaded Ensembles Object Detection Classification Algorithm Using NVIDIA CUDA 300 X 10 X 45 X 3X Audience Measurement – Real-time A GPU Accelerated Canny Edge Detection Real-time Video Analysis Visual Tracker by Evolutionary for Counting People, Face Stream Processing Computer Vision System Detection and Tracking
  85. 85. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 85 The ParLab in Berkeley  The Parallel Computing Lab. in UC Berkeley http://parlab.eecs.berkeley.edu  The ParLab. offers programmers a practical introduction to parallel programming techniques and tools on current parallel computers, emphasizing multicore and manycore computers.
  86. 86. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 86 Multicore Programming Practice (MPP)  Goal: Write portable C/C++ programs to be "Multicore ready" and platform compatible  Proposed by a MPP working group in the Multicore Association http://www.multicore-association.org/workgroup/mpp.php
  87. 87. Wang, Yuan-Kai (王元凱) Parallel Vision with GPGPU/CUDA p. 87 Special Conference  HPEC: High Performance Embedded Computing,  MIT Lincoln Lab, 1997 ~
  88. 88. 88 The EndFree for Questions

×