Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2014/07/17 Parallelize computer vision by GPGPU computing

517 views

Published on

Parallelize computer vision by GPGPU computing

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

2014/07/17 Parallelize computer vision by GPGPU computing

  1. 1. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Wang, Yuan-Kai (王元凱) Electrical Engineering Department, Fu Jen Catholic University (輔仁大學電機工程系) ykwang@mail.fju.edu.tw http://www.ykwang.tw 2014/07/17 Parallelize Computer Vision by GPGPU Computing 1
  2. 2. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. About this Course ❖ Multicore Era for Computer Vision ❖ GPGPU ❖ Parallel Programming (CUDA, OpenCL, Renderscript) ❖ OpenCV Acceleration with GPGPU ❖ Computer Vision Acceleration 2
  3. 3. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 1. Multicore Era for Computer Vision Paradigm shift from Clock Speed Race to Multicore Race 3
  4. 4. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore Computing ❖ What Is Multicore • Combine multiple processors (CPU, DSP, GPGPU, FPGA) into single chip ❖ Multicore computing is inevitable 4
  5. 5. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Moore's Law ❖ In 1965, Gordon Moore (Intel co-founder) predicted • The transistors no. on an IC would double every 18 months ❖ The well-known law • The performance of computer doubles every 18 months • More transistors → More performance ❖ The prediction was kept correctly by Intel's CPUs for 40 years 5
  6. 6. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Review of Moore's Law ❖ Transistors in a chip did increase 6 Software enjoys the fruits of hardware's labour.
  7. 7. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Problems ❖ More transistors need high frequency • We come into the Clock Speed Race ❖ But high frequency needs high power consumption • High power consumption è Heat problem • 4GHz has been the limit of Moore’s law 7
  8. 8. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Paradigm Shift from 2000 AD ❖ General-purpose multicore comes of age ❖ Chip companies race to create multicore processors • CPU: Intel Core Duo, Quad-core, ARM v7, ... • DSP: TI OMAP, ARM NEON, … • GPU/GPGPU: • nVidia: GeForce/Tesla, Tegra • ARM: Mali-T6x • … 8
  9. 9. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The Multicore Evolution Pentium processor Optimized for single thread Core Duo 5~10 years 10~100 energy efficient cores optimized for parallel execution From large mono-core to multiple lightweight cores 9
  10. 10. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Moore’s Law Needs Multicore ❖ Single core cannot fit Moore's law ❖ Multicore can fit Moore's law if a parallel programming model exists Time Performance Single Core Multi-Core 10
  11. 11. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Two Architectures for Multicore ❖ Symmetric multiprocessing (SMP) • Multicore CPU, GPGPU, DSP multicore • Homogeneous computing ❖ Asymmetric multiprocessing (AMP) • CPU+GPGPU, CPU+FPGA, CPU+DSP • Heterogeneous computing 11
  12. 12. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore CPU (1/2) ❖ Two or more CPUs in a chip ❖ Ex.: Intel Core i7 12 Multiple Execution Cores
  13. 13. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore CPU (2/2) ❖ Windows Task Manager(工作管理員) Two cores Eight cores 13
  14. 14. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU (1/2) ❖ GPU (Graphical Processing Unit) • The processor in graphics card to speed up 3D graphics • Game playing is a major application ❖ GPGPU: General-Purpose GPU • General purpose computation using GPU in applications other than 3D graphics 14
  15. 15. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU (2/2) ❖ GPGPU has more cores than CPU • 120 ~ 3072 cores vs. 2 ~ 8 cores (Many-core vs. Multi-core) ❖ GPGPU is more powerful than multicore CPU ❖ Vendors: • nVidia • Quadcomm (AMD, ATI) • ARM • Intel 15
  16. 16. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 16 It is the Software, Stupid ❖Gary Smith and Daya Nadamuni, Gartner Dataquest, Design Automation Conf., 2006 ❖The biggest problem with SoC design is embedded software development. ❖The next big hurdle is programmability. It's the ability to program these multicore platforms." ❖You can have elegant algorithms, first-pass silicon, and fancy intellectual property. But without software, the product goes nowhere.
  17. 17. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore Demands Threading 17
  18. 18. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore Demands Threading 18
  19. 19. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What Is Computer Vision 19
  20. 20. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Video Capture Image Enhance Object /Event Detection Object Tracking Object /Event Recognition Behavior Analysis Retrieval Imaging Event Detection Abnormal Detection Face Recognition Retrieval TripwireImage/Video Enhancement A Complete Vision System – Video Surveillance as an Example 20
  21. 21. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Computer Vision Needs High Performance Computing ❖ A CV example : video processing • Intelligent video surveillance, ❖ Its complexity is high • Video (1080p RGB): 6 Megapixels per frame, 30fps • 100 – 1K flops per pixel • ⇒ 18 - 180 Gigaflops per second ❖ Massive data processing • Intensive computation 21
  22. 22. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. HPC Approaches ❖ Cluster/distributed computing • Hadoop/MAP-REDUCE (Google, Cloud Computing) • MPI ❖ Multi-processing computing • Multicore (GPGPU, CPU, FPGA/DSP) • Programming: multi-thread • Windows thread, Pthraed, OpenMP • CUDA, renderscript, C++ AMP, … Supercomputer 22
  23. 23. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. However ❖ Can CV algorithms speed-up every 18 months with multicore? ❖ Multicore is not a simple solution for upgrading CV algorithm performance • The transition from single core to multicore will be blocked by software • We are not ready to face the software programming challenges • It is the software, stupid. 23
  24. 24. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Software, Threading, and Parallel Computing ❖ Identify parallelism: Analyze algorithm ❖ Express parallelism: Write parallel code ❖ Validate parallelism: Debug & verify parallel code ❖ Optimize parallelism: enhance parallel performance 24
  25. 25. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multi-threading Demands New Programming Skills ❖ Previous multi-threading techniques ❖ Windows thread, pthread, OpenMP, MPI, … ❖ New techniques • CUDA, C++ AMP, OpenCL, Renderscript, OpenACC, Map Reduce, … ❖ Concepts • Race condition, deadlock, • Domain partition, function partition, … 25
  26. 26. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore Programming Practice (MPP) ❖ Goal: Write portable C/C++ programs to be "Multicore ready" and platform compatible • Proposed by a MPP working group in the Multicore Association http://www.multicore-association.org/workgroup/mpp.php 26
  27. 27. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenACC ❖ An organization develops API to • describes a collection of compiler directives • To specify loops and regions of code in standard C, C++ and Fortran • To be offloaded from a host CPU to an attached accelerator, including •APUs, GPUs, and many-core coprocessor 27
  28. 28. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. HSA Foundation ❖Heterogeneous System Architecture • Key members: AMD, QUALCOMM, ARM, SAMSUNG, TI ❖System architecture easing efficient use of accelerators, SoCs • Intended to support high-level parallel programming frameworks • OpenCL, C++, C#, OpenMP, Java • Accelerator requirements • Full-system SVM, memory coherency, preemption, user-mode dispatch 28
  29. 29. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The ParLab in Berkeley ❖ The Parallel Computing Lab. in UC Berkeleyhttp://parlab.eecs.berkeley.e du • The ParLab. offers programmers a practical introduction to parallel programming techniques and tools on current parallel computers, emphasizing multicore and manycore computers. 29
  30. 30. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. HPEC ❖ High Performance Embedded Computing • MIT Lincoln Lab, 1997 ~ 30
  31. 31. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL ❖ Royalty-free, cross-platform, cross- vendor standard •Targeting: supercomputers è embedded systems è mobile devices ❖Enables programming of diverse compute resources •CPU, GPU, DSP, FPGA … 31
  32. 32. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL Working Group Members ❖Diverse industry participation – many industry experts ❖NVIDIA is chair, Apple is specification editor 32
  33. 33. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ Vendor, Hardware ❖ How parallel programming (Sec. 3) ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 33
  34. 34. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 2. GPGPU PC platform Mobile platform 34
  35. 35. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Why GPGPU ❖ GPGPU has many-core (vs. multi-core) • Suitable for masssively parallel computing 35
  36. 36. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU as a Coprocessor Heterogeneous Computing 36
  37. 37. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. PC Platform • Discrete GPUs • GPGPU card as a coprocessor From PC to PSC (Personal Super-Computer) 37 PCIe
  38. 38. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Mobile Platform • Integrated GPUs • GPGPU sub-chip as a coprocessor From mobile phone to mobile personal computer 38 No PCIe GPGPU CPU
  39. 39. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Solutions - nVidia • Compute Architecture: Tesla, Fermi, Kepler, … • PC • GeForce, Quadro • Tesla • 870, 1060, 2070, K40 • Mobile • Tegra: …, 4, K1(192 cores) 39 It’s Tegra K1 Everywhere at Google I/O, Embedded Vision Alliance, 2014/7/7.
  40. 40. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Solutions – Qualcomm/AMD ❖ Qualcomm, AMD, ATI ❖ APU: integrated CPU+GPU ❖ Low energy consumption ❖ PC(AMD): FirePro ❖ Mobile(Snapdragon): ❖ Adreno: 330(32 cores) 40
  41. 41. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Solutions - ARM ❖ Mali ❖ Samsung Exynos, MediaTek ❖ Compute engine after T-600 ❖ Exynos 5 ❖ At most 8 cores (Mali-T678) 41
  42. 42. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Intel – Multicore CPU • PC (Xeon Phi) • IRIS pro GPU • Knight Landing: 60 cores • Knight Cover: 48 CPU cores, PCIe • Mobile • Haswell • Atom 42
  43. 43. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Applications of GPGPU http://developer.nvidia.com/category/zone/cuda-zone 43
  44. 44. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Heterogeneous Architecture ❖Host: CPU ❖Device: GPGPU ❖Notice: memory hierarchy in device 44
  45. 45. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPUs Architecture - nVidia ❖ GT200 • GTX 260/280, Quardro5800, Tesla 1060 ❖ Fermi • Tesla 2060 DRAM Cache ALU Control ALU ALU ALU DRAM CPU(host) Multicore GPU(device) Many-core 45
  46. 46. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. nVidia GPGPU Architecture ❖ SM/SP(Stream multiprocessor/Stream processor) + Shared memory + DRAM 46
  47. 47. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Memory Hierarchy ❖ On-Chip Memory • Registers • Shared Memory • Constant Memory • Texture Memory ❖ Off-Chip Memory • Local Memory • Global Memory 47
  48. 48. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU vs. FPGA ❖GPU: nVidia GeForce GTX 280, GTX580 ❖FPGA: Xilinx Virtex4, Virtex5 A Comparison of FPGA and GPU for real-Time Phase-Based Optical Flow, Stereo, and Local Image Features, IEEE Transactions on Computers, 2012. 48
  49. 49. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU vs. FPGA ❖GPU: nVidia GeForce 7900 GTX ❖FPGA: Xilinx Virtex-4 Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study, IEEE Transactions on Computers, 2010. 49
  50. 50. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU vs. FPGA vs. Multicore ❖Application: 2-D image convolution GPU: nVidia GeForce 295 GTX FPGA: Altera Stratix III E260 A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding- Window Applications, ACM/SIGDA international symposium on FPGA, 2012. 50
  51. 51. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. However, GPGPU May Not Always Improve Speed & Energy 51
  52. 52. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Hardware vs. Software 52 GPGPU nVidia Qualcomm ARM Intel Parallel Programming CUDA OpenCL RenderScript C++ AMP
  53. 53. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ How parallel programming (Sec. 3) • CUDA, renderscript, OpenCL, … ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 53
  54. 54. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 3. Parallel Programming Multi-threading Programming Languages for Parallels 54
  55. 55. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Computing ❖ Serial Computing ❖ Parallel Computing CPU/GPU 55 Core Core Core Core
  56. 56. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Programming ❖ Many codes are written in C/C++/Java • Especially algorithmic programs ❖ Can we write GPGPU parallel programs by C/C++/Java? ❖ However, C/C++ is sequential • Three control structures of C/C++/Java: sequence, selection, repetition 56
  57. 57. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multi-threading ❖ Multi-threading is the fundamental concept for parallel programming • Some techniques are ready • Pthread, Win32 thread, OpenMP, MPI, Intel TBB (Threading Building Block)... • New techniques • CUDA, OpenCL, Renderscript, OpenACC, C++ AMP, ... 57
  58. 58. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Programming Models 58
  59. 59. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Programming in Sequential Language ❖ Do we need to learn new languages for multi-threading? • No ❖ Write multi-threading codes in C/C++ • Add functions/directives to C/C++ for multi-threading • That is the way current solutions did • pthread, Win32 thread, OpenMP, MPI, CUDA, OpenCL, ... 59
  60. 60. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Decompose the Problem ❖ Two basic approaches to partition computational work • Domain decomposition • Partition the data used in solving the problem • Function decomposition • Partition the jobs (functions) from the overall work (problem) GPGPU CPU Cooperate 60
  61. 61. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multi-Threading ❖ A program running In Serial http://en.wikipedia.org/wiki/Thread_(computer_science) In Parallel 61
  62. 62. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Domain Decomposition (1/3) ❖An image example • It is 2D data • Three popular partition ways 62
  63. 63. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Domain Decomposition (2/3) ❖Domain data are usually processed by loop • for (i=0; i<height; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); Original image(img1) Enhanced image(img2) The X-ray image of a circuit board i j SIMD SPMD SIMT 63
  64. 64. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Domain Decomposition (3/3) ❖A three-block partition example • // Thread 1 for (i=0; i<height/3; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); • // Thread 2 for (i=height/3; i<height*2/3; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); • // Thread 3 for (i=height*2/3; i<height; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); i j OpenMP CUDA(SPMD) fork(threads) join(barrier) i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 subdomain 1 subdomain 2 subdomain 3 64
  65. 65. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Programming: SIMT model ❖ CPU (“host”) program often written in C or C++ ❖ GPU code is written as a sequential kernel in (usually) a C or C++ dialect 65
  66. 66. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Programming Techniques CUDA OpenCL C++ AMP Rednerscript 66
  67. 67. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Programming Techniques 67
  68. 68. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA 68
  69. 69. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA ❖ CUDA: Compute Unified Device Architecture ❖ Parallel programming for nVidia's GPGPU ❖ Use C/C++ language • Java, Fortran, Matlab are OK ❖ When executing CUDA programs, the GPU operates as coprocessor to the main CPU 69
  70. 70. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Hardware Environment: CPU+GPU ❖ CPU • Organizes, interprets, and communicates information ❖ GPU • Handles the core processing on large quantities of parallel information • Compute-intensive portions of applications that are executed many times, but on different data, are extracted from the main application and compiled to execute in parallel on the GPU CPU GPU PCI-E 70
  71. 71. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Software Stack 71
  72. 72. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Processing Flow on CUDA Copy processing data 2 Copy the result 5 Instruct the processing 3 Main Memory CPU Memory for GPU Execute parallel in each core 4 Release device memory 6 Allocate device memory 1 72
  73. 73. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Programming with Memory Hierarchy ❖ Locality principle • Temporal locality • Spatial locality 73
  74. 74. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(1/3) int main() { char src[12]="Hello World"; char h_hello[12]; char* d_hello1; char* d_hello2; cudaMalloc((void**) &d_hello1, sizeof(char)*12); cudaMalloc((void**) &d_hello2, sizeof(char)*12); cudaMemcpy(d_hello1 , src , sizeof(char)* 12 , cudaMemcpyHostToDevice); hello<<<1,1>>>(d_hello1 , d_hello2 ); Host src h_hello Device d_hello1 d_hello2 call the kernel function 74
  75. 75. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/3) ❖ Kernel Function __global__ void hello(char* hello1 , char* hello2 ) { int k; for(k = 0 ; hello1[k] != '0' ; k++){ hello2[k] = hello1[k]; } } Host src h_hello Device d_hello1 d_hello2 No parallel processing in this example 75
  76. 76. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(3/3) cudaMemcpy(h_hello, d_hello2, sizeof(char)* 12, cudaMemcpyDeviceToHost); printf("%sn", h_hello); cudaFree(d_hello1); ❖ cudaFree(d_hello2); system("pause"); return 0; } Result: Host src h_hello Device d_hello1 d_hello2 76
  77. 77. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL Standard 77
  78. 78. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The Inspiration for OpenCL 78
  79. 79. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's OpenCL ❖One code tree can be executed on CPUs, GPUs, DSPs and hardware • Dynamically interrogate system load and balance across available processors ❖Powerful, low-level flexibility • Foundational access to compute resources for higher-level engines, frameworks and languages 79
  80. 80. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Broad OpenCL Implementer Adoption ❖Multiple conformant implementations shipping on desktop and mobile ❖Android ICD extension released in latest extension specification ❖Multiple implementations shipping in Android NDK 80
  81. 81. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL Enables Portability ❖C to gates programs are proprietary 81
  82. 82. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Altera OpenCL SDK for FPGAs 82
  83. 83. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. NVIDIA OpenCL SDK for GPU 83
  84. 84. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. AMD OpenCL Optimization Case Study ❖Platform • AMD Phenom II X4 965 CPU (quad core) • ATI Radeon HD 5870 GPU ❖Unoptimized CPU performance: 1 GFLOP/s ❖Optimized CPU performance reaches: 4 GFLOP/s ❖Optimized GPU performance reaches: 50 GFLOP/s 84
  85. 85. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(1/3) Including Declaring 85
  86. 86. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/3) Creating 86
  87. 87. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/3) Do Copy to host & display Creating 87
  88. 88. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(3/3) Kernel Function 88
  89. 89. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. C++ AMP Microsoft 89
  90. 90. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's C++ AMP(1/2) ❖Microsoft’s C++ AMP (Accelerated Massive Parallelism) • Part of Visual C++, integrated with Visual Studio, built on Direct3D • “Performance for the mainstream” ❖STL-like library for multidimensional array data • Special convenience support for 1, 2, and 3 dimensional arrays on CPU or GPU • C++ AMP runtime handles CPU<->GPU data copying • Tiles enable efficient processing of sub-arrays 90
  91. 91. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's C++ AMP(2/2) ❖Parallel_for_each •Executes a kernel (C++ lambda) at each point in the extent •restrict() clause specifies where to run the kernel: cpu (default) or direct3d (GPU) 91
  92. 92. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(1/2) Declaring& Coping to device 92
  93. 93. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/2) Do Display 93
  94. 94. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Google Android 94
  95. 95. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's Renderscript(1/2) ❖Higher-level than CUDA or OpenCL: simpler & less performance control • Emphasis on mobile devices & cross-SoC performance portability ❖Programming model • C99-based kernel language, JIT-compiled, single input-single output • Automatic Java class reflection • Intrinsics: built-in, highly-tuned operations, e.g. ScriptIntrinsicConvolve3x3 • Script groups combine kernels to amortize launch cost & enable kernel fusion 95
  96. 96. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's Renderscript(2/2) ❖ Data type: • 1D/2D collections of elements, C types like int and short2, types include size • Runtime type checking ❖ Parallelism • Implicit: one thread per data element, atomics for thread-safe access • Thread scheduling not exposed, VM-decided 96
  97. 97. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Architecture 97
  98. 98. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Low Level Virtual Machine ❖Low Level Virtual Machine (LLVM) is a compiler infrastructure 98
  99. 99. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Offline Compiler Flow 99
  100. 100. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Renderscript Compiler: libbcc 100
  101. 101. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Renderscript Project Framework 101
  102. 102. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(1/8) 102
  103. 103. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/8) HelloWorld.java 103
  104. 104. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(3/8) HelloWorld.java 104
  105. 105. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(4/8) HelloWorldView.java 105
  106. 106. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(5/8) HelloWorldView.java 106
  107. 107. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(6/8) HelloWorldRS.java 107
  108. 108. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(7/8) HelloWorldRS.java 108
  109. 109. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(7/8) ScriptC_helloworld.java 109
  110. 110. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(7/8) ScriptC_helloworld.java 110
  111. 111. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(8/8) HelloWorld.rs 111
  112. 112. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Comparison (1/2) ❖Renderscript vs. Native(NDK) vs. Java(SDK) • OS: Honeycomb v3.2(CPU only) Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." in Proc. First Asia- Pacific Programming Languages and Compilers Workshop (APPLC). 201 112
  113. 113. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Comparison(2/2) ❖OpenCL & CUDA • Sobel filter with(CMw/o) and without(CMw) constant memory OpenCL’s portability does not fundamentally affect its performance Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A comprehensive performance comparison of CUDA and OpenCL." in Proc. International Conference Parallel Processing (ICPP), 2011. 113
  114. 114. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Programming 114 Performance: more control, better performance Productivity: ease use, quick programming, portability
  115. 115. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖ Multicore/Multi-threading ❖ Data Parallelization • Data distribution • Parallel convolution • Reduction algorithm • Amdahl’s law ❖ Memory Hierarchy Management • Locality principle • Program accesses a relatively small portion of the address space at any instant of time Parallelization 115
  116. 116. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multi-thread Programming with the Discipline of Parallelization ❖ Identify parallelism: Analyze algorithm ❖ Express parallelism: Write parallel code ❖ Validate parallelism: Debug & verify parallel code ❖ Optimize parallelism: enhance parallel performance 116
  117. 117. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ How parallel programming (Sec. 3) ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 117
  118. 118. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 4. OpenCV Acceleration 118
  119. 119. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What Is OpenCV ❖A very popular computer vision library • 6M downloads • BSD licenses • 2000 ~ CV functions • Modularized and efficient • Optimization • Intel SSE, IPP, TBB • ARM NEON & GLSL (Tegra) • CUDA, OpenCL 119
  120. 120. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV Modules ❖Image/video I/O, processing, display (core, imgproc, highgui) ❖Object/feature detection (objdetect, features2d, nonfree) ❖Geometry-based monocular or stereo computer vision (calib3d, stitching, videostab) ❖Computational photography (photo, video, superres) ❖Machine learning & clustering (ml, flann) ❖CUDA and OpenCL GPU acceleration (gpu, ocl) Normal CV modules: 14 Acceleration modules: 2 120
  121. 121. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV GPU Module ❖Implemented using NVIDIA CUDA Runtime API ❖Latest version: 2.4.9 • Utilizing Multiple GPUs ❖Implemented modules: 11 ❖Implemented functions: 270 Focus on PC platform Not fully compatible to mobile GPGPU on Android 121
  122. 122. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Matrix Operations ❖Point-wise matrix math • gpu::add(), ::sum(), ::div(), ::sqrt(), ::sqrSum(), ::meanStdDev, ::min(), ::max(), ::minMaxLoc(), ::magnitude(), ::norm(), ::countNonZero(), ::cartToPolar(), etc.. ❖Matrix multiplication • gpu::gemm() ❖Channel manipulation • gpu::merge(), ::split() 122
  123. 123. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Geometric Operations ❖Image resize with sub-pixel interpolation • gpu::resize() ❖Image rotate with sub-pixel interpolation • gpu::rotate() ❖Image warp (e.g., panoramic stitching) • gpu::warpPerspective(), ::warpAffine() 123
  124. 124. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA other Math and Geometric Operations ❖Integral images • gpu::integral(), ::sqrIntegral() ❖Custom geometric transformation (e.g., lens distortion correction) • gpu::remap(), ::buildWarpCylindricalMaps(), ::buildWarpSphericalMaps() 124
  125. 125. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Image Processing(1/2) ❖Smoothing • gpu::blur(), ::boxFilter(), ::GaussianBlur() ❖Morphological • gpu::dilate(), ::erode(), ::morphologyEx() ❖Edge Detection • gpu::Sobel(), ::Scharr(), ::Laplacian(), gpu::Canny() ❖Custom 2D filters • gpu::filter2D(), ::createFilter2D_GPU(), ::createSeparableFilter_GPU() ❖Color space conversion • gpu::cvtColor() 125
  126. 126. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Image Processing(2/2) ❖Image blending • gpu::blendLinear() ❖Template matching (automated inspection) • gpu::matchTemplate() ❖Gaussian pyramid (scale invariant feature/object detection) • gpu::pyrUp(), ::pyrDown() ❖Image histogram • gpu::calcHist(), gpu::histEven, gpu::histRange() ❖Contract enhancement • gpu::equalizeHist() 126
  127. 127. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA De-noising ❖Gaussian noise removal • gpu::FastNonLocalMeansDenoising() ❖Edge preserving smoothing • gpu::bilateralFilter() 127
  128. 128. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Fourier and MeanShift ❖Fourier analysis •gpu::dft(), ::convolve(), ::mulAndScaleSpectrums(), etc.. ❖MeanShift •gpu::meanShiftFiltering(), ::meanShiftSegmentation() 128
  129. 129. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Shape Detection ❖Line detection (e.g., lane detection, building detection, perspective correction) • gpu::HoughLines(), ::HoughLinesDownload() ❖Circle detection (e.g., cells, coins, balls) • gpu::HoughCircles(), ::HoughCirclesDownload() 129
  130. 130. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Object Detection ❖HAAR and LBP cascaded adaptive boosting (e.g., face, nose, eyes, mouth) • gpu::CascadeClassifier_GPU::detectMulti Scale() ❖HOG detector (e.g., person, car, fruit, hand) • gpu::HOGDescriptor::detectMultiScale() 130
  131. 131. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Object Recognition ❖Interest point detectors • gpu::cornerHarris(), ::cornerMinEigenVal(), ::SURF_GPU, ::FAST_GPU, ::ORB_GPU(), ::GoodFeaturesToTrackDetector_GPU() ❖Feature matching • gpu::BruteForceMatcher_GPU(), ::BFMatcher_GPU() 131
  132. 132. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Stereo and 3D ❖RANSAC • gpu::solvePnPRansac() ❖Stereo correspondence (disparity map) • gpu::StereoBM_GPU(), ::StereoBeliefPropagation(), ::StereoConstantSpaceBP(), ::DisparityBilateralFilter() ❖Represent stereo disparity as 3D or 2D • gpu::reprojectImageTo3D(), ::drawColorDisp() 132
  133. 133. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Optical Flow ❖Dense/sparse optical flow gpu::FastOpticalFlowBM(), ::PyrLKOpticalFlow, ::BroxOpticalFlow(), ::FarnebackOpticalFlow(), ::OpticalFlowDual_TVL1_GPU(), ::interpolateFrames() 133
  134. 134. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Background Segmentation ❖Foregrdound/background segmentation (e.g., object detection/removal, motion tracking, background removal) • gpu::FGDStatModel, ::GMG_GPU, ::MOG_GPU, ::MOG2_GPU 134
  135. 135. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Performance of OpenCV GPU Accelerators on PC 135
  136. 136. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ How parallel programming (Sec. 3) ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 136
  137. 137. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 5. Computer Vision Acceleration on PC Image enhancement (HDR) Feature extraction Video surveillance cloud 137
  138. 138. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. HDR and Image Enhancement 138
  139. 139. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖ Restore and enhance an image ❖ Its complexity is high for large images HDR Image Enhancement Original RestoredComplexity: O(N2) ~ O(N3) 139
  140. 140. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Algorithms for Image Restoration ❖ Wiener Filter ❖ Histogram Based Approach • Histogram Equalization, Histogram Modification, … ❖ Retinex • Path-based Retinex • Recursive Retinex • Center/surround Retinex • No iterative process and is suitable for parallelization • Multi-Scale Retinex with Color Restoration (MSRCR) [Rahman et al. 1997] 140
  141. 141. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. MSRCR Algorithm • : the MSRCR output • : the original image distribution in the ith spectral band • : the kth Gaussian Surround function • : the convolution operation • : the weight • : the color restoration factor in the ith spectral band N : the number of spectral bands : the gain constant : controls the strength of the nonlinearity 141
  142. 142. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The Method Gaussian Blur Log-domain Processing Normalization Copy Data from CPU to GPGPU Copy Data from GPGPU to CPU GPGPUCPU Histogram Stretching • Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm." Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on. IEEE, 2011. • Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19. 142
  143. 143. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖ Multicore/Multi-threading • Tesla C1060 : 240 SP (Stream Processor) • CUDA: , Thread , Block , Grid ❖ Data Parallelization • Parallel convolution Parallelization by GPGPU • Parallel convolution A(0) A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(0)+A(1) A(2)+A(3) A(4)+A(5) A(6)+A(7) A(0)+A(1)+A(2)+A(3) A(4)+A(5)+A(6)+A(7) sum PE data time t0 t1 t2 t3 t4 t5 0 1 2 3 4 5 6 7 PE i { { pixels pixels M pixels M pixels PE ipixels pixels pixels pixels 1 pixels 1 pixels 1 pixels 1 pixels 143
  144. 144. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Our Memory Hierarchy Parallel Gaussian Blur Parallel Log-domain Processing Parallel Normalization Texture Memory Parallel Histogram Stretching Constant Memory Global Memory Shared Memory 144
  145. 145. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CPU results GPGPU resultsOriginal images Experimental Results (1/2) 145
  146. 146. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CPU results GPGPU resultsOriginal images Experimental Results (2/2) 146
  147. 147. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Speedup over CPU 74x 2x • Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103 • NPP: nVidia Performance Primitive 147
  148. 148. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Feature Extraction (SIFT) 148
  149. 149. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖SIFT • Scale Invariant Feature Transform ❖Invariance of feature points • Translation • Rotation • Scale What Is SIFT 149
  150. 150. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖Object recognition/tracking ❖Image retrieval ❖Autostitch Applications of SIFT 150
  151. 151. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallelize SIFT by GPGPU Intel Q9400 Quad cores (2.66GHz) Geforce GTS 250 128 SPs (1.836GHz) 151
  152. 152. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CPU GPU Experimental Results 152
  153. 153. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Execution Timem s CPU: 10 seconds in average GPGPU: 0.8 seconds in average 153
  154. 154. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Speedup 13x speedup in average 154
  155. 155. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Video Surveillance Cloud 155
  156. 156. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU雲端視訊監控系統 警戒區域入侵偵測 PTZ相機追蹤 攝影機異常偵測 高效率影片事件瀏覽系統 中央視訊及訊息管理系統多重解析度廣域監視系統 戶外 停車場 空位偵測 非法停車偵測 動態場景 人臉偵測 Storage Area Network PC Mobile device Multi-core Hypervisor GPGPU 156
  157. 157. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 私有雲機房 157
  158. 158. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ How parallel programming (Sec. 3) ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 158
  159. 159. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 6. Computer Vision Acceleration on Android OpenCV RenderScript 159
  160. 160. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV on Android 160
  161. 161. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV4Android SDK ❖Enables development of Android applications with use of OpenCV library. ❖Use java native interface (JNI) directly access c code ❖Support nVIDAs’ Tegra android development pack(TADP) Not fully compatible with GPU module 161
  162. 162. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. System Framework 162
  163. 163. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Two Methods to Call OpenCV ❖Using Java API ❖Using native C++ 163
  164. 164. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(1/5) 164
  165. 165. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(2/5) 165
  166. 166. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(3/5) 166
  167. 167. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(4/5) 167
  168. 168. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(5/5) 168
  169. 169. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on Android with GPU Acceleration 169
  170. 170. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(1/5) 170
  171. 171. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(2/5) 171
  172. 172. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(3/5) 172
  173. 173. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(4/5) 173
  174. 174. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(5/5) 174
  175. 175. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Image Processing Intrinsics Name Operation ScriptIntrinsicConvolve3x3,ScriptIntrinsicConvol ve5x5 Performs a 3x3 or 5x5 convolution. ScriptIntrinsicBlur Performs a Gaussian blur. Supports grayscale and RGBA buffers and is used by the system framework for drop shadows. ScriptIntrinsicYuvToRGB Converts a YUV buffer to RGB. Often used to process camera data. ScriptIntrinsicColorMatrix Applies a 4x4 color matrix to a buffer. ScriptIntrinsicBlend Blends two allocations in a variety of ways. ScriptIntrinsicLUT Applies a per-channel lookup table to a buffer. ScriptIntrinsic3DLUT Applies a color cube with interpolation to a buffer. ScriptIntrinsicHistogram Intrinsic Histogram filter 175
  176. 176. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Gaussian Blur Example by RenderScript Intrinsic RenderScript rs = RenderScript.create(theActivity); ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(mRS, Element.U8_4(rs));; Allocation tmpIn = Allocation.createFromBitmap(rs, inputBitmap); Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap); theIntrinsic.setRadius(25.f); theIntrinsic.setInput(tmpIn); theIntrinsic.forEach(tmpOut); tmpOut.copyTo(outputBitmap); 176
  177. 177. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Intrinsic Example(1/2) 177
  178. 178. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Intrinsic Example(2/2) 178
  179. 179. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Blur Intrinsic Performance Analysis 179
  180. 180. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Performance of RenderScript Intrinsics ❖On new Nexus 7 ❖Relative to equivalent multithreaded C implementations. 180
  181. 181. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Image Processing Benchmarks(1/2) ❖CPU only on a Galaxy Nexus device. 181
  182. 182. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Image Processing Benchmarks(2/2) 182
  183. 183. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Acceleration of Retinex Using RenderScript ❖This paper presents an implementation of rsRetinex, an optimized Retinex algorithm by using Renderscript technique. ❖The experimental results show that rsRetinex could gain up to five times speedup when applied to different image resolution. Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for Image Processing on Android Device Using Renderscript." in Proc. The 8th International Conference on Robotic, Vision, Signal Processing & Power Applications, 2014. 183
  184. 184. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Mobile GPGPU List Adoption OpenCL/ CUDA OpenCV Renderscript Qualcomm Adreno Google Nexus 10, Google new Nexus 7, SONY Xperia Tablet Z2 1.2(302~420) OCL module Android 4.0 later ARM Mali Nexus 10, Samsung Note 3, Samsung Note PRO 12.2, Meizu MX3 OpenCL 1.1 (T604~T678) OCL module Android 4.0 later nVIDIA Tegra Google Project Tango, HTC Nexus 9, Microsoft Surface 2, Nvidia Shield Note 7 CUDA, OpenCL 1.2(K1 only) GPU module Android 4.0 later(K1 only) AnandTech PowerVR iPad Air, iPad mini OpenCL 1.2 OCL module none Intel HD Graphics Microsoft Surface Pro 3, Sony VAIO Tap 11 OpenCL 1.1 OCL module none Nvidia CEO sees future in cars and gaming, 2014/5/19, CNet. 184
  185. 185. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 7. Summary 185
  186. 186. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU ❖ Single-core è Multi-core è Many-core ❖PC • nVidia Tesla + CUDA/OpenCV ❖Android • Qualcomm Adreno + OpenCV ocl • nVidia Tegra + OpenCV gpu 186
  187. 187. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Programming ❖C/C++/OpenCV • OpenMP, OpenACC, CUDA, C++ AMP • OpenCL ❖Java • OpenCL, RenderScript ❖Notice that OpenCL and RenderScript is • Not Efficient in parallelization. • Efficient in CV algorithmic design. 187
  188. 188. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV Acceleration (1/2) ❖Ver. 2.4.x • gpu module: CUDA, PC • ocl module: OpenCL, mobile ❖Ver. 3.0 (2014/6) • Transparent API for GPGPU acceleration 188
  189. 189. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV Acceleration (2/2) 189
  190. 190. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL 2.0 ❖Released in 2013 ❖SVM: Shared Virtual Memory • OpenCL 1.2: Explicit memory management ❖Dynamic (Nested) Parallelism • Allows a device to enqueue kernels onto itself – no round trip to host required ❖Disadvantage • Strong hardware support • Not well supported in current GPGPUs 190
  191. 191. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA still Dominant in the Near Future ❖ However, we have to manually parallelize the algorithm: more design overhead ❖ We need expertise in • Algorithms of image and signal processing • Filtering, frequency analysis, compression, feature extraction, recognition, ... • Theory, tools and methodology of parallel computing • Communication, synchronization, resource management, load balancing, debugging, ... 191
  192. 192. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPUs for Multimedia Motion Estimation for H.264/AVC on Multiple GPUs Using NVIDIA CUDA 10 X CUDA JPEG Decoder 10 X DivideFrame GPU Decoder Hyperspectral Image Compression on NVIDIA GPUs 10 X GPU Decoder (Vegas/Premiere) - Using the Power of NVIDIA Graphic Card to Decode H.264 Video Files 26 X PowerDirector7 Ultra 3.5X 192
  193. 193. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPUs for Computer Vision(1/2) 87 X CUDA SURF – A Real- time Implementation for SURF TU Darmstadt 26 X Leukocyte Tracking: ImageJ Plugin University of Virginia 200 X Real-time Spatiotemporal Stereo Matching Using the Dual-Cross-Bilateral Grid 100 X Image Denoising with Bilateral Filter Wlroclaw University of Technology 85 X Digital Breast Tomosynthesis Reconstruction Massachusetts General Hospital 100 X Fast Optical Flow on GPU At Video Rate for Full HD Resolution Onera 8 X A Framework for Efficient and Scalable Execution of Domain-specific Templates On GPU NEC Labs, Berkeley, Purdue 13 X Accelerating Advanced MRI Reconstructions University of Illinois 193
  194. 194. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPUs for Computer Vision(2/2) 20 X GPU for Surveillance 13 X Fast Human Detection with Cascaded Ensembles 109 X Fast Sliding-Window Object Detection 263 X GPU Acceleration of Object Classification Algorithm Using NVIDIA CUDA 10 X Real-time Visual Tracker by Stream Processing 45 X A GPU Accelerated Evolutionary Computer Vision System 3 X Canny Edge Detection 300 X Audience Measurement – Real-time Video Analysis for Counting People, Face Detection and Tracking 194
  195. 195. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The Embedded Vision Alliance 195
  196. 196. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Readings (1/2) • Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm." IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). 2011. • Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19. • Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features." Computers, IEEE Transactions on 61.7 (2012): 999-1012. • Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A review." Medical physics 38.5 (2011): 2685-2697. • Cope, Ben, et al. "Performance comparison of graphics processors to reconfigurable logic: a case study." Computers, IEEE Transactions on 59.4 (2010): 433-448. 196
  197. 197. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Readings (2/2) ❖ “Designing Visionary Mobile Apps Using the Tegra Android Development Pack,” http://bit.ly/1jvwbgV ❖ “Getting Started With GPU-Accelerated Computer Vision Using OpenCV and CUDA,” http://bit.ly/1oMwJEG ❖ “The open standard for parallel programming of heterogeneous systems,” https://www.khronos.org/opencl/ 197
  198. 198. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV Acceleration ❖ GPU Module Introduction — OpenCV 2.4.9.0 documentation ❖ OpenCL Module Introduction - opencv documentation! ❖ OpenCV-CL: Computer vision with OpenCL acceleration, AMD Developer Central, 2013. ❖ Pulli, Kari, et al. "Real-time computer vision with OpenCV." Communications of the ACM 55.6 (2012): 61-69. ❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated framework for image processing and computer vision." Advances in Visual Computing. Springer Berlin Heidelberg, 2008. 430-439. 198
  199. 199. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA ❖ CUDA Programming guide. nVidia. ❖ CUDA Best Practices Guide. nVidia. ❖ CUDA Reference Manual. nVidia. ❖ CUDA Zone - NVIDIA Developer, https://developer.nvidia.com/cuda-zone ❖ Parallel Programming and Computing Platform | CUDA Home, www.nvidia.com/object/cuda_home_new.html ❖ Applications of CUDA for Imaging and Computer Vision http://www.nvidia.com/object/imaging_comp_vision.html ❖ nVidia Performance Primitives (NPP) http://developer.nvidia.com/object/npp_home.html 199
  200. 200. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL ❖ Khronos OpenCL specification, reference card, tutorials, etc: http://www.khronos.org/opencl ❖ AMD OpenCL Resources: http://developer.amd.com/opencl ❖ NVIDIA OpenCL Resources: http://developer.nvidia.com/opencl ❖ Books • Using OpenCL: Programming Massively Parallel Computers. IOS Press, 2012. • OpenCL programming guide. Pearson Education, 2011. • Heterogeneous Computing with OpenCL: Revised OpenCL 1. Newnes, 2012. • OpenCL in Action: how to accelerate graphics and computation. Manning, 2012. 200
  201. 201. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript ❖ RenderScript for Android Developer, Official web site http://developer.android.com/guide/topics/renderscript/compute.ht ml ❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." First Asia-Pacific Programming Languages and Compilers Workshop. 2012. ❖ "High Performance Apps Development with RenderScript," 12th Kandroid Conference, 2013. 201
  202. 202. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Web Sites and Resources ❖Embedded Vision Alliance, http://www.embedded-vision.com ❖GPUComputing.Net, http://www.gpucomputing.net ❖HAS Foundation, www.hsafoundation.com ❖ 202
  203. 203. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Computing with GPGPU ❖Programming Massively Parallel Processors – A Hands-on Approach • D. B. Kirk, W. M. Hwu • Morgan Kaufmann, 2010 • http://www.nvidia.com/object/promotion_david_kirk_book.html 203

×