2014/07/17 Parallelize computer vision by GPGPU computing

Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p.
Wang, Yuan-Kai (王元凱)
Electrical Engineering Department, Fu Jen Catholic
University (輔仁大學電機工程系)
ykwang@mail.fju.edu.tw
http://www.ykwang.tw
2014/07/17
Parallelize Computer Vision
by GPGPU Computing
1

About this Course
❖ Multicore Era for Computer Vision
❖ GPGPU
❖ Parallel Programming
(CUDA, OpenCL, Renderscript)
❖ OpenCV Acceleration with GPGPU
❖ Computer Vision Acceleration
2

1. Multicore Era for
Computer Vision
Paradigm shift
from Clock Speed Race
to Multicore Race
3

Multicore Computing
❖ What Is Multicore
• Combine multiple processors
(CPU, DSP, GPGPU, FPGA)
into single chip
❖ Multicore computing is inevitable
4

Moore's Law
❖ In 1965, Gordon Moore (Intel co-founder)
predicted
• The transistors no. on an IC would double
every 18 months
❖ The well-known law
• The performance of computer
doubles every 18 months
• More transistors
→ More performance
❖ The prediction was
kept correctly by
Intel's CPUs for 40 years
5

Review of Moore's Law
❖ Transistors in a chip did increase
6
Software enjoys the fruits of hardware's labour.

Problems
❖ More transistors need high frequency
• We come into the Clock Speed Race
❖ But high frequency needs high power
consumption
• High power consumption è Heat problem
• 4GHz has been the limit of Moore’s law
7

Paradigm Shift from 2000 AD
❖ General-purpose multicore
comes of age
❖ Chip companies race to create multicore
processors
• CPU: Intel Core Duo, Quad-core,
ARM v7, ...
• DSP: TI OMAP, ARM NEON, …
• GPU/GPGPU:
• nVidia: GeForce/Tesla, Tegra
• ARM: Mali-T6x
• …
8

The Multicore Evolution
Pentium processor
Optimized for single
thread
Core Duo 5~10 years
10~100 energy efficient
cores optimized for
parallel execution
From large mono-core to multiple lightweight cores
9

Moore’s Law Needs Multicore
❖ Single core cannot fit Moore's law
❖ Multicore can fit Moore's law if a
parallel programming model exists
Time
Performance
Single Core
Multi-Core
10

Two Architectures
for Multicore
❖ Symmetric multiprocessing (SMP)
• Multicore CPU, GPGPU, DSP multicore
• Homogeneous computing
❖ Asymmetric multiprocessing (AMP)
• CPU+GPGPU,
CPU+FPGA,
CPU+DSP
• Heterogeneous computing
11

Multicore CPU (1/2)
❖ Two or more CPUs in a chip
❖ Ex.: Intel Core i7
12
Multiple
Execution Cores

Multicore CPU (2/2)
❖ Windows Task Manager(工作管理員)
Two cores Eight cores
13

GPGPU (1/2)
❖ GPU (Graphical Processing Unit)
• The processor in graphics card to speed
up 3D graphics
• Game playing
is a major
application
❖ GPGPU: General-Purpose GPU
• General purpose computation using
GPU in applications other than 3D
graphics
14

GPGPU (2/2)
❖ GPGPU has more cores than CPU
• 120 ~ 3072 cores vs. 2 ~ 8 cores
(Many-core vs. Multi-core)
❖ GPGPU is more powerful than
multicore CPU
❖ Vendors:
• nVidia
• Quadcomm
(AMD, ATI)
• ARM
• Intel
15

Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 16
It is the Software, Stupid
❖Gary Smith and Daya Nadamuni, Gartner
Dataquest, Design Automation Conf., 2006
❖The biggest problem with SoC design
is embedded software development.
❖The next big hurdle is
programmability. It's the ability to
program these multicore platforms."
❖You can have elegant algorithms,
first-pass silicon, and fancy intellectual
property. But without software, the
product goes nowhere.

Multicore Demands Threading
17

Multicore Demands Threading
18

What Is Computer Vision
19

Video
Capture
Image
Enhance
Object
/Event
Detection
Object
Tracking
Object
/Event
Recognition
Behavior
Analysis Retrieval
Imaging
Event Detection
Abnormal Detection Face Recognition Retrieval
TripwireImage/Video Enhancement
A Complete Vision System
– Video Surveillance as an Example
20

Computer Vision Needs
High Performance Computing
❖ A CV example : video processing
• Intelligent video surveillance,
❖ Its complexity is high
• Video (1080p RGB):
6 Megapixels per frame, 30fps
• 100 – 1K flops per pixel
• ⇒ 18 - 180 Gigaflops per second
❖ Massive data processing
• Intensive computation
21

HPC Approaches
❖ Cluster/distributed computing
• Hadoop/MAP-REDUCE
(Google, Cloud Computing)
• MPI
❖ Multi-processing
computing
• Multicore (GPGPU, CPU, FPGA/DSP)
• Programming: multi-thread
• Windows thread, Pthraed, OpenMP
• CUDA, renderscript, C++ AMP, …
Supercomputer
22

However
❖ Can CV algorithms speed-up every 18
months with multicore?
❖ Multicore is not a simple solution for
upgrading CV algorithm performance
• The transition from single core to
multicore will be blocked by software
• We are not ready to face the software
programming challenges
• It is the software, stupid.
23

Software, Threading, and
Parallel Computing
❖ Identify parallelism: Analyze algorithm
❖ Express parallelism: Write parallel code
❖ Validate parallelism: Debug & verify parallel code
❖ Optimize parallelism: enhance parallel
performance
24

Multi-threading Demands
New Programming Skills
❖ Previous multi-threading techniques
❖ Windows thread, pthread, OpenMP,
MPI, …
❖ New techniques
• CUDA, C++ AMP, OpenCL, Renderscript,
OpenACC, Map Reduce, …
❖ Concepts
• Race condition, deadlock,
• Domain partition, function partition, …
25

Multicore Programming
Practice (MPP)
❖ Goal: Write portable C/C++
programs to be "Multicore ready"
and platform compatible
• Proposed by a
MPP working group
in the Multicore
Association
http://www.multicore-association.org/workgroup/mpp.php
26

OpenACC
❖ An organization develops API to
• describes a collection of
compiler directives
• To specify loops and regions of
code in standard C, C++ and Fortran
• To be offloaded from a host CPU to
an attached accelerator, including
•APUs, GPUs, and many-core
coprocessor
27

HSA Foundation
❖Heterogeneous System Architecture
• Key members: AMD, QUALCOMM, ARM,
SAMSUNG, TI
❖System architecture easing efficient
use of accelerators, SoCs
• Intended to support high-level parallel
programming frameworks
• OpenCL, C++, C#, OpenMP, Java
• Accelerator requirements
• Full-system SVM, memory coherency,
preemption,
user-mode dispatch
28

The ParLab in Berkeley
❖ The Parallel Computing Lab. in UC
Berkeleyhttp://parlab.eecs.berkeley.e
du
• The ParLab. offers programmers a
practical introduction to parallel
programming techniques and tools on
current parallel computers,
emphasizing multicore and manycore
computers.
29

HPEC
❖ High Performance Embedded
Computing
• MIT Lincoln Lab, 1997 ~
30

OpenCL
❖ Royalty-free, cross-platform, cross-
vendor standard
•Targeting: supercomputers
è embedded systems
è mobile devices
❖Enables programming of diverse
compute resources
•CPU, GPU,
DSP, FPGA …
31

OpenCL Working Group
Members
❖Diverse industry participation – many
industry experts
❖NVIDIA is chair, Apple is specification
editor
32

Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ Vendor, Hardware
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
33

2. GPGPU
PC platform
Mobile platform
34

Why GPGPU
❖ GPGPU has many-core (vs. multi-core)
• Suitable for masssively parallel computing
35

GPGPU as a Coprocessor
Heterogeneous Computing
36

PC Platform
• Discrete GPUs
• GPGPU card as a coprocessor
From PC to PSC (Personal Super-Computer)
37
PCIe

Mobile Platform
• Integrated GPUs
• GPGPU sub-chip as a coprocessor
From mobile phone to mobile personal computer
38
No PCIe
GPGPU
CPU

GPGPU Solutions - nVidia
• Compute Architecture:
Tesla, Fermi, Kepler, …
• PC
• GeForce, Quadro
• Tesla
• 870, 1060, 2070, K40
• Mobile
• Tegra: …, 4, K1(192 cores)
39
It’s Tegra K1 Everywhere at Google I/O, Embedded Vision Alliance, 2014/7/7.

GPGPU Solutions
– Qualcomm/AMD
❖ Qualcomm, AMD, ATI
❖ APU: integrated CPU+GPU
❖ Low energy consumption
❖ PC(AMD): FirePro
❖ Mobile(Snapdragon):
❖ Adreno: 330(32 cores)
40

GPGPU Solutions - ARM
❖ Mali
❖ Samsung Exynos, MediaTek
❖ Compute engine
after T-600
❖ Exynos 5
❖ At most 8 cores
(Mali-T678)
41

Intel – Multicore CPU
• PC (Xeon Phi)
• IRIS pro GPU
• Knight Landing: 60 cores
• Knight Cover: 48 CPU cores,
PCIe
• Mobile
• Haswell
• Atom
42

Applications of GPGPU
http://developer.nvidia.com/category/zone/cuda-zone
43

Heterogeneous Architecture
❖Host: CPU
❖Device: GPGPU
❖Notice: memory hierarchy in device
44

GPGPUs Architecture
- nVidia
❖ GT200
• GTX 260/280, Quardro5800, Tesla 1060
❖ Fermi
• Tesla 2060
DRAM
Cache
ALU
Control
ALU
ALU
ALU
DRAM
CPU(host)
Multicore
GPU(device)
Many-core
45

nVidia GPGPU Architecture
❖ SM/SP(Stream multiprocessor/Stream
processor) + Shared memory + DRAM
46

Memory Hierarchy
❖ On-Chip Memory
• Registers
• Shared Memory
• Constant Memory
• Texture Memory
❖ Off-Chip Memory
• Local Memory
• Global Memory
47

GPGPU vs. FPGA
❖GPU: nVidia GeForce
GTX 280, GTX580
❖FPGA: Xilinx Virtex4,
Virtex5
A Comparison of FPGA and GPU for real-Time Phase-Based Optical Flow, Stereo, and Local
Image Features, IEEE Transactions on Computers, 2012.
48

GPGPU vs. FPGA
❖GPU: nVidia GeForce 7900 GTX
❖FPGA: Xilinx Virtex-4
Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case
Study, IEEE Transactions on Computers, 2010.
49

GPGPU vs. FPGA vs. Multicore
❖Application: 2-D image convolution
GPU: nVidia GeForce 295 GTX
FPGA: Altera Stratix III E260
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-
Window Applications, ACM/SIGDA international symposium on FPGA, 2012.
50

However, GPGPU May Not
Always Improve Speed & Energy
51

Hardware vs. Software
52
GPGPU
nVidia
Qualcomm
ARM
Intel
Parallel
Programming
CUDA
OpenCL
RenderScript
C++ AMP

Today We Talk About
• CUDA, renderscript, OpenCL, …
(Sec. 6)
53

3. Parallel
Programming
Multi-threading
Programming Languages for Parallels
54

Parallel Computing
❖ Serial
Computing
❖ Parallel
Computing
CPU/GPU
55
Core
Core
Core
Core

Parallel Programming
❖ Many codes are written in C/C++/Java
• Especially algorithmic programs
❖ Can we write GPGPU parallel
programs by C/C++/Java?
❖ However, C/C++ is sequential
• Three control structures of C/C++/Java:
sequence, selection, repetition
56

Multi-threading
❖ Multi-threading is the fundamental
concept for parallel programming
• Some techniques are ready
• Pthread, Win32 thread, OpenMP,
MPI, Intel TBB (Threading Building
Block)...
• New techniques
• CUDA, OpenCL, Renderscript,
OpenACC, C++ AMP, ...
57

Parallel Programming Models
58

Parallel Programming in
Sequential Language
❖ Do we need to learn new languages for
multi-threading?
• No
❖ Write multi-threading codes in C/C++
• Add functions/directives to C/C++ for
multi-threading
• That is the way current solutions did
• pthread, Win32 thread, OpenMP,
MPI, CUDA, OpenCL, ...
59

Decompose the Problem
❖ Two basic approaches to partition
computational work
• Domain decomposition
• Partition the data used
in solving the problem
• Function decomposition
• Partition the jobs (functions)
from the overall work (problem)
GPGPU
CPU
Cooperate
60

Multi-Threading
❖ A program running
In Serial
http://en.wikipedia.org/wiki/Thread_(computer_science)
In Parallel
61

Domain Decomposition (1/3)
❖An image example
• It is 2D data
• Three popular partition ways
62

❖Domain data are usually processed
by loop
• for (i=0; i<height; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
Original image(img1) Enhanced image(img2)
The X-ray image
of a circuit board
i
j
SIMD
SPMD
SIMT
63

❖A three-block partition
example
• // Thread 1
for (i=0; i<height/3; i++)
• // Thread 2
for (i=height/3; i<height*2/3; i++)
• // Thread 3
for (i=height*2/3; i<height; i++)
i
j
OpenMP
CUDA(SPMD)
fork(threads)
join(barrier)
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
i=11
subdomain 1 subdomain 2 subdomain 3
64

GPGPU Programming:
SIMT model
❖ CPU (“host”) program often
written in C or C++
❖ GPU code is written as a sequential
kernel in (usually) a C or C++
dialect
65

GPGPU Programming
Techniques
CUDA
OpenCL
C++ AMP
Rednerscript
66

GPGPU Programming
Techniques
67

CUDA
68

CUDA
❖ CUDA: Compute Unified Device
Architecture
❖ Parallel programming
for nVidia's GPGPU
❖ Use C/C++ language
• Java, Fortran, Matlab are OK
❖ When executing CUDA programs,
the GPU operates as coprocessor to
the main CPU
69

CUDA Hardware Environment:
CPU+GPU
❖ CPU
• Organizes, interprets, and
communicates information
❖ GPU
• Handles the core processing on large quantities
of parallel information
• Compute-intensive portions of applications
that are executed many times, but on different
data, are extracted from the main application
and compiled to execute in parallel on the GPU
CPU GPU
PCI-E
70

CUDA Software Stack
71

Processing Flow on CUDA
Copy processing
data
2
Copy the
result
5 Instruct the
processing
3
Main
Memory
CPU
Memory
for GPU Execute
parallel in
each core
4
Release
device memory
6
Allocate
device memory
1
72

Programming with
Memory Hierarchy
❖ Locality
principle
• Temporal
locality
• Spatial
locality
73

Example - Hello World(1/3)
int main()
{
char src[12]="Hello World";
char h_hello[12];
char* d_hello1;
char* d_hello2;
cudaMalloc((void**) &d_hello1, sizeof(char)*12);
cudaMalloc((void**) &d_hello2, sizeof(char)*12);
cudaMemcpy(d_hello1 , src , sizeof(char)* 12 ,
cudaMemcpyHostToDevice);
hello<<<1,1>>>(d_hello1 , d_hello2 );
Host
src
h_hello
Device
d_hello1
d_hello2
call the kernel function
74

❖ Kernel Function
__global__ void hello(char* hello1 , char* hello2 )
{
int k;
for(k = 0 ; hello1[k] != '0' ; k++){
hello2[k] = hello1[k];
}
}
Host
src
h_hello
Device
d_hello1
d_hello2
No parallel processing in this example
75

cudaMemcpy(h_hello, d_hello2, sizeof(char)*
12, cudaMemcpyDeviceToHost);
printf("%sn", h_hello);
cudaFree(d_hello1);
❖ cudaFree(d_hello2);
system("pause");
return 0;
}
Result:
Host
src
h_hello
Device
d_hello1
d_hello2
76

OpenCL
Standard
77

The Inspiration for OpenCL
78

What's OpenCL
❖One code tree can be executed on CPUs, GPUs,
DSPs and hardware
• Dynamically interrogate system load and
balance across available processors
❖Powerful, low-level flexibility
• Foundational access to compute resources for
higher-level engines, frameworks and
languages
79

Broad OpenCL Implementer
Adoption
❖Multiple conformant implementations shipping
on desktop and mobile
❖Android ICD extension released in latest
extension specification
❖Multiple implementations shipping in Android
NDK
80

OpenCL Enables Portability
❖C to gates programs are
proprietary
81

Altera OpenCL SDK for FPGAs
82

NVIDIA OpenCL SDK for GPU
83

AMD OpenCL Optimization
Case Study
❖Platform
• AMD Phenom II X4 965 CPU (quad core)
• ATI Radeon HD 5870 GPU
❖Unoptimized CPU performance: 1 GFLOP/s
❖Optimized CPU performance reaches: 4 GFLOP/s
❖Optimized GPU performance reaches: 50 GFLOP/s
84

Including
Declaring
85

Creating
86

Do
Copy to host &
display
Creating
87

Kernel Function
88

C++ AMP
Microsoft
89

What's C++ AMP(1/2)
❖Microsoft’s C++ AMP (Accelerated Massive
Parallelism)
• Part of Visual C++, integrated with Visual
Studio, built on Direct3D
• “Performance for the mainstream”
❖STL-like library for multidimensional array
data
• Special convenience support for 1, 2, and 3
dimensional arrays on CPU or GPU
• C++ AMP runtime handles CPU<->GPU data
copying
• Tiles enable efficient processing of sub-arrays
90

What's C++ AMP(2/2)
❖Parallel_for_each
•Executes a kernel (C++ lambda) at
each point in the extent
•restrict() clause specifies where to
run the kernel: cpu (default) or
direct3d (GPU)
91

Declaring&
Coping to device
92

Do
Display
93

RenderScript
Google Android
94

What's Renderscript(1/2)
❖Higher-level than CUDA or OpenCL: simpler &
less performance control
• Emphasis on mobile devices & cross-SoC
performance portability
❖Programming model
• C99-based kernel language, JIT-compiled,
single input-single output
• Automatic Java class reflection
• Intrinsics: built-in, highly-tuned operations,
e.g. ScriptIntrinsicConvolve3x3
• Script groups combine kernels to amortize
launch cost & enable kernel fusion
95

What's Renderscript(2/2)
❖ Data type:
• 1D/2D collections of elements, C types like int
and short2, types include size
• Runtime type checking
❖ Parallelism
• Implicit: one thread per data element,
atomics for thread-safe access
• Thread scheduling not exposed, VM-decided
96

RenderScript Architecture
97

Low Level Virtual Machine
❖Low Level Virtual Machine (LLVM)
is a compiler infrastructure
98

Offline Compiler Flow
99

Renderscript Compiler: libbcc
100

Renderscript Project
Framework
101

102

HelloWorld.java
103

HelloWorld.java
104

HelloWorldView.java
105

HelloWorldView.java
106

HelloWorldRS.java
107

HelloWorldRS.java
108

ScriptC_helloworld.java
109

ScriptC_helloworld.java
110

HelloWorld.rs
111

Comparison (1/2)
❖Renderscript vs. Native(NDK) vs. Java(SDK)
• OS: Honeycomb v3.2(CPU only)
Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of
the three programming models in google android." in Proc. First Asia-
Pacific Programming Languages and Compilers Workshop (APPLC). 201
112

Comparison(2/2)
❖OpenCL & CUDA
• Sobel filter with(CMw/o) and without(CMw)
constant memory
OpenCL’s portability does not
fundamentally affect its performance
Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A
comprehensive performance comparison of CUDA and OpenCL." in
Proc. International Conference Parallel Processing (ICPP), 2011.
113

GPGPU Programming
114
Performance: more control, better performance
Productivity: ease use, quick programming,
portability

❖ Multicore/Multi-threading
❖ Data Parallelization
• Data distribution
• Parallel convolution
• Reduction algorithm
• Amdahl’s law
❖ Memory Hierarchy Management
• Locality principle
• Program accesses a relatively small portion
of the address space at any instant of time
Parallelization
115

Multi-thread Programming with
the Discipline of Parallelization
❖ Identify parallelism: Analyze algorithm
❖ Express parallelism: Write parallel code
❖ Validate parallelism: Debug & verify parallel code
❖ Optimize parallelism: enhance parallel
performance
116

Today We Talk About
(Sec. 6)
117

4. OpenCV
Acceleration
118

What Is OpenCV
❖A very popular computer vision
library
• 6M downloads
• BSD licenses
• 2000 ~ CV functions
• Modularized and efficient
• Optimization
• Intel SSE, IPP, TBB
• ARM NEON & GLSL (Tegra)
• CUDA, OpenCL
119

OpenCV Modules
❖Image/video I/O, processing, display (core,
imgproc, highgui)
❖Object/feature detection (objdetect, features2d,
nonfree)
❖Geometry-based monocular or stereo computer
vision (calib3d, stitching, videostab)
❖Computational photography (photo, video,
superres)
❖Machine learning & clustering (ml, flann)
❖CUDA and OpenCL GPU acceleration (gpu, ocl)
Normal CV modules: 14
Acceleration modules: 2
120

OpenCV GPU Module
❖Implemented using NVIDIA CUDA
Runtime API
❖Latest version: 2.4.9
• Utilizing Multiple GPUs
❖Implemented modules: 11
❖Implemented functions: 270
Focus on PC platform
Not fully compatible to mobile GPGPU on Android
121

CUDA Matrix Operations
❖Point-wise matrix math
• gpu::add(), ::sum(), ::div(), ::sqrt(),
::sqrSum(), ::meanStdDev, ::min(), ::max(),
::minMaxLoc(), ::magnitude(), ::norm(),
::countNonZero(), ::cartToPolar(), etc..
❖Matrix multiplication
• gpu::gemm()
❖Channel manipulation
• gpu::merge(), ::split()
122

CUDA Geometric Operations
❖Image resize with sub-pixel interpolation
• gpu::resize()
❖Image rotate with sub-pixel interpolation
• gpu::rotate()
❖Image warp (e.g., panoramic stitching)
• gpu::warpPerspective(), ::warpAffine()
123

CUDA other Math and
Geometric Operations
❖Integral images
• gpu::integral(), ::sqrIntegral()
❖Custom geometric transformation (e.g., lens
distortion correction)
• gpu::remap(), ::buildWarpCylindricalMaps(),
::buildWarpSphericalMaps()
124

CUDA Image Processing(1/2)
❖Smoothing
• gpu::blur(), ::boxFilter(),
::GaussianBlur()
❖Morphological
• gpu::dilate(), ::erode(), ::morphologyEx()
❖Edge Detection
• gpu::Sobel(), ::Scharr(), ::Laplacian(),
gpu::Canny()
❖Custom 2D filters
• gpu::filter2D(), ::createFilter2D_GPU(),
::createSeparableFilter_GPU()
❖Color space conversion
• gpu::cvtColor()
125

CUDA Image Processing(2/2)
❖Image blending
• gpu::blendLinear()
❖Template matching (automated inspection)
• gpu::matchTemplate()
❖Gaussian pyramid (scale invariant
feature/object detection)
• gpu::pyrUp(), ::pyrDown()
❖Image histogram
• gpu::calcHist(), gpu::histEven,
gpu::histRange()
❖Contract enhancement
• gpu::equalizeHist()
126

CUDA De-noising
❖Gaussian noise removal
• gpu::FastNonLocalMeansDenoising()
❖Edge preserving smoothing
• gpu::bilateralFilter()
127

CUDA Fourier and MeanShift
❖Fourier analysis
•gpu::dft(), ::convolve(),
::mulAndScaleSpectrums(), etc..
❖MeanShift
•gpu::meanShiftFiltering(),
::meanShiftSegmentation()
128

CUDA Shape Detection
❖Line detection (e.g., lane detection, building
detection, perspective correction)
• gpu::HoughLines(), ::HoughLinesDownload()
❖Circle detection (e.g., cells, coins, balls)
• gpu::HoughCircles(),
::HoughCirclesDownload()
129

CUDA Object Detection
❖HAAR and LBP cascaded adaptive boosting
(e.g., face, nose, eyes, mouth)
• gpu::CascadeClassifier_GPU::detectMulti
Scale()
❖HOG detector (e.g., person, car, fruit, hand)
• gpu::HOGDescriptor::detectMultiScale()
130

CUDA Object Recognition
❖Interest point detectors
• gpu::cornerHarris(), ::cornerMinEigenVal(),
::SURF_GPU, ::FAST_GPU, ::ORB_GPU(),
::GoodFeaturesToTrackDetector_GPU()
❖Feature matching
• gpu::BruteForceMatcher_GPU(),
::BFMatcher_GPU()
131

CUDA Stereo and 3D
❖RANSAC
• gpu::solvePnPRansac()
❖Stereo correspondence (disparity map)
• gpu::StereoBM_GPU(),
::StereoBeliefPropagation(),
::StereoConstantSpaceBP(),
::DisparityBilateralFilter()
❖Represent stereo disparity as 3D or 2D
• gpu::reprojectImageTo3D(),
::drawColorDisp()
132

CUDA Optical Flow
❖Dense/sparse optical flow
gpu::FastOpticalFlowBM(),
::PyrLKOpticalFlow, ::BroxOpticalFlow(),
::FarnebackOpticalFlow(),
::OpticalFlowDual_TVL1_GPU(),
::interpolateFrames()
133

CUDA Background
Segmentation
❖Foregrdound/background segmentation (e.g.,
object detection/removal, motion tracking,
background removal)
• gpu::FGDStatModel, ::GMG_GPU,
::MOG_GPU, ::MOG2_GPU
134

Performance of OpenCV GPU
Accelerators on PC
135

Today We Talk About
(Sec. 6)
136

5. Computer Vision
Acceleration on PC
Image enhancement (HDR)
Feature extraction
Video surveillance cloud
137

HDR and
Image Enhancement
138

❖ Restore and enhance an image
❖ Its complexity is high for large images
HDR Image Enhancement
Original RestoredComplexity:
O(N2) ~ O(N3)
139

Algorithms for
Image Restoration
❖ Wiener Filter
❖ Histogram Based Approach
• Histogram Equalization,
Histogram Modification, …
❖ Retinex
• Path-based Retinex
• Recursive Retinex
• Center/surround Retinex
• No iterative process and is suitable for parallelization
• Multi-Scale Retinex with Color Restoration (MSRCR)
[Rahman et al. 1997]
140

MSRCR Algorithm
• : the MSRCR output
• : the original image distribution in the ith spectral band
• : the kth Gaussian Surround function
• : the convolution operation
• : the weight
• : the color restoration factor in the ith spectral band
N : the number of spectral bands
: the gain constant
: controls the strength of the nonlinearity
141

The Method
Gaussian Blur
Log-domain
Processing
Normalization
Copy Data
from CPU to
GPGPU
Copy Data
from GPGPU to
CPU
GPGPUCPU
Histogram
Stretching
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm."
Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer
Society Conference on. IEEE, 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for
accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.
142

❖ Multicore/Multi-threading
• Tesla C1060 : 240 SP (Stream Processor)
• CUDA: , Thread , Block , Grid
❖ Data Parallelization
Parallelization by GPGPU
A(0)
A(1)
A(2)
A(3)
A(4)
A(5)
A(6)
A(7)
A(0)+A(1)
A(2)+A(3)
A(4)+A(5)
A(6)+A(7)
A(0)+A(1)+A(2)+A(3)
A(4)+A(5)+A(6)+A(7)
sum
PE data time
t0 t1 t2 t3 t4 t5
0
1
2
3
4
5
6
7
PE i
{
{
pixels
pixels
M pixels
M
pixels
PE ipixels
pixels
pixels
pixels
1 pixels 1 pixels
1 pixels 1 pixels
143

Our Memory Hierarchy
Parallel Gaussian Blur
Parallel Log-domain
Processing
Parallel Normalization
Texture
Memory
Parallel Histogram
Stretching
Constant
Memory
Global
Memory
Shared
Memory
144

CPU results GPGPU resultsOriginal images
Experimental Results (1/2)
145

CPU results GPGPU resultsOriginal images
Experimental Results (2/2)
146

GPGPU Speedup over CPU
74x
2x
• Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103
• NPP: nVidia Performance Primitive
147

Feature Extraction
(SIFT)
148

❖SIFT
• Scale Invariant Feature Transform
❖Invariance of feature points
• Translation
• Rotation
• Scale
What Is SIFT
149

❖Object recognition/tracking
❖Image retrieval
❖Autostitch
Applications of SIFT
150

Parallelize SIFT by GPGPU
Intel Q9400
Quad cores
(2.66GHz)
Geforce GTS 250
128 SPs
(1.836GHz)
151

CPU GPU
Experimental Results
152

Execution Timem
s
CPU:
10 seconds
in average
GPGPU:
0.8 seconds
in average
153

Speedup
13x speedup in average
154

Video
Surveillance Cloud
155

GPGPU雲端視訊監控系統
警戒區域入侵偵測
PTZ相機追蹤
攝影機異常偵測
高效率影片事件瀏覽系統
中央視訊及訊息管理系統多重解析度廣域監視系統
戶外
停車場
空位偵測
非法停車偵測
動態場景
人臉偵測
Storage Area Network
PC Mobile
device
Multi-core
Hypervisor
GPGPU
156

私有雲機房
157

Today We Talk About
(Sec. 6)
158

6. Computer Vision
Acceleration on
Android
OpenCV
RenderScript
159

OpenCV
on Android
160

OpenCV4Android SDK
❖Enables development of Android applications
with use of OpenCV library.
❖Use java native interface (JNI) directly access c
code
❖Support nVIDAs’ Tegra android development
pack(TADP)
Not fully
compatible with
GPU module
161

System Framework
162

Two Methods to Call OpenCV
❖Using Java API
❖Using native C++
163

OpenCV for Android SDK by
GPU(1/5)
164

GPU(2/5)
165

GPU(3/5)
166

GPU(4/5)
167

GPU(5/5)
168

RenderScript on
Android with GPU
Acceleration
169

RenderScript on android
with GPU(1/5)
170

with GPU(2/5)
171

with GPU(3/5)
172

with GPU(4/5)
173

with GPU(5/5)
174

RenderScript Image
Processing Intrinsics
Name Operation
ScriptIntrinsicConvolve3x3,ScriptIntrinsicConvol
ve5x5
Performs a 3x3 or 5x5 convolution.
ScriptIntrinsicBlur Performs a Gaussian blur. Supports grayscale
and RGBA buffers and is used by the system
framework for drop shadows.
ScriptIntrinsicYuvToRGB Converts a YUV buffer to RGB. Often used to
process camera data.
ScriptIntrinsicColorMatrix Applies a 4x4 color matrix to a buffer.
ScriptIntrinsicBlend Blends two allocations in a variety of ways.
ScriptIntrinsicLUT Applies a per-channel lookup table to a buffer.
ScriptIntrinsic3DLUT Applies a color cube with interpolation to a
buffer.
ScriptIntrinsicHistogram Intrinsic Histogram filter
175

Gaussian Blur Example
by RenderScript Intrinsic
RenderScript rs = RenderScript.create(theActivity);
ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(mRS,
Element.U8_4(rs));;
Allocation tmpIn = Allocation.createFromBitmap(rs, inputBitmap);
Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap);
theIntrinsic.setRadius(25.f);
theIntrinsic.setInput(tmpIn);
theIntrinsic.forEach(tmpOut);
tmpOut.copyTo(outputBitmap);
176

RenderScript Intrinsic
Example(1/2)
177

RenderScript Intrinsic
Example(2/2)
178

Blur Intrinsic
Performance Analysis
179

Performance of
RenderScript Intrinsics
❖On new Nexus 7
❖Relative to equivalent multithreaded C
implementations.
180

RenderScript Image
Processing Benchmarks(1/2)
❖CPU only on a Galaxy Nexus device.
181

RenderScript Image
Processing Benchmarks(2/2)
182

Acceleration of Retinex Using
RenderScript
❖This paper presents an implementation of
rsRetinex, an optimized Retinex algorithm by
using Renderscript technique.
❖The experimental results show that rsRetinex
could gain up to five times speedup when applied
to different image resolution.
Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for Image
Processing on Android Device Using Renderscript." in Proc. The 8th International
Conference on Robotic, Vision, Signal Processing & Power Applications, 2014.
183

Mobile GPGPU List
Adoption OpenCL/ CUDA OpenCV Renderscript
Qualcomm
Adreno
Google Nexus 10,
Google new Nexus 7,
SONY Xperia Tablet Z2
1.2(302~420) OCL
module
Android 4.0
later
ARM Mali Nexus 10, Samsung
Note 3, Samsung Note
PRO 12.2, Meizu MX3
OpenCL 1.1
(T604~T678)
OCL
module
Android 4.0
later
nVIDIA
Tegra
Google Project Tango,
HTC Nexus 9, Microsoft
Surface 2, Nvidia Shield
Note 7
CUDA, OpenCL
1.2(K1 only)
GPU
module
Android 4.0
later(K1 only)
AnandTech
PowerVR
iPad Air, iPad mini OpenCL 1.2 OCL
module
none
Intel HD
Graphics
Microsoft Surface Pro 3,
Sony VAIO Tap 11
OpenCL 1.1 OCL
module
none
Nvidia CEO sees future in cars and gaming, 2014/5/19, CNet.
184

7. Summary
185

GPGPU
❖ Single-core
è Multi-core
è Many-core
❖PC
• nVidia Tesla + CUDA/OpenCV
❖Android
• Qualcomm Adreno + OpenCV ocl
• nVidia Tegra + OpenCV gpu
186

Parallel Programming
❖C/C++/OpenCV
• OpenMP, OpenACC, CUDA, C++ AMP
• OpenCL
❖Java
• OpenCL, RenderScript
❖Notice that OpenCL and
RenderScript is
• Not Efficient in parallelization.
• Efficient in CV algorithmic design.
187

OpenCV Acceleration (1/2)
❖Ver. 2.4.x
• gpu module: CUDA, PC
• ocl module: OpenCL, mobile
❖Ver. 3.0 (2014/6)
• Transparent API for GPGPU
acceleration
188

OpenCV Acceleration (2/2)
189

OpenCL 2.0
❖Released in 2013
❖SVM: Shared Virtual Memory
• OpenCL 1.2: Explicit memory
management
❖Dynamic (Nested) Parallelism
• Allows a device to enqueue kernels onto
itself – no round trip to host required
❖Disadvantage
• Strong hardware support
• Not well supported in current GPGPUs
190

CUDA still Dominant in the
Near Future
❖ However, we have to manually parallelize
the algorithm: more design overhead
❖ We need expertise in
• Algorithms of image and signal processing
• Filtering, frequency analysis, compression,
feature extraction, recognition, ...
• Theory, tools and methodology of parallel
computing
• Communication, synchronization, resource
management, load balancing, debugging, ...
191

GPUs for Multimedia
Motion Estimation for
H.264/AVC on
Multiple GPUs
Using NVIDIA CUDA
10 X
CUDA JPEG Decoder
10 X
DivideFrame GPU Decoder
Hyperspectral Image
Compression on
NVIDIA GPUs
10 X
GPU Decoder
(Vegas/Premiere) -
Using the Power of
NVIDIA Graphic Card to
Decode H.264 Video Files
26 X
PowerDirector7 Ultra
3.5X
192

GPUs for Computer Vision(1/2)
87 X
CUDA SURF – A Real-
time
Implementation for SURF
TU Darmstadt
26 X
Leukocyte Tracking:
ImageJ Plugin
University of Virginia
200 X
Real-time Spatiotemporal
Stereo Matching Using the
Dual-Cross-Bilateral Grid
100 X
Image Denoising with
Bilateral Filter
Wlroclaw University
of Technology
85 X
Digital Breast
Tomosynthesis
Reconstruction
Massachusetts General
Hospital
100 X
Fast Optical Flow on GPU
At Video Rate for Full HD
Resolution
Onera
8 X
A Framework for Efficient
and Scalable Execution of
Domain-specific Templates
On GPU
NEC Labs, Berkeley, Purdue
13 X
Accelerating Advanced MRI
Reconstructions
University of Illinois
193

GPUs for Computer Vision(2/2)
20 X
GPU for Surveillance
13 X
Fast Human Detection with
Cascaded Ensembles
109 X
Fast Sliding-Window
Object Detection
263 X
GPU Acceleration of Object
Classification Algorithm
Using NVIDIA CUDA
10 X
Real-time
Visual Tracker by
Stream Processing
45 X
A GPU Accelerated
Evolutionary
Computer Vision System
3 X
Canny Edge Detection
300 X
Audience Measurement –
Real-time Video Analysis
for Counting People, Face
Detection and Tracking
194

The Embedded Vision
Alliance
195

Readings (1/2)
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved
Retinex algorithm." IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR). 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel
algorithm for accelerating retinex." Journal of Real-Time Image
Processing (2012): 1-19.
• Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time
phase-based optical flow, stereo, and local image features."
Computers, IEEE Transactions on 61.7 (2012): 999-1012.
• Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A
review." Medical physics 38.5 (2011): 2685-2697.
• Cope, Ben, et al. "Performance comparison of graphics processors
to reconfigurable logic: a case study." Computers, IEEE
Transactions on 59.4 (2010): 433-448.
196

Readings (2/2)
❖ “Designing Visionary Mobile Apps Using the Tegra
Android Development Pack,” http://bit.ly/1jvwbgV
❖ “Getting Started With GPU-Accelerated Computer
Vision Using OpenCV and CUDA,”
http://bit.ly/1oMwJEG
❖ “The open standard for parallel programming of
heterogeneous systems,”
https://www.khronos.org/opencl/
197

OpenCV Acceleration
❖ GPU Module Introduction — OpenCV 2.4.9.0
documentation
❖ OpenCL Module Introduction - opencv documentation!
❖ OpenCV-CL: Computer vision with OpenCL
acceleration, AMD Developer Central, 2013.
❖ Pulli, Kari, et al. "Real-time computer vision with
OpenCV." Communications of the ACM 55.6 (2012):
61-69.
❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated
framework for image processing and computer vision."
Advances in Visual Computing. Springer Berlin
Heidelberg, 2008. 430-439.
198

CUDA
❖ CUDA Programming guide. nVidia.
❖ CUDA Best Practices Guide. nVidia.
❖ CUDA Reference Manual. nVidia.
❖ CUDA Zone - NVIDIA Developer,
https://developer.nvidia.com/cuda-zone
❖ Parallel Programming and Computing Platform | CUDA
Home, www.nvidia.com/object/cuda_home_new.html
❖ Applications of CUDA for Imaging and Computer
Vision
http://www.nvidia.com/object/imaging_comp_vision.html
❖ nVidia Performance Primitives (NPP)
http://developer.nvidia.com/object/npp_home.html
199

OpenCL
❖ Khronos OpenCL specification, reference card, tutorials, etc:
http://www.khronos.org/opencl
❖ AMD OpenCL Resources:
http://developer.amd.com/opencl
❖ NVIDIA OpenCL Resources:
http://developer.nvidia.com/opencl
❖ Books
• Using OpenCL: Programming Massively Parallel Computers.
IOS Press, 2012.
• OpenCL programming guide. Pearson Education, 2011.
• Heterogeneous Computing with OpenCL: Revised OpenCL 1.
Newnes, 2012.
• OpenCL in Action: how to accelerate graphics and
computation. Manning, 2012.
200

RenderScript
❖ RenderScript for Android Developer, Official web site
http://developer.android.com/guide/topics/renderscript/compute.ht
ml
❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li.
"Comparison and analysis of the three programming
models in google android." First Asia-Pacific
Programming Languages and Compilers Workshop.
2012.
❖ "High Performance Apps Development with
RenderScript," 12th Kandroid Conference, 2013.
201

Web Sites and Resources
❖Embedded Vision Alliance,
http://www.embedded-vision.com
❖GPUComputing.Net,
http://www.gpucomputing.net
❖HAS Foundation, www.hsafoundation.com
❖
202

Parallel Computing with
GPGPU
❖Programming Massively Parallel
Processors – A Hands-on Approach
• D. B. Kirk, W. M. Hwu
• Morgan Kaufmann, 2010
• http://www.nvidia.com/object/promotion_david_kirk_book.html
203

2014/07/17 Parallelize computer vision by GPGPU computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to 2014/07/17 Parallelize computer vision by GPGPU computing

Similar to 2014/07/17 Parallelize computer vision by GPGPU computing (20)

More from IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing

More from IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (16)

Recently uploaded

Recently uploaded (20)

2014/07/17 Parallelize computer vision by GPGPU computing