SlideShare a Scribd company logo
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Wang, Yuan-Kai (王元凱)
Electrical Engineering Department, Fu Jen Catholic
University (輔仁大學電機工程系)
ykwang@mail.fju.edu.tw		
http://www.ykwang.tw
2014/07/17
Parallelize Computer Vision
by GPGPU Computing
1
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
About this Course
❖ Multicore Era for Computer Vision
❖ GPGPU
❖ Parallel Programming
(CUDA, OpenCL, Renderscript)
❖ OpenCV Acceleration with GPGPU
❖ Computer Vision Acceleration
2
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
1. Multicore Era for
Computer Vision
Paradigm shift
from Clock Speed Race
to Multicore Race
3
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Computing
❖ What Is Multicore
• Combine multiple processors
(CPU, DSP, GPGPU, FPGA)
into single chip
❖ Multicore computing is inevitable
4
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Moore's Law
❖ In 1965, Gordon Moore (Intel co-founder)
predicted
• The transistors no. on an IC would double
every 18 months
❖ The well-known law
• The performance of computer
doubles every 18 months
• More transistors
→ More performance
❖ The prediction was
kept correctly by
Intel's CPUs for 40 years
5
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Review of Moore's Law
❖ Transistors in a chip did increase
6
Software enjoys the fruits of hardware's labour.
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Problems
❖ More transistors need high frequency
• We come into the Clock Speed Race
❖ But high frequency needs high power
consumption
• High power consumption è Heat problem
• 4GHz has been the limit of Moore’s law
7
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Paradigm Shift from 2000 AD
❖ General-purpose multicore
comes of age
❖ Chip companies race to create multicore
processors
• CPU: Intel Core Duo, Quad-core,
ARM v7, ...
• DSP: TI OMAP, ARM NEON, …
• GPU/GPGPU:
• nVidia: GeForce/Tesla, Tegra
• ARM: Mali-T6x
• …
8
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
The Multicore Evolution
Pentium processor
Optimized for single
thread
Core Duo 5~10 years
10~100 energy efficient
cores optimized for
parallel execution
From large mono-core to multiple lightweight cores
9
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Moore’s Law Needs Multicore
❖ Single core cannot fit Moore's law
❖ Multicore can fit Moore's law if a
parallel programming model exists
Time
Performance
Single Core
Multi-Core
10
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Two Architectures
for Multicore
❖ Symmetric multiprocessing (SMP)
• Multicore CPU, GPGPU, DSP multicore
• Homogeneous computing
❖ Asymmetric multiprocessing (AMP)
• CPU+GPGPU,
CPU+FPGA,
CPU+DSP
• Heterogeneous computing
11
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore CPU (1/2)
❖ Two or more CPUs in a chip
❖ Ex.: Intel Core i7
12
Multiple
Execution Cores
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore CPU (2/2)
❖ Windows Task Manager(工作管理員)
Two cores Eight cores
13
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU (1/2)
❖ GPU (Graphical Processing Unit)
• The processor in graphics card to speed
up 3D graphics
• Game playing
is a major
application
❖ GPGPU: General-Purpose GPU
• General purpose computation using
GPU in applications other than 3D
graphics
14
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU (2/2)
❖ GPGPU has more cores than CPU
• 120 ~ 3072 cores vs. 2 ~ 8 cores
(Many-core vs. Multi-core)
❖ GPGPU is more powerful than
multicore CPU
❖ Vendors:
• nVidia
• Quadcomm
(AMD, ATI)
• ARM
• Intel
15
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p. 16
It is the Software, Stupid
❖Gary Smith and Daya Nadamuni, Gartner
Dataquest, Design Automation Conf., 2006
❖The biggest problem with SoC design
is embedded software development.
❖The next big hurdle is
programmability. It's the ability to
program these multicore platforms."
❖You can have elegant algorithms,
first-pass silicon, and fancy intellectual
property. But without software, the
product goes nowhere.
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Demands Threading
17
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Demands Threading
18
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
What Is Computer Vision
19
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Video
Capture
Image
Enhance
Object
/Event
Detection
Object
Tracking
Object
/Event
Recognition
Behavior	
Analysis Retrieval
Imaging
Event Detection
Abnormal Detection Face Recognition Retrieval
TripwireImage/Video Enhancement
A Complete Vision System
– Video Surveillance as an Example
20
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Computer Vision Needs
High Performance Computing
❖ A CV example : video processing
• Intelligent video surveillance,
❖ Its complexity is high
• Video (1080p RGB):
6 Megapixels per frame, 30fps
• 100 – 1K flops per pixel
• ⇒ 18 - 180 Gigaflops per second
❖ Massive data processing
• Intensive computation
21
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
HPC Approaches
❖ Cluster/distributed computing
• Hadoop/MAP-REDUCE
(Google, Cloud Computing)
• MPI
❖ Multi-processing
computing
• Multicore (GPGPU, CPU, FPGA/DSP)
• Programming: multi-thread
• Windows thread, Pthraed, OpenMP
• CUDA, renderscript, C++ AMP, …
Supercomputer
22
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
However
❖ Can CV algorithms speed-up every 18
months with multicore?
❖ Multicore is not a simple solution for
upgrading CV algorithm performance
• The transition from single core to
multicore will be blocked by software
• We are not ready to face the software
programming challenges
• It is the software, stupid.
23
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Software, Threading, and
Parallel Computing
❖ Identify parallelism: Analyze algorithm
❖ Express parallelism: Write parallel code
❖ Validate parallelism: Debug & verify parallel code
❖ Optimize parallelism: enhance parallel
performance
24
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multi-threading Demands
New Programming Skills
❖ Previous multi-threading techniques
❖ Windows thread, pthread, OpenMP,
MPI, …
❖ New techniques
• CUDA, C++ AMP, OpenCL, Renderscript,
OpenACC, Map Reduce, …
❖ Concepts
• Race condition, deadlock,
• Domain partition, function partition, …
25
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multicore Programming
Practice (MPP)
❖ Goal: Write portable C/C++
programs to be "Multicore ready"
and platform compatible
• Proposed by a
MPP working group
in the Multicore
Association
http://www.multicore-association.org/workgroup/mpp.php
26
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenACC
❖ An organization develops API to
• describes a collection of
compiler directives
• To specify loops and regions of
code in standard C, C++ and Fortran
• To be offloaded from a host CPU to
an attached accelerator, including
•APUs, GPUs, and many-core
coprocessor
27
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
HSA Foundation
❖Heterogeneous System Architecture
• Key members: AMD, QUALCOMM, ARM,
SAMSUNG, TI
❖System architecture easing efficient
use of accelerators, SoCs
• Intended to support high-level parallel
programming frameworks
• OpenCL, C++, C#, OpenMP, Java
• Accelerator requirements
• Full-system SVM, memory coherency,
preemption,
user-mode dispatch
28
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
The ParLab in Berkeley
❖ The Parallel Computing Lab. in UC
Berkeleyhttp://parlab.eecs.berkeley.e
du
• The ParLab. offers programmers a
practical introduction to parallel
programming techniques and tools on
current parallel computers,
emphasizing multicore and manycore
computers.
29
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
HPEC
❖ High Performance Embedded
Computing
• MIT Lincoln Lab, 1997 ~
30
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL
❖ Royalty-free, cross-platform, cross-
vendor standard
•Targeting: supercomputers
è embedded systems
è mobile devices
❖Enables programming of diverse
compute resources
•CPU, GPU,
DSP, FPGA …
31
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL Working Group
Members
❖Diverse industry participation – many
industry experts
❖NVIDIA is chair, Apple is specification
editor
32
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ Vendor, Hardware
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
33
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
2. GPGPU
PC platform
Mobile platform
34
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Why GPGPU
❖ GPGPU has many-core (vs. multi-core)
• Suitable for masssively parallel computing
35
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU as a Coprocessor
Heterogeneous Computing
36
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
PC Platform
• Discrete GPUs
• GPGPU card as a coprocessor
From PC to PSC (Personal Super-Computer)
37
PCIe
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Mobile Platform
• Integrated GPUs
• GPGPU sub-chip as a coprocessor
From mobile phone to mobile personal computer
38
No PCIe
GPGPU
CPU
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Solutions - nVidia
• Compute Architecture:
Tesla, Fermi, Kepler, …
• PC
• GeForce, Quadro
• Tesla
• 870, 1060, 2070, K40
• Mobile
• Tegra: …, 4, K1(192 cores)
39
It’s Tegra K1 Everywhere at Google I/O, Embedded Vision Alliance, 2014/7/7.
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Solutions
– Qualcomm/AMD
❖ Qualcomm, AMD, ATI
❖ APU: integrated CPU+GPU
❖ Low energy consumption
❖ PC(AMD): FirePro
❖ Mobile(Snapdragon):
❖ Adreno: 330(32 cores)
40
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Solutions - ARM
❖ Mali
❖ Samsung Exynos, MediaTek
❖ Compute engine
after T-600
❖ Exynos 5
❖ At most 8 cores
(Mali-T678)
41
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Intel – Multicore CPU
• PC (Xeon Phi)
• IRIS pro GPU
• Knight Landing: 60 cores
• Knight Cover: 48 CPU cores,
PCIe
• Mobile
• Haswell
• Atom
42
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Applications of GPGPU
http://developer.nvidia.com/category/zone/cuda-zone
43
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Heterogeneous Architecture
❖Host: CPU
❖Device: GPGPU
❖Notice: memory hierarchy in device
44
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPUs Architecture
- nVidia
❖ GT200
• GTX 260/280, Quardro5800, Tesla 1060
❖ Fermi
• Tesla 2060
DRAM
Cache
ALU
Control
ALU
ALU
ALU
DRAM
CPU(host)
Multicore
GPU(device)
Many-core
45
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
nVidia GPGPU Architecture
❖ SM/SP(Stream multiprocessor/Stream
processor) + Shared memory + DRAM
46
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Memory Hierarchy
❖ On-Chip Memory
• Registers
• Shared Memory
• Constant Memory
• Texture Memory
❖ Off-Chip Memory
• Local Memory
• Global Memory
47
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU vs. FPGA
❖GPU: nVidia GeForce
GTX 280, GTX580
❖FPGA: Xilinx Virtex4,
Virtex5
A Comparison of FPGA and GPU for real-Time Phase-Based Optical Flow, Stereo, and Local
Image Features, IEEE Transactions on Computers, 2012.
48
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU vs. FPGA
❖GPU: nVidia GeForce 7900 GTX
❖FPGA: Xilinx Virtex-4
Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case
Study, IEEE Transactions on Computers, 2010.
49
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU vs. FPGA vs. Multicore
❖Application: 2-D image convolution
GPU: nVidia GeForce 295 GTX
FPGA: Altera Stratix III E260
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-
Window Applications, ACM/SIGDA international symposium on FPGA, 2012.
50
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
However, GPGPU May Not
Always Improve Speed & Energy
51
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Hardware vs. Software
52
GPGPU
nVidia
Qualcomm
ARM
Intel
Parallel
Programming
CUDA
OpenCL
RenderScript
C++ AMP
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
• CUDA, renderscript, OpenCL, …
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
53
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
3. Parallel
Programming
Multi-threading
Programming Languages for Parallels
54
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Computing
❖ Serial
Computing
❖ Parallel
Computing
CPU/GPU
55
Core
Core
Core
Core
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming
❖ Many codes are written in C/C++/Java
• Especially algorithmic programs
❖ Can we write GPGPU parallel
programs by C/C++/Java?
❖ However, C/C++ is sequential
• Three control structures of C/C++/Java:
sequence, selection, repetition
56
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multi-threading
❖ Multi-threading is the fundamental
concept for parallel programming
• Some techniques are ready
• Pthread, Win32 thread, OpenMP,
MPI, Intel TBB (Threading Building
Block)...
• New techniques
• CUDA, OpenCL, Renderscript,
OpenACC, C++ AMP, ...
57
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming Models
58
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming in
Sequential Language
❖ Do we need to learn new languages for
multi-threading?
• No
❖ Write multi-threading codes in C/C++
• Add functions/directives to C/C++ for
multi-threading
• That is the way current solutions did
• pthread, Win32 thread, OpenMP,
MPI, CUDA, OpenCL, ...
59
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Decompose the Problem
❖ Two basic approaches to partition
computational work
• Domain decomposition
• Partition the data used
in solving the problem
• Function decomposition
• Partition the jobs (functions)
from the overall work (problem)
GPGPU
CPU
Cooperate
60
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multi-Threading
❖ A program running
In Serial
http://en.wikipedia.org/wiki/Thread_(computer_science)
In Parallel
61
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Domain Decomposition (1/3)
❖An image example
• It is 2D data
• Three popular partition ways
62
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Domain Decomposition (2/3)
❖Domain data are usually processed
by loop
• for (i=0; i<height; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
Original image(img1) Enhanced image(img2)
The X-ray image
of a circuit board
i
j
SIMD
SPMD
SIMT
63
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Domain Decomposition (3/3)
❖A three-block partition
example
• // Thread 1
for (i=0; i<height/3; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
• // Thread 2
for (i=height/3; i<height*2/3; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
• // Thread 3
for (i=height*2/3; i<height; i++)
for (j=0; j<width; j++)
img2[i][j] = RemoveNoise(img1[i][j]);
i
j
OpenMP
CUDA(SPMD)
fork(threads)
join(barrier)
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
i=11
subdomain	1 subdomain	2 subdomain	3
64
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming:
SIMT model
❖ CPU (“host”) program often
written in C or C++
❖ GPU code is written as a sequential
kernel in (usually) a C or C++
dialect
65
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming
Techniques
CUDA
OpenCL
C++ AMP
Rednerscript
66
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming
Techniques
67
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA
68
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA
❖ CUDA: Compute Unified Device
Architecture
❖ Parallel programming
for nVidia's GPGPU
❖ Use C/C++ language
• Java, Fortran, Matlab are OK
❖ When executing CUDA programs,
the GPU operates as coprocessor to
the main CPU
69
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Hardware Environment:
CPU+GPU
❖ CPU
• Organizes, interprets, and
communicates information
❖ GPU
• Handles the core processing on large quantities
of parallel information
• Compute-intensive portions of applications
that are executed many times, but on different
data, are extracted from the main application
and compiled to execute in parallel on the GPU
CPU GPU
PCI-E
70
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Software Stack
71
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Processing Flow on CUDA
Copy	processing	
data
2
Copy	the	
result
5 Instruct		the	
processing
3
Main
Memory
CPU
Memory
for GPU Execute		
parallel	in	
each	core
4
Release	
device	memory
6
Allocate	
device	memory
1
72
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Programming with
Memory Hierarchy
❖ Locality
principle
• Temporal
locality
• Spatial
locality
73
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/3)
int main()
{
char src[12]="Hello World";
char h_hello[12];
char* d_hello1;
char* d_hello2;
cudaMalloc((void**) &d_hello1, sizeof(char)*12);
cudaMalloc((void**) &d_hello2, sizeof(char)*12);
cudaMemcpy(d_hello1 , src , sizeof(char)* 12 ,
cudaMemcpyHostToDevice);
hello<<<1,1>>>(d_hello1 , d_hello2 );
Host
src
h_hello
Device
d_hello1
d_hello2
call the kernel function
74
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/3)
❖ Kernel Function
__global__ void hello(char* hello1 , char* hello2 )
{
int k;
for(k = 0 ; hello1[k] != '0' ; k++){
hello2[k] = hello1[k];
}
}
Host
src
h_hello
Device
d_hello1
d_hello2
No parallel processing in this example
75
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(3/3)
cudaMemcpy(h_hello, d_hello2, sizeof(char)*
12, cudaMemcpyDeviceToHost);
printf("%sn", h_hello);
cudaFree(d_hello1);
❖ cudaFree(d_hello2);
system("pause");
return 0;
}
Result:
Host
src
h_hello
Device
d_hello1
d_hello2
76
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL
Standard
77
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
The Inspiration for OpenCL
78
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
What's OpenCL
❖One code tree can be executed on CPUs, GPUs,
DSPs and hardware
• Dynamically interrogate system load and
balance across available processors
❖Powerful, low-level flexibility
• Foundational access to compute resources for
higher-level engines, frameworks and
languages
79
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Broad OpenCL Implementer
Adoption
❖Multiple conformant implementations shipping
on desktop and mobile
❖Android ICD extension released in latest
extension specification
❖Multiple implementations shipping in Android
NDK
80
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL Enables Portability
❖C to gates programs are
proprietary
81
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Altera OpenCL SDK for FPGAs
82
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
NVIDIA OpenCL SDK for GPU
83
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
AMD OpenCL Optimization
Case Study
❖Platform
• AMD Phenom II X4 965 CPU (quad core)
• ATI Radeon HD 5870 GPU
❖Unoptimized CPU performance: 1 GFLOP/s
❖Optimized CPU performance reaches: 4 GFLOP/s
❖Optimized GPU performance reaches: 50 GFLOP/s
84
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/3)
Including
Declaring
85
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/3)
Creating
86
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/3)
Do
Copy to host &
display
Creating
87
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(3/3)
Kernel Function
88
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
C++ AMP
Microsoft
89
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
What's C++ AMP(1/2)
❖Microsoft’s C++ AMP (Accelerated Massive
Parallelism)
• Part of Visual C++, integrated with Visual
Studio, built on Direct3D
• “Performance for the mainstream”
❖STL-like library for multidimensional array
data
• Special convenience support for 1, 2, and 3
dimensional arrays on CPU or GPU
• C++ AMP runtime handles CPU<->GPU data
copying
• Tiles enable efficient processing of sub-arrays
90
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
What's C++ AMP(2/2)
❖Parallel_for_each
•Executes a kernel (C++ lambda) at
each point in the extent
•restrict() clause specifies where to
run the kernel: cpu (default) or
direct3d (GPU)
91
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/2)
Declaring&
Coping to device
92
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/2)
Do
Display
93
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript
Google Android
94
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
What's Renderscript(1/2)
❖Higher-level than CUDA or OpenCL: simpler &
less performance control
• Emphasis on mobile devices & cross-SoC
performance portability
❖Programming model
• C99-based kernel language, JIT-compiled,
single input-single output
• Automatic Java class reflection
• Intrinsics: built-in, highly-tuned operations,
e.g. ScriptIntrinsicConvolve3x3
• Script groups combine kernels to amortize
launch cost & enable kernel fusion
95
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
What's Renderscript(2/2)
❖ Data type:
• 1D/2D collections of elements, C types like int
and short2, types include size
• Runtime type checking
❖ Parallelism
• Implicit: one thread per data element,
atomics for thread-safe access
• Thread scheduling not exposed, VM-decided
96
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Architecture
97
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Low Level Virtual Machine
❖Low Level Virtual Machine (LLVM)
is a compiler infrastructure
98
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Offline Compiler Flow
99
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Renderscript Compiler: libbcc
100
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Renderscript Project
Framework
101
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(1/8)
102
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(2/8)
HelloWorld.java
103
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(3/8)
HelloWorld.java
104
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(4/8)
HelloWorldView.java
105
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(5/8)
HelloWorldView.java
106
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(6/8)
HelloWorldRS.java
107
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(7/8)
HelloWorldRS.java
108
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(7/8)
ScriptC_helloworld.java
109
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(7/8)
ScriptC_helloworld.java
110
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Example - Hello World(8/8)
HelloWorld.rs
111
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Comparison (1/2)
❖Renderscript vs. Native(NDK) vs. Java(SDK)
• OS: Honeycomb v3.2(CPU only)
Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of
the three programming models in google android." in Proc. First Asia-
Pacific Programming Languages and Compilers Workshop (APPLC). 201
112
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Comparison(2/2)
❖OpenCL & CUDA
• Sobel filter with(CMw/o) and without(CMw)
constant memory
OpenCL’s portability does not
fundamentally affect its performance
Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A
comprehensive performance comparison of CUDA and OpenCL." in
Proc. International Conference Parallel Processing (ICPP), 2011.
113
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Programming
114
Performance: more control, better performance
Productivity: ease use, quick programming,
portability
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
❖ Multicore/Multi-threading
❖ Data Parallelization
• Data distribution
• Parallel convolution
• Reduction algorithm
• Amdahl’s law
❖ Memory Hierarchy Management
• Locality principle
• Program accesses a relatively small portion
of the address space at any instant of time
Parallelization
115
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Multi-thread Programming with
the Discipline of Parallelization
❖ Identify parallelism: Analyze algorithm
❖ Express parallelism: Write parallel code
❖ Validate parallelism: Debug & verify parallel code
❖ Optimize parallelism: enhance parallel
performance
116
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
117
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
4. OpenCV
Acceleration
118
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
What Is OpenCV
❖A very popular computer vision
library
• 6M downloads
• BSD licenses
• 2000 ~ CV functions
• Modularized and efficient
• Optimization
• Intel SSE, IPP, TBB
• ARM NEON & GLSL (Tegra)
• CUDA, OpenCL
119
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Modules
❖Image/video I/O, processing, display (core,
imgproc, highgui)
❖Object/feature detection (objdetect, features2d,
nonfree)
❖Geometry-based monocular or stereo computer
vision (calib3d, stitching, videostab)
❖Computational photography (photo, video,
superres)
❖Machine learning & clustering (ml, flann)
❖CUDA and OpenCL GPU acceleration (gpu, ocl)
Normal CV modules: 14
Acceleration modules: 2
120
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV GPU Module
❖Implemented using NVIDIA CUDA
Runtime API
❖Latest version: 2.4.9
• Utilizing Multiple GPUs
❖Implemented modules: 11
❖Implemented functions: 270
Focus on PC platform
Not fully compatible to mobile GPGPU on Android
121
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Matrix Operations
❖Point-wise matrix math
• gpu::add(), ::sum(), ::div(), ::sqrt(),
::sqrSum(), ::meanStdDev, ::min(), ::max(),
::minMaxLoc(), ::magnitude(), ::norm(),
::countNonZero(), ::cartToPolar(), etc..
❖Matrix multiplication
• gpu::gemm()
❖Channel manipulation
• gpu::merge(), ::split()
122
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Geometric Operations
❖Image resize with sub-pixel interpolation
• gpu::resize()
❖Image rotate with sub-pixel interpolation
• gpu::rotate()
❖Image warp (e.g., panoramic stitching)
• gpu::warpPerspective(), ::warpAffine()
123
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA other Math and
Geometric Operations
❖Integral images
• gpu::integral(), ::sqrIntegral()
❖Custom geometric transformation (e.g., lens
distortion correction)
• gpu::remap(), ::buildWarpCylindricalMaps(),
::buildWarpSphericalMaps()
124
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Image Processing(1/2)
❖Smoothing
• gpu::blur(), ::boxFilter(),
::GaussianBlur()
❖Morphological
• gpu::dilate(), ::erode(), ::morphologyEx()
❖Edge Detection
• gpu::Sobel(), ::Scharr(), ::Laplacian(),
gpu::Canny()
❖Custom 2D filters
• gpu::filter2D(), ::createFilter2D_GPU(),
::createSeparableFilter_GPU()
❖Color space conversion
• gpu::cvtColor()
125
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Image Processing(2/2)
❖Image blending
• gpu::blendLinear()
❖Template matching (automated inspection)
• gpu::matchTemplate()
❖Gaussian pyramid (scale invariant
feature/object detection)
• gpu::pyrUp(), ::pyrDown()
❖Image histogram
• gpu::calcHist(), gpu::histEven,
gpu::histRange()
❖Contract enhancement
• gpu::equalizeHist()
126
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA De-noising
❖Gaussian noise removal
• gpu::FastNonLocalMeansDenoising()
❖Edge preserving smoothing
• gpu::bilateralFilter()
127
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Fourier and MeanShift
❖Fourier analysis
•gpu::dft(), ::convolve(),
::mulAndScaleSpectrums(), etc..
❖MeanShift
•gpu::meanShiftFiltering(),
::meanShiftSegmentation()
128
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Shape Detection
❖Line detection (e.g., lane detection, building
detection, perspective correction)
• gpu::HoughLines(), ::HoughLinesDownload()
❖Circle detection (e.g., cells, coins, balls)
• gpu::HoughCircles(),
::HoughCirclesDownload()
129
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Object Detection
❖HAAR and LBP cascaded adaptive boosting
(e.g., face, nose, eyes, mouth)
• gpu::CascadeClassifier_GPU::detectMulti
Scale()
❖HOG detector (e.g., person, car, fruit, hand)
• gpu::HOGDescriptor::detectMultiScale()
130
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Object Recognition
❖Interest point detectors
• gpu::cornerHarris(), ::cornerMinEigenVal(),
::SURF_GPU, ::FAST_GPU, ::ORB_GPU(),
::GoodFeaturesToTrackDetector_GPU()
❖Feature matching
• gpu::BruteForceMatcher_GPU(),
::BFMatcher_GPU()
131
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Stereo and 3D
❖RANSAC
• gpu::solvePnPRansac()
❖Stereo correspondence (disparity map)
• gpu::StereoBM_GPU(),
::StereoBeliefPropagation(),
::StereoConstantSpaceBP(),
::DisparityBilateralFilter()
❖Represent stereo disparity as 3D or 2D
• gpu::reprojectImageTo3D(),
::drawColorDisp()
132
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Optical Flow
❖Dense/sparse optical flow
gpu::FastOpticalFlowBM(),
::PyrLKOpticalFlow, ::BroxOpticalFlow(),
::FarnebackOpticalFlow(),
::OpticalFlowDual_TVL1_GPU(),
::interpolateFrames()
133
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA Background
Segmentation
❖Foregrdound/background segmentation (e.g.,
object detection/removal, motion tracking,
background removal)
• gpu::FGDStatModel, ::GMG_GPU,
::MOG_GPU, ::MOG2_GPU
134
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Performance of OpenCV GPU
Accelerators on PC
135
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
136
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
5. Computer Vision
Acceleration on PC
Image enhancement (HDR)
Feature extraction
Video surveillance cloud
137
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
HDR and
Image Enhancement
138
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
❖ Restore and enhance an image
❖ Its complexity is high for large images
HDR Image Enhancement
Original RestoredComplexity:
O(N2) ~ O(N3)
139
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Algorithms for
Image Restoration
❖ Wiener Filter
❖ Histogram Based Approach
• Histogram Equalization,
Histogram Modification, …
❖ Retinex
• Path-based Retinex
• Recursive Retinex
• Center/surround Retinex
• No iterative process and is suitable for parallelization
• Multi-Scale Retinex with Color Restoration (MSRCR)
[Rahman et al. 1997]
140
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
MSRCR Algorithm
• : the MSRCR output
• : the original image distribution in the ith spectral band
• : the kth Gaussian Surround function
• : the convolution operation
• : the weight
• : the color restoration factor in the ith spectral band
N : the number of spectral bands
: the gain constant
: controls the strength of the nonlinearity
141
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
The Method
Gaussian Blur
Log-domain
Processing
Normalization
Copy Data
from CPU to
GPGPU
Copy Data
from GPGPU to
CPU
GPGPUCPU
Histogram
Stretching
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm."
Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer
Society Conference on. IEEE, 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for
accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19.
142
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
❖ Multicore/Multi-threading
• Tesla C1060 : 240 SP (Stream Processor)
• CUDA: , Thread , Block , Grid
❖ Data Parallelization
• Parallel convolution
Parallelization by GPGPU
• Parallel convolution
A(0)
A(1)
A(2)
A(3)
A(4)
A(5)
A(6)
A(7)
A(0)+A(1)
A(2)+A(3)
A(4)+A(5)
A(6)+A(7)
A(0)+A(1)+A(2)+A(3)
A(4)+A(5)+A(6)+A(7)
sum
PE data time
t0 t1																t2																														t3																											t4											t5
0
1
2
3
4
5
6
7
PE	i
{
{
pixels
pixels
M	pixels
M
pixels
PE	ipixels
pixels
pixels
pixels
1	pixels 1	pixels
1	pixels 1	pixels
143
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Our Memory Hierarchy
Parallel Gaussian Blur
Parallel Log-domain
Processing
Parallel Normalization
Texture
Memory
Parallel Histogram
Stretching
Constant
Memory
Global
Memory
Shared
Memory
144
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CPU results GPGPU resultsOriginal images
Experimental Results (1/2)
145
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CPU results GPGPU resultsOriginal images
Experimental Results (2/2)
146
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU Speedup over CPU
74x
2x
• Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103
• NPP: nVidia Performance Primitive
147
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Feature Extraction
(SIFT)
148
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
❖SIFT
• Scale Invariant Feature Transform
❖Invariance of feature points
• Translation
• Rotation
• Scale
What Is SIFT
149
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
❖Object recognition/tracking
❖Image retrieval
❖Autostitch
Applications of SIFT
150
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Parallelize SIFT by GPGPU
Intel Q9400
Quad cores
(2.66GHz)
Geforce GTS 250
128 SPs
(1.836GHz)
151
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CPU GPU
Experimental Results
152
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Execution Timem
s
CPU:
10 seconds
in average
GPGPU:
0.8 seconds
in average
153
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Speedup
13x speedup in average
154
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Video
Surveillance Cloud
155
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU雲端視訊監控系統
警戒區域入侵偵測
PTZ相機追蹤
攝影機異常偵測
高效率影片事件瀏覽系統
中央視訊及訊息管理系統多重解析度廣域監視系統
戶外
停車場
空位偵測
非法停車偵測
動態場景
人臉偵測
Storage Area Network
PC Mobile
device
Multi-core
Hypervisor
GPGPU
156
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
私有雲機房
157
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Today We Talk About
❖ Why GPGPU's multicore is better(Sec. 2)
❖ How parallel programming (Sec. 3)
❖ OpenCV Acceleration (Sec. 4)
❖ Computer vision Acceleration-PC (Sec. 5)
❖ Computer vision Acceleration-Android
(Sec. 6)
158
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
6. Computer Vision
Acceleration on
Android
OpenCV
RenderScript
159
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV
on Android
160
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV4Android SDK
❖Enables development of Android applications
with use of OpenCV library.
❖Use java native interface (JNI) directly access c
code
❖Support nVIDAs’ Tegra android development
pack(TADP)
Not fully
compatible with
GPU module
161
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
System Framework
162
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Two Methods to Call OpenCV
❖Using Java API
❖Using native C++
163
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by
GPU(1/5)
164
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by
GPU(2/5)
165
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by
GPU(3/5)
166
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by
GPU(4/5)
167
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV for Android SDK by
GPU(5/5)
168
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on
Android with GPU
Acceleration
169
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android
with GPU(1/5)
170
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android
with GPU(2/5)
171
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android
with GPU(3/5)
172
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android
with GPU(4/5)
173
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript on android
with GPU(5/5)
174
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image
Processing Intrinsics
Name Operation
ScriptIntrinsicConvolve3x3,ScriptIntrinsicConvol
ve5x5
Performs a 3x3 or 5x5 convolution.
ScriptIntrinsicBlur Performs a Gaussian blur. Supports grayscale
and RGBA buffers and is used by the system
framework for drop shadows.
ScriptIntrinsicYuvToRGB Converts a YUV buffer to RGB. Often used to
process camera data.
ScriptIntrinsicColorMatrix Applies a 4x4 color matrix to a buffer.
ScriptIntrinsicBlend Blends two allocations in a variety of ways.
ScriptIntrinsicLUT Applies a per-channel lookup table to a buffer.
ScriptIntrinsic3DLUT Applies a color cube with interpolation to a
buffer.
ScriptIntrinsicHistogram Intrinsic Histogram filter
175
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Gaussian Blur Example
by RenderScript Intrinsic
RenderScript rs = RenderScript.create(theActivity);
ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(mRS,
Element.U8_4(rs));;
Allocation tmpIn = Allocation.createFromBitmap(rs, inputBitmap);
Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap);
theIntrinsic.setRadius(25.f);
theIntrinsic.setInput(tmpIn);
theIntrinsic.forEach(tmpOut);
tmpOut.copyTo(outputBitmap);
176
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Intrinsic
Example(1/2)
177
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Intrinsic
Example(2/2)
178
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Blur Intrinsic
Performance Analysis
179
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Performance of
RenderScript Intrinsics
❖On new Nexus 7
❖Relative to equivalent multithreaded C
implementations.
180
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image
Processing Benchmarks(1/2)
❖CPU only on a Galaxy Nexus device.
181
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript Image
Processing Benchmarks(2/2)
182
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Acceleration of Retinex Using
RenderScript
❖This paper presents an implementation of
rsRetinex, an optimized Retinex algorithm by
using Renderscript technique.
❖The experimental results show that rsRetinex
could gain up to five times speedup when applied
to different image resolution.
Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for Image
Processing on Android Device Using Renderscript." in Proc. The 8th International
Conference on Robotic, Vision, Signal Processing & Power Applications, 2014.
183
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Mobile GPGPU List
Adoption OpenCL/ CUDA OpenCV Renderscript
Qualcomm
Adreno
Google Nexus 10,
Google new Nexus 7,
SONY Xperia Tablet Z2
1.2(302~420) OCL
module
Android 4.0
later
ARM Mali Nexus 10, Samsung
Note 3, Samsung Note
PRO 12.2, Meizu MX3
OpenCL 1.1
(T604~T678)
OCL
module
Android 4.0
later
nVIDIA
Tegra
Google Project Tango,
HTC Nexus 9, Microsoft
Surface 2, Nvidia Shield
Note 7
CUDA, OpenCL
1.2(K1 only)
GPU
module
Android 4.0
later(K1 only)
AnandTech
PowerVR
iPad Air, iPad mini OpenCL 1.2 OCL
module
none
Intel HD
Graphics
Microsoft Surface Pro 3,
Sony VAIO Tap 11
OpenCL 1.1 OCL
module
none
Nvidia CEO sees future in cars and gaming, 2014/5/19, CNet.
184
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
7. Summary
185
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPGPU
❖ Single-core
è Multi-core
è Many-core
❖PC
• nVidia Tesla + CUDA/OpenCV
❖Android
• Qualcomm Adreno + OpenCV ocl
• nVidia Tegra + OpenCV gpu
186
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Programming
❖C/C++/OpenCV
• OpenMP, OpenACC, CUDA, C++ AMP
• OpenCL
❖Java
• OpenCL, RenderScript
❖Notice that OpenCL and
RenderScript is
• Not Efficient in parallelization.
• Efficient in CV algorithmic design.
187
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration (1/2)
❖Ver. 2.4.x
• gpu module: CUDA, PC
• ocl module: OpenCL, mobile
❖Ver. 3.0 (2014/6)
• Transparent API for GPGPU
acceleration
188
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration (2/2)
189
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL 2.0
❖Released in 2013
❖SVM: Shared Virtual Memory
• OpenCL 1.2: Explicit memory
management
❖Dynamic (Nested) Parallelism
• Allows a device to enqueue kernels onto
itself – no round trip to host required
❖Disadvantage
• Strong hardware support
• Not well supported in current GPGPUs
190
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA still Dominant in the
Near Future
❖ However, we have to manually parallelize
the algorithm: more design overhead
❖ We need expertise in
• Algorithms of image and signal processing
• Filtering, frequency analysis, compression,
feature extraction, recognition, ...
• Theory, tools and methodology of parallel
computing
• Communication, synchronization, resource
management, load balancing, debugging, ...
191
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Multimedia
Motion Estimation for
H.264/AVC on
Multiple GPUs
Using NVIDIA CUDA
10 X
CUDA JPEG Decoder
10 X
DivideFrame GPU Decoder
Hyperspectral Image
Compression on
NVIDIA GPUs
10 X
GPU Decoder
(Vegas/Premiere) -
Using the Power of
NVIDIA Graphic Card to
Decode H.264 Video Files
26 X
PowerDirector7 Ultra
3.5X
192
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Computer Vision(1/2)
87 X
CUDA SURF – A Real-
time
Implementation for SURF
TU Darmstadt
26 X
Leukocyte Tracking:
ImageJ Plugin
University of Virginia
200 X
Real-time Spatiotemporal
Stereo Matching Using the
Dual-Cross-Bilateral Grid
100 X
Image Denoising with
Bilateral Filter
Wlroclaw University
of Technology
85 X
Digital Breast
Tomosynthesis
Reconstruction
Massachusetts General
Hospital
100 X
Fast Optical Flow on GPU
At Video Rate for Full HD
Resolution
Onera
8 X
A Framework for Efficient
and Scalable Execution of
Domain-specific Templates
On GPU
NEC Labs, Berkeley, Purdue
13 X
Accelerating Advanced MRI
Reconstructions
University of Illinois
193
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
GPUs for Computer Vision(2/2)
20 X
GPU for Surveillance
13 X
Fast Human Detection with
Cascaded Ensembles
109 X
Fast Sliding-Window
Object Detection
263 X
GPU Acceleration of Object
Classification Algorithm
Using NVIDIA CUDA
10 X
Real-time
Visual Tracker by
Stream Processing
45 X
A GPU Accelerated
Evolutionary
Computer Vision System
3 X
Canny Edge Detection
300 X
Audience Measurement –
Real-time Video Analysis
for Counting People, Face
Detection and Tracking
194
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
The Embedded Vision
Alliance
195
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Readings (1/2)
• Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved
Retinex algorithm." IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR). 2011.
• Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel
algorithm for accelerating retinex." Journal of Real-Time Image
Processing (2012): 1-19.
• Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time
phase-based optical flow, stereo, and local image features."
Computers, IEEE Transactions on 61.7 (2012): 999-1012.
• Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A
review." Medical physics 38.5 (2011): 2685-2697.
• Cope, Ben, et al. "Performance comparison of graphics processors
to reconfigurable logic: a case study." Computers, IEEE
Transactions on 59.4 (2010): 433-448.
196
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Readings (2/2)
❖ “Designing Visionary Mobile Apps Using the Tegra
Android Development Pack,” http://bit.ly/1jvwbgV
❖ “Getting Started With GPU-Accelerated Computer
Vision Using OpenCV and CUDA,”
http://bit.ly/1oMwJEG
❖ “The open standard for parallel programming of
heterogeneous systems,”
https://www.khronos.org/opencl/
197
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCV Acceleration
❖ GPU Module Introduction — OpenCV 2.4.9.0
documentation
❖ OpenCL Module Introduction - opencv documentation!
❖ OpenCV-CL: Computer vision with OpenCL
acceleration, AMD Developer Central, 2013.
❖ Pulli, Kari, et al. "Real-time computer vision with
OpenCV." Communications of the ACM 55.6 (2012):
61-69.
❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated
framework for image processing and computer vision."
Advances in Visual Computing. Springer Berlin
Heidelberg, 2008. 430-439.
198
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
CUDA
❖ CUDA Programming guide. nVidia.
❖ CUDA Best Practices Guide. nVidia.
❖ CUDA Reference Manual. nVidia.
❖ CUDA Zone - NVIDIA Developer,
https://developer.nvidia.com/cuda-zone
❖ Parallel Programming and Computing Platform | CUDA
Home, www.nvidia.com/object/cuda_home_new.html
❖ Applications of CUDA for Imaging and Computer
Vision
http://www.nvidia.com/object/imaging_comp_vision.html
❖ nVidia Performance Primitives (NPP)
http://developer.nvidia.com/object/npp_home.html
199
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
OpenCL
❖ Khronos OpenCL specification, reference card, tutorials, etc:
http://www.khronos.org/opencl
❖ AMD OpenCL Resources:
http://developer.amd.com/opencl
❖ NVIDIA OpenCL Resources:
http://developer.nvidia.com/opencl
❖ Books
• Using OpenCL: Programming Massively Parallel Computers.
IOS Press, 2012.
• OpenCL programming guide. Pearson Education, 2011.
• Heterogeneous Computing with OpenCL: Revised OpenCL 1.
Newnes, 2012.
• OpenCL in Action: how to accelerate graphics and
computation. Manning, 2012.
200
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
RenderScript
❖ RenderScript for Android Developer, Official web site
http://developer.android.com/guide/topics/renderscript/compute.ht
ml
❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li.
"Comparison and analysis of the three programming
models in google android." First Asia-Pacific
Programming Languages and Compilers Workshop.
2012.
❖ "High Performance Apps Development with
RenderScript," 12th Kandroid Conference, 2013.
201
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Web Sites and Resources
❖Embedded Vision Alliance,
http://www.embedded-vision.com
❖GPUComputing.Net,
http://www.gpucomputing.net
❖HAS Foundation, www.hsafoundation.com
❖
202
Wang,	Yuan-Kai	(王元凱) Computer Vision Parallelization by GPGPU p.
Parallel Computing with
GPGPU
❖Programming Massively Parallel
Processors – A Hands-on Approach
• D. B. Kirk, W. M. Hwu
• Morgan Kaufmann, 2010
• http://www.nvidia.com/object/promotion_david_kirk_book.html
203

More Related Content

What's hot

MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
AMD Developer Central
 
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
NTT Communications Technology Development
 
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
Edge AI and Vision Alliance
 
100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV
NTT Communications Technology Development
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Intel® Software
 
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
Edge AI and Vision Alliance
 
"Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P..."Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P...
Edge AI and Vision Alliance
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
Lior Sidi
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
Shinya Takamaeda-Y
 
Applying Deep Learning Vision Technology to low-cost/power Embedded Systems
Applying Deep Learning Vision Technology to low-cost/power Embedded SystemsApplying Deep Learning Vision Technology to low-cost/power Embedded Systems
Applying Deep Learning Vision Technology to low-cost/power Embedded Systems
Jenny Midwinter
 
Possibilities of generative models
Possibilities of generative modelsPossibilities of generative models
Possibilities of generative models
Alison B. Lowndes
 
Hire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
Hire a Machine to Code - Michael Arthur Bucko & Aurélien NicolasHire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
Hire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
WithTheBest
 
"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org
"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org
"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org
Edge AI and Vision Alliance
 
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li..."The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
Edge AI and Vision Alliance
 
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
corehard_by
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
AMD Developer Central
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
GiannisTsagatakis
 
Cloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & InferenceCloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & Inference
Mr. Vengineer
 
"APIs for Accelerating Vision and Inferencing: An Industry Overview of Option...
"APIs for Accelerating Vision and Inferencing: An Industry Overview of Option..."APIs for Accelerating Vision and Inferencing: An Industry Overview of Option...
"APIs for Accelerating Vision and Inferencing: An Industry Overview of Option...
Edge AI and Vision Alliance
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Intel® Software
 

What's hot (20)

MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
MM-4092, Optimizing FFMPEG and Handbrake Using OpenCL and Other AMD HW Capabi...
 
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to O...
 
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
“Deploying PyTorch Models for Real-time Inference On the Edge,” a Presentatio...
 
100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV100Gbps OpenStack For Providing High-Performance NFV
100Gbps OpenStack For Providing High-Performance NFV
 
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
Bring Intelligent Motion Using Reinforcement Learning Engines | SIGGRAPH 2019...
 
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
“OpenCV: Past, Present and Future,” a Presentation from OpenCV.org
 
"Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P..."Current and Planned Standards for Computer Vision and Machine Learning," a P...
"Current and Planned Standards for Computer Vision and Machine Learning," a P...
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
Applying Deep Learning Vision Technology to low-cost/power Embedded Systems
Applying Deep Learning Vision Technology to low-cost/power Embedded SystemsApplying Deep Learning Vision Technology to low-cost/power Embedded Systems
Applying Deep Learning Vision Technology to low-cost/power Embedded Systems
 
Possibilities of generative models
Possibilities of generative modelsPossibilities of generative models
Possibilities of generative models
 
Hire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
Hire a Machine to Code - Michael Arthur Bucko & Aurélien NicolasHire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
Hire a Machine to Code - Michael Arthur Bucko & Aurélien Nicolas
 
"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org
"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org
"OpenCV: Current Status and Future Plans," a Presentation from OpenCV.org
 
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li..."The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
"The OpenVX Hardware Acceleration API for Embedded Vision Applications and Li...
 
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
GPGPU: что это такое и для чего. Александр Титов. CoreHard Spring 2019
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
Cloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & InferenceCloud Deep Learning Chips Training & Inference
Cloud Deep Learning Chips Training & Inference
 
"APIs for Accelerating Vision and Inferencing: An Industry Overview of Option...
"APIs for Accelerating Vision and Inferencing: An Industry Overview of Option..."APIs for Accelerating Vision and Inferencing: An Industry Overview of Option...
"APIs for Accelerating Vision and Inferencing: An Industry Overview of Option...
 
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
 

Viewers also liked

Time critical multitasking for multicore
Time critical multitasking for multicoreTime critical multitasking for multicore
Time critical multitasking for multicore
ijesajournal
 
Reconfigurable 3D MultiCore Concept by Prof. Michael Hübner @ ARC 2013
Reconfigurable 3D MultiCore Concept by Prof. Michael Hübner @ ARC 2013Reconfigurable 3D MultiCore Concept by Prof. Michael Hübner @ ARC 2013
Reconfigurable 3D MultiCore Concept by Prof. Michael Hübner @ ARC 2013
FlexTiles Team
 
Multicore
MulticoreMulticore
Multicoretjk2n
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Akihiro Hayashi
 
"Processors for Embedded Vision: Technology and Market Trends," A Presentatio...
"Processors for Embedded Vision: Technology and Market Trends," A Presentatio..."Processors for Embedded Vision: Technology and Market Trends," A Presentatio...
"Processors for Embedded Vision: Technology and Market Trends," A Presentatio...
Edge AI and Vision Alliance
 
"How Computer Vision Is Accelerating the Future of Virtual Reality," a Presen...
"How Computer Vision Is Accelerating the Future of Virtual Reality," a Presen..."How Computer Vision Is Accelerating the Future of Virtual Reality," a Presen...
"How Computer Vision Is Accelerating the Future of Virtual Reality," a Presen...
Edge AI and Vision Alliance
 
How Computer Vision is Reshaping Real Estate Search - Andrew Flachner
How Computer Vision is Reshaping Real Estate Search - Andrew FlachnerHow Computer Vision is Reshaping Real Estate Search - Andrew Flachner
How Computer Vision is Reshaping Real Estate Search - Andrew Flachner
Inman News
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Universitat Politècnica de Catalunya
 
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres..."The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
Edge AI and Vision Alliance
 
Computer architecture kai hwang
Computer architecture   kai hwangComputer architecture   kai hwang
Computer architecture kai hwangSumedha
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
Nitin Sharma
 
Lecture 3
Lecture 3Lecture 3
Lecture 3Mr SMAK
 
Cross platform computer vision optimization
Cross platform computer vision optimizationCross platform computer vision optimization
Cross platform computer vision optimization
Yoss Cohen
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
Ameer Mohamed Rajah
 
Computer Vision Basics
Computer Vision BasicsComputer Vision Basics
Computer Vision BasicsSuren Kumar
 

Viewers also liked (18)

Vol1
Vol1Vol1
Vol1
 
Time critical multitasking for multicore
Time critical multitasking for multicoreTime critical multitasking for multicore
Time critical multitasking for multicore
 
Reconfigurable 3D MultiCore Concept by Prof. Michael Hübner @ ARC 2013
Reconfigurable 3D MultiCore Concept by Prof. Michael Hübner @ ARC 2013Reconfigurable 3D MultiCore Concept by Prof. Michael Hübner @ ARC 2013
Reconfigurable 3D MultiCore Concept by Prof. Michael Hübner @ ARC 2013
 
Multicore
MulticoreMulticore
Multicore
 
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
Studies on Automatic Parallelization for Heterogeneous and Homogeneous Multi...
 
"Processors for Embedded Vision: Technology and Market Trends," A Presentatio...
"Processors for Embedded Vision: Technology and Market Trends," A Presentatio..."Processors for Embedded Vision: Technology and Market Trends," A Presentatio...
"Processors for Embedded Vision: Technology and Market Trends," A Presentatio...
 
"How Computer Vision Is Accelerating the Future of Virtual Reality," a Presen...
"How Computer Vision Is Accelerating the Future of Virtual Reality," a Presen..."How Computer Vision Is Accelerating the Future of Virtual Reality," a Presen...
"How Computer Vision Is Accelerating the Future of Virtual Reality," a Presen...
 
How Computer Vision is Reshaping Real Estate Search - Andrew Flachner
How Computer Vision is Reshaping Real Estate Search - Andrew FlachnerHow Computer Vision is Reshaping Real Estate Search - Andrew Flachner
How Computer Vision is Reshaping Real Estate Search - Andrew Flachner
 
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
Video Analysis with Convolutional Neural Networks (Master Computer Vision Bar...
 
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres..."The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
"The OpenCV Open Source Computer Vision Library: Latest Developments," a Pres...
 
Computer architecture kai hwang
Computer architecture   kai hwangComputer architecture   kai hwang
Computer architecture kai hwang
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Cross platform computer vision optimization
Cross platform computer vision optimizationCross platform computer vision optimization
Cross platform computer vision optimization
 
Computer Vision
Computer VisionComputer Vision
Computer Vision
 
Memory hierarchy
Memory hierarchyMemory hierarchy
Memory hierarchy
 
Computer Vision Basics
Computer Vision BasicsComputer Vision Basics
Computer Vision Basics
 
Computer vision
Computer visionComputer vision
Computer vision
 

Similar to 2014/07/17 Parallelize computer vision by GPGPU computing

PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
NVIDIA
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
Sandeep Singh
 
OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4
Peter Tröger
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
Kohei KaiGai
 
Cuda meetup presentation 5
Cuda meetup presentation 5Cuda meetup presentation 5
Cuda meetup presentation 5
Rihards Gailums
 
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision SystemHai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
AI Frontiers
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
Kohei KaiGai
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
Tim Ellison
 
Cuda
CudaCuda
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
Linaro
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
AMD Developer Central
 
GPU Virtualization in Embedded Automotive Solutions
GPU Virtualization in Embedded Automotive SolutionsGPU Virtualization in Embedded Automotive Solutions
GPU Virtualization in Embedded Automotive Solutions
GlobalLogic Ukraine
 
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process ApproachCheckpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
inside-BigData.com
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
Rogue Wave Software
 
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorEarly Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Intel IT Center
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
OpenACC
 
GVirtuS4j
GVirtuS4jGVirtuS4j
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
Mathieu Dumoulin
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
OpenACC
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering Software
Yung-Yu Chen
 

Similar to 2014/07/17 Parallelize computer vision by GPGPU computing (20)

PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018PGI Compilers & Tools Update- March 2018
PGI Compilers & Tools Update- March 2018
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4OpenHPI - Parallel Programming Concepts - Week 4
OpenHPI - Parallel Programming Concepts - Week 4
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Cuda meetup presentation 5
Cuda meetup presentation 5Cuda meetup presentation 5
Cuda meetup presentation 5
 
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision SystemHai Tao at AI Frontiers: Deep Learning For Embedded Vision System
Hai Tao at AI Frontiers: Deep Learning For Embedded Vision System
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 
Cuda
CudaCuda
Cuda
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
PL-4044, OpenACC on AMD APUs and GPUs with the PGI Accelerator Compilers, by ...
 
GPU Virtualization in Embedded Automotive Solutions
GPU Virtualization in Embedded Automotive SolutionsGPU Virtualization in Embedded Automotive Solutions
GPU Virtualization in Embedded Automotive Solutions
 
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process ApproachCheckpointing the Un-checkpointable: MANA and the Split-Process Approach
Checkpointing the Un-checkpointable: MANA and the Split-Process Approach
 
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo... Debugging Numerical Simulations on Accelerated Architectures  - TotalView fo...
Debugging Numerical Simulations on Accelerated Architectures - TotalView fo...
 
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi CoprocessorEarly Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
Early Successes Debugging with TotalView on the Intel Xeon Phi Coprocessor
 
OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020OpenACC Monthly Highlights September 2020
OpenACC Monthly Highlights September 2020
 
GVirtuS4j
GVirtuS4jGVirtuS4j
GVirtuS4j
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 
OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021OpenACC Monthly Highlights: January 2021
OpenACC Monthly Highlights: January 2021
 
Engineer Engineering Software
Engineer Engineering SoftwareEngineer Engineering Software
Engineer Engineering Software
 

More from IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing

Towards Embedded Computer Vision - New @ 2013
Towards Embedded Computer Vision - New @ 2013Towards Embedded Computer Vision - New @ 2013
Markov Random Field (MRF)
Markov Random Field (MRF)Markov Random Field (MRF)
Monocular Human Pose Estimation with Bayesian Networks
Monocular Human Pose Estimation with Bayesian NetworksMonocular Human Pose Estimation with Bayesian Networks
Towards Embedded Computer Vision邁向嵌入式電腦視覺
Towards Embedded Computer Vision邁向嵌入式電腦視覺Towards Embedded Computer Vision邁向嵌入式電腦視覺
Intelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud ComputingIntelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance and Sousveillance
Intelligent Video Surveillance and SousveillanceIntelligent Video Surveillance and Sousveillance

More from IEEE International Conference on Intelligent Information Hiding and Multimedia Signal Processing (16)

Computer Vision in the Age of IoT
Computer Vision in the Age of IoTComputer Vision in the Age of IoT
Computer Vision in the Age of IoT
 
Towards Embedded Computer Vision - New @ 2013
Towards Embedded Computer Vision - New @ 2013Towards Embedded Computer Vision - New @ 2013
Towards Embedded Computer Vision - New @ 2013
 
老師與教學助理的互動經驗分享 1010217
老師與教學助理的互動經驗分享 1010217老師與教學助理的互動經驗分享 1010217
老師與教學助理的互動經驗分享 1010217
 
Markov Random Field (MRF)
Markov Random Field (MRF)Markov Random Field (MRF)
Markov Random Field (MRF)
 
07 approximate inference in bn
07 approximate inference in bn07 approximate inference in bn
07 approximate inference in bn
 
06 exact inference in bn
06 exact inference in bn06 exact inference in bn
06 exact inference in bn
 
08 probabilistic inference over time
08 probabilistic inference over time08 probabilistic inference over time
08 probabilistic inference over time
 
05 probabilistic graphical models
05 probabilistic graphical models05 probabilistic graphical models
05 probabilistic graphical models
 
04 Uncertainty inference(continuous)
04 Uncertainty inference(continuous)04 Uncertainty inference(continuous)
04 Uncertainty inference(continuous)
 
03 Uncertainty inference(discrete)
03 Uncertainty inference(discrete)03 Uncertainty inference(discrete)
03 Uncertainty inference(discrete)
 
01 Probability review
01 Probability review01 Probability review
01 Probability review
 
02 Statistics review
02 Statistics review02 Statistics review
02 Statistics review
 
Monocular Human Pose Estimation with Bayesian Networks
Monocular Human Pose Estimation with Bayesian NetworksMonocular Human Pose Estimation with Bayesian Networks
Monocular Human Pose Estimation with Bayesian Networks
 
Towards Embedded Computer Vision邁向嵌入式電腦視覺
Towards Embedded Computer Vision邁向嵌入式電腦視覺Towards Embedded Computer Vision邁向嵌入式電腦視覺
Towards Embedded Computer Vision邁向嵌入式電腦視覺
 
Intelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud ComputingIntelligent Video Surveillance with Cloud Computing
Intelligent Video Surveillance with Cloud Computing
 
Intelligent Video Surveillance and Sousveillance
Intelligent Video Surveillance and SousveillanceIntelligent Video Surveillance and Sousveillance
Intelligent Video Surveillance and Sousveillance
 

Recently uploaded

Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
Nettur Technical Training Foundation
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
AkolbilaEmmanuel1
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 

Recently uploaded (20)

Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
 
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 

2014/07/17 Parallelize computer vision by GPGPU computing

  • 1. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Wang, Yuan-Kai (王元凱) Electrical Engineering Department, Fu Jen Catholic University (輔仁大學電機工程系) ykwang@mail.fju.edu.tw http://www.ykwang.tw 2014/07/17 Parallelize Computer Vision by GPGPU Computing 1
  • 2. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. About this Course ❖ Multicore Era for Computer Vision ❖ GPGPU ❖ Parallel Programming (CUDA, OpenCL, Renderscript) ❖ OpenCV Acceleration with GPGPU ❖ Computer Vision Acceleration 2
  • 3. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 1. Multicore Era for Computer Vision Paradigm shift from Clock Speed Race to Multicore Race 3
  • 4. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore Computing ❖ What Is Multicore • Combine multiple processors (CPU, DSP, GPGPU, FPGA) into single chip ❖ Multicore computing is inevitable 4
  • 5. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Moore's Law ❖ In 1965, Gordon Moore (Intel co-founder) predicted • The transistors no. on an IC would double every 18 months ❖ The well-known law • The performance of computer doubles every 18 months • More transistors → More performance ❖ The prediction was kept correctly by Intel's CPUs for 40 years 5
  • 6. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Review of Moore's Law ❖ Transistors in a chip did increase 6 Software enjoys the fruits of hardware's labour.
  • 7. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Problems ❖ More transistors need high frequency • We come into the Clock Speed Race ❖ But high frequency needs high power consumption • High power consumption è Heat problem • 4GHz has been the limit of Moore’s law 7
  • 8. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Paradigm Shift from 2000 AD ❖ General-purpose multicore comes of age ❖ Chip companies race to create multicore processors • CPU: Intel Core Duo, Quad-core, ARM v7, ... • DSP: TI OMAP, ARM NEON, … • GPU/GPGPU: • nVidia: GeForce/Tesla, Tegra • ARM: Mali-T6x • … 8
  • 9. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The Multicore Evolution Pentium processor Optimized for single thread Core Duo 5~10 years 10~100 energy efficient cores optimized for parallel execution From large mono-core to multiple lightweight cores 9
  • 10. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Moore’s Law Needs Multicore ❖ Single core cannot fit Moore's law ❖ Multicore can fit Moore's law if a parallel programming model exists Time Performance Single Core Multi-Core 10
  • 11. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Two Architectures for Multicore ❖ Symmetric multiprocessing (SMP) • Multicore CPU, GPGPU, DSP multicore • Homogeneous computing ❖ Asymmetric multiprocessing (AMP) • CPU+GPGPU, CPU+FPGA, CPU+DSP • Heterogeneous computing 11
  • 12. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore CPU (1/2) ❖ Two or more CPUs in a chip ❖ Ex.: Intel Core i7 12 Multiple Execution Cores
  • 13. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore CPU (2/2) ❖ Windows Task Manager(工作管理員) Two cores Eight cores 13
  • 14. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU (1/2) ❖ GPU (Graphical Processing Unit) • The processor in graphics card to speed up 3D graphics • Game playing is a major application ❖ GPGPU: General-Purpose GPU • General purpose computation using GPU in applications other than 3D graphics 14
  • 15. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU (2/2) ❖ GPGPU has more cores than CPU • 120 ~ 3072 cores vs. 2 ~ 8 cores (Many-core vs. Multi-core) ❖ GPGPU is more powerful than multicore CPU ❖ Vendors: • nVidia • Quadcomm (AMD, ATI) • ARM • Intel 15
  • 16. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 16 It is the Software, Stupid ❖Gary Smith and Daya Nadamuni, Gartner Dataquest, Design Automation Conf., 2006 ❖The biggest problem with SoC design is embedded software development. ❖The next big hurdle is programmability. It's the ability to program these multicore platforms." ❖You can have elegant algorithms, first-pass silicon, and fancy intellectual property. But without software, the product goes nowhere.
  • 17. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore Demands Threading 17
  • 18. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore Demands Threading 18
  • 19. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What Is Computer Vision 19
  • 20. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Video Capture Image Enhance Object /Event Detection Object Tracking Object /Event Recognition Behavior Analysis Retrieval Imaging Event Detection Abnormal Detection Face Recognition Retrieval TripwireImage/Video Enhancement A Complete Vision System – Video Surveillance as an Example 20
  • 21. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Computer Vision Needs High Performance Computing ❖ A CV example : video processing • Intelligent video surveillance, ❖ Its complexity is high • Video (1080p RGB): 6 Megapixels per frame, 30fps • 100 – 1K flops per pixel • ⇒ 18 - 180 Gigaflops per second ❖ Massive data processing • Intensive computation 21
  • 22. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. HPC Approaches ❖ Cluster/distributed computing • Hadoop/MAP-REDUCE (Google, Cloud Computing) • MPI ❖ Multi-processing computing • Multicore (GPGPU, CPU, FPGA/DSP) • Programming: multi-thread • Windows thread, Pthraed, OpenMP • CUDA, renderscript, C++ AMP, … Supercomputer 22
  • 23. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. However ❖ Can CV algorithms speed-up every 18 months with multicore? ❖ Multicore is not a simple solution for upgrading CV algorithm performance • The transition from single core to multicore will be blocked by software • We are not ready to face the software programming challenges • It is the software, stupid. 23
  • 24. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Software, Threading, and Parallel Computing ❖ Identify parallelism: Analyze algorithm ❖ Express parallelism: Write parallel code ❖ Validate parallelism: Debug & verify parallel code ❖ Optimize parallelism: enhance parallel performance 24
  • 25. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multi-threading Demands New Programming Skills ❖ Previous multi-threading techniques ❖ Windows thread, pthread, OpenMP, MPI, … ❖ New techniques • CUDA, C++ AMP, OpenCL, Renderscript, OpenACC, Map Reduce, … ❖ Concepts • Race condition, deadlock, • Domain partition, function partition, … 25
  • 26. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multicore Programming Practice (MPP) ❖ Goal: Write portable C/C++ programs to be "Multicore ready" and platform compatible • Proposed by a MPP working group in the Multicore Association http://www.multicore-association.org/workgroup/mpp.php 26
  • 27. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenACC ❖ An organization develops API to • describes a collection of compiler directives • To specify loops and regions of code in standard C, C++ and Fortran • To be offloaded from a host CPU to an attached accelerator, including •APUs, GPUs, and many-core coprocessor 27
  • 28. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. HSA Foundation ❖Heterogeneous System Architecture • Key members: AMD, QUALCOMM, ARM, SAMSUNG, TI ❖System architecture easing efficient use of accelerators, SoCs • Intended to support high-level parallel programming frameworks • OpenCL, C++, C#, OpenMP, Java • Accelerator requirements • Full-system SVM, memory coherency, preemption, user-mode dispatch 28
  • 29. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The ParLab in Berkeley ❖ The Parallel Computing Lab. in UC Berkeleyhttp://parlab.eecs.berkeley.e du • The ParLab. offers programmers a practical introduction to parallel programming techniques and tools on current parallel computers, emphasizing multicore and manycore computers. 29
  • 30. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. HPEC ❖ High Performance Embedded Computing • MIT Lincoln Lab, 1997 ~ 30
  • 31. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL ❖ Royalty-free, cross-platform, cross- vendor standard •Targeting: supercomputers è embedded systems è mobile devices ❖Enables programming of diverse compute resources •CPU, GPU, DSP, FPGA … 31
  • 32. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL Working Group Members ❖Diverse industry participation – many industry experts ❖NVIDIA is chair, Apple is specification editor 32
  • 33. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ Vendor, Hardware ❖ How parallel programming (Sec. 3) ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 33
  • 34. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 2. GPGPU PC platform Mobile platform 34
  • 35. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Why GPGPU ❖ GPGPU has many-core (vs. multi-core) • Suitable for masssively parallel computing 35
  • 36. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU as a Coprocessor Heterogeneous Computing 36
  • 37. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. PC Platform • Discrete GPUs • GPGPU card as a coprocessor From PC to PSC (Personal Super-Computer) 37 PCIe
  • 38. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Mobile Platform • Integrated GPUs • GPGPU sub-chip as a coprocessor From mobile phone to mobile personal computer 38 No PCIe GPGPU CPU
  • 39. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Solutions - nVidia • Compute Architecture: Tesla, Fermi, Kepler, … • PC • GeForce, Quadro • Tesla • 870, 1060, 2070, K40 • Mobile • Tegra: …, 4, K1(192 cores) 39 It’s Tegra K1 Everywhere at Google I/O, Embedded Vision Alliance, 2014/7/7.
  • 40. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Solutions – Qualcomm/AMD ❖ Qualcomm, AMD, ATI ❖ APU: integrated CPU+GPU ❖ Low energy consumption ❖ PC(AMD): FirePro ❖ Mobile(Snapdragon): ❖ Adreno: 330(32 cores) 40
  • 41. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Solutions - ARM ❖ Mali ❖ Samsung Exynos, MediaTek ❖ Compute engine after T-600 ❖ Exynos 5 ❖ At most 8 cores (Mali-T678) 41
  • 42. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Intel – Multicore CPU • PC (Xeon Phi) • IRIS pro GPU • Knight Landing: 60 cores • Knight Cover: 48 CPU cores, PCIe • Mobile • Haswell • Atom 42
  • 43. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Applications of GPGPU http://developer.nvidia.com/category/zone/cuda-zone 43
  • 44. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Heterogeneous Architecture ❖Host: CPU ❖Device: GPGPU ❖Notice: memory hierarchy in device 44
  • 45. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPUs Architecture - nVidia ❖ GT200 • GTX 260/280, Quardro5800, Tesla 1060 ❖ Fermi • Tesla 2060 DRAM Cache ALU Control ALU ALU ALU DRAM CPU(host) Multicore GPU(device) Many-core 45
  • 46. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. nVidia GPGPU Architecture ❖ SM/SP(Stream multiprocessor/Stream processor) + Shared memory + DRAM 46
  • 47. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Memory Hierarchy ❖ On-Chip Memory • Registers • Shared Memory • Constant Memory • Texture Memory ❖ Off-Chip Memory • Local Memory • Global Memory 47
  • 48. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU vs. FPGA ❖GPU: nVidia GeForce GTX 280, GTX580 ❖FPGA: Xilinx Virtex4, Virtex5 A Comparison of FPGA and GPU for real-Time Phase-Based Optical Flow, Stereo, and Local Image Features, IEEE Transactions on Computers, 2012. 48
  • 49. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU vs. FPGA ❖GPU: nVidia GeForce 7900 GTX ❖FPGA: Xilinx Virtex-4 Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study, IEEE Transactions on Computers, 2010. 49
  • 50. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU vs. FPGA vs. Multicore ❖Application: 2-D image convolution GPU: nVidia GeForce 295 GTX FPGA: Altera Stratix III E260 A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding- Window Applications, ACM/SIGDA international symposium on FPGA, 2012. 50
  • 51. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. However, GPGPU May Not Always Improve Speed & Energy 51
  • 52. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Hardware vs. Software 52 GPGPU nVidia Qualcomm ARM Intel Parallel Programming CUDA OpenCL RenderScript C++ AMP
  • 53. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ How parallel programming (Sec. 3) • CUDA, renderscript, OpenCL, … ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 53
  • 54. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 3. Parallel Programming Multi-threading Programming Languages for Parallels 54
  • 55. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Computing ❖ Serial Computing ❖ Parallel Computing CPU/GPU 55 Core Core Core Core
  • 56. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Programming ❖ Many codes are written in C/C++/Java • Especially algorithmic programs ❖ Can we write GPGPU parallel programs by C/C++/Java? ❖ However, C/C++ is sequential • Three control structures of C/C++/Java: sequence, selection, repetition 56
  • 57. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multi-threading ❖ Multi-threading is the fundamental concept for parallel programming • Some techniques are ready • Pthread, Win32 thread, OpenMP, MPI, Intel TBB (Threading Building Block)... • New techniques • CUDA, OpenCL, Renderscript, OpenACC, C++ AMP, ... 57
  • 58. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Programming Models 58
  • 59. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Programming in Sequential Language ❖ Do we need to learn new languages for multi-threading? • No ❖ Write multi-threading codes in C/C++ • Add functions/directives to C/C++ for multi-threading • That is the way current solutions did • pthread, Win32 thread, OpenMP, MPI, CUDA, OpenCL, ... 59
  • 60. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Decompose the Problem ❖ Two basic approaches to partition computational work • Domain decomposition • Partition the data used in solving the problem • Function decomposition • Partition the jobs (functions) from the overall work (problem) GPGPU CPU Cooperate 60
  • 61. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multi-Threading ❖ A program running In Serial http://en.wikipedia.org/wiki/Thread_(computer_science) In Parallel 61
  • 62. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Domain Decomposition (1/3) ❖An image example • It is 2D data • Three popular partition ways 62
  • 63. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Domain Decomposition (2/3) ❖Domain data are usually processed by loop • for (i=0; i<height; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); Original image(img1) Enhanced image(img2) The X-ray image of a circuit board i j SIMD SPMD SIMT 63
  • 64. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Domain Decomposition (3/3) ❖A three-block partition example • // Thread 1 for (i=0; i<height/3; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); • // Thread 2 for (i=height/3; i<height*2/3; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); • // Thread 3 for (i=height*2/3; i<height; i++) for (j=0; j<width; j++) img2[i][j] = RemoveNoise(img1[i][j]); i j OpenMP CUDA(SPMD) fork(threads) join(barrier) i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10 i=11 subdomain 1 subdomain 2 subdomain 3 64
  • 65. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Programming: SIMT model ❖ CPU (“host”) program often written in C or C++ ❖ GPU code is written as a sequential kernel in (usually) a C or C++ dialect 65
  • 66. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Programming Techniques CUDA OpenCL C++ AMP Rednerscript 66
  • 67. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Programming Techniques 67
  • 68. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA 68
  • 69. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA ❖ CUDA: Compute Unified Device Architecture ❖ Parallel programming for nVidia's GPGPU ❖ Use C/C++ language • Java, Fortran, Matlab are OK ❖ When executing CUDA programs, the GPU operates as coprocessor to the main CPU 69
  • 70. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Hardware Environment: CPU+GPU ❖ CPU • Organizes, interprets, and communicates information ❖ GPU • Handles the core processing on large quantities of parallel information • Compute-intensive portions of applications that are executed many times, but on different data, are extracted from the main application and compiled to execute in parallel on the GPU CPU GPU PCI-E 70
  • 71. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Software Stack 71
  • 72. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Processing Flow on CUDA Copy processing data 2 Copy the result 5 Instruct the processing 3 Main Memory CPU Memory for GPU Execute parallel in each core 4 Release device memory 6 Allocate device memory 1 72
  • 73. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Programming with Memory Hierarchy ❖ Locality principle • Temporal locality • Spatial locality 73
  • 74. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(1/3) int main() { char src[12]="Hello World"; char h_hello[12]; char* d_hello1; char* d_hello2; cudaMalloc((void**) &d_hello1, sizeof(char)*12); cudaMalloc((void**) &d_hello2, sizeof(char)*12); cudaMemcpy(d_hello1 , src , sizeof(char)* 12 , cudaMemcpyHostToDevice); hello<<<1,1>>>(d_hello1 , d_hello2 ); Host src h_hello Device d_hello1 d_hello2 call the kernel function 74
  • 75. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/3) ❖ Kernel Function __global__ void hello(char* hello1 , char* hello2 ) { int k; for(k = 0 ; hello1[k] != '0' ; k++){ hello2[k] = hello1[k]; } } Host src h_hello Device d_hello1 d_hello2 No parallel processing in this example 75
  • 76. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(3/3) cudaMemcpy(h_hello, d_hello2, sizeof(char)* 12, cudaMemcpyDeviceToHost); printf("%sn", h_hello); cudaFree(d_hello1); ❖ cudaFree(d_hello2); system("pause"); return 0; } Result: Host src h_hello Device d_hello1 d_hello2 76
  • 77. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL Standard 77
  • 78. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The Inspiration for OpenCL 78
  • 79. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's OpenCL ❖One code tree can be executed on CPUs, GPUs, DSPs and hardware • Dynamically interrogate system load and balance across available processors ❖Powerful, low-level flexibility • Foundational access to compute resources for higher-level engines, frameworks and languages 79
  • 80. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Broad OpenCL Implementer Adoption ❖Multiple conformant implementations shipping on desktop and mobile ❖Android ICD extension released in latest extension specification ❖Multiple implementations shipping in Android NDK 80
  • 81. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL Enables Portability ❖C to gates programs are proprietary 81
  • 82. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Altera OpenCL SDK for FPGAs 82
  • 83. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. NVIDIA OpenCL SDK for GPU 83
  • 84. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. AMD OpenCL Optimization Case Study ❖Platform • AMD Phenom II X4 965 CPU (quad core) • ATI Radeon HD 5870 GPU ❖Unoptimized CPU performance: 1 GFLOP/s ❖Optimized CPU performance reaches: 4 GFLOP/s ❖Optimized GPU performance reaches: 50 GFLOP/s 84
  • 85. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(1/3) Including Declaring 85
  • 86. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/3) Creating 86
  • 87. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/3) Do Copy to host & display Creating 87
  • 88. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(3/3) Kernel Function 88
  • 89. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. C++ AMP Microsoft 89
  • 90. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's C++ AMP(1/2) ❖Microsoft’s C++ AMP (Accelerated Massive Parallelism) • Part of Visual C++, integrated with Visual Studio, built on Direct3D • “Performance for the mainstream” ❖STL-like library for multidimensional array data • Special convenience support for 1, 2, and 3 dimensional arrays on CPU or GPU • C++ AMP runtime handles CPU<->GPU data copying • Tiles enable efficient processing of sub-arrays 90
  • 91. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's C++ AMP(2/2) ❖Parallel_for_each •Executes a kernel (C++ lambda) at each point in the extent •restrict() clause specifies where to run the kernel: cpu (default) or direct3d (GPU) 91
  • 92. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(1/2) Declaring& Coping to device 92
  • 93. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/2) Do Display 93
  • 94. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Google Android 94
  • 95. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's Renderscript(1/2) ❖Higher-level than CUDA or OpenCL: simpler & less performance control • Emphasis on mobile devices & cross-SoC performance portability ❖Programming model • C99-based kernel language, JIT-compiled, single input-single output • Automatic Java class reflection • Intrinsics: built-in, highly-tuned operations, e.g. ScriptIntrinsicConvolve3x3 • Script groups combine kernels to amortize launch cost & enable kernel fusion 95
  • 96. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What's Renderscript(2/2) ❖ Data type: • 1D/2D collections of elements, C types like int and short2, types include size • Runtime type checking ❖ Parallelism • Implicit: one thread per data element, atomics for thread-safe access • Thread scheduling not exposed, VM-decided 96
  • 97. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Architecture 97
  • 98. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Low Level Virtual Machine ❖Low Level Virtual Machine (LLVM) is a compiler infrastructure 98
  • 99. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Offline Compiler Flow 99
  • 100. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Renderscript Compiler: libbcc 100
  • 101. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Renderscript Project Framework 101
  • 102. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(1/8) 102
  • 103. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(2/8) HelloWorld.java 103
  • 104. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(3/8) HelloWorld.java 104
  • 105. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(4/8) HelloWorldView.java 105
  • 106. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(5/8) HelloWorldView.java 106
  • 107. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(6/8) HelloWorldRS.java 107
  • 108. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(7/8) HelloWorldRS.java 108
  • 109. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(7/8) ScriptC_helloworld.java 109
  • 110. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(7/8) ScriptC_helloworld.java 110
  • 111. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Example - Hello World(8/8) HelloWorld.rs 111
  • 112. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Comparison (1/2) ❖Renderscript vs. Native(NDK) vs. Java(SDK) • OS: Honeycomb v3.2(CPU only) Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." in Proc. First Asia- Pacific Programming Languages and Compilers Workshop (APPLC). 201 112
  • 113. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Comparison(2/2) ❖OpenCL & CUDA • Sobel filter with(CMw/o) and without(CMw) constant memory OpenCL’s portability does not fundamentally affect its performance Fang, Jianbin, Ana Lucia Varbanescu, and Henk Sips. "A comprehensive performance comparison of CUDA and OpenCL." in Proc. International Conference Parallel Processing (ICPP), 2011. 113
  • 114. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Programming 114 Performance: more control, better performance Productivity: ease use, quick programming, portability
  • 115. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖ Multicore/Multi-threading ❖ Data Parallelization • Data distribution • Parallel convolution • Reduction algorithm • Amdahl’s law ❖ Memory Hierarchy Management • Locality principle • Program accesses a relatively small portion of the address space at any instant of time Parallelization 115
  • 116. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Multi-thread Programming with the Discipline of Parallelization ❖ Identify parallelism: Analyze algorithm ❖ Express parallelism: Write parallel code ❖ Validate parallelism: Debug & verify parallel code ❖ Optimize parallelism: enhance parallel performance 116
  • 117. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ How parallel programming (Sec. 3) ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 117
  • 118. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 4. OpenCV Acceleration 118
  • 119. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. What Is OpenCV ❖A very popular computer vision library • 6M downloads • BSD licenses • 2000 ~ CV functions • Modularized and efficient • Optimization • Intel SSE, IPP, TBB • ARM NEON & GLSL (Tegra) • CUDA, OpenCL 119
  • 120. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV Modules ❖Image/video I/O, processing, display (core, imgproc, highgui) ❖Object/feature detection (objdetect, features2d, nonfree) ❖Geometry-based monocular or stereo computer vision (calib3d, stitching, videostab) ❖Computational photography (photo, video, superres) ❖Machine learning & clustering (ml, flann) ❖CUDA and OpenCL GPU acceleration (gpu, ocl) Normal CV modules: 14 Acceleration modules: 2 120
  • 121. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV GPU Module ❖Implemented using NVIDIA CUDA Runtime API ❖Latest version: 2.4.9 • Utilizing Multiple GPUs ❖Implemented modules: 11 ❖Implemented functions: 270 Focus on PC platform Not fully compatible to mobile GPGPU on Android 121
  • 122. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Matrix Operations ❖Point-wise matrix math • gpu::add(), ::sum(), ::div(), ::sqrt(), ::sqrSum(), ::meanStdDev, ::min(), ::max(), ::minMaxLoc(), ::magnitude(), ::norm(), ::countNonZero(), ::cartToPolar(), etc.. ❖Matrix multiplication • gpu::gemm() ❖Channel manipulation • gpu::merge(), ::split() 122
  • 123. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Geometric Operations ❖Image resize with sub-pixel interpolation • gpu::resize() ❖Image rotate with sub-pixel interpolation • gpu::rotate() ❖Image warp (e.g., panoramic stitching) • gpu::warpPerspective(), ::warpAffine() 123
  • 124. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA other Math and Geometric Operations ❖Integral images • gpu::integral(), ::sqrIntegral() ❖Custom geometric transformation (e.g., lens distortion correction) • gpu::remap(), ::buildWarpCylindricalMaps(), ::buildWarpSphericalMaps() 124
  • 125. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Image Processing(1/2) ❖Smoothing • gpu::blur(), ::boxFilter(), ::GaussianBlur() ❖Morphological • gpu::dilate(), ::erode(), ::morphologyEx() ❖Edge Detection • gpu::Sobel(), ::Scharr(), ::Laplacian(), gpu::Canny() ❖Custom 2D filters • gpu::filter2D(), ::createFilter2D_GPU(), ::createSeparableFilter_GPU() ❖Color space conversion • gpu::cvtColor() 125
  • 126. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Image Processing(2/2) ❖Image blending • gpu::blendLinear() ❖Template matching (automated inspection) • gpu::matchTemplate() ❖Gaussian pyramid (scale invariant feature/object detection) • gpu::pyrUp(), ::pyrDown() ❖Image histogram • gpu::calcHist(), gpu::histEven, gpu::histRange() ❖Contract enhancement • gpu::equalizeHist() 126
  • 127. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA De-noising ❖Gaussian noise removal • gpu::FastNonLocalMeansDenoising() ❖Edge preserving smoothing • gpu::bilateralFilter() 127
  • 128. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Fourier and MeanShift ❖Fourier analysis •gpu::dft(), ::convolve(), ::mulAndScaleSpectrums(), etc.. ❖MeanShift •gpu::meanShiftFiltering(), ::meanShiftSegmentation() 128
  • 129. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Shape Detection ❖Line detection (e.g., lane detection, building detection, perspective correction) • gpu::HoughLines(), ::HoughLinesDownload() ❖Circle detection (e.g., cells, coins, balls) • gpu::HoughCircles(), ::HoughCirclesDownload() 129
  • 130. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Object Detection ❖HAAR and LBP cascaded adaptive boosting (e.g., face, nose, eyes, mouth) • gpu::CascadeClassifier_GPU::detectMulti Scale() ❖HOG detector (e.g., person, car, fruit, hand) • gpu::HOGDescriptor::detectMultiScale() 130
  • 131. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Object Recognition ❖Interest point detectors • gpu::cornerHarris(), ::cornerMinEigenVal(), ::SURF_GPU, ::FAST_GPU, ::ORB_GPU(), ::GoodFeaturesToTrackDetector_GPU() ❖Feature matching • gpu::BruteForceMatcher_GPU(), ::BFMatcher_GPU() 131
  • 132. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Stereo and 3D ❖RANSAC • gpu::solvePnPRansac() ❖Stereo correspondence (disparity map) • gpu::StereoBM_GPU(), ::StereoBeliefPropagation(), ::StereoConstantSpaceBP(), ::DisparityBilateralFilter() ❖Represent stereo disparity as 3D or 2D • gpu::reprojectImageTo3D(), ::drawColorDisp() 132
  • 133. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Optical Flow ❖Dense/sparse optical flow gpu::FastOpticalFlowBM(), ::PyrLKOpticalFlow, ::BroxOpticalFlow(), ::FarnebackOpticalFlow(), ::OpticalFlowDual_TVL1_GPU(), ::interpolateFrames() 133
  • 134. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA Background Segmentation ❖Foregrdound/background segmentation (e.g., object detection/removal, motion tracking, background removal) • gpu::FGDStatModel, ::GMG_GPU, ::MOG_GPU, ::MOG2_GPU 134
  • 135. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Performance of OpenCV GPU Accelerators on PC 135
  • 136. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ How parallel programming (Sec. 3) ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 136
  • 137. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 5. Computer Vision Acceleration on PC Image enhancement (HDR) Feature extraction Video surveillance cloud 137
  • 138. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. HDR and Image Enhancement 138
  • 139. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖ Restore and enhance an image ❖ Its complexity is high for large images HDR Image Enhancement Original RestoredComplexity: O(N2) ~ O(N3) 139
  • 140. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Algorithms for Image Restoration ❖ Wiener Filter ❖ Histogram Based Approach • Histogram Equalization, Histogram Modification, … ❖ Retinex • Path-based Retinex • Recursive Retinex • Center/surround Retinex • No iterative process and is suitable for parallelization • Multi-Scale Retinex with Color Restoration (MSRCR) [Rahman et al. 1997] 140
  • 141. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. MSRCR Algorithm • : the MSRCR output • : the original image distribution in the ith spectral band • : the kth Gaussian Surround function • : the convolution operation • : the weight • : the color restoration factor in the ith spectral band N : the number of spectral bands : the gain constant : controls the strength of the nonlinearity 141
  • 142. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The Method Gaussian Blur Log-domain Processing Normalization Copy Data from CPU to GPGPU Copy Data from GPGPU to CPU GPGPUCPU Histogram Stretching • Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm." Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on. IEEE, 2011. • Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19. 142
  • 143. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖ Multicore/Multi-threading • Tesla C1060 : 240 SP (Stream Processor) • CUDA: , Thread , Block , Grid ❖ Data Parallelization • Parallel convolution Parallelization by GPGPU • Parallel convolution A(0) A(1) A(2) A(3) A(4) A(5) A(6) A(7) A(0)+A(1) A(2)+A(3) A(4)+A(5) A(6)+A(7) A(0)+A(1)+A(2)+A(3) A(4)+A(5)+A(6)+A(7) sum PE data time t0 t1 t2 t3 t4 t5 0 1 2 3 4 5 6 7 PE i { { pixels pixels M pixels M pixels PE ipixels pixels pixels pixels 1 pixels 1 pixels 1 pixels 1 pixels 143
  • 144. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Our Memory Hierarchy Parallel Gaussian Blur Parallel Log-domain Processing Parallel Normalization Texture Memory Parallel Histogram Stretching Constant Memory Global Memory Shared Memory 144
  • 145. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CPU results GPGPU resultsOriginal images Experimental Results (1/2) 145
  • 146. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CPU results GPGPU resultsOriginal images Experimental Results (2/2) 146
  • 147. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU Speedup over CPU 74x 2x • Ideal speedup: 240 * (1.296GHz/ 3GHz) = 103 • NPP: nVidia Performance Primitive 147
  • 148. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Feature Extraction (SIFT) 148
  • 149. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖SIFT • Scale Invariant Feature Transform ❖Invariance of feature points • Translation • Rotation • Scale What Is SIFT 149
  • 150. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. ❖Object recognition/tracking ❖Image retrieval ❖Autostitch Applications of SIFT 150
  • 151. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallelize SIFT by GPGPU Intel Q9400 Quad cores (2.66GHz) Geforce GTS 250 128 SPs (1.836GHz) 151
  • 152. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CPU GPU Experimental Results 152
  • 153. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Execution Timem s CPU: 10 seconds in average GPGPU: 0.8 seconds in average 153
  • 154. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Speedup 13x speedup in average 154
  • 155. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Video Surveillance Cloud 155
  • 156. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU雲端視訊監控系統 警戒區域入侵偵測 PTZ相機追蹤 攝影機異常偵測 高效率影片事件瀏覽系統 中央視訊及訊息管理系統多重解析度廣域監視系統 戶外 停車場 空位偵測 非法停車偵測 動態場景 人臉偵測 Storage Area Network PC Mobile device Multi-core Hypervisor GPGPU 156
  • 157. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 私有雲機房 157
  • 158. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Today We Talk About ❖ Why GPGPU's multicore is better(Sec. 2) ❖ How parallel programming (Sec. 3) ❖ OpenCV Acceleration (Sec. 4) ❖ Computer vision Acceleration-PC (Sec. 5) ❖ Computer vision Acceleration-Android (Sec. 6) 158
  • 159. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 6. Computer Vision Acceleration on Android OpenCV RenderScript 159
  • 160. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV on Android 160
  • 161. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV4Android SDK ❖Enables development of Android applications with use of OpenCV library. ❖Use java native interface (JNI) directly access c code ❖Support nVIDAs’ Tegra android development pack(TADP) Not fully compatible with GPU module 161
  • 162. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. System Framework 162
  • 163. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Two Methods to Call OpenCV ❖Using Java API ❖Using native C++ 163
  • 164. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(1/5) 164
  • 165. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(2/5) 165
  • 166. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(3/5) 166
  • 167. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(4/5) 167
  • 168. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV for Android SDK by GPU(5/5) 168
  • 169. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on Android with GPU Acceleration 169
  • 170. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(1/5) 170
  • 171. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(2/5) 171
  • 172. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(3/5) 172
  • 173. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(4/5) 173
  • 174. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript on android with GPU(5/5) 174
  • 175. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Image Processing Intrinsics Name Operation ScriptIntrinsicConvolve3x3,ScriptIntrinsicConvol ve5x5 Performs a 3x3 or 5x5 convolution. ScriptIntrinsicBlur Performs a Gaussian blur. Supports grayscale and RGBA buffers and is used by the system framework for drop shadows. ScriptIntrinsicYuvToRGB Converts a YUV buffer to RGB. Often used to process camera data. ScriptIntrinsicColorMatrix Applies a 4x4 color matrix to a buffer. ScriptIntrinsicBlend Blends two allocations in a variety of ways. ScriptIntrinsicLUT Applies a per-channel lookup table to a buffer. ScriptIntrinsic3DLUT Applies a color cube with interpolation to a buffer. ScriptIntrinsicHistogram Intrinsic Histogram filter 175
  • 176. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Gaussian Blur Example by RenderScript Intrinsic RenderScript rs = RenderScript.create(theActivity); ScriptIntrinsicBlur theIntrinsic = ScriptIntrinsicBlur.create(mRS, Element.U8_4(rs));; Allocation tmpIn = Allocation.createFromBitmap(rs, inputBitmap); Allocation tmpOut = Allocation.createFromBitmap(rs, outputBitmap); theIntrinsic.setRadius(25.f); theIntrinsic.setInput(tmpIn); theIntrinsic.forEach(tmpOut); tmpOut.copyTo(outputBitmap); 176
  • 177. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Intrinsic Example(1/2) 177
  • 178. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Intrinsic Example(2/2) 178
  • 179. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Blur Intrinsic Performance Analysis 179
  • 180. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Performance of RenderScript Intrinsics ❖On new Nexus 7 ❖Relative to equivalent multithreaded C implementations. 180
  • 181. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Image Processing Benchmarks(1/2) ❖CPU only on a Galaxy Nexus device. 181
  • 182. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript Image Processing Benchmarks(2/2) 182
  • 183. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Acceleration of Retinex Using RenderScript ❖This paper presents an implementation of rsRetinex, an optimized Retinex algorithm by using Renderscript technique. ❖The experimental results show that rsRetinex could gain up to five times speedup when applied to different image resolution. Le, Duc Phuoc Phat Dat, et al. "Acceleration of Retinex Algorithm for Image Processing on Android Device Using Renderscript." in Proc. The 8th International Conference on Robotic, Vision, Signal Processing & Power Applications, 2014. 183
  • 184. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Mobile GPGPU List Adoption OpenCL/ CUDA OpenCV Renderscript Qualcomm Adreno Google Nexus 10, Google new Nexus 7, SONY Xperia Tablet Z2 1.2(302~420) OCL module Android 4.0 later ARM Mali Nexus 10, Samsung Note 3, Samsung Note PRO 12.2, Meizu MX3 OpenCL 1.1 (T604~T678) OCL module Android 4.0 later nVIDIA Tegra Google Project Tango, HTC Nexus 9, Microsoft Surface 2, Nvidia Shield Note 7 CUDA, OpenCL 1.2(K1 only) GPU module Android 4.0 later(K1 only) AnandTech PowerVR iPad Air, iPad mini OpenCL 1.2 OCL module none Intel HD Graphics Microsoft Surface Pro 3, Sony VAIO Tap 11 OpenCL 1.1 OCL module none Nvidia CEO sees future in cars and gaming, 2014/5/19, CNet. 184
  • 185. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. 7. Summary 185
  • 186. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPGPU ❖ Single-core è Multi-core è Many-core ❖PC • nVidia Tesla + CUDA/OpenCV ❖Android • Qualcomm Adreno + OpenCV ocl • nVidia Tegra + OpenCV gpu 186
  • 187. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Programming ❖C/C++/OpenCV • OpenMP, OpenACC, CUDA, C++ AMP • OpenCL ❖Java • OpenCL, RenderScript ❖Notice that OpenCL and RenderScript is • Not Efficient in parallelization. • Efficient in CV algorithmic design. 187
  • 188. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV Acceleration (1/2) ❖Ver. 2.4.x • gpu module: CUDA, PC • ocl module: OpenCL, mobile ❖Ver. 3.0 (2014/6) • Transparent API for GPGPU acceleration 188
  • 189. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV Acceleration (2/2) 189
  • 190. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL 2.0 ❖Released in 2013 ❖SVM: Shared Virtual Memory • OpenCL 1.2: Explicit memory management ❖Dynamic (Nested) Parallelism • Allows a device to enqueue kernels onto itself – no round trip to host required ❖Disadvantage • Strong hardware support • Not well supported in current GPGPUs 190
  • 191. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA still Dominant in the Near Future ❖ However, we have to manually parallelize the algorithm: more design overhead ❖ We need expertise in • Algorithms of image and signal processing • Filtering, frequency analysis, compression, feature extraction, recognition, ... • Theory, tools and methodology of parallel computing • Communication, synchronization, resource management, load balancing, debugging, ... 191
  • 192. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPUs for Multimedia Motion Estimation for H.264/AVC on Multiple GPUs Using NVIDIA CUDA 10 X CUDA JPEG Decoder 10 X DivideFrame GPU Decoder Hyperspectral Image Compression on NVIDIA GPUs 10 X GPU Decoder (Vegas/Premiere) - Using the Power of NVIDIA Graphic Card to Decode H.264 Video Files 26 X PowerDirector7 Ultra 3.5X 192
  • 193. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPUs for Computer Vision(1/2) 87 X CUDA SURF – A Real- time Implementation for SURF TU Darmstadt 26 X Leukocyte Tracking: ImageJ Plugin University of Virginia 200 X Real-time Spatiotemporal Stereo Matching Using the Dual-Cross-Bilateral Grid 100 X Image Denoising with Bilateral Filter Wlroclaw University of Technology 85 X Digital Breast Tomosynthesis Reconstruction Massachusetts General Hospital 100 X Fast Optical Flow on GPU At Video Rate for Full HD Resolution Onera 8 X A Framework for Efficient and Scalable Execution of Domain-specific Templates On GPU NEC Labs, Berkeley, Purdue 13 X Accelerating Advanced MRI Reconstructions University of Illinois 193
  • 194. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. GPUs for Computer Vision(2/2) 20 X GPU for Surveillance 13 X Fast Human Detection with Cascaded Ensembles 109 X Fast Sliding-Window Object Detection 263 X GPU Acceleration of Object Classification Algorithm Using NVIDIA CUDA 10 X Real-time Visual Tracker by Stream Processing 45 X A GPU Accelerated Evolutionary Computer Vision System 3 X Canny Edge Detection 300 X Audience Measurement – Real-time Video Analysis for Counting People, Face Detection and Tracking 194
  • 195. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. The Embedded Vision Alliance 195
  • 196. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Readings (1/2) • Wang, Yuan-Kai, and Wen-Bin Huang. "Acceleration of an improved Retinex algorithm." IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). 2011. • Wang, Yuan-Kai, and Wen-Bin Huang. "A CUDA-enabled parallel algorithm for accelerating retinex." Journal of Real-Time Image Processing (2012): 1-19. • Pauwels, Karl, et al. "A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features." Computers, IEEE Transactions on 61.7 (2012): 999-1012. • Pratx, Guillem, and Lei Xing. "GPU computing in medical physics: A review." Medical physics 38.5 (2011): 2685-2697. • Cope, Ben, et al. "Performance comparison of graphics processors to reconfigurable logic: a case study." Computers, IEEE Transactions on 59.4 (2010): 433-448. 196
  • 197. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Readings (2/2) ❖ “Designing Visionary Mobile Apps Using the Tegra Android Development Pack,” http://bit.ly/1jvwbgV ❖ “Getting Started With GPU-Accelerated Computer Vision Using OpenCV and CUDA,” http://bit.ly/1oMwJEG ❖ “The open standard for parallel programming of heterogeneous systems,” https://www.khronos.org/opencl/ 197
  • 198. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCV Acceleration ❖ GPU Module Introduction — OpenCV 2.4.9.0 documentation ❖ OpenCL Module Introduction - opencv documentation! ❖ OpenCV-CL: Computer vision with OpenCL acceleration, AMD Developer Central, 2013. ❖ Pulli, Kari, et al. "Real-time computer vision with OpenCV." Communications of the ACM 55.6 (2012): 61-69. ❖ Allusse, Yannick, et al. "GpuCV: A GPU-accelerated framework for image processing and computer vision." Advances in Visual Computing. Springer Berlin Heidelberg, 2008. 430-439. 198
  • 199. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. CUDA ❖ CUDA Programming guide. nVidia. ❖ CUDA Best Practices Guide. nVidia. ❖ CUDA Reference Manual. nVidia. ❖ CUDA Zone - NVIDIA Developer, https://developer.nvidia.com/cuda-zone ❖ Parallel Programming and Computing Platform | CUDA Home, www.nvidia.com/object/cuda_home_new.html ❖ Applications of CUDA for Imaging and Computer Vision http://www.nvidia.com/object/imaging_comp_vision.html ❖ nVidia Performance Primitives (NPP) http://developer.nvidia.com/object/npp_home.html 199
  • 200. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. OpenCL ❖ Khronos OpenCL specification, reference card, tutorials, etc: http://www.khronos.org/opencl ❖ AMD OpenCL Resources: http://developer.amd.com/opencl ❖ NVIDIA OpenCL Resources: http://developer.nvidia.com/opencl ❖ Books • Using OpenCL: Programming Massively Parallel Computers. IOS Press, 2012. • OpenCL programming guide. Pearson Education, 2011. • Heterogeneous Computing with OpenCL: Revised OpenCL 1. Newnes, 2012. • OpenCL in Action: how to accelerate graphics and computation. Manning, 2012. 200
  • 201. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. RenderScript ❖ RenderScript for Android Developer, Official web site http://developer.android.com/guide/topics/renderscript/compute.ht ml ❖ Qian, Xi, Guangyu Zhu, and Xiao-Feng Li. "Comparison and analysis of the three programming models in google android." First Asia-Pacific Programming Languages and Compilers Workshop. 2012. ❖ "High Performance Apps Development with RenderScript," 12th Kandroid Conference, 2013. 201
  • 202. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Web Sites and Resources ❖Embedded Vision Alliance, http://www.embedded-vision.com ❖GPUComputing.Net, http://www.gpucomputing.net ❖HAS Foundation, www.hsafoundation.com ❖ 202
  • 203. Wang, Yuan-Kai (王元凱) Computer Vision Parallelization by GPGPU p. Parallel Computing with GPGPU ❖Programming Massively Parallel Processors – A Hands-on Approach • D. B. Kirk, W. M. Hwu • Morgan Kaufmann, 2010 • http://www.nvidia.com/object/promotion_david_kirk_book.html 203