SlideShare a Scribd company logo
NVIDIA DevTech | Anton Obukhov <aobukhov@nvidia.com>
GPU-Accelerated Video Encoding
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
1
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
2
PRESENTED BY
Motivation for the talk
3
Video Encode and Decode are even more significant now as sharing of
media is more popular with Portable Media Devices and on the internet
Encoding HD movies takes tens
of hours on modern desktops
Portable and mobile devices have
underutilized processing power
PRESENTED BY
Trends
• GPU encoding is gaining wide adoption from developers (consumer,
server, and professionals)
• Up to now the majority of encoders provide performance through a Fast
Profile, but the Quality Profile increases CPU usage
• Normal state of things: CUDA exists since 2007
• A video encoder becomes mature after ~5 years
• Time to deliver quality GPU encoding
4
PRESENTED BY
Trends
• Knuth: “We should forget about small efficiencies, say
about 97% of the time: premature optimization is the root
of all evil”
• x86 over-optimization has its roots in reference encoders
• Slices were a step towards partial encoder parallelization,
but the whole idea is not GPU-friendly (doesn’t fit well
with SIMT)
• Integrated solution is an encoder architecture from
scratch
5
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
6
PRESENTED BY
Video encoding with NVIDIA GPU
Facilities:
• SW H.264 codec designed for CUDA
– Baseline, Main, High profiles
– CABAC, CAVLC
Interfaces:
• C library (NVCUVENC)
• Direct Show API
7
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
8
PRESENTED BY
Hybrid CPU-GPU encoding pipeline
• The codec should be designed with best practices for both
CPU and GPU architectures
• PCI-e is the main bridge between architectures, it has
limited bandwidth, CPU–GPU data transfers can take
away all benefits of GPU speed-up
• PCI-e bandwidth is an order of magnitude less than video
memory bandwidth
9
PRESENTED BY
Hybrid CPU-GPU encoding pipeline
• There are not so many parts of a codec that vitally need data
dependencies: CABAC, CAVLC are the most well-known
• Many other dependencies were introduced to respect CPU best
practices guides, can be resolved by codec architecture revision
• It might be beneficial to perform some serial processing with
one CUDA block and one CUDA thread on GPU instead of
copying data back and forth
10
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
11
PRESENTED BY
GPU Encoding Use Cases
• Video encoding
• Video transcoding
• Encoding live video input
• Compressing rendered scenes
12
PRESENTED BY
GPU Encoding Use Cases
• The use cases differ in the way input video frames
appear in GPU memory
• Frames come from GPU memory when:
– Compressing rendered scenes
– Transcoding using CUDA decoder
• Frames come from CPU memory when:
– Encoding/Transcoding of a video file decoded on CPU
– Live feed from a webcam or any other video input device
via USB
13
PRESENTED BY
GPU Encoding Use Cases
Helper libraries for source acquisition acceleration:
• NVCUVID – GPU acceleration of H.264, VC-1, MPEG2
decoding in hardware (available in CUDA C SDK)
• videoInput – an open-source library for video devices
handling on CPU via DirectShow
– DirectShow filter can be implemented instead to minimize the
amount of “hidden” buffers on the CPU. The webcam filter
writes frames directly into pinned/mapped CUDA memory
14
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
15
PRESENTED BY
Algorithm primitives
• Fundamental primitives of Parallel Programming
• Building blocks for other algorithms
• Have very efficient implementations
• Mostly well-known:
– Reduce
– Scan
– Compact
16
PRESENTED BY
Reduce
• Having a vector of numbers a = [ a0, a1, … an-1 ]
• And a binary associative operator ⊕, calculate:
res = a0 ⊕ a1 ⊕ … ⊕ an-1
• Instead of ⊕ take any of the following:
“ + ”, “ * ”, “min”, “max”, etc.
17
PRESENTED BY
Reduce
• Use half of N threads
• Each thread loads 2 elements, applies the operator to them and puts the intermediate
result into the shared memory
• Repeat log2 N times, halve the number of threads on every new iteration, use
__syncthreads to verify integrity of results in shared memory
18
2 5 0 1 3 9 9 4
7 1 12 13
8 25
33
Legend
⊕
__syncthreads
PRESENTED BY
Reduce sample: 8 numbers
19
2 5 0 1 3 9 9 4
Legend
⊕
set
PRESENTED BY
Reduce sample: 8 numbers
20
2 5 0 1 3 9 9 4
5 14 9 5
Legend
⊕
set
PRESENTED BY
Reduce sample: 8 numbers
21
2 5 0 1 3 9 9 4
5 14 9 5
14 19
Legend
⊕
set
PRESENTED BY
Reduce sample: 8 numbers
22
2 5 0 1 3 9 9 4
5 14 9 5
14 19
33
Legend
⊕
set
PRESENTED BY
Scan
• Having a vector of numbers a = [ a0, a1, … an-1 ]
• And a binary associative operator ⊕, calculate:
scan(a) = [ 0, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2) ]
• Instead of ⊕ take any of the following:
“ + ”, “ * ”, “min”, “max”, etc.
23
PRESENTED BY
Scan sample: 8 numbers
24
2 9 5 1 3 6 1 4
Legend
⊕
set
PRESENTED BY
Scan sample: 8 numbers
25
2 9 5 1 3 6 1 4
0 0 0 0 0 0 0 2 9 5 1 3 6 1 4
Legend
⊕
set
PRESENTED BY
Scan sample: 8 numbers
26
2 9 5 1 3 6 1 4
0 0 0 0 0 0 0 2 9 5 1 3 6 1 4
0 0 0 0 0 0 0 2 11 14 6 4 9 7 5
Legend
⊕
set
PRESENTED BY
Scan sample: 8 numbers
27
2 9 5 1 3 6 1 4
0 0 0 0 0 0 0 2 9 5 1 3 6 1 4
0 0 0 0 0 0 0 2 11 14 6 4 9 7 5
0 0 0 0 0 0 0 2 11 16 17 18 15 11 14
Legend
⊕
set
PRESENTED BY
Scan sample: 8 numbers
28
2 9 5 1 3 6 1 4
0 0 0 0 0 0 0 2 9 5 1 3 6 1 4
0 0 0 0 0 0 0 2 11 14 6 4 9 7 5
0 0 0 0 0 0 0 2 11 16 17 18 15 11 14
0 0 0 0 0 0 0 2 11 16 17 20 26 27 31
Legend
⊕
set
PRESENTED BY
Scan sample: 8 numbers
29
2 9 5 1 3 6 1 4
0 0 0 0 0 0 0 2 9 5 1 3 6 1 4
0 0 0 0 0 0 0 2 11 14 6 4 9 7 5
0 0 0 0 0 0 0 2 11 16 17 18 15 11 14
0 0 0 0 0 0 0 2 11 16 17 20 26 27 31
Legend
⊕
set
PRESENTED BY
Compact
• Having a vector of numbers a = [ a0, a1, … an-1 ] and a
mask of elements of interest m = [m0, m1, …, mn-1],
mi = [0 | 1],
• Calculate:
compact(a) = [ ai0
, ai1
, … aik-1
], where
reduce(m) = k,
j[0, k-1]: mij
= 1.
30
PRESENTED BY
Compact sample
31
Input:
Desired output:
Green elements are of interest with corresponding mask
elements set to 1, while gray to 0.
0 1 2 3 4 5 6 7 8 9
0 2 4 7 8
PRESENTED BY
Compact sample
32
The algorithm makes use of the scan primitive:
Input:
Mask of interest:
Scan of the mask:
Compacted vector:
0 1 2 3 4 5 6 7 8 9
0 2 4 7 8
1 0 1 0 1 0 0 1 1 0
0 1 1 2 2 3 3 3 4 5
PRESENTED BY
Algorithm primitives discussion
• Highly efficient implementations in CUDA C SDK,
CUDPP and Thrust libraries
• Performance benefits from HW native intrinsics (POPC
and etc), which makes difference when implementing
with CUDA
33
PRESENTED BY
Algorithm primitives discussion
• Reduce: sum of absolute differences (SAD)
• Scan: Integral Image calculation, Compact facilitation
• Compact: Work pool index creation
34
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
35
PRESENTED BY
Intra Prediction
36
PRESENTED BY
Intra Prediction
• Each block is predicted according to the predictors vector and the
prediction rule
• Predictors vector consists of N pixels to the top of the block, N to
the left and 1 pixel at top-left location
• Prediction rules (few basic for 16x16 blocks):
1. Constant – each pixel of the block is predicted with a constant
value
2. Vertical – each row of the block is predicted using the top predictor
3. Horizontal – each column of the block is predicted using the left
predictor
4. Plane – A 2D-plane prediction that optimally fits the source pixels
37
?
PRESENTED BY
Intra Prediction
• Select the rule with the lowest prediction error
• The prediction error can be calculated as a Sum of Absolute Differences
of the corresponding pixels in the predicted and the original blocks:
• Data compression effect: only need to store predictors and the prediction
rule
38
 

y
x
x
y
predict
x
y
src
SAD
,
]
][
[
]
][
[
PRESENTED BY
Intra Prediction
39
PRESENTED BY
Intra Prediction
40
Horizontal
Constant
Plane
Vertical
PRESENTED BY
CUDA: Intra Prediction 16x16
• An image block of 16x16 pixels is processed by 16 CUDA threads
• Each CUDA-block contains 64 threads and processes 4 image blocks at
once:
– 0-15
– 16-31
– 32-47
– 48-63
• Block configuration: blockDim = {16, 4, 1};
41
PRESENTED BY
CUDA: Intra Prediction 16x16
42
1. Load image block and predictors into the shared memory:
int mb_x = ( blockIdx.x * 16 ); // left offset (of the current 16x16 image block in the frame)
int mb_y = ( blockIdx.y * 64 ) + ( threadIdx.y * 16 ); // top offset
int thread_id = threadIdx.x; // thread ID [0,15] working inside of the current image block
// copy the entire image block into the shared memory
int offset = thread_id * 16; // offset of the thread_id row from the top of image block
byte* current = share_mem->current + offset; // pointer to the row in shared memory (SHMEM)
int pos = (mb_y + thread_id) * stride + mb_x; // position of the corresponding row in the frame
fetch16pixels(pos, current); // copy 16 pixels of the frame from GMEM to SHMEM
// copy predictors
pos = (mb_y - 1) * stride + mb_x; // position one pixel to the top of the current image block
if (!thread_id) // this condition is true only for one thread (zero) working on the image block
fetch1pixel(pos - 1, share_mem->top_left); // copy 1-pix top-left predictor
fetch1pixel(pos + thread_id, share_mem->top[thread_id]); // copy thread_idth pixel of the top predictor
pos += stride - 1; // position one pixel to the left of the current image block
fetch1pixel(pos + thread_id * stride, share_mem->left[thread_id]); // copy left predictor
PRESENTED BY
CUDA: Intra Prediction 16x16
43
2. Calculate horizontal prediction (using the left predictor):
//INTRA_PRED_16x16_HORIZONTAL
int left = share_mem->left[thread_id]; // load thread_id pixel of the left predictor
int sad = 0;
for (int i=0; i<16; i++) // iterate pixels of the thread_idth row of the image block
{
sad = __usad(left, current[i], sad); // accumulate row-wise SAD
}
// each of 16 threads own its own variable “sad”, which resides in a register
// the function devSum16 sums all 16 variables using the reduce primitive and SHMEM
sad = devSum16(share_mem, sad, thread_id);
if (!thread_id) // executes once for an image block
{
share_mem->mode[MODE_H].mode = INTRA_PRED_16x16_HORIZONTAL;
// store sad if left predictor actually exists
// otherwise store any number greater than the worst const score
share_mem->mode[MODE_H].score = mb_x ? sad : WORST_INTRA_SCORE;
}
PRESENTED BY
CUDA: Intra Prediction 16x16
44
3.
Under the certain assumptions about the CUDA warp size, some of the
__syncthreads can be omitted
__device__ int devSum16( IntraPred16x16FullSearchShareMem_t* share_mem, int a, int thread_id )
{
share_mem->tmp[thread_id] = a; // store the register value in the shared memory vector
//consisting of 16 cells
__syncthreads(); // make all threads pass the barrier when working with shared memory
if ( thread_id < 8 ) // work with only 8 first threads out of 16
{
share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 8];
__syncthreads();
share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 4]; // only 4 threads are useful
__syncthreads();
share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 2]; // only 2 threads are useful
__syncthreads();
share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 1]; // only 1 thread is useful
__syncthreads();
}
return share_mem->tmp[0];
}
PRESENTED BY
CUDA: Intra Prediction 16x16
45
3. __device__ int devSum16
1 3 6 0 3 2 8 1
5 14 16 5
21 19
40
1 2 1 1 0 7 1 3
5
2 7 1 3 9 9 4
PRESENTED BY
CUDA: Intra Prediction 16x16
4. Calculate rest of the rules (Vertical and Plane)
5. Rule selection: take the one with the minimum SAD
6. Possible to do Intra reconstruction in the same kernel
(see gray images few slides back)
46
PRESENTED BY
CUDA Host: Intra Prediction 16x16
47
// main() body
byte *h_frame, *d_frame; // frame pointers in CPU (host) and CUDA (device) memory
int *h_ip16modes, *d_ip16modes; // output prediction rules pointers
int width, height; // frame dimensions
getImageSizeFromFile(pathToImg, &width, &height); // get frame dimensions
int sizeFrame = width * height; // calculate frame size
int sizeModes = (width/16) * (height/16) * sizeof(int); // calculate prediction rules array size
cudaMallocHost((void**)&h_frame, sizeFrame); // allocate host frame memory (if not having yet)
cudaMalloc((void**)&d_frame, sizeFrame); // allocate device frame memory
cudaMallocHost((void**)& h_ip16modes, sizeModes); // allocate host rules memory
cudaMalloc((void**)& d_ip16modes, sizeModes); // allocate device rules memory
loadImageFromFile(pathToImg, width, height, &h_frame); // load image to h_frame
cudaMemcpy(d_frame, h_frame, sizeFrame, cudaMemcpyHostToDevice); // copy host frame to device
dim3 blockSize = {16, 4, 1}; // configure CUDA block (16 threads per image block, 4 blocks in CUDA block)
dim3 gridSize = {width / 16, height / 64, 1}; // configure CUDA grid to cover the whole frame
intraPrediction16x16<<<gridSize, blockSize>>>(d_frame, d_modes_out); // launch kernel
cudaMemcpy(h_ip16modes, d_ip16modes, sizeModes, cudaMemcpyDeviceToHost); // copy rules back to host
cudaFree(d_frame); cudaFreeHost(h_frame); // free frames memory
cudaFree(d_ip16modes); cudaFreeHost(h_ip16modes); // free rules memory
PRESENTED BY
SAD and SATD
• Sum of Absolute Transformed Differences increases
MVF quality
• More robust than SAD
• Takes more time to compute
• Can be efficiently computed, i.e. split 16x16 block into
16 blocks of 4x4 pixels and do transform in-place
48
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
49
PRESENTED BY
Motion Estimation
50
PRESENTED BY
Motion Estimation
51
PRESENTED BY
Motion Estimation
52
PRESENTED BY
Motion Estimation
53
Current (P)
frame
Reference (I,
previous)
frame
Original Block match Motion Vector Field
Like in Intra Prediction, when doing block-matching Motion Estimation a
frame is split into blocks (typically 16x16, 8x8 and subdivisions like 16x8,
8x16, 8x4, 4x8, 4x4)
PRESENTED BY
Motion Estimation
54
Motion vectors field (MVF) quality can be represented by the amount of information
in the residual (the less – the better)
Compensated from reference (notice blocking)
Original frame Residual Original-Compensated
Residual Original-Reference
PRESENTED BY
ME – Full Search
55
• Full search of the best match in some local area of
the frame
• The area is defined as a rectangle centered on the
block that we want to predict and going range_x
pixels both sides along X axis and range_y pixels
along Y axis
• range_x is usually greater than range_y because
horizontal motion prevails in movies
• Very computationally expensive though straight-
forward
• Doesn’t “catch” long MVs outside of the range
Current frame
Reference frame
range_x
range_y
PRESENTED BY
ME – Diamond search
• Template methods class reduces the amount of block matches by the order of
magnitude comparing to Full Search
• On every iteration the algorithm matches 6 blocks with anchors that form diamond
centered on the initial position
• The diamond size doesn’t grow and must reduce with the amount of steps made
56
PRESENTED BY
ME – Diamond search
57
• The majority of template methods are based on gradient descent and have certain
application restrictions
• When searching for a match for the block on the frame below, red initialization
is not better than a random block selection, while green will likely converge
• Used often in junction with some
other methods as a refinement
step
PRESENTED BY
ME – Candidate Set
• Objects that move inside a frame are typically larger than a block, adjacent blocks have
similar motion vectors
• Candidate set is a list of motion vectors that have to be checked in first place. The list is
usually filled with motion vectors of already found adjacent (spatially or temporarily)
blocks
• The majority of CPU-school ME algorithms fill candidate set with motion vectors found
for left, top-left, top, top-right blocks (assuming block-linear YX-order processing)
• This creates an explicit data dependency which hinders efficient porting of the
algorithm to CUDA
58
PRESENTED BY
ME – 2D-wave
• A non-insulting way to resolve spatial data dependency
• Splits processing into the list of groups of blocks, with possibility of parallel processing of blocks
within any group. Groups have to be executed in order
• The basic idea – to move by independent slices from the top-left corner to the down-right. TN
stands for the group ID (and an iteration)
59
• Inefficiency of CUDA implementation:
– Low SMs load
– Need work pool with
producerconsumer roles, or several
kernel launches
– Irregular data access
PRESENTED BY
ME – 3D-wave
• 3D-wave treats time as a 3rd dimension and moves from the top-left corner of the first frame (cube
origin) by groups of independent blocks
• Useful for resolving spatio-temporal candidate set dependencies
60
Frame 0 Frame 1 Frame 2
Legend
Blocks with MV found
Blocks to be processed (in parallel)
PRESENTED BY
ME – GPU-friendly candidate set
• 2D and 3D wave approaches make GPU mock CPU pipeline by using
clever tricks, which rarely agree with best GPU programming
practices
• If full CPU-GPU data compliance is not a case, then revising the
candidate set or even the full Motion Estimation pipeline is necessary
• Revised GPU-friendly candidate sets might give worse MVF. But they
will free tons of time spent on maintaining useless structures, which
in turn can be partially spent on doing more computations to
leverage MVF quality
61
PRESENTED BY
ME – GPU-friendly candidate set
• Any candidate set is friendly if it is known for every block at the kernel launch time
(see Hierarchical Search)
• A candidate set with spatial MV dependency can be formed as below
• Just by removing one (left) spatial candidate, we:
– Enhance memory access pattern
– Relax constraints on work producingconsuming
– Reduce the code size and complexity
– Equalize parallel execution group sizes
• Can be implemented in a loop where each CUDA block processes a column of the
frame of few image blocks width
• Some grid synchronization trickery is still needed
62
Group ID
Group Size (MBs)
PRESENTED BY
ME – Hierarchical Search
63
Current frame
Reference frame
Full Search
Refine
Refine
Refinement data
Refinement data
Level 2
Level 1
Level 0
PRESENTED BY
ME – Hierarchical Search
Algorithm:
1. Build image pyramids
2. Perform full search on the top level
3. A motion vector of level K is the best approximation (candidate)
for all “underlying” blocks on level K+1 and can be in candidate
set for block on all subsequent levels
4. MVs refinement can be done by any template search
5. On the refinement step, N best matches can be passed to
subsequent levels to prevent the fall into the local minima
64
PRESENTED BY
ME – Hierarchical Search
Hierarchical search ingredients:
1. Downsampling kernel (participates in the creation of pyramid)
2. Full search kernel (creates the initialization MVF on the largest scale, can
be used to refine problematic areas on subsequent layers)
3. Template search kernel (refines the detections obtained from the
previous scales, or on the previous steps, frames)
4. Global memory candidate sets approach (to store N best results from the
previous layers and frames)
65
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
66
PRESENTED BY
CUDA and OpenCL
• For the sub-pixel Motion Estimation precision CUDA
provides textures with bilinear interpolation on access
• CUDA provides direct access to GPU intrinsics
• CUDA: C++, IDE (Parallel Nsight), Visual Studio Integration,
ease of use
• OpenCL kernels are not a magic bullet, they need to be
fine-tuned for each particular architecture
• OpenCL for CPU vs GPU will have different optimizations
67
PRESENTED BY
Tesla and Fermi architectures
• All algorithms will benefit from L1 and L2 caches
(Fermi)
• Forget about SHMEM bank conflicts when working
with 1-byte data types
• Increased SHMEM size (Tesla = 16K, Fermi = up to
48K)
• 2 Copy engines, Parallel kernel execution
68
PRESENTED BY
Multi-GPU encoding
• One slice per GPU
– Fine-grain parallelism
– Requires excessive PCI-e transfers to collect results either on CPU or on one of the
GPUs to generate the final bitstream
– Requires data frame-level data consolidation after every kernel launch
– Data management might get not trivial due to apron processing
69
I I
P B B P B B
PRESENTED BY
Multi-GPU encoding
• N GOPs for N GPUs
– Coarse-grain parallelism
– Better for offline compression
– Higher utilization of GPU resources
– Requires N times more memory on CPU to handle processing
– Drawbacks: must be careful when concatenating GOP structures, CPU
multiplexing may become a bottleneck,
– Requires N times more memory on the host to process video
70
I P B B I P B B
PRESENTED BY
Other directions
• JPEG2000 - cuj2k is out there
– open-source: http://sourceforge.net/projects/cuj2k/
– needs more BPP for usage in the industry, currently
supports only 8bpp per channel
• CUDA JPEG encoder – now available in CUDA C SDK
• MJPEG
71
PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
72
PRESENTED BY
Source code
• The reference Full Search algorithm is implemented in
CUDA and is available here in public domain:
http://tinyurl.com/cuda-me-fs
http://courses.graphicon.ru/files/courses/mdc/2010/assigns/assign3/MotionEstimationCUDA.zip
• Feel free to email us to see if we have an algorithm
you need implemented in CUDA
73
PRESENTED BY
Recommended GTC talks
We have a plenty of Computer Vision algorithms
developed and discussed by our engineers:
– James Fung, “Accelerating Computer Vision on the Fermi
Architecture”
– Joe Stam, “Extending OpenCV with GPU Acceleration”
– Timo Stich, “Fast High-Quality Panorama Stitching”
74
PRESENTED BY
Ideas? Questions? Suggestions?
75
reach me via email:
Anton Obukhov <aobukhov@nvidia.com>

More Related Content

Similar to S12075-GPU-Accelerated-Video-Encoding.pdf

The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
bakers84
 
GPU Programming with CUDA
GPU Programming with CUDAGPU Programming with CUDA
GPU Programming with CUDA
Filipo Mór
 
3526192.ppt
3526192.ppt3526192.ppt
3526192.ppt
ssuseraf60311
 
FPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionFPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusion
PersiPersi1
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
byteLAKE
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projects
awan2008
 
Nbsingh csir-ceeri-semiconductor-activities
Nbsingh csir-ceeri-semiconductor-activitiesNbsingh csir-ceeri-semiconductor-activities
Nbsingh csir-ceeri-semiconductor-activities
Narendra Bahadur Singh
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
Edge AI and Vision Alliance
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
Sundance Multiprocessor Technology Ltd.
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goKristofferson A
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
Facultad de Informática UCM
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
fcassier
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)Shivam Gupta
 
nbsingh-CSIR-CEERI-Semiconductor-Activities
nbsingh-CSIR-CEERI-Semiconductor-Activitiesnbsingh-CSIR-CEERI-Semiconductor-Activities
nbsingh-CSIR-CEERI-Semiconductor-ActivitiesNarendra Bahadur Singh
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
Yuichiro Yasui
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Tokyo Institute of Technology
 

Similar to S12075-GPU-Accelerated-Video-Encoding.pdf (20)

The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
GPU Programming with CUDA
GPU Programming with CUDAGPU Programming with CUDA
GPU Programming with CUDA
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
3526192.ppt
3526192.ppt3526192.ppt
3526192.ppt
 
FPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusionFPGA_prototyping proccesing with conclusion
FPGA_prototyping proccesing with conclusion
 
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projects
 
A2 e overview
A2 e overviewA2 e overview
A2 e overview
 
Nbsingh csir-ceeri-semiconductor-activities
Nbsingh csir-ceeri-semiconductor-activitiesNbsingh csir-ceeri-semiconductor-activities
Nbsingh csir-ceeri-semiconductor-activities
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
OOW 2013: Where did my CPU go
OOW 2013: Where did my CPU goOOW 2013: Where did my CPU go
OOW 2013: Where did my CPU go
 
Programming Models for Heterogeneous Chips
Programming Models for  Heterogeneous ChipsProgramming Models for  Heterogeneous Chips
Programming Models for Heterogeneous Chips
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
System On Chip (SOC)
System On Chip (SOC)System On Chip (SOC)
System On Chip (SOC)
 
nbsingh-CSIR-CEERI-Semiconductor-Activities
nbsingh-CSIR-CEERI-Semiconductor-Activitiesnbsingh-CSIR-CEERI-Semiconductor-Activities
nbsingh-CSIR-CEERI-Semiconductor-Activities
 
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
NUMA-aware thread-parallel breadth-first search for Graph500 and Green Graph5...
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 

More from gopikahari7

Tema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptxTema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptx
gopikahari7
 
barrera.ppt
barrera.pptbarrera.ppt
barrera.ppt
gopikahari7
 
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptxcuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
gopikahari7
 
Final PPT.pptx (1).pptx
Final PPT.pptx (1).pptxFinal PPT.pptx (1).pptx
Final PPT.pptx (1).pptx
gopikahari7
 
batalgorithm-160501121237 (1).pptx
batalgorithm-160501121237 (1).pptxbatalgorithm-160501121237 (1).pptx
batalgorithm-160501121237 (1).pptx
gopikahari7
 
batalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxbatalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptx
gopikahari7
 
Copy of Parallel_and_Cluster_Computing.pptx
Copy of Parallel_and_Cluster_Computing.pptxCopy of Parallel_and_Cluster_Computing.pptx
Copy of Parallel_and_Cluster_Computing.pptx
gopikahari7
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
gopikahari7
 
batalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxbatalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptx
gopikahari7
 
ELEMPowerPoint.pptx
ELEMPowerPoint.pptxELEMPowerPoint.pptx
ELEMPowerPoint.pptx
gopikahari7
 
Hayes2010.ppt
Hayes2010.pptHayes2010.ppt
Hayes2010.ppt
gopikahari7
 
plantpresentation.ppt
plantpresentation.pptplantpresentation.ppt
plantpresentation.ppt
gopikahari7
 
2_2018_12_20!06_04_28_PM.ppt
2_2018_12_20!06_04_28_PM.ppt2_2018_12_20!06_04_28_PM.ppt
2_2018_12_20!06_04_28_PM.ppt
gopikahari7
 
abelbrownnvidiarakuten2016-170208065814 (1).pptx
abelbrownnvidiarakuten2016-170208065814 (1).pptxabelbrownnvidiarakuten2016-170208065814 (1).pptx
abelbrownnvidiarakuten2016-170208065814 (1).pptx
gopikahari7
 
realtime_ai_systems_academia.pptx
realtime_ai_systems_academia.pptxrealtime_ai_systems_academia.pptx
realtime_ai_systems_academia.pptx
gopikahari7
 
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
gopikahari7
 
FPGA-Arch.ppt
FPGA-Arch.pptFPGA-Arch.ppt
FPGA-Arch.ppt
gopikahari7
 
BEC007 -Digital image processing.pdf
BEC007  -Digital image processing.pdfBEC007  -Digital image processing.pdf
BEC007 -Digital image processing.pdf
gopikahari7
 
1-Data Understanding.pdf
1-Data Understanding.pdf1-Data Understanding.pdf
1-Data Understanding.pdf
gopikahari7
 

More from gopikahari7 (20)

Tema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptxTema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptx
 
barrera.ppt
barrera.pptbarrera.ppt
barrera.ppt
 
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptxcuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
cuckoosearchalgorithm-141028173457-conversion-gate02 (1).pptx
 
Final PPT.pptx (1).pptx
Final PPT.pptx (1).pptxFinal PPT.pptx (1).pptx
Final PPT.pptx (1).pptx
 
batalgorithm-160501121237 (1).pptx
batalgorithm-160501121237 (1).pptxbatalgorithm-160501121237 (1).pptx
batalgorithm-160501121237 (1).pptx
 
batalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxbatalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptx
 
Copy of Parallel_and_Cluster_Computing.pptx
Copy of Parallel_and_Cluster_Computing.pptxCopy of Parallel_and_Cluster_Computing.pptx
Copy of Parallel_and_Cluster_Computing.pptx
 
MattsonTutorialSC14.pptx
MattsonTutorialSC14.pptxMattsonTutorialSC14.pptx
MattsonTutorialSC14.pptx
 
batalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptxbatalgorithm-170406072944 (4).pptx
batalgorithm-170406072944 (4).pptx
 
ELEMPowerPoint.pptx
ELEMPowerPoint.pptxELEMPowerPoint.pptx
ELEMPowerPoint.pptx
 
Hayes2010.ppt
Hayes2010.pptHayes2010.ppt
Hayes2010.ppt
 
plantpresentation.ppt
plantpresentation.pptplantpresentation.ppt
plantpresentation.ppt
 
Plants.ppt
Plants.pptPlants.ppt
Plants.ppt
 
2_2018_12_20!06_04_28_PM.ppt
2_2018_12_20!06_04_28_PM.ppt2_2018_12_20!06_04_28_PM.ppt
2_2018_12_20!06_04_28_PM.ppt
 
abelbrownnvidiarakuten2016-170208065814 (1).pptx
abelbrownnvidiarakuten2016-170208065814 (1).pptxabelbrownnvidiarakuten2016-170208065814 (1).pptx
abelbrownnvidiarakuten2016-170208065814 (1).pptx
 
realtime_ai_systems_academia.pptx
realtime_ai_systems_academia.pptxrealtime_ai_systems_academia.pptx
realtime_ai_systems_academia.pptx
 
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
ppd_seminar_110202_talk_edward_freeman_introduction_to_programmable_logic_dev...
 
FPGA-Arch.ppt
FPGA-Arch.pptFPGA-Arch.ppt
FPGA-Arch.ppt
 
BEC007 -Digital image processing.pdf
BEC007  -Digital image processing.pdfBEC007  -Digital image processing.pdf
BEC007 -Digital image processing.pdf
 
1-Data Understanding.pdf
1-Data Understanding.pdf1-Data Understanding.pdf
1-Data Understanding.pdf
 

Recently uploaded

LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 

Recently uploaded (20)

LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 

S12075-GPU-Accelerated-Video-Encoding.pdf

  • 1. NVIDIA DevTech | Anton Obukhov <aobukhov@nvidia.com> GPU-Accelerated Video Encoding
  • 2. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 1
  • 3. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 2
  • 4. PRESENTED BY Motivation for the talk 3 Video Encode and Decode are even more significant now as sharing of media is more popular with Portable Media Devices and on the internet Encoding HD movies takes tens of hours on modern desktops Portable and mobile devices have underutilized processing power
  • 5. PRESENTED BY Trends • GPU encoding is gaining wide adoption from developers (consumer, server, and professionals) • Up to now the majority of encoders provide performance through a Fast Profile, but the Quality Profile increases CPU usage • Normal state of things: CUDA exists since 2007 • A video encoder becomes mature after ~5 years • Time to deliver quality GPU encoding 4
  • 6. PRESENTED BY Trends • Knuth: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” • x86 over-optimization has its roots in reference encoders • Slices were a step towards partial encoder parallelization, but the whole idea is not GPU-friendly (doesn’t fit well with SIMT) • Integrated solution is an encoder architecture from scratch 5
  • 7. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 6
  • 8. PRESENTED BY Video encoding with NVIDIA GPU Facilities: • SW H.264 codec designed for CUDA – Baseline, Main, High profiles – CABAC, CAVLC Interfaces: • C library (NVCUVENC) • Direct Show API 7
  • 9. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 8
  • 10. PRESENTED BY Hybrid CPU-GPU encoding pipeline • The codec should be designed with best practices for both CPU and GPU architectures • PCI-e is the main bridge between architectures, it has limited bandwidth, CPU–GPU data transfers can take away all benefits of GPU speed-up • PCI-e bandwidth is an order of magnitude less than video memory bandwidth 9
  • 11. PRESENTED BY Hybrid CPU-GPU encoding pipeline • There are not so many parts of a codec that vitally need data dependencies: CABAC, CAVLC are the most well-known • Many other dependencies were introduced to respect CPU best practices guides, can be resolved by codec architecture revision • It might be beneficial to perform some serial processing with one CUDA block and one CUDA thread on GPU instead of copying data back and forth 10
  • 12. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 11
  • 13. PRESENTED BY GPU Encoding Use Cases • Video encoding • Video transcoding • Encoding live video input • Compressing rendered scenes 12
  • 14. PRESENTED BY GPU Encoding Use Cases • The use cases differ in the way input video frames appear in GPU memory • Frames come from GPU memory when: – Compressing rendered scenes – Transcoding using CUDA decoder • Frames come from CPU memory when: – Encoding/Transcoding of a video file decoded on CPU – Live feed from a webcam or any other video input device via USB 13
  • 15. PRESENTED BY GPU Encoding Use Cases Helper libraries for source acquisition acceleration: • NVCUVID – GPU acceleration of H.264, VC-1, MPEG2 decoding in hardware (available in CUDA C SDK) • videoInput – an open-source library for video devices handling on CPU via DirectShow – DirectShow filter can be implemented instead to minimize the amount of “hidden” buffers on the CPU. The webcam filter writes frames directly into pinned/mapped CUDA memory 14
  • 16. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 15
  • 17. PRESENTED BY Algorithm primitives • Fundamental primitives of Parallel Programming • Building blocks for other algorithms • Have very efficient implementations • Mostly well-known: – Reduce – Scan – Compact 16
  • 18. PRESENTED BY Reduce • Having a vector of numbers a = [ a0, a1, … an-1 ] • And a binary associative operator ⊕, calculate: res = a0 ⊕ a1 ⊕ … ⊕ an-1 • Instead of ⊕ take any of the following: “ + ”, “ * ”, “min”, “max”, etc. 17
  • 19. PRESENTED BY Reduce • Use half of N threads • Each thread loads 2 elements, applies the operator to them and puts the intermediate result into the shared memory • Repeat log2 N times, halve the number of threads on every new iteration, use __syncthreads to verify integrity of results in shared memory 18 2 5 0 1 3 9 9 4 7 1 12 13 8 25 33 Legend ⊕ __syncthreads
  • 20. PRESENTED BY Reduce sample: 8 numbers 19 2 5 0 1 3 9 9 4 Legend ⊕ set
  • 21. PRESENTED BY Reduce sample: 8 numbers 20 2 5 0 1 3 9 9 4 5 14 9 5 Legend ⊕ set
  • 22. PRESENTED BY Reduce sample: 8 numbers 21 2 5 0 1 3 9 9 4 5 14 9 5 14 19 Legend ⊕ set
  • 23. PRESENTED BY Reduce sample: 8 numbers 22 2 5 0 1 3 9 9 4 5 14 9 5 14 19 33 Legend ⊕ set
  • 24. PRESENTED BY Scan • Having a vector of numbers a = [ a0, a1, … an-1 ] • And a binary associative operator ⊕, calculate: scan(a) = [ 0, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2) ] • Instead of ⊕ take any of the following: “ + ”, “ * ”, “min”, “max”, etc. 23
  • 25. PRESENTED BY Scan sample: 8 numbers 24 2 9 5 1 3 6 1 4 Legend ⊕ set
  • 26. PRESENTED BY Scan sample: 8 numbers 25 2 9 5 1 3 6 1 4 0 0 0 0 0 0 0 2 9 5 1 3 6 1 4 Legend ⊕ set
  • 27. PRESENTED BY Scan sample: 8 numbers 26 2 9 5 1 3 6 1 4 0 0 0 0 0 0 0 2 9 5 1 3 6 1 4 0 0 0 0 0 0 0 2 11 14 6 4 9 7 5 Legend ⊕ set
  • 28. PRESENTED BY Scan sample: 8 numbers 27 2 9 5 1 3 6 1 4 0 0 0 0 0 0 0 2 9 5 1 3 6 1 4 0 0 0 0 0 0 0 2 11 14 6 4 9 7 5 0 0 0 0 0 0 0 2 11 16 17 18 15 11 14 Legend ⊕ set
  • 29. PRESENTED BY Scan sample: 8 numbers 28 2 9 5 1 3 6 1 4 0 0 0 0 0 0 0 2 9 5 1 3 6 1 4 0 0 0 0 0 0 0 2 11 14 6 4 9 7 5 0 0 0 0 0 0 0 2 11 16 17 18 15 11 14 0 0 0 0 0 0 0 2 11 16 17 20 26 27 31 Legend ⊕ set
  • 30. PRESENTED BY Scan sample: 8 numbers 29 2 9 5 1 3 6 1 4 0 0 0 0 0 0 0 2 9 5 1 3 6 1 4 0 0 0 0 0 0 0 2 11 14 6 4 9 7 5 0 0 0 0 0 0 0 2 11 16 17 18 15 11 14 0 0 0 0 0 0 0 2 11 16 17 20 26 27 31 Legend ⊕ set
  • 31. PRESENTED BY Compact • Having a vector of numbers a = [ a0, a1, … an-1 ] and a mask of elements of interest m = [m0, m1, …, mn-1], mi = [0 | 1], • Calculate: compact(a) = [ ai0 , ai1 , … aik-1 ], where reduce(m) = k, j[0, k-1]: mij = 1. 30
  • 32. PRESENTED BY Compact sample 31 Input: Desired output: Green elements are of interest with corresponding mask elements set to 1, while gray to 0. 0 1 2 3 4 5 6 7 8 9 0 2 4 7 8
  • 33. PRESENTED BY Compact sample 32 The algorithm makes use of the scan primitive: Input: Mask of interest: Scan of the mask: Compacted vector: 0 1 2 3 4 5 6 7 8 9 0 2 4 7 8 1 0 1 0 1 0 0 1 1 0 0 1 1 2 2 3 3 3 4 5
  • 34. PRESENTED BY Algorithm primitives discussion • Highly efficient implementations in CUDA C SDK, CUDPP and Thrust libraries • Performance benefits from HW native intrinsics (POPC and etc), which makes difference when implementing with CUDA 33
  • 35. PRESENTED BY Algorithm primitives discussion • Reduce: sum of absolute differences (SAD) • Scan: Integral Image calculation, Compact facilitation • Compact: Work pool index creation 34
  • 36. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 35
  • 38. PRESENTED BY Intra Prediction • Each block is predicted according to the predictors vector and the prediction rule • Predictors vector consists of N pixels to the top of the block, N to the left and 1 pixel at top-left location • Prediction rules (few basic for 16x16 blocks): 1. Constant – each pixel of the block is predicted with a constant value 2. Vertical – each row of the block is predicted using the top predictor 3. Horizontal – each column of the block is predicted using the left predictor 4. Plane – A 2D-plane prediction that optimally fits the source pixels 37 ?
  • 39. PRESENTED BY Intra Prediction • Select the rule with the lowest prediction error • The prediction error can be calculated as a Sum of Absolute Differences of the corresponding pixels in the predicted and the original blocks: • Data compression effect: only need to store predictors and the prediction rule 38    y x x y predict x y src SAD , ] ][ [ ] ][ [
  • 42. PRESENTED BY CUDA: Intra Prediction 16x16 • An image block of 16x16 pixels is processed by 16 CUDA threads • Each CUDA-block contains 64 threads and processes 4 image blocks at once: – 0-15 – 16-31 – 32-47 – 48-63 • Block configuration: blockDim = {16, 4, 1}; 41
  • 43. PRESENTED BY CUDA: Intra Prediction 16x16 42 1. Load image block and predictors into the shared memory: int mb_x = ( blockIdx.x * 16 ); // left offset (of the current 16x16 image block in the frame) int mb_y = ( blockIdx.y * 64 ) + ( threadIdx.y * 16 ); // top offset int thread_id = threadIdx.x; // thread ID [0,15] working inside of the current image block // copy the entire image block into the shared memory int offset = thread_id * 16; // offset of the thread_id row from the top of image block byte* current = share_mem->current + offset; // pointer to the row in shared memory (SHMEM) int pos = (mb_y + thread_id) * stride + mb_x; // position of the corresponding row in the frame fetch16pixels(pos, current); // copy 16 pixels of the frame from GMEM to SHMEM // copy predictors pos = (mb_y - 1) * stride + mb_x; // position one pixel to the top of the current image block if (!thread_id) // this condition is true only for one thread (zero) working on the image block fetch1pixel(pos - 1, share_mem->top_left); // copy 1-pix top-left predictor fetch1pixel(pos + thread_id, share_mem->top[thread_id]); // copy thread_idth pixel of the top predictor pos += stride - 1; // position one pixel to the left of the current image block fetch1pixel(pos + thread_id * stride, share_mem->left[thread_id]); // copy left predictor
  • 44. PRESENTED BY CUDA: Intra Prediction 16x16 43 2. Calculate horizontal prediction (using the left predictor): //INTRA_PRED_16x16_HORIZONTAL int left = share_mem->left[thread_id]; // load thread_id pixel of the left predictor int sad = 0; for (int i=0; i<16; i++) // iterate pixels of the thread_idth row of the image block { sad = __usad(left, current[i], sad); // accumulate row-wise SAD } // each of 16 threads own its own variable “sad”, which resides in a register // the function devSum16 sums all 16 variables using the reduce primitive and SHMEM sad = devSum16(share_mem, sad, thread_id); if (!thread_id) // executes once for an image block { share_mem->mode[MODE_H].mode = INTRA_PRED_16x16_HORIZONTAL; // store sad if left predictor actually exists // otherwise store any number greater than the worst const score share_mem->mode[MODE_H].score = mb_x ? sad : WORST_INTRA_SCORE; }
  • 45. PRESENTED BY CUDA: Intra Prediction 16x16 44 3. Under the certain assumptions about the CUDA warp size, some of the __syncthreads can be omitted __device__ int devSum16( IntraPred16x16FullSearchShareMem_t* share_mem, int a, int thread_id ) { share_mem->tmp[thread_id] = a; // store the register value in the shared memory vector //consisting of 16 cells __syncthreads(); // make all threads pass the barrier when working with shared memory if ( thread_id < 8 ) // work with only 8 first threads out of 16 { share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 8]; __syncthreads(); share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 4]; // only 4 threads are useful __syncthreads(); share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 2]; // only 2 threads are useful __syncthreads(); share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 1]; // only 1 thread is useful __syncthreads(); } return share_mem->tmp[0]; }
  • 46. PRESENTED BY CUDA: Intra Prediction 16x16 45 3. __device__ int devSum16 1 3 6 0 3 2 8 1 5 14 16 5 21 19 40 1 2 1 1 0 7 1 3 5 2 7 1 3 9 9 4
  • 47. PRESENTED BY CUDA: Intra Prediction 16x16 4. Calculate rest of the rules (Vertical and Plane) 5. Rule selection: take the one with the minimum SAD 6. Possible to do Intra reconstruction in the same kernel (see gray images few slides back) 46
  • 48. PRESENTED BY CUDA Host: Intra Prediction 16x16 47 // main() body byte *h_frame, *d_frame; // frame pointers in CPU (host) and CUDA (device) memory int *h_ip16modes, *d_ip16modes; // output prediction rules pointers int width, height; // frame dimensions getImageSizeFromFile(pathToImg, &width, &height); // get frame dimensions int sizeFrame = width * height; // calculate frame size int sizeModes = (width/16) * (height/16) * sizeof(int); // calculate prediction rules array size cudaMallocHost((void**)&h_frame, sizeFrame); // allocate host frame memory (if not having yet) cudaMalloc((void**)&d_frame, sizeFrame); // allocate device frame memory cudaMallocHost((void**)& h_ip16modes, sizeModes); // allocate host rules memory cudaMalloc((void**)& d_ip16modes, sizeModes); // allocate device rules memory loadImageFromFile(pathToImg, width, height, &h_frame); // load image to h_frame cudaMemcpy(d_frame, h_frame, sizeFrame, cudaMemcpyHostToDevice); // copy host frame to device dim3 blockSize = {16, 4, 1}; // configure CUDA block (16 threads per image block, 4 blocks in CUDA block) dim3 gridSize = {width / 16, height / 64, 1}; // configure CUDA grid to cover the whole frame intraPrediction16x16<<<gridSize, blockSize>>>(d_frame, d_modes_out); // launch kernel cudaMemcpy(h_ip16modes, d_ip16modes, sizeModes, cudaMemcpyDeviceToHost); // copy rules back to host cudaFree(d_frame); cudaFreeHost(h_frame); // free frames memory cudaFree(d_ip16modes); cudaFreeHost(h_ip16modes); // free rules memory
  • 49. PRESENTED BY SAD and SATD • Sum of Absolute Transformed Differences increases MVF quality • More robust than SAD • Takes more time to compute • Can be efficiently computed, i.e. split 16x16 block into 16 blocks of 4x4 pixels and do transform in-place 48
  • 50. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 49
  • 54. PRESENTED BY Motion Estimation 53 Current (P) frame Reference (I, previous) frame Original Block match Motion Vector Field Like in Intra Prediction, when doing block-matching Motion Estimation a frame is split into blocks (typically 16x16, 8x8 and subdivisions like 16x8, 8x16, 8x4, 4x8, 4x4)
  • 55. PRESENTED BY Motion Estimation 54 Motion vectors field (MVF) quality can be represented by the amount of information in the residual (the less – the better) Compensated from reference (notice blocking) Original frame Residual Original-Compensated Residual Original-Reference
  • 56. PRESENTED BY ME – Full Search 55 • Full search of the best match in some local area of the frame • The area is defined as a rectangle centered on the block that we want to predict and going range_x pixels both sides along X axis and range_y pixels along Y axis • range_x is usually greater than range_y because horizontal motion prevails in movies • Very computationally expensive though straight- forward • Doesn’t “catch” long MVs outside of the range Current frame Reference frame range_x range_y
  • 57. PRESENTED BY ME – Diamond search • Template methods class reduces the amount of block matches by the order of magnitude comparing to Full Search • On every iteration the algorithm matches 6 blocks with anchors that form diamond centered on the initial position • The diamond size doesn’t grow and must reduce with the amount of steps made 56
  • 58. PRESENTED BY ME – Diamond search 57 • The majority of template methods are based on gradient descent and have certain application restrictions • When searching for a match for the block on the frame below, red initialization is not better than a random block selection, while green will likely converge • Used often in junction with some other methods as a refinement step
  • 59. PRESENTED BY ME – Candidate Set • Objects that move inside a frame are typically larger than a block, adjacent blocks have similar motion vectors • Candidate set is a list of motion vectors that have to be checked in first place. The list is usually filled with motion vectors of already found adjacent (spatially or temporarily) blocks • The majority of CPU-school ME algorithms fill candidate set with motion vectors found for left, top-left, top, top-right blocks (assuming block-linear YX-order processing) • This creates an explicit data dependency which hinders efficient porting of the algorithm to CUDA 58
  • 60. PRESENTED BY ME – 2D-wave • A non-insulting way to resolve spatial data dependency • Splits processing into the list of groups of blocks, with possibility of parallel processing of blocks within any group. Groups have to be executed in order • The basic idea – to move by independent slices from the top-left corner to the down-right. TN stands for the group ID (and an iteration) 59 • Inefficiency of CUDA implementation: – Low SMs load – Need work pool with producerconsumer roles, or several kernel launches – Irregular data access
  • 61. PRESENTED BY ME – 3D-wave • 3D-wave treats time as a 3rd dimension and moves from the top-left corner of the first frame (cube origin) by groups of independent blocks • Useful for resolving spatio-temporal candidate set dependencies 60 Frame 0 Frame 1 Frame 2 Legend Blocks with MV found Blocks to be processed (in parallel)
  • 62. PRESENTED BY ME – GPU-friendly candidate set • 2D and 3D wave approaches make GPU mock CPU pipeline by using clever tricks, which rarely agree with best GPU programming practices • If full CPU-GPU data compliance is not a case, then revising the candidate set or even the full Motion Estimation pipeline is necessary • Revised GPU-friendly candidate sets might give worse MVF. But they will free tons of time spent on maintaining useless structures, which in turn can be partially spent on doing more computations to leverage MVF quality 61
  • 63. PRESENTED BY ME – GPU-friendly candidate set • Any candidate set is friendly if it is known for every block at the kernel launch time (see Hierarchical Search) • A candidate set with spatial MV dependency can be formed as below • Just by removing one (left) spatial candidate, we: – Enhance memory access pattern – Relax constraints on work producingconsuming – Reduce the code size and complexity – Equalize parallel execution group sizes • Can be implemented in a loop where each CUDA block processes a column of the frame of few image blocks width • Some grid synchronization trickery is still needed 62 Group ID Group Size (MBs)
  • 64. PRESENTED BY ME – Hierarchical Search 63 Current frame Reference frame Full Search Refine Refine Refinement data Refinement data Level 2 Level 1 Level 0
  • 65. PRESENTED BY ME – Hierarchical Search Algorithm: 1. Build image pyramids 2. Perform full search on the top level 3. A motion vector of level K is the best approximation (candidate) for all “underlying” blocks on level K+1 and can be in candidate set for block on all subsequent levels 4. MVs refinement can be done by any template search 5. On the refinement step, N best matches can be passed to subsequent levels to prevent the fall into the local minima 64
  • 66. PRESENTED BY ME – Hierarchical Search Hierarchical search ingredients: 1. Downsampling kernel (participates in the creation of pyramid) 2. Full search kernel (creates the initialization MVF on the largest scale, can be used to refine problematic areas on subsequent layers) 3. Template search kernel (refines the detections obtained from the previous scales, or on the previous steps, frames) 4. Global memory candidate sets approach (to store N best results from the previous layers and frames) 65
  • 67. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 66
  • 68. PRESENTED BY CUDA and OpenCL • For the sub-pixel Motion Estimation precision CUDA provides textures with bilinear interpolation on access • CUDA provides direct access to GPU intrinsics • CUDA: C++, IDE (Parallel Nsight), Visual Studio Integration, ease of use • OpenCL kernels are not a magic bullet, they need to be fine-tuned for each particular architecture • OpenCL for CPU vs GPU will have different optimizations 67
  • 69. PRESENTED BY Tesla and Fermi architectures • All algorithms will benefit from L1 and L2 caches (Fermi) • Forget about SHMEM bank conflicts when working with 1-byte data types • Increased SHMEM size (Tesla = 16K, Fermi = up to 48K) • 2 Copy engines, Parallel kernel execution 68
  • 70. PRESENTED BY Multi-GPU encoding • One slice per GPU – Fine-grain parallelism – Requires excessive PCI-e transfers to collect results either on CPU or on one of the GPUs to generate the final bitstream – Requires data frame-level data consolidation after every kernel launch – Data management might get not trivial due to apron processing 69 I I P B B P B B
  • 71. PRESENTED BY Multi-GPU encoding • N GOPs for N GPUs – Coarse-grain parallelism – Better for offline compression – Higher utilization of GPU resources – Requires N times more memory on CPU to handle processing – Drawbacks: must be careful when concatenating GOP structures, CPU multiplexing may become a bottleneck, – Requires N times more memory on the host to process video 70 I P B B I P B B
  • 72. PRESENTED BY Other directions • JPEG2000 - cuj2k is out there – open-source: http://sourceforge.net/projects/cuj2k/ – needs more BPP for usage in the industry, currently supports only 8bpp per channel • CUDA JPEG encoder – now available in CUDA C SDK • MJPEG 71
  • 73. PRESENTED BY Outline • Motivation • Video encoding facilities • Video encoding with CUDA – Principles of a hybrid CPU/GPU encode pipeline – GPU encoding use cases – Algorithm primitives: reduce, scan, compact – Intra prediction in details – High level Motion Estimation approaches – CUDA, OpenCL, Tesla, Fermi – Source code 72
  • 74. PRESENTED BY Source code • The reference Full Search algorithm is implemented in CUDA and is available here in public domain: http://tinyurl.com/cuda-me-fs http://courses.graphicon.ru/files/courses/mdc/2010/assigns/assign3/MotionEstimationCUDA.zip • Feel free to email us to see if we have an algorithm you need implemented in CUDA 73
  • 75. PRESENTED BY Recommended GTC talks We have a plenty of Computer Vision algorithms developed and discussed by our engineers: – James Fung, “Accelerating Computer Vision on the Fermi Architecture” – Joe Stam, “Extending OpenCV with GPU Acceleration” – Timo Stich, “Fast High-Quality Panorama Stitching” 74
  • 76. PRESENTED BY Ideas? Questions? Suggestions? 75 reach me via email: Anton Obukhov <aobukhov@nvidia.com>