The document discusses GPU-accelerated video encoding using CUDA. It outlines the motivation for GPU encoding and describes the basic principles of a hybrid CPU-GPU encoding pipeline. It then discusses various algorithm primitives like reduce, scan and compact that are useful building blocks for encoding algorithms. The document dives deeper into how intra prediction and motion estimation can be implemented on the GPU. It provides pseudocode examples for intra prediction of 16x16 blocks using CUDA threads and shared memory.
Heterogeneous computing is seen as a path forward to deliver the energy and performance improvements needed over the next decade. That way, heterogeneous systems feature GPUs (Graphics Processing Units) or FPGAs (Field Programmable Gate Arrays) that excel at accelerating complex tasks while consuming less energy. There are also heterogeneous architectures on-chip, like the processors developed for mobile devices (laptops, tablets and smartphones) comprised of multiple cores and a GPU.
This talk covers hardware and software aspects of this kind of heterogeneous architectures. Regarding the HW, we briefly discuss the underlying architecture of some heterogeneous chips composed of multicores+GPU and multicores+FPGA, delving into the differences between both kind of accelerators and how to measure the energy they consume. We also address the different solutions to get a coherent view of the memory shared between the cores and the GPU or between the cores and the FPGA.
Various processor architectures are described in this presentation. It could be useful for people working for h/w selection and processor identification.
Heterogeneous computing is seen as a path forward to deliver the energy and performance improvements needed over the next decade. That way, heterogeneous systems feature GPUs (Graphics Processing Units) or FPGAs (Field Programmable Gate Arrays) that excel at accelerating complex tasks while consuming less energy. There are also heterogeneous architectures on-chip, like the processors developed for mobile devices (laptops, tablets and smartphones) comprised of multiple cores and a GPU.
This talk covers hardware and software aspects of this kind of heterogeneous architectures. Regarding the HW, we briefly discuss the underlying architecture of some heterogeneous chips composed of multicores+GPU and multicores+FPGA, delving into the differences between both kind of accelerators and how to measure the energy they consume. We also address the different solutions to get a coherent view of the memory shared between the cores and the GPU or between the cores and the FPGA.
Various processor architectures are described in this presentation. It could be useful for people working for h/w selection and processor identification.
Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
byteLAKE's presentation from the PPAM 2019 conference.
Abstract:
The goal of this work is to adapt 4 CFD kernels to the Xilinx ALVEO U250 FPGA, including first-order step of the non-linear iterative upwind advection MPDATA schemes (non-oscillatory forward in time), the divergence part of the matrix-free linear operator formulation in the iterative Krylov scheme, tridiagonal Thomas algorithm for vertical matrix inversion inside preconditioner for the iterative solver, and computation of the psuedovelocity for the second pass of upwind algorithm in MPDATA. All the kernels use 3-dimensional compute domain consisted from 7 to 11 arrays. Since all kernels belong to the group of memory bound algorithms, our main challenge is to provide the highest utilization of global memory bandwidth. Our adaptation allows us to reduce the execution time upto 4x.
Find out more at: www.byteLAKE.com/en/CFD
Foot note:
This is the presentation about the non-AI version of byteLAKE's CFD kernels, highly optimized for Alveo FPGA. Based on this research project and many others in the CFD space, we decided to shift the course of the CFD Suite product development and leverage AI to accelerate computations and enable new possibilities. Instead of adapting CFD solvers to accelerators, we use AI and work on a cross-platform solution. More on the latest: www.byteLAKE.com/en/CFDSuite.
-
Update for 2020: byteLAKE is currently developing CFD Suite as AI for CFD Suite, a collection of AI/ Artificial Intelligence Models to accelerate and enable new features for CFD simulations. It is a cross-platform solution (not only for FPGAs). More: www.byteLAKE.com/en/CFDSuite.
My activities at CSIR-CEERI, related to Semiconductor Devices (Memory and Linear ICs Design), Circuits and Smart System Sensor's Signal Processing Hardware and Software Design and Development.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/07/efficiently-map-ai-and-vision-applications-onto-multi-core-ai-processors-using-cevas-parallel-processing-framework-a-presentation-from-ceva/
Rami Drucker, Machine Learning Software Architect at CEVA, presents the “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework” tutorial at the May 2023 Embedded Vision Summit.
Next-generation AI and computer vision applications for autonomous vehicles, cameras, drones and robots require higher-than-ever computing power. Often, the most efficient way to deliver high performance (especially in cost- and power-constrained applications) is to use multi-core processors. But developers must then map their applications onto the multiple cores in an efficient manner, which can be difficult. To address this challenge and streamline application development, CEVA has introduced the Architecture Planner tool as a new element in CEVA’s comprehensive AI SDK.
In this talk, Drucker shows how the Architecture Planner tool analyzes the network model and the processor configuration (number of cores, memory sizes), then automatically maps the workload onto the multiple cores in an efficient manner. He explains key techniques used by the tool, including symmetrical and asymmetrical multi-processing, partition by sub-graphs, batch partitioning and pipeline partitioning.
Presentation I gave at the SORT Conference in 2011. Was generalized from some work I had done with using GPUs to accelerate image processing at FamilySearch.
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)byteLAKE
byteLAKE's presentation from the PPAM 2019 conference.
Abstract:
The goal of this work is to adapt 4 CFD kernels to the Xilinx ALVEO U250 FPGA, including first-order step of the non-linear iterative upwind advection MPDATA schemes (non-oscillatory forward in time), the divergence part of the matrix-free linear operator formulation in the iterative Krylov scheme, tridiagonal Thomas algorithm for vertical matrix inversion inside preconditioner for the iterative solver, and computation of the psuedovelocity for the second pass of upwind algorithm in MPDATA. All the kernels use 3-dimensional compute domain consisted from 7 to 11 arrays. Since all kernels belong to the group of memory bound algorithms, our main challenge is to provide the highest utilization of global memory bandwidth. Our adaptation allows us to reduce the execution time upto 4x.
Find out more at: www.byteLAKE.com/en/CFD
Foot note:
This is the presentation about the non-AI version of byteLAKE's CFD kernels, highly optimized for Alveo FPGA. Based on this research project and many others in the CFD space, we decided to shift the course of the CFD Suite product development and leverage AI to accelerate computations and enable new possibilities. Instead of adapting CFD solvers to accelerators, we use AI and work on a cross-platform solution. More on the latest: www.byteLAKE.com/en/CFDSuite.
-
Update for 2020: byteLAKE is currently developing CFD Suite as AI for CFD Suite, a collection of AI/ Artificial Intelligence Models to accelerate and enable new features for CFD simulations. It is a cross-platform solution (not only for FPGAs). More: www.byteLAKE.com/en/CFDSuite.
My activities at CSIR-CEERI, related to Semiconductor Devices (Memory and Linear ICs Design), Circuits and Smart System Sensor's Signal Processing Hardware and Software Design and Development.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/07/efficiently-map-ai-and-vision-applications-onto-multi-core-ai-processors-using-cevas-parallel-processing-framework-a-presentation-from-ceva/
Rami Drucker, Machine Learning Software Architect at CEVA, presents the “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework” tutorial at the May 2023 Embedded Vision Summit.
Next-generation AI and computer vision applications for autonomous vehicles, cameras, drones and robots require higher-than-ever computing power. Often, the most efficient way to deliver high performance (especially in cost- and power-constrained applications) is to use multi-core processors. But developers must then map their applications onto the multiple cores in an efficient manner, which can be difficult. To address this challenge and streamline application development, CEVA has introduced the Architecture Planner tool as a new element in CEVA’s comprehensive AI SDK.
In this talk, Drucker shows how the Architecture Planner tool analyzes the network model and the processor configuration (number of cores, memory sizes), then automatically maps the workload onto the multiple cores in an efficient manner. He explains key techniques used by the tool, including symmetrical and asymmetrical multi-processing, partition by sub-graphs, batch partitioning and pipeline partitioning.
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfKamal Acharya
The College Bus Management system is completely developed by Visual Basic .NET Version. The application is connect with most secured database language MS SQL Server. The application is develop by using best combination of front-end and back-end languages. The application is totally design like flat user interface. This flat user interface is more attractive user interface in 2017. The application is gives more important to the system functionality. The application is to manage the student’s details, driver’s details, bus details, bus route details, bus fees details and more. The application has only one unit for admin. The admin can manage the entire application. The admin can login into the application by using username and password of the admin. The application is develop for big and small colleges. It is more user friendly for non-computer person. Even they can easily learn how to manage the application within hours. The application is more secure by the admin. The system will give an effective output for the VB.Net and SQL Server given as input to the system. The compiled java program given as input to the system, after scanning the program will generate different reports. The application generates the report for users. The admin can view and download the report of the data. The application deliver the excel format reports. Because, excel formatted reports is very easy to understand the income and expense of the college bus. This application is mainly develop for windows operating system users. In 2017, 73% of people enterprises are using windows operating system. So the application will easily install for all the windows operating system users. The application-developed size is very low. The application consumes very low space in disk. Therefore, the user can allocate very minimum local disk space for this application.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
1. NVIDIA DevTech | Anton Obukhov <aobukhov@nvidia.com>
GPU-Accelerated Video Encoding
2. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
1
3. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
2
4. PRESENTED BY
Motivation for the talk
3
Video Encode and Decode are even more significant now as sharing of
media is more popular with Portable Media Devices and on the internet
Encoding HD movies takes tens
of hours on modern desktops
Portable and mobile devices have
underutilized processing power
5. PRESENTED BY
Trends
• GPU encoding is gaining wide adoption from developers (consumer,
server, and professionals)
• Up to now the majority of encoders provide performance through a Fast
Profile, but the Quality Profile increases CPU usage
• Normal state of things: CUDA exists since 2007
• A video encoder becomes mature after ~5 years
• Time to deliver quality GPU encoding
4
6. PRESENTED BY
Trends
• Knuth: “We should forget about small efficiencies, say
about 97% of the time: premature optimization is the root
of all evil”
• x86 over-optimization has its roots in reference encoders
• Slices were a step towards partial encoder parallelization,
but the whole idea is not GPU-friendly (doesn’t fit well
with SIMT)
• Integrated solution is an encoder architecture from
scratch
5
7. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
6
8. PRESENTED BY
Video encoding with NVIDIA GPU
Facilities:
• SW H.264 codec designed for CUDA
– Baseline, Main, High profiles
– CABAC, CAVLC
Interfaces:
• C library (NVCUVENC)
• Direct Show API
7
9. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
8
10. PRESENTED BY
Hybrid CPU-GPU encoding pipeline
• The codec should be designed with best practices for both
CPU and GPU architectures
• PCI-e is the main bridge between architectures, it has
limited bandwidth, CPU–GPU data transfers can take
away all benefits of GPU speed-up
• PCI-e bandwidth is an order of magnitude less than video
memory bandwidth
9
11. PRESENTED BY
Hybrid CPU-GPU encoding pipeline
• There are not so many parts of a codec that vitally need data
dependencies: CABAC, CAVLC are the most well-known
• Many other dependencies were introduced to respect CPU best
practices guides, can be resolved by codec architecture revision
• It might be beneficial to perform some serial processing with
one CUDA block and one CUDA thread on GPU instead of
copying data back and forth
10
12. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
11
13. PRESENTED BY
GPU Encoding Use Cases
• Video encoding
• Video transcoding
• Encoding live video input
• Compressing rendered scenes
12
14. PRESENTED BY
GPU Encoding Use Cases
• The use cases differ in the way input video frames
appear in GPU memory
• Frames come from GPU memory when:
– Compressing rendered scenes
– Transcoding using CUDA decoder
• Frames come from CPU memory when:
– Encoding/Transcoding of a video file decoded on CPU
– Live feed from a webcam or any other video input device
via USB
13
15. PRESENTED BY
GPU Encoding Use Cases
Helper libraries for source acquisition acceleration:
• NVCUVID – GPU acceleration of H.264, VC-1, MPEG2
decoding in hardware (available in CUDA C SDK)
• videoInput – an open-source library for video devices
handling on CPU via DirectShow
– DirectShow filter can be implemented instead to minimize the
amount of “hidden” buffers on the CPU. The webcam filter
writes frames directly into pinned/mapped CUDA memory
14
16. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
15
17. PRESENTED BY
Algorithm primitives
• Fundamental primitives of Parallel Programming
• Building blocks for other algorithms
• Have very efficient implementations
• Mostly well-known:
– Reduce
– Scan
– Compact
16
18. PRESENTED BY
Reduce
• Having a vector of numbers a = [ a0, a1, … an-1 ]
• And a binary associative operator ⊕, calculate:
res = a0 ⊕ a1 ⊕ … ⊕ an-1
• Instead of ⊕ take any of the following:
“ + ”, “ * ”, “min”, “max”, etc.
17
19. PRESENTED BY
Reduce
• Use half of N threads
• Each thread loads 2 elements, applies the operator to them and puts the intermediate
result into the shared memory
• Repeat log2 N times, halve the number of threads on every new iteration, use
__syncthreads to verify integrity of results in shared memory
18
2 5 0 1 3 9 9 4
7 1 12 13
8 25
33
Legend
⊕
__syncthreads
24. PRESENTED BY
Scan
• Having a vector of numbers a = [ a0, a1, … an-1 ]
• And a binary associative operator ⊕, calculate:
scan(a) = [ 0, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2) ]
• Instead of ⊕ take any of the following:
“ + ”, “ * ”, “min”, “max”, etc.
23
31. PRESENTED BY
Compact
• Having a vector of numbers a = [ a0, a1, … an-1 ] and a
mask of elements of interest m = [m0, m1, …, mn-1],
mi = [0 | 1],
• Calculate:
compact(a) = [ ai0
, ai1
, … aik-1
], where
reduce(m) = k,
j[0, k-1]: mij
= 1.
30
33. PRESENTED BY
Compact sample
32
The algorithm makes use of the scan primitive:
Input:
Mask of interest:
Scan of the mask:
Compacted vector:
0 1 2 3 4 5 6 7 8 9
0 2 4 7 8
1 0 1 0 1 0 0 1 1 0
0 1 1 2 2 3 3 3 4 5
34. PRESENTED BY
Algorithm primitives discussion
• Highly efficient implementations in CUDA C SDK,
CUDPP and Thrust libraries
• Performance benefits from HW native intrinsics (POPC
and etc), which makes difference when implementing
with CUDA
33
35. PRESENTED BY
Algorithm primitives discussion
• Reduce: sum of absolute differences (SAD)
• Scan: Integral Image calculation, Compact facilitation
• Compact: Work pool index creation
34
36. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
35
38. PRESENTED BY
Intra Prediction
• Each block is predicted according to the predictors vector and the
prediction rule
• Predictors vector consists of N pixels to the top of the block, N to
the left and 1 pixel at top-left location
• Prediction rules (few basic for 16x16 blocks):
1. Constant – each pixel of the block is predicted with a constant
value
2. Vertical – each row of the block is predicted using the top predictor
3. Horizontal – each column of the block is predicted using the left
predictor
4. Plane – A 2D-plane prediction that optimally fits the source pixels
37
?
39. PRESENTED BY
Intra Prediction
• Select the rule with the lowest prediction error
• The prediction error can be calculated as a Sum of Absolute Differences
of the corresponding pixels in the predicted and the original blocks:
• Data compression effect: only need to store predictors and the prediction
rule
38
y
x
x
y
predict
x
y
src
SAD
,
]
][
[
]
][
[
42. PRESENTED BY
CUDA: Intra Prediction 16x16
• An image block of 16x16 pixels is processed by 16 CUDA threads
• Each CUDA-block contains 64 threads and processes 4 image blocks at
once:
– 0-15
– 16-31
– 32-47
– 48-63
• Block configuration: blockDim = {16, 4, 1};
41
43. PRESENTED BY
CUDA: Intra Prediction 16x16
42
1. Load image block and predictors into the shared memory:
int mb_x = ( blockIdx.x * 16 ); // left offset (of the current 16x16 image block in the frame)
int mb_y = ( blockIdx.y * 64 ) + ( threadIdx.y * 16 ); // top offset
int thread_id = threadIdx.x; // thread ID [0,15] working inside of the current image block
// copy the entire image block into the shared memory
int offset = thread_id * 16; // offset of the thread_id row from the top of image block
byte* current = share_mem->current + offset; // pointer to the row in shared memory (SHMEM)
int pos = (mb_y + thread_id) * stride + mb_x; // position of the corresponding row in the frame
fetch16pixels(pos, current); // copy 16 pixels of the frame from GMEM to SHMEM
// copy predictors
pos = (mb_y - 1) * stride + mb_x; // position one pixel to the top of the current image block
if (!thread_id) // this condition is true only for one thread (zero) working on the image block
fetch1pixel(pos - 1, share_mem->top_left); // copy 1-pix top-left predictor
fetch1pixel(pos + thread_id, share_mem->top[thread_id]); // copy thread_idth pixel of the top predictor
pos += stride - 1; // position one pixel to the left of the current image block
fetch1pixel(pos + thread_id * stride, share_mem->left[thread_id]); // copy left predictor
44. PRESENTED BY
CUDA: Intra Prediction 16x16
43
2. Calculate horizontal prediction (using the left predictor):
//INTRA_PRED_16x16_HORIZONTAL
int left = share_mem->left[thread_id]; // load thread_id pixel of the left predictor
int sad = 0;
for (int i=0; i<16; i++) // iterate pixels of the thread_idth row of the image block
{
sad = __usad(left, current[i], sad); // accumulate row-wise SAD
}
// each of 16 threads own its own variable “sad”, which resides in a register
// the function devSum16 sums all 16 variables using the reduce primitive and SHMEM
sad = devSum16(share_mem, sad, thread_id);
if (!thread_id) // executes once for an image block
{
share_mem->mode[MODE_H].mode = INTRA_PRED_16x16_HORIZONTAL;
// store sad if left predictor actually exists
// otherwise store any number greater than the worst const score
share_mem->mode[MODE_H].score = mb_x ? sad : WORST_INTRA_SCORE;
}
45. PRESENTED BY
CUDA: Intra Prediction 16x16
44
3.
Under the certain assumptions about the CUDA warp size, some of the
__syncthreads can be omitted
__device__ int devSum16( IntraPred16x16FullSearchShareMem_t* share_mem, int a, int thread_id )
{
share_mem->tmp[thread_id] = a; // store the register value in the shared memory vector
//consisting of 16 cells
__syncthreads(); // make all threads pass the barrier when working with shared memory
if ( thread_id < 8 ) // work with only 8 first threads out of 16
{
share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 8];
__syncthreads();
share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 4]; // only 4 threads are useful
__syncthreads();
share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 2]; // only 2 threads are useful
__syncthreads();
share_mem->tmp[thread_id] += share_mem->tmp[thread_id + 1]; // only 1 thread is useful
__syncthreads();
}
return share_mem->tmp[0];
}
47. PRESENTED BY
CUDA: Intra Prediction 16x16
4. Calculate rest of the rules (Vertical and Plane)
5. Rule selection: take the one with the minimum SAD
6. Possible to do Intra reconstruction in the same kernel
(see gray images few slides back)
46
48. PRESENTED BY
CUDA Host: Intra Prediction 16x16
47
// main() body
byte *h_frame, *d_frame; // frame pointers in CPU (host) and CUDA (device) memory
int *h_ip16modes, *d_ip16modes; // output prediction rules pointers
int width, height; // frame dimensions
getImageSizeFromFile(pathToImg, &width, &height); // get frame dimensions
int sizeFrame = width * height; // calculate frame size
int sizeModes = (width/16) * (height/16) * sizeof(int); // calculate prediction rules array size
cudaMallocHost((void**)&h_frame, sizeFrame); // allocate host frame memory (if not having yet)
cudaMalloc((void**)&d_frame, sizeFrame); // allocate device frame memory
cudaMallocHost((void**)& h_ip16modes, sizeModes); // allocate host rules memory
cudaMalloc((void**)& d_ip16modes, sizeModes); // allocate device rules memory
loadImageFromFile(pathToImg, width, height, &h_frame); // load image to h_frame
cudaMemcpy(d_frame, h_frame, sizeFrame, cudaMemcpyHostToDevice); // copy host frame to device
dim3 blockSize = {16, 4, 1}; // configure CUDA block (16 threads per image block, 4 blocks in CUDA block)
dim3 gridSize = {width / 16, height / 64, 1}; // configure CUDA grid to cover the whole frame
intraPrediction16x16<<<gridSize, blockSize>>>(d_frame, d_modes_out); // launch kernel
cudaMemcpy(h_ip16modes, d_ip16modes, sizeModes, cudaMemcpyDeviceToHost); // copy rules back to host
cudaFree(d_frame); cudaFreeHost(h_frame); // free frames memory
cudaFree(d_ip16modes); cudaFreeHost(h_ip16modes); // free rules memory
49. PRESENTED BY
SAD and SATD
• Sum of Absolute Transformed Differences increases
MVF quality
• More robust than SAD
• Takes more time to compute
• Can be efficiently computed, i.e. split 16x16 block into
16 blocks of 4x4 pixels and do transform in-place
48
50. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
49
54. PRESENTED BY
Motion Estimation
53
Current (P)
frame
Reference (I,
previous)
frame
Original Block match Motion Vector Field
Like in Intra Prediction, when doing block-matching Motion Estimation a
frame is split into blocks (typically 16x16, 8x8 and subdivisions like 16x8,
8x16, 8x4, 4x8, 4x4)
55. PRESENTED BY
Motion Estimation
54
Motion vectors field (MVF) quality can be represented by the amount of information
in the residual (the less – the better)
Compensated from reference (notice blocking)
Original frame Residual Original-Compensated
Residual Original-Reference
56. PRESENTED BY
ME – Full Search
55
• Full search of the best match in some local area of
the frame
• The area is defined as a rectangle centered on the
block that we want to predict and going range_x
pixels both sides along X axis and range_y pixels
along Y axis
• range_x is usually greater than range_y because
horizontal motion prevails in movies
• Very computationally expensive though straight-
forward
• Doesn’t “catch” long MVs outside of the range
Current frame
Reference frame
range_x
range_y
57. PRESENTED BY
ME – Diamond search
• Template methods class reduces the amount of block matches by the order of
magnitude comparing to Full Search
• On every iteration the algorithm matches 6 blocks with anchors that form diamond
centered on the initial position
• The diamond size doesn’t grow and must reduce with the amount of steps made
56
58. PRESENTED BY
ME – Diamond search
57
• The majority of template methods are based on gradient descent and have certain
application restrictions
• When searching for a match for the block on the frame below, red initialization
is not better than a random block selection, while green will likely converge
• Used often in junction with some
other methods as a refinement
step
59. PRESENTED BY
ME – Candidate Set
• Objects that move inside a frame are typically larger than a block, adjacent blocks have
similar motion vectors
• Candidate set is a list of motion vectors that have to be checked in first place. The list is
usually filled with motion vectors of already found adjacent (spatially or temporarily)
blocks
• The majority of CPU-school ME algorithms fill candidate set with motion vectors found
for left, top-left, top, top-right blocks (assuming block-linear YX-order processing)
• This creates an explicit data dependency which hinders efficient porting of the
algorithm to CUDA
58
60. PRESENTED BY
ME – 2D-wave
• A non-insulting way to resolve spatial data dependency
• Splits processing into the list of groups of blocks, with possibility of parallel processing of blocks
within any group. Groups have to be executed in order
• The basic idea – to move by independent slices from the top-left corner to the down-right. TN
stands for the group ID (and an iteration)
59
• Inefficiency of CUDA implementation:
– Low SMs load
– Need work pool with
producerconsumer roles, or several
kernel launches
– Irregular data access
61. PRESENTED BY
ME – 3D-wave
• 3D-wave treats time as a 3rd dimension and moves from the top-left corner of the first frame (cube
origin) by groups of independent blocks
• Useful for resolving spatio-temporal candidate set dependencies
60
Frame 0 Frame 1 Frame 2
Legend
Blocks with MV found
Blocks to be processed (in parallel)
62. PRESENTED BY
ME – GPU-friendly candidate set
• 2D and 3D wave approaches make GPU mock CPU pipeline by using
clever tricks, which rarely agree with best GPU programming
practices
• If full CPU-GPU data compliance is not a case, then revising the
candidate set or even the full Motion Estimation pipeline is necessary
• Revised GPU-friendly candidate sets might give worse MVF. But they
will free tons of time spent on maintaining useless structures, which
in turn can be partially spent on doing more computations to
leverage MVF quality
61
63. PRESENTED BY
ME – GPU-friendly candidate set
• Any candidate set is friendly if it is known for every block at the kernel launch time
(see Hierarchical Search)
• A candidate set with spatial MV dependency can be formed as below
• Just by removing one (left) spatial candidate, we:
– Enhance memory access pattern
– Relax constraints on work producingconsuming
– Reduce the code size and complexity
– Equalize parallel execution group sizes
• Can be implemented in a loop where each CUDA block processes a column of the
frame of few image blocks width
• Some grid synchronization trickery is still needed
62
Group ID
Group Size (MBs)
64. PRESENTED BY
ME – Hierarchical Search
63
Current frame
Reference frame
Full Search
Refine
Refine
Refinement data
Refinement data
Level 2
Level 1
Level 0
65. PRESENTED BY
ME – Hierarchical Search
Algorithm:
1. Build image pyramids
2. Perform full search on the top level
3. A motion vector of level K is the best approximation (candidate)
for all “underlying” blocks on level K+1 and can be in candidate
set for block on all subsequent levels
4. MVs refinement can be done by any template search
5. On the refinement step, N best matches can be passed to
subsequent levels to prevent the fall into the local minima
64
66. PRESENTED BY
ME – Hierarchical Search
Hierarchical search ingredients:
1. Downsampling kernel (participates in the creation of pyramid)
2. Full search kernel (creates the initialization MVF on the largest scale, can
be used to refine problematic areas on subsequent layers)
3. Template search kernel (refines the detections obtained from the
previous scales, or on the previous steps, frames)
4. Global memory candidate sets approach (to store N best results from the
previous layers and frames)
65
67. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
66
68. PRESENTED BY
CUDA and OpenCL
• For the sub-pixel Motion Estimation precision CUDA
provides textures with bilinear interpolation on access
• CUDA provides direct access to GPU intrinsics
• CUDA: C++, IDE (Parallel Nsight), Visual Studio Integration,
ease of use
• OpenCL kernels are not a magic bullet, they need to be
fine-tuned for each particular architecture
• OpenCL for CPU vs GPU will have different optimizations
67
69. PRESENTED BY
Tesla and Fermi architectures
• All algorithms will benefit from L1 and L2 caches
(Fermi)
• Forget about SHMEM bank conflicts when working
with 1-byte data types
• Increased SHMEM size (Tesla = 16K, Fermi = up to
48K)
• 2 Copy engines, Parallel kernel execution
68
70. PRESENTED BY
Multi-GPU encoding
• One slice per GPU
– Fine-grain parallelism
– Requires excessive PCI-e transfers to collect results either on CPU or on one of the
GPUs to generate the final bitstream
– Requires data frame-level data consolidation after every kernel launch
– Data management might get not trivial due to apron processing
69
I I
P B B P B B
71. PRESENTED BY
Multi-GPU encoding
• N GOPs for N GPUs
– Coarse-grain parallelism
– Better for offline compression
– Higher utilization of GPU resources
– Requires N times more memory on CPU to handle processing
– Drawbacks: must be careful when concatenating GOP structures, CPU
multiplexing may become a bottleneck,
– Requires N times more memory on the host to process video
70
I P B B I P B B
72. PRESENTED BY
Other directions
• JPEG2000 - cuj2k is out there
– open-source: http://sourceforge.net/projects/cuj2k/
– needs more BPP for usage in the industry, currently
supports only 8bpp per channel
• CUDA JPEG encoder – now available in CUDA C SDK
• MJPEG
71
73. PRESENTED BY
Outline
• Motivation
• Video encoding facilities
• Video encoding with CUDA
– Principles of a hybrid CPU/GPU encode pipeline
– GPU encoding use cases
– Algorithm primitives: reduce, scan, compact
– Intra prediction in details
– High level Motion Estimation approaches
– CUDA, OpenCL, Tesla, Fermi
– Source code
72
74. PRESENTED BY
Source code
• The reference Full Search algorithm is implemented in
CUDA and is available here in public domain:
http://tinyurl.com/cuda-me-fs
http://courses.graphicon.ru/files/courses/mdc/2010/assigns/assign3/MotionEstimationCUDA.zip
• Feel free to email us to see if we have an algorithm
you need implemented in CUDA
73
75. PRESENTED BY
Recommended GTC talks
We have a plenty of Computer Vision algorithms
developed and discussed by our engineers:
– James Fung, “Accelerating Computer Vision on the Fermi
Architecture”
– Joe Stam, “Extending OpenCV with GPU Acceleration”
– Timo Stich, “Fast High-Quality Panorama Stitching”
74