Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform


Published on

This presentation give a framework for paralleling ME of HEVC on multicore platform

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform

  1. 1. Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Platform Xiangwen Wang1, Li Song2, Yanan Zhao2, Min Chen1 1Shanghai University of Electric Power 2 Shanghai Jiao Tong University
  2. 2. Outline Introduction The proposed paralleling VBSME of HEVC for GPU+CPU Preliminary result and Future work
  3. 3. Introduction HEVC is the newest video coding standard introduced by ITU-T VCEG and ISO/IEC MEPG. Compared with H.264/AVC, HEVC decreases the bitrate by 50% percent on average while maintaning the same visual quality. BQMall_832x480. Left: HEVC 1.5Mbps, right: x264 3.0Mbps
  4. 4. Introduction However, encoding complexity is several times more complex than H.264. • RDO: iterate over all mode and partition combinations to dicide the best coding information • RDoQ: iterate over many QP candidates for each block • Intra: prediction modes increased to 35 for luma • SAO: works pixel by pixel • Quadtree structure: bigger block sizes and numerous partition manners • Some other highly computational modules ... As a result, the traditional method which performs the encoding in a sequential way could no longer provide a real-time demand, especially when it comes to HD (1920x1080) and UHD (3840x2160) videos. Parallellism in the encoding procedure must be extensively utilized.
  5. 5. Overview OF VBSME In HEVC • Three independent block concepts • CU - Coding Unit • PU - Prediction Unit • TU - Transform Unit • The total number of allowed PU size is 12 (from 64x64 to 4x8/8x4)  up to 425 times ME for one 64x64 CTU (5+4x5+16x5+64x5 = 425) CU size and PU partition structure CU PU Depth1: 64x64 Depth2: 32x32 Depth3: 16x16 Depth4:8x8
  6. 6. Two stages: • ME to select best MV for candidate PUs • CU depth and PU partition mode decision MV selection criterion for each PU: Jpred,SAD =SA(T)D + λpred * Rpred CU sizes and PU partition mode decision: Jmode =SSD + λmode * Rmode, To calculate the Jmode for each PU, reconstruction and entropy coding of all syntaxes are necessary, the complexity is beyond the computational capability of common computers for real applications. Mode Decision with VBSME In HM
  7. 7. The proposed parallel encoding framework Copy to GPU fEnc Interpolate& Border Pad fEnc ME Kernel 64x64~8x8 PU MVs Half/quarter pixel img buff Half/quarter pixel img buff fRec Encode & Reconstruct 64x64~8x8 PU MVs MC CPU GPU GPU-MEMCPU-MEM Read Img Launch new LCU Line ME Sync to last LCU Line ME Launch Interpolate Entropy coding Sync to Interpolate All LCU line? N Mode Decision fRec Next frame frame loop LCU line loop Lunch ME for one CTU line
  8. 8. Fast PU partition mode decsion scheme SKIP ? Half/quarter pixel img buff CBF_fast ? Fast CU partition 64x64~8x8 PU MVs PART_2Nx2N? CU partition or next CU N Y Y CU depth==4 CU_idx==4? Y RD cost Calculate CU depth = 4 Sync to last LCU Line ME CPU-MEM MC The MV and residual information are employed for PU partition decision Two edge feature parameters:       00 01 10 11 _ 8 S S S S V QP stepN                    00 10 01 11 _ 8 S S S S H QP stepN              If (H==V &&H!=0) PART_2Nx2N Elseif (H==V and H==0) PART_NxN Elseif H>V PART_Nx2N else PART_2NxN
  9. 9. Parallel realization of VBSME on CUDA 8x8 block size SAD calculation 16x16 block size Jpred calculation Integer Pixel Jpred Comparison 16x16 Fractional pixel MV refinement Variable block size Jpred generation Variable block size Jpred calculation Integer Pixel Jpred Comparison four 16x16 lines Variable block size ME The MV selection criterion is as follow: Jpred =SAD +λpred * DMV = SAD +λpred *(MV_C - PMV) where MV_C: current point MV, PMV: next slice
  10. 10. PMV for MV cost calculate MV0 MV1 MV2 MV3 MV4 PMV=medium(MV0, MV1, MV2, MV3, MV4)  one CTU(64x64) line is divided into four 16x16 block lines;  The ME process of each 16x16 line is done by GPU sequentially;  The MVs of 16x16 block size are used as the MV predictions for all other block sizes.
  11. 11. Variable block size SAD Generation 8x8 0 1 2 3 4 5 6 7 63626160 16x16 32x32 64x64 Variable block size SAD Generation on CUDA
  12. 12. Experimental Results Platform:Z620 = NVIDIA Tesla C2050+i7@2.6G, with Win7 OS The CUDA driver version of the GPU is 5.0 and the CUDA Capability version number is 2.0. The search range is 64x64 with the full search strategy for IMV and 24 fractional-pixel positions around the IMV. sequence CPU(fps) GPU(fps) Speedup ratio Traffic_2560x 1600_crop 0.21 23.77 113.2 ParkScreen_1 920x1080_24 0.69 77.76 112.7 The speed-up ratio is about 113 times
  13. 13. Experimental Results:RD comparison Note1, The propose algorithm is realization based on X265 encoder, a first open source encoder implementation of HEVC "x265 project,"; Note2, the Cactus_Proposed implies the RD curve generated by the X265 encoder with the proposed algorithm.
  14. 14. Conclusion We present a parallel friendly VBSME(variable block size motion estimation) scheme which make full of available computation resources from CPU and GPU respectively Preliminary results are reported with speedup ratio over 100 times compared to single thread CPU only solution We will continue to exploit parallelism, targeting a 4K@30fps realtime HEVC encoder over multicore CPU and GPGPU platform.
  15. 15. Thanks!