SlideShare a Scribd company logo
1 of 18
Download to read offline
Efficient Realization of
Parallel HEVC Intra Coding
Yanan Zhao1, Li Song1, Xiangwen Wang2, Min Chen2, Jia Wang1
1Institute of Image Communication and Network Engineering,
Shanghai Jiao Tong University, China
2Shanghai University of Electric Power, China
Introduction
HEVC is the newest video coding standard introduced by ITU-T VCEG and
ISO/IEC MEPG.
Compared with H.264/AVC, HEVC decreases the bitrate by 50% percent
on average while maintaning the same visual quality [1].
Fig.1. BQMall_832x480. Left: HEVC 1.5Mbps, right: x264 3.0Mbps
Introduction
However, encoding complexity is several times more complex than H.264.
• RDO: iterate over all mode and partition combinations to dicide the best coding
information
• RODQ: iterate over many QP candidates for each block
• Intra: prediction modes increased to 35 for luma
• SAO: works pixel by pixel
• Quadtree structure: bigger block sizes and numerous partition manners
• Some other highly computational modules ...
As a result, the traditional method which performs the encoding in a
sequential way could no longer provide a real-time demand, especially
when it comes to HD (1920x1080) and UHD (3840x2160) videos.
Parallellism in the encoding procedure must be extensively utilized.
1. HEVC Intra Coding
• Quadtree partition structure
• Flexible partition manners fit picture
characteristics better
• Three independent block concepts
• CU - Coding Unit
• PU - Prediction Unit
• TU - Transform Unit
• Intra Prediction Modes
• 35 modes for luma prediction
• One special mode can also be used for
chroma prediction
Fig.2. Prediction modes in HEVC
intra coding
In HEVC intra coding, all the prediction modes utilize the same basic set of
reference samples, constituted by the pixels of left-bottom and left columns,
top and top-right rows and left-top point.
The intra coding is performed by unit of CTU (largest CU) in raster scan
order. Within each CTU, CUs are processed in quadtree traverse order. As
a consequence, two levels of CU dependency exist.
2. HEVC Intra Dependency Analysis
In CTU-level, each CTU must waits until its left and top-right neighbor
CTUs finish reconsturction. This makes the current CTU row always tow-
CTU latent than its adjacent upper row.
Maximum parallelism is achieved if each CTU starts encoding whenever its
two dependent CTUs finish.
Proceed like the wavefront. Fig.3.
2.1 CTU-level Dependency
thread0
thread1
thread2
thread3
thread4
thread5
thread6
thread7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
Fig.3. Maximum prallelism in CTU-level
(each block stands for one CTU)
Intra denpendency also exists within one
CTU. Assume CTU size is 32x32, and it
can be further split into 16x16, 8x8 CUs,
and 4x4 PUs in quadtree structure.
HEVC decides the best partition and best
mode in a brute force way - compares
the best cost of current CU and the sum
costs of its four sub-CUs. The flow chart
is given in Fig.4.
For each CTU, the encoder has to
perform the following amount of
calculations:
• IntraPred32x32 35 times
• IntraPred16x16 4x35 times
• IntraPred8x8 16x35 times
• IntraPred4x4 64x35 times
2.2 CU-level Dependency
Start
nDepth=0,nSize=32
nMaxDepth=3
CheckRDCostIntra
(nDepth,nSize)
nDepth<nMaxDepth?
CheckRD
CostIntra
(nDepth,
nSize)
CheckRD
CostIntra
(nDepth,
nSize)
CheckRD
CostIntra
(nDepth,
nSize)
CheckRD
CostIntra
(nDepth,
nSize)
nDepth++
nSize/=2
Yes
SumChildCost<ParentCost?
SplitFlagCurrParCU=true
BestModes=BestChildModes
BestCost=SumChildCost
SplitFlagCurrParCU=false
BestModes=BestParMode
BestCost=ParentCost
nDepth==1?
nDepth--
End
Yes
No
Yes No
No
Fig.4. Deciding the best encoding cost in
HEVC intra coding
2.2 CU-level Dependency
Due to the sequential proporty, this process is really time-consuming, which
prohibits the intra coding speed on a large extent.
Now we analyse the CU dependencies in terms of time iterations. We denote
CU_N(x,y) as a CU with size NxN and relative coordinate (x,y) from CTU left-top
pixel. We also assume the processing time for one CU_4x4 as one iteration. Then
the processing time for each CU_8x8, CU_16x16, CU_32x32 are 4, 16, and 64
respectively.
The encoder first uses 64 iterations for CU_32(0,0), then 16 more iterations for
CU_16(0,0), then 4 for CU_8(0x0), then CU_4(0,0) starts at the 64+16+8+4+1=85th
iteration. Calculations for other CUs are similar. Here, we show the results of each
4x4 CUs in Fig.5.(a).
85 86 93 94
87 88 95 96
101 102 109 110
103 104 111 112
133 134
135 136
181 182 189 190
183 184 191 192
197 198 205 206
199 200 207 208
229 230 237 238
231 232 239 240
245 246 253 254
247 248 255 256
141 142
143 144
149 150
151 152
157 158
159 160
1 2 5a 6
3 4 7 8a
5b 8b 11a 12a
9a 10a 13a 14a
9b 10b
13b 14b
11c 14b 17b 20b
15c 16b 21b 22b
17c 23b 26a 27a
24b 25a 28a 29a
25b 26b 30a 31
28b 29b 32 33a
30b 33b 34b 35b
34a 35a 36 37
15a 16a
17a 18a
15b 18b
19a 20a
21a 22a
23a 24a
(a)sequential fasion (b)with maxmium parallelism
Fig.5. Starting times of each CU_4x4
2.2 CU-level Dependency
Notice that CU_4(0,0), CU_8(0,0), CU_16(0,0) each only need a subset of the
reference pixels of CU_32(0,0), so if CU_32(0,0) is ready for processing, so do the
other three CUs.
Further, CU_4(8,0) and CU_4(0,8) are also ready for prediction once the partition
and best modes of CU_16(0,0) are decided.
We use a DAG (Directed Acyclic Grahp) to visualize the CU dependencies[2]. As
shown in Fig.6. The vertical aixs is iteration. CUs with same vertical coordinate can
be started simultaneously.
If CUs are always started right at their readiness (enable parallel processing),
maximum parallelism within one CTU could be achieved. Starting moments under
this mechnism is shown in Fig.5.(b). It should be noted the although the last CU_4x4
finished at iteration 37, the whole CTU ends at 64, which is the finishing time of
CU_32(0,0). Clearly, all the processing time of other CUs is hidden in the largest
CU's processing time.
If parallel at this level is fully utilized, the thoeretical speedup gain for one CTU can
be as high as
(257-64)/64x100% = 301.56%
2.2 CU-level Dependency
CU_4x4(12,0)CU_4x4(12,0)
CU_4x4(8,4)CU_4x4(8,4)
CU_4x4(12,4)CU_4x4(12,4)
CU_4x4(8,0)CU_4x4(8,0)
select
CU_4x4(4,0)CU_4x4(4,0)
CU_4x4(0,4)CU_4x4(0,4)
CU_4x4(4,4)CU_4x4(4,4)
CU_8x8(0,0)CU_8x8(0,0) CU_4x4(0,0)CU_4x4(0,0)
select
CU_4x4(4,8)CU_4x4(4,8)
CU_4x4(0,12)CU_4x4(0,12)
CU_4x4(4,12)CU_4x4(4,12)
CU_4x4(0,8)CU_4x4(0,8)
select
CU_16x16(0,0)CU_16x16(0,0)
CU_4x4(12,8)CU_4x4(12,8)
CU_4x4(8,12)CU_4x4(8,12)
CU_4x4(12,12)CU_4x4(12,12)
CU_4x4(8,8)CU_4x4(8,8)
select
CU_4x4(20,0)CU_4x4(20,0)
CU_4x4(20,4)CU_4x4(20,4)
CU_4x4(16,0)CU_4x4(16,0)
select
CU_4x4(16,4)CU_4x4(16,4)
iterationiteration
11
22
33
44
CU_8x8(8,0)CU_8x8(8,0)
CU_8x8(0,8)CU_8x8(0,8)
CU_8x8(8,8)CU_8x8(8,8)
CU_8x8(16,16)CU_8x8(16,16)
select
88
1212
CU_32X32CU_32X32
1616
Fig.6. Part of the DAG for one CTU processing
3. Proposed Parallelization Scheme
In this section, we propose a two-stage parallelization speedup scheme
exploiting CTU level parallelism.
The design of the scheme takes two major aspects into consideration
• Maximizing encoding speedup
• Minimizing compression performance loss
The overall structure strikes a good balance between design effort,
parallelism degree and RD performance.
• the first stage performs parallel processing by launching number of thread, with
each thread processing one CTU row under the CTU-level constraint. The resulted
prediction and partition information is then stored.
• in the second stage, a single thread is used to encode all the CTUs within the
picture in raster scan order.
3. Proposed Parallelization Scheme
thread1
thread2
thread3
thread0
thread1
thread2
thread3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
thread0
thread4 thread0
thread1
thread4
thread3
thread0
thread1
thread2
thread3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
2
3
4
5
6
7
8
thread2
Fig.6. Different moments of the proposed parallelism scheme
(The gray area are CTUs finished processing, while blue area are those finished entropy coding)
(b) thread1 starts processing the first CTU in 6th row(a) thread3 starts processing
This scheme achieves three benefits in terms of encoding speed and
compression efficiency:
• Maximizing acceleration ratio - parallel proccesing the most computation - intensive part
• Minimizing performance loss - continuous entropy coding in the whole picture
• Further speedup gain - run stage 1& 2 simultaneously, encoding time is hidden in
processing time
4. Simulation results
We implemented our algorithm on x265, an open source HEVC encoder[3].
The encoder is configured as follows: all intra, CTU size 32, split depth 3,
QP 22, 27, 32,37, no fast split decision or fast intra mode selection
algorithms. The benchmark is x265 default encoder which runs at single
thread. All tests run on a HP workstation with 8 cores @ 2.6GHz.
The first group of experiments studies the relationship between speedup
gains and thread numbers. Fig.7.
The second goup tests the proposed algorithm on different video
sequences. The results is shown in Table.1.
4. Simulation results
Fig.7. Speedup gains with thread numbers
%100

proposed
proposedanchor
T
TT
Test sequence: BasketballDrill_832x480_50
Speedup gain:
4. Simulation results
sequences QP Anchor (s) Proposed (s) gain
PeopleOnStreet_2560x1600_30
22 384.212 61.729 522%
27 375.194 58.203 545%
32 367.106 57.268 541%
37 366.805 57.205 541%
Traffic_2560x1600_30
22 386.036 58.391 586%
27 374.760 57.704 549%
32 370.128 57.642 542%
37 372.294 57.143 512%
BQTerrace_1920x1080_60
22 822.581 156.312 426%
27 806.491 132.708 508%
32 782.312 131.881 493%
37 757.416 131.694 475%
Cactus_1920x1080_50
22 646.017 127.295 407%
27 644.591 109.981 486%
32 638.755 110.055 480%
37 632.842 110.055 475%
ParkScene_1920x1080_24
22 327.787 55.972 486%
27 310.908 52.728 490%
32 310.066 52.759 488%
37 304.854 52.759 478%
Average 502%
Table.1. Speedup gains on diffrent video sequences
Newest Work - Speedup Inter Coding
Sequence (3840x2160) PSNR(dB) Bitrate(Mbps) Frame Rate (fps)
Cactus 43.266 3.994 30.16
Foreman 42.345 4.878 29.73
Coastguard 36.985 17.025 14.47
News 44.203 2.504 33.66
Suzie 41.640 5.157 28.32
Mobile 39.599 5.999 26.28
Library 38.782 3.405 31.34
BundNightscape 38.548 6.190 26.85
AncientTown 39.355 5.830 27.91
Horses 38.219 8.310 21.07
TrafficAndBuilding 37.567 5.683 25.01
Marathon 34.576 24.672 10.21
Average 39.590 7.111 25.42
Table.2. Performance on UHD sequences by speedup intra and inter coding
Table.2. shows the encoding performance to speedup intra & inter coding. Configuration: QP 32, one I
frames followed by 49 P frames. 16 threads are used. UHD sequences are from SJTU[4] and Elemental
Techonologies[5].
References
[1] G. Sullivan, J.-R. Ohom, W.-J. Han, and T. Wiegand, "Overview of the high
efficiency video coding (HEVC) standard", IEEE Trans. on CSVT, 2013
[2] N. Cheung, O. Au, M. Kung, "Highly parallel rate-distortion optimized intra-mode
decision on multicore graphics processors", IEEE Trans. on CSVT, 2009
[3] x265 project, http://code.google.com/p/x265/
[4 ]http://medialab.sjtu.edu.cn//web4k/index.html
[5] http://www.elementaltechnologies.com/resources/4k-test-sequences
Questions?

More Related Content

What's hot

Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAprithan
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition동호 이
 
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureAn OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureWaqas Tariq
 
GPU-Quicksort
GPU-QuicksortGPU-Quicksort
GPU-Quicksortdaced
 
Introduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious AlgorithmsIntroduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious AlgorithmsChristopher Gilbert
 
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCBryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCMLconf
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows seteSAT Publishing House
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...eSAT Journals
 
IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP...
IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP...IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP...
IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP...Vignesh V Menon
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptgrssieee
 
IEEE MMSP'21: INCEPT: Intra CU Depth Prediction for HEVC
IEEE MMSP'21: INCEPT: Intra CU Depth Prediction for HEVCIEEE MMSP'21: INCEPT: Intra CU Depth Prediction for HEVC
IEEE MMSP'21: INCEPT: Intra CU Depth Prediction for HEVCVignesh V Menon
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 

What's hot (20)

Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
 
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYCJeff Johnson, Research Engineer, Facebook at MLconf NYC
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Aes
AesAes
Aes
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition
 
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureAn OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms for GPU Architecture
 
GPU-Quicksort
GPU-QuicksortGPU-Quicksort
GPU-Quicksort
 
Introduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious AlgorithmsIntroduction to Cache-Oblivious Algorithms
Introduction to Cache-Oblivious Algorithms
 
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYCBryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
Bryan Thompson, Chief Scientist and Founder at SYSTAP, LLC at MLconf NYC
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Performance comparison of row per slave and rows set
Performance comparison of row per slave and rows setPerformance comparison of row per slave and rows set
Performance comparison of row per slave and rows set
 
Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...Performance comparison of row per slave and rows set per slave method in pvm ...
Performance comparison of row per slave and rows set per slave method in pvm ...
 
IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP...
IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP...IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP...
IEEE ICIP'22:Efficient Content-Adaptive Feature-based Shot Detection for HTTP...
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
 
IEEE MMSP'21: INCEPT: Intra CU Depth Prediction for HEVC
IEEE MMSP'21: INCEPT: Intra CU Depth Prediction for HEVCIEEE MMSP'21: INCEPT: Intra CU Depth Prediction for HEVC
IEEE MMSP'21: INCEPT: Intra CU Depth Prediction for HEVC
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 

Similar to Efficient Realization of Parallel HEVC Intra Coding

INCEPT: Intra CU Depth Prediction for HEVC
INCEPT: Intra CU Depth Prediction for HEVCINCEPT: Intra CU Depth Prediction for HEVC
INCEPT: Intra CU Depth Prediction for HEVCAlpen-Adria-Universität
 
Project report - Hypergraph states
Project report - Hypergraph statesProject report - Hypergraph states
Project report - Hypergraph statesHenry Costello
 
Accelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUAccelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUDavide Nardone
 
Efficient pu mode decision and motion estimation for h.264 avc to hevc transc...
Efficient pu mode decision and motion estimation for h.264 avc to hevc transc...Efficient pu mode decision and motion estimation for h.264 avc to hevc transc...
Efficient pu mode decision and motion estimation for h.264 avc to hevc transc...sipij
 
Evaluation and Analysis of Rate Control Methods for H.264/AVC and MPEG-4 Vide...
Evaluation and Analysis of Rate Control Methods for H.264/AVC and MPEG-4 Vide...Evaluation and Analysis of Rate Control Methods for H.264/AVC and MPEG-4 Vide...
Evaluation and Analysis of Rate Control Methods for H.264/AVC and MPEG-4 Vide...IJECEIAES
 
New generation video coding OVERVIEW.pptx
New generation video coding OVERVIEW.pptxNew generation video coding OVERVIEW.pptx
New generation video coding OVERVIEW.pptxYaseenMo
 
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...ijma
 
Algorithm and architecture design of the h.265 hevc intra encoder
Algorithm and architecture design of the h.265 hevc intra encoderAlgorithm and architecture design of the h.265 hevc intra encoder
Algorithm and architecture design of the h.265 hevc intra encoderjpstudcorner
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureIJMER
 
H.265ImprovedCE_over_H.264-HarmonicMay2014Final
H.265ImprovedCE_over_H.264-HarmonicMay2014FinalH.265ImprovedCE_over_H.264-HarmonicMay2014Final
H.265ImprovedCE_over_H.264-HarmonicMay2014FinalDonald Pian
 
Advanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem SolutionsAdvanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem SolutionsJoe Christensen
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520IJRAT
 
Conception of a new Syndrome Block for BCH codes with hardware Implementation...
Conception of a new Syndrome Block for BCH codes with hardware Implementation...Conception of a new Syndrome Block for BCH codes with hardware Implementation...
Conception of a new Syndrome Block for BCH codes with hardware Implementation...IJERA Editor
 
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...ijma
 
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...ijma
 

Similar to Efficient Realization of Parallel HEVC Intra Coding (20)

Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...
Paralleling Variable Block Size Motion Estimation of HEVC On CPU plus GPU Pla...
 
INCEPT: Intra CU Depth Prediction for HEVC
INCEPT: Intra CU Depth Prediction for HEVCINCEPT: Intra CU Depth Prediction for HEVC
INCEPT: Intra CU Depth Prediction for HEVC
 
Project report - Hypergraph states
Project report - Hypergraph statesProject report - Hypergraph states
Project report - Hypergraph states
 
Architecture Assignment Help
Architecture Assignment HelpArchitecture Assignment Help
Architecture Assignment Help
 
Accelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPUAccelerating Dynamic Time Warping Subsequence Search with GPU
Accelerating Dynamic Time Warping Subsequence Search with GPU
 
Efficient pu mode decision and motion estimation for h.264 avc to hevc transc...
Efficient pu mode decision and motion estimation for h.264 avc to hevc transc...Efficient pu mode decision and motion estimation for h.264 avc to hevc transc...
Efficient pu mode decision and motion estimation for h.264 avc to hevc transc...
 
Evaluation and Analysis of Rate Control Methods for H.264/AVC and MPEG-4 Vide...
Evaluation and Analysis of Rate Control Methods for H.264/AVC and MPEG-4 Vide...Evaluation and Analysis of Rate Control Methods for H.264/AVC and MPEG-4 Vide...
Evaluation and Analysis of Rate Control Methods for H.264/AVC and MPEG-4 Vide...
 
New generation video coding OVERVIEW.pptx
New generation video coding OVERVIEW.pptxNew generation video coding OVERVIEW.pptx
New generation video coding OVERVIEW.pptx
 
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
 
Algorithm and architecture design of the h.265 hevc intra encoder
Algorithm and architecture design of the h.265 hevc intra encoderAlgorithm and architecture design of the h.265 hevc intra encoder
Algorithm and architecture design of the h.265 hevc intra encoder
 
Real time SHVC decoder
Real time SHVC decoderReal time SHVC decoder
Real time SHVC decoder
 
Efficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT ArchitectureEfficient Implementation of Low Power 2-D DCT Architecture
Efficient Implementation of Low Power 2-D DCT Architecture
 
H.265ImprovedCE_over_H.264-HarmonicMay2014Final
H.265ImprovedCE_over_H.264-HarmonicMay2014FinalH.265ImprovedCE_over_H.264-HarmonicMay2014Final
H.265ImprovedCE_over_H.264-HarmonicMay2014Final
 
Kai hwang solution
Kai hwang solutionKai hwang solution
Kai hwang solution
 
Advanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem SolutionsAdvanced Comuter Architecture Ch6 Problem Solutions
Advanced Comuter Architecture Ch6 Problem Solutions
 
Paper id 37201520
Paper id 37201520Paper id 37201520
Paper id 37201520
 
K505028085
K505028085K505028085
K505028085
 
Conception of a new Syndrome Block for BCH codes with hardware Implementation...
Conception of a new Syndrome Block for BCH codes with hardware Implementation...Conception of a new Syndrome Block for BCH codes with hardware Implementation...
Conception of a new Syndrome Block for BCH codes with hardware Implementation...
 
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
 
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
IMPROVING PSNR AND PROCESSING SPEED FOR HEVC USING HYBRID PSO FOR INTRA FRAME...
 

More from Shanghai Jiao Tong University(上海交通大学) (6)

ICIP2013-video stabilization with l1 l2 optimization
ICIP2013-video stabilization with l1 l2 optimizationICIP2013-video stabilization with l1 l2 optimization
ICIP2013-video stabilization with l1 l2 optimization
 
THE SJTU 4K VIDEO SEQUENCE DATASET
THE SJTU 4K VIDEO SEQUENCE DATASETTHE SJTU 4K VIDEO SEQUENCE DATASET
THE SJTU 4K VIDEO SEQUENCE DATASET
 
No-reference Video Quality Assessment on Mobile Devices
No-reference Video Quality Assessment on Mobile DevicesNo-reference Video Quality Assessment on Mobile Devices
No-reference Video Quality Assessment on Mobile Devices
 
Foreground Detection : Combining Background Subspace Learning with Object Smo...
Foreground Detection : Combining Background Subspace Learning with Object Smo...Foreground Detection : Combining Background Subspace Learning with Object Smo...
Foreground Detection : Combining Background Subspace Learning with Object Smo...
 
Background Subtraction Based on Phase and Distance Transform Under Sudden Ill...
Background Subtraction Based on Phase and Distance Transform Under Sudden Ill...Background Subtraction Based on Phase and Distance Transform Under Sudden Ill...
Background Subtraction Based on Phase and Distance Transform Under Sudden Ill...
 
Perceptual Video Coding
Perceptual Video Coding Perceptual Video Coding
Perceptual Video Coding
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Efficient Realization of Parallel HEVC Intra Coding

  • 1. Efficient Realization of Parallel HEVC Intra Coding Yanan Zhao1, Li Song1, Xiangwen Wang2, Min Chen2, Jia Wang1 1Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, China 2Shanghai University of Electric Power, China
  • 2. Introduction HEVC is the newest video coding standard introduced by ITU-T VCEG and ISO/IEC MEPG. Compared with H.264/AVC, HEVC decreases the bitrate by 50% percent on average while maintaning the same visual quality [1]. Fig.1. BQMall_832x480. Left: HEVC 1.5Mbps, right: x264 3.0Mbps
  • 3. Introduction However, encoding complexity is several times more complex than H.264. • RDO: iterate over all mode and partition combinations to dicide the best coding information • RODQ: iterate over many QP candidates for each block • Intra: prediction modes increased to 35 for luma • SAO: works pixel by pixel • Quadtree structure: bigger block sizes and numerous partition manners • Some other highly computational modules ... As a result, the traditional method which performs the encoding in a sequential way could no longer provide a real-time demand, especially when it comes to HD (1920x1080) and UHD (3840x2160) videos. Parallellism in the encoding procedure must be extensively utilized.
  • 4. 1. HEVC Intra Coding • Quadtree partition structure • Flexible partition manners fit picture characteristics better • Three independent block concepts • CU - Coding Unit • PU - Prediction Unit • TU - Transform Unit • Intra Prediction Modes • 35 modes for luma prediction • One special mode can also be used for chroma prediction Fig.2. Prediction modes in HEVC intra coding
  • 5. In HEVC intra coding, all the prediction modes utilize the same basic set of reference samples, constituted by the pixels of left-bottom and left columns, top and top-right rows and left-top point. The intra coding is performed by unit of CTU (largest CU) in raster scan order. Within each CTU, CUs are processed in quadtree traverse order. As a consequence, two levels of CU dependency exist. 2. HEVC Intra Dependency Analysis
  • 6. In CTU-level, each CTU must waits until its left and top-right neighbor CTUs finish reconsturction. This makes the current CTU row always tow- CTU latent than its adjacent upper row. Maximum parallelism is achieved if each CTU starts encoding whenever its two dependent CTUs finish. Proceed like the wavefront. Fig.3. 2.1 CTU-level Dependency thread0 thread1 thread2 thread3 thread4 thread5 thread6 thread7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 Fig.3. Maximum prallelism in CTU-level (each block stands for one CTU)
  • 7. Intra denpendency also exists within one CTU. Assume CTU size is 32x32, and it can be further split into 16x16, 8x8 CUs, and 4x4 PUs in quadtree structure. HEVC decides the best partition and best mode in a brute force way - compares the best cost of current CU and the sum costs of its four sub-CUs. The flow chart is given in Fig.4. For each CTU, the encoder has to perform the following amount of calculations: • IntraPred32x32 35 times • IntraPred16x16 4x35 times • IntraPred8x8 16x35 times • IntraPred4x4 64x35 times 2.2 CU-level Dependency Start nDepth=0,nSize=32 nMaxDepth=3 CheckRDCostIntra (nDepth,nSize) nDepth<nMaxDepth? CheckRD CostIntra (nDepth, nSize) CheckRD CostIntra (nDepth, nSize) CheckRD CostIntra (nDepth, nSize) CheckRD CostIntra (nDepth, nSize) nDepth++ nSize/=2 Yes SumChildCost<ParentCost? SplitFlagCurrParCU=true BestModes=BestChildModes BestCost=SumChildCost SplitFlagCurrParCU=false BestModes=BestParMode BestCost=ParentCost nDepth==1? nDepth-- End Yes No Yes No No Fig.4. Deciding the best encoding cost in HEVC intra coding
  • 8. 2.2 CU-level Dependency Due to the sequential proporty, this process is really time-consuming, which prohibits the intra coding speed on a large extent. Now we analyse the CU dependencies in terms of time iterations. We denote CU_N(x,y) as a CU with size NxN and relative coordinate (x,y) from CTU left-top pixel. We also assume the processing time for one CU_4x4 as one iteration. Then the processing time for each CU_8x8, CU_16x16, CU_32x32 are 4, 16, and 64 respectively. The encoder first uses 64 iterations for CU_32(0,0), then 16 more iterations for CU_16(0,0), then 4 for CU_8(0x0), then CU_4(0,0) starts at the 64+16+8+4+1=85th iteration. Calculations for other CUs are similar. Here, we show the results of each 4x4 CUs in Fig.5.(a). 85 86 93 94 87 88 95 96 101 102 109 110 103 104 111 112 133 134 135 136 181 182 189 190 183 184 191 192 197 198 205 206 199 200 207 208 229 230 237 238 231 232 239 240 245 246 253 254 247 248 255 256 141 142 143 144 149 150 151 152 157 158 159 160 1 2 5a 6 3 4 7 8a 5b 8b 11a 12a 9a 10a 13a 14a 9b 10b 13b 14b 11c 14b 17b 20b 15c 16b 21b 22b 17c 23b 26a 27a 24b 25a 28a 29a 25b 26b 30a 31 28b 29b 32 33a 30b 33b 34b 35b 34a 35a 36 37 15a 16a 17a 18a 15b 18b 19a 20a 21a 22a 23a 24a (a)sequential fasion (b)with maxmium parallelism Fig.5. Starting times of each CU_4x4
  • 9. 2.2 CU-level Dependency Notice that CU_4(0,0), CU_8(0,0), CU_16(0,0) each only need a subset of the reference pixels of CU_32(0,0), so if CU_32(0,0) is ready for processing, so do the other three CUs. Further, CU_4(8,0) and CU_4(0,8) are also ready for prediction once the partition and best modes of CU_16(0,0) are decided. We use a DAG (Directed Acyclic Grahp) to visualize the CU dependencies[2]. As shown in Fig.6. The vertical aixs is iteration. CUs with same vertical coordinate can be started simultaneously. If CUs are always started right at their readiness (enable parallel processing), maximum parallelism within one CTU could be achieved. Starting moments under this mechnism is shown in Fig.5.(b). It should be noted the although the last CU_4x4 finished at iteration 37, the whole CTU ends at 64, which is the finishing time of CU_32(0,0). Clearly, all the processing time of other CUs is hidden in the largest CU's processing time. If parallel at this level is fully utilized, the thoeretical speedup gain for one CTU can be as high as (257-64)/64x100% = 301.56%
  • 10. 2.2 CU-level Dependency CU_4x4(12,0)CU_4x4(12,0) CU_4x4(8,4)CU_4x4(8,4) CU_4x4(12,4)CU_4x4(12,4) CU_4x4(8,0)CU_4x4(8,0) select CU_4x4(4,0)CU_4x4(4,0) CU_4x4(0,4)CU_4x4(0,4) CU_4x4(4,4)CU_4x4(4,4) CU_8x8(0,0)CU_8x8(0,0) CU_4x4(0,0)CU_4x4(0,0) select CU_4x4(4,8)CU_4x4(4,8) CU_4x4(0,12)CU_4x4(0,12) CU_4x4(4,12)CU_4x4(4,12) CU_4x4(0,8)CU_4x4(0,8) select CU_16x16(0,0)CU_16x16(0,0) CU_4x4(12,8)CU_4x4(12,8) CU_4x4(8,12)CU_4x4(8,12) CU_4x4(12,12)CU_4x4(12,12) CU_4x4(8,8)CU_4x4(8,8) select CU_4x4(20,0)CU_4x4(20,0) CU_4x4(20,4)CU_4x4(20,4) CU_4x4(16,0)CU_4x4(16,0) select CU_4x4(16,4)CU_4x4(16,4) iterationiteration 11 22 33 44 CU_8x8(8,0)CU_8x8(8,0) CU_8x8(0,8)CU_8x8(0,8) CU_8x8(8,8)CU_8x8(8,8) CU_8x8(16,16)CU_8x8(16,16) select 88 1212 CU_32X32CU_32X32 1616 Fig.6. Part of the DAG for one CTU processing
  • 11. 3. Proposed Parallelization Scheme In this section, we propose a two-stage parallelization speedup scheme exploiting CTU level parallelism. The design of the scheme takes two major aspects into consideration • Maximizing encoding speedup • Minimizing compression performance loss The overall structure strikes a good balance between design effort, parallelism degree and RD performance. • the first stage performs parallel processing by launching number of thread, with each thread processing one CTU row under the CTU-level constraint. The resulted prediction and partition information is then stored. • in the second stage, a single thread is used to encode all the CTUs within the picture in raster scan order.
  • 12. 3. Proposed Parallelization Scheme thread1 thread2 thread3 thread0 thread1 thread2 thread3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 thread0 thread4 thread0 thread1 thread4 thread3 thread0 thread1 thread2 thread3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 thread2 Fig.6. Different moments of the proposed parallelism scheme (The gray area are CTUs finished processing, while blue area are those finished entropy coding) (b) thread1 starts processing the first CTU in 6th row(a) thread3 starts processing This scheme achieves three benefits in terms of encoding speed and compression efficiency: • Maximizing acceleration ratio - parallel proccesing the most computation - intensive part • Minimizing performance loss - continuous entropy coding in the whole picture • Further speedup gain - run stage 1& 2 simultaneously, encoding time is hidden in processing time
  • 13. 4. Simulation results We implemented our algorithm on x265, an open source HEVC encoder[3]. The encoder is configured as follows: all intra, CTU size 32, split depth 3, QP 22, 27, 32,37, no fast split decision or fast intra mode selection algorithms. The benchmark is x265 default encoder which runs at single thread. All tests run on a HP workstation with 8 cores @ 2.6GHz. The first group of experiments studies the relationship between speedup gains and thread numbers. Fig.7. The second goup tests the proposed algorithm on different video sequences. The results is shown in Table.1.
  • 14. 4. Simulation results Fig.7. Speedup gains with thread numbers %100  proposed proposedanchor T TT Test sequence: BasketballDrill_832x480_50 Speedup gain:
  • 15. 4. Simulation results sequences QP Anchor (s) Proposed (s) gain PeopleOnStreet_2560x1600_30 22 384.212 61.729 522% 27 375.194 58.203 545% 32 367.106 57.268 541% 37 366.805 57.205 541% Traffic_2560x1600_30 22 386.036 58.391 586% 27 374.760 57.704 549% 32 370.128 57.642 542% 37 372.294 57.143 512% BQTerrace_1920x1080_60 22 822.581 156.312 426% 27 806.491 132.708 508% 32 782.312 131.881 493% 37 757.416 131.694 475% Cactus_1920x1080_50 22 646.017 127.295 407% 27 644.591 109.981 486% 32 638.755 110.055 480% 37 632.842 110.055 475% ParkScene_1920x1080_24 22 327.787 55.972 486% 27 310.908 52.728 490% 32 310.066 52.759 488% 37 304.854 52.759 478% Average 502% Table.1. Speedup gains on diffrent video sequences
  • 16. Newest Work - Speedup Inter Coding Sequence (3840x2160) PSNR(dB) Bitrate(Mbps) Frame Rate (fps) Cactus 43.266 3.994 30.16 Foreman 42.345 4.878 29.73 Coastguard 36.985 17.025 14.47 News 44.203 2.504 33.66 Suzie 41.640 5.157 28.32 Mobile 39.599 5.999 26.28 Library 38.782 3.405 31.34 BundNightscape 38.548 6.190 26.85 AncientTown 39.355 5.830 27.91 Horses 38.219 8.310 21.07 TrafficAndBuilding 37.567 5.683 25.01 Marathon 34.576 24.672 10.21 Average 39.590 7.111 25.42 Table.2. Performance on UHD sequences by speedup intra and inter coding Table.2. shows the encoding performance to speedup intra & inter coding. Configuration: QP 32, one I frames followed by 49 P frames. 16 threads are used. UHD sequences are from SJTU[4] and Elemental Techonologies[5].
  • 17. References [1] G. Sullivan, J.-R. Ohom, W.-J. Han, and T. Wiegand, "Overview of the high efficiency video coding (HEVC) standard", IEEE Trans. on CSVT, 2013 [2] N. Cheung, O. Au, M. Kung, "Highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors", IEEE Trans. on CSVT, 2009 [3] x265 project, http://code.google.com/p/x265/ [4 ]http://medialab.sjtu.edu.cn//web4k/index.html [5] http://www.elementaltechnologies.com/resources/4k-test-sequences