4K Ultra High Definition Video Coding using
Homogeneous Motion Discovery Oriented Prediction
Ashek Ahmmed†
Afrin Rahman†
Mark PickeringΨ
Aous Thabit Naman∗
†
Department of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh.
Ψ
School of Engineering and Information Technology, The University of New South Wales, Canberra, Australia.
∗
School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, Australia.
Abstract
State of the art video compression techniques use the motion model to approximate geometric boundaries
of moving objects where motion discontinuities occur. Motion hints based inter-frame prediction paradigm
moves away from this redundant approach and employs an innovative framework consisting of motion hint
fields that are continuous and invertible, at least, over their respective domains. However, estimation of
motion hint is computationally demanding, in particular for high resolution video sequences. Discovery of
homogeneous motion models and their associated masks over the current frame and then use these models
and masks to form a prediction of the current frame, provides a computationally simpler approach to video
coding compared to motion hint. In this paper, the potential of this coherent motion model based approach,
equipped with bigger blocks, is investigated for coding 4K Ultra High Definition (UHD) video sequences.
Experimental results show a savings in bit rate of 4.68% is achievable over standalone HEVC.
Introduction
Block-based translational motion model assigns a single motion vector to all the pixels inside a block
based on the assumption that the constituting pixels are moving in the same direction at a constant
speed. This uniformity of motion within a block hypothesis does not hold if the block is on object
boundaries i.e. where motion discontinuity exists. Hence, this model fails to efficiently model the
actual locations of discontinuities in the motion field.
Partitioning motion blocks, with object boundaries, into smaller square or rectangular sub-blocks
represents a popular approach to improve the compression efficiency since it is possible to better
match the blocks to the objects in the scene.
Motion hint can provide a global description of motion over specific domain and is related to the
foreground-background segmentation where the foreground and background motions are the hints.
The inspiration behind motion hint is to avoid using motion model for the purpose of describing
object boundaries since the spatial structure of previously-decoded reference frames can be exploited
to infer appropriate boundaries in the frames to be predicted.
The motion hint based prediction paradigm introduced in [2] is promising. Each reference frame is
segmented into super-pixels and those super-pixels are then grouped into homogeneous motion groups
iteratively. However, for high definition, full high definition and 4K ultra high definition resolution
video sequences the number of super-pixels becomes too many for the segmentation algorithm to
deal with and produce representative enough foreground-background shapes within a viable number
of iterations. This phenomenon is depicted in Figure 1.
Figure 1: Outcome of the motion hint segmentation approach, described in [2], after 4 iterations. The example sequence
is the Kimono 1080p sequence. Many background super-pixels are still misclassified as foreground, hence poor quality
motion hint segmentation is yielded.
In this paper we investigate the applicability of a prediction paradigm [1], where a bi-directional
affine motion model compensated prediction is used as a reference frame and prediction generation
process does not require any foreground-background segmentation, for coding 4K ultra high defini-
tion 4K (UHD) video sequences.
Structure of the coding/decoding architecture
In the considered approach, depicted with a simplified block diagram in Figure 2, the affine mo-
tion field between the reference frame Ri and the current B-frame, C is estimated using standard
gradient-based image registration technique. The mask, f
(Ri→C)
1 , used to estimate the associated 6-
parameter affine model is the entire C frame i.e. f
(Ri→C)
1 is a binary image with all values equal to 1.
The resultant affine motion model M
(Ri→C)
1 is approximated by the 3-corner motion vectors (MVs),
specifically by the top left, top right and the center pixels’ MVs. The fractional part of these MVs
are quantized to the accuracy of 1/16-th of a pixel using the Exponential Golomb coding technique.
This quantized motion model, M
(Ri→C)
1 is employed to warp Ri for generating an affine motion
compensated prediction, CRi→C
1 of C .
CRi→C
1 = M
(Ri→C)
1 Ri (1)
Figure 2: Block diagram of the coding/decoding framework that uses the bi-directional affine motion model compensated
prediction as a reference frame, along with the usual temporal reference(s), for the B-frames.
Next, an error analysis is carried out over the prediction error associated to the prediction CRi→C
1 .
Prediction error blocks having the sum-squared error (SSE) greater than the block-wise mean SSE
of this error image are identified. An example of these blocks, marked with white boundary pixels,
for a 4K UHD sequence is shown in Figure 3. Blocks with high SSE are then used to form another
mask, f
(Ri→C)
2 , which is fed in the affine motion model estimation process involving Ri and C ,
for generating a second affine model, namely M
(Ri→C)
2 . The reference frame Ri is warped by this
newly estimated and quantized affine motion model to generate another prediction of C , specifically
CRi→C
2 .
CRi→C
2 = M
(Ri→C)
2 Ri (2)
Figure 3: The affine motion model, M
(Ri→C)
1 performed poorly in blocks with white boundary pixels i.e. these blocks
have high prediction error energy. The example scenario is for predicting frame 5 using coded frames 1 and 9 of the
Vehicles (3840 × 2160) sequence. Used block size is 240 × 240 pixels.
Now with the help of the mask f
(Ri→C)
2 , the predictions CRi→C
1 and CRi→C
2 are fused into a single
prediction of C , from the reference frame Ri , in the following way:
CRi→C = 1 − f
(Ri→C)
2 · CRi→C
1 + f
(Ri→C)
2 · CRi→C
2 (3)
Similarly, using the reference frame Rj and C , the prediction of C namely CRj→C is formed. The
predictions CRi→C and CRj→C are then combined through a weighting scheme to generate the bi-
directional affine motion models compensated prediction, Raffine , of C .
Raffine = wi · CRi→C + wj · CRj→C (4)
where wi, wj ∈ [0, 1] are real-valued weights.
What is communicated from the encoder are the 4 sets of 3-corner motion vectors M
(Ri→C)
1 ,
M
(Ri→C)
2 , M
(Rj→C)
1 , and M
(Rj→C)
2 . That means in total 4 × 3 = 12 MVs, each with 1/16-th
of a pixel accuracy, are necessary to yield the reference frame Raffine at the decoder.
Experimental Analysis
The RD performance of the employed coder is investigated on three 4K UHD video sequences. The
first 49 frames of each sequence are coded by the HM 16.10 reference software for HEVC. The HM
encoder is configured using the random access main configuration i.e. hierarchical GOP structure is
used with GOP size = 8, Intra period = 32 as per the common test conditions. Four different quantiza-
tion parameter values (QP = 22, 27, 32, 37) are used. For each B-frame, the available pair of reference
frames are fed into the bi-directional affine motion models based prediction process to generate the
additional reference frame, Raffine . All the obtained results are summarized in Table 1.
Sequence Delta rate Delta PSNR
Park and Buildings −2.70% 0.06 dB
Vehicles −4.68% 0.15 dB
Book −3.97% 0.08 dB
Average −3.78% 0.10 dB
Table 1: The Bjøntegaard delta gains obtained from the test sequences over standalone HEVC by using the bi-directional
affine motion models compensated reference frames.
Conclusions
• investigated the RD performance of a hybrid coder, that uses an affine motion compensated refer-
ence frame along with the typical references, for coding 4K UHD video sequences.
• generation of this additional reference is done by finding homogeneous motion regions over the
current frame and it does not require any super-pixel level segmentation.
• experimental results are encouraging e.g. an improvement in bit rebate of up to 4.68% is achieved
over standalone HEVC.
• the associated motion information and signalling overhead could further be optimized to have in-
creased bit rate savings.
References
[1] A. Ahmmed, D. Taubman, A. T. Naman, and M. Pickering. Homogeneous motion discovery ori-
ented reference frame for high efficiency video coding. In Picture Coding Symposium (PCS),
2016, pages 1–5, 2016.
[2] A. Ahmmed, R. Xu, A. T. Naman, M. J. Alam, M. Pickering, and D. Taubman. Motion seg-
mentation initialization strategies for bi-directional inter-frame prediction. In IEEE International
Workshop on Multimedia Signal Processing, pages 58–63, Sept 2013.
Acknowledgements
The authors would like to thank Dr. Matteo Naccari of the B.B.C. Research and Development team for providing the 4K
UHD video sequences and valuable suggestions during this work.

DICTA 2017 poster

  • 1.
    4K Ultra HighDefinition Video Coding using Homogeneous Motion Discovery Oriented Prediction Ashek Ahmmed† Afrin Rahman† Mark PickeringΨ Aous Thabit Naman∗ † Department of Computer Science and Engineering, Rajshahi University of Engineering and Technology, Rajshahi, Bangladesh. Ψ School of Engineering and Information Technology, The University of New South Wales, Canberra, Australia. ∗ School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, Australia. Abstract State of the art video compression techniques use the motion model to approximate geometric boundaries of moving objects where motion discontinuities occur. Motion hints based inter-frame prediction paradigm moves away from this redundant approach and employs an innovative framework consisting of motion hint fields that are continuous and invertible, at least, over their respective domains. However, estimation of motion hint is computationally demanding, in particular for high resolution video sequences. Discovery of homogeneous motion models and their associated masks over the current frame and then use these models and masks to form a prediction of the current frame, provides a computationally simpler approach to video coding compared to motion hint. In this paper, the potential of this coherent motion model based approach, equipped with bigger blocks, is investigated for coding 4K Ultra High Definition (UHD) video sequences. Experimental results show a savings in bit rate of 4.68% is achievable over standalone HEVC. Introduction Block-based translational motion model assigns a single motion vector to all the pixels inside a block based on the assumption that the constituting pixels are moving in the same direction at a constant speed. This uniformity of motion within a block hypothesis does not hold if the block is on object boundaries i.e. where motion discontinuity exists. Hence, this model fails to efficiently model the actual locations of discontinuities in the motion field. Partitioning motion blocks, with object boundaries, into smaller square or rectangular sub-blocks represents a popular approach to improve the compression efficiency since it is possible to better match the blocks to the objects in the scene. Motion hint can provide a global description of motion over specific domain and is related to the foreground-background segmentation where the foreground and background motions are the hints. The inspiration behind motion hint is to avoid using motion model for the purpose of describing object boundaries since the spatial structure of previously-decoded reference frames can be exploited to infer appropriate boundaries in the frames to be predicted. The motion hint based prediction paradigm introduced in [2] is promising. Each reference frame is segmented into super-pixels and those super-pixels are then grouped into homogeneous motion groups iteratively. However, for high definition, full high definition and 4K ultra high definition resolution video sequences the number of super-pixels becomes too many for the segmentation algorithm to deal with and produce representative enough foreground-background shapes within a viable number of iterations. This phenomenon is depicted in Figure 1. Figure 1: Outcome of the motion hint segmentation approach, described in [2], after 4 iterations. The example sequence is the Kimono 1080p sequence. Many background super-pixels are still misclassified as foreground, hence poor quality motion hint segmentation is yielded. In this paper we investigate the applicability of a prediction paradigm [1], where a bi-directional affine motion model compensated prediction is used as a reference frame and prediction generation process does not require any foreground-background segmentation, for coding 4K ultra high defini- tion 4K (UHD) video sequences. Structure of the coding/decoding architecture In the considered approach, depicted with a simplified block diagram in Figure 2, the affine mo- tion field between the reference frame Ri and the current B-frame, C is estimated using standard gradient-based image registration technique. The mask, f (Ri→C) 1 , used to estimate the associated 6- parameter affine model is the entire C frame i.e. f (Ri→C) 1 is a binary image with all values equal to 1. The resultant affine motion model M (Ri→C) 1 is approximated by the 3-corner motion vectors (MVs), specifically by the top left, top right and the center pixels’ MVs. The fractional part of these MVs are quantized to the accuracy of 1/16-th of a pixel using the Exponential Golomb coding technique. This quantized motion model, M (Ri→C) 1 is employed to warp Ri for generating an affine motion compensated prediction, CRi→C 1 of C . CRi→C 1 = M (Ri→C) 1 Ri (1) Figure 2: Block diagram of the coding/decoding framework that uses the bi-directional affine motion model compensated prediction as a reference frame, along with the usual temporal reference(s), for the B-frames. Next, an error analysis is carried out over the prediction error associated to the prediction CRi→C 1 . Prediction error blocks having the sum-squared error (SSE) greater than the block-wise mean SSE of this error image are identified. An example of these blocks, marked with white boundary pixels, for a 4K UHD sequence is shown in Figure 3. Blocks with high SSE are then used to form another mask, f (Ri→C) 2 , which is fed in the affine motion model estimation process involving Ri and C , for generating a second affine model, namely M (Ri→C) 2 . The reference frame Ri is warped by this newly estimated and quantized affine motion model to generate another prediction of C , specifically CRi→C 2 . CRi→C 2 = M (Ri→C) 2 Ri (2) Figure 3: The affine motion model, M (Ri→C) 1 performed poorly in blocks with white boundary pixels i.e. these blocks have high prediction error energy. The example scenario is for predicting frame 5 using coded frames 1 and 9 of the Vehicles (3840 × 2160) sequence. Used block size is 240 × 240 pixels. Now with the help of the mask f (Ri→C) 2 , the predictions CRi→C 1 and CRi→C 2 are fused into a single prediction of C , from the reference frame Ri , in the following way: CRi→C = 1 − f (Ri→C) 2 · CRi→C 1 + f (Ri→C) 2 · CRi→C 2 (3) Similarly, using the reference frame Rj and C , the prediction of C namely CRj→C is formed. The predictions CRi→C and CRj→C are then combined through a weighting scheme to generate the bi- directional affine motion models compensated prediction, Raffine , of C . Raffine = wi · CRi→C + wj · CRj→C (4) where wi, wj ∈ [0, 1] are real-valued weights. What is communicated from the encoder are the 4 sets of 3-corner motion vectors M (Ri→C) 1 , M (Ri→C) 2 , M (Rj→C) 1 , and M (Rj→C) 2 . That means in total 4 × 3 = 12 MVs, each with 1/16-th of a pixel accuracy, are necessary to yield the reference frame Raffine at the decoder. Experimental Analysis The RD performance of the employed coder is investigated on three 4K UHD video sequences. The first 49 frames of each sequence are coded by the HM 16.10 reference software for HEVC. The HM encoder is configured using the random access main configuration i.e. hierarchical GOP structure is used with GOP size = 8, Intra period = 32 as per the common test conditions. Four different quantiza- tion parameter values (QP = 22, 27, 32, 37) are used. For each B-frame, the available pair of reference frames are fed into the bi-directional affine motion models based prediction process to generate the additional reference frame, Raffine . All the obtained results are summarized in Table 1. Sequence Delta rate Delta PSNR Park and Buildings −2.70% 0.06 dB Vehicles −4.68% 0.15 dB Book −3.97% 0.08 dB Average −3.78% 0.10 dB Table 1: The Bjøntegaard delta gains obtained from the test sequences over standalone HEVC by using the bi-directional affine motion models compensated reference frames. Conclusions • investigated the RD performance of a hybrid coder, that uses an affine motion compensated refer- ence frame along with the typical references, for coding 4K UHD video sequences. • generation of this additional reference is done by finding homogeneous motion regions over the current frame and it does not require any super-pixel level segmentation. • experimental results are encouraging e.g. an improvement in bit rebate of up to 4.68% is achieved over standalone HEVC. • the associated motion information and signalling overhead could further be optimized to have in- creased bit rate savings. References [1] A. Ahmmed, D. Taubman, A. T. Naman, and M. Pickering. Homogeneous motion discovery ori- ented reference frame for high efficiency video coding. In Picture Coding Symposium (PCS), 2016, pages 1–5, 2016. [2] A. Ahmmed, R. Xu, A. T. Naman, M. J. Alam, M. Pickering, and D. Taubman. Motion seg- mentation initialization strategies for bi-directional inter-frame prediction. In IEEE International Workshop on Multimedia Signal Processing, pages 58–63, Sept 2013. Acknowledgements The authors would like to thank Dr. Matteo Naccari of the B.B.C. Research and Development team for providing the 4K UHD video sequences and valuable suggestions during this work.