2. Section I
− Video Compression History
− A Generic Interframe Video Encoder
− The Principle of Compression
− Differential Pulse-Code Modulation (DPCM)
− Transform Coding
− Quantization of DCT Coefficients
− Entropy Coding
Section II
− Still Image Coding
− Prediction in Video Coding (Temporal and Spatial Prediction)
− A Generic Video Encoder/Decoder
− Some Motion Estimation Approaches
2
Outline
4. Lossless Compression
– Transparent (Totally reversible without any loss)
– Compression ratio not guaranteed
– Good for computer data where no loss is important
– Lempel–Ziv–Welch coding
• Lempel–Ziv–Welch (LZW) lossless compression technique (is a universal lossless data
compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch.)
• LZW is the compression of a file into smaller file using a table based lookup algorithm.
• It replaces strings of characters with single codes.
• A dynamic coding system
• Used by PKZip, WinZip, GIF, TIF, PNG, Fax, UNIX and Linux, Microsoft DriveSpace (is a disk
compression utility supplied with MS-DOS), and many other compression systems
4
Encoder Decoder
Transparent
5. Lossy Compression
– Non-transparent.
– Compression ratio is guaranteed.
– Examples
• JPEG, MPEG, DV, Digital Betacam, Betacam SX, IMX, etc..
• MP3, Dolby E, Dolby Digital, DTS (originally Digital Theater Systems), etc..
• Both DTS and Dolby Digital are audio compression technologies, allowing movie makers to record
surround sound that can be reproduced in cinemas as well as homes.
– Good for media data ...
• … where compression ratio is important.
5
Encoder Decoder
Non-transparent
6. Lossless and Lossy Compression Techniques
− The Lossless approaches in compression process:
− DCT: Discrete Cosine Transform
− VLC: Variable Length Coding
− RLC: Run Length Coding
− The Lossy approaches in compression process:
− Sample subsampling: 4:2:2, 4:2:0, 4:1:1
− DPCM: Differential Pulse Code Modulation:
− Quantization
6
7. 7
− It arises when parts of a picture are often replicated within a single frame of video (with minor
changes).
Spatial Redundancy in Still Images
This area
is all blue
This area is half
blue and half green
Sky Blue
Sky Blue
Sky Blue
Sky Blue
Sky Blue
Sky Blue
Sky Blue
Sky Blue
8. − Spatial Redundancy Reduction (pixels inside a picture are similar)
− Statistical Redundancy Reduction (more frequent symbols are assigned short code words and less
frequent ones longer words)
The Principle of Compression in Still Images
8
10. • Zig-Zag Scan
• Run-length coding
• VLC
Quantization
• Major Reduction
• Controls ‘Quality’
10
Intra-frame Compression (Like as still image Compression)
11. • .
Intra-frame Compression (Like as still image Compression)
Data
Buffer
Entropy
Coding
Quantisation
Discrete
Cosine
Transform
Base band
input
Compressed
output
Zig-zag
Scanning
Rearranges the pixels into frequency coefficients
Replaces the original data with shorter codes or symbols
Rearranges the
coefficients from raster
scan to low frequency first
Stores the compressed data & checks compression ratio.
Returns a quantisation signal if ratio is not achieved.
Quantises the data if the compression ratio
has not been achieved
11
12. Lossless and Lossy in Intra-frame Compression
Data
Buffer
Entropy
Coding
Quantisation
Discrete
Cosine
Transform
Base band
input
Compressed
output
Zig-zag
Scanning
Rearrangement
(reversible)
Compression
ratio checking
(reversible)
Possible loss of entropy
(non-reversible!!)
Data reduction
(reversible)
12
16. – The quantizer cuts entropy.
– Controlled by the data buffer.
– May pass data through without change.
• The quantizer can effectively “switch off” by going into pass-through mode if the amount
of data from the data buffer is OK.
– May quantize the data in a number of steps.
• If the data buffer fills, a signal is back to the quantizer which switches to a different
quantisation matrix to flatten the data and reduce its content.
Quantisation
16
26. – Rearranges the DCT coefficients.
• Changed from raster scan so that the DC and low frequency coefficients are first and the
high frequency coefficients are last.
– Helps to separate entropy from redundancy.
Zig-zag Scanning
26
29. • .
Zig-zag Scanning
29
DC and low frequency coefficients are first
and the high frequency coefficients are last.
30. • .
Zig-zag Scanning for Separating Redundancy and Entropy
30
RedundancyEntropy
DC and low frequency coefficients are first
and the high frequency coefficients are last.
32. • .
Data
Buffer
Entropy
Coding
Quantisation
Discrete
Cosine
Transform
Base band
input
Compressed
output
Zig-zag
Scanning
Rearranges the pixels into
frequency coefficients Replaces the original data
with shorter codes or symbols
Rearranges the
coefficients from raster
scan to low frequency first
Stores the compressed data & checks
compression ratio. Returns a quantisation
signal if ratio is not achieved.
Quantises the data if the compression ratio
has not been achieved
32
Intra-frame Compression (Like as still image Compression)
33. – Holds the results from the variable length coder and outputs data at a constant rate.
– If the data buffer empties, ‘packing’ data is output.
– If the data buffer fills, a signal is sent to quantizer.
– The quantizer is instructed to reduce the amount of data.
Data Buffer
Data
Buffer
Entropy
Coding
Quantisation
Discrete
Cosine
Transform
Base band
input
Compressed
output
Zig-zag
Scanning
33
34. – Simple video signal
• DCT has most of its big numbers in the top left corner.
• Zig-zag scan has all the big numbers at the start of the scan.
• RLC and VLC can reduce the amount of data a lot.
• Amount of data entering the data buffer is small.
– Medium complexity video signal
• DCT has some of its big numbers in the top left corner.
• Zig-zag scan has a few big numbers at the start of the scan.
• RLC and VLC reduce data a bit.
• Amount of data entering the data buffer is OK.
• Data buffer sends the compressed data out as is.
• Data buffer adds packing data.
– Complex video signal
• DCT has big numbers all over the DCT block.
• Zig-zag scan still results in big numbers everywhere.
• RLC and VLC cannot reduce the amount of data very much.
• Amount of data entering the data buffer is too high.
• Data buffer sends a signal back to the quantiser.
• Quantiser reduces the amount of data by cutting entropy.
Data Buffer
34
Data
Buffer
Entropy
Coding
Quantisation
Discrete
Cosine
Transform
Base band
input
Compressed
output
Zig-zag
Scanning
36. 36
− It arises when parts of a picture are often replicated within a single frame of video (with minor
changes).
Spatial Redundancy in Still Images
This area
is all blue
This area is half
blue and half green
Sky Blue
Sky Blue
Sky Blue
Sky Blue
Sky Blue
Sky Blue
Sky Blue
Sky Blue
37. This picture is the
same as the previous
one except for this
area
− It arises when successive frames of video display images of the same scene.
− Take advantage of similarity between successive frames
37
Temporal Redundancy in Moving Images
This picture is the
same as the previous
one except for this
area
39. Moving Picture Redundancies
Temporal Redundancy
− It arises when successive frames of video display images of the same scene.
Spatial Redundancy
− It arises when parts of a picture are often replicated within a single frame of video (with
minor changes).
39
Temporal Redundancy (inter-frame)
Spatial Redundancy (intra-frame)
40. The MPEG video compression algorithm achieves very high rates of compression by exploiting the
redundancy in video information.
− Spatial Redundancy Reduction (pixels inside a picture are similar)
− Temporal Redundancy Reduction (Similarity between the frames)
− Statistical Redundancy Reduction (more frequent symbols are assigned short code words and less
frequent ones longer words)
The Principle of Compression for Moving Images
40
42. The goal of the prediction model is to reduce redundancy by forming a prediction of the data and subtracting
this prediction from the current data.
− The residual is encoded and sent to the decoder which re-creates the same prediction so that it can add the
decoded residual and reconstruct the current frame.
− In order that the decoder can create an identical prediction, it is essential that the encoder forms the
prediction using only data available to the decoder, i.e. data that has already been coded and transmitted.
Prediction Model
42
Decoder
Encoded Residual
I. Re-creates the same prediction (predictor)
II. Add the decoded residual to prediction (predictor)
III. Reconstructs a version of the original block
Encoder
I. Forming a prediction (predictor)
II. Subtracting prediction from the current
data to create residual
43. Spatial Prediction: The prediction is formed from previously coded image samples in the same frame
− The output of this process is a set of residual or difference samples and the more accurate the
prediction process, the less energy is contained in the residual.
Temporal Prediction: The prediction is formed from previously coded frames
43
Inter Frame (Temporal) and Intra Frame (Spatial) Prediction
46. The prediction for the current block of image samples is
created from previously-coded samples in the same frame.
− Assuming that the blocks of image samples are coded in
raster-scan order, which is not always the case, the
upper/left shaded blocks are available for intra prediction.
− These blocks have already been coded and placed in the
output bitstream.
− When the decoder processes the current block, the shaded
upper/left blocks are already decoded and can be used to
re-create the prediction.
− H.264/AVC uses spatial extrapolation to create an intra
prediction for a block or macroblock.
46
Intra Prediction
Available samples
Spatial extrapolation
48. Vertical
Horizontal + + + +
+
+
+
+
Mean
DC Diagonal down-left
Horizontal up
Diagonal right
Vertical right Vertical leftHorizontal down
Intra Prediction (Ex: H.264/AVC)
1
2
34
56
7
8
Mode Mode Name
0 DC
1 Vertical
2 Horizontal
3 Diagonal down/right
4 Diagonal down/left
5 Vertical –right
6 Vertical-left
7 Horizontal- Up
8 Horizontal-down
48
A B C D
E F G H
I J K L
M N O P
A B C D
I
J
K
L
M E F G H
Mode 1
Mode 6
Mode 0 Mode 5 Mode 4
A B C D
E F G H
I J K L
M N O P
A B C D
I
J
K
L
M E F G H
Mode 8
Mode 3 Mode 7
49. 49
Intra Prediction for 4x4 Luma Blocks
Mode 0: DC Prediction
− If all samples A, B, C, D, I, J, K, L, are available, a=b=c=…=p = (A+B+C+D+I+J+K+L+4) / 8.
− If A, B, C, and D are not available and I, J, K, and L are available, a=b=c=…=p =(I+J+K+L+2) / 4.
− If I, J, K, and L are not available and A, B, C, and D are available, a=b=c=…=p =(A+B+C+D+2) /4.
− If all eight samples are not available, a=b=c=…=p = 128.
Intra Prediction (Ex: H.264/AVC)
1
2
34
56
7
8
a b c d
e f g h
i j k l
m n o p
Q A B C D E F G
I
J
K
L
H
50. 50
Intra Prediction for 4x4 Luma Blocks
Mode 1: Vertical Prediction
− This mode shall be used only if A, B, C, D are available. The prediction in this mode shall be as follows:
• a, e, i, m are predicted by A,
• b, f, j, n are predicted by B,
• c, g, k, o are predicted by C,
• d, h, l, p are predicted by D.
Intra Prediction (Ex: H.264/AVC)
1
2
34
56
7
8
a b c d
e f g h
i j k l
m n o p
Q A B C D E F G
I
J
K
L
H
51. 51
Intra Prediction for 4x4 Luma Blocks
Mode 3: Diagonal Down/Right prediction
− This mode is used only if all A,B,C,D,I,J,K,L,Q are inside the picture. This is a 'diagonal' prediction.
• m is predicted by: (J + 2K + L + 2)/4
• i, n are predicted by: (I + 2J + K + 2)/4
• e, j, o are predicted by: (Q + 2I + J + 2)/4
• a, f, k, p are predicted by: (A + 2Q + I + 2)/4
• b, g, l are predicted by: (Q + 2A + B + 2)/4
• c, h are predicted by: (A + 2B + C + 2)/4
Intra Prediction (Ex: H.264/AVC)
a b c d
e f g h
i j k l
m n o p
Q A B C D E F G
I
J
K
L
H
1
2
34
56
7
8
52. a b c d
e f g h
i j k l
m n o p
A B C D
I
J
K
L
M E F G H
Mode 4
a b c d
e f g h
i j k l
m n o p
A B C D
I
J
K
L
M E F G H
Mode 8
Intra Prediction for 4x4 Luma Blocks
Example of 4 x 4 luma block
– Sample a, d : predicted by round(I/4 + M/2 + A/4), round(B/4 + C/2 + D/4) for mode 4
– Sample a, d : predicted by round(I/2 + J/2), round(J/4 + K/2 + L/4) for mode 8
Intra Prediction (Ex: H.264/AVC)
53. 53
Ex. Intra Prediction for 4x4 Luma Blocks
Intra Prediction (Ex: H.264/AVC)
A 4 × 4 luma block, part of the
highlighted macroblock
QCIF frame with highlighted macroblock
54. 54
Ex. Intra Prediction for 4x4 Luma Blocks
− The 9 prediction modes 0-8 are calculated for the
following 4 × 4 block.
− The Sum of Absolute Errors (SAE) for each
prediction indicates the magnitude of the
prediction error.
− In this case, the best match to the actual current
block is given by mode 8, horizontal-up, because
this mode gives the smallest SAE.
− A visual comparison shows that the mode 8 P
block (prediction block) appears quite similar to
the original 4 × 4 block.
Intra Prediction (Ex: H.264/AVC)
55. 55
Intra Prediction for 4x4 Chroma Blocks (Only one mode: DC Prediction)
− A, B, C, D are four 4x4 blocks in a 8x8 chroma block.
− S0, S1, S2, S3 are the sums of 4 neighboring pixels.
Intra Prediction (Ex: H.264/AVC)
If S0, S1, S2, S3 are all inside the frame:
A = (S0 + S2 + 4)/8
B = (S1 + 2)/4
C = (S3 + 2)/4
D = (S1 + S3 + 4)/8
If only S0 and S1 are inside the frame:
A = (S0 + 2)/4
B = (S1 + 2)/4
C = (S0 + 2)/4
D = (S1 + 2)/4
If only S2 and S3 are inside the frame:
A = (S2 + 2)/4
B = (S2 + 2)/4
C = (S3 + 2)/4
D = (S3 + 2)/4
If S0, S1, S2, S3 are all outside the frame: A = B = C = D = 128
A B
C D
S1S0
S2
S3
56. 56
Intra Prediction (Ex: H.264/AVC)
16x16 Intra Prediction Mode
− Especially suitable for smooth areas
− Prediction Modes
• Mode 0 =Vertical Prediction
• Mode 1 = Horizontal Prediction
• Mode 2 = DC prediction
• Mode 3 = Plane prediction
− Residual coding
• Another 4x4 transform is applied to the 16 DC coefficients
• Only single scan is used.
58. 58
Ex. Intra 16×16
− A luma macroblock with
previously-encoded samples
at the upper and lefthand
edges.
− The best match is given by
mode 3 which in this case
produces a plane with a
luminance gradient from
light at the upper-left to dark
at the lower-right.
Intra Prediction (Ex: H.264/AVC)
60. 60
General Reasons for Differences Between Two Frames
Differences between two frames can be caused by
• Camera motion: the outlines of background or stationary objects can be seen in the Diff Image
• Object motion: the outlines of moving objects can be seen in the Diff Image
• Illumination changes: sun rising, headlights, etc.
• Scene Cuts: Lots of stuff in the Diff Image
• Noise: If the only difference between two frames is noise (nothing moved), then you won’t recognize
anything in the Difference Image
We try to minimize entropy in difference image by motion prediction
61. 61
Typical Camera Motions, Recall
Track right
Dolly
backward
Boom up
(Pedestal up)
Pan right
Tilt up
Track left
Dolly forward
Boom down
(Pedestal down)
Pan left
Tilt down
Roll
62. 62
Typical Objects Motions, Recall
Translation: Simple movement of typically rigid objects
Camera pans vs. movement of objects
Rotation: Spinning about an axis
– Camera versus object rotation
Zooms –in/out
– Camera zoom vs. object zoom (movement in/out)
Frame n+1
(Translation)
Frame n+1
(Rotation)
Frame n+2
(Zoom)
Frame n
70. Difference Frame
Without Motion Prediction
Difference Frame
With Motion Prediction
Frame N Frame N+1
70
Goal: to remove the correlation by motion compensation
If you can see something in the Diff Image and recognize it, there’s still correlation in the difference image.
Temporal Prediction (Motion Prediction)
71. Temporal Redundancy Reduction
− Pixels in the successive frames of the same locations are highly correlated.
− In static parts of the picture, they are virtually the same.
− Due to motion, they are displaced, but their motion compensated values become more similar.
− The accuracy of the prediction can usually be improved by compensating for motion between the
reference frame(s) and the current frame.
• Hence motion compensated frame difference pixels become smaller
• Instead of transforming a block of pixels, their motion compensated values are transformed and quantised.
• The predicted frame is created from one or more past or future frames known as reference frames (anchor).
71
Temporal Prediction (Motion Prediction)
Frame 1 (as a predictor for frame 2 ) Frame 2 (current frame ) Difference
Mid-grey represents a difference of zero and light or dark greys correspond to positive and negative differences.
72. Preveious frame N Next frame N+1
Small Different
(Residual block)
Y
X
(Forward) Motion Vector
Best Match for
Macroblock in
preveious Frame N
(the predictor for the
macroblock in the next frame)
Macroblock in next frame N+1
Best Match for Macroblock in preveious Frame N
(motion-compensated prediction)
Macroblock in next frame N+1
Differentiator
72
Movement is cancelled in frames and added as vector information in header.
Temporal Prediction (Motion Prediction)
73. 73
Image t Image t-1 Diff. without motion compensation
Differences with motion compensationMotion vectors for Blocks
Temporal Prediction (Motion Prediction), Example
75. 75
Searching area for finding predictor
Image to code
Reference image
Temporal Prediction (Motion Prediction), Example
76. 76
(a motion compensated prediction)
The chosen candidate region becomes the
predictor for the current MxN block.
Motion Estimation (ME)
The process of finding the best match (finding MVs)
Residual block
Best Match for Macroblock in
preveious Frame N
(motion-compensated prediction)
Macroblock in next frame N+1
Diff.
Motion Compensation (MC)
− The selected ‘best’ matching region in the
reference frame is subtracted from the current
macroblock to produce a residual macroblock.
− This residual block encoded and transmitted
together with a motion vector describing the
position of the best matching region relative to
the current macroblock position.
Temporal Prediction (Motion Prediction)
77. Motion Estimation (ME)
− Search an area in the reference frame (a past or future frame) to find a similar MxN-sample region.
− The process of finding the best match (Motion Vector) is known as motion estimation (ME).
Motion Compensation (MC)
− The chosen candidate region becomes the predictor for the current MxN block (a motion compensated
prediction) and is subtracted from the current block to form a residual MxN block.
− The residual block is encoded and transmitted and the offset between the current block and the position
of the candidate region (motion vector) is also transmitted.
− The decoder uses the received motion vector to re-create the predictor region.
− It decodes the residual block, adds it to the predictor and reconstructs a version of the original block.
77
Temporal Prediction (Motion Prediction)
78. 78
Motion Vector Extraction
− First, the frame to be approximated, the current frame, is chopped up into uniform non-overlapping
blocks.
− Then each block in the current frame is compared to areas of similar size from the previous frame in order
to find an area that is similar. A block from the current frame for which a similar area is sought is known as
a target block.
− The location of the similar or matching block in the past frame might be different from the location of the
target block in the current frame. The relative difference in locations is known as the motion vector.
− If the target block and matching block are found at the same location in their respective frames then the
motion vector that describes their difference is known as a zero vector
Temporal Prediction (Motion Prediction)
79. Block-based Motion Estimation:
− Motion estimation of a macroblock involves finding a M×N-sample region in a reference frame that
closely matches the current macroblock (best match).
− The reference frame is a previously encoded frame from the sequence and may be before or after the
current frame in display order. (frames must be encoded out of order)
− Where there is a significant change between the reference and current frames (ex: a scene change or
an uncovered area) it may be more efficient to encode the macroblock without motion compensation
and so an encoder may choose intra mode encoding using intra prediction.
Block-based Motion Compensation:
− The luma and chroma samples of the selected ‘best’ matching region in the reference frame is
subtracted from the current macroblock to produce a residual macroblock that is encoded and
transmitted together with a motion vector describing the position of the best matching region relative to
the current macroblock position.
Block-based Motion Estimation and Compensation
79
82. 82
Frame 1 s[x,y,t-1](previous) Frame 2 s[x,y,t](current) Partition of frame 2 into blocks (schematic)
Frame 2 with displacement vectors Difference between motion-compensated
prediction and current frame u[x,y,t] Referenced blocks in frame 1
Block-based Motion Estimation and Compensation, Ex3
83. 83
Effectiveness of Block Based Motion Prediction
The effectiveness of compression techniques that use block based motion compensation depends on
the extent to which the following assumptions hold.
• Objects move in a plane that is parallel to the camera plane. Thus the effects of zoom and
object rotation are not considered, although tracking in the plane parallel to object motion is.
• Illumination is spatially and temporally uniform. That is, the level of lighting is constant
throughout the image and does not change over time.
• Occlusion of one object by another, and uncovered background are not considered.
84. Frame N
Frame N+1
Available
from earlier
frame (N)
Not available from earlier frame (N)
for prediction of frame N+1
84
Occlusion in Motion Estimation and Compensation, Example
Occlusion parts of one object by another object
85. 85
Motion Estimation and Block Matching Algorithms
To carry out motion compensation, the motion of the moving objects has to be estimated first.
− The technique in all the standard video codecs is the Block Matching Algorithm (BMA).
− In a typical BMA, a frame is divided into blocks of 𝑀 × 𝑁 pixels or, more usually, square blocks
of 𝑁 × 𝑁 pixels.
− Then, for a maximum motion displacement of w pixels/frame, the current block of pixels is
matched against a corresponding block at the same coordinates but in the previous/next
frame, within the square window of width 𝑁 + 2𝑤.
− The best match on the basis of a matching criterion yields the displacement.
− Measurements of the video encoders’ complexity show that ME comprises almost 50–70 per
cent of the overall encoder’s complexity.
86. N+2w
(m,n
(m+i,n+j)
w
w
N+2w
i
j
(NxN) block
in the current
frame
search window
in the previous
frame
(NxN) block under the search
in the previous frame, shifted by i,j
)
2
1 1
2
),(),(
1
),(
N
m
N
n
jnimgnmf
N
jiM
wjiw ,
N
m
N
n
jnimgnmf
N
jiM
1 1
2
,(),(
1
),(
,
,
Mean Squared Error (MSE)
Mean Absolute Error (MAE)
Complexity
• Various measures such as the Cross-correlation Function
(CCF), Mean Squared Error (MSE) and Mean Absolute
Error (MAE) can be used in the matching criterion.
• For the best match, in the CCF the correlation has to be
maximised, whereas in the latter two the distortion must
be minimised.
• In practical coders, both MSE and MAE are used, since it
is believed that CCF would not give good motion
tracking, especially when the displacement is not large.
(2𝑊 + 1)2
Motion Estimation and Block Matching Algorithms
86
93. Algorithm Maximum number of
search points
4 8 16
w
TDL 2 + 7 log w2
16 23 30
TSS 1 + 8 log w2 17 25 33
MMEA 1 + 6 log w
2
13 19 25
CDS 3 + 2 w 11 19 35
CSA 5 + 4 log w
2
13 17 21
(2w+1)FSM 81 289 10892
OSA 1 + 4 log w 9 13 17
2
Computational Complexity
Complexity/Performance
93
The range of motion speed from w=4 to 16 pixels/frame
− The accuracy of the motion estimation is expressed in terms of errors: maximum absolute error, root-mean-
square error, average error and standard deviation.
FSM: Full Search Mode
TDL: Two-dimensional Logarithmic
TSS: Three-step Search
MMEA: Modified Motion Estimation Algorithm
CDS: Conjugate Direction Search
OSA: Orthogonal Search Algorithm
CSA: Cross-search Algorithm
94. Algorithm
Split Screen Trevor White
Entropy
(bits/pel)
Standard
deviation
Entropy
(bits/pel)
Standard
deviation
FSM 4.57 7.39 4.41 6.07
TDL 4.74 8.23 4.60 6.92
TSS 4.74 8.19 4.58 6.86
MMEA 4.81 8.56 4.69 7.46
CDS 4.84 8.86 4.74 7.54
OSA 4.85 8.81 4.72 7.51
CSA 4.82 8.65 4.68 7.42
Compensation Efficiency
Complexity/Performance
94
The motion compensation efficiencies of some algorithms for a motion speed of
w=8 pixels/frame for two test image sequences (Split screen and Trevor white)
− The accuracy of the motion estimation is expressed in terms of errors: maximum absolute error, root-mean-
square error, average error and standard deviation.
FSM: Full Search Mode
TDL: Two-dimensional Logarithmic
TSS: Three-step Search
MMEA: Modified Motion Estimation Algorithm
CDS: Conjugate Direction Search
OSA: Orthogonal Search Algorithm
CSA: Cross-search Algorithm
96. 96
Sub-pixel Motion Compensation
− In the first stage, motion estimation finds the best match
on the integer pixel grid (circles).
− The encoder searches the half-pixel positions
immediately next to this best match (squares) to see
whether the match can be improved and if required, the
quarter-pixel positions next to the best half-pixel position
(triangles) are then searched.
− The final match, at an integer, half-pixel or quarter-pixel
position, is subtracted from the current block or
macroblock.
101. − The smaller motion compensation block sizes can produce better motion compensation results.
• Motion compensating each 8 × 8 block instead of each 16 × 16 macroblock reduces the residual
energy further and motion compensating each 4 × 4 block gives the smallest residual energy of all.
− However, a smaller block size leads to increased complexity, with more search operations to be carried
out, and an increase in the number of motion vectors that need to be transmitted.
− An effective compromise is to adapt the block size to the picture characteristics, for example
• choosing a large block size in flat, homogeneous regions of a frame
• choosing a small block size around areas of high detail and complex motion
Motion Compensation Block Size
101
103. − It is possible to estimate the trajectory of each pixel between successive video frames, producing a field
of pixel trajectories known as the optical flow or optic flow.
− If the optical flow field is accurately known, it should be possible to form an accurate prediction of most of
the pixels of the current frame by moving each pixel from the reference frame along its optical flow
vector.
− However, this is not a practical method of motion compensation. (An accurate calculation of optical flow
is very computationally intensive)
Optical Flow or Optic Flow
103
Frame 1
(as a predictor for frame 2 )
Frame 2
(current frame )
Optical Flow or Optic Flow
104. − The macroblock, corresponding to a 16×16-pixel region of a frame, is the basic unit for motion
compensated prediction in a number of important visual coding standards including MPEG-1, MPEG-2,
MPEG-4 Visual, H.261, H.263 and H.264.
− For source video material in the popular 4:2:0 format, a macroblock is organized as shown in Figure.
− An H.261 codec processes each video frame in units of a macroblock.
104
Y Y
Y Y Cr Cb
8
8
Ex: Motion Estimation in H.261
105. Macro-block
– Motion estimation of a macroblock involves finding a 16×16-sample
region in a reference frame that closely matches the current
macroblock.
– Luminance: 16x16, four 8x8 blocks
– Chrominance: two 8x8 blocks
– Motion estimation only performed for luminance component
Motion Vector Range
– [ -15, 15]
– MB: 16 x 16
15
15
15 15
Search Area in Reference Frame
MB
105
Ex: Motion Estimation in H.261
𝑪𝒓 𝑪𝒃
𝒀
𝒀 𝟎 𝒀 𝟏
𝒀 𝟐 𝒀 𝟑
107. Uncompressed SDTV Digital Video Stream - 170 Mb/s
Picture 830kBytes Picture 830kBytesPicture 830kBytes
100 kBytes
I Frame
Picture 830kBytes
B Frame
12-30 kBytes12-30 kBytes
B Frame
33-50 kBytes
P Frame
I - Intra coded picture without reference to other pictures. Compressed using spatial redundancy only
MPEG-2 Compressed SDTV Digital Video Stream - 3.9 Mb/s
P - Predictive coded picture using motion compensated prediction from past I or P frames
B - Bi-directionally predictive coded picture using both past and future I or P frames
I, P & B Frames (Ex: MPEG 1)
107
I: Intra Coded Frame
P: Predictively Coded (Predictive-coded ) Frame
B: Bidirectionally Coded (Bidirectional-coded) Frame
108. • Intraframe Compression
– Frames marked by (I) denote the frames that are strictly intraframe compressed.
– The purpose of these frames, called the "I pictures", is to serve as random access points
to the sequence.
I Frames
108
109. • P Frames use motion-compensated forward predictive compression on a block basis.
– Motion vectors and prediction errors are coded.
– Predicting blocks from closest (most recently decoded) I and P pictures are utilised.
Forward Prediction
P Frames
109
110. • B frames use motion-compensated bi-directional predictive compression on a block basis.
– Motion vectors and prediction errors are coded.
– Predicting blocks from closest (most recently decoded) I and P pictures are utilised.
Forward Prediction
Bi-Directional Prediction
B Frames
110
Backward Prediction
111. I-pictures
• They are coded without reference to the previous picture.
• They provide access points to the coded sequence for decoding (intraframe coded as for JPEG)
P-pictures
• They are predictively coded with reference to the previous I- or P-coded pictures.
• They themselves are used as a reference (anchor) for coding of the future pictures.
B-pictures
• Bidirectionally coded pictures, which may use past, future or combinations of both pictures in their
predictions.
D-pictures
• As intraframe coded, where only the DC coefficients are retained.
• Hence, the picture quality is poor and normally used for applications like fast forward.
• D-pictures are not part of the GOP; hence, they are not present in a sequence containing any other
picture types. 111
I, P, B and D Pictures Features
112. • Relative number of (I), (P), and (B) pictures can be arbitrary.
• Group of Pictures (GOP) is the Distance from one I frame to the next I frame.
• Ex: MPEG-2: An I picture is mandatory at least once in a sequence of 132 frames (period_max=132)
1 2 3 4 5 6 7 8 9 10 11 12 1
GOP = 12
Group of Pictures
112
113. An I picture is mandatory at least once in a sequence of 132 frames (period_max= 132)
GOP = 6
GOP = 2
GOP = 2
113
Group of Pictures, Examples
114. – I frames are independently encoded.
– P frames are based on previous I, P frames.
– B frames are based on previous and following I and/or P frames.
114
The Typical Size of Compressed Frames
I: Intra Coded Frame
P: Predictively Coded (Predictive-coded ) Frame
B: Bidirectionally Coded (Bidirectional-coded) Frame
Type Size Compression
I 18kB 7:1
P 6kB 20:1
B 2.5kB 50:1
Avg 4.8kB 27:1
Typical Sizes of MPEG-1 Frames
115. – If B-pictures are not used for predictions of future frames, then they can be coded with the highest
possible compression without any side effects.
– This is because, if one picture is coarsely coded and is used as a prediction, the coding distortions are
transferred to the next frame. This frame then needs more bits to clear the previous distortions, and the
overall bit rate may increase rather than decrease.
– The typical size of compressed P-frames is significantly smaller than that of I-frames (because temporal
redundancy is exploited in inter-frame compression).
– B-frames are even smaller than P-frames because
• the advantage of bi-directional prediction
• the lowest priority given to B-frames
115
Type Size Compression
I 18kB 7:1
P 6kB 20:1
B 2.5kB 50:1
Avg 4.8kB 27:1
Typical Sizes of MPEG-1 Frames
The Typical Size of Compressed Frames
116. Previous reference Current frame Future reference
Forward prediction
P-frame (Predictive Coded Frame) Features
Forward prediction
Why we need P frame?
116
117. • .
+
-1
+1
Intra frame
compression
P’
P
– The P-frames are forward predicted from the last I-frame or P-
frame.
– It is impossible to reconstruct them without the data of another
frame (I or P).
– Are coded with respect to the nearest previous I- or P-frames.
– This technique is called forward prediction.
– It uses motion compensation to provide more compression than I-
frames.
– About 30% the size of an I frame
P-frame (Predictive Coded Frame) Features
117
Forward Prediction
118. Difference
Forward Motion vector
Reference image
(previous image)
Encode:
− Motion Vector - difference in spatial location of macro-blocks.
− Small difference in content of the macro-blocks
Current image
Huffman
Coding
P-frame (Predictive Coded Frame) Features
118
000111010…...
119. B-frame (Bidirectional Coded Frame) Features
Previous reference Current frame Future reference
Forward prediction Backward prediction
Why we need B frame?
119
120. 120
– The MB containing part of a ball in the Target frame cannot find a good matching MB in the previous
frame because half of the ball was occluded by another object.
– A match however can readily be obtained from the next frame.
B-frame (Bidirectional Coded Frame) Features
Forward
prediction
Backward
prediction
121. Best match
Forward Motion Vector
Macroblock to be coded
Previous reference picture
Current B-picture
Future reference picture
Best match
Backward Motion Vector
121
Forward Prediction
Backward Prediction
B-frame (Bidirectional Coded Frame) Features
123. – B-frame requires information of the previous and following I-frame
and/or P-frame for encoding and decoding.
– Three types of motion compensation techniques are used:
• Forward motion compensation uses past anchor frame
information.
• Backward motion compensation uses future anchor frame
information.
• Interpolative motion compensation uses the average of the
past and future anchor frame information.
– It uses motion compensation to provide more compression than I-and
P-frames.
– About 15% the size of an I frame.
123
B-frame (Bidirectional Coded Frame) Features
+
+1
Intra frame
compression
B’
B
-1
2 -1
2
Forward Prediction
Bi-Directional PredictionBackward Prediction
124. 124
– B-pictures have access to both past and future anchor pictures.
– Such an option increases the motion compensation efficiency, particularly when there are occluded
objects in the scene.
– In fact, one of the reasons for the introduction of B-pictures was this fact that the forward motion
estimation and P-pictures cannot compensate for the uncovered background of moving objects.
– From the two forward and backward motion vectors, the coder has a choice of choosing any of the
forward, backward or their combined motion-compensated predictions.
B-frame (Bidirectional Coded Frame) Features
125. 125
– Note that B-pictures do not use motion compensation from each other, since they are not used as
predictors.
– Also note that the motion vector overhead in B-pictures is much more than in P-pictures.
– The reason is that, for B-pictures, there are more macroblock types, which increase the macroblock type
overhead, and for the bidirectionally motion-compensated macroblocks two motion vectors have to be
sent.
B-frame (Bidirectional Coded Frame) Features
126. - [ + ] =
Past reference Target Future reference
Encode:
− Two motion vectors - difference in spatial location of macro-blocks.
• Two motion vectors are estimated (one to a past frame, one to a future frame).
− Small difference in content of the macro-blocks
126
Interpolative compensation uses the waited average of the past and future anchor frame information.
B-frame (Bidirectional Coded Frame) Features
FMV
BMV
𝒘 𝟏 𝒘 𝟐
DCT + Quant + RLE
Huffman CodeMotion Vectors 000111010…...
127. 127
The combined motion-compensated predictions
– A weighted average of the forward and backward motion-compensated pictures is calculated.
– The weight is inversely proportional to the distance of the B-picture with its anchor pictures.
Ex: GOB structure of I, B1, B2, P
– The bidirectionally interpolated motion-compensated picture for B1 would be two-thirds of the forward
motion-compensated pixels from the I-picture and one-third from backward motion-compensated pixels
of the P-picture.
B-frame (Bidirectional Coded Frame) Features
𝑀𝑉 𝑓𝑜𝑟 𝐵1 =
2
3
𝐹𝑀𝑉 𝑓𝑟𝑜𝑚 𝐼 +
1
3
𝐵𝑀𝑉 (𝑓𝑟𝑜𝑚 𝑃)
128. New Search Mechanism: Prediction
Spatial Motion Vector Prediction:
a. Due to the spatial correlation, the motion vector of the current block is close to those in nearby blocks.
− Usage of Predictor:
1. Initial search point
2. DPCM coding of motion vector
128
𝒗𝒊, 𝒋 of current
block
vi,j1vi,j2
vi 1,j1vi 1,j2 vi 1,j 1vi 1,j
Previous coded blocks
},,{~
,11,11, jijijiij Mean vvvv
Predictor Example 1:
},,{~
,11,11, jijijiij Median vvvv
Predictor Example 2:
Uncoded
Block
129. New Search Mechanism: Prediction
Temporal Motion Vector Prediction:
b. Due to the temporal correlation, the motion vector of the current block is close to those in nearby
blocks in the previous frame.
129
of current block
t
ijv
t
ji ,1v t
ji 1,1 vt
ji 1,1 vt
ji 2,1 v
t
ji 2, v t
ji 1, v1t
ijv
1
,1
t
jiv 1
1,1
t
jiv1
1,1
t
jiv1
2,1
t
jiv
1
2,
t
jiv 1
1,
t
jiv 1
,
t
jiv
Current FramePrevious Frame
Uncoded
Block
(all are coded)
},,,,,{~ 1
,1
1
,,11,11,
t
ji
t
ji
t
ji
t
ji
t
ji
t
ij Median vvvvvv
(Predictor Example)
Uncoded
Block
Uncoded
Block
Uncoded
Block
Uncoded
Block
Uncoded
Block
132. Intra-frame Compression/Coding
– Is a picture coded without
reference to any picture
except itself.
– It is a still image encoded
in JPEG in real-time.
– Often, I pictures (I-frames)
are used for random
access and as references
for the decoding of other
pictures.
132
133. • Zig-Zag Scan
• Run-length coding
• VLC
Quantization
• Major Reduction
• Controls ‘Quality’
133
Intra-frame Compression (Like as still image Compression)
134. • .
Intra-frame Compression (Like as still image Compression)
Data
Buffer
Entropy
Coding
Quantisation
Discrete
Cosine
Transform
Base band
input
Compressed
output
Zig-zag
Scanning
Rearranges the pixels into
frequency coefficients Replaces the original data
with shorter codes or symbols
Rearranges the
coefficients from raster
scan to low frequency first
Stores the compressed data & checks
compression ratio. Returns a quantisation
signal if ratio is not achieved.
Quantises the data if the compression ratio
has not been achieved
134
137. – Inter-frame coding removes temporal redundancy (Inter-frames reduce the average bit rate for the same
quality!)
– Relies on successive frames looking similar.
• Does not work well with cuts and breaks.
– 2 different types of comparison.
• P-frame & B-frame.
– Inter-frames need a ‘reference’ frame.
• i.e. an I-frame (or P-frame).
– Many inter-frames can be used after the I frame.
• This can reduce bit rate a lot.
– Eventually the process must be started again.
• The difference becomes too great especially if there is a cut.
137
Inter-frame Compression/Coding
141. 141
Interframe Loop
− In interframe predictive coding, the difference between pixels in the current frame and their
prediction values from the reference frame (ex: previous frame) is coded and transmitted.
− At the receiver, after decoding the error signal of each pixel, it is added to a similar prediction
value to reconstruct the picture.
− The better the predictor, the smaller the error signal, and hence the transmission bit rate.
− when there is a motion, assuming that movement in the picture is only a shift of object
position, then a pixel in the previous frame, displaced by a motion vector, is used.
A Generic Video Encoder
142. 142
Motion Estimator
− Assigning a motion vector to a group of pixels.
− A group of pixels is motion compensated, such that the motion vector overhead per pixel can
be very small.
− In standard codecs, a block of 16×16 pixels, known as a macroblock (MB) (to be differentiated
from 8 ×8 DCT blocks), is motion estimated and compensated.
− It should be noted that ME is only carried out on the luminance parts of the pictures.
− A scaled version of the same motion vector is used for compensation of chrominance blocks,
depending on the picture format.
A Generic Video Encoder
143. 143
BD –Block Difference
DBD – Displaced Block Difference
X
X
3
2.7
MC
No MC
256
DBD
y
x
256
BD
1.5
0.5
1
DBD c[x, y] r[x dx, y dy]
256 MB
1
BD c[x, y] r[x, y]
256 MB
1
𝑦 = 𝑥/1.1
A Generic Video Encoder
– Not all blocks are motion compensated
– The one which generates less bits are preferred.
Motion Compensation Decision Characteristic (H.261)
144. 144
Inter/Intra Switch
− Every MB is either interframe or intraframe coded, called inter/intra MBs.
− The decision on the type of MB depends on the coding technique.
− Sometimes it might be advantageous to intraframe code an MB, rather than
interframecoding it.
There are at least two reasons for intraframe coding:
I. Scene cuts or, in the event of violent motion, interframe prediction errors may not be less than those
of the intraframe. Hence, intraframe pictures might be coded at lower bit rates.
II. Intraframe coded pictures have a better error resilience to channel errors.
A Generic Video Encoder
145. 145
Inter/Intra Switch
− In interframe coding in the event of channel error, the error
propagates into the subsequent frames. If that part of the
picture is not updated, the error can persist for a long time.
− The variance of intraframe MB is compared with that of the
variance of interframe MB (motion compensated or not) in
previous frame. The smallest is chosen.
• For large variances, no preference between the two modes.
• For smaller variances, interframe is preferred.
− The reason is that, in intra mode, the DC coefficients of the
blocks have to be quantised with a quantiser without a dead
zone and with 8-bit resolutions. This increases the bit rate
compared to that of the interframe mode, and hence
interframe is preferred. MC/NO_MC mode decision in H.261
A Generic Video Encoder
(Intraframe AC energy)
(Interframe AC energy)
146. 146
DCT
− Every MB is divided into 8×8 luminance and chrominance pixel blocks.
− Each block is then transformed via the DCT.
− There are four luminance blocks in each MB, but the number of chrominance blocks depends
on the colour resolutions (image format).
A Generic Video Encoder
147. 147
Quantiser
− There are two types of quantisers.
• With dead zone for the AC coefficients and the DC coefficient of inter MB
• Without the dead zone for the DC coefficient of intra MB.
− With a dead zone quantiser, if the modulus (absolute value) of a
coefficient is less than the quantiser step size q, it is set to zero; otherwise, it
is quantised according to quantiser indices.
Variable length coding
− The quantiser indices are variable length coded, according to the type of
VLC used.
− Motion vectors, as well as the address of coded MBs, are also variable
length coded.
A Generic Video Encoder
148. 148
IQ and IDCT
− To generate a prediction for interframe coding, the quantised DCT coefficients are first inverse
quantised and inverse DCT coded.
− These are added to their previous picture values (after a frame delay by the frame store) to
generate a replica of decoded picture.
− The picture is then used as a prediction for coding of the next picture in the sequence.
Buffer
− The bit rate generated by an interframe coder is variable. (a function of motion of objects and their
details)
− Therefore, to transmit coded video into fixed rate channels, the bit rate has to be regulated. Storing
the coded data in a buffer and then emptying the buffer at the channel rate does this.
− However, if the picture activity is such that the buffer may overflow (violent motion), then a
feedback from the buffer to the quantiser can regulate the bit rate.
A Generic Video Encoder
151. 151
A Generic Video Decoder
− The compressed bitstream, after demultiplexing and Variable Length Decoding (VLD), separates the
motion vectors and the DCT coefficients.
− Motion vectors are used by motion compensation
− The DCT coefficients after the inverse quantisation and IDCT are converted to error data.
− They are then added to the motion-compensated previous frame to reconstruct the decoded picture.
152. 152
Bit Rate Variation
Constant Bit Rate (CBR)
− Quantiser step size and even frame rate may change to adapt the bit rate to channel rate
− Video quality is variable
− Normally a complex structure is used to regulate the bit rate
Variable Bit Rate (VBR)
− Quantiser step size is nearly constant, generating almost constant quality picture
− Difficult to adapt to channel rate, but is suitable for packet Switched Network applications (e.g. Internet)
− No need for bit rate regulation (coded is simple)
153. − To achieve the requirement of random access, a set of pictures can be defined to form a
Group of Picture (GOP), consisting of a minimum of one I-frame, which is the first frame,
together with some P-frames and/or B-frames.
153
Group of Picture (GOP), Recall
Forward Prediction
Bi-Directional PredictionBackward Prediction
GOP = 12
I: Intra Coded Frame
P: Predictively Coded (Predictive-coded ) Frame
B: Bidirectionally Coded (Bidirectional-coded) Frame
154. Example GOP Structures
MPEG-2: Simple Possibilities
MPEG-2: An I picture is mandatory at least once in a sequence of 132 frames (period_max=132)
154
I B B P B B I B B P B B I
I I I I I I I I I I I I I
I P I P I P I P I P I P I
I P I P I I I P I P I I I
155. Example GOP Structures
• .I I I I I I I I I I I I I I I I I I
B I B I B I B I B I B I B I B I B I
B I P B B P B B P B B P B B P I P B
High bit rate, broadcast quality. Easy to edit.
Low bit rate, domestic and transmission quality. No
further editing required.
156. Example GOP Structures
• .I I I I I I I I I I I I I I I I I I
B I B I B I B I B I B I B I B I B I
B I P B B P B B P B B P B B P I P B
I frame only (1 frame GOP) Used by Sony IMX
IB frame only (2 frame GOP) Used by Sony Betacam SX
Long GOP Used by satellite, cable & DVD
157. 157
How to Chose Current Frame: I, B, P?
To chose of how to encode the current frame is done by encoder
− Change of scene frames should be encoded as I-frame
− Encoder should never allocate too long sequences of P or B frames (Interframe coding is bad for error
resilience)
− B frames are computationally intensive
− Must compute forward and backward motion vectors
On average, natural images with fixed quantization intervals:
− Size (I-frame), Size (P-frame), Size (B-frame)= 6:3:2
158. − I and P pictures are called “anchor” pictures
− A GOP is a series of one or more pictures to assist random access into the picture sequence.
− The GOP length is normally defined as the distance between I-pictures, which is represented by
parameter N in the standard codecs.
− The distance between the anchor I/P and P-pictures is represented by M.
− The encoding or transmission order of pictures differs from the display or incoming picture order.
− This reordering introduces delays amounting to several frames at the encoder (equal to the number of B-
pictures between the anchor I- and P-pictures).
− The same amount of delay is introduced at the decoder in putting the transmission/ decoding sequence
back to its original. This format inevitably limits the application of MPEG-1 for telecommunications.
− A GOP, in coding, must start with an I picture and in display order, must start with an I or B picture and
must end with an I or P picture.
158
Group of pictures and Reordering
159. − In order to allow B frames to be decoded, frames are re-ordered when the MPEG file is created, so that
when a frame is received, the decoder will already have the required reference frames.
− To encode the frames in display order
I1B1B2P1B3B4P2B5B6P3B7B8P4
− The ordering of the frames in the file would be
I1P1B1B2P2B3B4P3B5B6P4B7B8
− When the decoder receives the P frame, it decodes it, but it would delay displaying the picture, as the
next frame is a B frame
159
Picture Re-ordering, Ex. 1
160. I B B P B B P B B I
1 2 3 4 5 6 10987
1 4 2 3 7 5 98106
Encoder Input/Decoder Output
Encoder Output/ Decoder Input
Picture Re-ordering, Ex. 1
Encoder Decoder
P B BI P B B I B B
160
161. 1 2 3 4 5 6 7 8 9 10 11 12 1
Source and Display Order
Transmission Order
161
Picture Re-ordering, Ex. 2
162. Group of pictures and Reordering
162
0 3 1 2 6 4 5 9 7 8 12
I P B B P B B I B B P
Encoding Order of Frames
Intra frame coding
(Temporal reference)
0 1 2 3 4 5 6 7 8 9 10
I B B P B B P B B I B
Group of Picture (GOP)
Forward prediction Bidirectional prediction
N=8
M=3
Forward prediction Backward prediction
Picture Re-ordering, Ex. 3
163. • Give the encoded sequence of the following frames:
I1P1P2B1B2B3B4P3B5I2B6B7B8P4
• Answer
I1P1P2P3B1B2B3B4I2B5P4B6B7B8
163
Picture Re-ordering, Ex. 4
165. Ex: Bit Rate and Compression Ratio
− Consider a video clip encoded in MPEG-1 with a frame rate of 30 frames per second and a group of
pictures with sequence:
I B B P B B P B B P B B P B B .....
− If the size of each I-frame, P-frame and B-frame is 12.5 KB, 6 KB and 2.5 KB respectively, calculate the
average bit rate for the video clip.
− Suppose that the uncompressed frames are each of size 150 KB, find the compression ratio.
165
166. Solution:
− In a GOP, there are 1 I-frame, 4 P-frames, and 10 B-frames.
I-frame = 1 ×12.5 = 12.5 KB
P-frame = 4 × 6 = 24 KB
B-frame = 10 × 2.5 = 25 KB
Size of a GOPs = 61.5 KB
In 1 second, there are 30 frames = 2 GOPs = 2 × 61.5 KB = 123 KB
Average bit rate = 123 × 1024 × 8 = 1007616 bit/s
− Overall compression ratio for the video stream
= original/compressed = 15×150 / 61.5 = 36.59.
166
Ex: Bit Rate and Compression Ratio
167. 167
PSNR
Long GOP
B I B B P B B I BI I I I I I I I I
Time
GOP
I Frame Only
Ti me
Moving Picture Types Quality
169. 169
To GOP or not to GOP
1st 5th
Long GOP Codec
PSNR
10th
Generation
AVC-Intra100 : Cut Edit
AVC-Intra50 : Cut Edit
Still Pictures
Fast Motion
Confetti fall
Flashing lights
Landscape
Long GOP quality is
content dependent
170. 170
Code and Decode Speed for Inter and Intra Codecs, Examples
Software Coded Performance
171. 171
Code and Decode Speed for Inter and Intra Codecs, Examples
Core i7 4770, 4 core, 8 thread
172. Multi Slice Encoding
172
Single CPU Model
CPU #0
Multi CPU Model
CPU #0
CPU #1
CPU #2
CPU #3
Total 4 CPUsA
B
C
D
CPU #0
CPU #1
CPU #2
CPU #3
GOP 0 GOP 1 GOP 2
A B C D
A
B
C
D
* Use 1 GOP = 6 frames for Explanation
173. Blocking
− Borders of 8x8 blocks become visible in reconstructed frame (Caused by coarse quantization, with
different quantization applied to neighboring blocks.)
Ringing (Echoing, Ghosting)
− Distortions near edges of image (Caused by quantization/truncation of high-frequency
transform(DCT/DWT) coefficients during compression)
173
Original image
Reconstructed image
(with ringing Artifacts)
De-blocking and De-ringing Filters
174. Deblocking and Deringing Filters
Low-pass filters are used to smooth the image where artifacts occur.
De-blocking:
− Do Low-pass filtering on the pixels at borders of 8x8 blocks
− One-dimensional filter applied perpendicular to 8x8 block borders
− Can be turned on or off for each block, usually go together with MC
− Advantage
• Decreases prediction error by smoothing the prediction frame
• Reduces high-frequency artifacts like mosquito effects
− Disadvantage
• Increases complexity & overhead
De-ringing:
− Detect edges of image features
− Adaptively apply 2D filter to smooth out areas near edges
− Little or no filtering applied to edge pixels in order to avoid blurring
174
Deblocking
175. Artifact Reduction: Post-processing vs. In-loop filtering
De-blocking/de-ringing often applied after the decoder
(post-processing)
− Reference frames are not filtered
− Developers free to select best filters for the application
or not filter at all
− It may require an additional frame buffer
De-blocking/de-ringing can be incorporated in the
compression algorithm (in-loop filtering)
− Reference frames are filtered
− Same filters must be applied in encoder and decoder
− Better image quality at very low bit-rates
175
176. Sensitivity to Transmission Errors
− Prediction and Variable Length Coding (VLC) makes the video stream very sensitive to
transmission errors on the bitstream
− Error in one frame will propagate to subsequent frames
− Bit errors in one part of the bit stream make the following bits undecodable
176
177. Effect of Transmission Errors
177
Example reconstructed video frames from a H.263 coded sequence, subject to packet losses
178. Error Resilient Encoding
− To help the decoder to resume normal decoding after errors occur, the encoder can
• Periodically insert INTRA mode (INTRA refresh)
• Insert resynchronization codewords at the beginning of a group of blocks (GOB)
− More sophisticated error-resilience tools
• Multiple description coding
− Trade-off between efficiency and error-resilience
− Can also use channel coding / retransmission to correct errors
178
179. Error Concealment
− With proper error-resilience tools, packet loss typically lead to the loss of an isolated segment of a frame
− The lost region can be “recovered” based on the received regions by spatial/temporal interpolation →
Error concealment
− Decoders on the market differ in their error concealment capabilities
179Without concealment With concealment
181. 181
Projective Mapping
2-D Motion: Projection of 3-D motion, depending on 3D object motion and projection operator
Optical flow: “Perceived” 2-D motion based on changes in image pattern, also depends on illumination and
object surface texture
On the left, a sphere is rotating under a
constant ambient illumination, but the
observed image does not change.
On the right, a point light source is rotating
around a stationary sphere, causing the
highlight point on the sphere to rotate
185. 185
Two Features of Projective Mapping
− Chirping: increasing perceived spatial frequency for far away objects
− Converging (Keystone): parallel lines converge in distance
Non-chirping models Chirping models
(Original) (Affine) (Bilinear) (Projective) (Relative- (Pseudo- (Biquadratic)
projective) perspective)
186. 186
Affine and Bilinear Transformation Models
Approximation of projective mapping:
I. Affine (6 parameters): Good for mapping triangles to triangles
II. Bilinear (8 parameters): Good for mapping blocks to quadrangles
187. 187
Prospective and Pixel Coordinate Transformation Models
Approximation of projective mapping:
I. Prospective Transformation
II. Pixel Coordinate Transformation
Prospective
a1x1 + a2x2 + a3
a7x1 + a8xa + 1
x’1 =
a4x1 + a5x2 + a6
a7x1 + a8xa + 1
x’2 =
Eight Motion Parameters: a1, a2, a3, a4, a5, a6, a7, a8
Shift
x’1 = x1 + d1 x’2 = x2 + d2
Two Motion Parameters: d1, d2
The simplest block motion model, which is used for
block-based motion compensation mostly!
191. 191
Optical flow or optic flow
− It is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by
the relative motion between an observer (an eye or a camera) and the scene
Methods for determining optical flow
− Phase correlation – inverse of normalized cross-power spectrum
− Block-based methods – minimizing sum of squared differences or sum of absolute differences,
or maximizing normalized cross-correlation
Optical flow
192. 192
Optical Flow Equation
When illumination condition is unknown, the best one can do it to estimate optical flow.
Constant intensity assumption → Optical flow equation
Under "constant intensity assumption":
But, using Taylor's expansion
Compare the above two, we have the optical flow equation
Brightness or intensity (𝝍)
193. 193
Ambiguities in Motion Estimation
− Optical flow equation only constrains the flow vector in the gradient direction 𝑣 𝑛
− The flow vector in the tangent direction (𝑣 𝑡) is under-determined
− In regions with constant brightness (∆𝜓 = 0), the flow is indeterminate → Motion estimation is unreliable in
regions with flat texture, more reliable near edges
194. 194
General Considerations for Motion Estimation
Two categories of approaches:
− Feature based (more often used in object tracking, 3D reconstruction from 2D)
− Intensity based (based on constant intensity assumption) (more often used for motion compensated
prediction, required in video coding, frame interpolation) → Our focus
Three important questions
− How to represent the motion field?
− What criteria to use to estimate motion parameters?
− How to search motion parameters?
195. 195
Motion Representation
Global:
Entire motion field is represented
by a few global parameters
Pixel-based:
One MV at each pixel, with some
smoothness constraint between
adjacent MVs.
Region-based:
Entire frame is divided into regions,
each region corresponding to an
object or sub- object with consistent
motion, represented by a few
parameters.
Block-based:
Entire frame is divided into blocks,
and motion in each block is
characterized by a few
parameters.
196. 196
Notations
Anchor frame: 1(x)
Target frame: 2(x)
Motion parameters: a
Motion vector at a pixel in the anchor frame: d(x)
Motion field: d(x;a), x
Mapping function: w(x;a) x d(x;a), x
198. 198
Relation Among Different Criteria
− OF (Optical Flow) criterion is good only if motion is small.
− OF criterion can often yield closed-form solution as the objective function is quadratic in MVs.
− When the motion is not small, can iterate the solution based on the OF criterion to satisfy the DFD criterion.
− Bayesian criterion can be reduced to the DFD criterion plus motion smoothness constraint
− More in the textbook
199. 199
Optimization Methods
Exhaustive search
– Typically used for the DFD criterion with p=1 (MAD)
– Guarantees reaching the global optimal
– Computation required may be unacceptable when number of parameters to search simultaneously is large!
– Fast search algorithms reach sub-optimal solution in shorter time
Gradient-based search
– Typically used for the DFD or OF criterion with p=2 (MSE)
− The gradient can often be calculated analytically
− When used with the OF criterion, closed-form solution may be obtained
– Reaches the local optimal point closest to the initial solution
Multi-resolution search
− Search from coarse to fine resolution, faster than exhaustive search
− Avoid being trapped a local minimum
200. 200
Gradient-based Search
Iteratively update the current estimate in the direction opposite the gradient direction.
• The solution depends on the initial condition. Reaches the local minimum closest to the initial condition
• Choice of step side:
– Fixed stepsize: Stepsize must be small to avoid oscillation, requires many iterations
– Steepest gradient descent (adjust stepsize optimally)
201. 201
Block-Based Motion Estimation: Overview
• Assume all pixels in a block undergo a coherent motion, and search for the motion
parameters for each block independently
• Block matching algorithm (BMA): assume translational motion, 1 MV per block (2 parameter)
– Exhaustive BMA (EBMA)
– Fast algorithms
• Deformable block matching algorithm (DBMA): allow more complex motion (affine,
bilinear), to be discussed later.
202. 202
Block Matching Algorithm
• Overview:
– Assume all pixels in a block undergo a translation, denoted by a single MV
– Estimate the MV for each block independently, by minimizing the DFD error over this block
• Minimizing function:
• Optimization method:
– Exhaustive search (feasible as one only needs to search one MV at a time), using MAD criterion (p=1)
– Fast search algorithms
Integer vs. fractional pel accuracy search
206. 206
Fractional Accuracy EBMA
• Real MV may not always be multiples of pixels. To allow sub- pixel MV, the search stepsize must
be less than 1 pixel
• Half-pel EBMA: stepsize=1/2 pixel in both dimension
• Difficulty:
– Target frame only have integer pels
• Solution:
– Interpolate the target frame by factor of two before searching
– Bilinear interpolation is typically used
• Complexity:
– 4 times of integer-pel, plus additional operations for interpolation.
• Fast algorithms:
– Search in integer precisions first, then refine in a small search region in half-pel accuracy.
207. 207
Sub-pixel Motion Compensation
− In the first stage, motion estimation finds the best match
on the integer pixel grid (circles).
− The encoder searches the half-pixel positions
immediately next to this best match (squares) to see
whether the match can be improved and if required, the
quarter-pixel positions next to the best half-pixel position
(triangles) are then searched.
− The final match, at an integer, half-pixel or quarter-pixel
position, is subtracted from the current block or
macroblock.
212. 212
Pros and Cons with EBMA
• Blocking effect (discontinuity across block boundary) in the predicted image
– Because the block-wise translation model is not accurate Fix: Deformable BMA (next
lecture)
• Motion field somewhat chaotic
– because MVs are estimated independently from block to block
– Fix 1: Mesh-based motion estimation (next lecture)
– Fix 2: Imposing smoothness constraint explicitly
• Wrong MV in the flat region
– because motion is indeterminate when spatial gradient is near zero
• Nonetheless, widely used for motion compensated prediction in video coding
– Because its simplicity and optimality in minimizing prediction error
213. 213
Fast Algorithms for BMA
• Key idea to reduce the computation in EBMA:
– Reduce # of search candidates:
• Only search for those that are likely to produce small errors.
• Predict possible remaining candidates, based on previous search result
– Simplify the error measure (DFD) to reduce the computation involved for each candidate
• Classical fast algorithms (Save large computation, Not accurate as EBMA)
– Three-step
– 2D-log
– Conjugate direction
– The characteristics of fast algorithm
• Many new fast algorithms have been developed since then
– Some suitable for software implementation, others for VLSI implementation (memory access, etc)
215. 215
Three Step Search (TSS) Method
TOTAL = (9 + 8 + 8 ) = 25
Step 1:”O”﹐9 search points find min point (1st MSB)
(x, y) = (a 0 0, b 0 0)
Step 2:””﹐8 search points find min point (2nd MSB)
(x, y) = (-1 c 0, +1 d 0)
Step 3: ”□” 8 points find min point
a and b could be +1, 0, or -1
c and d could be +1, 0, or -1
The TSS method is called as the logarithmic search method!!
First-step
Search Points
Second-step
Search Points
Third-step
Search Points
218. 218
Three-Step Search Algorithm Example
2
3 3 3
3 2 3 2
1 1 2 3 3 3 2
1
2 2 2
1 1
1
1 1 1
n denotes a search point of step n
i 6 i 5 i 4 i 3 i 2 i 1 i i 1 i 2 i 3 i 4 i 5 i 6 j 6
j 5
j 4
j 3
j 2
j 1
j
j 1
j 2
j 3
j 4
j 5
j 6
The best matching MVs in steps 1–3 are (3,3), (3,5),
and (2,6). The final MV is (2,6). From [Musmann85].
220. Four Step Search (4SS) Method
1. Corner: 2. Center:
3. Side:
Step 1. Search 9 points (Start from 5x5 : same as the TSS)
Go to Step 2
220
221. Four Step Search (4SS) Method
1. Corner:
2. Center:
3. Side:
Step 2 and 3
If Corner: (+5) If Side: (+3)
If Center: (8-point step)
Step 3 if Not Center
Go to
Step 4 if Center
221
228. Cross-search Algorithm (CSA)
An example of the CSA (cross-search
algorithm) search for w=8 pixels/frame
• Another method of fast BMA is the cross-
search algorithm (CSA) . In this method, the
basic idea is still a logarithmic step search,
but with some differences, which lead to
fewer computational search points.
• The main difference is that at each iteration
there are four search locations, which are the
end points of a cross (×) rather than (+).
• Also, at the final stage, the search points can
be either the end points of (×) or (+) crosses,
as shown in Figure.
• For a maximum motion displacement of w
pixels/frame, the total number of
computations becomes 5 + 4 log2 𝑊.
228
229. 229
Step 1: 9 points ==> Three Cases
1. Center
2. Side
3. Corner
Step. 2 and 3
Final step: with shirked diamond (same as Center case)
Minimun search points = 9 + 4 = 13, Step 1(Center)
Maximun search points = ? (33)
Diamond Search
Slide
236. 236
2D-Log Search, Example
3 5 4 5
3 2 5 5 5
4
3
2 1 2
1 1 1
1
n denotes a search point of step n
i 6 i 5 i 4 i 3 i 2 i 1 i i 1 i 2 i 3 i 4 i 5 i 6 j 6
j 5
j 4
j 3
j 2
j 1
j
j 1
j 2
j 3
j 4
j 5
j 6
The best matching MVs in steps 1– 5 are
(0,2), (0,4), (2,4), (2,6), and (2,6). The final
MV is (2,6). From [Musmann85].
237. 237
Step 1: The large hexagon with seven checking points is centered at (0,0), the center of a predefined search window in the
motion field.
If the MBD point is found to be at the center of the hexagon,
go to Step3 (Ending);
otherwise,
go to Step 2 (Searching).
Step 2: With the MBD point in the previous search step as the center, a new large hexagon is formed. Three new candidate
points are checked, and the MBD point is again identified.
If the MBD point is at the center point of the new hexagon,
go to Step3 (Ending);
otherwise, repeat this step continuously.
Step 3: Switch the search pattern from the large to the small size of the hexagon. The four points covered by the small
hexagon are evaluated to compare with the current MBD point. The new MBD point is the final solution of the motion vector.
Hexagon-Based Search Algorithm
239. 239
Diamond Search
Minimum search points =9+4=13
Number of search points
NDS= 9+Mxn’+4, M=5 or 3
n’: number of steps
Hexagon Search
Minimum search points =7+4=11
Number of search points
NHEXBS=7+3xn+4
n: number of steps
Number of Search Points
242. Algorithm Maximum number of
search points
4 8 16
w
TDL 2 + 7 log w2
16 23 30
TSS 1 + 8 log w2 17 25 33
MMEA 1 + 6 log w
2
13 19 25
CDS 3 + 2 w 11 19 35
CSA 5 + 4 log w
2
13 17 21
(2w+1)FSM 81 289 10892
OSA 1 + 4 log w 9 13 17
2
Computational complexity
Complexity/Performance
242
The range of motion speed from w=4 to 16 pixels/frame
− The accuracy of the motion estimation is expressed in terms of errors: maximum absolute error, root-mean-
square error, average error and standard deviation.
FSM: Full Search Mode
TDL: two-dimensional logarithmic
TSS: three-step search
MMEA: modified motion estimation algorithm
CDS: conjugate direction search
OSA: orthogonal search algorithm
CSA: cross-search algorithm
243. Algorithm
Split Screen Trevor White
Entropy
(bits/pel)
Standard
deviation
Entropy
(bits/pel)
Standard
deviation
FSM 4.57 7.39 4.41 6.07
TDL 4.74 8.23 4.60 6.92
TSS 4.74 8.19 4.58 6.86
MMEA 4.81 8.56 4.69 7.46
CDS 4.84 8.86 4.74 7.54
OSA 4.85 8.81 4.72 7.51
CSA 4.82 8.65 4.68 7.42
Compensation efficiency
Complexity/Performance
243
The motion compensation efficiencies of some algorithms for a motion speed of
w=8 pixels/frame for two test image sequences (Split screen and Trevor white)
− The accuracy of the motion estimation is expressed in terms of errors: maximum absolute error, root-mean-
square error, average error and standard deviation.
FSM: Full Search Mode
TDL: two-dimensional logarithmic
TSS: three-step search
MMEA: modified motion estimation algorithm
CDS: conjugate direction search
OSA: orthogonal search algorithm
CSA: cross-search algorithm
246. 246
Multi-resolution Motion Estimation or Hierarchical Motion Estimation
• Problems with BMA
– Unless exhaustive search is used, the solution may not be global minimum
– Exhaustive search requires extremely large computation
– Block wise translation motion model is not always appropriate
• Multiresolution approach
– Aim to solve the first two problems
– First estimate the motion in a coarse resolution over low-pass filtered, down-sampled image pair
• Can usually lead to a solution close to the true motion field
– Then modify the initial solution in successively finer resolution within a small search range
• Reduce the computation
– Can be applied to different motion representations, but we will focus on its application to BMA
247. 247
− The assumption of monotonic variation of image intensity employed in the fast BMAs often causes
false estimations, especially for larger picture displacements.
− These methods perform well for slow moving objects, such as those in video conferencing.
− However, for higher motion speeds, due to the intrinsic selective nature of these methods, they
often converge to a local minimum of distortion.
− One method of alleviating this problem is to subsample the image to smaller sizes, such that the
motion speed is reduced by the sampling ratio.
− The process is done on a multilevel image pyramid, known as the Hierarchical Block Matching
Algorithm (HBMA).
− In this technique, pyramids of the image frames are reconstructed by successive two-dimensional
filtering and subsampling of the current and past image frames.
Multi-resolution Motion Estimation or Hierarchical Motion Estimation
251. 251
− Conventional block matching with a block size of 16 pixels, either full search or any fast method, is
first applied to the highest level of the pyramid (level 2).
− This motion vector is then doubled in size, and further refinement within 1-pixel search is carried out
in the following level. The process is repeated to the lowest level.
− Therefore, with an n-level pyramid, the maximum motion speed of w at the highest level is reduced
to
𝑊
2 𝑛−1.
− For example, a maximum motion speed of 32 pixels/frame with a three level pyramid is reduced to
8 pixels/frame, which is quite manageable by any fast search method.
− Note that this method can also be regarded as another type of fast search, with a performance
very close to the full search, irrespective of the motion speed, but the computational complexity
can be very close to the fast logarithmic methods.
Multi-resolution Motion Estimation or Hierarchical Motion Estimation
252. 252
• The number of levels is L
• l-th level images of the target frames
where Λ 𝑙 is set of pixels at level L
• At the l-th level, the MV is d(x)
• At the l-th level, the estimated MV is
• Determine update 𝑞𝑙(𝑥) such that error is minimized
• The new motion vector is
, ( ), , 1,2,...t l l t x x
1( ) ( ( ))l lU d x d x
2, 1,| ( ( ) ( )) ( )) |
p
ll l l
x l
error
x d x q x x
( ) ( ) ( )ll l d x d x q x
Hierarchical Block Matching Algorithm (HBMA)
255. 255
Overview of DBMA
• Three steps:
– Partition the anchor frame into regular blocks
– Model the motion in each block by a more complex motion
• The 2-D motion caused by a flat surface patch undergoing rigid 3-D motion can be
approximated well by projective mapping
• Projective Mapping can be approximated by affine mapping and bilinear mapping
• Various possible mappings can be described by a node-based motion model
– Estimate the motion parameters block by block independently
• Discontinuity problem cross block boundaries still remain
• Still cannot solve the problem of multiple motions within a block or changes due
to illumination effect!
256. 256
Affine and Bilinear Model
Approximation of projective mapping:
I. Affine (6 parameters): Good for mapping triangles to triangles
II. Bilinear (8 parameters): Good for mapping blocks to quadrangles
257. 257
Node-Based Motion Model
Control nodes in this example: Block
corners
Motion in other points are interpolated from
the nodal MVs dm,k
Control node MVs can be described with
integer or half-pel accuracy, all have same
importance
Translation, affine, and bilinear are special
case of this model
258. 258
Problems with DBMA
• Motion discontinuity across block boundaries, because nodal MVs are estimated
independently from block to block
– Fix: mesh-based motion estimation
– First apply EBMA to all blocks
• Cannot do well on blocks with multiple moving objects or changes due to illumination
effect
– Three mode method
• First apply EBMA to all blocks
• Blocks with small EBMA errors have translational motion
• Blocks with large EBMA errors may have non-translational motion
– First apply DBMA to these blocks
– Blocks still having errors are non-motion compensable
• [Ref] O. Lee and Y. Wang, Motion compensated prediction using nodal based deformable block matching. J.
Visual Communications and Image Representation (March 1995), 6:26-34
259. 259
Mesh-Based Motion Estimation: Overview
− MPEG-4 object motion
− Affine warping motion model
− Deformable polygon meshes
− Similar MAD, SSE error measures
− Trade-offs: more accurate ME vs. tremendous complexity
− Bilinear and perspective motion models are rarely used in video coding
262. 262
Mesh-Based Motion Model
• The motion in each element is interpolated from nodal MVs
• Mesh-based vs. node-based model:
– Mesh-based: Each node has a single MV, which influences the motion of all four adjacent elements
– Node-based: Each node can have four different MVs depend on within with element it is considered