Video Compression, Part 4 Section 1, Video Quality Assessment

Dr. Mohieddin Moradi
mohieddinmoradi@gmail.com
1
Dream
Idea
Plan
Implementation

Section I
− Perceptual Artifacts in Compressed Video
− Quality Assessment in Compressed Video
− Objective Assessment of Compressed Video
− Objective Assessment of Compressed Video, Codec Assessment for Production
Section II
− Subjective Assessment of Compressed Video
− Subjective Assessment of Compressed Video, Subjective Assessment by Expert Viewing
− Performance Comparison of Video Coding Standards: An Adaptive Streaming Perspective
− Subjective Assessment by Visualizer™ Test Pattern
2
Outline

3
Video Source
Decompress
(Decode)
Compress
(Encode)
Video Display
Coded
video
ENCODER + DECODER = CODEC

4
Blockiness Bluriness Exposure Interlace Noisiness
Framing & Pillar-/Letter-Boxing Flickering Blackout Ringing Ghosting
Brightness Contrast Freezing Block Loss Slicing
Some Video Artifacts (Baseband and Compressed)

− Consumers' expectations for better Quality-of-Experience (QoE) has been higher than ever before.
− The constraint in available resources in codec optimization often leads to degradations of perceptual
quality by introducing compression artifacts in the decoded video.
− Objective VQA techniques had also been designed to automatically evaluate the perceptual quality of
compressed video streams.
Codec Optimal Compromise
Availability of Resources
(Bandwidth, Power, and Time)
Perceptual
Quality
5
Compression Artifacts

6
Spatial Artifacts
Blurring
Blocking
Ringing
Basis Pattern
Effect
Color Bleeding
Temporal Artifacts
Flickering
Mosquito Noise
Fine-granularity
Flickering
Coarse-granularity
FlickeringJerkiness
Floating
Texture Floating
Edge Neighborhood
Floating

− Location-based (spatial): If you can see the artifact when the video is paused, then it’s probably a spatial
artifact.
− Time/sequence-based (temporal): If it’s much more visible while the video plays, then it’s likely temporal.
• The origin of many temporal artifacts in inter-frame coding algorithms → The propagation
compression losses to subsequent frame predictions and “rounding on rounding”.
− These include those artifacts generated
I. during video acquisition (e.g., camera noise, camera motion blur, and line/frame jittering)
II. during video transmission in error-prone networks (e.g., video freezing, jittering, and erroneously
decoded blocks caused by packet loss and delay)
III. during video post-processing and display (e.g., post deblocking and noise filtering, spatial scaling,
retargeting, chromatic aberration, and pincushion distortion)
7
Temporal vs. Spatial Artifacts

− Block-based video coding schemes create various spatial artifacts due to block partitioning and
quantization. These artifacts include
• Blurring
• Blocking
• Ringing
• Basis Pattern Effect
• Color Bleeding
− They are detected without referencing to temporally neighboring frames, and thus can be better
identified when the video is paused.
− Due to the complexity of modern compression techniques, these artifacts are interrelated with each
other, and the classification here is mainly based on their visual appearance.
8
Spatial Artifacts
Spatial Artifacts
Blurring
Blocking
Ringing
Basis Pattern
Effect
Color Bleeding

9
Blurring (Fuzziness or Unsharpness )
− Blurring of an image refers to a smoothing of its details and edges (Reduction in sharpness of edges, spatial details)
• Caused by quantization/truncation of high-frequency transform (DCT/DWT) coefficients during compression
− More noticeable around edges, textured regions

11
Blurring
Compressed frame with blurring artifact

1- Removing high spatial frequency after Transform and quantization
− Blurring is a result of loss of high spatial frequency image detail, typically at sharp edges.
− Colloquially referred to as “fuzziness” or “unsharpness”.
− It makes discrete objects – as opposed to the entire video – appear out of focus.
− Since the energy of natural visual signals concentrate at low frequencies, quantization reduces high
frequency energy in such signals, resulting in significant blurring effect in the reconstructed signals.
12
Blurring

2- Smoothing by the in-loop de-blocking filtering
− Another source of blurring effect is in-loop de-blocking filtering, which is employed to reduce the blocking
artifact across block boundaries, and are adopted as options by state-of-the-art video coding standards
such as H.264/AVC and HEVC.
− The de-blocking operators are essentially spatially adaptive low-pass filters that smooth the block
boundaries, and thus produces perceptual blurring effect.
Note:
− Sometimes, blurring is intentionally introduced by using a Gaussian function to reduce image noise or to
enhance image structures at different scales.
− Typically, this is done as a pre-processing step before compression algorithms may be applied,
attenuating high-frequency signals and resulting in more efficient compression.
13
Blurring

a) Reference frame
b) Compressed frame with de-blocking filter turned off
c) Compressed frame with de-blocking filter turned on
14
Blurring

Motion Blur
− It appears in the direction of motion corresponding to rapidly moving objects in a still image or a video.
− It happens when the image being recorded changes position (or the camera moves) during the recording of a
single frame, because of either rapid movement of objects or long exposure of slow-moving objects.
− One way to avoid motion blur is by panning the camera to track the moving objects, so the object remains
sharp but the background is blurred instead.
− Graphics, image, or video editing tools may also generate the motion blur effect for artistic reasons; (ex.
computer-generated imagery (CGI)).
15
Blurring

16
Blocking or Blockiness
− Visibility of underlying block encoding structure (false discontinuities across block boundaries)
• Caused by coarse quantization, with different quantization applied to neighboring blocks.
− More visible in smoother areas of picture

17
Reference frame

18
Compressed frame with Blockiness artifact

19
Blocking
DCT blocks are not recreated properly.
Either due to errors or high compression ratio.

20
Blocking
DCT blocks are not recreated properly.
Either due to errors or high compression ratio.

− Blocking is known by several names: tiling, jaggies, mosaicing, pixelating, quilting and checkerboarding.
− It is frequently seen in video compression standards, which use blocks of various sizes as the basic units for
frequency transformation, quantization and motion estimation/compensation, thus producing false
discontinuities across block boundaries.
− It occurs whenever a complex (compressed) image is streamed over a low bandwidth connection.
− At decompression, the output of certain decoded blocks makes surrounding pixels appear averaged
together and look like larger blocks.
− As displays increase in size, blocking typically becomes more visible.
− However, an increase in resolution makes blocking artifacts smaller in terms of the image size and
therefore less visible at a given viewing distance.
21

− The lower the bit rate, the more coarsely the block is quantized,
producing blurry, low-resolution versions of the block.
− In the extreme case, only the DC coefficient, representing the
average of the data, is left for a block, so that the reconstructed
block is only a single color region.
− The DC values vary from block to block.
− The block boundary artifact is the result of independently
quantizing the blocks of transform coefficients.
− Neighboring blocks quantize the coefficients separately, leading
to discontinuities in the reconstructed block boundaries.
− These block-boundary discontinuities are usually visible,
especially in the flat color regions such as the sky, faces, and so
on, where there are little details to mask the discontinuity.
22

Blocking
Mosaic
Effect
Staircase
Effect
False
Edge
− Although all blocking effects are generated because of similar reasons, their visual appearance may be
different, depending on the region where blockiness occurs.
− Therefore, here we further classify the blocking effects into 3 subcategories.
23
Reference frame
False Edge
Staircase Effect
Mosaic Effect

− Mosaic effect usually occurs when there is luminance transitions at large low-
energy regions (e.g., walls, black/white boards, and desk surfaces.)
− Due to quantization within each block, nearly all AC coefficients are
quantized to zero, and thus each block is reconstructed as a constant DC
block, where the DC values vary from block to block.
− When all blocks are put together, mosaic effect manifests as abrupt
luminance change from one block to another across the space.
− The mosaic effect is highly visible and annoying to the visual system, where
the visual masking effect is the weakest at smooth regions.
Note: Visual Masking Effect
• The reduced visibility of one image component due to the existence of
another neighboring image component.
24
Mosaic Effect

− Staircase effect typically happens along a diagonal line or curve, which, when mixed with the false
horizontal and vertical edges at block boundaries, creates fake staircase structures.
− Depending on root cause, staircasing can be categorized as
• A compression artifact (insufficient sampling rates)
• A scaler artifact (spatial resolution is too low)
25
Staircase Effect

− False edge is a fake edge that appears near a true edge.
− This is often created by a combination of
• motion estimation/compensation based inter-frame prediction
• blocking effect in the previous frame
The blockiness in the previous frame is transformed to the current frame via motion compensation as
artificial edges.
26
False Edge

− Halo surrounding objects and edges
• Caused by quantization/truncation of high-frequency transform (DCT/DWT) coefficients during compression
− Doesn’t move around frame-to-frame (unlike mosquito noise)
27
Ringing (Echoing, Ghosting)

29
Ringing
Compressed frame with Ringing artifact

30
Reference frame
Compressed frame with ringing artifact
Ringing

Ringing is unwanted oscillation of an output signal in response to a sudden change in
the input.
− The output signal oscillates at a fading rate, similar to a bell ringing after being
struck, inspiring the name of the ringing artifact.
− Image and video signals in digital data compression and processing are band
limited.
− When they undergo frequency domain techniques such as Fourier or wavelet
transforms, or non-monotone filters such as deconvolution, a spurious and visible
ghosting or echo effect is produced near the sharp transitions or object contours.
− This is due to the well-known Gibb’s phenomenon—an oscillating behavior of the
filter’s impulse response near discontinuities, in which the output takes higher value
(overshoots) or lower value (undershoots) than the corresponding input values, with
decreasing magnitude until a steady-state is reached.
31

− The ringing takes the form of a “halo,” band, or “ghost” near sharp edges.
− Sharp transitions in images such as strong edges and lines are transformed to many coefficients in
frequency domain representations.
− The quantization process results in partial loss or distortion of these coefficients.
− So during image reconstruction (decompression), there’s insufficient data to form as sharp an edge as in
the original.
− When the remaining coefficients are combined to reconstruct the edges or lines, artificial wave-like or
ripple structures are created in nearby regions, known as the ringing artifacts.
− Mathematically, this causes both over- and undershooting to occur at the samples around the original
edge.
− It’s the over- and undershooting that typically introduces the halo effect, creating a silhouette-like shade
parallel to the original edge.
32

− The ringing effect is restricted to sharp edges or lines.
− Such ringing artifacts are most significant when the edges or lines are sharp and strong, and when the
regions near the edges or lines are smooth, where the visual masking effect is the weakest.
− Ringing doesn’t move around frame to frame (Unlike mosquito noise).
Note
− When the ringing effect is combined with object motion in consecutive video frames, a special temporal
artifact called mosquito noise is observed.
33

− The basis pattern effect takes its name from basis functions (mathematical transforms) endemic to all
compression algorithms. The artifact appears similar to the ringing effect.
− However, whereas the ringing effect is restricted to sharp edges or lines, the basis pattern is not.
− It usually occurs in regions that have texture, like trees, fields of grass, waves, etc.
− Typically, if viewers notice a basis pattern, it has a strong negative impact on perceived video quality.
− If the region is in the background and does not attract visual attention, then the effect is often ignored by
human observers.
34
Basis Pattern Effect
Reference frame Compressed frame with basis pattern effect

− When the edges of one color in the image unintentionally bleeds or overlaps into another color.
− Colors of contrasting hue/saturation bleed across sharp brightness boundaries, looks like “sloppy painting”
• Caused by chroma subsampling (result of inconsistent image rendering across the luminance and chromatic
channels)
− Worse in images with high color detail
35
Color Bleeding (Smearing)

36
Color Bleeding
Reference frame

37
Color Bleeding
Compressed frame with Color Bleeding artifact

For example, in the most popular YCbCr 4:2:0 video format, the color channels Cb and Cr have half resolution
of the luminance channel Y in both horizontal and vertical dimensions.
Inconsistent Distortions Across Color Channels
− After compression, all luminance and chromatic channels exhibit various types of distortions (such as
blurring, blocking and ringing described earlier), and more importantly, these distortions are
inconsistent across color channels.
Interpolation Operations
− Moreover, because of the lower resolution in the chromatic channels, the rendering processes
inevitably involve interpolation operations, leading to additional inconsistent color spreading in the
rendering result.
− In the literature, it was shown that chromatic distortion is helpful in color image quality assessment, but
how color bleeding affects the overall perceptual quality of compressed video is still an unsolved
problem.
38
Color Bleeding (Smearing)

Temporal artifacts refer to those distortion effects that are not observed when the video is paused but during
video playback.
39
Temporal Artifacts
Temporal
Artifacts
Flickering
Mosquito Noise
Fine-granularity Flickering
Coarse-granularity
FlickeringJerkiness
Floating
Texture Floating
Edge Neighborhood Floating

Temporal artifacts are of particular interest to us for two reasons
I. As compared to spatial artifacts, temporal artifacts evolve more significantly with the development
of video coding techniques.
• For example, texture floating did not appear to be a significant issue in early video coding standards, and is
more manifest in H.264/AVC video, but is largely reduced in the latest HEVC coded video.
II. The objective evaluation of such artifacts is more challenging, and popular VQA models often fail to
account for these artifacts.
• More importantly, it was pointed out that such drops are largely due to the lack of proper assessment in
temporal artifacts such as flickering and floating (ghosting).
40
Temporal Artifacts

42
Flickering
Compressed frame with Flickering artifact
Frequent luminance or chrominance
changes along temporal dimension

Flickering artifact generally refers to frequent luminance or chrominance changes along temporal
dimension that does not appear in uncompressed reference video.
− It can be very eye-catching and annoying to viewers and has been identified as an important temporal
artifacts that has significant impact on perceived quality.
− The most likely cause of this type of flickering is the use of GOP structures in the compression algorithm.
− I-frame-based algorithms are not susceptible to this type of artifact.
43
Flickering
Flickering
Mosquito noise
Coarse-granularity flickering
Fine-granularity flickering

− Haziness, shimmering, blotchy noise around objects/edges
• Varies from frame to frame, like mosquitos flying around person’s head.
− Addition of edges, Incorrect DCT block reconstruction, Pixels of the opposite colour created.
44
Mosquito Noise (Gibbs Effect, Edge Busyness)

− Mosquito noise is a joint effect of object motion and time-varying spatial artifacts (such as ringing and
motion prediction error) near sharp object boundaries.
− Specially, the ringing and motion prediction error are most manifest at the regions near the boundaries of
objects.
− When the objects move, such noise-like time-varying artifacts move together with the objects, and thus
look like mosquitos flying around the objects.
− Since moving objects attract visual attention and the plane region near object boundaries have weak
visual masking effect on the noise, mosquito noise is usually easily detected and has strong negative
impact on perceived video quality.
− A variant of flickering, it’s typified as haziness and/or shimmering around high-frequency content (sharp
transitions between foreground entities and the background or hard edges), and can sometimes be
mistaken for ringing.
45
Mosquito Noise (Gibbs Effect, Edge Busyness)

Coarse-granularity flickering refers to low-frequency sudden luminance changes in large spatial regions that
could extend to the entire video frame.
− The most likely reason of such flickering is the use of group of-picture (GoP) structures in standard video
compression techniques.
− When a new GoP starts, there is no dependency between the last P-frame in the previous GoP and the I-
frame in the current GoP.
− Thus sudden luminance change is likely to be observed, especially when these two frames are about the
same scene.
46
Coarse-granularity Flickering
GOP = 6
No Dependency No Dependency

− The frequency of coarse-granularity flickering is typically determined by the size of GoP.
− Advanced video encoders may not use fixed GoP lengths or structures, and an I-frame may be
employed only when scene change occurs, and thus coarse-granularity flickering may be avoided or
significantly reduced.
47
Coarse-granularity Flickering
GOP = 6
No Dependency No Dependency
GOP = 12
No Dependency

Fine-granularity flickering is typically observed in large low-energy to mid-energy regions with significant
blocking effect and slow motion.
− In these regions, significant blocking effect occurs at each frame.
− The levels of blockiness and the DC values in corresponding blocks change frame by frame.
− Consequently, these regions appear to be flashing at high frequencies, frame-by-frame (as opposed to
GoP-by-GoP in coarse granularity flickering).
− Such flashing effect is highly eye-catching and perceptually annoying, especially when the associated
moving regions are of interest to the human observers.
48
Fine-granularity flickering

− A flicker-like artifact, jerkiness (also known as choppiness), describes the perception of individual still
images in a motion picture.
− Jerkiness, is the perceived uneven or wobbly motion due to frame sampling.
− Jerkiness occurs when the temporal resolution is not high enough to catch up with the speed of moving
objects, and thus the object motion appears to be discontinuous.
− The highly visible jerkiness is typically observed only when there is strong object motion in the frame.
− The resulting unsmooth object movement may cause significant perceptual quality degradation.
− It may be noted that the frequency at which flicker and jerkiness are perceived is dependent upon many
conditions, including ambient lighting conditions.
49
Jerkiness (Choppiness)

− Jerkiness is not discernible for normal playback of video at typical frame rates of 24 frames per second or
above.
− However, in visual communication systems, if a video frame is dropped by the decoder owing to its late
arrival, or if the decoding is unsuccessful owing to network errors, the previous frame would continue to be
displayed.
− Upon successful decoding of the next error-free frame, the scene on the display would suddenly be
updated. This would cause a visible jerkiness artifact.
50
Jerkiness
Encoder DecoderNetwork Channel
• Dropped by the decoder owing to its late arrival
• Unsuccessful decoding owing to network errors

− Traditionally, jerkiness is not considered a compression artifact, but an effect of the low temporal
resolution of video acquisition device, or a video transmission issue when the available bandwidth is not
enough to transmit all frames and some frames have to be dropped or delayed.
Telecine Judder
− Another flicker-like artifact is the telecine judder.
− It’s often caused by the conversion of 24 fps movies to a 30 or 60 fps video format. The process, known as
"3:2 pulldown" or "2:3 pulldown," can’t create a flawless copy of the original movie because 24 does not
divide evenly into 30 or 60.
− Jerkiness also sometimes is called Judder.
51
Jerkiness

53
Floating
Compressed frame with Floating artifact

Floating refers to the appearance of illusive motion in certain regions as opposed to their surrounding
background.
− Visually, these regions appear as if they were floating on top of the surrounding background.
− Typically, the video encoders associate the Skip coding mode with these regions (where the encoding of
motion compensated prediction residue is skipped), and thus the structural details within the regions keep
unchanged across frames.
− It is again erroneous as the actual details in the regions are evolving over time.
− This is the result of the encoder erroneously skipping predictive frames.
54
Floating
Encoder Floating Region
Skip Coding Mode
It has illusive motion as opposed
to their surrounding background.

Texture floating typically occurs at large mid-energy texture regions.
− The most common case is when a scene with large textured regions such as water surface or trees is
captured with a camera having a slow motion movements.
− Despite the actual shifting of image content due to camera motion, many video encoders choose to
encode the blocks in the texture regions with zero motion and Skip mode.
− These are reasonable choices to save bandwidth without significantly increasing the mean squared error
or mean absolute error between the reference and reconstructed frames, but often create strong texture
floating illusion.
− Not surprisingly, such floating illusive motion is usually in opposite direction with respect to the camera
motion with the same absolute speed.
− It was sometimes also referred to as” ghosting”
55
Texture Floating Floating
Texture
floating
Edge
neighborho
od floating
Large Mid-energy Texture Regions

a) The 200th frame in the original video
b) The 200th frame in the compressed video
(visual texture floating regions are marked manually)
a) The texture floating map generated (black regions
indicate texture floating is detected)
56
Texture Floating Detection

− Among all types of temporal artifacts, texture floating is perhaps the least identified in the literature, but is
found to be highly eye-catching and visually annoying when it exists.
Factors are relevant in detecting and assessing texture floating
1. Global motion:
• Texture floating is typically observed in the video frames with global camera motion, including translation, rotation
and zooming. The relative motion between the floating regions and the background creates the floating illusion in
the visual system.
• A robust global motion estimation method is employed which uses the statistical distribution of motion vectors from
the compressed bitstream.
2. Skip mode:
• The use of Skip mode is the major source of temporal floating effect. When the compressed video stream is
available, the Skip mode can be easily detected in the syntax information.
57

3. Local energy:
• In high-energy texture and edge regions, erroneous motion estimation and Skip mode selection are unlikely in most
high-performance video encoders. On the other hand, there is no visible texture in low-energy regions.
• Therefore, texture floating is most likely seen in mid-energy regions. Therefore, we can define two threshold energy
parameters E1; E2 to constrain the energy range for texture floating detection.
4. Local luminance:
• The visibility of texture floating is also limited by the luminance levels of the floating regions.
• Because human eyes are less sensitive to textures in highly bright or dark regions, we can define two threshold
luminance parameters L1; L2 to consider only mid-luminance regions for texture floating identification.
5. Temporal variation similarity:
• Temporal floating is often associated with erroneous motion estimation. In the reconstruction of video frames,
erroneous motion estimation/compensation leads to significant distortions along temporal direction.
• In the case that the original uncompressed video is available to compare with, it is useful to evaluate the similarity of
temporal variation between the reference video frames and the compressed video frames as a factor to detect
temporal floating effect.
58

Edge neighborhood floating is observed in stationary regions that are next to the boundaries of moving
objects.
− Rather than remaining stationary, these regions may move together with the boundaries of the objects.
− Different from texture floating, edge neighborhood floating may appear without global motion.
− It is often visually unpleasant because it looks like there exists a wrapped package surrounding and
moving together with the object boundaries.
− This effect was also called stationary area temporal fluctuations.
59
Edge Neighborhood Floating Floating
Texture
floating
Edge
neighborho
od floating
Stationary Region
Boundaries of Moving Objects

− Looks like random noise process (snow); can be grey or colored, but not uniform over image
• Caused by quantization of DCT coefficients
60
Quantization Noise

− It is a frequent problem, which is caused by the limits of 8-bit color depth and (especially in low bitrate) truncation.
• Not enough colours to define the image correctly.
• Not noticeable in high detail areas of the image.
• Affects flat areas and smooth transitions badly.
61
Contouring (Banding)

− Similar reason to contouring
− Make video look like a poster
− Colors look flat
− Makeup looks badly applied
62
Posterization

− Similar reason to contouring
− Make video look like a poster
− Colors look flat
− Makeup looks badly applied
63
Posterization

− Data incorrectly saved or retrieved. Affects still images as a loss of part of the image …
… or a block shift of part of the image.
64
Data Corruption

65
Data Corruption
− Data incorrectly saved or retrieved. Affects still images as a loss of part of the image …
… or a block shift of part of the image.

66
Video Source
Decompress
(Decode)
Compress
(Encode)
Video Display
Coded
video

69
Audio and Video Visual Quality

− Audiovisual Quality Control
• Audio
• Video
• Interaction between audio and video : e.g. lip-sync
− File based AV Quality Control
• Part of an automated or manual workflow
• Diagnose
• Repair / Redo
− Technical Quality Control
• Container
• Metadata
• Interaction between container, audio and video : e.g. duration of tracks
− File Based Technical Quality Control
• Part of an automated or manual workflow
• Application specifications
70
Video Quality Control in a File-based World

Safeguarding Audiovisual Quality
− Maintaining quality throughout the production chain
– Choose material as close to source as possible
• Prevent unneeded multi-generation
– Try to produce with the shortest/’most apt’ chain
• Prevent unneeded multi-generation
• Prevent transcoding
– Check quality
− Carefully design the production chain
– Choose the right codecs
– Choose the right equipment
71
Video Quality Control in a File-based World

− Video compression algorithm factors
• Decoder concealment, packetization, GOP structure, …
− Network-specific factors
• Delay, Delay variation, Bit-rate, Packet loss rate (PLR)
− Network independent factors
• Sequence
− Content, amount of motion, amount of texture, spatial and temporal resolution
− User
• Eyesight, interest, experience, involvement, expectations
− Environmental viewing conditions
• Background and room lighting; display sensitivity, contrast, and characteristics; viewing distance
72
Factors that Affect Video Quality

73
Image and Video Processing Chain
Display
Transmission
Compression
Acquisition
Aliasing, Blurring, Ringing
Noise, Contouring, Distortions
Blocking/Tiling, MC Edges, Aliasing
Blurring, Ringing, Flicker, jerkiness
Noise; Jerkiness; Blackout
Packet loss, Macroblocking
Aliasing, Blurring, Ringing, Color
contrast artifacts, Interlacing, Overscan

The Spectrum of Visual Quality
Perfect (Lossless)
Awful
Visually Lossless?
Good enough?
74

What Affects Visual Quality? (1)
Compression Ratio
Computation
Delay
Codec Type
Codec Version
Transmission Errors
System Issues
75

Detail
Motion
Masking
Spatial Masking
Content
Issues
76

Environment
Experience
Human Issues
Task
Attention
Attention
77

78
Excellent
Good
Fair
Poor
Bad
Absolute Category Rating
A
B
Pair-wise comparison
Absolute Category Rating and Pair-wise Comparison

− In Video/image processing, the pixel values may change leading to distortions
− Examples of added distortions
• Filtering/Interpolation distortions
• Compression distortions
• Watermarking distortions
• Transmission distortions
− It is important to know if added distortions are acceptable
Objective and Subjective Measurements/Assessment
79

80
Objective Metrics
• Peak Signal to Noise Ratio (PSNR)
• Structural Similarity Index (SSIM)
• Just Noticeable Difference (JND)
• … and Many More.
Video Quality
Measurement
Subjective Objective
Subjective Metrics
• MOS: Mean Opinion Score
• DMOS: Differential Mean
Opinion Scores
(it is defined as the
difference between the
raw quality score of the
reference and test images)
…….

81
Video Quality
Measurement
Goldeneye
Multiple
Viewer
Absolute Comparison

82
Video Quality
Measurement
Reduced
Reference
Model
Based Hybrid
Full
Reference
No
Reference
Distortion
Based

83
Video Quality
Measurement
Reduced
Reference
Model
Based
Hybrid
Full
Reference
No
Reference
Distortion
Based
Goldeneye
Multiple
Viewer
Absolute Comparison

− Formation: 1997
− Experts of ITU-T study group 6 and ITU-T study group 9
− They are considering three methods for the development of the video quality metric.
− The VQEG have defined three methods as an objective Video Quality meter:
• Full Reference Method (FR)
• Reduced Reference Method (RR)
• No Reference Method (NR)
− All these models should be validated with the SSCQE (Single Stimulus Continuous Quality Evaluation)
methods for various video segments.
− Early results indicate that these methods compared with the SSCQE perform satisfactorily, with a
correlation coefficient between 0.8 and 0.9.
Video Quality Experts Group (VQEG)
84
ITU-T study
group 9
ITU-T study
group 6
Video Quality
Experts Group
(VQEG)

85
The VQEG Three Methods as an Objective Video Quality Meter

− These three models try to mimic perceptual model of HVS
− They try to
• assess specific distortions (such as blocking, blurring and texture distortion, Jerkiness, Freeze, etc.)
• pooling (combining features, spatially and temporally)
• mapping or contributions of these artifacts into an overall video quality score.
− FR is the best and NR the worst
VQEG Meters and HVS Model
86
Measure Pool
Map to
Quality

− A full-reference (FR) quality measurement makes a comparison between a (known) reference video
signal at the input of the system and the processed video signal at the output of the system
Full-reference (FR) Method
87

− In a reduced-reference (RR) quality measurement, specific parameters (features) are extracted from both
the (known) reference and processed signals.
− Reference data relating to these parameters are sent using a side-channel to the measurement system.
− The measurement system extracts similar features to those in the reference data to make a comparison
and produce a quality measurement.
Reduced-reference (RR) Method
88
Similar features extraction

− A no-reference (NR) quality measurement analyses only the processed video without the need to access
the (full or partial) reference information.
No-reference (NR) Method
89

Image/Spatial
• Blurring
• Ringing
• Snow; Noise
• Aliasing; HD on SD and vice versa
• Colorfulness; color shifts
• PSNR due to quantization
• Contrast
• Ghosting
• Interlacing
• Motion-compensated edge artifacts!
No-reference (NR) Method
90
Video/Temporal
• Flicker
• Block flashing artifacts (mosquito noise?)
• Telecine
• Jerkiness
• Blackness

Pooling: combining features; combining
in space, frequency, orientation, time
Typical Video Quality Estimator (QE) or Video Quality Assessment System
92
Quality
Estimator
Measure Pool
Map to
Quality
Reference
image
Distorted test
image
Absolute
QE score
Methodology
− Measure individual artifacts
− Combine many artifacts
• Combine linearly
• Combine non-linearly
− Measure physical artifacts
− Measure perceptual artifacts
− Incorporate viewing distance
(cyc/degree) into NR
− Combine NR with FR-HVS

93
The Ideal Quality Estimator
Subjective Quality
Objective Quality
A QE With a Systematic Weakness
Subjective Quality
Objective Quality
Specific type of processing,
or specific type of image (or video)
A Quality Estimator (QE) With a Systematic Weakness
A Typical QE On a Typical Dataset
Subjective Quality
Objective Quality

− Algorithm optimization
• Automated in-the-loop assessment
− Product benchmarks
• Vendor comparison to decide what product to buy
• Product marketing to convince customer to give you $$
− System provisioning
• Determine how many servers, how much bandwidth, etc.
− Content acquisition and delivery (and SLAs)
• Enter into legal agreements with other parties
− Outage detection and troubleshooting
Applications Of Video Quality Estimators
94

Absolute QE scores
− Absolute QE scores are useful for
• product benchmarking
• content acquisition
• system provisioning
− Absolute QE scores still depend on context. They are not truly “absolute”.
Relative QE scores
− Relative QE scores are useful for
• algorithm optimization
• product benchmarking
Absolue vs. Relative Quality Estimator (QE) Scores
95

− Full reference (FR)
• Most available info; requires original and decoded pixels
• ITU-T standards J.247, J.341
− Reduced reference (RR)
• Most available info; requires original and decoded pixels extracted features
• ITU-T standards J.246, J.342
− No-Reference Pixel-based methods (NR-P)
• Requires decoded pixels: a decoder for each video stream
− No-Reference Bitstream-based methods (NR-B)
• Processes packets containing bitstream, without decoder
• ITU-T standards: P.1201 (packet headers only);P.1202 (packet headers and bitstream info)
− Hybrid Methods (Hybrid NR-P-B)
• Hybrid models combine parameters extracted from the bitstream with a decoded video signal.
• They are therefore a mix between NR-P and NR-B models.
Quality Estimator (QE) Categorization
96

Quality Estimator (QE) Categorization
97

Use information from a collection of “similar enough” images to estimate quality of them all
− Applications
• Super-resolution
• Downsampling
• image fusion
• images collected of the same scene from different angles, etc
• egocentric video
− A relative quality estimation
− Uses effective information from overlapping regions; does not require perfect pixel alignment
Quality Estimator (QE) with Mutual Reference
98

− Sequence
• Content
• amount of motion, amount of texture
• spatial and temporal resolution
− User
• Eyesight
• Interest
• Experience
• Involvement
• expectations
− Environmental viewing conditions
• Background and room lighting
• display sensitivity, contrast, and characteristics
• viewing distance
− The processing may improve, not degrade, quality!
Why FR Quality Estimator (QE) are Challenging to Design?
99

− All the reasons FR QE are challenging plus…
− Many types of processing
• Encoding
• transmission errors
• Sampling
• backhoe, …
− Many desired signals may “look like” distortion
− Limited input information – no vref and often no vtest!
− Nonetheless, some applications require NR
Why NR Quality Estimator (QE) are Challenging to Design?
100

Approach 1: Model Perception and Perceptual Attributes
− Psychology community, Martens, Kayyargadde, Allnatt, Goodman,…
Approach 2: Model System Quality
− The entire photography/camera community, Keelan,…
Approach 3:
− The image processing community
− H.R. Wu, Winkler, Marziliano, Bovik, Sheikh, Wang, Ghanbari, Reibman, Wolf
Existing Approaches to NR Quality Estimator (QE)
101

Original video X
Encoding parameters E(.)
Complete encoded bitstream E(X)
Network impairments (losses, jitter) L(.)
Lossy bitstream L(E(X))
Decoder (concealment, buffer, jitter) D(.)
Decoded pixels D(L(E(X)))
What Information Can You Gather?
102
NetwotkEncoder Decoder
A B C D

Original video X
Full-Reference Quality Estimator (QE)
103
A B C D

Original video X
Quality Estimator (QE) Using Network Measurements
104
A B C D

Original video X
Quality Estimator (QE) Using Lossy Bitstream
105
A B C D

106
Video Source
Decompress
(Decode)
Compress
(Encode)
Video Display
Coded
video

− Mean Squared Error (MSE) of a quantizer for a continuous valued signal
• Where 𝑝(𝑓) is the probability density function of 𝑓
− MSE for a specific image
107
Mean Squared Error (MSE)

− Signal to Noise Ratio (SNR)
− Peak SNR or PSNR
• For the error measure to be “independent of the signal energy”→
Signal energy≡ Dynamic range square of the image
• For 8 bit image, peak=255
108
Signal to Noise Ratio (SNR) and Peak Signal to Noise Ratio (PSNR)

109
MSE of a Uniform Quantizer for a Uniform Source
• Uniform quantization into L levels: 𝑞 = 𝐵/𝐿
• Same error in each bin
• Error is uniformly distributed in (−𝑞/2, +𝑞/2)
𝜎𝑓
2
= න
𝑓 𝑚𝑖𝑛
𝑓 𝑚𝑎𝑥
(𝑓 − 𝑄(𝑓))2
1
𝐵
𝑑𝑓 =
𝐵2
12

110
𝑆𝑄𝑁𝑅 = 10 log
𝑅𝑀𝑆 𝑆𝑖𝑔𝑛𝑎𝑙 𝑃𝑜𝑤𝑒𝑟
𝑅𝑀𝑆 𝑄𝑢𝑎𝑛𝑡𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑁𝑜𝑖𝑠𝑒 𝑃𝑜𝑤𝑒𝑟
= 6𝐵 + 1.78
𝑃𝑆𝑁𝑅 = 10 log
𝑃𝑒𝑎𝑘 𝑆𝑖𝑔𝑛𝑎𝑙 𝑃𝑜𝑤𝑒𝑟
=?
=
×
=(
(2𝐴)2
(
𝐴
2
)2
)×
=8
𝑷𝑺𝑵𝑹 = 10 log 8 ×
= 10 log 8 + (6𝐵 + 1.78) ≈ 𝟔𝑩 + 𝟏𝟏 (𝒅𝑩)
PSNR for a Sin Waveform
2𝐴

− Due to the limitations in subjective assessments in terms of cost and time consuming, the researchers are
focusing for the automated quality assessment.
− The computerized assessment of quality metric is mainly based on mathematical calculations which also
include the property of human visual system (HVS).
− Ideally we want to measure the performance by how close is the quantized image to the original image
“Perceptual Difference”
− But it is very hard to come up with an objective measure that correlates very well with the perceptual
quality
− Frequently used objective measure
− Mean Squared Error (MSE) between original and quantized samples
− Signal to Noise Ratio (SNR)
− Peak SNR (PSNR)
111
Objective Measurement of Performance

− The Mean Squared Error (MSE) is simplest Objective measure and is calculated:
− Where 𝒀 𝒓 𝒙, 𝒚 and 𝒀 𝒅 𝒙, 𝒚 are the luminance level of pixel in the reference image and the coded images
of size 𝑚 × 𝑛 respectively.
− The PSNR is calculated using MSE as:
Objective Assessment by PSNR and MSE
112
𝑀𝑆𝐸 =
1
𝑚 × 𝑛
෍
𝑥=1
𝑚
෍
𝑦=1
𝑛
(𝑌𝑟 𝑥, 𝑦 − 𝑌𝑑 𝑥, 𝑦 )2
𝑃𝑆𝑁𝑅 = 10 log10
2552
𝑀𝑆𝐸
𝑑𝐵

113
PSNR
− The difference between PSNR and APSNR is in the way of average PSNR calculation for a sequence.
− The correct way to calculate average PSNR for a sequence is to calculate average MSE for all frames
(average MSE is arithmetic mean of the MSE values for frames) and after that to calculate PSNR using
ordinary equation for PSNR:
APSNR (Average Peak Signal to Noise Ratio)
− But sometimes it is needed to take simple average of all the per frame PSNR values.
− APSNR is implemented for this case and calculate average PSNR by simply averaging per frame PSNR
values.
− APSNR is usually 1dB higher than PSNR.
PSNR and APSNR
𝑃𝑆𝑁𝑅 = 10 log10
2552
𝑀𝑆𝐸
𝑑𝐵

114
− Usually our PSNR refers to Y-PSNR, because human eyes are more sensitive to luminance difference.
− However, some customers calculate PSNR by combining Y, U, V-PSNR in a costumed ratio, like:
PSNR Channel
𝑃𝑆𝑁𝑅 =
6 ∗ 𝑌 − 𝑃𝑆𝑁𝑅 + 𝑈 − 𝑃𝑆𝑁𝑅 + 𝑉 − 𝑃𝑆𝑁𝑅
8

115
Encoding y4m with H.264 video codec, MP4 container, 2Mbps bitrate, PSNR measurement:
• ffmpeg -i source .y4m -codec h264 -b 2000000 destination.mp4 –psnr
Note: Other codecs: -codec mpeg2video , -codec hevc, -codec vp9, …)
Decoding from coded media file (any) to y4m:
• ffmpeg -i source.mp4 destination.y4m
Encoding, Decoding and PSNR measurement by FFmpeg

116
PSNR
Without saving the results in a log file:
− ffmpeg -i distorted.mp4 -i reference.mp4 -lavfi psnr -f null -
With saving the results in a log file:
− ffmpeg -i distorted.mp4 -i reference.mp4 -lavfi psnr=psnr.log -f null –
SSIM
Without saving the results in a log file:
− ffmpeg -i distorted.mp4 -i reference.mp4 -lavfi ssim -f null -
With saving the results in a log file:
− ffmpeg -i distorted.mp4 -i reference.mp4 -lavfi ssim=ssim.log -f null -
PSNR and SSIM measurement by FFmpeg

117
Original Processed MSE
Objective Assessment by MSE
𝑀𝑆𝐸 =
1
𝑚 × 𝑛
෍
𝑥=1
𝑚
෍
𝑦=1
𝑛
(𝑌𝑟 𝑥, 𝑦 − 𝑌𝑑 𝑥, 𝑦 )2

118
Objective Assessment by MSE
𝑀𝑆𝐸 =
1
𝑚 × 𝑛
෍
𝑥=1
𝑚
෍
𝑦=1
𝑛
(𝑌𝑟 𝑥, 𝑦 − 𝑌𝑑 𝑥, 𝑦 )2
(a) original image
(b) luminance offset
(c) contrast stretch
(d) impulse noise
(e) Gaussian noise
(f) Blur
(g) JPEG compression
(h) spatial shift (to the left)
(i) spatial scaling (zoom out)
(j) rotation (CCW).
Images (b)-(g) have nearly identical
MSEs but very different visual quality.
(b) MSE = 309
(a)
(c) MSE = 306 (d) MSE = 313
(e) MSE = 309 (f) MSE = 308 (g) MSE = 309
(h) MSE = 871 (i) MSE = 694 (j) MSE = 590

119
PSNR=19.09 dB PSNR=25.25 dB
Objective Assessment by PSNR
PSNR for Image Encoding: JPEG-2000

120Original

121PSNR 45.53 [dB]

122PSNR 36.81 [dB]

123PSNR 31.45 [dB]

124
Objective Assessment by MSE and PSNR

125
50
45
40
35
30
25
20
40dB:
Every pixel off by 1%
30dB:
Every pixel off by 3.2%
20dB:
Every pixel off by 10%0 5 10 15 20
Root MSE
PSNR
An Example of PSNR Relationship to Root MSE

126
PSNR 25.8
0.63 bpp (bit per pixel)
12.8 : 1
PSNR Reflects Fidelity (1)

127
PSNR 24.2
25.6 : 1

128
PSNR 23.2
51.2 : 1

PSNR = 25.12 dB PSNR = 25.11 dB
Q. Huynh-Thu and M. Ghanbari, ‘The scope of validity of PSNR in image/video quality assessment, Electronic Letters, 44:13, (June 2008), pp. 800-801
129
PSNR is not Everything

130
PSNR = 25.36 dB PSNR = 25.25 dB

132
PSNR = 25.8 dB PSNR = 25.8 dB

− Take a natural image
− Give more bits to areas you look at more
(Saliency map)
− Give less bits to areas you look at less
− Subjective rating will be high, PSNR low
Ex. 1: How to Trick PSNR?
133
Original
Attention Map Example Test (High subjective rating, low PSNR)

Ex: A small part of a picture in a video is severely degraded
− This type of distortion is very common in video, where due to a single bit error, blocks of 16×16 pixels might
be erroneously decoded.
− This has almost no significant effect on PSNR but can be viewed as an annoying artefact.
134
A Picture
Severely Degraded
(Distorted)
It hardly affects the PSNR or any objective model
parameters (depending on the area of distortion)
It attracts the observers attention, and the video
looks bad if a larger part of the picture was distorted.

− In comparing codecs, PSNR or any objective measure should be used with great care.
− PSNR does not correlate accurately with the picture quality and thus it would be misleading to directly
compare PSNR from very different algorithms.
“Ensure that the types of coding distortions are not significantly different from each other”
135
Coder 1
(block-based)
Coder 2
(filter-based)
Blockiness Distortion, PSNR,1
Smearing Distortion, PSNR,2
Same Input Objective results can be different
The expert viewers prefer blockiness distortion to the smearing, and nonexperts’ views are opposite!

1. Not shift invariant
2. Rates enhanced images as degraded
3. Difficult to handle different spatial resolutions
4. Difficult to handle different temporal resolutions
5. Does not consider underlying image
• Masking in human visual system (temporal, luminance, texture)
• Will clearly fail to accurately compare different source material
6. Does not consider relationship among pixels
• Compares different types of impairments poorly
• Blocking vs. blurring; Wavelet vs. DCT; High vs. low spatial frequency
Six Situations in Which PSNR Fails
136

− The main criticism against the PSNR
“The human interpretation of the distortions at different parts of the video can be different”
− Although it is hoped that the variety of interpretations can be included in the objective models, there are
still some issues that not only the simple PSNR but also more sophisticated objective models may fail to
address.
137
Example of PSNR interpretation in terms of a quality
for a specific video codec and streamed video
QualityPSNR Value
Excellent QualityPSNR > 33dB
Fair Quality33 dB < PSNR < 30 dB
Poor QualityPSNR < 30 dB

What is the main reason that PSNR is still used in comparing the performance of various video codecs?
− Under similar conditions if one system has a better PSNR than the other, then the subjective quality can be
better but not worse.
− PSNR can provide information about the behaviour of the compression algorithm through the multi-
generation process.
138
System 1
System 2
PSNR,1
PSNR,2
Same Input

MSE and PSNR are widely used because they are simple and easy to calculate and mathematically easy to
deal with for optimization purpose.
− Mathematically, it’s very tractable
• Easy to model, differentiable
− Many times, increasing PSNR improves visual quality
− Experimentally, it’s easy to optimize
• Reducing error in one pixel increases PSNR
• Can reduce the error in each pixel independently
− Rate distortion optimization successes
• Search over all possible strategies, the one that minimizes the distortion (D=MSE) at the decoder,
subject to a constraint on encoding rate R.
• Shows dramatic improvement in images, video
Good Things about MSE and PSNR
139

A number of reasons why MSE or PSNR may not correlate well with the human perception of quality.
• Digital pixel values, on which the MSE is typically computed, may not exactly represent the light stimulus
entering the eye.
• Simple error summation, like the one implemented in the MSE formulation, may be markedly different from
the way the HVS and the brain arrives at an assessment of the perceived distortion.
• Two distorted image signals with the same amount of error energy may have very different structure of
errors, and hence different perceptual quality.
Some PSNR Facts
140

− PSNR is inaccurate in measuring video quality of a video content encoded at different frame rates
because it is not capable of assessing the perceptual trade-off between the spatial and temporal
qualities.
− PSNR follows a monotonic relationship with subjective quality in the case of full frame rate encoding
(without the presence of frame freezing or dropping) when the video content and codec are fixed.
− So PSNR can be used as an indicator of the variation of the video quality when the content and codec are
fixed across the test conditions, and when the encoding is done at full frame rate without the presence of
frame freezing or dropping (it leads to frame rate change).
− PSNR becomes an unreliable and inaccurate quality metric when several videos with different content are
jointly assessed.
Some PSNR Facts
141

142
PSNR
Long GOP
B I B B P B B I BI I I I I I I I I
Time
GOP
I Frame Only
Ti me
PSNR in Different Moving Picture Types

143
PSNR in Different Moving Picture Types

144
PSNR, GOP, Intra and Inter Coding
1st 5th
Long GOP Codec
PSNR
10th
Generation
AVC-Intra100 : Cut Edit
AVC-Intra50 : Cut Edit
Still Pictures
Fast Motion
Confetti fall
Flashing lights
Landscape
Long GOP quality is
content dependent

− This measure gives the difference between color components between the original frame and
compressed frame. The value of this metric is the mean absolute difference of the color components in
the correspondent points of image.
− The values are in 0..1. The value 0 means equal frames, lower values are better.
− It can be used to identify which part of a search image is most similar to a template image.
Mean Sum of Absolute Difference (MSAD)
145
Original Processed MSAD
𝒅 𝑿, 𝒀 =
σ𝒊=𝟏,𝒋=𝟏
𝒎,𝒏
𝑿𝒊,𝒋 − 𝒀𝒊,𝒋
𝒎𝒏

The conventional PSNR demonstrates inaccuracy of measurement while applied to measure video stream
over wireless and mobile network.
− It is due to packet loss issue in the wireless and mobile network.
− A concept of dynamic window size is used to improve accuracy of frame lost detection.
− The concept is named Aligned-PSNR (APSNR).
Aligned-PSNR (APSNR)
146
Illustration of conventional PSNR.

− The window size is depending on sum of frame loss in the S-frame.
− In this case, there are total of five frame losses so the window size is six (𝑾𝒊𝒏𝒅𝒐𝒘 𝒔𝒊𝒛𝒆 = 𝑺𝒖𝒎𝑭𝑳 + 𝟏).
− Window determines limit of frame loss searching so that the algorithm only need to find corresponding
frames in total of window size.
− The processed S-frame (S-frame number x) should correspond with O-frame number eight.
Aligned-PSNR (APSNR)
147
Original
Streamed

Two phenomena demonstrate that perceived brightness is not a simple function of intensity.
− Mach Band Effect: The visual system tends to undershoot or overshoot around the boundary of regions of
different intensities.
− Simultaneous Contrast: a region’s perceived brightness does not depend only on its intensity.
Perceived Brightness Relation with Intensity
148
Mach band effect.
Perceived intensity is
not a simple function
of actual intensity.
Examples of simultaneous contrast.
All the inner squares have the same intensity, but they appear
progressively darker as the background becomes lighter

− The term masking usually refers to a destructive interaction or
interference among stimuli that are closely coupled in time or
space.
− This may result in a failure in detection or errors in recognition.
− Here, we are mainly concerned with the detectability of one
stimulus when another stimulus is present simultaneously.
− The effect of one stimulus on the detectability of another,
however, does not have to decrease detectability.
Masking Recall
149
I: Gray level (intensity value)
Masker: Background 𝐼2 (one stimulus)
Disk: Another stimulus 𝐼1
In ∆𝑰 = 𝑰 𝟐 − 𝑰 𝟏,the object can be noticed by
the HVS with a 50% chance.

− Under what circumstances can the disk-shaped object be
discriminated from the background (as a masker stimulus) by
the HVS? Weber’s law:
− Weber’s law states that for a relatively very wide range of I
(Masker), the threshold for disc discrimination, ∆𝑰, is directly
proportional to the intensity I.
• Bright Background: a larger difference in gray levels is needed
for the HVS to discriminate the object from the background.
• Dark Background: the intensity difference required could be
smaller.
Masking Recall
150
Contrast Sensitivity Function (CSF)
I: Gray level (intensity value)
Masker: Background 𝐼2 (one stimulus)
Disk: Another stimulus 𝐼1
In ∆𝑰 = 𝑰 𝟐 − 𝑰 𝟏,the object can be noticed by
the HVS with a 50% chance.
∆𝑰
𝑰
= 𝑪𝒐𝒏𝒔𝒕𝒂𝒏𝒕 (≈ 𝟎. 𝟎𝟐)

− The HVS demonstrates light adaptation characteristics and as a consequence of that it is sensitive to
relative changes in brightness. This effect is referred to as “Luminance masking”.
“ Luminance Masking: The perception of brightness is not a linear function of the luminance”
− In fact, the threshold of visibility of a brightness pattern is a linear function of the background luminance.
− In other words, brighter regions in an image can tolerate more noise due to distortions before it becomes
visually annoying.
− The direct impact that luminance masking has on image and video compression is related to
quantization.
− Luminance masking suggests a nonuniform quantization scheme that takes the contrast sensitivity function
into consideration.
Luminance Masking
151

− It can be observed that the noise is more visible in the dark area than in the bright area if comparing, for
instance, the dark portion and the bright portion of the cloud above the bridge.
152The bridge in Vancouver: (a) Original and (b) Uniformly corrupted by AWGN.
Luminance Masking

Luminance Masking
− The perception of brightness is not a linear function of the luminance.
− The HVS demonstrates light adaptation characteristics and as a consequence of that it is sensitive to
relative changes in brightness.
Contrast Masking
− The changes in contrast are less noticeable when the base contrast is higher than when it is low.
− The visibility of certain image components is reduced due to the presence of other strong image
components with similar spatial frequencies and orientations at neighboring spatial locations.
Contrast Masking
153

With same MSE:
• The distortions are clearly visible in the ‘‘Caps’’ image.
• The distortions are hardly noticeable in the ‘‘Buildings’’ image.
• The strong edges and structure in the ‘‘Buildings’’ image
effectively mask the distortion, while it is clearly visible in the
smooth ‘‘Caps’’ image.
This is a consequence of the contrast masking property of the HVS i.e.
• The visibility of certain image components is reduced due to
the presence of other strong image components with similar
spatial frequencies and orientations at neighboring spatial
locations.
Contrast Masking
154
(a) Original ‘‘Caps’’ image (b) Original ‘‘Buildings’’ image
(c) JPEG compressed image, MSE = 160 (d) JPEG compressed image, MSE = 165
(e) JPEG 2000 compressed image, MSE =155 (f) AWGN corrupted image, MSE = 160.

− In developing a quality metric, a signal is first decomposed into several frequency bands and the HVS
model specifies the maximum possible distortion that can be introduced in each frequency component
before the distortion becomes visible.
− This is known as the Just Noticeable Difference (JND).
− The final stage in the quality evaluation involves combining the errors in the different frequency
components, after normalizing them with the corresponding sensitivity thresholds, using some metric such
as the Minkowski error.
− The final output of the algorithm is either
• a spatial map showing the image quality at different spatial locations
• a single number describing the overall quality of the image.
Developing a Quality Metric Using Just Noticeable Difference (JND)
155

− The Contrast Sensitivity Function (CSF) provides a description of the frequency response of the HVS, which
can be thought of as a band-pass filter.
Weber’s law
∆𝑰
𝑰
= 𝑪𝒐𝒏𝒔𝒕𝒂𝒏𝒕 ≈ 𝟎. 𝟎𝟐 → 𝑪𝑭𝑺 =
𝐈
∆𝑰 𝒎𝒊𝒏
− The HVS is less sensitive to higher spatial frequencies and this fact is exploited by most compression
algorithms to encode images at low bit rates, with minimal degradation in visual quality.
− Most HVS based approaches use some kind of modeling of the luminance masking and contrast
sensitivity properties of the HVS.
Contrast Sensitivity Function (CSF)
156
HVS Model
• Luminance masking
• Contrast sensitivity
• Contrast masking
HVS Model
• Luminance masking
• Contrast sensitivity
• Contrast masking
Frequency
decomposition
Frequency
decomposition
Error
Pooling
Reference
image/video
Distorted
image/video
Block diagram of HVS-based quality metrics

− Structural information is defined as those aspects of the image that are independent of the luminance
and contrast
− The structure of various objects in the scene is independent of the brightness and contrast of the image.
− Structural approaches to image quality assessment, in contrast to HVS-based approaches, take a top-
down view of the problem.
− It is hypothesized that the HVS has evolved to extract structural information from a scene and hence,
quantifying the loss in structural information can accurately predict the quality of an image.
− The structural philosophy overcomes certain limitations of HVS-based approaches such as:
• computational complexity
• inaccuracy of HVS models.
Image Structural Information
157

− PSNR and MSE are inconsistent with human eye perception.
− The distorted versions of the ‘‘Buildings’’ and ‘‘Caps’’ images
have the same MSE with respect to the references.
− The bad visual quality of the ‘‘Caps’’ image can be
attributed to the structural distortions in both the background
and the objects in the image.
− The structural philosophy can also accurately predict the
good visual quality of the ‘‘Buildings’’ image, since the
structure of the image remains almost intact in both distorted
versions.
Image Structural Information
158
(a) Original ‘‘Caps’’ image (b) Original ‘‘Buildings’’ image
(c) JPEG compressed image, MSE = 160 (d) JPEG compressed image, MSE = 165
(e) JPEG 2000 compressed image, MSE =155 (f) AWGN corrupted image, MSE = 160.

− The HVS demonstrates luminance and contrast masking, which SSIM (also called Wang-Bovik Index) takes
into account while PSNR does not.
− The SSIM index executes three comparisons, in terms of
• l(x,y): Luminance Comparison Measurement
• c(x,y): Contrast Comparison Measurement
• s(x,y): Structure Comparison Measurement
− Where x and y are the original and processed pictures
− Value lies between [0,1]
Structural SIMilarity (SSIM)
159
𝑆𝑆𝐼𝑀 𝑥, 𝑦 = 𝑓(𝑙 𝑥, 𝑦 , 𝑐 𝑥, 𝑦 , 𝑠 𝑥, 𝑦 )

160
Structural SIMilarity (SSIM)

− First the mean luminance is calculated
− Then a luminance comparison is executed
− 𝑪 𝟏 is used to stabilize the division with weak denominator
− The value of the parameters 𝑪 𝟏 is set to (𝑲 𝟏. 𝑳) 𝟐, where 𝑲 𝟏 << 𝟏 is a small constant
− Luminance comparison attaining the maximum possible value if and only if the means of the two images
are equal.
Luminance comparison: l(x,y)
161
𝜇 𝑥 =
1
𝑁
෍
𝑖=1
𝑁
𝑥𝑖
𝑙 𝑥, 𝑦 =
2𝜇 𝑥 𝜇 𝑦 + 𝐶1
𝜇 𝑥
2
+ 𝜇 𝑦
2
+ 𝐶1

− The base contrast of each signal is computed using its standard deviation (the average luminance is
removed from the signal amplitude)
− A contrast comparison is computed as
− 𝑪 𝟐 is used to stabilize the division with weak denominator.
− The value of the parameters 𝑪 𝟐 is set to (𝑲 𝟐. 𝑳) 𝟐
, where 𝑲 𝟐 << 𝟏 is a small constant.
− L is the dynamic range of the pixel values.
Contrast Comparison C(x,y)
162
𝜎 𝑥 = (
1
𝑁 − 1
෍
𝑖=1
𝑁
(𝑥𝑖 − 𝜇 𝑥)2
)
1
2 𝜎 𝑦 = (
1
𝑁 − 1
෍
𝑖=1
𝑁
(𝑥𝑖 − 𝜇 𝑦)2)
1
2
𝑐 𝑥, 𝑦 =
2𝜎 𝑥 𝜎 𝑦 + 𝐶2
𝜎 𝑥
2
+ 𝜎 𝑦
2
+ 𝐶2

− The structural comparison is performed between the luminance and contrast normalized signals.
− Let 𝒙 and 𝒚 represent vectors containing pixels from the reference and distorted images respectively.
− For each image, the average luminance is subtracted and divided by its base contrast to normalize it.
− The correlation or inner product between luminance and contrast normalized signals is an effective
measure of the structural similarity.
− The correlation between the normalized vectors is equal to the correlation coefficient between the
original signals 𝒙 and 𝒚.
Structure Comparison S(x,y)
163
Ԧ𝑥 − 𝜇 𝑥
𝜎 𝑥
Ԧ𝑦 − 𝜇 𝑦
𝜎 𝑦

− 𝝈 𝒙𝒚 is the covariance of 𝒙 and 𝒚 that is a measure of the joint variability of 𝑥 and 𝑦 .
− Pearson's correlation coefficient is the test statistics that measures the statistical relationship, or association,
between two continuous variables.
− A Pearson correlation coefficient is calculated as a measure of structural similarity
− 𝑪 𝟑 is used to stabilize the division with weak denominator.
Structure Comparison S(x,y)
164
𝑠 𝑥, 𝑦 =
𝜎 𝑥𝑦 + 𝐶3
𝜎 𝑥 𝜎 𝑦 + 𝐶3
𝜎 𝑥𝑦 =
1
𝑁 − 1
෍
𝑖=1
𝑁
(𝑥𝑖 − 𝜇 𝑥)(𝑦𝑖 − 𝜇 𝑦)

− The SSIM output is a combination of all three components
− This SSIM model is parameterized by 𝜶, 𝜷, 𝜸 where typically the parameter values are all equal to 1.
− 𝑪 𝟏, 𝑪 𝟐 and 𝑪 𝟑 are small constants added to avoid numerical instability when the denominators of the
fractions are small.
Final Value of SSIM
165
𝑆𝑆𝐼𝑀 𝑥, 𝑦 =
2𝜇 𝑥 𝜇 𝑦 + 𝐶1
𝜇 𝑥
2
+ 𝜇 𝑦
2
+ 𝐶1
.
2𝜎 𝑥 𝜎 𝑦 + 𝐶2
𝜎 𝑥
2
+ 𝜎 𝑦
2
+ 𝐶2
.
𝑆𝑆𝐼𝑀 𝑥, 𝑦 = [𝑙 𝑥, 𝑦 ] 𝛼. [c 𝑥, 𝑦 ] 𝛽 . [𝑠 𝑥, 𝑦 ] 𝛾
𝑆𝑆𝐼𝑀 𝑥, 𝑦 = [
2𝜇 𝑥 𝜇 𝑦 + 𝐶1
𝜇 𝑥
2
+ 𝜇 𝑦
2
+ 𝐶1
] 𝛼. [
2𝜎 𝑥 𝜎 𝑦 + 𝐶2
𝜎 𝑥
2
+ 𝜎 𝑦
2
+ 𝐶2
] 𝛽 . [
] 𝛾

− Two variables 𝑪 𝟏 and 𝑪 𝟏 are used to stabilize the division with weak denominator
• 𝑪 𝟏 = (𝑲 𝟏. 𝑳) 𝟐
, where 𝑲 𝟏 << 𝟏 is a small constant. (By default 𝑲 𝟏 = 𝟎. 𝟎𝟏)
• 𝑪 𝟐 = (𝑲 𝟐. 𝑳) 𝟐
, where 𝑲 𝟐 << 𝟏 is a small constant. (By default 𝑲 𝟐 = 𝟎. 𝟎𝟑)
• 𝑳 = 𝟐 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒃𝒊𝒕𝒔 𝒑𝒆𝒓 𝒑𝒊𝒙𝒆𝒍
− 𝟏, the dynamic range of the pixel-values
− In order to simplify the expression 𝑪 𝟑 = 𝑪 𝟐/𝟐.
− Consequently we get
Final Value of SSIM
166
2𝜇 𝑥 𝜇 𝑦 + 𝐶1
𝜇 𝑥
2
+ 𝜇 𝑦
2
+ 𝐶1
.
2𝜎 𝑥𝑦 + 𝐶2
𝜎 𝑥
2
+ 𝜎 𝑦
2
+ 𝐶2
2𝜇 𝑥 𝜇 𝑦 + 𝐶1
𝜇 𝑥
2
+ 𝜇 𝑦
2
+ 𝐶1
.
2𝜎 𝑥 𝜎 𝑦 + 𝐶2
𝜎 𝑥
2
+ 𝜎 𝑦
2
+ 𝐶2
.
MSSIM: This SSIMI is averaged over the image, so it is known as Mean SSIM or MSSIM.

167
Original
Gaussian Blurring
SSIM = 0.85
Objective Assessment by SSIM

168
JPEG2000 compression
SSIM = 0.78
Original

169
Salt and Pepper noise
SSIM = 0.87
Original

170
Objective Assessment by MSE and SSIM
MSE=226.80 SSIM =0.4489 MSE = 225.91 SSIM =0.4992

171
MSE = 213.55 SIM = 0.3732 MSE = 225.80 SIM =0.7136

172
MSE = 226.80 SSIM = 0.4489 MSE = 406.87 SSIM =0.910

SSIM vs. MOS
173
On a broad database of images distorted by jpeg, jpeg2000,
white noise, gaussian blur, and fast fading noise.
-t/τ
-t/τ
1+be
a
1+βe
Curve is best fitting .
What is important is that the data cluster
closely about the curve.
logistic function

− Displaying 𝑆𝑆𝐼𝑀 𝑥, 𝑦 as an image is called a
SSIM Map. (𝑆𝑆𝐼𝑀 𝑥, 𝑦 (𝑖, 𝑗))
− It is an effective way of visualizing where the
images 𝑥, 𝑦 differ.
− The SSIM map depicts where the quality of one
image is flawed relative to the other.
SSIM Map
175
(a) (b)
(d)(c)
(a) Reference Image; (b) JPEG Compressed;
(c) Absolute Difference; (d) SSIM Map.
𝑆𝑆𝐼𝑀 𝑥, 𝑦 𝑖, 𝑗
=
2𝜇 𝑥(𝑖, 𝑗)𝜇 𝑦(𝑖, 𝑗) + 𝐶1
𝜇 𝑥
2
(𝑖, 𝑗) + 𝜇 𝑦
2
(𝑖, 𝑗) + 𝐶1
.
2𝜎 𝑥𝑦(𝑖, 𝑗) + 𝐶2
𝜎 𝑥
2
(𝑖, 𝑗) + 𝜎 𝑦
2
(𝑖, 𝑗) + 𝐶2

SSIM Map
176
(a) (b)
(d)(c)
An example of perceptual masking!An example of perceptual masking!
(a) (b)
(d)(c)

Two Images with Equal SSIM
177

False Ordering SSIM: 0.67 vs. 0.80
178

− Some good aspects
• Considers correlation between signal and error
• Considers local region
− Some bad aspects
• Very hard to put into an optimization
• Still requires an original signal
• Does not incorporate many features of HVS
− Has received much attention on ways to improve it
SSIM Specifications
179

− The MSE/PSNR can be a poor predictor of visual fidelity.
− VSNR is an efficient metric for quantifying the visual fidelity of natural images based on near-threshold and
suprathreshold properties of human vision is VSNR.
− It is efficient both in terms of its low computational complexity and low memory requirements
− It operates based on physical luminances and visual angle (rather than on digital pixel values and pixel-
based dimensions) to accommodate different viewing conditions.
− This metric estimates visual fidelity by computing:
1. contrast thresholds for detection of the distortions
2. a measure of the perceived contrast of the distortions
3. a measure of the degree to which the distortions disrupt global precedence and, therefore, degrade
the image’s structure.
Visual SNR (VSNR)
180
Chandler and Hemami, “A wavelet-based visual signal-to-noise ratio for natural images”, IEEE Trans. Image Processing, 2007

It operates via a two-stage approach.
1- Computing contrast thresholds for detection of distortions in the presence of natural images (many
subjective tests)
− The low-level HVS properties of contrast sensitivity, visual masking (visual summation) are used via a
wavelet-based model to determine if the distortions are below the threshold of visual detection (whether
the distortions in the distorted image are visible).
− If the distortions are below the threshold of detection, the distorted image is deemed to be of perfect
visual fidelity (VSNR =∞ ).
Visual SNR (VSNR)
181

It operates via a two-stage approach.
2- If the distortions are suprathreshold, followings are taken into account as an alternative measure of
structural degradation.
• Low-level visual property of perceived contrast including
I. Low-level HVS property of contrast sensitivity
II. Low-level HVS property of visual masking
• Mid-level visual property of global precedence (i.e., the visual system’s preference for integrating
edges in a coarse-to-fine-scale fashion)
− These two properties are modeled as Euclidean distances in distortion-contrast space of a multiscale
wavelet decomposition, and VSNR is computed based on a simple linear sum of these distances.
Visual SNR (VSNR)
182

Two Images with Equal VSNR
183

VSNR, Noise Most Visible
184
VSNR: 32.7
SSIM: 0.83
VSNR: 21.2
SSIM: 0.95

− Relies on modeling of the statistical image source, the
image distortion channel and the human visual
distortion channel.
− VIF was developed for image and video quality
measurement based on natural scene statistics (NSS).
− Images come from a common class: the class of natural
scene.
Objective Assessment by Virtual Image Fidelity (VIF)
185
Natural
Image
Source
Channel
Distortion
HVS
Channel
HVS
Channel
Mutual
Information
Information
Content
C D
E F

− Image quality assessment is done based on information
fidelity where the channel imposes fundamental limits
on how much information could flow from the source
(the reference image), through the channel (the image
distortion process) to the receiver (the human observer).
186
Mutual information between C and E quantifies the
information that the brain could ideally extract from the
reference image, whereas the mutual information
between C and F quantifies the corresponding
information that could be extracted from the test image.
𝑉𝐼𝐹 =
𝐷𝑖𝑠𝑡𝑜𝑟𝑡𝑒𝑑 𝐼𝑚𝑎𝑔𝑒 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛
𝑅𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝐼𝑚𝑎𝑔𝑒 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛
Natural
Image
Source
Channel
Distortion
HVS
Channel
HVS
Channel
Mutual
Information
Information
Content
C D
E F

187
VIF = 0.5999 SSIM = 0.8558 VIF = 1.11 SSIM = 0.9272
− The VIF has a distinction over traditional quality assessment methods, a linear contrast enhancement of the
reference image that does not add noise to it will result in a VIF value larger than unity, thereby signifying
that the enhanced image has a superior visual quality than the reference image.
− No other quality assessment algorithm has the ability to predict if the visual image quality has been
enhanced by a contrast enhancement operation.

188
VIF = 0.6045 SSIM = 0.8973 VIF = 0.5944 SSIM = 0.7054

189
VIF = 0.60 SSIM = 0.7673 VIF = 0.6043 SSIM = 0.8695

Blur vs. AWG Noise
190
PSNR: 26.6
SSIM: 0.89
VSNR: 20.7
VIF: 0.47
PSNR: 25.4
SSIM: 0.74
VSNR: 29.5
VIF: 0.58

Blur vs. AWG Noise
191
PSNR: 26.6
SSIM: 0.89
VSNR: 20.7
VIF: 0.47
PSNR: 21.3
SSIM: 0.60
VSNR: 29.5
VIF: 0.44

192
JPEG-2000 vs. Blur
PSNR: 35.4
VSNR: 35.2
SSIM: 0.93
VIF: 0.76
PSNR: 34.3
VSNR: 33.5
SSIM: 0.97
VIF: 0.76

193
JPEG-2000 vs. Blur
PSNR: 35.4
VSNR: 35.2
SSIM: 0.93
VIF: 0.59
PSNR: 30.8
VSNR: 26.8
SSIM: 0.94
VIF: 0.5 vs. 0.59

− The advantage of the multi-scale methods, over single-scale methods, like SSIM, is that in multi-scale
methods image details at different resolutions and viewing conditions are incorporated into the quality
assessment algorithm.
Multi-scale Structural Similarity Index (MS-SSIM)
194
i  i th scale
L: low-pass filter; ↓2:downsampling by factor of 2.

MAD assumes that HVS employs different strategies when judging the quality of images.
Detection-based Strategy
• When HVS attempts to view images containing near-threshold distortions, it tries to move past the
image, looking for distortions.
• For estimating distortions in detection-based strategy, local luminance and contrast masking are
used.
Appearance-based Strategy
• When HVS attempts to view images containing clearly visible distortions, it tries to move past the
distortions, looking for image’s subject matter.
• For estimating distortions in appearance-based strategy, variations in local statistics of spatial
frequency components are being employed.
MAD (Most Apparent Distortion) Algorithm
195

196
MAD (Most Apparent Distortion) Algorithm
Detection-based and Appearance-based Strategies
The block diagram of the detection-based strategy in the MAD algorithm
The block diagram of the appearance-based strategy in the MAD algorithm.

− The FSIM algorithm is based on the fact that HVS understands an image mainly due to its low-
level characteristics, e.g., edges and zero crossings.
− In order to assess the quality of an image, FSIM algorithm uses two kinds of features.
1. Phase Congruency (PC):
• Physiological and psychophysical experiments have demonstrated that at points with high phase
congruency (PC), HVS can extract highly informative features.
2. Gradient Magnitude (GM):
• The phase congruency (PC), is contrast invariant and our perception of an image’s quality is also
affected by local contrast of that image.
• As a result of this dependency, the image gradient magnitude (GM) is used as the secondary
feature in the FSIM algorithm.
Feature Similarity Index (FSIM)
197

Feature Similarity Index (FSIM)
198
Calculating FSIM measure consists of two stages:
• Computing image’s PC and GM
• Computing the similarity measure between the reference and test images.

1. The detection thresholds are predicted and a perceptually normalized response map is generated.
2. The perceptually normalized response is decomposed into several bands of different orientations and
scales (using cortex transform, i.e., the collection of the band-pass and orientation selective filters)
3. The conditional probability of each distortion type is calculated for prediction of three distortion types
separately for each band.
4. The probability of detecting a distortion in any subband is calculated.
Quality assessment of high dynamic range (HDR) images
Dynamic range independent quality measure (DRIM)
199The block diagram of the DRIM algorithm.

− This metric is a combination of
• multi-scale structural fidelity measure
• statistical naturalness measure
− The TMQI algorithm consists of two stages
• structural fidelity measurement
• statistical naturalness measurement.
− At each scale, the local structural fidelity map is computed and averaged in order to obtain a single score S.
Quality assessment of high dynamic range (HDR) images
Tone-mapped images quality index (TMQI)
200The block diagram of the TMQI algorithm.

VMAF – Video Multi-Method Assessment Fusion
201
PSNR: 31
DMOS: 82
PSNR: 31
DMOS: 27
PSNR: 34
DMOS: 96
PSNR: 34
DMOS: 58

− VMAF is a perceptual video quality metric that models the human visual system.
− Predicts subjective quality by ‘fusing’ elementary metrics into a final quality metrics using a machine-learning
algorithm which assigns weights to each elementary metric (Support Vector Machine (SVM) regressor).
− Machine-learning model is trained and tested using the opinion scores obtained through a subjective experiment
(Ex: NFLX Video Dataset).
− It correlates better than PSNR with subjective opinion over a wide quality range and content (Netflix).
− VMAF enables reliable codec comparisons across the broad range of bitrates and resolutions occurring in
adaptive streaming.
Elementary quality metrics
− Visual Information Fidelity (VIF): quality is complementary to the measure of information fidelity loss
− Detail Loss Metric (DLM): separately measure the loss of details which affects the content visibility, and the
redundant impairment which distracts viewer attention
− Motion: temporal difference between adjacent frames
VMAF – Video Multi-Method Assessment Fusion
202

204
Video Source
Decompress
(Decode)
Compress
(Encode)
Video Display
Coded
video

205
Ex: Video Quality due to De-interlacing
De-interlacing: difficult to perform
Good quality converters exist (price!)
720p - DNxHD (Gen0) 1080i - DNxHD (Gen0) – de-interlace – 7
source de-interlaced

206
Ex: Distribution Encoder and Bit-rate
Which bit-rate to choose?
Distribution channel ‘defines’ available bit-rate
MPEG-4 does a finejob
Motion in video isimportant
UncompressedAVC/DNxHD–10Mbit/sH.264
AVC/DNxHD–16Mbit/sH.264AVC/DNxHD–8Mbit/sH.264

207
Ex: Distribution Encoder and Production Codec
Uncompressed HD ~ 1 500 Mbit/s
Production bitrate ~ 100 Mbit/s
Distribution encoder ~ 10 Mbit/s
Do we actually see the influence of the production codec?
AVC/AVC + 8 Mbit/s H.264 DVC/DVC + 8 Mbit/s H.264

Example of Simulation Chain
208
Video Quality in Production Chain
Without Pixel shift
With Pixel shift (+2H, +6V)
Camera
Encoding
Post Production Encoding (4 Generations)

Example of Test Setup
209
Video Quality in Production Chain
Encode Decode
HD-SDI HD-SDI
Uncompressed
Source
Gen 0
(Cam)
Gen 1
(PP1)
Gen 2
(PP2)
Gen 3
(PP3)
Gen 3
shifted
Gen 4
(PP4)
Uncompressed YCbCr Storage
HD-SDI Ingest & Playout
VRT-medialab: onderzoek en innovatie

− The multi-generation codec assessment simulates how the production chain affects the images as a result
of multi-compression and decompression stages.
− The multi-generation codec assessment has two steps:
1. The agreed method of multi-generation testing was to visually compare the 1st, 4th and 7th generations
(including pixel shifts after each generation) with the original image under defined conditions (reference
video monitor, particular viewing environment settings, and expert viewers).
2. In addition, an objective measurement– the PSNR – was calculated to give some general trend indication
of the multi-generation performance of the individual codecs.
210
Multi-generation Codec Assessment for Production Codec Tests

211
Standalone Chain (without Spatial Shift) for INTRA Codec

− The standalone chain without processing simply consisted of several cascaded generations of the codec
under test, without any other modifications to the picture content apart from those applied by the codec
under test.
− This process accurately simulates the effect of a simple dubbing of the sequence and is usually not very
challenging for the compression algorithm.
− This simple chain can provide useful information about
• the performance of the sub-sampling filtering that is applied.
• the precision of the mathematical implementation of the code.
212
Encoder Decoder Encoder Decoder Encoder Decoder
First
Generation
Second
Generation
Seven
GenerationInput
Video

213
In fact, the most important
impact on the picture quality
should be incurred at the first
generation, when the encoder
has to eliminate some
information.
The effect of the subsequent
generations should be minimal
as the encoder should basically
eliminate the same information
already deleted in the first
compression step.
First
Generation
Second
Generation
Seven
GenerationInput
Video

214
Standalone Chain (with Spatial Shift) for INTRA Codec

215

− In a real production chain, several manipulations are applied to the picture to produce the master
• Editing
• Zoom
• NLE
• Colour correction
− A realistic simulation has to take into account this issue.
− As all these processes are currently feasible only in the uncompressed domain, the effect of the
processing is simulated by spatially shifting the image horizontally (pixel) or vertically (lines) in between
each compression step.
− Obviously, this shift makes the task of the coder more challenging, especially for those algorithms based
on a division of the picture into blocks (e.g. NxN DCT block), as in any later generation the content of each
block is different to that in the previous generation.
216

− The shift process introduces black pixels on the edges of the frame if/when necessary.
− The shifts were applied variously using software or hardware, but the method used was exactly the same
for all the algorithms under test.
• Horizontal shift (H): “only even shifts” to take into account the chroma subsampling of the 4:2:2 format.
• Vertical shift (V): shift is applied on a frame basis and is always an “even value”.
Progressive formats:
− The whole frame is shifted by a number of lines corresponding to the vertical shift applied.
• For example, a shift equal to +2V means two lines down for progressive formats.
Interlaced formats:
− Each field is shifted by a number of lines corresponding to half the vertical shift applied.
• For example, a shift equal to +2V means 1 line down for each field of an interlaced format.
217

218
Standalone Chain with GoP Alignment (without Spatial Shift) for INTER Codec

− The GoP structure has some important implications on the way the standalone chain has to be realized,
and introduces a further variable in the way the multi-generation can be performed –depending on
whether the GoP alignment is guaranteed between each generation (GoP aligned) or not (GoP mis-
aligned).
− The GoP is considered to be aligned if one frame of the original picture that is encoded at the first
generation using one of the three possible kinds of frame is again encoded using that same kind of frame
in all the following generations (Intra, Predicted or Bidirectional ).
− It is therefore possible to have only one multi-generation chain with “GoP alignment”.
219
Standalone Chain with GoP Alignment (without Spatial Shift) for INTER Codec
First
Generation
Second
Generation
Seven
GenerationInput
Video
Ex: frame n of the original sequence is always encoded as Intra and frame n+6 as Predicted in all generations.

220
Standalone Chain with GoP Alignment and Spatial Shift for INTER Codec

− If GoP alignment is not guaranteed, several conditions of GoP mis-alignment are possible
If GoP length L=12 → for the second generation 11 different GoP mis-alignments are possible
→ for the third generation 11 by 11 different GoP mis-alignments are possible
and so on
→ making the testing of all the possible conditions unrealistic.
− It was therefore agreed to apply one “temporal shift” equal to one frame between each generation, so
that the frame that is encoded in Intra mode in the first generation is encoded in Bidirectional mode in the
second generation and, in general, in a different mode for each following generation.
− It is interesting to underline that the alignment of the GoP in the different generations was under control
(not random) and that this was considered the likely worst case as far as the mis-alignment effect is
concerned, and was referred to in the documents as “chain without GoP alignment”.
221
Standalone Chain without GoP Alignment

222
Standalone Chain without GoP Alignment and with Spatial Shift for INTER Codec

− Four different possible standalone chains up to the seventh generation:
• Multigeneration chain with GoP alignment (without spatial shift)
• Multigeneration chain without GoP alignment (without spatial shift)
• Multigeneration chain with GoP alignment and spatial shift
• Multigeneration chain without GoP alignment and spatial shift
− A procedure to re-establish the spatial alignment between the original and the de-compressed version of the
test sequence is applied.
− 16 pixels on the edges of the picture is skipped to avoid taking measurements on the black pixels introduced
during the shift.
− PSNR does not correlate accurately with the picture quality and thus it would be misleading to directly compare
PSNR from very different algorithms.
− PSNR can provide information about the behaviour of the compression algorithm through the multi-generation
process.
223
Objective Measurements by PSNR

− Both objective measurements (PSNR) and visual scrutiny of the picture (i.e. expert viewing) are used for
video quality evaluation.
− They are considered to be complementary.
224
Ex. 1, EBU R124-2008

− The viewing distance is 3H (HDTV).
− Sometimes a closer viewing distance, e.g. 1H, was used to closely observe small details and artefacts and,
when used, this condition was clearly noted in the report.
225
Ex. 1, EBU R124-2008
Compressed (Impaired) version
(e.g. Seventh generation
with spatial shift)
Original
The following displays were used during the tests (HDTV):
• CRT 32” Sony Type BVM-A32E1WM
• CRT 20” Sony Type BVM-A20F1M
• Plasma Full HD 50” Type TH50PK9EK Panasonic
• LCD 47” Type Focus
The displays and the room conditions were aligned
according to the conditions described in ITU-R BT.500-11.

• For acquisition applications an HDTV format with 4:2:2 sampling, no further horizontal or vertical sub-
sampling should be applied.
• The 8-bit bit-depth is sufficient for mainstream programmes.
• The 10-bit bit-depth is preferred for high-end acquisition.
− For production applications of mainstream HD, the tests of the EBU has found no reason to relax the
requirement placed on SDTV studio codecs that “Quasi-transparent quality” must be maintained after 7
cycles of encoding and recoding with horizontal and vertical pixel-shifts applied.
− All tested codecs have shown quasi-transparent quality up to at least 4 to 5 multi-generations, but have
also shown few impairments such as noise or loss of resolution with critical images at the 7th generation.
→ Thus EBU Members are required to carefully design the production workflow and to avoid 7 multi-
generation steps.
226
Ex. 1, EBU R124-2008

The EBU recommends in document R124-2008 is that:
− If the production/archiving format is to be based on I-frames only, the bitrate should not be less than 100
Mbit/s.
− If the production/archiving format is to be based on long-GoP MPEG-2, the bitrate should not be less than
50 Mbit/s.
− Furthermore, the expert viewing tests have revealed that:
• A 10-bit bit-depth in production is only significant for post-production with graphics and after
transmission encoding and decoding at the consumer end, if the content has been generated using
advanced colour grading, etc (e.g. graphics or animation).
• For normal moving pictures, an 8-bit bit-depth in production will not significantly degrade the HD
picture quality at the consumer’s premises.
227
Ex. 1, EBU R124-2008

− A contribution link was simulated by passing a signal
twice through the codec under test, with a spatial pixel
shift introduced between codec passes.
− This is equivalent to a signal passing through a pair of
cascaded codecs that would typically be
encountered on a contribution link.
− 4:2:2 colour sampling is required for professional
contribution applications.
228
Ex. 2, Simulation of a Typical Contribution Link
Encoder
Decoder
Pixel
Shift
Final
Output
2 Pixels horizontally
2 Pixels vertically
1st Pass
2nd Pass
Source

− The equivalent quality using H.264/AVC was found subjectively by viewing the H.264/AVC sequences at
different bit-rates, starting at half the MPEG-2 bit-rate and increasing it by steps of 10% (of the MPEG-2
reference bit rate) until the quality of the MPEG-2 encoded sequence and that of the H.264/AVC
encoded sequence was judged to be equivalent.
− The subjective evaluation mainly tested the 2nd generation sequences, which can actually allow better
discrimination of the feeds than the 1st generation.
229
Ex. 2, Simulation of a Typical Contribution Link (Cont.)
EBU Triple Stimulus Continuous Evaluation Scale
(TSCES) quality assessment rig

Questions??
Discussion!!
Suggestions!!
Criticism!!
230

Video Compression, Part 4 Section 1, Video Quality Assessment

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Video Compression, Part 4 Section 1, Video Quality Assessment

Similar to Video Compression, Part 4 Section 1, Video Quality Assessment (20)

More from Dr. Mohieddin Moradi

More from Dr. Mohieddin Moradi (17)

Recently uploaded

Recently uploaded (20)

Video Compression, Part 4 Section 1, Video Quality Assessment