July 14, 2021 @QCOMResearch
Qualcomm Technologies, Inc.
How AI research is
enabling next-gen codecs
3
Agenda
The demand for
improved data
compression
is growing
1
AI is a
compelling
tool for
compression
2
Our latest
AI voice and
video codec
research
3
Our
on-device
neural video
decoder demo
4
Future
research
work and
challenges
5
4
4
The scale of video and voice
being created and consumed is massive
1M
Minutes of video
crossing the internet
per second
15B
Minutes of talking
per day on WhatsApp
calls
82%
Of all consumer
internet traffic is
online video
76
Minutes per day watching
video on digital devices
by US adults
8B
Average daily
video views
on Facebook
Cisco Visual Networking Index: Forecast and Trends, 2017–2022; WhatsApp blog 4/28/20 4
5
5
Video streaming
Extended reality Smart cities
Smart factories
Increasingly, video is all around us — providing entertainment,
enhancing collaboration, and transforming industries
Autonomous vehicles
Video conferencing
Sports
Smartphone
6
6
of internet traffic will
be video by 20221
Video technology revolutionized how we create and consume media
Enhanced video quality with less bits led to broad adoption across a wide range of devices and services
1. Cisco Annual Internet Report, 2018–2023
7
1. On a 1080p video at 30 frames per second; EVC = Essential Video Coding, VVC = Versatile Video Coding,
Technical innovation drives video evolution
A regular cadence of technical advancement in video codecs has led to massive reduction in file size
The first evolution
Introduced flexibility
Focused initially on optical disc media
Increased efficiency
Introduced internet video
Emerging video applications and technology
Robust network efficiency
Enabling user experiences as demand surges
Continued technical advancement
2013
HEVC
2020
EVC and VVC
Video
Compression
2003
AVC
1995
MPEG-2
Evolution transformed
Enriched experiences and proliferation
Video
Compression
Reduction in
file size with
VVC versus no
compression1
~1000x
DVD, digital TV Blu-ray, HDTV, internet
~50% compression ↑
Streaming, 4k, HDR, 360˚
~50% compression ↑
8k, immersive video
~30% ~40%
compression ↑
8
8
Visual quality is much more than resolution and frame rate
Preserving color accuracy, contrast, and brightness is also crucial
Broad interoperability
across all our screens
Resolution
Increased definition and sharpness
Frame rate
Reduced blurring and latency
Pixel quantity Pixel fidelity
Color accuracy
More realistic colors through an expanded color
gamut, depth, and temperature
Contrast and brightness
Increased detail through a larger dynamic range
and lighting enhancements
9
Variational auto
encoder (VAE)*
Invertible
Auto-regressive
Generative adversarial
network (GAN)
Generative
models
Extract features by
learning a low-dimension
feature representation
Sampling to generate,
restore, predict, or
compress data
Powerful
capabilities
Broad
applications
Computational photography
Text to speech
Graphics rendering
Speech/video compression
Voice UI
* VAE first introduced by D. Kingma and M. Welling in 2013
Deep generative
model research
for unsupervised
learning
Given unlabeled training data,
generate new samples from the
same distribution
10
Variational auto
encoder (VAE)*
Invertible
Auto-regressive
Generative adversarial
network (GAN)
Generative
models
Deep generative
model research
for unsupervised
learning
Given unlabeled training data,
generate new samples from the
same distribution
* VAE first introduced by D. Kingma and M. Welling in 2013
Encoder part
Decoder part
Input unlabeled
data, such as
images
Desired output
as close as
possible to input
Extract features by
learning a low-dimension
feature representation
Sampling to generate,
restore, predict or
compress data
11
AI-based
compression
has compelling
benefits
No special-purpose
hardware required, other
than an AI acceleration
Easy to upgrade, standardize,
and deploy new codecs
Specialized to a specific
data distribution
Easy to develop new codecs
for new modalities
Improved rate-distortion
trade-off
Optimized for advanced
perceptual quality metrics
Semantics aware for
human visual perception
Can generate
visual details not
in the bitstream
12
12
Achieving state-of-the-art data compression with AI
1 Comparison between state-of-the-art traditional speech coding algorithm and AI speech compression
encoder decoder
Input
speech
Output
speech
Compressed
speech
Bit-rate compression at same
speech quality using AI1
2.6X
13
13
Feedback Recurrent Autoencoder for Video Compression (Golinski, et al., arXiv 2020)
Video used in images is produced by Netflix, with CC BY-NC-ND 4.0 license: https://media.xiph.org/video/derf/ElFuente/Netflix_Tango_Copyright.txt
Novel machine learning-based video codec research
Neural network based I-frame and P-frame compression
Achieving state-of-the-art rate-distortion compared
with other learned video compression solutions
End-to-end deep learning
Reconstruction
Ground truth
Motion vector
Residual
T=3 Motion
estimation
P-frame
encoder
Motion
estimation
Entropy
encoding
bits Entropy
decoding P-frame
decoder Sum
Warp
Deep
neural net
T=1 I-frame
encoder
Entropy
encoding
bits Entropy
decoding I-frame
decoder
T=2 Motion
estimation
P-frame
encoder
Entropy
encoding
bits Entropy
decoding P-frame
decoder
Warp
Sum
14
14
Video compression utilizes temporal and spatial redundancy
Group of pictures (GOP) is a sequence of video frame types used to efficiently encode data
GOP structures
I-frame
(intra)
P-frame
(predicted)
B-frame
(bi-directional)
Independently code
each frame
Exploit spatial
redundancy
Code changes based
on previous frame
Exploit spatial and temporal
redundancy
Code changes based on
previous and next frame
Exploit spatial and temporal
redundancy
| | | | | | | | | | ● ● ● P ● ● ● | ● ● ● P ● ● ● | | ● ● ● B ● ● ● P ● ● ● B ● ● ● P
Increasing complexity and bitrate compression
15
Existing research for B-frame codecs has limitations
Our solution combines the best of existing P-frame and B-frame codecs while allowing them to share weights
Existing P-frame codec Existing B-frame codecs Our B-frame codec
Residual
compression
Motion
compensation
Motion
compression
𝑅𝑒𝑓
𝐼𝑛𝑝𝑢𝑡
𝑂𝑢𝑡𝑝𝑢𝑡
𝑃𝑟𝑒𝑑
Residual
compression
Bidir motion
compensation
Bidir motion
compression
𝑃𝑟𝑒𝑑
𝑅𝑒𝑓1
𝑅𝑒𝑓0
𝐼𝑛𝑝𝑢𝑡
𝑂𝑢𝑡𝑝𝑢𝑡
High correlation between the
two flows (motion constancy)
Residual
compression
Frame
interpolation
𝑃𝑟𝑒𝑑
𝑅𝑒𝑓1
𝑅𝑒𝑓0
𝐼𝑛𝑝𝑢𝑡
𝑂𝑢𝑡𝑝𝑢𝑡
Does not work well
for large motion
P-frame
codec
Frame
interpolation
𝑃𝑟𝑒𝑑
𝑅𝑒𝑓1
𝑅𝑒𝑓0
𝐼𝑛𝑝𝑢𝑡
𝑂𝑢𝑡𝑝𝑢𝑡
P-frame and B-frame
codecs can share weights
16
16
Extending Neural P-frame Codecs for B-frame Coding (Pourreza, et al., arXiv 2021)
Illustration of different inter-frame coding methods
B-Frame compression through Extended P-frame & Interpolation Codec (B-EPIC)
Actual
object
location
Object location
Prediction
Motion vector
Prediction error
Interpolation
Interpolation guide
𝐱𝑡
ࡤ
𝐱𝑡
ො
𝐱ref
Input frame
The corresponding prediction
The reference
𝐱𝑡 ො
𝐱ref1
ො
𝐱ref0
ࡤ
𝐱𝑡
B-frame prediction based on
bidirectional flow/warp: two motion
vectors with respect to two references
are transmitted.
ො
𝐱ref ෤
𝐱𝑡
P-frame prediction:
a motion vector with
respect to a single
reference is transmitted
ො
𝐱ref0
ࡤ
𝐱𝑡 ො
𝐱ref1
B-frame prediction based on frame
interpolation: the interpolation result is
treated as the prediction. No motion
information is transmitted.
ො
𝐱ref0
ࡤ
𝐱𝑡 ො
𝐱ref1
Our B-frame prediction approach:
the interpolation result is corrected
using a unidirectional motion vector
similar to P-frame
17
17
R. Pourezza, T. Cohen, Extending Neural P-frame Codecs for B-frame Coding, Under review at ICCV 2021
Our B-frame coding provides state-of-the-art results
Improved rate-distortion tradeoff by extending neural p-frame codecs for B-frame coding
32
33
34
35
36
37
38
39
40
41
0 0.1 0.2 0.3 0.4 0.5 0.6
PSNR
(dB)
Rate (bits per pixel)
PSNR
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0 0.1 0.2 0.3 0.4 0.5 0.6
MS-SSIM
Rate (bits per pixel)
MS-SSIM
Ours Google [CVPR-20] H.265 (FFmpeg)
18
18
shared knowledge
𝜃𝒟 𝜃𝒟
sender receiver
Overfitting through instance-adaptive video compression
Improved compression via continual learning
Overfit a model on one video instance
and encode weight-deltas in the bitstream
Model
delta < baseline
bitstream
Encoded
bitstream
+
Send weight-deltas
based on overfitting
Send smaller
encoded bitstream
based on overfitting
encoder
𝑞𝜑(𝒛|𝒙)
0
0
0 1 1
1
ෝ
𝒙
0
0
0 1 1
1
𝒙
𝒃𝒛
E9
E9
E11
D4
D4
𝒛 𝒛
𝒃𝒛
D3
latent prior latent prior decoder
𝑝𝜃(𝒙|𝒛)
EE ED
E11 EE ED
0
0
1 0
0
1
𝒃ഥ
δ
E10
D1
𝒃ഥ
δ
⊖
ŕ´¤
δ
⊕
ŕ´¤
δ
D2
model prior model prior
decoder
𝑝𝜃(𝒙|𝒛)
T. van Rozendaal*, I.A.M. Huijben*, T. Cohen, Overfitting for Fun and Profit: Instance-Adaptive Data Compression, ICLR 2021
T. Van Rozendaal, Y. Zhang, J. Brehmer, T. Cohen, Instance-Adaptive Video Compression: Improving Neural Codecs by Training on the Test Set, Under review at ICCV 2021
19
Mobile friendly
deployment
Decoding complexity can be reduced by
while still maintaining
SOTA results
72%
SOTA results from
instance-adaptive
video compression
24%
BD-rate savings
over leading neural
codec by Google
29%
BD-rate savings
over FFmpeg
H.265
(FFmpeg)
20
20
Variable bitrate image compression for simpler deployment
Three levels of solutions with the goal of single model producing a single bitstream
Level 0:
Multi-model multi-bitstream
encoder1 decoder1
encoder2 decoder2
encoder3 decoder3
Each model is
optimized for
a fixed 𝛽
𝑁 models are needed
for 𝑁 rate-distortion
tradeoffs
Level 1:
Single-model multi-bitstream
encoder decoder
encoder decoder
encoder decoder
Single
encoder-decoder
model
Certain params /input
are used to steer
R-D tradeoff
𝛽1
𝛽1 𝛽2 𝛽3 𝛽2
𝛽3
Level 2:
Single-model single-bitstream
encoder decoder
decoder
decoder
Single
encoder-decoder
model
Encode once,
embed all
bitrates
A.k.a. embedded/
progressive/
scalable coding
Increasing
order of design
difficulty
and ease of
deployment
21
1/s × s
1/s
𝑦
+
- EC
𝑥
𝑦(𝑠)
Hyper Codec
ED
𝒈𝑎 Analysis transform / encoder
𝒈s Synthesis transform / decoder
𝑥 Input image
s Scaling factor that determines the bitrate
µ, ℴ Prior parameters
EC Entropy coding
ED Input image
𝑔𝑎
𝑔𝑠
Âľ
𝑥
^
⎿
⎿
•
N(µ, ℴ)
Level 1 approach that applies
a scaling factor to the latent for
a different trade-off between rate
and distortion
Latent scaling
for a
single-model
multi-bitstream
solution
22
Level 2 approach that uses multiple
quantization levels and conditional
probability to create a single bitstream
with multiple quality levels
Nested
quantization
for a
single-model
single-bitstream
solution
Fine
scale s1
Coarse
scale s2
Finer
bins for
increased
quality
23
Y. Lu, Y. Zhu, Y. Yang, A. Said, T. Cohen, Progressive Neural Image Compression with Nested Quantization and Latent Ordering, Accepted at ICIP 2021
Variable-bitrate
progressive
neural image
compression
Achieves comparable
performance to HEVC
Intra but uses only a
single model and a
single bitstream
HEVC Intra
Separate bitstreams,
one for each quality
[N-model, N-bitstream] Level 0
[1-model, 1-bitstream] Level 2 - Ours
[1-model, N-bitstream] Level 1 - HEVC Intra
[1-model, 1-bitstream] Level 2
Our scheme
A single bitstream
where the prefixes
decode to different
qualities
24
24
Semantic-aware image compression for improved quality
Allocate more bits to regions of interest
Source Baseline bit allocation Semantic-aware bit allocation
25
25
Semantic
compression
focuses on regions
of interest
End-to-end deep learning
on foreground/background
weighted distortion loss
Mask code
Mask
encoder
Code
Decoder
Encoder
Mask
decode
Foreground /
background loss
Segmentation mask Input image
Gain
Reconstructed
image
26
Next step
Extend to video compression
Semantic-aware
image compression
improves quality
where it matters
State-of-the-art results for
rate-distortion tradeoff
28
30
32
34
36
38
40
42
0.2 0.4 0.6 0.8 1 1.2 1.4
RGB
PSNR
(dB)
Rate (bits per pixel)
Rate distortion
Segmentation AE (ROI) Segmentation AE (non-ROI)
Hyperprior AE (ROI) Hyperprior AE (non-ROI)
5dB gain
in region
of interest
(ROI)
2dB loss
at non-ROI
27
27
The tradeoff between perception and distortion metrics
Optimizing distortion and/or perceptual quality leads to different reconstructions
Original Rate + Distortion Rate + Perception Rate + Distortion + Perception
Distortion
Perceptual quality
Very good
Very bad
Very bad
Very good
Good
Good
28
28
How GAN-based codecs work
GAN-based codecs can produce much more visually appealing content at very low bits per pixel
Encoder Generator
Discriminator
1
2
1
0
Data distribution
P(𝑧)
Prior P(real | 𝑧)
P(real)
29
29
GAN
14.5 kB
0.221 bpp
145x reduction
Raw
2097.15 kB
8.0 bpp
512x1024
BPG QP=40
16.4 kB
0.250 bpp
128x reduction
30
Parallel
Entropy
Decoding
Snapdragon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.
Real-time on-device neural video decode is now possible
Research demo showcases all-intra neural video decode on a smartphone. Inter-frame decoding is next
1280 × 704
(Comparable to HD 720p)
Mobile device powered by
SnapdragonÂŽ 888 Mobile Platform
CPU
Parallel
Entropy
Encoding
30+
Frames /
second
Offline processing
Encoder AI accelerator
Decoder
Bitstream
31
31
8-bit model with
efficient decoding
PEC
Decoder
AI Model Efficiency Toolkit
(AIMET)
https://github.com/quic/aimet
Step-3
AIMET quantization
aware training
AIMET is a product of Qualcomm Innovation Center, Inc.
Efficient on-device neural decoding
Limit to I-frame (intra-frame) component of neural video decoding; Inter-frame decoding is coming soon
Step-1
Decoder architecture
refinement
Decoder
Floating-point
neural codec
Step-2
Parallel Entropy Coding
(PEC) PEC
Entropy
coding
Decoder
Encoder
Parallel entropy coding + architecture refinement + 8-bit  Fast inference
AIMET quantization aware training  Improve quantized model accuracy
32
Real-time neural video decoding
on a mobile device
demo video
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
33
Perceptual quality
Need to develop differentiable perceptual
quality metrics
GANs are promising, but not perfect
Video GANs are challenging to train due
to computational requirements
Computational efficiency
Real-time on-device video encoding
and decoding is still a challenge
Rate distortion improvement
Bitrate needs to continue to get smaller for the
foreseeable future as video demand increases
New modalities
Lots of work to do on new modalities
like point clouds, omnidirectional video,
multi-camera setups (e.g. in AV), etc.
Challenges
for mass
deployment of
neural video
compression
34
34
AI-based compression has the
potential to more efficiently share
video and voice across devices
We are conducting leading
research and development
in end-to-end deep learning
for video and speech coding
We are solving implementation
challenges and demonstrating
a real-time neural video decoder
running on a smartphone
35
www.qualcomm.com/ai
@QCOMResearch
www.qualcomm.com/news/onq
https://www.youtube.com/qualcomm?
http://www.slideshare.net/qualcommwirelessevolution
Connect with Us
Questions?
Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Thank you
Nothing in these materials is an offer to sell any of the
components or devices referenced herein.
Š2018-2021 Qualcomm Technologies, Inc. and/or its
affiliated companies. All Rights Reserved.
Qualcomm and Snapdragon are trademarks or registered
trademarks of Qualcomm Incorporated. Other products
and brand names may be trademarks or registered
trademarks of their respective owners.
References in this presentation to “Qualcomm” may mean Qualcomm
Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries
or business units within the Qualcomm corporate structure, as
applicable. Qualcomm Incorporated includes our licensing business,
QTL, and the vast majority of our patent portfolio. Qualcomm
Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates,
along with its subsidiaries, substantially all of our engineering,
research and development functions, and substantially all of our
products and services businesses, including our QCT semiconductor
business.

How AI research is enabling next-gen codecs

  • 1.
    July 14, 2021@QCOMResearch Qualcomm Technologies, Inc. How AI research is enabling next-gen codecs
  • 2.
    3 Agenda The demand for improveddata compression is growing 1 AI is a compelling tool for compression 2 Our latest AI voice and video codec research 3 Our on-device neural video decoder demo 4 Future research work and challenges 5
  • 3.
    4 4 The scale ofvideo and voice being created and consumed is massive 1M Minutes of video crossing the internet per second 15B Minutes of talking per day on WhatsApp calls 82% Of all consumer internet traffic is online video 76 Minutes per day watching video on digital devices by US adults 8B Average daily video views on Facebook Cisco Visual Networking Index: Forecast and Trends, 2017–2022; WhatsApp blog 4/28/20 4
  • 4.
    5 5 Video streaming Extended realitySmart cities Smart factories Increasingly, video is all around us — providing entertainment, enhancing collaboration, and transforming industries Autonomous vehicles Video conferencing Sports Smartphone
  • 5.
    6 6 of internet trafficwill be video by 20221 Video technology revolutionized how we create and consume media Enhanced video quality with less bits led to broad adoption across a wide range of devices and services 1. Cisco Annual Internet Report, 2018–2023
  • 6.
    7 1. On a1080p video at 30 frames per second; EVC = Essential Video Coding, VVC = Versatile Video Coding, Technical innovation drives video evolution A regular cadence of technical advancement in video codecs has led to massive reduction in file size The first evolution Introduced flexibility Focused initially on optical disc media Increased efficiency Introduced internet video Emerging video applications and technology Robust network efficiency Enabling user experiences as demand surges Continued technical advancement 2013 HEVC 2020 EVC and VVC Video Compression 2003 AVC 1995 MPEG-2 Evolution transformed Enriched experiences and proliferation Video Compression Reduction in file size with VVC versus no compression1 ~1000x DVD, digital TV Blu-ray, HDTV, internet ~50% compression ↑ Streaming, 4k, HDR, 360˚ ~50% compression ↑ 8k, immersive video ~30% ~40% compression ↑
  • 7.
    8 8 Visual quality ismuch more than resolution and frame rate Preserving color accuracy, contrast, and brightness is also crucial Broad interoperability across all our screens Resolution Increased definition and sharpness Frame rate Reduced blurring and latency Pixel quantity Pixel fidelity Color accuracy More realistic colors through an expanded color gamut, depth, and temperature Contrast and brightness Increased detail through a larger dynamic range and lighting enhancements
  • 8.
    9 Variational auto encoder (VAE)* Invertible Auto-regressive Generativeadversarial network (GAN) Generative models Extract features by learning a low-dimension feature representation Sampling to generate, restore, predict, or compress data Powerful capabilities Broad applications Computational photography Text to speech Graphics rendering Speech/video compression Voice UI * VAE first introduced by D. Kingma and M. Welling in 2013 Deep generative model research for unsupervised learning Given unlabeled training data, generate new samples from the same distribution
  • 9.
    10 Variational auto encoder (VAE)* Invertible Auto-regressive Generativeadversarial network (GAN) Generative models Deep generative model research for unsupervised learning Given unlabeled training data, generate new samples from the same distribution * VAE first introduced by D. Kingma and M. Welling in 2013 Encoder part Decoder part Input unlabeled data, such as images Desired output as close as possible to input Extract features by learning a low-dimension feature representation Sampling to generate, restore, predict or compress data
  • 10.
    11 AI-based compression has compelling benefits No special-purpose hardwarerequired, other than an AI acceleration Easy to upgrade, standardize, and deploy new codecs Specialized to a specific data distribution Easy to develop new codecs for new modalities Improved rate-distortion trade-off Optimized for advanced perceptual quality metrics Semantics aware for human visual perception Can generate visual details not in the bitstream
  • 11.
    12 12 Achieving state-of-the-art datacompression with AI 1 Comparison between state-of-the-art traditional speech coding algorithm and AI speech compression encoder decoder Input speech Output speech Compressed speech Bit-rate compression at same speech quality using AI1 2.6X
  • 12.
    13 13 Feedback Recurrent Autoencoderfor Video Compression (Golinski, et al., arXiv 2020) Video used in images is produced by Netflix, with CC BY-NC-ND 4.0 license: https://media.xiph.org/video/derf/ElFuente/Netflix_Tango_Copyright.txt Novel machine learning-based video codec research Neural network based I-frame and P-frame compression Achieving state-of-the-art rate-distortion compared with other learned video compression solutions End-to-end deep learning Reconstruction Ground truth Motion vector Residual T=3 Motion estimation P-frame encoder Motion estimation Entropy encoding bits Entropy decoding P-frame decoder Sum Warp Deep neural net T=1 I-frame encoder Entropy encoding bits Entropy decoding I-frame decoder T=2 Motion estimation P-frame encoder Entropy encoding bits Entropy decoding P-frame decoder Warp Sum
  • 13.
    14 14 Video compression utilizestemporal and spatial redundancy Group of pictures (GOP) is a sequence of video frame types used to efficiently encode data GOP structures I-frame (intra) P-frame (predicted) B-frame (bi-directional) Independently code each frame Exploit spatial redundancy Code changes based on previous frame Exploit spatial and temporal redundancy Code changes based on previous and next frame Exploit spatial and temporal redundancy | | | | | | | | | | ● ● ● P ● ● ● | ● ● ● P ● ● ● | | ● ● ● B ● ● ● P ● ● ● B ● ● ● P Increasing complexity and bitrate compression
  • 14.
    15 Existing research forB-frame codecs has limitations Our solution combines the best of existing P-frame and B-frame codecs while allowing them to share weights Existing P-frame codec Existing B-frame codecs Our B-frame codec Residual compression Motion compensation Motion compression 𝑅𝑒𝑓 𝐼𝑛𝑝𝑢𝑡 𝑂𝑢𝑡𝑝𝑢𝑡 𝑃𝑟𝑒𝑑 Residual compression Bidir motion compensation Bidir motion compression 𝑃𝑟𝑒𝑑 𝑅𝑒𝑓1 𝑅𝑒𝑓0 𝐼𝑛𝑝𝑢𝑡 𝑂𝑢𝑡𝑝𝑢𝑡 High correlation between the two flows (motion constancy) Residual compression Frame interpolation 𝑃𝑟𝑒𝑑 𝑅𝑒𝑓1 𝑅𝑒𝑓0 𝐼𝑛𝑝𝑢𝑡 𝑂𝑢𝑡𝑝𝑢𝑡 Does not work well for large motion P-frame codec Frame interpolation 𝑃𝑟𝑒𝑑 𝑅𝑒𝑓1 𝑅𝑒𝑓0 𝐼𝑛𝑝𝑢𝑡 𝑂𝑢𝑡𝑝𝑢𝑡 P-frame and B-frame codecs can share weights
  • 15.
    16 16 Extending Neural P-frameCodecs for B-frame Coding (Pourreza, et al., arXiv 2021) Illustration of different inter-frame coding methods B-Frame compression through Extended P-frame & Interpolation Codec (B-EPIC) Actual object location Object location Prediction Motion vector Prediction error Interpolation Interpolation guide 𝐱𝑡 ෤ 𝐱𝑡 ො 𝐱ref Input frame The corresponding prediction The reference 𝐱𝑡 ො 𝐱ref1 ො 𝐱ref0 ෤ 𝐱𝑡 B-frame prediction based on bidirectional flow/warp: two motion vectors with respect to two references are transmitted. ො 𝐱ref ෤ 𝐱𝑡 P-frame prediction: a motion vector with respect to a single reference is transmitted ො 𝐱ref0 ෤ 𝐱𝑡 ො 𝐱ref1 B-frame prediction based on frame interpolation: the interpolation result is treated as the prediction. No motion information is transmitted. ො 𝐱ref0 ෤ 𝐱𝑡 ො 𝐱ref1 Our B-frame prediction approach: the interpolation result is corrected using a unidirectional motion vector similar to P-frame
  • 16.
    17 17 R. Pourezza, T.Cohen, Extending Neural P-frame Codecs for B-frame Coding, Under review at ICCV 2021 Our B-frame coding provides state-of-the-art results Improved rate-distortion tradeoff by extending neural p-frame codecs for B-frame coding 32 33 34 35 36 37 38 39 40 41 0 0.1 0.2 0.3 0.4 0.5 0.6 PSNR (dB) Rate (bits per pixel) PSNR 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 0 0.1 0.2 0.3 0.4 0.5 0.6 MS-SSIM Rate (bits per pixel) MS-SSIM Ours Google [CVPR-20] H.265 (FFmpeg)
  • 17.
    18 18 shared knowledge 𝜃𝒟 𝜃𝒟 senderreceiver Overfitting through instance-adaptive video compression Improved compression via continual learning Overfit a model on one video instance and encode weight-deltas in the bitstream Model delta < baseline bitstream Encoded bitstream + Send weight-deltas based on overfitting Send smaller encoded bitstream based on overfitting encoder 𝑞𝜑(𝒛|𝒙) 0 0 0 1 1 1 ෝ 𝒙 0 0 0 1 1 1 𝒙 𝒃𝒛 E9 E9 E11 D4 D4 𝒛 𝒛 𝒃𝒛 D3 latent prior latent prior decoder 𝑝𝜃(𝒙|𝒛) EE ED E11 EE ED 0 0 1 0 0 1 𝒃ഥ δ E10 D1 𝒃ഥ δ ⊖ ത δ ⊕ ത δ D2 model prior model prior decoder 𝑝𝜃(𝒙|𝒛) T. van Rozendaal*, I.A.M. Huijben*, T. Cohen, Overfitting for Fun and Profit: Instance-Adaptive Data Compression, ICLR 2021 T. Van Rozendaal, Y. Zhang, J. Brehmer, T. Cohen, Instance-Adaptive Video Compression: Improving Neural Codecs by Training on the Test Set, Under review at ICCV 2021
  • 18.
    19 Mobile friendly deployment Decoding complexitycan be reduced by while still maintaining SOTA results 72% SOTA results from instance-adaptive video compression 24% BD-rate savings over leading neural codec by Google 29% BD-rate savings over FFmpeg H.265 (FFmpeg)
  • 19.
    20 20 Variable bitrate imagecompression for simpler deployment Three levels of solutions with the goal of single model producing a single bitstream Level 0: Multi-model multi-bitstream encoder1 decoder1 encoder2 decoder2 encoder3 decoder3 Each model is optimized for a fixed 𝛽 𝑁 models are needed for 𝑁 rate-distortion tradeoffs Level 1: Single-model multi-bitstream encoder decoder encoder decoder encoder decoder Single encoder-decoder model Certain params /input are used to steer R-D tradeoff 𝛽1 𝛽1 𝛽2 𝛽3 𝛽2 𝛽3 Level 2: Single-model single-bitstream encoder decoder decoder decoder Single encoder-decoder model Encode once, embed all bitrates A.k.a. embedded/ progressive/ scalable coding Increasing order of design difficulty and ease of deployment
  • 20.
    21 1/s × s 1/s 𝑦 + -EC 𝑥 𝑦(𝑠) Hyper Codec ED 𝒈𝑎 Analysis transform / encoder 𝒈s Synthesis transform / decoder 𝑥 Input image s Scaling factor that determines the bitrate µ, ℴ Prior parameters EC Entropy coding ED Input image 𝑔𝑎 𝑔𝑠 µ 𝑥 ^ ⎿ ⎿ • N(µ, ℴ) Level 1 approach that applies a scaling factor to the latent for a different trade-off between rate and distortion Latent scaling for a single-model multi-bitstream solution
  • 21.
    22 Level 2 approachthat uses multiple quantization levels and conditional probability to create a single bitstream with multiple quality levels Nested quantization for a single-model single-bitstream solution Fine scale s1 Coarse scale s2 Finer bins for increased quality
  • 22.
    23 Y. Lu, Y.Zhu, Y. Yang, A. Said, T. Cohen, Progressive Neural Image Compression with Nested Quantization and Latent Ordering, Accepted at ICIP 2021 Variable-bitrate progressive neural image compression Achieves comparable performance to HEVC Intra but uses only a single model and a single bitstream HEVC Intra Separate bitstreams, one for each quality [N-model, N-bitstream] Level 0 [1-model, 1-bitstream] Level 2 - Ours [1-model, N-bitstream] Level 1 - HEVC Intra [1-model, 1-bitstream] Level 2 Our scheme A single bitstream where the prefixes decode to different qualities
  • 23.
    24 24 Semantic-aware image compressionfor improved quality Allocate more bits to regions of interest Source Baseline bit allocation Semantic-aware bit allocation
  • 24.
    25 25 Semantic compression focuses on regions ofinterest End-to-end deep learning on foreground/background weighted distortion loss Mask code Mask encoder Code Decoder Encoder Mask decode Foreground / background loss Segmentation mask Input image Gain Reconstructed image
  • 25.
    26 Next step Extend tovideo compression Semantic-aware image compression improves quality where it matters State-of-the-art results for rate-distortion tradeoff 28 30 32 34 36 38 40 42 0.2 0.4 0.6 0.8 1 1.2 1.4 RGB PSNR (dB) Rate (bits per pixel) Rate distortion Segmentation AE (ROI) Segmentation AE (non-ROI) Hyperprior AE (ROI) Hyperprior AE (non-ROI) 5dB gain in region of interest (ROI) 2dB loss at non-ROI
  • 26.
    27 27 The tradeoff betweenperception and distortion metrics Optimizing distortion and/or perceptual quality leads to different reconstructions Original Rate + Distortion Rate + Perception Rate + Distortion + Perception Distortion Perceptual quality Very good Very bad Very bad Very good Good Good
  • 27.
    28 28 How GAN-based codecswork GAN-based codecs can produce much more visually appealing content at very low bits per pixel Encoder Generator Discriminator 1 2 1 0 Data distribution P(𝑧) Prior P(real | 𝑧) P(real)
  • 28.
    29 29 GAN 14.5 kB 0.221 bpp 145xreduction Raw 2097.15 kB 8.0 bpp 512x1024 BPG QP=40 16.4 kB 0.250 bpp 128x reduction
  • 29.
    30 Parallel Entropy Decoding Snapdragon is aproduct of Qualcomm Technologies, Inc. and/or its subsidiaries. Real-time on-device neural video decode is now possible Research demo showcases all-intra neural video decode on a smartphone. Inter-frame decoding is next 1280 × 704 (Comparable to HD 720p) Mobile device powered by Snapdragon® 888 Mobile Platform CPU Parallel Entropy Encoding 30+ Frames / second Offline processing Encoder AI accelerator Decoder Bitstream
  • 30.
    31 31 8-bit model with efficientdecoding PEC Decoder AI Model Efficiency Toolkit (AIMET) https://github.com/quic/aimet Step-3 AIMET quantization aware training AIMET is a product of Qualcomm Innovation Center, Inc. Efficient on-device neural decoding Limit to I-frame (intra-frame) component of neural video decoding; Inter-frame decoding is coming soon Step-1 Decoder architecture refinement Decoder Floating-point neural codec Step-2 Parallel Entropy Coding (PEC) PEC Entropy coding Decoder Encoder Parallel entropy coding + architecture refinement + 8-bit  Fast inference AIMET quantization aware training  Improve quantized model accuracy
  • 31.
    32 Real-time neural videodecoding on a mobile device demo video Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
  • 32.
    33 Perceptual quality Need todevelop differentiable perceptual quality metrics GANs are promising, but not perfect Video GANs are challenging to train due to computational requirements Computational efficiency Real-time on-device video encoding and decoding is still a challenge Rate distortion improvement Bitrate needs to continue to get smaller for the foreseeable future as video demand increases New modalities Lots of work to do on new modalities like point clouds, omnidirectional video, multi-camera setups (e.g. in AV), etc. Challenges for mass deployment of neural video compression
  • 33.
    34 34 AI-based compression hasthe potential to more efficiently share video and voice across devices We are conducting leading research and development in end-to-end deep learning for video and speech coding We are solving implementation challenges and demonstrating a real-time neural video decoder running on a smartphone
  • 34.
  • 35.
    Follow us on: Formore information, visit us at: www.qualcomm.com & www.qualcomm.com/blog Thank you Nothing in these materials is an offer to sell any of the components or devices referenced herein. ©2018-2021 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved. Qualcomm and Snapdragon are trademarks or registered trademarks of Qualcomm Incorporated. Other products and brand names may be trademarks or registered trademarks of their respective owners. References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes our licensing business, QTL, and the vast majority of our patent portfolio. Qualcomm Technologies, Inc., a subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of our engineering, research and development functions, and substantially all of our products and services businesses, including our QCT semiconductor business.