Super resolution in deep learning era - Jaejun Yoo

Super-resolution
in deep learning era
Jaejun Yoo
Ph.D. AI Scientist
@NAVER Clova AI
Seminar, Math Dept.,
KAIST

System (H)
noise
Inverse
problem
unknown
original data
model 𝒚 = 𝑯(𝒙) + 𝒏
observed data
? ?
?
Inverse problems

Inverse problems: scattering
Maxwell equation
Láme equation
noise
Inverse
problem
Restored density
original density
* Yoo J et al., SIAM, 2016
observed field
e.g. electromagnetic, or acoustic wave
physical property reconstruction* or
source localization

Blurring,
Down Sampling
noise
Inverse
problem
Restored image
original image
model 𝒅 = 𝑮(𝒙) + 𝒏
* Bae W and Yoo J, CVPRW, 2017
observed image
e.g. natural images
denoising & super-resolution*
Inverse problems: image restoration (IR)

System (H)
noise
Image restoration (IR)
By specifying different degradation operator H,
one can correspondingly get different IR tasks.
Deblurring or Deconvolution : 𝑯𝒙 = 𝒌 ⊛ 𝒙
Super-resolution : 𝑯𝒙 = (𝒌 ⊛ 𝒙) ↓ 𝒔
Denoising : 𝑯𝒙 = 𝑰𝒙
Inpainting : 𝑯𝒙 = 𝑰 𝒎𝒊𝒔𝒔𝒊𝒏𝒈 𝒙
General formulation of IR problems:

Given a single image 𝒚, solve f𝐨𝐫 𝒙:
• 𝒚: known low resolution (LR) image
• 𝒙: unknown high resolution (HR) image
• 𝒌: unknown blur kernel (typically set as identity)
• ↓ 𝒔: downsample 𝒙 by the factor of 𝒔 (typically done by bicubic function)
• 𝒏: additive white Gaussian noise (AWGN)
Single Image Super-Resolution
𝒚 = 𝒌 ⊛ 𝒙 ↓ 𝒔 + 𝒏

𝑯𝒙 =
𝟏, 𝟎, 𝟏, 𝟎
𝟎, 𝟏, 𝟎, 𝟏
𝟏, 𝟏, 𝟎, 𝟎
𝟎, 𝟎, 𝟏 𝟏
𝑨
𝑩
𝑪
𝑫
, 𝒓𝒂𝒏𝒌 𝑯 = 𝟒
A B
DC
1 0
00
01
1
0
body
𝒘𝒆𝒍𝒍 − 𝒑𝒐𝒔𝒆𝒅 𝒑𝒓𝒐𝒃𝒍𝒆𝒎, ∃𝑯−𝟏
more constraints, assumptions, regularization, iterative methods, etc.
How we solve the problem?
• To find the best model 𝑯 such that 𝒚 = 𝑯 𝒙 + 𝒏
• In linear system, e.g. X-ray CT, we minimize the following cost function:
• In signal processing society:
𝐲 = 𝐇𝐱, 𝝓 = 𝒚 − 𝑯𝒙 𝟐
𝟐

From Bayesian perspective, the solution ෝ𝒙 can be obtained by solving a Maximum A Posteriori (MAP) problem:
ෝ𝒙 = 𝒂𝒓𝒈𝐦𝐚𝐱
𝒙
𝒍𝒐𝒈 𝒑 𝒚 𝒙 + 𝒍𝒐𝒈 𝒑(𝒙)
ෝ𝒙 = 𝒂𝒓𝒈𝐦𝐢𝐧
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽(𝒙)
More formally,
data fidelity regularization
Enforces desired property
of the output
Guarantees the solution accords
with the degradation process
1) Model-based optimization 2) Discriminative learning methods
: What kinds of prior knowledge can we “impose on” our model?

Model-based optimization methods
What kinds of prior knowledge can we “impose on” our model?
• Sparsity
• Wavelets, DCT, PCA, etc.
• Dictionary learning
• e.g., K-SVD
• Nonlocal self-similarity
• BM3D
• Low-rankness
ෝ𝜶 = 𝒂𝒓𝒈𝐦𝐢𝐧
𝜶
𝟏
𝟐
𝒚 − 𝑯𝑫𝜶 𝟐 + 𝝀 𝜶 𝟏
෡𝑿 = 𝒂𝒓𝒈𝐦𝐢𝐧
𝑿
𝟏
𝟐
𝒀 − 𝑿 𝟐 + 𝝀 𝑿 ∗

From Bayesian perspective, the solution ෝ𝒙 can be obtained by solving a Maximum A Posteriori (MAP) problem:
ෝ𝒙 = 𝒂𝒓𝒈𝐦𝐚𝐱
𝒙
𝒍𝒐𝒈 𝒑 𝒚 𝒙 + 𝒍𝒐𝒈 𝒑(𝒙)
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽(𝒙)
More formally,
Enforces desired property
of the output
Guarantees the solution accords
with the degradation process
What kinds of prior knowledge can we “learn using” our model? :

Discriminative learning methods
What kinds of prior knowledge can we “learn using” our model?
𝐦𝐢𝐧
𝜽
𝒍 ෝ𝒙, 𝒙 ,
𝒔. 𝒕. ෝ𝒙 = 𝒂𝒓𝒈𝐦𝐢𝐧
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽(𝒙; 𝜽)
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽(𝒙)
Here, we learn the prior parameter 𝜽:
through an optimization of a loss function 𝒍 on a training set (image pairs).

Discriminative learning methods
CNNs (𝒇)
𝒚 𝒙
Conv, ReLU, pooling, etc.
General statement of the problem:
𝐦𝐢𝐧
𝜽
𝒍 ෝ𝒙, 𝒙 ,
𝒔. 𝒕. ෝ𝒙 = 𝒂𝒓𝒈𝐦𝐢𝐧
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝀𝚽(𝒙; 𝜽)
By replacing MAP inference with a predefined nonlinear function ෝ𝒙 = 𝒇 𝒚, 𝑯, 𝜽 ,
solving IR problem with CNNs can be treated as one of the discriminative learning methods.
learn the image prior model

SRCNN
The start of deep learning in SISR
• Link the CNN architecture to the traditional “sparse coding” methods.
• The first end-to-end framework: each module is optimized through the learning process

SRCNN
Set5 dataset with an upscaling factor × 𝟑
SNCNN Surpasses the bicubic baseline and
outperforms the sparse coding based method.
The first-layer filters trained on upscaling factor × 𝟑
Example feature maps of different layers.
Results

SRCNN
Problems left
• Bicubic LR input usage: pros & cons
• Shallow (three-layer) network design
• Naive & implicit prior

• Upsampling methods
• Interpolation-based vs. Learning-based
• Model framework
• Pre-, Post-, Progressive- upsampling
• Iterative up-and-down sampling
• Network design
• Residual learning
• Recursive learning
• Deeper & Denser
• Etc.
Developments of deep models in SISR

Pre- vs. Post-upsampling
Progressive- upsampling
Model framework
LapSRN: motivated from Laplacian Image Pyramid
• LapSRN (CVPR ‘17)
• ProSR (CVPR ‘18)
• Early stage
• SRCNN (ECCV ‘14)
• FSRCNN (ECCV ‘16)

Pre- vs. Post-upsampling
Progressive- upsampling
Iterative up-and-down sampling
Model framework
BPDN: motivated from iterative projection in the optimization methods.
Super-resolution result on 8× enlargement. PSNR: LapSRN
(15.25 dB), EDSR (15.33 dB), and BPDN (16.63 dB)
• Deep Back-Projection Network (CVPR ‘18)
• LapSRN (CVPR ‘17)
• ProSR (CVPR ‘18)
• Early stage
• SRCNN (ECCV ‘14)
• FSRCNN (ECCV ‘16)

Variety of model designs
• Residual learning
• Recursive learning
• Deeper & Denser
• Exploit Non-local or Attention
• GANs
• Etc.
Network design

Variety of model designs
Network design

VDSR (CVPR ‘16)
Network design: 1st cornerstone, residual learning
Very Deep SR network
• The first “deep” network (20 layers)
• Proposed a practical method to actually train the “deep” layers (before BN)

VDSR (CVPR ‘16)
Network design: 1st cornerstone, residual learning
Very Deep SR network
• The first “deep” network (20 layers)
• Proposed a practical method to actually train the “deep” layers

Network design: recursive learning
DRCN (CVPR ‘16), DRRN (CVPR ‘17), MemNet (ICCV ‘17)
• Reuse the module (less parameters, smaller network)

Network design: deeper & denser
SRResNet (CVPR ‘17), SRDenseNet (ICCV ‘17)
• Deeper (ResNet backbone)
• Denser (DenseNet backbone)

Network design: 2nd cornerstone
EDSR (CVPR ‘17)
• The first to provide a backbone for “SR” task
• Remove batch normalization
• Residual scaling
• Very stable and reproducible model
• Removed batch normalization layers
• Self-geometric ensemble method
• Exploit pretrained 2x model for the other scales
• Performance gain
• Model size reduction (43M → 8M)
• Flexibility (partially scale agnostic)

Network design: 2nd cornerstone
EDSR (CVPR ‘17)

Network design
Non-local & Attention module
Generative Adversarial Networks in Super-resolution
• "Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network" (SRGAN, CVPR ‘17)
• "A fully progressive approach to single-image super-resolution" (ProSR, CVPR ‘18)
• "ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks" ECCV ‘18
• Candidate of 4th cornerstone?
• "2018 PIRM Challenge on Perceptual Image Super-resolution" (ECCV ‘18)
• 3rd cornerstone; at least in the perspective of the performance; too sensitive, many hyper-params.
• "Image Super-Resolution Using Very Deep Residual Channel Attention Networks" (RCAN, ECCV ‘18)
• "Non-local Recurrent Network for Image Restoration" (NLRN, NIPS ‘18)
• "Residual Non-local Attention Networks for Image Restoration" (RNAN, ICLR ‘19)

GANs in SR: candidate of 4th cornerstone?
Problems
• Cannot go along with traditional metrics: PSNR / SSIM
• New metric?; "2018 PIRM Challenge on Perceptual Image Super-resolution" (ECCV ‘18)
ProSRGAN (CVPR ‘18)

Summary (until now)
Methods
General formulation of IR
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽(𝒙)
• Hand-crafted priors
• Sparsity
• Low-rankness
• Learned discriminative priors
• Predefined nonlinear function
• CNNs

Pros
General to handle different IR problems
Clear physical meanings
Data-driven end-to-end learning
Efficient inference during test-phase
Cons
Hand-crafted priors (weak representations)
Optimization task is time-consuming
Generality of model is limited
Interpretability of model is limited
Summary (until now)
Methods
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽(𝒙)

Pros
Cons
Summary (until now)
Methods
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽(𝒙)
“Can we somehow get the best of both worlds?”

Getting the best of both worlds
Variable Splitting Methods
• Want to separately deal with the data fidelity term and the regularization terms
• Specifically, the regularization term only corresponds to a denoising subproblem
• Alternating Direction Method of Multipliers (ADMM), Half Quadratic Splitting (HQS)
• Cost function of HQS: ℒ 𝝁 𝒙, 𝒛 =
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝀𝚽 𝒛 + 𝝁 𝒛 − 𝒙 𝟐
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽 𝒛 𝒔. 𝒕. 𝒛 = 𝒙
𝒙 𝒌+𝟏 = 𝒂𝒓𝒈𝐦𝐢𝐧
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝁 𝒙 − 𝒛 𝒌
𝟐
, 𝒛 𝒌+𝟏 = 𝒂𝒓𝒈𝐦𝐢𝐧
𝒛
𝟏
𝟐 𝝀/𝝁
𝟐
𝒛 − 𝒙 𝒌+𝟏
𝟐
+ 𝚽 𝒛

𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝀𝚽 𝒛 + 𝝁 𝒛 − 𝒙 𝟐
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽 𝒛 𝒔. 𝒕. 𝒛 = 𝒙
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝁 𝒙 − 𝒛 𝒌
𝟐
𝒛
𝟏
𝟐 𝝀/𝝁
𝟐
𝒛 − 𝒙 𝒌+𝟏
𝟐
+ 𝚽 𝒛
𝒙 𝒌+𝟏 = 𝑯 𝑻
𝑯 + 𝝁𝑰
−𝟏
(𝑯𝒚 + 𝝁𝒛 𝒌)

𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝀𝚽 𝒛 + 𝝁 𝒛 − 𝒙 𝟐
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽 𝒛 𝒔. 𝒕. 𝒛 = 𝒙
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝁 𝒙 − 𝒛 𝒌
𝟐
𝒛
𝟏
𝟐 𝝀/𝝁
𝟐
𝒛 − 𝒙 𝒌+𝟏
𝟐
+ 𝚽 𝒛
In Bayesian perspective, this is Gaussian denoising
subproblem with noise level 𝝀/𝝁!𝒙 𝒌+𝟏 = 𝑯 𝑻
𝑯 + 𝝁𝑰
−𝟏

𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝀𝚽 𝒛 + 𝝁 𝒛 − 𝒙 𝟐
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽 𝒛 𝒔. 𝒕. 𝒛 = 𝒙
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝁 𝒙 − 𝒛 𝒌
𝟐
𝒛
𝟏
𝟐 𝝀/𝝁
𝟐
𝒛 − 𝒙 𝒌+𝟏
𝟐
+ 𝚽 𝒛
𝒛 𝒌+𝟏 = 𝑫𝒆𝒏𝒐𝒊𝒔𝒆𝒓 (𝒙 𝒌+𝟏, 𝝀/𝝁)𝒙 𝒌+𝟏 = 𝑯 𝑻
𝑯 + 𝝁𝑰
−𝟏

𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝀𝚽 𝒛 + 𝝁 𝒛 − 𝒙 𝟐
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽 𝒛 𝒔. 𝒕. 𝒛 = 𝒙
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐
+ 𝝁 𝒙 − 𝒛 𝒌
𝟐
𝒛
𝟏
𝟐 𝝀/𝝁
𝟐
𝒛 − 𝒙 𝒌+𝟏
𝟐
+ 𝚽 𝒛
𝒛 𝒌+𝟏 = 𝑫𝒆𝒏𝒐𝒊𝒔𝒆𝒓 (𝒙 𝒌+𝟏, 𝝀/𝝁)𝒙 𝒌+𝟏 = 𝑯 𝑻
𝑯 + 𝝁𝑰
−𝟏
1. Any gray or color denoisers to solve a variety of inverse problems.
2. The explicit image prior can be unknown in solving the original equation.
3. Several complementary denoisers which exploit different image priors can be
jointly utilized to solve one specific problem.

IRCNN
HQS: Plug and Play
• Image Restoration with CNN Denoiser Prior
• Kai Zhang et al. “Learning Deep CNN Denoiser Prior for Image Restoration”
𝒛 𝒌+𝟏 = 𝑫𝒆𝒏𝒐𝒊𝒔𝒆𝒓 (𝒙 𝒌+𝟏, 𝝀/𝝁)

IRCNN
HQS: Plug and Play
Image deblurring performance comparison for Leaves image
(the blur kernel is Gaussian kernel with standard deviation 1.6, the noise level σ is 2).

IRCNN
HQS: Plug and Play
SISR performance comparison for Set5: IRCNN can tune the blur kernel and scale factor w/o training.
(the blur kernel is 7×7 Gaussian kernel with standard deviation 1.6, the scale factor × 3)

Pros
Cons
Summary (until now)
Methods
𝒙
𝟏
𝟐
𝒚 − 𝑯𝒙 𝟐 + 𝝀𝚽(𝒙)
√√

DPSR (CVPR ‘19)
Deep Plug and Play Super-Resolution for Arbitrary Blur Kernels

Problems yet to be solved
• It WORKS but NO WHYS.
• Many studies are just blindly suggesting a new architecture that works.
• Recent architecture are (kind of) overfitted to the dataset.
• Bicubic downsampling tasks are saturated. (fails in other d/s scheme or realistic noises)
• We need more “realistic” and “pragmatic” model that works in real environments.
• Lack of fair comparisons
• Lighter (greener) and faster (inference) models
• New architectures (more than just a shared parameters)
• New methods

Jaejun Yoo
Ph.D. Research Scientist
@NAVER Clova AI Research, South Korea
Interested in Generative models, Signal Processing,
Interpretable AI, and Algebraic Topology
Techblog: https://jaejunyoo.blogspot.com
Github: https://github.com/jaejun-yoo / LinkedIn: www.linkedin.com/in/jaejunyoo
Research Keywords
deep learning, inverse problem, signal processing, generative models
Thank you
Q&A?

Super resolution in deep learning era - Jaejun Yoo

More Related Content

What's hot

Similar to Super resolution in deep learning era - Jaejun Yoo

More from JaeJun Yoo

Recently uploaded

Super resolution in deep learning era - Jaejun Yoo