Generating super resolution images using transformers

GENERATING SUPER-RESOLUTION IMAGES USING
TRANSFORMERS
NEERAJ BAGHEL (RSI2021003)
19 Sept 2021 at IIIT-Allahabad

INTRODUCTION
Image super-resolution

INTRODUCTION
Original low-resolution images

INTRODUCTION
Original low-resolution images Cropped and Zoom

INTRODUCTION
Cropped and Zoom Reconstructed high-resolution image

INTRODUCTION
Cropped and Zoom Reconstructed high-resolution image
low-resolution high-resolution
Applications:
Medical Imaging.
Satellite Imaging.
Digital zoom in Camera.
Image enhancement technology for digital televisions

MOTIVATION & OBJECTIVE
Motivation: low resolution images.

MOTIVATION & OBJECTIVE
Motivation: low resolution images.
Objective: Super-Resolution Transformer

DESCRIPTION OF EXISTING METHODS
1)Deep Learning methods like conditional GANs can be used to for this
problems.
Method where, end to end mapping function between LR and HR
images.
2)Method like image aligning or patch matching functions can be used
to treat this problem.
Aligning b/w LR and Ref. image. Optical flow,

Author proposed 1) A Texture Transformer, 2) Cross-scale feature integration module
(CSFI)
Method 1: TTSR
Texture Transformer
Learnable Texture Extractor:
texture extraction for reference images.
whose parameters will be updated during end-to-end training.
Relevance Embedding:
Unfold Q and K into n patches.
Calculate relevance b/w them.
Then transfer from most relevant position in V.
Yang, Fuzhi, et al. "Learning texture transformer network for image super-resolution." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Method 1: TTSR
lrsr
refsr
ref
Learnable Texture Extractor (LTE)
Normalize Input range:[-1,1], Output range:[0,1], f(x)=(x+1)/2
Equalize Data by fixing mean&std (MeanShift)
(Lv3,Lv2,Lv1)=VGG19 (2,7,12 layer output)
Embedding-Create Patches
UnFold (kernel(3,3),padding(1))
Lv3
Lv3
Lv3, Lv2, Lv1,
UnFold(k(3,3),p(1),s(1))
UnFold(k(6,6),p(2),s(2))
UnFold(k(12,12),p(4),s(4))
Transpose
Normalize patches
Matrix Multiplication (t.bmm)
Q
K
V
Attention weight
Max
MaxArg
h
s
S (soft attention map)
Value map
H (hard attention maps)
Fold values map
TIV3
TIV2
TIV1
Lr
F Conv+Relu+Res-block+Conv
Backbone DNN
Fout=F+Conv(F,TIVi)*S

2) Cross-scale feature integration module
(CSFI)
Method 1: TTSR
Cross-Scale Feature Integration Module
3) Loss Function
1) Reconstruction loss: Utilize L1 loss and L2
loss
2) Adversarial loss: effective in generating
clear and visually favorable images
3) Perceptual loss: enhance the similarity in
feature space
Yang, Fuzhi, et al. "Learning texture transformer network for image super-resolution." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

SUMMARY OF THE PROPOSED PLAN
 Transformer for computer vision problems.
 There are various techniques available that are efficient for lower scale values (✕ 2, ✕ 3 ) but as we go
for higher scales like the efficiency gets reduced.
 Transformer commonly need heavy GPU memory cost.
 Assume case where no reference image is available for test data.

SUMMARY OF THE PROPOSED PLAN
 We can use Different deep feature space
as for texture extractor. Based on pre-
trained model Other than vgg19. give
direct image as input
 Upgrading LR ↑ using self-attention in
LR↑ to create LR ↑+ images: which has
more information then LR↑.
 Apply discriminator transformer. After that
for more enhanced resolution.
 Assume case where no reference image is
available for test data.
LR ↑+
Discriminator
HR Ground
Truth

DESCRIPTION OF DATASETS
 CUFED Dataset: 11,871 pair images
 CUFED5 Dataset: 126 testing images with 5 reference set
 DIV2K Dataset : 800 RGB training images and 100 validation images, Rich textures (2K resolution)
 Set5, Set14: Evaluation dataset for Super Resolution (5,14 images) (buildings, animal,etc.)
 B100, Urban100: Comparison for 4 scale factor (100 images)

EVOLUTION TECHNIQUE
 Peak signal-to-noise ratio (PSNR)
 Structure similarity index (SSIM) are used to evaluate the performance of the reconstructed SR images.

DESCRIPTION OF CODE AVAILABILITY
 SwinIR: Image Restoration Using Swin Transformer
https://github.com/JingyunLiang/SwinIR
 Light Field Image Super-Resolution with Transformers
https://github.com/ZhengyuLiang24/LFT
 Learning Texture Transformer Network for Image Super-Resolution
https://github.com/researchmm/TTSR

EXPERIMENTAL SETUP
 Python
 Pytorch using JuyperLab IDE
 Created virtual environment as transformer

PRESENTER CONTRIBUTION
 Learning Transformer basic concepts.
 Available state of ART Technique on super resolution.
 Implementation of previous state of art technique.
 Downloaded dataset and check it for state of art technique.
 Proposed plan for future work.

ATTENTION IS ALL YOU NEED
CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS
(NIPS 2017)
4 DEC 2017 – 9 DEC 2017 AT CALIFORNIA, UNITED STATES
NEERAJ BAGHEL
(RESEARCH SCHOLAR)
IIITA, INDIA
Ashish Vaswani
(Google Brain)
Noam Shazeer
(Google Brain)
Niki Parmar
(Google Research)
Jakob Uszkoreit
(Google Research)
Llion Jones
(Google
Research)
Aidan N. Gomez
(University of
Toronto)
Łukasz Kaiser
(Google Brain)
Illia Polosukhin
(Google Research)

OUTLINE
• NATURAL LANGUAGE PROCESSING
• ATTENTION
• PROPOSED TRANSFORMER ARCHITECTURE
• SCALED DOT-PRODUCT ATTENTION
• MULTI-HEAD ATTENTION
• SELF-ATTENTION
• FEED FORWARD LAYER
• POSITIONAL ENCODING
• BEAM-SEARCH
• EXPERIMENT

DOMAIN: NATURAL LANGUAGE PROCESSING
(APPLICATION OF ML)
PROBLEMS THAT CAN BE SOLVED
BY NLP
:SENTENCE CLASSIFICATION
:SENTENCE TO SENTENCE
:LANGUAGE CONVERSION
Methods
:RNN,
:LSTM,
:GRN
e
H start H0
e
H1
e
H2
Word vector
X1 X2 X3
d
H3
Y1
d
H4
Y2
Problems in Rnn
This sequential nature.
reduce the connectivity of hidden states to original inputs.
Not effective when sequence length is long.
Prevents parallelization within training samples
Require Attention Mechanism for remembering the focused
area
Problems in CNN
:DON’T ALLOW THE TIME SERIES CONTEXT TO FLOW.
:CAN PERFORM ONE TO ONE OUTPUT
Methods
:RNN,
:LSTM,
:GRN
Summary:
Need attention mechanisms allow us to draw
global dependencies between input and output
by a constant number of operations
The cat eats
DIE KATZ

ATTENTION:
The cat eats
e
H start H0
e
H1
e
H2
Word vector
X1 X2 X3
d
H3
Y1
d
H4
Y2
DIE KATZ
d
Y2
K1
K2
.
.
.
kn
Searching key
Search keys which are similar to a query, and return the
corresponding values.

Proposed Transformer Architecture
Input: Sequence of symbol representations (x1, x2,…, xn )
Output: Sequence of symbol representations (y1, y2,…, yn )

Attention! All you need
Self Attention in
Encoder
Encoder decoder
Attention
Self Attention in
Decoder

USING BEAM-SEARCH IN SELECTING MODEL
PREDICTION
 When selecting model output, we can take the word with
the highest probability and throw away the rest word
candidates. : greedy decoding
 Another way to select model output is beam-search.
 beam-search
 Instead of only predicting the token with the best score, we keep
track of k hypotheses (for example k=4, we refer to k as the beam
size).

Generating super resolution images using transformers

More Related Content

Similar to Generating super resolution images using transformers

More from NEERAJ BAGHEL

Recently uploaded

Generating super resolution images using transformers

Editor's Notes