The wise doc_trans presentation

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
AbdelRaouf KESKES
January 26, 2021
1/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Outline
1 Introduction and Problem Setting
2 Our Approach
Global Perspective
Text Detection : CRAFT
Text Recognition : STR
Text Merging
Inpainting
3 Further Improvements
4 Conclusion
References
2/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction and Problem Setting
Problem
A project where I aim to build a system that converts a document from a
language to another keeping all the design (Layout, Logo, Sign, ...)
Why
• Price : Real translators are very expensive and make simple
translation templates that never match the input layout
• Credibility : it adds a kind of credibility and trustworthiness when
we see that the translation match exactly the input layout
• Importance degree : Documents have different degree of
importance, translating a worksheet is not equivalent to translating
a criminal record of a person
3/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Figure: Document translation example
4/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Global Perspective
Our Approach
After Answering the (What?) and the Why?, Let’s answer the How? ...
How?
We divided our problem to 5 sequential sub-problems :
• 1) Text Detection : Localizing word-wise text areas
• 2) Text Recognition : recognizing the words
• 3) Text Merging : merging words to create statements
• 4) Inpainting : Delete the text areas and fill them
• 5) Text Translation : Translate the text and put it back in the
documents
5/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Text Detection
We used a SOTA model in Text Detection which is called CRAFT
(Character Region Awareness for Text Detection) [1] : best results,
multilingual, open source code, documentation, ...
Figure: Text Detection
6/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
CRAFT : Idea and Data
Idea
exploring characters and aﬀinity between them to form a text entity
Data
X are input images whose shape is (N, h, w, 3), and Y are outputs whose shape is (N, h, w, 2)
where :
∗ N is number of images, h is the height and w is the width of the image respectively
∗ for each image Xi of shape (h, w, 3) we have an Yi representing two matrices :
characters score heatmap and aﬀinity/linkage score heatmap
Figure: CRAFT data sample
7/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
CRAFT : Data
Synthetic data : where we have characters level annotations
Figure: Synthetic data annotations generation process
8/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
CRAFT : Data
Real data : where we have word-level annotations
Figure: Real data annotations generation process
Figure: Zoom in the splitting characters and inverse projection steps 9/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
CRAFT : Model + Loss
VGG in a U-Net fashion with skip
connections
Figure: Schematic illustration of the
model Architecture
Loss : Weighted pixel-wise MSE
Sc(p) for synthetic data we obviously set it
to 1
10/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
CRAFT : Post Processing
From the 2 predicted heatmaps Sr and Sa we want to generate bounding
boxes for words and it is done as the following :
• We build a binary mask M where a pixel p is set to 1 if (Sr(p) > thr)
or (Sa(p) > tha) where thr and tha are thresholding hyper
parameters.
• We apply CCL(Connected Component Labeling) algorithm on M
• We find Min Area Rectangle covering each component (rotated
rectangles are accepted too since the text could be inclined/rotated )
11/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Text Recognition
We used a generalized SOTA model in Text Recognition from the same
team CLOVAAI called STR (Scene Text Recognition) [2]
Figure: Text Recognition
12/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
STR : Idea and Data
Idea
After showing the fallacies and the inconsistencies raising from STR datasets and unfair
perfomances comparison and benchmarks.They proposed a four stages unified framework
laveraging previous work and also going beyond by exploring their variants on a granular way
and also general way (module wise) combinations
Figure: Four stages STR Model
Data
X are input images whose shape is (N, h, w, 3), where N is number of images, h is the height
and w is the width of the image respectively, Y are ground truth words.
6 datasets: MJSynth, SynthText, IC13, IC15, IIIT, and SVT.
13/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
STR : Model
The four stages are ...
Transformation: with arbitrarily shaped and curvy texts, the STR network
applies a Thin-Plate Spline (TPS) transformation and normalizes the input
text into a rectangular shape.
Feature Extraction: The transformed images is mapped to a set of
features relevant to character recognition.The authors carried out
experiments on different backbones, namely ResNet, VGG, RCNN.
Sequence Modeling: We use biLSTMs to capture contextual information
(ba? => ba”d”, ba”g”, ba”t”) . However, BiLSTMs suffers from memory
computations cost, so this stage can be selected or deselected as per user
need.
Prediction: This stage estimates the output character sequence from the
identified features of an image.2 options : CTC or Attention
14/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Text Merging
Text Merging
since row/sentence merging level is suﬀicient for translation and to keep
the main project goal of the document design possible
Figure: Text line-wise merging
15/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Text Merging
Text Merging : Algorithm
Complexity is polynomial O(n3), but ... in practice (avg 200 words/doc)
=> the merging is extremely fast (<1s on my laptop)
16/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Inpainting
Inpainting
Figure: Inpainting process 17/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Inpainting
DeepFill v2 Idea
We used a SOTA model in Free-shaped Inpainting developed by Adobe Research Team in [3],
we’ve chosen a model pretrained on a dataset called Places2, since Natural places have big
parts sharing the same texture distribution and spatial information like documents
Figure: DeepFill V2 SOTA results
Idea
they proposed different solutions for generative inpainting problems:
• Custom Gated Convolutions : Vanilla convolutions applied to an image with a hole is
meaningless.Solution => Learnable Mask
• Free-shaped : local and global GANs are adapted for rectangular shape. Solution =>
SN-PatchGAN. 18/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Inpainting
Inpainting : DeepFill v2 Model
19/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Inpainting
Inpainting : DeepFill v2 Data and Model
Data
• the input X (for both Generator and Discriminator) shape is (N, h, w, 5) where N is
number of images, h is the height and w is the width of the image respectively, input
channels are R, G, B, Holes Mask, User-guidance Mask (not required)
• outputs for the The generator are hole-generated images of shape (N, h, w, 3)
• outputs for our custom discriminator (elaborated using Spectral Normalization and
Patches fashion) is of shape (N, H
32
, W
32
, 256) storing binary variables (binary
classification fake or real)
Model Details
• Gated Convolutions : Normal convolutions are calculated as the following :
Oy,x =
∑ ∑
W · I
then PartialConv were proposed to take only valid pixels as the following static mask
formula :
Oy,x =
{ ∑ ∑
W ·
(
I ⊙ M
sum(M)
)
, if sum(M) > 0
0, otherwise
20/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Our Approach
Inpainting
Inpainting : DeepFill V2 Model
Model Details(2)
After that DeepFill v2 authors proposed a generalization of this PartialConv with
learnable-dynamic mask through GatedConv as the following :
Gating y,x =
∑ ∑
Wg · I
Feature y,x =
∑ ∑
Wf · I
Oy,x = ϕ ( Feature y,x) ⊙ σ
(
Gating y,x
)
• SN-PatchGAN : A convolutional network (6 Convs with kernal=5 and stride=2) is used
as the discriminator, where they adapt spectral normalization using the default fast
approximation algorithm of spectral normalization described in SN-GAN
Loss
• Generative Loss (Hinge) : LG = −Ez∼Pz(z) [Dsn(G(z))]
• Discriminative Loss (Custom BCE) :
LDsn = Ex∼Pdata (x) [ReLU (1 − Dsn(x))] + Ez∼Pz(z) [ReLU (1 + Dsn(G(z)))]
21/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Further Improvements
Further Improvements
Protocole
• Preprocessing : endorse the preprocessing step with Accepting natural photos (not
necessarly scanned)and Robusteness to text orientation.
• Dataset : Build a dataset, which we could use for validation, evaluation and if necessary
(continue training in some cases for example text recognition for non latin languages).
• Evaluation metrics : Subsequently, we need to develop annotations and mathematical
solid metrics to estimate the quality of our predictions
• Text characteristics : Add a module of Text Font Recognition that given a text crop
will predict : font family, font-size, bold, italic, text color, underlined, ..., it could be
end-to-end, handcrafted, hybrid between both ...
• Currency Converter : if the document contains money currency (for example bills) we
could propose currency conversion
• Merging algo complexity improvement : explore the nature of the data to not compore
all boxes with all boxes
22/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Conclusion
Conclusion
”We are really satisfied of the first version of this enticing project, we
deeply believe that we can bring this project to life and a add a value to
the society. We look forward to endow the system with all the proposed
improvements during the next versions.”
23/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Conclusion
Merci pour votre attention
Des questions?
24/24

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
References
Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region
awareness for text detection,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 9365–9374, 2019.
J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and
H. Lee, “What is wrong with scene text recognition model
comparisons? dataset and model analysis,” in International
Conference on Computer Vision (ICCV), 2019.
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form
image inpainting with gated convolution,” in Proceedings of the IEEE
International Conference on Computer Vision, pp. 4471–4480, 2019.
24/24

The wise doc_trans presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to The wise doc_trans presentation

Similar to The wise doc_trans presentation (20)

More from Raouf KESKES

More from Raouf KESKES (8)

Recently uploaded

Recently uploaded (20)

The wise doc_trans presentation