3. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Introduction and Problem Setting
Introduction and Problem Setting
Problem
A project where I aim to build a system that converts a document from a
language to another keeping all the design (Layout, Logo, Sign, ...)
Why
• Price : Real translators are very expensive and make simple
translation templates that never match the input layout
• Credibility : it adds a kind of credibility and trustworthiness when
we see that the translation match exactly the input layout
• Importance degree : Documents have different degree of
importance, translating a worksheet is not equivalent to translating
a criminal record of a person
3/24
5. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Global Perspective
Our Approach
After Answering the (What?) and the Why?, Let’s answer the How? ...
How?
We divided our problem to 5 sequential sub-problems :
• 1) Text Detection : Localizing word-wise text areas
• 2) Text Recognition : recognizing the words
• 3) Text Merging : merging words to create statements
• 4) Inpainting : Delete the text areas and fill them
• 5) Text Translation : Translate the text and put it back in the
documents
5/24
7. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Detection : CRAFT
CRAFT : Idea and Data
Idea
exploring characters and affinity between them to form a text entity
Data
X are input images whose shape is (N, h, w, 3), and Y are outputs whose shape is (N, h, w, 2)
where :
∗ N is number of images, h is the height and w is the width of the image respectively
∗ for each image Xi of shape (h, w, 3) we have an Yi representing two matrices :
characters score heatmap and affinity/linkage score heatmap
Figure: CRAFT data sample
7/24
11. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Detection : CRAFT
CRAFT : Post Processing
From the 2 predicted heatmaps Sr and Sa we want to generate bounding
boxes for words and it is done as the following :
• We build a binary mask M where a pixel p is set to 1 if (Sr(p) > thr)
or (Sa(p) > tha) where thr and tha are thresholding hyper
parameters.
• We apply CCL(Connected Component Labeling) algorithm on M
• We find Min Area Rectangle covering each component (rotated
rectangles are accepted too since the text could be inclined/rotated )
11/24
13. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Recognition : STR
STR : Idea and Data
Idea
After showing the fallacies and the inconsistencies raising from STR datasets and unfair
perfomances comparison and benchmarks.They proposed a four stages unified framework
laveraging previous work and also going beyond by exploring their variants on a granular way
and also general way (module wise) combinations
Figure: Four stages STR Model
Data
X are input images whose shape is (N, h, w, 3), where N is number of images, h is the height
and w is the width of the image respectively, Y are ground truth words.
6 datasets: MJSynth, SynthText, IC13, IC15, IIIT, and SVT.
13/24
14. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Text Recognition : STR
STR : Model
The four stages are ...
Transformation: with arbitrarily shaped and curvy texts, the STR network
applies a Thin-Plate Spline (TPS) transformation and normalizes the input
text into a rectangular shape.
Feature Extraction: The transformed images is mapped to a set of
features relevant to character recognition.The authors carried out
experiments on different backbones, namely ResNet, VGG, RCNN.
Sequence Modeling: We use biLSTMs to capture contextual information
(ba? => ba”d”, ba”g”, ba”t”) . However, BiLSTMs suffers from memory
computations cost, so this stage can be selected or deselected as per user
need.
Prediction: This stage estimates the output character sequence from the
identified features of an image.2 options : CTC or Attention
14/24
18. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Inpainting
DeepFill v2 Idea
We used a SOTA model in Free-shaped Inpainting developed by Adobe Research Team in [3],
we’ve chosen a model pretrained on a dataset called Places2, since Natural places have big
parts sharing the same texture distribution and spatial information like documents
Figure: DeepFill V2 SOTA results
Idea
they proposed different solutions for generative inpainting problems:
• Custom Gated Convolutions : Vanilla convolutions applied to an image with a hole is
meaningless.Solution => Learnable Mask
• Free-shaped : local and global GANs are adapted for rectangular shape. Solution =>
SN-PatchGAN. 18/24
20. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Inpainting
Inpainting : DeepFill v2 Data and Model
Data
• the input X (for both Generator and Discriminator) shape is (N, h, w, 5) where N is
number of images, h is the height and w is the width of the image respectively, input
channels are R, G, B, Holes Mask, User-guidance Mask (not required)
• outputs for the The generator are hole-generated images of shape (N, h, w, 3)
• outputs for our custom discriminator (elaborated using Spectral Normalization and
Patches fashion) is of shape (N, H
32
, W
32
, 256) storing binary variables (binary
classification fake or real)
Model Details
• Gated Convolutions : Normal convolutions are calculated as the following :
Oy,x =
∑ ∑
W · I
then PartialConv were proposed to take only valid pixels as the following static mask
formula :
Oy,x =
{ ∑ ∑
W ·
(
I ⊙ M
sum(M)
)
, if sum(M) > 0
0, otherwise
20/24
21. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Our Approach
Inpainting
Inpainting : DeepFill V2 Model
Model Details(2)
After that DeepFill v2 authors proposed a generalization of this PartialConv with
learnable-dynamic mask through GatedConv as the following :
Gating y,x =
∑ ∑
Wg · I
Feature y,x =
∑ ∑
Wf · I
Oy,x = ϕ ( Feature y,x) ⊙ σ
(
Gating y,x
)
• SN-PatchGAN : A convolutional network (6 Convs with kernal=5 and stride=2) is used
as the discriminator, where they adapt spectral normalization using the default fast
approximation algorithm of spectral normalization described in SN-GAN
Loss
• Generative Loss (Hinge) : LG = −Ez∼Pz(z) [Dsn(G(z))]
• Discriminative Loss (Custom BCE) :
LDsn = Ex∼Pdata (x) [ReLU (1 − Dsn(x))] + Ez∼Pz(z) [ReLU (1 + Dsn(G(z)))]
21/24
22. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Further Improvements
Further Improvements
Protocole
• Preprocessing : endorse the preprocessing step with Accepting natural photos (not
necessarly scanned)and Robusteness to text orientation.
• Dataset : Build a dataset, which we could use for validation, evaluation and if necessary
(continue training in some cases for example text recognition for non latin languages).
• Evaluation metrics : Subsequently, we need to develop annotations and mathematical
solid metrics to estimate the quality of our predictions
• Text characteristics : Add a module of Text Font Recognition that given a text crop
will predict : font family, font-size, bold, italic, text color, underlined, ..., it could be
end-to-end, handcrafted, hybrid between both ...
• Currency Converter : if the document contains money currency (for example bills) we
could propose currency conversion
• Merging algo complexity improvement : explore the nature of the data to not compore
all boxes with all boxes
22/24
23. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
Conclusion
Conclusion
”We are really satisfied of the first version of this enticing project, we
deeply believe that we can bring this project to life and a add a value to
the society. We look forward to endow the system with all the proposed
improvements during the next versions.”
23/24
25. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The Wise Document Translator
References
Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region
awareness for text detection,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 9365–9374, 2019.
J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and
H. Lee, “What is wrong with scene text recognition model
comparisons? dataset and model analysis,” in International
Conference on Computer Vision (ICCV), 2019.
J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form
image inpainting with gated convolution,” in Proceedings of the IEEE
International Conference on Computer Vision, pp. 4471–4480, 2019.
24/24